Impressive Artificial Intelligence program that recreates faces from audio

Technology continues to grow by leaps and bounds, drawing on various areas to explore new capabilities and features. One of them is power. “reconstruct” a person’s face through a fragment of voice.

The study Speech2Face presented in 2019 at a Vision and Recognition Patterns conference showed that an Artificial Intelligence (AI) can decipher what a person looks like through short audio segments.

The document explains that the goal of researchers Tae-Hyun On, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, and Michael Rubinstein of the MIT Science and Research Program is not to identically reconstruct the faces of the people but to make an image with the physical characteristics that are related to the analyzed audio.

To achieve this they used designed and trained a deep neural network that analyzed millions of videos taken from YouTube where people are talking. During training the model learned to correlate voices with facesallowing you to produce images with physical attributes similar to speakersincluding the age, gender and ethnicity.

The training was carried out under supervision and using the concurrence of faces and voices from Internet videos, without the need to model detailed physical features of the face.

“Our reconstructions, obtained directly from the audio, reveal the correlations between faces and voices. We numerically evaluate and quantify how, and in what way, our Speech2Face reconstructions from audio resemble real images of speakers’ faces.”

They detailed that because this study could have sensitive aspects due to ethnicity, as well as privacy, it is that no specific physical aspects have been added to the recreation of faces and they assure that, like any other system of machine learning, this one improves over time, since with each use it increases its library of knowledge.

Although the evidence shown shows that Speech2Face has a high number of matches between faces and voicesalso had some flaws, where ethnicity, age or gender did not match the voice sample used.

The model is designated to present Statistical correlations that exist between facial features with voice. It should be remembered that the AI ​​learned from YouTube videos, which do not represent a real sample of the world’s population, for example, in some languages ​​it shows discrepancies with the training data.

In this sense, the study itself recommends, at the end of its results, that those who decide to explore and modernize the system take into consideration a broader sample of people and voices so that in this way the machine learning have a broader repertoire of face matching and recreation.

The program was also able to recreate the voice in cartoons, which also bear an incredible resemblance to the voices in the analyzed audios.

Because this technology could also be used for malicious purposes, the recreation of the face only keeps the closest thing to the person and does not give full faces, as this could be a problem for people’s privacy. Even so, it has surprised what the technology can do from audio samples.

:

Exit mobile version