EMO (Emotive Portrait Alive) is a new generative AI researched by Alibaba's Institute of Intelligent Computing (IIC) with the ability to "magically" transform any image into being able to speak and sing realistically.
In other words, Alibaba's AI can turn a static reference image and voice audio into a video that can speak and sing with natural expressions.
Previous AIs only morphed the mouth and part of the face, while EMO can create facial expressions, natural mouth expressions, precise lip synchronization, move eyebrows, frown eyes or even sway to the music.
Alibaba has released a few videos showing how images will turn into videos and sing imported songs on the fly. EMO supports English, Chinese, and many other languages.
Alibaba revealed that in order for EMO to be able to create realistic facial expressions, it was trained with a large amount of image, audio, and video data through its own diffusion model called Audio2Video.
To address the current major challenge of realism and expressiveness in video generation from images and sounds, the research team focused on the relationship and nuances between audio signals and facial movements, bypassing the intermediate 3D model linkage or facial landmarks, seamlessly transitioning frames, and preserving consistency in the video.
Alibaba has not revealed when it will release this AI to the public, but has published EMO's data on Github, and research papers posted on ArXiv.