deeplearning_lipsync deeplearning_lipsync
TechTools

Disney And Other Researchers Are Developing A New Method For Automated Real-Time Lip Sync

Automated lip sync is not a new technology, but Disney Research, in tandem with a group of researchers at University of East Anglia (England), Caltech, and Carnegie Mellon University, have added a twist to it: deep learning.

By training a neural network, the researchers are using a deep learning approach to generate real-time animated speech. In addition to automatically generating lip sync for English speaking actors, the new software can be applied to singing or adapted for foreign languages. The technology was presented at the most recent SIGGRAPH computer graphics conference in Los Angeles.

Realistic speech animation is essential for effective character animation,” said lead researcher Dr. Sarah Taylor, from UEA’s School of Computing Sciences. “Done badly, it can be distracting and lead to a box office flop. Doing it well however is both time consuming and costly as it has to be manually produced by a skilled animator. Our goal is to automatically generate production-quality animated speech for any style of character, given only audio speech as an input.”

The team of researchers designed a system that trains a computer to take spoken words from a voice actor, predict the mouth shape needed, and then animate the character’s lip sync.

The process requires both recorded audio, as well as eight-hours video reference of a single speaker reciting a collection of more than 2,500 phonetically diverse sentences, the latter of which is tracked to create a “reference face” animation model.

The audio is transcribed into speech sounds (phonemes) using off-the-shelf speech recognition software. Those phonemes are then applied to the reference face, and those results can be retargeted to any real-time cg character rig.

“Our automatic speech animation therefore works for any input speaker, for any style of speech, and can even work in other languages,” said Taylor. “Our results so far show that our approach achieves state-of-the-art performance in visual speech animation. The real beauty is that it is very straightforward to use, and easy to edit and stylize the animation using standard production editing software.”

While it would be easy to scoff at the quality of the technology based on these early examples, it’s not difficult to envision how in 10 to 20 years, automated lip sync could form the foundation of most computer-generated characters. In its current state, real-time lip sync could serve a valuable role for gaming applications, location-based animation projects, and tv series that require large volumes of modestly-budgeted animation.

The Walt Disney Company is not the only entity exploring automated speech and facial animation technologies. An entire track was devoted to the topic at the most recent SIGGRAPH, with multiple technical papers presented on latest developments.

The paper about this project – “A deep learning approach for generalized speech animation” – can be downloaded here.

  • Josh Evans

    Wow, that’s some cool stuff!! It’ll be interesting to see how technology like this opens the doors for tiny studios to be able to handle more and more complex projects!

    • Joshua Taback

      Maybe the tiny studios should practice their craft enough to be able to do the work themselves

      • Josh Evans

        It’s less about being ABLE to do something and more about being able to tell a story with fewer resources and on a tighter budget. Freeing up time at any stage in the pipeline allows smaller filmmakers to spend that time on other parts of production, thus increasing the possibility of competing with much larger studios. It frees up animators to focus more on refinement of character motion than getting lost in the technical but hey, just like with any tool; If you don’t like the idea of the tech, then don’t use it. Easy.

  • Elsi Pote

    I know is all about and turning around more and more content for cheap. BuuI don’t know, why focus too much energy on lip synch when people pay attention to the eyes, unless you have a particular type of dyslexia or related disorder for that matter.

  • Davion Alexander Blackwíng

    Great! As if we needed more stuff to make 2D animation look as lifeless and flash-y as it already looks.

    • Josh Evans

      Hopefully people don’t use it right out of the box but instead as a baseline from which to animate on top of in order to save time and money. But yeah, anything used without additional animation is going to look stale and formulaic.

  • Trokon Hufnagel

    “Then in a year or so, a computer programmer will figure out how to eliminate us completely by creating a program that can instantly design flat hip perspective-free characters and animate all the stock cool actions. The executives and soccer-moms will be really happy then because they will finally be completely free of us pesky artists and will be able to make the cartoons by themselves. And they will be the cool ones.

    And there will be no reason for cartoons or cartoonists to exist anymore.”

    -John Kricfalusi

    http://johnkstuff.blogspot.com/2009/09/hipster-toon-tude-retreat.html

  • Mesterius

    Awesome! I had been hoping to see this make a comeback in the digital world: https://www.youtube.com/watch?v=6MHg1-mpcUY

  • @SpitAndSpite

    “lead to a box office flop” lols – yea, that’s the ticket yea!

  • Matthew

    Maybe they can finally properly sync the redubbed dialogue on BEDKNOBS AND BROOMSTICKS.

  • RCooke

    The fact that they use the word “animationS” is clear proof they’re not animators. Using the “s” at the end of animation comes from the world of technical engineering (and neal gabler), not art. And certainly not animation.

    Lip sync is NOT the most important part of dialogue animation. Never has been. Body attitude is.

    • It’s sad when people who think they know-it-all really don’t.

  • Adobe Character Animator does this with 2D already. Impressive, but underwhelming results.

  • Bradley

    Great! Then one day this tech will be used against someone and people will say, “look at the terrible things they said”. Do people not see the great consequences of this? There will be fake speeches attributed wrongfully. This will take fake news to a whole new level.

    • GW

      I hadn’t thought about it. You’re right. I already realized that CGI’s advances would make it possible to fake just about anything but I didn’t take into account how automation will make it easier for less skilled people to do it as well. It will make it so that just about anybody can record a soundalike voice and put words in somebody’s mouth. But you’d have to animate more than somebody’s lips to do that. You’d need to have convincing eye movements, other facial expressions and breathing, etc. Once that sort of thing is taken care of then we should really be worried.

      • I already knew this would happen for years.

        • GW

          Why not type up a blog post about what’s coming next then? I’d read it.