simple lip synching

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Aug 02, 2010 at 19:45

The simplest lip synching I can think of is just measuring the amplitude of the wave form and opening the mouth due to the amplitude.

Obviously this falls down for “th” “m” when the amplitude is high but the mouth is shut… whats the next step after this before it starts getting impossible to understand? :)

14 Replies

Please log in or register to post a reply.

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Aug 02, 2010 at 20:05

Im versed in the FFT, its just I dont know what to do with it to detect the phenomes (i think thats how they are spelled)… how do you detect a phenome from the data? (Is it just a matter of a nearest test to a sample of that phenome?)

I spose I could try that.

6aa952514ff4e5439df1e9e6d337b864
0
roel 101 Aug 02, 2010 at 21:16

That’s a far from trivial subject, rouncer. I once did a course on speech recognition, and had to create my own one. Iirc, the input was something close to a FFT of the audio signal (I can’t recall the name), and there were random markov chains for each phoneme. My program then searched for the most probable path through all the chains, and the phonemes for the chosen chains were used to construct the word. So eh, that could be a solution. But I’d select an easier one.

edit: these are older coursenotes, but they appear to follow the approach I described: http://www.kbs.twi.tudelft.nl/docs/syllabi/speech.pdf

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 Aug 02, 2010 at 22:06

A classmate of mine built a homebrew lipsync system for his senior thesis. IIRC, he transformed the input waveform into Mel frequency cepstral coefficients or MFCCs (google it). Then he used some sort of neural network to recognize the visemes (that’s the visual equivalent of phonemes, but there are fewer visemes, if you don’t count tongue positions).

Anyway, as roel says it’s not a trivial subject, but it is also a very well researched one, and thankfully a lot easier than full speech recognition. I’m sure you can find some papers online.

Finally, I believe the amplitude version was used in Half-Life 1, FWIW.

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Aug 03, 2010 at 11:09

Well, ive got an experiment in mind, in no way at all would it possibly be able to perform full speech recognition, but could work for lipsync.
What ive got in mind, is first recording all the “visemes” I wish to use, get an fft slice of each one.
Then it analyzes the wave form and for each slice itll do an amplitude and phase check on each bin of each viseme, then itll pick the viseme of the least difference.

So ill report back how successful this is in the near future.

6837d514b487de395be51432d9cdd078
0
TheNut 179 Aug 03, 2010 at 16:16

I would have figured speech recognition to be easier than lip syncing. One just involves interpreting wave data. The other interprets + maps it to an accurate animation + must include emotion = skynet. I always thought games just use a primitive form with the help from artists (visemes) to animate it. Sort of like using subtitles with the wave data so no interpretation is necessary.

Fe8a5d0ee91f9db7f5b82b8fd4a4e1e6
0
JarkkoL 102 Aug 03, 2010 at 16:21

Games use lipsync software to extract phonemes from wav files and artist create these phonemes manually. Then you just play back the wave file and interpolate the animation using the extracted phoneme info.

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Aug 03, 2010 at 16:26

@TheNut

I would have figured speech recognition to be easier than lip syncing. One just involves interpreting wave data. The other interprets + maps it to an accurate animation + must include emotion = skynet. I always thought games just use a primitive form with the help from artists (visemes) to animate it. Sort of like using subtitles with the wave data so no interpretation is necessary.

Well your right if you want AAA animation, but if your happy with something a little less than perfect (yes me…) then speech recognition needs to be more precise (Cause its all about recognizing a phoneme no matter WHO says it… something im NOT going to bother with), unless your going for walt disney animation… then I guess I see your point.

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 Aug 03, 2010 at 19:10

@TheNut

I would have figured speech recognition to be easier than lip syncing. One just involves interpreting wave data. The other interprets + maps it to an accurate animation + must include emotion

Speech rec is harder because it must discriminate among a greater number of phonemes and then also put those together into words. Lipsync doesn’t require extracting words. I’m not sure what you mean by including emotion? If you mean lipsynching while smiling/scowling/whatever, that can probably be done via additive or blended animations. (I’m assuming the emotion parts would be hand-animated, not extracted from the tone of voice automatically - although that is also an interesting problem!)

8676d29610e6c98d6dd2d9c38528cd9c
0
alphadog 101 Aug 03, 2010 at 19:18

Are you trying to lipsync to unknown recordings?

Because, if you have predetermined tracks, which is predominantly the case for games, then there are various ways to markup audio to articulate and sync the animations.

Obviously, if you are allowing users to input a wave and you want to lipsync an avatar to it, then that gets tricky as the other have illustrated. But, it may be useful to re-think the question before looking at the answers…

6837d514b487de395be51432d9cdd078
0
TheNut 179 Aug 03, 2010 at 19:58

@JarkkoL

Games use lipsync software to extract phonemes from wav files and artist create these phonemes manually.

  • That would be pretty impressive I think. Although I’ve never seen community content where lip synced worked for them, which is why I always figured game developers hand cranked this stuff.
    @Reedbeta

If you mean lipsynching while smiling/scowling/whatever, that can probably be done via additive or blended animations.

  • Yes, that’s what I mean. It’s not so much as the solution as it is the effort. As others have said, this topic is quite advanced. Even if lip syncing was done for you, simply integrating it is no walk in the park.
A638aa42130293f319eda7fa4ba121f4
0
fireside 141 Aug 04, 2010 at 03:05

Playing around with this a little, I’ve found you can fool the mind pretty easily with random phonemes as long as the mouth isn’t moving when there is no amplitude.

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Aug 04, 2010 at 10:54

@fireside

Playing around with this a little, I’ve found you can fool the mind pretty easily with random phonemes as long as the mouth isn’t moving when there is no amplitude.

I might be half relying on this when my phoneme detector isnt working at all. hehe

Manually animating the face for every single bit of the spoken wave is something im trying to avoid, even though I know thats another solution, but a work heavy one.

Fe8a5d0ee91f9db7f5b82b8fd4a4e1e6
0
JarkkoL 102 Aug 04, 2010 at 11:07

What I meant is that artists just create a handful of poses for different phonemes and the tool does the rest, i.e. extracts the phonemes automatically from the wav files, which is the tricky part. Then you interpolate between artists created phoneme poses using the extracted data. There are existing tools for that (e.g. http://www.annosoft.com) if you don’t want to roll one on your own and with the SDK you could integrate the lipsync to your shipped game as well if you got money to spare d: It’s cool if you can get half decent job on your own though and maybe release an SDK under zlib ;)

B7109317066ddd5327cb0674388c4974
0
Luz_Reyes 101 Oct 28, 2010 at 20:58

When creating LS motions it’s key that the character closes his mouth for certain sounds. The best way might be to simply film your mouth as it makes various sounds, and look at what it does.