𝗧𝘄𝗼 𝗗𝗶𝗳𝗳𝘂𝘀𝗶𝗼𝗻 𝗦𝘁𝗲𝗽𝘀 𝗥𝗲𝗮𝗰𝗵 𝟯𝟭 𝗙𝗣𝗦

📅2 hours ago⏱1 min read

Diffusion models for lip sync finally reach real-time speeds.

Most people believe you need dozens of steps to make diffusion work. New research shows you only need two.

The Lip Forcing method changes how the pipeline works. It does not just make the model bigger. It makes the process smarter.

Old systems required over 50 steps. This caused long delays. You could not use them for live interaction.

The new 1.3B student model hits 31 FPS. This is 17.6x faster than previous models of the same size.

How does it work?

It uses a two-step inference schedule.
It removes classifier-free guidance during testing.
It uses a Sync-Window DMD to keep audio and video aligned.

The speed comes with a small trade-off in fidelity. However, the synchronization remains high.

The limitations are clear.

It works on chunks of video, not the whole sequence at once.
It requires a large teacher model for training.
It currently only works on speaking faces.

If two steps work for lip sync, other video models should follow this path. We can replace heavy models with lightweight students. This opens the door for live streaming filters and on-device animation.

We might see models with only one step soon. This would make video generation instant.

Source: https://dev.to/olaughter/two-diffusion-steps-reach-31-fps-52pd

Optional learning community: https://t.me/GyaanSetuAi

𝗧𝘄𝗼 𝗗𝗶𝗳𝗳𝘂𝘀𝗶𝗼𝗻 𝗦𝘁𝗲𝗽𝘀 𝗥𝗲𝗮𝗰𝗵 𝟯𝟭 𝗙𝗣𝗦

Continue reading

𝗛𝗶𝗴𝗵 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗔𝗿𝗲 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗦𝘆𝘀𝘁𝗲𝗺𝘀

𝗗𝗶𝗿𝗲𝗰𝘁 𝗜𝗻𝘃𝗲𝗿𝘀𝗶𝗼𝗻: 𝗕𝗼𝗼𝘀𝘁𝗶𝗻𝗴 𝗗𝗶𝗳𝗳𝘂𝘀𝗶𝗼𝗻 𝗘𝗱𝗶𝘁𝗶𝗻𝗴

𝗗𝗶𝘀𝗰𝗼𝘂𝗿𝘀𝗲 𝗕𝗮𝘀𝗲𝗱 𝗢𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲𝘀 𝗳𝗼𝗿 𝗙𝗮𝘀𝘁 𝗦𝗲𝗻𝘁𝗲𝗻𝗰𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴

𝗗𝗶𝗳𝗳𝘂𝘀𝗶𝗼𝗻𝗚𝗲𝗺𝗺𝗮 𝟮𝟲𝗕: 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹 𝗧𝗲𝘅𝘁 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻

𝗗𝗶𝗳𝗳𝘂𝘀𝗶𝗼𝗻𝗚𝗲𝗺𝗺𝗮: 𝟭,𝟬𝟬𝟬 𝗧𝗼𝗸𝗲𝗻𝘀 𝗣𝗲𝗿 𝗦𝗲𝗰𝗼𝗻𝗱