You can check our more about us and our company in this video here: https://bit.ly/3TV27rd
Our first api lets you lip-sync a person in a video to an audio in any language in zero-shot. You can check out some examples here (https://bit.ly/3IT3UXk)
Here’s a demo showing how it works and how to sync your first video / audio: https://bit.ly/4ablRwo
Our playground + api is live, you can play with our models here: https://app.synclabs.so/
Four years ago we open-sourced Wav2lip (https://github.com/Rudrabha/Wav2Lip), the first model to lipsync anyone to any audio w/o having to train for each speaker. Even now, it’s the most prolific lipsyncing model to date (almost 9k GitHub stars).
Human lip-sync enables interesting features for many products – you can use it to seamlessly translate videos from one language to another, create personalized ads / video messages to send to your customers, or clone yourself so you never have to record a piece of content again.
We’re excited about this area of research / the models we’re building because they can be impactful in many ways:
[1] we can dissolve language as a barrier
check out how we used it to dub the entire 2-hour Tucker Carlson interview with Putin speaking fluent English: https://vimeo.com/914605299
imagine millions gaining access to knowledge, entertainment, and connection — regardless of their native tongue.
realtime at the edge takes us further — live multilingual broadcasts + video calls, even walking around Tokyo w/ a Vision Pro 2 speaking English while everyone else Japanese.
[2] we can move the human-computer interface beyond text-based-chat
keyboard / mice are lossy + low bandwidth. human communication is rich and goes beyond just the words we say. what if we could compute w/ a face-to-face interaction?
Many people get carried away w/ the fact LLMs can generate, but forget they can also read. The same is true for these audio/visual models — generation unlocks a portion of the use-cases, but understanding humans from video unlocks huge potential.
Embedding context around expressions + body language in inputs / outputs would help us interact w/ computers in a more human way.
[3] and more
powerful models small enough to run at the edge could unlock a lot:
eg. extreme compression for face-to-face video streaming enhanced, spatial-aware transcription w/ lip-reading detecting deepfakes in the wild on-device real-time video translation etc.
We’re building + scaling an API / SDK to bring this capability directly to the apps / services people already use. Lip-syncing is the first step, but we’re moving towards generating / modifying facial expressions, speech, and eventually a foundational approach to modify a human in video in any way you can imagine.
Our playground and API is live today – we’re early, but we’re iterating quickly. We’d love any feedback you have on the playground experience, API dev/ex, and the quality of the output of our models :) we appreciate you and your time.
You can play with it here for free: https://app.synclabs.so/