Show HN: sync. (YC W24) – an API for fast and affordable lip-sync at scale

5 points

2 years ago

Hey HN, we’re sync. (https://synclabs.so/). We’re building fast + lightweight audio-visual models to create, modify, and understand humans in video.

You can check our more about us and our company in this video here: https://bit.ly/3TV27rd

Our first api lets you lip-sync a person in a video to an audio in any language in zero-shot. You can check out some examples here (https://bit.ly/3IT3UXk)

Here’s a demo showing how it works and how to sync your first video / audio: https://bit.ly/4ablRwo

Our playground + api is live, you can play with our models here: https://app.synclabs.so/

Four years ago we open-sourced Wav2lip (https://github.com/Rudrabha/Wav2Lip), the first model to lipsync anyone to any audio w/o having to train for each speaker. Even now, it’s the most prolific lipsyncing model to date (almost 9k GitHub stars).

Human lip-sync enables interesting features for many products – you can use it to seamlessly translate videos from one language to another, create personalized ads / video messages to send to your customers, or clone yourself so you never have to record a piece of content again.

We’re excited about this area of research / the models we’re building because they can be impactful in many ways:

[1] we can dissolve language as a barrier

check out how we used it to dub the entire 2-hour Tucker Carlson interview with Putin speaking fluent English: https://vimeo.com/914605299

imagine millions gaining access to knowledge, entertainment, and connection — regardless of their native tongue.

realtime at the edge takes us further — live multilingual broadcasts + video calls, even walking around Tokyo w/ a Vision Pro 2 speaking English while everyone else Japanese.

[2] we can move the human-computer interface beyond text-based-chat

keyboard / mice are lossy + low bandwidth. human communication is rich and goes beyond just the words we say. what if we could compute w/ a face-to-face interaction?

Many people get carried away w/ the fact LLMs can generate, but forget they can also read. The same is true for these audio/visual models — generation unlocks a portion of the use-cases, but understanding humans from video unlocks huge potential.

Embedding context around expressions + body language in inputs / outputs would help us interact w/ computers in a more human way.

[3] and more

powerful models small enough to run at the edge could unlock a lot:

eg. extreme compression for face-to-face video streaming enhanced, spatial-aware transcription w/ lip-reading detecting deepfakes in the wild on-device real-time video translation etc.

We’re building + scaling an API / SDK to bring this capability directly to the apps / services people already use. Lip-syncing is the first step, but we’re moving towards generating / modifying facial expressions, speech, and eventually a foundational approach to modify a human in video in any way you can imagine.

Our playground and API is live today – we’re early, but we’re iterating quickly. We’d love any feedback you have on the playground experience, API dev/ex, and the quality of the output of our models :) we appreciate you and your time.

You can play with it here for free: https://app.synclabs.so/