I built OpenShorts, an open-source tool that takes a long YouTube video (or local file) and automatically generates vertical short clips ready for TikTok, Instagram Reels, and YouTube Shorts.
How it works:
1. Transcribes the video using faster-whisper
(CPU-optimized, word-level timestamps)
2. Sends the transcript to Gemini 2.0 Flash, which
identifies the 3–15 most "viral-worthy" moments
(15–60s each)
3. FFmpeg extracts the clips precisely
4. AI-powered vertical reframing with two modes:
- TRACK mode: MediaPipe face detection + YOLOv8
fallback with stabilization ("Heavy Tripod" engine)
for single-subject scenes
- GENERAL mode: Blurred background layout for
groups/landscapes
5. Optional: AI subtitles, hook text overlays,
voice dubbing (ElevenLabs, 30+ languages), and
direct social posting
The reframing engine was the hardest part. Naive
face tracking produces jittery, unwatchable output.
I built a SmoothedCameraman class with safe-zone
logic and a SpeakerTracker that prevents rapid
switching between detected faces. The system
pre-scans every scene to decide TRACK vs. GENERAL
before processing.
Stack: Python/FastAPI backend, React/Vite
dashboard, Docker Compose for one-command setup.
All API keys (Gemini, ElevenLabs) stay client-side,
encrypted in localStorage — never stored on the
server.
Try it:
git clone ... && docker compose up --build
Then open localhost:5173, paste a Gemini API key
and a YouTube URL.
MIT licensed. Feedback and PRs welcome.