Building A 2-Host Video Pipeline With AI
I wanted to move past short vertical videos.
Longer content needs a better format. A single robot voice reading a list is boring. People stop watching.
I built a system to create 10-minute videos with two hosts. They talk, they disagree, and they hand off topics naturally. This rhythm keeps people watching.
I built this from scratch to work inside GitHub Actions. It must run automatically every time I update a file.
Here is how the system works:
• Everything starts with a single JSON file. • This file contains the script, the speakers, and the slide data. • I use edge-tts for audio. It is free and requires no API keys. • I use Pillow to turn JSON data into slide images. • I use ffmpeg to stitch the audio and images into a video.
Key technical choices:
- Two Voices: I map Speaker A to one voice and Speaker B to another. I keep sentences under 25 words. This makes the AI sound more human.
- No Browsers: I do not use Playwright or Chrome to make slides. That takes too long in a CI pipeline. Pillow is much faster for rendering images.
- Smart Errors: I check the file size of every audio clip. Sometimes the API returns an empty file. My script catches this before the video fails.
- Fast Rendering: A 10-minute video takes about 5 minutes to render in GitHub Actions. Most of that time is spent waiting for the audio API.
The workflow is simple:
- I push a JSON file to a specific folder.
- GitHub Actions triggers the render.
- The system uploads the video to YouTube via API.
- The file moves to an uploaded folder.
This setup allows me to produce long-form educational content without manual editing. It turns a script into a finished video automatically.
Optional learning community: https://t.me/GyaanSetuAi
