Building A 2-Host Video Pipeline With AI

I wanted to move past short vertical videos.

Longer content needs a better format. A single robot voice reading a list is boring. People stop watching.

I built a system to create 10-minute videos with two hosts. They talk, they disagree, and they hand off topics naturally. This rhythm keeps people watching.

I built this from scratch to work inside GitHub Actions. It must run automatically every time I update a file.

Here is how the system works:

• Everything starts with a single JSON file. • This file contains the script, the speakers, and the slide data. • I use edge-tts for audio. It is free and requires no API keys. • I use Pillow to turn JSON data into slide images. • I use ffmpeg to stitch the audio and images into a video.

Key technical choices:

  • Two Voices: I map Speaker A to one voice and Speaker B to another. I keep sentences under 25 words. This makes the AI sound more human.
  • No Browsers: I do not use Playwright or Chrome to make slides. That takes too long in a CI pipeline. Pillow is much faster for rendering images.
  • Smart Errors: I check the file size of every audio clip. Sometimes the API returns an empty file. My script catches this before the video fails.
  • Fast Rendering: A 10-minute video takes about 5 minutes to render in GitHub Actions. Most of that time is spent waiting for the audio API.

The workflow is simple:

  1. I push a JSON file to a specific folder.
  2. GitHub Actions triggers the render.
  3. The system uploads the video to YouTube via API.
  4. The file moves to an uploaded folder.

This setup allows me to produce long-form educational content without manual editing. It turns a script into a finished video automatically.

Source: https://dev.to/morinaga/what-i-learned-building-a-scripted-two-host-video-pipeline-with-edge-tts-and-ffmpeg-41o6

Optional learning community: https://t.me/GyaanSetuAi