You Can't Benchmark AI With Real Meetings

I wanted to find the best AI notetaker. I compared Granola, Fathom, and Otter.

I started by recording a real meeting. I ran the recording through all three tools. Then I realized my experiment was useless.

To score a transcript, you need a correct version to compare it against. In a real meeting, the only record of what happened is the transcript itself. I was grading the exam using the students' own answers. I had no answer key.

If you lack ground truth, manufacture it.

I wrote a script for a two-person meeting first. I used ElevenLabs to turn that text into audio. Now, the exact words are something I typed. I have a perfect answer key.

I stuffed the script with difficult terms:

  • Quarter labels (Q3, Q2)
  • Percentages (5.2%, 6.8%)
  • Dollar figures ($16 to $19)
  • Jargon (churn, cohort, SSO, p95)
  • Names and deadlines

Here is what I learned from the results:

All three tools are excellent at raw accuracy. Otter hit 99% accuracy. Fathom was the most precise. Granola kept the meaning but garbled a few lines.

Raw accuracy is the wrong metric. It is just the baseline. The real differences appear in two areas:

  1. Meaningful tokens: Otter had high accuracy but turned "Q3" into "Q". In a business meeting, that mistake ruins the data.
  2. Speaker attribution: Otter was the only tool that correctly identified who spoke when. Granola gave me one long stream of text without names.

The "best" tool depends on your goal:

  • Use Otter if you need to know who said what.
  • Use Fathom if you need perfect numbers and jargon.
  • Use Granola if you want a bot-free experience for solo notes.

You can use this method for any speech-to-text testing. Script your audio to get a repeatable test. Add difficult words to see where models fail. Use the same clip to see if a vendor actually improves their model over time.

Synthetic audio is clean and easy. It is not a perfect simulation of a messy four-person meeting. But it provides a clean baseline to compare tools against each other.

Source: https://dev.to/tiennguyenftuk52/you-cant-benchmark-an-ai-notetaker-against-a-real-meeting-you-dont-know-the-right-answer-so-i-3llo

Optional learning community: https://t.me/GyaanSetuAi