Evaluating a C LLM Eventparser with Promptfoo

AI-assisted draft.

Evaluating a C# LLM Eventparser with Promptfoo

Testing regular code is simple. You call a function, get a result, and check if it matches your expectation.

Testing LLMs is different. An LLM might return "3 PM" in one run and "15:00" in another. Both are correct, but an exact match test will fail. You need to check if the answer is good, not if it is identical.

I built a small app called EventParser to test this. It takes a casual message like "Team sync on Friday at 3 PM" and turns it into structured data.

Here is how you can test it using Promptfoo and an LLM-as-a-judge workflow.

The Setup

The app uses a single prompt file: extract_event.txt. The C# code reads this file at runtime. Promptfoo reads the same file for testing. This ensures you test the actual prompt your users see.

The Workflow

Instead of a human checking every output, we use a judge model. This process uses two roles:

• The model under test: The model providing the answer. • The judge model: A faster, cheaper model that grades the answer.

How the Judge Decides

The judge uses a rubric. A rubric is a plain English rule. Instead of checking for a specific JSON string, you tell the judge what the answer should contain.

Example Rubric: "The answer should extract the event title, day, time, and location. It must not add details not mentioned in the message."

Testing for Errors

I intentionally broke the prompt by adding a bad instruction: "If the message mentions coffee, set the location to Starbucks."

When I ran the evaluation, the judge caught the error. The original message did not mention Starbucks. The model hallucinated a location. An exact match test would miss this, but a judge model catches semantic errors.

Why this works:

• It matches reality: It accepts various correct formats like "3 PM" or "15:00". • It uses readable rules: Plain English rubrics are easy to understand. • It catches meaning bugs: It finds hallucinations and logic errors. • It is cost effective: You can use a cheap model to grade a more expensive one.

This approach makes LLM testing feel like real software testing.

Source: https://dev.to/bigboybamo/evaluating-a-c-llm-eventparser-with-promptfoo-4b87

Optional learning community: https://t.me/GyaanSetuAi

Evaluating a C LLM Eventparser with Promptfoo

Continue reading

Prompt Engineering for Synthetic Data

Mastering LLM Prompting: A Developer's Guide

How I A/B Test LLM Prompts Without Fooling Myself

GLM 5.2 Code Reviews Depend On Your Prompts

I Built An AI Security Scanner — Then Found A Bug In My Own Detector