Evaluating a C# LLM Eventparser with Promptfoo
Testing regular code is simple. You call a function, get a result, and check if it matches your expectation.
Testing LLMs is different. An LLM might return "3 PM" in one run and "15:00" in another. Both are correct, but an exact match test will fail. You need to check if the answer is good, not if it is identical.
I built a small app called EventParser to test this. It takes a casual message like "Team sync on Friday at 3 PM" and turns it into structured data.
Here is how you can test it using Promptfoo and an LLM-as-a-judge workflow.
The Setup
The app uses a single prompt file: extract_event.txt. The C# code reads this file at runtime. Promptfoo reads the same file for testing. This ensures you test the actual prompt your users see.
The Workflow
Instead of a human checking every output, we use a judge model. This process uses two roles:
• The model under test: The model providing the answer. • The judge model: A faster, cheaper model that grades the answer.
How the Judge Decides
The judge uses a rubric. A rubric is a plain English rule. Instead of checking for a specific JSON string, you tell the judge what the answer should contain.
Example Rubric: "The answer should extract the event title, day, time, and location. It must not add details not mentioned in the message."
Testing for Errors
I intentionally broke the prompt by adding a bad instruction: "If the message mentions coffee, set the location to Starbucks."
When I ran the evaluation, the judge caught the error. The original message did not mention Starbucks. The model hallucinated a location. An exact match test would miss this, but a judge model catches semantic errors.
Why this works:
• It matches reality: It accepts various correct formats like "3 PM" or "15:00". • It uses readable rules: Plain English rubrics are easy to understand. • It catches meaning bugs: It finds hallucinations and logic errors. • It is cost effective: You can use a cheap model to grade a more expensive one.
This approach makes LLM testing feel like real software testing.
Source: https://dev.to/bigboybamo/evaluating-a-c-llm-eventparser-with-promptfoo-4b87
Optional learning community: https://t.me/GyaanSetuAi
