Evaluating Agentic AI In The Age Of LLM Benchmarks
Most AI tests follow a simple pattern. You give a model a prompt. You compare the answer to a reference. You score the result.
This works for summaries. It works for classification. It fails when a model must act in a changing environment.
The Age of LLM paper introduces a better way. It is a 1v1 game on a grid. Two models compete under a fog of war. They cannot see everything. They must scout or guess to find enemy units. They must use diplomacy to propose deals or ultimatums.
Every move must follow a strict JSON schema. If a move is illegal, the system discards it.
This test measures specific skills:
- State tracking: Does the model remember what it saw and what it lost?
- Belief management: Does it act sensibly with incomplete info?
- Action validity: Does it follow the rules of the environment?
- Long-horizon strategy: Can it pick a sequence of moves that leads to a goal?
A model might sound fluent but fail in practice. It might forget its state or issue invalid tool calls.
The results show a pattern. Many models fall into simple traps under uncertainty. Most chose aggressive military moves. Diplomacy happened, but agreements rarely finished. Many errors came from poor state tracking.
Standard benchmarks miss these failures. A model can write a great explanation but fail to track a hidden unit. You only see this when the environment forces the model to act.
Current AI work often focuses on tool use. Tool use is necessary, but it is not enough. A real agent must maintain context and recover when things change.
The industry is shifting from chat quality to outcomes. Useful systems are measured by whether they complete work, not how much they produce polished prose.
If an agent cannot maintain a belief state, it is not strategic. If it cannot follow a schema, its tool use is brittle.
Real agentic capability requires two things:
- The ability to plan.
- The ability to execute under uncertainty.
In software, bad output is a bug. In AI agents, bad output is often a silent failure. A tool call does nothing. A hidden assumption is wrong. If you only score the final answer, you miss the problem.
We must test for:
- Partial observability
- Hidden state
- Long-horizon coordination
- Action validity
- Recovery from mistakes
Evaluation must move closer to how these systems work in the real world.
Optional learning community: https://t.me/GyaanSetuAi
