New Benchmark for Cloud Tasks

AI performance is uneven. A model might top coding benchmarks but fail at cloud work. It often invents resources that do not exist.

Current benchmarks cover coding and reasoning. No benchmark exists for cloud management tasks.

We are building that benchmark.

We test tools like Codex and Claude Code. Our first test runs on AWS. We use a template that works for Azure and GCP later.

Our Methodology

We use Infrastructure as Code (IaC) as the answer key. Terraform builds the resources. Its output provides the truth. We know the exact resource IDs that should exist. This removes human error. Anyone can run the same stack to get the same result.

We test two variables:

• Size: Small accounts, medium accounts, and large accounts with thousands of dependencies. • History: New accounts with pure IaC and old accounts with messy tags and manual changes.

A tool that only works on small, clean accounts fails in real production environments.

We keep the agent contained. It runs in a single container with read-only credentials. We use CloudTrail to track every action. We repeat every test three times to rule out network errors.

We classify every wrong answer:

  • Found: The agent saw the resource.
  • Missed: The agent failed to see it.
  • Flagged: The agent reported a resource that is actually in use.
  • Fabricated: The agent invented a resource ID that does not exist.

Our first task focuses on AWS waste discovery. We use Terraform to plant unattached volumes and unused IPs. We also add active resources to see if the agent makes mistakes.

Waste discovery is the first test because it saves money and has clear scores. Future tests will cover security audits and architecture reconstruction.

We will publish our full process, including raw logs and prompts. We will share results even if they are bad.

We need your feedback.

Where is this method weak? What makes a test feel like a real account? What task should we test next?

Source: https://dev.to/rachcorp/new-benchmark-for-cloud-tasks-4o1

Optional learning community: https://t.me/GyaanSetuAi