ਨਵਾਂ AA Briefcase ਬੈਂਚਮਾਰਕ ਅਸਲ ਗਿਆਨ-ਅਧਾਰਤ ਕੰਮ ਵਿੱਚ AI ਦੇ ਸੰਘਰਸ਼ ਦਾ ਖੁਲਾਸਾ ਕਰਦਾ ਹੈ

Translated for your language. Read the original.

AI-assisted draft.

-2 d3min read

In this article

New AA-Briefcase Benchmark Reveals AI’s Struggle With Real Knowledge Work

While Large Language Models (LLMs) appear increasingly capable in standard evaluations, new data suggests they remain fundamentally unprepared for the complexities of professional environments. A groundbreaking benchmark has exposed a massive gap between pattern recognition and the actual execution of multi-step, information-dense knowledge work.

The AA-Briefcase Benchmark: Simulating the Real World

Traditional AI benchmarks often rely on isolated questions or static datasets that do not reflect the messy reality of a modern office. To bridge this gap, Artificial Analysis introduced the AA-Briefcase benchmark, a rigorous testing framework designed to simulate long-form, multi-week projects.

Instead of simple prompts, models are tasked with navigating thousands of fragmented source files, including Slack threads, email chains, meeting transcripts, and large-scale data exports. This requires the model to perform high-level reasoning, synthesize disparate data points, and maintain context across massive, unstructured datasets—skills essential for analysts, lawyers, and engineers.

Why Even Top Models Are Failing

The results are sobering for those expecting immediate AI autonomy in the workplace. Even the most advanced model tested, Anthropic’s Claude Fable 5, managed to fully solve only 3 percent of the tasks presented. The benchmark revealed that on 31 out of 91 specific tasks, not a single model could even clear a 50 percent pass rate.

The research highlights a fascinating shift in how AI fails as intelligence scales. "Weaker" models tend to suffer from "loud" failures: they choke on basic execution, miss relevant files entirely, or produce outputs that are fundamentally unusable. In contrast, "stronger" models like Claude Fable 5 fail more "quietly." These high-tier models hit the obvious requirements and maintain professional formatting, but they fail the deeper reasoning test by missing subtle details that can only be uncovered by piecing together information from multiple, disconnected sources.

The Economic Disparity of AI Performance

Beyond the technical shortcomings, the benchmark highlights a massive economic divide in the current LLM landscape. There is a staggering price gap between models when measured by the cost of task completion.

ਕੁਸ਼ਲਤਾ ਵਿੱਚ ਬਹੁਤ ਜ਼ਿਆਦਾ ਅੰਤਰ ਹੈ: DeepSeek V4 Flash ਨੇ ਲਗਭਗ $0.04 ਪ੍ਰਤੀ ਟਾਸਕ ਦੀ ਲਾਗਤ 'ਤੇ ਕੰਮ ਪੂਰੇ ਕੀਤੇ, ਜਦੋਂ ਕਿ ਉੱਚ-ਪ੍ਰਦਰਸ਼ਨ ਕਰਨ ਵਾਲੇ Claude Fable 5 ਦੀ ਲਾਗਤ $31 ਪ੍ਰਤੀ ਟਾਸਕ ਤੋਂ ਵੱਧ ਸੀ। ਇਹ 800 ਗੁਣਾ ਕੀਮਤ ਦਾ ਅੰਤਰ ਦਰਸਾਉਂਦਾ ਹੈ, ਜੋ ਕਿ ਉਹਨਾਂ ਸੰਸਥਾਪਕਾਂ ਅਤੇ ਉੱਦਮਾਂ ਲਈ ਇੱਕ ਵੱਡੀ ਚੁਣੌਤੀ ਪੇਸ਼ ਕਰਦਾ ਹੈ ਜੋ ਅਸਥਿਰ ਸੰਚਾਲਨ ਲਾਗਤਾਂ ਤੋਂ ਬਿਨਾਂ AI ਏਜੰਟਾਂ ਨੂੰ ਵਧਾਉਣ ਦੀ ਕੋਸ਼ਿਸ਼ ਕਰ ਰਹੇ ਹਨ।

AI ਲੈਂਡਸਕੇਪ ਲਈ ਪ੍ਰਭਾਵ

AA-Briefcase ਦੇ ਨਤੀਜੇ "AI Agent" ਦੇ ਹਾਈਪ ਸਾਈਕਲ ਲਈ ਇੱਕ ਰੀਅਲਿਟੀ ਚੈੱਕ ਵਜੋਂ ਕੰਮ ਕਰਦੇ ਹਨ। AI ਨੂੰ ਇੱਕ ਗੱਲਬਾਤ ਕਰਨ ਵਾਲੇ ਸਹਾਇਕ ਤੋਂ ਇੱਕ ਭਰੋਸੇਯੋਗ ਗਿਆਨ ਕਾਰਕ ਵਿੱਚ ਬਦਲਣ ਲਈ, ਮਾਡਲਾਂ ਨੂੰ ਸਧਾਰਨ ਰਿਟ੍ਰੀਵਲ ਤੋਂ ਅੱਗੇ ਵਧ ਕੇ ਡੂੰਘੇ, ਕ੍ਰਾਸ-ਕੰਟੈਕਸਚੁਅਲ ਸਿੰਥੇਸਿਸ (cross-contextual synthesis) ਵੱਲ ਵਿਕਸਤ ਹੋਣਾ ਪਵੇਗਾ। ਡਿਵੈਲਪਰਾਂ ਅਤੇ ਤਕਨੀਕੀ ਲੀਡਰਾਂ ਲਈ, ਟੀਚਾ ਹੁਣ ਸਿਰਫ਼ ਪੈਰਾਮੀਟਰਾਂ ਦੀ ਗਿਣਤੀ ਵਧਾਉਣਾ ਨਹੀਂ ਹੈ, ਸਗੋਂ ਵਧੇਰੇ ਸਟੀਕਤਾ ਅਤੇ ਘੱਟ ਮਾਰਜਨਲ ਲਾਗਤਾਂ ਨਾਲ ਟੁਕੜੇ-ਟੁਕੜੇ ਵਾਲੇ, ਲੰਬੇ-ਸਮੇਂ ਦੇ ਤਰਕ ਵਾਲੇ ਕੰਮਾਂ ਨੂੰ ਸੰਭਾਲਣ ਦੀ ਯੋਗਤਾ ਵਿੱਚ ਸੁਧਾਰ ਕਰਨਾ ਹੈ।

ਮੁੱਖ ਨੁਕਤੇ

ਵੱਡਾ ਪ੍ਰਦਰਸ਼ਨ ਅੰਤਰ: Claude Fable 5 ਵਰਗੇ ਅਤਿ-ਆਧੁਨਿਕ ਮਾਡਲ ਵੀ ਗੁੰਝਲਦਾਰ, ਬਹੁ-ਸਰੋਤ ਗਿਆਨ ਵਾਲੇ ਕੰਮਾਂ 'ਤੇ ਸਿਰਫ਼ 3% ਦੀ ਪੂਰੀ ਸਫਲਤਾ ਦਰ ਪ੍ਰਾਪਤ ਕਰਦੇ ਹਨ।
ਗਲਤੀਆਂ ਦਾ ਵਿਕਾਸ: ਜਿੱਥੇ ਘੱਟ-ਦਰਜੇ ਦੇ ਮਾਡਲ ਬੁਨਿਆਦੀ ਕਾਰਜਕਾਰੀ ਵਿੱਚ ਅਸਫਲ ਰਹਿੰਦੇ ਹਨ, ਉੱਥੇ ਉੱਨਤ ਮਾਡਲ "ਸ਼ਾਂਤ" ਗਲਤੀਆਂ ਰਾਹੀਂ ਅਸਫਲ ਹੁੰਦੇ ਹਨ, ਜੋ ਕਿ ਟੁਕੜੇ-ਟੁਕੜੇ ਵਾਲੇ ਡੇਟਾ ਸੈੱਟਾਂ ਵਿੱਚ ਲੁਕੀਆਂ ਹੋਈਆਂ ਬਾਰੀਕ ਜਾਣਕਾਰੀ ਨੂੰ ਨਜ਼ਰਅੰਦਾਜ਼ ਕਰ ਦਿੰਦੇ ਹਨ।
ਲਾਗਤ ਵਿੱਚ ਭਾਰੀ ਅੰਤਰ: DeepSeek V4 Flash ਵਰਗੇ ਬਜਟ-ਅਨੁਕੂਲ ਮਾਡਲਾਂ ਅਤੇ Claude Fable 5 ਵਰਗੇ ਪ੍ਰੀਮੀਅਮ ਮਾਡਲਾਂ ਦੇ ਵਿਚਕਾਰ ਪ੍ਰਤੀ-ਟਾਸਕ ਕਾਰਜਕਾਰੀ ਵਿੱਚ 800 ਗੁਣਾ ਲਾਗਤ ਦਾ ਅੰਤਰ ਹੈ।

ਨਵਾਂ AA Briefcase ਬੈਂਚਮਾਰਕ ਅਸਲ ਗਿਆਨ-ਅਧਾਰਤ ਕੰਮ ਵਿੱਚ AI ਦੇ ਸੰਘਰਸ਼ ਦਾ ਖੁਲਾਸਾ ਕਰਦਾ ਹੈ

New AA-Briefcase Benchmark Reveals AI’s Struggle With Real Knowledge Work

The AA-Briefcase Benchmark: Simulating the Real World

Why Even Top Models Are Failing

The Economic Disparity of AI Performance

AI ਲੈਂਡਸਕੇਪ ਲਈ ਪ੍ਰਭਾਵ

ਮੁੱਖ ਨੁਕਤੇ

Continue reading

𝗔𝗜 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗔𝘀 𝗔𝗻 𝗘𝗾𝘂𝗶𝗹𝗶𝗯𝗿𝗶𝘂𝗺 𝗣𝗼𝗶𝗻𝘁

𝗠𝗔 𝗣𝗿𝗼𝗼𝗳𝗕𝗲𝗻𝗰𝗵: 𝗚𝗣𝗧 𝟱.𝟱 𝗛𝗶𝘁𝘀 𝟭𝟲% 𝗼𝗻 𝗠𝗮𝘁𝗵 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀

𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗦𝗰𝗼𝗿𝗲𝗱 𝟬% 𝗢𝗻 𝗘𝘅𝗽𝗲𝗿𝘁 𝗧𝗮𝘀𝗸𝘀

𝗔𝗜 𝗧𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝘆 𝗙𝗮𝗶𝗹𝘀 𝗶𝗻 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻: 𝗖𝗹𝗼𝘀𝗲 𝘁𝗵𝗲 𝗔𝗜 𝗖𝗼𝗼𝗿𝗱𝗶𝗻𝗮𝘁𝗶𝗼𝗻 𝗚𝗮𝗽

ਸੈਮ ਅਲਟਮੈਨ ਦਾ ਦਾਅਵਾ: ਸਕੈਲਿੰਗ ਦੇ ਸ਼ੱਕੀ ਲੋਕਾਂ ਨੇ AI ਵਿਕਾਸ ਵਿੱਚ ਰੁਕਾਵਟ ਪਾਈ