Researchers have just launched JobBench, an AI agent evaluation benchmark focused on the ability to execute workflows that humans actually want to delegate, rather than just aiming to replace workforce.
Key Developments
JobBench consists of 130 agentic tasks across 35 different professions. Unlike traditional benchmarks, each test in JobBench is designed as a realistic workspace with noisy data, requiring AI to reason through chaotic streams of information. Evaluation results across 36 models show that a significant gap remains: the strongest model, Claude Opus 4.7 (running under Claude Code), only achieved a score of 45.9%.
Why It Matters
JobBench marks a shift in AI development philosophy: from 'replacement' to 'assistance'. This is extremely important for the Vietnamese labor market, where anxiety over AI replacing humans is palpable. The low scores of leading models show that AI agents still have a long way to go to truly understand and accurately execute complex human intentions in real-world office environments. This benchmark will help developers in Vietnam align their efforts toward building genuinely useful agents instead of just chasing theoretical metrics.