Bỏ qua đến nội dung chính
Back to home
AI tools-ai 1 min read

JobBench: A New Benchmark Measuring AI's Ability to Work According to Human Intent

Instead of focusing on replacing humans, JobBench evaluates AI across 130 real-world tasks that experts want to delegate. The new Claude Opus 4.7 only scored 45.9%.

Tier 2 · sources 99% confidence Reviewed
Sources arxiv.org

Researchers have just launched JobBench, an AI agent evaluation benchmark focused on the ability to execute workflows that humans actually want to delegate, rather than just aiming to replace workforce.

Key Developments

JobBench consists of 130 agentic tasks across 35 different professions. Unlike traditional benchmarks, each test in JobBench is designed as a realistic workspace with noisy data, requiring AI to reason through chaotic streams of information. Evaluation results across 36 models show that a significant gap remains: the strongest model, Claude Opus 4.7 (running under Claude Code), only achieved a score of 45.9%.

Why It Matters

JobBench marks a shift in AI development philosophy: from 'replacement' to 'assistance'. This is extremely important for the Vietnamese labor market, where anxiety over AI replacing humans is palpable. The low scores of leading models show that AI agents still have a long way to go to truly understand and accurately execute complex human intentions in real-world office environments. This benchmark will help developers in Vietnam align their efforts toward building genuinely useful agents instead of just chasing theoretical metrics.