JobBench: A New Benchmark Measuring AI's Ability to Work According to Human Intent
Instead of focusing on replacing humans, JobBench evaluates AI across 130 real-world tasks that experts want to delegate. The new Claude Opus 4.7 only scored 45.9%.
Sources arxiv.org