Turing has just announced a new benchmark called Multimodal STEM HLE++ to evaluate the capabilities of the most advanced AI models, at a time when traditional metrics like MMLU have gradually become saturated. According to information shared on social media platform X, this tool focuses on multimodal STEM problems at the PhD level to find the true limits of leading artificial intelligence systems.
Background
For years, MMLU has been the gold standard for evaluating the reasoning capabilities of large language models (LLMs). However, with the rapid pace of technological advancement, current state-of-the-art (SOTA) models easily achieve perfect scores, meaning this metric no longer accurately reflects the differences between AI systems. Even the challenging HLE (Hard LLM Benchmark) dataset is gradually being caught up to by newer models, driving the need for a more rigorous and comprehensive evaluation standard.
Details
To address this, the Multimodal STEM HLE++ evaluation dataset is designed with 1,100 complex STEM questions at the PhD candidate level. The defining feature of HLE++ is its multimodal nature, requiring AI to not only process text but also understand and reason based on images, graphs, and complex scientific diagrams. According to the announcement, the questions in this dataset have stumped even advanced models like Opus 4.6, with the first-attempt correct answer rate (pass@1) of current top AIs reaching only about 20%. Several major AI labs worldwide are reportedly starting to integrate HLE++ into their testing pipelines.
Why It Matters
For the research and technology development community in Vietnam, the arrival of HLE++ shows that the AI race is shifting dramatically from memorizing general knowledge to solving in-depth scientific problems. The fact that next-generation super AIs only achieve a 20% accuracy rate highlights that there is still a massive gap between AI and high-level human intelligence. This forces domestic developers to rethink how they evaluate the real-world capabilities of AI rather than relying solely on saturated theoretical scores.