MPMMine: A New Benchmark Suite for Constraint Acquisition in Mathematical Programming
MPMMine is introduced to provide a standardized evaluation framework for algorithms that discover and validate mathematical programming (MP) models.
MPMMine is introduced to provide a standardized evaluation framework for algorithms that discover and validate mathematical programming (MP) models.
Instead of focusing on replacing humans, JobBench evaluates AI across 130 real-world tasks that experts want to delegate. The new Claude Opus 4.7 only scored 45.9%.
The Gemini 3.2 Flash model is rumored to achieve 92% of GPT 5.5's performance in coding and reasoning tasks, with operating costs 15 to 20 times cheaper.
Google has just demonstrated the power of its Gemini Flash 3.5 model by achieving performance comparable to Sonnet 4.6 on prestigious leaderboards, marking a strong comeback for the search giant.
A new tool called physics-intern significantly boosts the performance of large language models like Gemini 3.1 Pro in solving physics and scientific problems, thanks to a specialized subagent mechanism.
PostTrainBench v1.0 provides a new standard to measure the capability of AI agents in performing post-training tasks for language models.
Aleph from Logic International has just topped formal reasoning benchmarks, validating Yann LeCun's vision that AI needs structural verification systems before outputting responses.
Bindu Reddy points out that the latest updates to Opus, Gemini, and Sonnet are showing poorer performance or more bugs compared to their predecessors.
The BEAMS initiative establishes standards for AI in modeling and simulation, aiming for responsibility and ethics. Experimental results show that current AI tools are strong at discussion and qualitative tasks, but still struggle with causal reasoning and quantitative debugging. The open-source sd-ai project helps increase transparency in evaluation.
Tech expert Bindu Reddy shared a curated list of the most effective AI models for tasks like coding, design, and chat.
Anthropic has announced Claude Opus 4.8, a powerful upgrade that helps the company reclaim the performance crown from OpenAI and Google while introducing the groundbreaking dynamic workflows feature.
Paris 2.0 marks a new milestone as the first video generation model to be trained in a decentralized manner. Tests show that its performance is twice as high as traditional centralized models at the same budget.
A new study introduces DynaSchedBench, a standardized benchmark for the Dynamic Flexible Job-Shop Scheduling Problem (DFJSP), exposing the limitations of AI agents when exposed to excessive data.
In collaboration with the UAE's Technology Innovation Institute (TII), Hugging Face has introduced QIMMA, a quality-focused leaderboard aimed at standardizing the evaluation of Arabic Large Language Models.
Hugging Face has introduced the "Benchmaxxer Repellant" tool, which uses hidden data to prevent score gaming on its Open ASR Leaderboard.
Apple has introduced SFI-Bench, a new video-based benchmark featuring over 1,700 questions designed to evaluate multimodal AI models' deep understanding of physical functionality.
Multiple new studies published on arXiv have simultaneously exposed significant flaws in the self-awareness, mathematical reasoning, and logical thinking of large language models.
A new study introduces AgingBench, a benchmark evaluating the long-term reliability of AI agents, showing that agents also "age" and experience performance degradation over time after deployment.
Abacus AI CEO Bindu Reddy has shared a list of today's leading AI models tailored for specific needs, including coding, image processing, and real-time voice.
The Hugging Face Dataset Leaderboard has added a feature to filter benchmark results by parameter range, making it easier for users to find the optimal model for their hardware capacity.
The new update for llama.cpp integrates Multi-Tentative-Parallelism (MTP), enabling the Qwen3.6-27B model to reach 45 tokens per second on an A10G GPU.