Tag

#Benchmark

21 English Kalera News articles tagged Benchmark — source-backed.

AI · tools-ai Jun 9, 2026

MPMMine: A New Benchmark Suite for Constraint Acquisition in Mathematical Programming

MPMMine is introduced to provide a standardized evaluation framework for algorithms that discover and validate mathematical programming (MP) models.

Sources arxiv.org

AI · tools-ai Jun 9, 2026

JobBench: A New Benchmark Measuring AI's Ability to Work According to Human Intent

Instead of focusing on replacing humans, JobBench evaluates AI across 130 real-world tasks that experts want to delegate. The new Claude Opus 4.7 only scored 45.9%.

Sources arxiv.org

AI Jun 8, 2026

Google Set to Launch Gemini 3.2 Flash: Near-GPT 5.5 Performance at 1/20th of the Cost

The Gemini 3.2 Flash model is rumored to achieve 92% of GPT 5.5's performance in coding and reasoning tasks, with operating costs 15 to 20 times cheaper.

Sources x.com

AI · tools-ai Jun 8, 2026

Gemini Flash 3.5 goes head-to-head with Claude Sonnet 4.6

Google has just demonstrated the power of its Gemini Flash 3.5 model by achieving performance comparable to Sonnet 4.6 on prestigious leaderboards, marking a strong comeback for the search giant.

Sources x.com

AI · tools-ai Jun 8, 2026

physics-intern: Framework Helps Gemini 3.1 Pro 'Outperform' GPT 5.5 Pro in Science

A new tool called physics-intern significantly boosts the performance of large language models like Gemini 3.1 Pro in solving physics and scientific problems, thanks to a specialized subagent mechanism.

Sources x.com

AI · tools-ai Jun 6, 2026

PostTrainBench v1.0 Released: A Benchmark for Evaluating AI Agents in the Post-Training Phase

PostTrainBench v1.0 provides a new standard to measure the capability of AI agents in performing post-training tasks for language models.

Sources x.com

AI · tools-ai Jun 6, 2026

The Return of Energy-Based Models: Aleph Leads Formal Reasoning Benchmarks

Aleph from Logic International has just topped formal reasoning benchmarks, validating Yann LeCun's vision that AI needs structural verification systems before outputting responses.

Sources x.com

AI · tools-ai Jun 6, 2026

Warning: Next-generation AI models show signs of "going in circles"

Bindu Reddy points out that the latest updates to Opus, Gemini, and Sonnet are showing poorer performance or more bugs compared to their predecessors.

Sources x.com

AI · tools-ai Jun 5, 2026

AI: BEAMS - A Framework for Evaluating AI in Modeling and Simulation

The BEAMS initiative establishes standards for AI in modeling and simulation, aiming for responsibility and ethics. Experimental results show that current AI tools are strong at discussion and qualitative tasks, but still struggle with causal reasoning and quantitative debugging. The open-source sd-ai project helps increase transparency in evaluation.

Sources arxiv.org

AI Jun 2, 2026

Top AI Models for Specific Use Cases: Insights from Bindu Reddy

Tech expert Bindu Reddy shared a curated list of the most effective AI models for tasks like coding, design, and chat.

Sources x.com

AI May 29, 2026

Anthropic Unveils Claude Opus 4.8: Outperforming GPT-5.5 and Gemini 3.1 Pro

Anthropic has announced Claude Opus 4.8, a powerful upgrade that helps the company reclaim the performance crown from OpenAI and Google while introducing the groundbreaking dynamic workflows feature.

Sources the-decoder.com

AI May 29, 2026

Paris 2.0 — The World's First Decentrally Trained Video Generation Model

Paris 2.0 marks a new milestone as the first video generation model to be trained in a decentralized manner. Tests show that its performance is twice as high as traditional centralized models at the same budget.

Sources x.com

AI May 28, 2026

DynaSchedBench: Decoding the LLM 'Observability Paradox' in Dynamic Scheduling

A new study introduces DynaSchedBench, a standardized benchmark for the Dynamic Flexible Job-Shop Scheduling Problem (DFJSP), exposing the limitations of AI agents when exposed to excessive data.

Sources arxiv.org

AI May 27, 2026

Hugging Face and TII UAE Launch QIMMA — An Arabic LLM Quality Leaderboard ⛰️

In collaboration with the UAE's Technology Innovation Institute (TII), Hugging Face has introduced QIMMA, a quality-focused leaderboard aimed at standardizing the evaluation of Arabic Large Language Models.

Sources huggingface.co

AI May 27, 2026

Hugging Face updates ASR leaderboard to prevent score gaming

Hugging Face has introduced the "Benchmaxxer Repellant" tool, which uses hidden data to prevent score gaming on its Open ASR Leaderboard.

Sources huggingface.co

AI May 27, 2026

Apple Announces SFI-Bench: Evaluating Spatial-Functional Intelligence in AI 🧠

Apple has introduced SFI-Bench, a new video-based benchmark featuring over 1,700 questions designed to evaluate multimodal AI models' deep understanding of physical functionality.

Sources machinelearning.apple.com

AI May 27, 2026

New Studies Reveal the True Cognitive Limits of LLMs

Multiple new studies published on arXiv have simultaneously exposed significant flaws in the self-awareness, mathematical reasoning, and logical thinking of large language models.

Sources arxiv.org arxiv.org arxiv.org

AI May 27, 2026

AI: AgingBench — A Benchmark for Measuring AI Agent 'Aging' in Real-World Deployments

A new study introduces AgingBench, a benchmark evaluating the long-term reliability of AI agents, showing that agents also "age" and experience performance degradation over time after deployment.

Sources arxiv.org

AI May 23, 2026

The Best AI Models by Task: 2026 Rankings 🏆

Abacus AI CEO Bindu Reddy has shared a list of today's leading AI models tailored for specific needs, including coding, image processing, and real-time voice.

Sources x.com

AI May 20, 2026

Hugging Face Updates Leaderboard to Enable Model Filtering by Parameter Count

The Hugging Face Dataset Leaderboard has added a feature to filter benchmark results by parameter range, making it easier for users to find the optimal model for their hardware capacity.

Sources x.com

AI May 20, 2026

llama.cpp adds MTP support, boosting local AI speed by 78%

The new update for llama.cpp integrates Multi-Tentative-Parallelism (MTP), enabling the Qwen3.6-27B model to reach 45 tokens per second on an A10G GPU.

Sources x.com