MPMMine: A New Benchmark Suite for Constraint Acquisition in Mathematical Programming
MPMMine is introduced to provide a standardized evaluation framework for algorithms that discover and validate mathematical programming (MP) models.
MPMMine is introduced to provide a standardized evaluation framework for algorithms that discover and validate mathematical programming (MP) models.
A new study proposes a framework to mitigate errors and uncertainty when using LLMs to automate experimental procedures in virtual environments.
The new COLAGUARD model addresses the safety-speed trade-off in guardrailing large language models. Instead of requiring explicit reasoning which causes high latency, COLAGUARD shifts the multi-step reasoning process into the latent space during inference. Results show that the model significantly improves F1 scores compared to Llama Guard 3, while being 12.9x faster and consuming 22.4x fewer tokens.
An arXiv study reveals that LLMs easily compromise correct results under user pressure, while proposing COLAGUARD as a highly effective security solution.
Studies on arXiv propose solutions for sim-to-real transfer, off-policy optimization, and opponent behavior shaping in multi-agent environments.
A new series of studies on AI agents focuses on physical feasibility (BrickAnything) and maintaining long-term system performance.
Redpanda introduces the Agentic Data Plane (ADP), an architecture that utilizes out-of-band metadata channels to manage security for autonomous AI agents. Instead of relying on agents to handle access policies directly, ADP pushes security contexts and audit trails out of their control. This helps prevent risks from agent hallucinations or manipulation, ensuring compliance with data rights and execution policies even in complex tasks like financial portfolio management.
New research shows that LLM-based AI agents (Anthropic, OpenAI) are capable of annotating biological phenotype data with accuracy comparable to human experts. This has traditionally been a highly specialized and time-consuming process, causing a bottleneck in evolutionary biology research. Agents equipped with a self-contained workspace (research PDFs, annotation guidelines, ontologies) achieved performance that far exceeds traditional NLP tools.
UniScale is an online framework that unifies model routing and test-time scaling into a single optimization space, achieving a better balance between quality and cost.
Research from arXiv (2605.30621) indicates that an agent's ability to update its "harness" does not necessarily mean it will benefit from it. Mid-tier models typically benefit the most from self-evolution.
Proposing an uncertainty-aware framework to guide exploration in reinforcement learning for autonomous vehicles, helping to avoid collisions during training.
AdaCoM trains an external LLM to manage context for a "frozen" agent, mitigating the degradation of reasoning capabilities in overextended contexts.
COMPASS uses MCTS for the safety alignment of search agents, detecting malicious intents disguised as seemingly harmless sub-queries.
New research proposes the Calibrated Interactive RL framework to mitigate distribution shift and behavioral bias in conversational LLMs.
Multiple new studies published on arXiv have simultaneously exposed significant flaws in the self-awareness, mathematical reasoning, and logical thinking of large language models.