llama.cpp b9235: Accelerating Inference with Speculative N-gram Tuning
The llama.cpp b9235 release introduces Speculative N-gram Tuning, significantly optimizing decode speeds when running large models like Qwen3.6 27B.
The llama.cpp b9235 release introduces Speculative N-gram Tuning, significantly optimizing decode speeds when running large models like Qwen3.6 27B.
A new milestone for local AI as llama.cpp officially supports Multi-Token Prediction (MTP) for the Qwen3.6 series, dramatically boosting processing speeds on consumer hardware.
Pinterest has achieved a major breakthrough in operational efficiency, slashing AI infrastructure costs by 90% and boosting accuracy by 30% by restructuring the vision processing layer of the Qwen3-VL model.
The latest llama.cpp update supporting Multi-Token Prediction (MTP) enables the Qwen3.6-27B model to reach 45 tokens/second on mid-range hardware, accelerating the trend of self-hosting AI.
Alibaba Cloud has introduced Qwen3.7-Max, featuring a 1M-token context window and outstanding performance in coding, reasoning, and long-horizon autonomy.
The new update for llama.cpp integrates Multi-Tentative-Parallelism (MTP), enabling the Qwen3.6-27B model to reach 45 tokens per second on an A10G GPU.
The Qwen3.6-27B model can now run entirely on WebGPU, allowing AI to run directly in the browser without server dependency. Although its speed is still limited, this is a major step forward for decentralized AI.