llama.cpp b9235: Accelerating Inference with Speculative N-gram Tuning
The llama.cpp b9235 release introduces Speculative N-gram Tuning, significantly optimizing decode speeds when running large models like Qwen3.6 27B.
Sources x.com
The llama.cpp b9235 release introduces Speculative N-gram Tuning, significantly optimizing decode speeds when running large models like Qwen3.6 27B.
The new update for llama.cpp integrates Multi-Tentative-Parallelism (MTP), enabling the Qwen3.6-27B model to reach 45 tokens per second on an A10G GPU.