AI May 20, 2026 1 min read

llama.cpp adds MTP support, boosting local AI speed by 78%

The new update for llama.cpp integrates Multi-Tentative-Parallelism (MTP), enabling the Qwen3.6-27B model to reach 45 tokens per second on an A10G GPU.

Tier 1 · sources 90% confidence Reviewed

Llamacpp Open Source Local AI Qwen Benchmark

Sources x.com

The open-source AI community has just received exciting news as llama.cpp officially supports the MTP (Multi-Tentative-Parallelism) technique, significantly boosting the inference speed of locally run large language models.

Key Developments

According to tests shared on X, the Qwen3.6-27B model running dense generation on an A10G GPU saw its speed jump from 25 tokens/second to 45 tokens/second, which is 78% faster. Users can enable this feature on llama-server via two new command-line flags: --spec-type draft-mtp and --spec-draft-n-max 2.

Why It Matters

A speed of 45 tokens/second on a 27B model is a "daily driver" threshold—fast enough for daily practical work instead of just testing. For Vietnamese developers and businesses concerned about data privacy, this breakthrough makes local AI deployment more practical than ever, reducing reliance on expensive cloud APIs.