The open-source AI community has just received exciting news as llama.cpp officially supports the MTP (Multi-Tentative-Parallelism) technique, significantly boosting the inference speed of locally run large language models.
Key Developments
According to tests shared on X, the Qwen3.6-27B model running dense generation on an A10G GPU saw its speed jump from 25 tokens/second to 45 tokens/second, which is 78% faster. Users can enable this feature on llama-server via two new command-line flags: --spec-type draft-mtp and --spec-draft-n-max 2.
Why It Matters
A speed of 45 tokens/second on a 27B model is a "daily driver" threshold—fast enough for daily practical work instead of just testing. For Vietnamese developers and businesses concerned about data privacy, this breakthrough makes local AI deployment more practical than ever, reducing reliance on expensive cloud APIs.