LocalLLaMA@sh.itjust.worksEnglish · 1 month ago

Llama.cpp MTP Support merged - up to 2.5x speed increase

github.com

Llama.cpp MTP Support merged - up to 2.5x speed increase

github.com

TheCornCollector@piefed.zip to

LocalLLaMA@sh.itjust.worksEnglish · 1 month ago

llama + spec: MTP Support by am17an · Pull Request #22673 · ggml-org/llama.cpp

github.com

Overview This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detaile...

Qwen3.6-27B-MTP-UD-Q5_K_XL on my 7900XTX goes from 32 t/s to 50-72 t/s depending on the predictability of the task. So, a 1.5x increase on creative tasks up to a 2.2x increase on math.

MTP does not change the quality with the only cost being a few hundred MB extra VRAM usage. You will need to download a gguf model with MTP support to use it.
My parameters:

; Context memory usage  
ctx-size = 65536  
ctk = q8_0  
ctv = q8_0  

; Prompt processing speed  
batch-size = 1024  
ubatch-size = 1024  

; Speculative decoding  
np = 1  
spec-type = draft-mtp  
spec-draft-n-max = 3

Edit: did some more testing using Unsloth’s parameters and with spec-draft-n-max = 6 I can get up to 82 tk/s, a 2.56x increase, on the same math prompt. But this comes at the cost of the creative writing task that now falls below 40 tk/s.
It seems like this should be tweaked depending on the prompt similar to the sampling parameters.

Chat

robber@lemmy.ml
link
fedilink
English
arrow-up
0·
28 days ago
Using MTP combined with tensor parallelism, I was able to go from running Qwen3.6 27b at ~7t/s to ~30t/s which I think is an insane boost (3x RTX 2000e Ada).

LocalLLaMA@sh.itjust.works

localllama@sh.itjust.works

Create a post

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they’re still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

2 users / day
20 users / week
59 users / month
205 users / 6 months
0 local subscribers
4.83K subscribers
377 Posts
1.94K Comments
Modlog