Qwen3.6-27B-MTP-UD-Q5_K_XL on my 7900XTX goes from 32 t/s to 50-72 t/s depending on the predictability of the task. So, a 1.5x increase on creative tasks up to a 2.2x increase on math.

MTP does not change the quality with the only cost being a few hundred MB extra VRAM usage. You will need to download a gguf model with MTP support to use it.
My parameters:

; Context memory usage  
ctx-size = 65536  
ctk = q8_0  
ctv = q8_0  

; Prompt processing speed  
batch-size = 1024  
ubatch-size = 1024  

; Speculative decoding  
np = 1  
spec-type = draft-mtp  
spec-draft-n-max = 3  

Edit: did some more testing using Unsloth’s parameters and with spec-draft-n-max = 6 I can get up to 82 tk/s, a 2.56x increase, on the same math prompt. But this comes at the cost of the creative writing task that now falls below 40 tk/s.
It seems like this should be tweaked depending on the prompt similar to the sampling parameters.

  • robber@lemmy.ml
    link
    fedilink
    English
    arrow-up
    0
    ·
    28 days ago

    Using MTP combined with tensor parallelism, I was able to go from running Qwen3.6 27b at ~7t/s to ~30t/s which I think is an insane boost (3x RTX 2000e Ada).