Maybe it was just me, but in case others have done the same this post might help someone else too.

I have a workstation with plenty of CPU and system RAM, but I’m “GPU poor” in that I only have a 5060Ti with its 16GB of VRAM. Additionally, I need to use the GPU for regular system activities too which means I only have around ~14GB of VRAM available for the LLM.

I’m exclusively using this setup for development and system management tasks, and I’ve found Qwen 3.6 35B A3B to excel compared to other models. I don’t have the VRAM to run the 27GB dense model, so I’ve spent time on getting the best usage out of the MoE.

Or so I thought. Since “everyone” says to use Unsloth UD-Q4_K_XL that’s the quant I’ve been using, and I’ve gone a bit back’n’forth with MTP/no MTP, UB increase, mmproj since I’ve also started using a browser MCP etc.

Today I took another look at their quant chart and thought that since it’s MoE maybe I could run Q5_K_S which would be a step up?

Well. Now I’m using Q6_K because it turns out I could run that with the exact same settings as I’ve optimized my Q4_K_XL setup for which means there are no drawbacks - just a better performing model. I’ve already noticed how it’s able to get out of loops while before I had to interrupt it sometimes.

This is my setup. I get >1000 t/s prefill and >20 t/s inference. I’m not chasing faster inference since I actively read the thought process when working the LLM - but I’ve increased ub to get faster prefill since that’s just waiting time otherwise.

./llama-server
    -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K \
    -c 160000 \
    -n 32768 \
    -fa on \
    -ub 2048 \
    -ctk q8_0 \
    -ctv q8_0 \
    --no-mmap \
    --mlock \
    --no-warmup \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    --host 0.0.0.0

I also use Opencode with the DCP and Superpowers plugins, which make a tremendous difference both to context handling as well as planning. I have no need for a larger context - I even compact early quite often since the tasks get done before reaching the limit.

  • brockhold@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    4 days ago

    This is exactly why I am so sad that I didn’t buy more DDR4 back when it was reasonable. I run Unsloth’s Qwen 3.5 122B A10B UD_Q4_k_XL and while it works great, I really wish I had enough ram for Q6 or even Q8. The speed difference won’t be wildly worse, but the quality of output is noticeable. I’m just glad that it works as well as it does in Q4. It’s mostly limited by my main ram bandwidth, the GPU helps but I only barely hit 15t/s decode with MTP hitting >80%.

    • troed@fedia.ioOP
      link
      fedilink
      arrow-up
      0
      ·
      4 days ago

      15t/s is workable IMHO. What’s your system specs? I have 96GB DDR5 but never thought about going to an ever higher MoE.