DeepSeek V4 Support (WIP) #22376

nisparks · 2026-04-26T00:14:24Z

nisparks
Apr 26, 2026

So want pretty deep into optimizing DeepSeek V4 on my experimental branch, before I realized it wasn't the upstream version. I went back and ported to the upstream base, but it performs a little slower than my experimental branch.

I will share both.

Here is the Work in Progress, based on the upstream version: https://github.com/nisparks/llama.cpp/tree/wip/deepseek-v4-support

The experimental branch is still a work in progress and will share later.

Uploaded the GGUF here: https://huggingface.co/nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF

nisparks · 2026-04-26T00:55:41Z

nisparks
Apr 26, 2026
Author

Created a draft with no intentions of merging: #22378

Just so it is easy to see the diff.

0 replies

wuwenthink · 2026-04-26T10:12:33Z

wuwenthink
Apr 26, 2026

tecaprovn/deepseek-v4-flash-gguf cannot success ,do you have gguf model can be use? thanks

18 replies

wuwenthink Apr 26, 2026

@wuwenthink, ah, I renamed it, "native" was a bad name for it, so I had that changed.

I wish I could not find a suitable branch, so I can't convert the model to GGUF locally.

wuwenthink Apr 26, 2026

@nisparks I made a Q2-Q3 mixed GGUF to fit almost exact the VRAM size of RTX Pro 6000. When offloading MOE to CPU, I also got 15-17 t/s.

你是怎么转换成功的？
我这边切了分支也报错 = =

nisparks Apr 26, 2026
Author

@wuwenthink try https://github.com/nisparks/llama.cpp/tree/wip/deepseek-v4-support

nisparks Apr 26, 2026
Author

Or https://github.com/nisparks/llama.cpp/tree/experiment/deepseek-v4-gguf-convert

wuwenthink Apr 26, 2026

Or https://github.com/nisparks/llama.cpp/tree/experiment/deepseek-v4-gguf-convert

ok

lovedheart · 2026-04-26T20:00:24Z

lovedheart
Apr 26, 2026

@wuwenthink I can upload a bf16 gguf ...

1 reply

wuwenthink Apr 26, 2026

@wuwenthink I can upload a bf16 gguf ...

but i dont have enough mem = =

Fringe210 · 2026-04-28T12:48:37Z

Fringe210
Apr 28, 2026

https://github.com/Fringe210/llama.cpp-deepseek-v4-flash-cuda <-- it works (18tokens/sec on single RTX 6000 96 gb) . 100% Minimax+clina+vstudio so needs a lot of love. Maybe it helps.

3 replies

ColumbusAI Apr 30, 2026

https://github.com/Fringe210/llama.cpp-deepseek-v4-flash-cuda <-- it works (18tokens/sec on single RTX 6000 96 gb) . 100% Minimax+clina+vstudio so needs a lot of love. Maybe it helps.

do you have a link to the model gguf that I can download and test with this fork?

Fringe210 May 2, 2026

https://huggingface.co/antirez/deepseek-v4-gguf/tree/main

emcalv May 6, 2026

with deepseek v4 the only output I get is "<<<<<<<<<<<<< ... "; what params are you using?

cdome94 · 2026-05-04T14:54:47Z

cdome94
May 4, 2026

I got CUDA working on NVIDIA GB10 (128 GB unified memory) with antirez's fork and GGUF.

The crash (GGML_ASSERT(src0->type == GGML_TYPE_F32) failed in concat.cu) was caused by hardcoded F32 assertions in ggml_cuda_op_concat that block any quantized input. Fixed it with a byte-level cudaMemcpy path for contiguous quantized tensors.

PR on antirez's fork: antirez/llama.cpp-deepseek-v4-flash#4
Fork with fix: https://github.com/cdome94/llama.cpp-deepseek-v4-flash

1 reply

AlphaMo99 May 10, 2026

Interesting, I have 6.5 tok/s on a thinkpad P16 with 128GB+A5500

DeepSeek V4 Support (WIP) #22376

Uh oh!

Uh oh!

Replies: 5 comments · 23 replies

Uh oh!

nisparks Apr 26, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nisparks Apr 26, 2026 Author

Uh oh!

nisparks Apr 26, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 23 replies

nisparks
Apr 26, 2026
Author

nisparks Apr 26, 2026
Author

nisparks Apr 26, 2026
Author