Releases · ggml-org/llama.cpp

20 Jun 15:29

github-actions

b9739

8452824

b9739 Latest

Latest

release: add missing link for win opencl adreno arm64 (#24809)

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2026-06-20T15:29:13Z
cudart-llama-bin-win-cuda-13.3-x64.zip

sha256:1462a050eb4c684921ba51dcc4cc488a036674c3e73e9945ee705b854808d03e

373 MB 2026-06-20T15:29:27Z
llama-b9739-bin-android-arm64.tar.gz

sha256:ee285b3edbda414083e831a9a4216700c1b02d1c2b47480f58197004a758f938

73.2 MB 2026-06-20T15:29:38Z
llama-b9739-bin-macos-arm64.tar.gz

sha256:1e6f9da7d58e31a579c1e6ed24d4d598b79f8cf178fc34c4c8ab80176651b8be

10.4 MB 2026-06-20T15:29:41Z
llama-b9739-bin-macos-x64.tar.gz

sha256:9e02d95e915d9cdaa0134474cff42dec1330c41e271c772d59929ac37bc1905a

10.7 MB 2026-06-20T15:29:43Z
llama-b9739-bin-ubuntu-arm64.tar.gz

sha256:b10b276bff7e281b04fa6bb913f3ecfc6bc2236462d629ae60446ec5aecc889d

12 MB 2026-06-20T15:29:44Z
llama-b9739-bin-ubuntu-openvino-2026.2-x64.tar.gz

sha256:27ebc4a6a376354c50bec278a8178dde26052034d280315bc57022e6923445a3

13.5 MB 2026-06-20T15:29:45Z
llama-b9739-bin-ubuntu-rocm-7.2-x64.tar.gz

sha256:f236560c0d27380bd4056aba6b796391e1cad9dfb6af8d960b29815f5e4a9499

125 MB 2026-06-20T15:29:46Z
llama-b9739-bin-ubuntu-s390x.tar.gz

sha256:920b071a3b7ff1b5ed0dbeb9c08cae752ba02d2058ec0c4f1615ead3ecdfe6c4

14 MB 2026-06-20T15:29:50Z
llama-b9739-bin-ubuntu-sycl-fp16-x64.tar.gz

sha256:a8d2690e437955c3b4cc2f79fb64747d6e5cf59151336fa97a2088f5e97686cb

45.5 MB 2026-06-20T15:29:52Z
Source code (zip)

2026-06-20T15:08:59Z
Source code (tar.gz)

2026-06-20T15:08:59Z

20 Jun 14:06

github-actions

b9738

e27f308

b9738

server: avoid forwarding auth headers in CORS proxy (#24373)

server: avoid forwarding auth headers in CORS proxy
format
fix test
fix e2e test

Co-authored-by: Xuan Son Nguyen son@huggingface.co

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

20 Jun 11:59

github-actions

b9736

796f41b

b9736

model : glm-dsa load DSA indexer tensors as optional (#24770)

GLM-5.2 ships the DSA "lightning indexer" on only a subset of layers (the
"full" layers; others omit it), but the GLM_DSA loader created the five
indexer tensors on every layer as required, so loading any GLM-5.2 GGUF
failed with e.g. missing tensor 'blk.3.indexer.k_norm.weight'.

GLM_DSA's graph is llama_model_deepseek2::graph (plain MLA) and does not use
the indexer tensors (indexer runtime not yet implemented), so they are
loaded-but-unused. Marking them TENSOR_NOT_REQUIRED lets layers without an
indexer load as nullptr and the model runs as full MLA attention.

DeepSeek-V3.2 (uniform indexer on all layers) is unaffected.

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

20 Jun 11:33

github-actions

b9735

37a77fb

b9735

ggml : optimize AMX (#24806)

Flatten the partition over n_batch * M so every thread participates in
the quantization

| CPU                             | Model                         | Test   |   t/s OLD |   t/s NEW |   Speedup |
|:--------------------------------|:------------------------------|:-------|----------:|----------:|----------:|
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B IQ4_NL - 4.5 bpw  | pp512  |    730.71 |    779.86 |      1.07 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B IQ4_NL - 4.5 bpw  | tg128  |     87.88 |     86.79 |      0.99 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B IQ4_XS - 4.25 bpw | pp512  |    725.09 |   1023.31 |      1.41 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B IQ4_XS - 4.25 bpw | tg128  |     83.64 |     83.62 |      1.00 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_0              | pp512  |    820.51 |    924.05 |      1.13 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_0              | tg128  |     90.59 |     92.46 |      1.02 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_1              | pp512  |    776.88 |    872.79 |      1.12 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_1              | tg128  |     89.39 |     90.94 |      1.02 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_K_M            | pp512  |    719.28 |   1009.27 |      1.40 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_K_M            | tg128  |     80.62 |     80.86 |      1.00 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_K_S            | pp512  |    732.29 |   1077.29 |      1.47 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_K_S            | tg128  |     86.42 |     83.53 |      0.97 |

Signed-off-by: Adrien Gallouët angt@huggingface.co

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

20 Jun 00:16

github-actions

b9733

f449e05

b9733

ggml-webgpu: add adapter toggles for F16 on Vulkan + NVIDIA

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

19 Jun 23:44

github-actions

b9732

2b686a9

b9732

server: refactor child --> router communication (#24821)

server: refactor child --> router communication
fix wakeup case
add docs
improve update_status()
nits

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

19 Jun 21:54

github-actions

b9731

4b48a53

b9731

server : optimize get_token_probabilities (#24796)

Use std::partial_sort to order only the requested top-n tokens instead
of the full vocabulary

logprobs sort: vocab=128000 n_top=0 iters=100
full    sort:   8555.6 us/op
partial sort:    704.3 us/op

Signed-off-by: Adrien Gallouët angt@huggingface.co

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

19 Jun 21:18

github-actions

b9730

e475fa2

b9730

mtmd, arg: fix utf8 handling on windows (#24779)

mtmd, arg: fix utf8 handling on windows
also fix ggml_fopen
fix build fail
also fix CLI

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

19 Jun 20:47

github-actions

b9729

175147e

b9729

server: remove all internal mentions about "webui" (#24817)

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

19 Jun 16:19

github-actions

b9728

fabde3b

b9728

arg: Add comment line support to --api-key-file (#23168)

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

Releases: ggml-org/llama.cpp

b9739

Uh oh!

b9738

Uh oh!

b9736

Uh oh!

b9735

Uh oh!

b9733

Uh oh!

b9732

Uh oh!

b9731

Uh oh!

b9730

Uh oh!

b9729

Uh oh!

b9728

Uh oh!