Releases · microsoft/onnxruntime-genai

29 May 18:06

baijumeswani

v0.14.0

b7a6ec3

v0.14.0 Latest

Latest

What's Changed

Fix WhisperProcessor divide-by-zero when single prompt is provided by @Copilot in #2068
Fix lm_head tensor loading order dependency in quantized model builder by @thpereir in #2061
Fail to build Whisper model by @xiaofeihan1 in #2075
Rename NemotronCacheConfig to NemotronConfig and add blank penalty to the decoder by @nenad1002 in #2042
Fix YaRN RoPE bugs in model builder and add parity tests by @titaiwangms in #2076
Add Transformers v5 Support by @sayanshaw24 in #2089
macOS ARM64 ADO pipeline by @Copilot in #2091
Reduce CPU-side per-token overhead in GenerateNextToken and SampleTopP by @hanbitmyths in #2085
Add onStageComplete by @apsonawane in #2074
[WebGPU] Support continuous decoding (RewindTo) with graph capture by @qjia7 in #2083
[Mistral3] Add VLM support with multi-image inference by @titaiwangms in #2077
Add k_quant_linear mixed-precision quantization for hybrid attention … by @apsonawane in #2100
Removes QNN packaging from onnxruntime-genai pipelines by @baijumeswani in #2109
Add Gemma4 multimodal support (vision + audio) by @apsonawane in #2103
Update GUIDs during az login by @kunal-vaishnavi in #2122
Add CODEOWNERS file for repository ownership by @kunal-vaishnavi in #2119
Qwen3.5: drop fp32 cast around RMSNorm in builder by @xiaofeihan1 in #2101
Add support for LFM2 in ORT GenAI by @xenova in #1979
Enable CUDA graph capture for CUDA EP to improve decode throughput by @apsonawane in #2070
[Qwen3.5] dedup position ids by @daijh in #2102
Address win-cuda pipeline errors by @baijumeswani in #2154
Update Extensions Commit to Fix Id2Token Bugs by @sayanshaw24 in #2159
Limit the CUDA cmake architectures to 86 for CI builds by @baijumeswani in #2161
Gate leaked-object error reporting in Shutdown() to debug builds or when logging is enabled by @baijumeswani in #2162
Update Copilot instructions for reviewing model builder by @kunal-vaishnavi in #2164
Fix DecoderState input_ids check regression introduced in #2103 by @titaiwangms in #2148
Fix memory leaks by @skottmckay in #2153
[Qwen3.5] Use LpNormalization for L2-norm in linear-attention Q/K by @xiaofeihan1 in #2127
Fix: Win32 build failure when paths contain spaces by @nsubaru in #2053
Fix CUDA build with MSVC by enabling /Zc:preprocessor for nvcc host compilation on VS 16.5 or greater by @nsubaru in #2054
Apply linear rope_scaling in model builder for Neutts/nano by @VishalX in #2142
Fix Quark/AWQ weight loading for Qwen3-VL-4B text model by @anilmartha in #2143
Fix WebGPU inference crash in embedding and multi-modal feature allocation by @feich-ms in #2163
Support Visual Studio 18 2026 build by @Copilot in #2017
Add QNN EP documentation to OGA including Genie note by @qti-kromero in #2158
Use windowsml package and make winml usage simpler by @baijumeswani in #2155
Cleanup TensorObject created by OrtxTensorResultGetAt by @skottmckay in #2168
Fix nemotron leaks by @skottmckay in #2169
[RyzenAI] make speech sub-model optional in PhiMultiModalProcessor by @manasablrm in #2167
Enable graph capture for WebGPU models and DML continuous decoding tests by @qjia7 in #2099
[Qwen3] Allow packed QKV MatMul under QK-Norm via post-MatMul Split by @xiaofeihan1 in #2137
Enable Linux ARM64 builds and packaging by @baijumeswani in #2107
Add gemma4 unit tests by @apsonawane in #2151
Auto-detect fixed kv-cache shape in DefaultKeyValueCache by @akholodnamdcom in #2166
Add text-only mode support for Qwen 3.5 model builder by @apsonawane in #2157
Fix heap overflow issue by @apsonawane in #2110
[Benchmark] Add --use_random_tokens flag to C benchmark by @VishalX in #2170
Add HunYuan Dense V1 (hunyuan_v1_dense) model support by @anilmartha in #2144
Nvidia Parakeet Tdt ASR support by @nenad1002 in #2150
Multilingual Streaming Nemotron ASR + CUDA support by @nenad1002 in #2171
Add Csharp binding for Multi-lingual ASR by @rui-ren in #2176
Add VideoChat-Flash (OpenGVLab) language model support by @anilmartha in #2147
Update Nemotron ASR docs by @rui-ren in #2178
Validate sliding window size before creating KV cache by @baijumeswani in #2181
Fix external weights loading for in-memory models without changing cwd by @baijumeswani in #2180
Enable Qwen3.5 TRT-RTX EP path with CUDA graph by @yen-shi in #2139
Add Qwen3.5-MoE (35B-A3B) model support by @tanzeel-amd in #2146
Update ort-extensions commit by @baijumeswani in #2182

New Contributors

@titaiwangms made their first contribution in #2076
@nsubaru made their first contribution in #2053
@VishalX made their first contribution in #2142
@anilmartha made their first contribution in #2143
@feich-ms made their first contribution in #2163
@qti-kromero made their first contribution in #2158
@manasablrm made their first contribution in #2167
@yen-shi made their first contribution in #2139
@tanzeel-amd made their first contribution in #2146

Full Changelog: v0.13.1...v0.14.0

Contributors

skottmckay, qjia7, and 21 other contributors

Assets 15

onnxruntime-genai-0.14.0-linux-arm64.tar.gz

sha256:7d4d1ad8f0f956968f95a1344d49443f8172cab5b0f69f28ffd833e82e89044b

52.8 MB 2026-05-29T18:04:28Z
onnxruntime-genai-0.14.0-linux-x64-cuda.tar.gz

sha256:0c6e693b4e89486082559761dd24ffa451b80f1fb98c4dde6245cd53d52b3f1d

82.3 MB 2026-05-29T18:04:34Z
onnxruntime-genai-0.14.0-linux-x64.tar.gz

sha256:7b37f13619ee01263278fb1c24a950e219d75c9fa90586b1623d3e8bab9076b0

53.3 MB 2026-05-29T18:04:31Z
onnxruntime-genai-0.14.0-osx-arm64.tar.gz

sha256:56583c98e3939d2cfd5a3812471be44017ce2752776d389015ff583a8d758312

3.36 MB 2026-05-29T18:04:42Z
onnxruntime-genai-0.14.0-win-arm64-dml-winml.zip

sha256:461f60cecec6a2e221a19ea1a034242a9c834bd458c29e65a485033b43ea41e5

16.3 MB 2026-05-29T18:04:45Z
onnxruntime-genai-0.14.0-win-arm64-dml.zip

sha256:95bc062ca46f6313b061ac076cdc4f830639764f58847ec58e71cb144c20fc40

16.3 MB 2026-05-29T18:04:49Z
onnxruntime-genai-0.14.0-win-arm64.zip

sha256:b6daeedb6395406e4cefbd6577a0d2196611e360086f7767c153b1d4b3cb3f1b

15.5 MB 2026-05-29T18:04:43Z
onnxruntime-genai-0.14.0-win-x64-cuda.zip

sha256:24509dcf8fdd9cff5e4be98aa9d9a9d13c5c06a772faeaca68dc594cf7cb912f

41.3 MB 2026-05-29T18:05:12Z
onnxruntime-genai-0.14.0-win-x64-dml.zip

sha256:3ad50e0b978e9095909dadd4c8fcc7c2431f0d3be035ea2c7d142b71b403ea99

16.8 MB 2026-05-29T18:05:24Z
onnxruntime-genai-0.14.0-win-x64-winml.zip

sha256:d40ea0345ce4478d7ef442b812d425edd7fd665b2d0ed71f97c4047d57495416

42.2 MB 2026-05-29T18:04:58Z
Source code (zip)

2026-05-25T18:29:42Z
Source code (tar.gz)

2026-05-25T18:29:42Z

29 May 18:01

baijumeswani

v0.13.2

53deb89

v0.13.2

Version 0.13.2

Assets 13

15 Apr 20:04

baijumeswani

v0.13.1

db2baa9

v0.13.1

Version 0.13.1

Assets 14

15 Apr 19:56

baijumeswani

v0.13.0

2d30e49

v0.13.0

What's Changed

update WebGPU buffer memory info name by @fs-eire in #1957
Add enable_profiling in Runtime Options by @xiaofeihan1 in #1949
Fix uninitialized tools variable and improve exception debug messages by @sheller-ms in #1971
Add common download to Phi-3 tutorial by @kunal-vaishnavi in #1973
Add support for InternLM2 model architecture by @amdrajeevp1 in #1958
Update cmake cuda architecture and use win-arm64 pool workaround by @baijumeswani in #1976
Update examples after 0.12.0 release by @kunal-vaishnavi in #1980
Add CI pipeline for WebGPU EP model testing by @qjia7 in #1956
Fix Python nightly build by @kunal-vaishnavi in #1981
Add missing Quark 0.11 weight patterns for ChatGLM3 output layer by @poganesh in #1983
Support Qwen2.5-VL pre-quantized models in qwen.py by @poganesh in #1985
[VitisAI] external_ep_libray support fix for WinML by @akholodnamdcom in #1984
Fix guidance bug by @baijumeswani in #1988
Fix incorrect batch responses when using multiple prompts by @lnigam in #1986
Enable webgpu graph capture in base.py by @qjia7 in #1991
Harden CUDA error checking across the codebase by @Copilot in #1994
allow pruned models for prefill by @fs-eire in #1995
Fix WinML Packaging Pipeline by @baijumeswani in #1998
Add small changes after pruning prefill by @kunal-vaishnavi in #2000
webgpu: Optimize Copyfrom by @qjia7 in #1992
Add support for CUDA 13 by @baijumeswani in #2001
add webgpu to qmoe path by @guschmue in #2005
Fix ERNIE 4.5 model builder: rope_attrs and config architecture name by @xiaoyao9184 in #2007
Bug fix in Continuous Decoding by @chilukam-qti in #2008
Update Phi-4 mm README links by @kunal-vaishnavi in #2014
Add Qwen3-VL model support + multi-image input support in Qwen VL family by @hanbitmyths in #2003
Add Qwen3.5 model support and optimize multi-image handling by @apsonawane in #2019
Reuse a single generator via RewindTo(0) in benchmark instead of creating multiple generators by @qjia7 in #2002
[RyzenAI] WinML compatibility fix by @akholodnamdcom in #2026
Nemotron ASR Support for Streaming by @nenad1002 in #1997
[WebGPU] Fix the prefill regression when graph capture is ON by @qjia7 in #2021
Support 4 inputs for nemotron model by @jiafatom in #2036
Updated java packaging based on python packaging logic by @EPNW-Eric in #2029
Fix android packaging pipeline by @baijumeswani in #2039
Add OpenAI's Whisper to model builder by @kunal-vaishnavi in #2018
[Java] Add a dependency on onnxruntime (#2030) by @EPNW-Eric in #2040
Fix mutually exclusive inputs for language models by @kunal-vaishnavi in #2046
Decouple plugin execution providers (EPs) from the USE_WINML pre-processor macro by @baijumeswani in #2038
Route pipeline model RunOptions through SetRunOption for proper special key handling by @Copilot in #2044
Add ort_build_version and ort_build_source parameters to nuget and python packaging pipelines, remove ROCm support by @Copilot in #2049
Add batched multi-image vision path and window_size config for Qwen VL by @hanbitmyths in #2050
docs: fix formatting and syntax highlighting in documentation by @riddles-the-one in #2051
Add Silero VAD Support to Nemotron Streaming ASR by @sayanshaw24 in #2035
Add Qwen3.5 hybrid decoder export support (GatedDeltaNet + Attention) by @apsonawane in #2043
Add support for QNN stateful models by @qti-ashimaj in #2012
Allocate recurrent state via device allocator to enable CUDA graph capture by @apsonawane in #2057
Speed up CI pipelines by @Copilot in #2052
Fix tool calling for TRT-RTX models by @kunal-vaishnavi in #2048
Fix vision pipeline EP hardcoding and pixel_values rank mismatch for Qwen VL models by @apsonawane in #2060

New Contributors

@sheller-ms made their first contribution in #1971
@amdrajeevp1 made their first contribution in #1958
@poganesh made their first contribution in #1983
@xiaoyao9184 made their first contribution in #2007
@chilukam-qti made their first contribution in #2008
@EPNW-Eric made their first contribution in #2029

Full Changelog: v0.12.0...v0.13.0

Contributors

qjia7, xiaoyao9184, and 18 other contributors

Assets 14

27 Mar 17:49

baijumeswani

v0.12.2

935d426

v0.12.2

Assets 14

02 Mar 23:22

baijumeswani

v0.12.1

ab6e204

v0.12.1

#1988
#1984

Assets 14

13 Feb 17:38

baijumeswani

v0.12.0

f3a57ba

v0.12.0

What's Changed

Update versions after making 0.11.0 branch by @kunal-vaishnavi in #1867
Fix guidance usage in continuous decoding by @kunal-vaishnavi in #1870
Fix HelloPhi C# example by @kunal-vaishnavi in #1871
Fix regex by @apsonawane in #1875
Update extensions commit by @apsonawane in #1874
Revert removal of eps_without_if_support by @xiaofeihan1 in #1878
Fix condition for NPU by @apsonawane in #1880
Model builder refactoring by @tianleiwu in #1862
Add lintrunner to format code by @tianleiwu in #1884
Remove empty submodule leftover. by @xkszltl in #1883
Fix build for lack of RTLD_DI_ORIGIN support by @jaeyoonjung in #1888
Enable graph capture for webgpu by @qjia7 in #1848
Generic shared emb_tokens/lm_head implementation by @jixiongdeng in #1885
Fix bug in Squeeze for getting the value of total_seq_len by @Honry in #1886
Extra_options disable_qkv_fusion to untie qkv_projs from upstream choice by @jixiongdeng in #1893
Fix mac pipeline by @apsonawane in #1904
whisper: Support a variant of the whisper pipeline where encoder / decoder are stateful. by @RyanMetcalfeInt8 in #1857
Add model builder for Qwen2_5_VLTextModel by @tianleiwu in #1882
Integrate FARA-7B model by @apsonawane in #1902
Fix gpt-oss model export by @apsonawane in #1861
OpenVINO: Add support for model caching via 'cache_dir' provider option by @RyanMetcalfeInt8 in #1900
WinML - Remove the inclusive Microsoft.WindowsAppSDK.ML range check by @chrisdMSFT in #1907
Run the model in text mode by @apsonawane in #1908
Update extensions commit by @apsonawane in #1914
Fix gpt-oss export by @apsonawane in #1915
Support Olive new uint8 quantization format by @xiaoyu-work in #1916
Disable CUDA graph for Phi LongRoPE models with IF nodes on TRT-RTX by @anujj in #1921
Add support for CUDA and CPU arch for Qwen-2.5-VL and Fara-7B by @apsonawane in #1919
Add Gemma-3 vision tutorial to ONNX Runtime GenAI by @kunal-vaishnavi in #1793
Quark GPT-OSS support by @thpereir in #1903
Fix sliding window alignment regression in QNN models by @apsonawane in #1938
AMD RyzenAI EP Support by @akholodnamdcom in #1935
Update README by @natke in #1934
[RyzenAI] Non-pruned models backward compatibility by @akholodnamdcom in #1942
[VitisAI] EP loader by @akholodnamdcom in #1918
Set default top_k and top_p if it is None by @xiaoyu-work in #1944
Ensure dlls are signed in the c and nuget packages. by @baijumeswani in #1947
Bump torch from 2.7.1 to 2.7.1+cpu in /test/python/directml/torch by @dependabot[bot] in #1868
Add linker flags for 16 KB page size on Android by @sheetalarkadam in #1860
Only manually load DLLs if onnxruntime.dll is not already loaded. by @chemwolf6922 in #1800
Add a doc showing how to run GPT OSS 20B with WebGPU by @natke in #1945
Add C#, Java, and Objective-C APIs for Config by @kunal-vaishnavi in #1946
Fix GatherBlockQuantized node to support symmetric quantized LM_HEAD by @sushraja-msft in #1951
Fix QMoE blockwise quantization support for TRT-RTX execution provider by @anujj in #1926
Revert "Add a doc showing how to run GPT OSS 20B with WebGPU" by @kunal-vaishnavi in #1950
Add custom model path support for unit tests by @mpasumarthi-git in #1917
fix: patch llguidance to remove reference to ring crate by @sanaa-hamel-microsoft in #1948
Implement graph models for EPs by @qjia7 in #1895
Update handling EOS token id detection by @kunal-vaishnavi in #1925
Remove onnxruntime-genai-cuda from the foundry package by @baijumeswani in #1954
Include linux builds in the foundry ort-genai package by @baijumeswani in #1955
Support pre-registered plug-in NvTensorRtRtx execution provider library by @anujj in #1889
[RyzenAI] Linux compatibility fixes by @akholodnamdcom in #1959
Use cuda 12.8 to build ort-genai by @baijumeswani in #1960
Bump protobuf from 5.29.5 to 6.33.5 in /test/python by @dependabot[bot] in #1961
Add RAII wrappers for ORT Model Editor API types by @qjia7 in #1953
Rewrite all examples using standardization by @kunal-vaishnavi in #1939
Add versioning to the onnxruntime-genai-cuda.dll by @baijumeswani in #1965
[Build][Packaging] macOS packaging to skip building x86_64 by @baijumeswani in #1966
Sync packaging changes with ONNX Runtime by @baijumeswani in #1967
Release 0.12.0 cherry-pick PR by @baijumeswani in #1978

New Contributors

@xkszltl made their first contribution in #1883
@jaeyoonjung made their first contribution in #1888
@jixiongdeng made their first contribution in #1885
@Honry made their first contribution in #1886
@thpereir made their first contribution in #1903
@akholodnamdcom made their first contribution in #1935
@sheetalarkadam made their first contribution in #1860
@sanaa-hamel-microsoft made their first contribution in #1948

Full Changelog: v0.11.4...v0.12.0

Contributors

anujj, Honry, and 21 other contributors

Assets 14

12 Dec 05:23

kunal-vaishnavi

v0.11.4

a8a6136

v0.11.4

What's Changed

WinML - Remove the inclusive Microsoft.WindowsAppSDK.ML range check by @chrisdMSFT in #1907
Run the model in text mode by @apsonawane in #1908

Full Changelog: v0.11.3...v0.11.4

Contributors

chrisdMSFT and apsonawane

Assets 13

08 Dec 20:23

kunal-vaishnavi

v0.11.3

4915b02

v0.11.3

What's Changed

Model builder refactoring by @tianleiwu in #1862
Add lintrunner to format code by @tianleiwu in #1884
Remove empty submodule leftover. by @xkszltl in #1883
Fix build for lack of RTLD_DI_ORIGIN support by @jaeyoonjung in #1888
Enable graph capture for webgpu by @qjia7 in #1848
Generic shared emb_tokens/lm_head implementation by @jixiongdeng in #1885
Fix bug in Squeeze for getting the value of total_seq_len by @Honry in #1886
Extra_options disable_qkv_fusion to untie qkv_projs from upstream choice by @jixiongdeng in #1893
Fix mac pipeline by @apsonawane in #1904
whisper: Support a variant of the whisper pipeline where encoder / decoder are stateful. by @RyanMetcalfeInt8 in #1857
Add model builder for Qwen2_5_VLTextModel by @tianleiwu in #1882
Integrate FARA-7B model by @apsonawane in #1902
Set version as 0.11.3 by @kunal-vaishnavi in #1905

New Contributors

@xkszltl made their first contribution in #1883
@jaeyoonjung made their first contribution in #1888
@jixiongdeng made their first contribution in #1885
@Honry made their first contribution in #1886

Full Changelog: v0.11.2...v0.11.3

Contributors

Honry, qjia7, and 7 other contributors

Assets 13

18 Nov 12:53

kunal-vaishnavi

v0.11.2

25962b0

v0.11.2

What's Changed

Revert removal of eps_without_if_support by @xiaofeihan1 in #1878
Fix condition for NPU by @apsonawane in #1880
Set version as 0.11.2 by @kunal-vaishnavi in #1881

Full Changelog: v0.11.1...v0.11.2

Contributors

xiaofeihan1, apsonawane, and kunal-vaishnavi

Assets 13

Releases: microsoft/onnxruntime-genai

v0.14.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.13.2

Uh oh!

v0.13.1

Uh oh!

v0.13.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.12.2

Uh oh!

v0.12.1

Uh oh!

v0.12.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.11.4

What's Changed

Contributors

Uh oh!

v0.11.3

What's Changed

New Contributors

Contributors

Uh oh!

v0.11.2

What's Changed

Contributors

Uh oh!