Releases: microsoft/onnxruntime-genai
Releases · microsoft/onnxruntime-genai
v0.14.0
What's Changed
- Fix WhisperProcessor divide-by-zero when single prompt is provided by @Copilot in #2068
- Fix lm_head tensor loading order dependency in quantized model builder by @thpereir in #2061
- Fail to build Whisper model by @xiaofeihan1 in #2075
- Rename NemotronCacheConfig to NemotronConfig and add blank penalty to the decoder by @nenad1002 in #2042
- Fix YaRN RoPE bugs in model builder and add parity tests by @titaiwangms in #2076
- Add Transformers v5 Support by @sayanshaw24 in #2089
- macOS ARM64 ADO pipeline by @Copilot in #2091
- Reduce CPU-side per-token overhead in GenerateNextToken and SampleTopP by @hanbitmyths in #2085
- Add onStageComplete by @apsonawane in #2074
- [WebGPU] Support continuous decoding (RewindTo) with graph capture by @qjia7 in #2083
- [Mistral3] Add VLM support with multi-image inference by @titaiwangms in #2077
- Add k_quant_linear mixed-precision quantization for hybrid attention … by @apsonawane in #2100
- Removes QNN packaging from onnxruntime-genai pipelines by @baijumeswani in #2109
- Add Gemma4 multimodal support (vision + audio) by @apsonawane in #2103
- Update GUIDs during az login by @kunal-vaishnavi in #2122
- Add CODEOWNERS file for repository ownership by @kunal-vaishnavi in #2119
- Qwen3.5: drop fp32 cast around RMSNorm in builder by @xiaofeihan1 in #2101
- Add support for LFM2 in ORT GenAI by @xenova in #1979
- Enable CUDA graph capture for CUDA EP to improve decode throughput by @apsonawane in #2070
- [Qwen3.5] dedup position ids by @daijh in #2102
- Address win-cuda pipeline errors by @baijumeswani in #2154
- Update Extensions Commit to Fix Id2Token Bugs by @sayanshaw24 in #2159
- Limit the CUDA cmake architectures to 86 for CI builds by @baijumeswani in #2161
- Gate leaked-object error reporting in Shutdown() to debug builds or when logging is enabled by @baijumeswani in #2162
- Update Copilot instructions for reviewing model builder by @kunal-vaishnavi in #2164
- Fix DecoderState input_ids check regression introduced in #2103 by @titaiwangms in #2148
- Fix memory leaks by @skottmckay in #2153
- [Qwen3.5] Use LpNormalization for L2-norm in linear-attention Q/K by @xiaofeihan1 in #2127
- Fix: Win32 build failure when paths contain spaces by @nsubaru in #2053
- Fix CUDA build with MSVC by enabling /Zc:preprocessor for nvcc host compilation on VS 16.5 or greater by @nsubaru in #2054
- Apply linear rope_scaling in model builder for Neutts/nano by @VishalX in #2142
- Fix Quark/AWQ weight loading for Qwen3-VL-4B text model by @anilmartha in #2143
- Fix WebGPU inference crash in embedding and multi-modal feature allocation by @feich-ms in #2163
- Support Visual Studio 18 2026 build by @Copilot in #2017
- Add QNN EP documentation to OGA including Genie note by @qti-kromero in #2158
- Use windowsml package and make winml usage simpler by @baijumeswani in #2155
- Cleanup TensorObject created by OrtxTensorResultGetAt by @skottmckay in #2168
- Fix nemotron leaks by @skottmckay in #2169
- [RyzenAI] make speech sub-model optional in PhiMultiModalProcessor by @manasablrm in #2167
- Enable graph capture for WebGPU models and DML continuous decoding tests by @qjia7 in #2099
- [Qwen3] Allow packed QKV MatMul under QK-Norm via post-MatMul Split by @xiaofeihan1 in #2137
- Enable Linux ARM64 builds and packaging by @baijumeswani in #2107
- Add gemma4 unit tests by @apsonawane in #2151
- Auto-detect fixed kv-cache shape in DefaultKeyValueCache by @akholodnamdcom in #2166
- Add text-only mode support for Qwen 3.5 model builder by @apsonawane in #2157
- Fix heap overflow issue by @apsonawane in #2110
- [Benchmark] Add --use_random_tokens flag to C benchmark by @VishalX in #2170
- Add HunYuan Dense V1 (hunyuan_v1_dense) model support by @anilmartha in #2144
- Nvidia Parakeet Tdt ASR support by @nenad1002 in #2150
- Multilingual Streaming Nemotron ASR + CUDA support by @nenad1002 in #2171
- Add Csharp binding for Multi-lingual ASR by @rui-ren in #2176
- Add VideoChat-Flash (OpenGVLab) language model support by @anilmartha in #2147
- Update Nemotron ASR docs by @rui-ren in #2178
- Validate sliding window size before creating KV cache by @baijumeswani in #2181
- Fix external weights loading for in-memory models without changing cwd by @baijumeswani in #2180
- Enable Qwen3.5 TRT-RTX EP path with CUDA graph by @yen-shi in #2139
- Add Qwen3.5-MoE (35B-A3B) model support by @tanzeel-amd in #2146
- Update ort-extensions commit by @baijumeswani in #2182
New Contributors
- @titaiwangms made their first contribution in #2076
- @nsubaru made their first contribution in #2053
- @VishalX made their first contribution in #2142
- @anilmartha made their first contribution in #2143
- @feich-ms made their first contribution in #2163
- @qti-kromero made their first contribution in #2158
- @manasablrm made their first contribution in #2167
- @yen-shi made their first contribution in #2139
- @tanzeel-amd made their first contribution in #2146
Full Changelog: v0.13.1...v0.14.0
v0.13.2
Version 0.13.2
v0.13.1
Version 0.13.1
v0.13.0
What's Changed
- update WebGPU buffer memory info name by @fs-eire in #1957
- Add
enable_profilingin Runtime Options by @xiaofeihan1 in #1949 - Fix uninitialized tools variable and improve exception debug messages by @sheller-ms in #1971
- Add common download to Phi-3 tutorial by @kunal-vaishnavi in #1973
- Add support for InternLM2 model architecture by @amdrajeevp1 in #1958
- Update cmake cuda architecture and use win-arm64 pool workaround by @baijumeswani in #1976
- Update examples after 0.12.0 release by @kunal-vaishnavi in #1980
- Add CI pipeline for WebGPU EP model testing by @qjia7 in #1956
- Fix Python nightly build by @kunal-vaishnavi in #1981
- Add missing Quark 0.11 weight patterns for ChatGLM3 output layer by @poganesh in #1983
- Support Qwen2.5-VL pre-quantized models in qwen.py by @poganesh in #1985
- [VitisAI] external_ep_libray support fix for WinML by @akholodnamdcom in #1984
- Fix guidance bug by @baijumeswani in #1988
- Fix incorrect batch responses when using multiple prompts by @lnigam in #1986
- Enable webgpu graph capture in base.py by @qjia7 in #1991
- Harden CUDA error checking across the codebase by @Copilot in #1994
- allow pruned models for prefill by @fs-eire in #1995
- Fix WinML Packaging Pipeline by @baijumeswani in #1998
- Add small changes after pruning prefill by @kunal-vaishnavi in #2000
- webgpu: Optimize Copyfrom by @qjia7 in #1992
- Add support for CUDA 13 by @baijumeswani in #2001
- add webgpu to qmoe path by @guschmue in #2005
- Fix ERNIE 4.5 model builder: rope_attrs and config architecture name by @xiaoyao9184 in #2007
- Bug fix in Continuous Decoding by @chilukam-qti in #2008
- Update Phi-4 mm README links by @kunal-vaishnavi in #2014
- Add Qwen3-VL model support + multi-image input support in Qwen VL family by @hanbitmyths in #2003
- Add Qwen3.5 model support and optimize multi-image handling by @apsonawane in #2019
- Reuse a single generator via RewindTo(0) in benchmark instead of creating multiple generators by @qjia7 in #2002
- [RyzenAI] WinML compatibility fix by @akholodnamdcom in #2026
- Nemotron ASR Support for Streaming by @nenad1002 in #1997
- [WebGPU] Fix the prefill regression when graph capture is ON by @qjia7 in #2021
- Support 4 inputs for nemotron model by @jiafatom in #2036
- Updated java packaging based on python packaging logic by @EPNW-Eric in #2029
- Fix android packaging pipeline by @baijumeswani in #2039
- Add OpenAI's Whisper to model builder by @kunal-vaishnavi in #2018
- [Java] Add a dependency on onnxruntime (#2030) by @EPNW-Eric in #2040
- Fix mutually exclusive inputs for language models by @kunal-vaishnavi in #2046
- Decouple plugin execution providers (EPs) from the USE_WINML pre-processor macro by @baijumeswani in #2038
- Route pipeline model RunOptions through SetRunOption for proper special key handling by @Copilot in #2044
- Add ort_build_version and ort_build_source parameters to nuget and python packaging pipelines, remove ROCm support by @Copilot in #2049
- Add batched multi-image vision path and window_size config for Qwen VL by @hanbitmyths in #2050
- docs: fix formatting and syntax highlighting in documentation by @riddles-the-one in #2051
- Add Silero VAD Support to Nemotron Streaming ASR by @sayanshaw24 in #2035
- Add Qwen3.5 hybrid decoder export support (GatedDeltaNet + Attention) by @apsonawane in #2043
- Add support for QNN stateful models by @qti-ashimaj in #2012
- Allocate recurrent state via device allocator to enable CUDA graph capture by @apsonawane in #2057
- Speed up CI pipelines by @Copilot in #2052
- Fix tool calling for TRT-RTX models by @kunal-vaishnavi in #2048
- Fix vision pipeline EP hardcoding and pixel_values rank mismatch for Qwen VL models by @apsonawane in #2060
New Contributors
- @sheller-ms made their first contribution in #1971
- @amdrajeevp1 made their first contribution in #1958
- @poganesh made their first contribution in #1983
- @xiaoyao9184 made their first contribution in #2007
- @chilukam-qti made their first contribution in #2008
- @EPNW-Eric made their first contribution in #2029
Full Changelog: v0.12.0...v0.13.0
v0.12.2
- Update examples after 0.12.0 release
- Add missing Quark 0.11 weight patterns for ChatGLM3 output layer
- Support Qwen2.5-VL pre-quantized models in qwen.py
- Fix incorrect batch responses when using multiple prompts
- Harden CUDA error checking across the codebase
- allow pruned models for prefill
- Add small changes after pruning prefill
v0.12.1
v0.12.0
What's Changed
- Update versions after making 0.11.0 branch by @kunal-vaishnavi in #1867
- Fix guidance usage in continuous decoding by @kunal-vaishnavi in #1870
- Fix HelloPhi C# example by @kunal-vaishnavi in #1871
- Fix regex by @apsonawane in #1875
- Update extensions commit by @apsonawane in #1874
- Revert removal of eps_without_if_support by @xiaofeihan1 in #1878
- Fix condition for NPU by @apsonawane in #1880
- Model builder refactoring by @tianleiwu in #1862
- Add lintrunner to format code by @tianleiwu in #1884
- Remove empty submodule leftover. by @xkszltl in #1883
- Fix build for lack of RTLD_DI_ORIGIN support by @jaeyoonjung in #1888
- Enable graph capture for webgpu by @qjia7 in #1848
- Generic shared emb_tokens/lm_head implementation by @jixiongdeng in #1885
- Fix bug in Squeeze for getting the value of total_seq_len by @Honry in #1886
- Extra_options
disable_qkv_fusionto untie qkv_projs from upstream choice by @jixiongdeng in #1893 - Fix mac pipeline by @apsonawane in #1904
- whisper: Support a variant of the whisper pipeline where encoder / decoder are stateful. by @RyanMetcalfeInt8 in #1857
- Add model builder for Qwen2_5_VLTextModel by @tianleiwu in #1882
- Integrate FARA-7B model by @apsonawane in #1902
- Fix gpt-oss model export by @apsonawane in #1861
- OpenVINO: Add support for model caching via 'cache_dir' provider option by @RyanMetcalfeInt8 in #1900
- WinML - Remove the inclusive Microsoft.WindowsAppSDK.ML range check by @chrisdMSFT in #1907
- Run the model in text mode by @apsonawane in #1908
- Update extensions commit by @apsonawane in #1914
- Fix gpt-oss export by @apsonawane in #1915
- Support Olive new uint8 quantization format by @xiaoyu-work in #1916
- Disable CUDA graph for Phi LongRoPE models with IF nodes on TRT-RTX by @anujj in #1921
- Add support for CUDA and CPU arch for Qwen-2.5-VL and Fara-7B by @apsonawane in #1919
- Add Gemma-3 vision tutorial to ONNX Runtime GenAI by @kunal-vaishnavi in #1793
- Quark GPT-OSS support by @thpereir in #1903
- Fix sliding window alignment regression in QNN models by @apsonawane in #1938
- AMD RyzenAI EP Support by @akholodnamdcom in #1935
- Update README by @natke in #1934
- [RyzenAI] Non-pruned models backward compatibility by @akholodnamdcom in #1942
- [VitisAI] EP loader by @akholodnamdcom in #1918
- Set default top_k and top_p if it is None by @xiaoyu-work in #1944
- Ensure dlls are signed in the c and nuget packages. by @baijumeswani in #1947
- Bump torch from 2.7.1 to 2.7.1+cpu in /test/python/directml/torch by @dependabot[bot] in #1868
- Add linker flags for 16 KB page size on Android by @sheetalarkadam in #1860
- Only manually load DLLs if onnxruntime.dll is not already loaded. by @chemwolf6922 in #1800
- Add a doc showing how to run GPT OSS 20B with WebGPU by @natke in #1945
- Add C#, Java, and Objective-C APIs for Config by @kunal-vaishnavi in #1946
- Fix GatherBlockQuantized node to support symmetric quantized LM_HEAD by @sushraja-msft in #1951
- Fix QMoE blockwise quantization support for TRT-RTX execution provider by @anujj in #1926
- Revert "Add a doc showing how to run GPT OSS 20B with WebGPU" by @kunal-vaishnavi in #1950
- Add custom model path support for unit tests by @mpasumarthi-git in #1917
- fix: patch
llguidanceto remove reference toringcrate by @sanaa-hamel-microsoft in #1948 - Implement graph models for EPs by @qjia7 in #1895
- Update handling EOS token id detection by @kunal-vaishnavi in #1925
- Remove onnxruntime-genai-cuda from the foundry package by @baijumeswani in #1954
- Include linux builds in the foundry ort-genai package by @baijumeswani in #1955
- Support pre-registered plug-in NvTensorRtRtx execution provider library by @anujj in #1889
- [RyzenAI] Linux compatibility fixes by @akholodnamdcom in #1959
- Use cuda 12.8 to build ort-genai by @baijumeswani in #1960
- Bump protobuf from 5.29.5 to 6.33.5 in /test/python by @dependabot[bot] in #1961
- Add RAII wrappers for ORT Model Editor API types by @qjia7 in #1953
- Rewrite all examples using standardization by @kunal-vaishnavi in #1939
- Add versioning to the onnxruntime-genai-cuda.dll by @baijumeswani in #1965
- [Build][Packaging] macOS packaging to skip building x86_64 by @baijumeswani in #1966
- Sync packaging changes with ONNX Runtime by @baijumeswani in #1967
- Release 0.12.0 cherry-pick PR by @baijumeswani in #1978
New Contributors
- @xkszltl made their first contribution in #1883
- @jaeyoonjung made their first contribution in #1888
- @jixiongdeng made their first contribution in #1885
- @Honry made their first contribution in #1886
- @thpereir made their first contribution in #1903
- @akholodnamdcom made their first contribution in #1935
- @sheetalarkadam made their first contribution in #1860
- @sanaa-hamel-microsoft made their first contribution in #1948
Full Changelog: v0.11.4...v0.12.0
v0.11.4
What's Changed
- WinML - Remove the inclusive Microsoft.WindowsAppSDK.ML range check by @chrisdMSFT in #1907
- Run the model in text mode by @apsonawane in #1908
Full Changelog: v0.11.3...v0.11.4
v0.11.3
What's Changed
- Model builder refactoring by @tianleiwu in #1862
- Add lintrunner to format code by @tianleiwu in #1884
- Remove empty submodule leftover. by @xkszltl in #1883
- Fix build for lack of RTLD_DI_ORIGIN support by @jaeyoonjung in #1888
- Enable graph capture for webgpu by @qjia7 in #1848
- Generic shared emb_tokens/lm_head implementation by @jixiongdeng in #1885
- Fix bug in Squeeze for getting the value of total_seq_len by @Honry in #1886
- Extra_options disable_qkv_fusion to untie qkv_projs from upstream choice by @jixiongdeng in #1893
- Fix mac pipeline by @apsonawane in #1904
- whisper: Support a variant of the whisper pipeline where encoder / decoder are stateful. by @RyanMetcalfeInt8 in #1857
- Add model builder for Qwen2_5_VLTextModel by @tianleiwu in #1882
- Integrate FARA-7B model by @apsonawane in #1902
- Set version as 0.11.3 by @kunal-vaishnavi in #1905
New Contributors
- @xkszltl made their first contribution in #1883
- @jaeyoonjung made their first contribution in #1888
- @jixiongdeng made their first contribution in #1885
- @Honry made their first contribution in #1886
Full Changelog: v0.11.2...v0.11.3
v0.11.2
What's Changed
- Revert removal of eps_without_if_support by @xiaofeihan1 in #1878
- Fix condition for NPU by @apsonawane in #1880
- Set version as 0.11.2 by @kunal-vaishnavi in #1881
Full Changelog: v0.11.1...v0.11.2