Releases: AutoGPTQ/AutoGPTQ
v0.7.1: patch release
Support loading sharded quantized checkpoints
Sharded checkpoints can now be loaded in the from_quantized method.
Gemma GPTQ quantization
Gemma model can be quantized with AutoGPTQ.
Other changes and fixes
- Add back missing import by @fxmarty in #553
- Fix bias materialization for Marlin by @fxmarty in #554
- Fix shape check marlin by @fxmarty in #557
- Explicitely check compute capability in marlin's QLinear by @fxmarty in #567
- Compatibility with latest transformers by @fxmarty in #573
Full Changelog: v0.7.0...v0.7.1
v0.7.0: Marlin int4*fp16 kernel, AWQ checkpoints loading
Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading
@efrantar, GPTQ author, released Marlin, an optimized CUDA kernel for Ampere GPUs for int4*fp16 matrix multiplication, with per-group symmetric quantization support (without act-order), which significantly outperforms other existing kernels when using batching.
This kernel can be used in AutoGPTQ when loading models with the use_marlin=True argument. Using this flag will repack the quantized weights as the Marlin kernel expects a different layout. The repacked weight is then saved locally so as to avoid the need to repack again. Example:
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-GPTQ", torch_dtype=torch.float16, use_marlin=True, device="cuda:0")
prompt = "Is quantization a good compression technique?"
inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")
res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))
# Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████| 566/566 [00:29<00:00, 19.17it/s]
#
# <s> Is quantization a good compression technique?
#
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in audio and image compression, as well as in scientific and engineering applications.A complete benchmark can be found at: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark
Visual tables coming soon.
- add marlin kernel by @qwopqwop200 in #514
- updated marlin serialization by @rib-2 in #522
- Marlin repacking CUDA kernel by @fxmarty in #539
- Marlin kernel can be built against any compute capability by @fxmarty in #540
Ability to load AWQ checkpoints in AutoGPTQ
Note: The AWQ checkpoints repacking step is currently slow, and a faster implementation can be implemented.
AWQ's original implementation adopted a serialization format different than the one expected by current GPTQ kernels (triton, cuda_old, exllama, exllamav2), but the computation happen to be the same. We allow loading AWQ checkpoints in AutoGPTQ to leverage exllama/exllamav2 kernels that may be more performant for some problem sizes (see the PR below, notably at sequence_length = 1 and for long sequences).
Example:
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-AWQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-AWQ", torch_dtype=torch.float16, device="cuda:0")
prompt = "Is quantization a good compression technique?"
inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")
res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))
# Repacking model.layers.9.self_attn.v_proj...: 100%|████████████████████████████████████████████████████████████████████████| 280/280 [05:29<00:00, 1.18s/it]
#
# <s> Is quantization a good compression technique?
#
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in digital signal processing and image compression.Qwen2, LongLLaMA, Deci_lm models support
These models can be quantized with AutoGPTQ.
- Add qwen2 by @JustinLin610 in #519
- Change deci_lm model type to deci by @LaaZa in #491
- Support for LongLLaMA models. by @LaaZa in #442
Other changes and bugfixes
- Update version & install instructions by @fxmarty in #485
- fix the support of Qwen by @hzhwcmhf in #495
- rocm6.0 compatible exllama by @seungrokj in #515
- Untie weights for safetensors serialization by @fxmarty in #536
- marlin update version 0.1.1 and fix marlin bug by @qwopqwop200 in #524
- Use ruff for linting by @fxmarty in #537
- Fix wheels build for torch==2.2.0 by @fxmarty in #541
- Fix repo owners in workflows by @fxmarty in #542
- Disable peft compatibility by @fxmarty in #543
- Improve README by @fxmarty in #544
- Add ROCm dockerfile by @fxmarty in #545
- Make all tests pass by @fxmarty in #546
- Fix cuda wheel build workflows by @fxmarty in #547
- Use bash in workflows by @fxmarty in #548
- Dissociate Windows & Linux CUDA build by @fxmarty in #549* Add more guards on compute capability in Marlin kernel by @fxmarty in #550
New Contributors
- @hzhwcmhf made their first contribution in #495
- @rib-2 made their first contribution in #522
- @seungrokj made their first contribution in #515
Full Changelog: v0.6.0...v0.7.0
v0.6.0: Mixtral, StableLM, DeciLM, Yi support, Transformers 4.36 compatibility
What's Changed
- Precise PyTorch version by @fxmarty in #421
- Fix triton unexpected keyword by @LaaZa in #423
- Add support for Yi models. by @LaaZa in #413
- Add support for Xverse models. by @LaaZa in #417
- Allow fp32 input to GPTQ linear by @fxmarty in #437
- Fix typos in tests by @fxmarty in #438
- Update _base.py - Remote (.bin) model load fix by @Shades-en in #465
- make build successful on Jetson device(L4T) by @mikeshi80 in #470
- Add option to disable qigen at build by @fxmarty in #471
- Stop trying to convert a list to int in setup.py when trying to retrieve cores_info by @wemoveon2 in #474
- Only make_quant on inside_layer_modules. by @LaaZa in #479
- Add support for DeciLM models. by @LaaZa in #481
- Support for StableLM Epoch models. by @LaaZa in #444
- Add support for Mixtral models. by @LaaZa in #480
- Fix compatibility with transformers 4.36 by @fxmarty in #483
New Contributors
- @Shades-en made their first contribution in #465
- @mikeshi80 made their first contribution in #470
- @wemoveon2 made their first contribution in #474
Full Changelog: v0.5.1...v0.6.0
v0.5.1: Patch release
Mainly fixes Windows support.
What's Changed
- Update README and version following 0.5.0 release by @fxmarty in #397
- Fix windows support by @fxmarty in #407
- Fix quantize method with None mask by @fxmarty in #408
- Improve message about buffer size in exllama v1 backend by @fxmarty in #410
- Fix windows (no triton) and cpu-only support by @fxmarty in #411
- Fix workflows to use pip instead of conda by @fxmarty in #419
Full Changelog: v0.5.0...v0.5.1
v0.5.0: Exllama v2 GPTQ kernels, RoCm 5.6/5.7 support, many bugfixes
Exllama v2 GPTQ kernel support
The more performant GPTQ kernels from @turboderp's exllamav2 library are now available directly in AutoGPTQ, and are the default backend choice.
A comprehensive benchmark is available here.
CPU inference support
This is experimental.
- Add AutoGPTQ's cpu kernel. by @qwopqwop200 in #245
Loading from safetensors is now the default
Falcon, Mistral support
- Add support for Falcon as part of Transformers 4.33.0, including new Falcon 180B by @TheBloke in #326
- Add support for Mistral models. by @LaaZa in #362
Other changes and bugfixes
- Fix setuptools classifier by @fxmarty in #285
- Update install instructions by @fxmarty in #286
- Install skip qigen(windows) by @qwopqwop200 in #309
- fix model type changed after calling .to() method by @PanQiWei in #310
- Update qwen.py for Qwen-VL by @JustinLin610 in #303
- fix typo in max_input_length by @SunMarc in #311
- Use
adapter_nameforget_gptq_peft_modelwithtrain_mode=Trueby @alex4321 in #347 - Ignore unknown parameters in quantize_config.json by @z80maniac in #335
- fix bug(breaking change) remove (zeors -= 1) by @qwopqwop200 in #325
- Revert "fix bug(breaking change) remove (zeors -= 1)" by @PanQiWei in #354
- import exllama QuantLinear instead of exllamav2's in
pack_modelby @PanQiWei in #355 - Modify qlinear_cuda for tracing the GPTQ model by @vivekkhandelwal1 in #367
- Fix QiGen kernel generation by @fxmarty in #379
- Improve RoCm support by @fxmarty in #382
- PEFT initialization fix by @alex4321 in #361
- Pin to accelerate>=0.22 by @fxmarty in #384
- Fix overflow in exllama with act-order by @fxmarty in #386
- Default to exllama kernel when exllama v2 is disabled by @fxmarty in #387
- Error out on exllama_set_max_input_length call without exllama backend by @fxmarty in #389
- Add fix for CPU Inference by @vivekkhandelwal1 in #385
- Fix dtype issues and add relevant tests by @fxmarty in #393
- Patch accelerate to use correct dtype by @fxmarty in #394
- Fixed missing cstdint include by @kodai2199 in #388
- Update RoCm workflow to build for RoCm 5.7 by @fxmarty in #395
- Fix Windows build by @fxmarty in #396
New Contributors
- @JustinLin610 made their first contribution in #303
- @SunMarc made their first contribution in #311
- @alex4321 made their first contribution in #347
- @vivekkhandelwal1 made their first contribution in #367
- @kodai2199 made their first contribution in #388
Full Changelog: v0.4.2...v0.5.0
v0.4.2: Patch release
Major bugfix: exllama backend with arbitrary input length
This patch release includes a major bugfix to have the exllama backend work with input length > 2048 through a reconfigurable buffer size:
from auto_gptq import exllama_set_max_input_length
...
model = exllama_set_max_input_length(model, 4096)Exllama kernels support in Windows wheels
This patch tentatively includes the exllama kernels in the wheels for Windows.
What's Changed
- Build wheels on ubuntu 20.04 by @fxmarty in #272
- Free disk space for rocm build by @fxmarty in #273
- Use focal for RoCm build by @fxmarty in #274
- Use conda incubator for rocm build by @fxmarty in #276
- Update install instructions by @fxmarty in #275
- Use --extra-index-url to resolve dependencies by @fxmarty in #277
- Fix python version for rocm build by @fxmarty in #278
- Fix powershell in workflow by @fxmarty in #284
Full Changelog: v0.4.1...v0.4.2
v0.4.1: Patch Fix
Overview
- Fix typo so not only
pytorch==2.0.0but alsopytorch>=2.0.0can be used for llama fused attention. - Patch exllama QuantLinear to avoid modifying the state dict to make the integration with transformers smoother.
Change Log
What's Changed
Full Changelog: v0.4.0...v0.4.1
v0.4.0
Overview
- New platform: support ROCm platform (5.4.2 for now, and will extend to 5.5 and 5.6 as soon as pytorch officially release 2.1.0).
- New kernels: support exllama q4 kernels to get at least 1.3x inference speedup.
- New quantization strategy: support to specify
static_groups=Trueon quantization which can futher improve quantized model's performance and close the gap of PPL again un-quantized model. - New model: qwen
Full Change Log
What's Changed
- Add RoCm support by @fxmarty in #214
- Fix revision used to load the quantization config by @fxmarty in #220
- [General Quant Linear] Register quant params of general quant linear for friendly post process. by @LeiWang1999 in #226
- Add exllama q4 kernel by @fxmarty in #219
- Suppprt static groups and fix bug by @qwopqwop200 in #236
- support qwen by @qwopqwop200 in #240
New Contributors
- @fxmarty made their first contribution in #214
- @LeiWang1999 made their first contribution in #226
Full Changelog: v0.3.2...v0.4.0
v0.3.2: Patch Fix
Overview
- Fix CUDA kernel bug that cause
desc_actandgroup_sizecan't be used together - Improve user experience of manually installation
- Improve user experience of loading quantized model
- Add
perplexity_utils.pyto gracefully calculate PPL so that the result can be used to compare with other libraries fairly - Remove
save_dirargument fromfrom_quantizedmodel, and now onlymodel_name_or_pathargument is supported in this method
Full Change Log
What's Changed
- Fix cuda bug by @qwopqwop200 in #202
- Fix
revisionand other huggingface_hub kwargs in .from_quantized() by @TheBloke in #205 - Change the install script so it attempts to build the CUDA extension in all cases by @TheBloke in #206
- Add a central version number by @TheBloke in #207
- Add Safetensors metadata saving, with some values saved to each .safetensor file by @TheBloke in #208
- [FEATURE] Implement perplexity metric to compare against llama.cpp by @casperbh96 in #166
- Fix error raised when CUDA kernels are not installed by @PanQiWei in #209
- Fix build on non-CUDA machines after #206 by @casperbh96 in #212
New Contributors
- @casperbh96 made their first contribution in #166
Full Changelog: v0.3.0...v0.3.2
v0.3.0
Overview
- CUDA kernels improvement: support models whose hidden_size can only divisible by 32/64 instead of 256.
- Peft integration: support training and inference using LoRA, AdaLoRA, AdaptionPrompt, etc.
- New models: BaiChuan, InternLM.
- Other updates: see 'Full Change Log' below for details.
Full Change Log
What's Changed
- Pytorch qlinear by @qwopqwop200 in #116
- Specify UTF-8 encoding for README.md in setup.py by @EliEron in #132
- Support cuda 64dim by @qwopqwop200 in #126
- Support 32dim by @qwopqwop200 in #125
- Peft integration by @PanQiWei in #102
- Support setting inject_fused_attention and inject_fused_mlp to False by @TheBloke in #134
- Add transpose operator when replace Conv1d with qlinear_cuda_old by @geekinglcq in #140
- Add support for BaiChuan model by @LaaZa in #164
- Fix error message by @AngainorDev in #141
- Add support for InternLM by @cczhong11 in #189
- Fix stale documentation by @MarisaKirisame in #158
New Contributors
- @EliEron made their first contribution in #132
- @geekinglcq made their first contribution in #140
- @AngainorDev made their first contribution in #141
- @cczhong11 made their first contribution in #189
- @MarisaKirisame made their first contribution in #158
Full Changelog: v0.2.1...v0.3.0