Releases · Dao-AILab/flash-attention

17 Jun 09:45

github-actions

fa4-v4.0.0.beta18

d16e381

fa4-v4.0.0.beta18 Pre-release

Pre-release

What's Changed

Fix SM100 FP8 fwd with cutlass-dsl >=4.5.2 (MmaF8F6F4Op) by @Johnsonms in #2640
[cute] Fix int32 overflow in SM100 LPT tile scheduler for long context by @sryap in #2662
[Fwd,Sm100] Tune FP8 causal hd128 ex2_emu_freq (8 vs inherited 16) by @Johnsonms in #2642
Make q_subtile_factor default to identity by @drisspg in #2660

Full Changelog: fa4-v4.0.0.beta17...fa4-v4.0.0.beta18

Contributors

Johnsonms, sryap, and drisspg

Assets 4

10 Jun 13:32

github-actions

v2.8.3.post1

a8aa52b

v2.8.3.post1 Latest

Latest

v2.8.3.post1

Assets 52

flash_attn-2.8.3+cu13torch2.9cxx11abiTRUE-cp312-cp312-linux_aarch64.whl

sha256:6cfb6cc0b224355363a060d1e34288d14f25311ca0a8e0e9003347722ee43b5b

233 MB 2026-06-11T07:12:25Z
flash_attn-2.8.3+cu13torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

sha256:14609b65ebddd28d5087434b01019c9be093c92d44fe95def97b9f28905081dc

233 MB 2026-06-11T13:36:32Z
flash_attn-2.8.3.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

sha256:63d73b795e3f9a07a7cfd5039d41cf02b2f72877191d6ac331c8729b6976d082

244 MB 2026-06-10T18:33:45Z
flash_attn-2.8.3.post1+cu12torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl

sha256:d6683965bc0b230eff22c5c154bb39639c1bb39a18488943ef9fccde200e020c

244 MB 2026-06-10T23:56:56Z
flash_attn-2.8.3.post1+cu12torch2.4cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

sha256:72f0fbf4671f1e9b496450d29ac0cab92c1c93c0abd88b819444021e87afa78d

244 MB 2026-06-10T23:36:08Z
flash_attn-2.8.3.post1+cu12torch2.4cxx11abiFALSE-cp39-cp39-linux_x86_64.whl

sha256:83abb9f78260ebdc5b059623ccbae4928394acd805f0ffb3e5d9bf9045751f00

244 MB 2026-06-11T00:20:39Z
flash_attn-2.8.3.post1+cu12torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whl

sha256:4015d80ec6a0cd9f14d2140ae536d25adf567872b845a1507c5b9ac2b26b7ab5

244 MB 2026-06-10T23:49:08Z
flash_attn-2.8.3.post1+cu12torch2.4cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

sha256:ce059795d96a9b145c9a4e499899fb5a6109127d0639677d05204839caa04251

244 MB 2026-06-11T04:27:16Z
flash_attn-2.8.3.post1+cu12torch2.4cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

sha256:4198fccfddfb25b22538f593e556e4915705c18e36133cc235df4b9e9d48763e

244 MB 2026-06-10T18:22:22Z
flash_attn-2.8.3.post1+cu12torch2.4cxx11abiTRUE-cp39-cp39-linux_x86_64.whl

sha256:f9cf8a1a779ca468ca11ef7d2eb90c72db20400fb5072558f76086cb8326a8fa

244 MB 2026-06-10T16:51:39Z
Source code (zip)

2026-06-10T13:31:58Z
Source code (tar.gz)

2026-06-10T13:31:58Z

10 Jun 09:23

github-actions

fa4-v4.0.0.beta17

fb02fc8

fa4-v4.0.0.beta17 Pre-release

Pre-release

What's Changed

[Triton] Fix graph capture issues and env var by @micmelesse in #2620
[CuTe,Bwd,Sm100] allow 2cta with score mod and mask mod in bwd by @reubenconducts in #2557
[CuTe] Fix lint failures by @drisspg in #2625
[CuTe] Fix lint failure in flash_bwd_sm100.py by @Johnsonms in #2627
fix: add weights_only=True to all torch.load call sites by @aryanputta in #2622
[Cute,Sm100,Fwd] use correction warps if not tma store; remove outdated packgqa guard by @jayhshah in #2629
Add aux-scalars to interface to enable dynamic ints and floats in expressions by @drisspg in #2616
fix: build and select cu13.2 prebuilt wheels by @ko3n1g in #2618
ci(fa4): enforce cutlass-dsl/quack dep floors and rebake cu130 image by @Johnsonms in #2636

New Contributors

@aryanputta made their first contribution in #2622

Full Changelog: fa4-v4.0.0.beta16...fa4-v4.0.0.beta17

Contributors

Johnsonms, micmelesse, and 5 other contributors

Assets 4

03 Jun 09:45

github-actions

fa4-v4.0.0.beta16

b02b07e

fa4-v4.0.0.beta16 Pre-release

Pre-release

What's Changed

Bump AITER submodule to commit 3b2e6f4 by @sstamenk in #2540
Clamp kv_stage to avoid SMEM overflow for small head_dims on SM100 by @Johnsonms in #2594
[CuTe,Sm100] fix: decode/prefill exp2 emulation consistency by @Luosuu in #2595
NFC: replace deprecated APIs: cute.make_fragment and cute.core.ThrMma by @brandon-yujie-sun in #2602
Bump nvidia-cutlass-dsl to >=4.5.2 and quack-kernels to >=0.5.0 by @Johnsonms in #2605
[CuTe,Fwd,Sm100] refactor mla sm100 forward and add page table by @jayhshah in #2558
ci: bump Jimver/cuda-toolkit to v0.2.35 for CUDA 13.2 support by @ko3n1g in #2617
[ROCm] Bump Triton to >=3.6.0 and update aiter submodule by @micmelesse in #2614

New Contributors

@sstamenk made their first contribution in #2540

Full Changelog: fa4-v4.0.0.beta15...fa4-v4.0.0.beta16

Contributors

Johnsonms, micmelesse, and 5 other contributors

Assets 4

27 May 09:18

github-actions

fa4-v4.0.0.beta15

6c4f74f

fa4-v4.0.0.beta15 Pre-release

Pre-release

What's Changed

Wrap mask contruction in a function for mask subclassing by @sryap in #2584
Build Fix: Update abi3 tag to cp310 and minimum python version to 3.10 by @aw920h in #2532
[Cute,Flex,Sm100] vectorized mask_mod by @reubenconducts in #2261
[CuTe, SM103] Update architecture assertion for SM 10.x and 11.x by @ocss884 in #2572
Include sm_110 in Blackwell-family arch gating (follow-up to #2572) by @Johnsonms in #2590
Use is_family_of for sm_90 and sm_103 arch checks by @Johnsonms in #2589

New Contributors

@sryap made their first contribution in #2584
@aw920h made their first contribution in #2532
@ocss884 made their first contribution in #2572

Full Changelog: fa4-v4.0.0.beta14...fa4-v4.0.0.beta15

Contributors

Johnsonms, sryap, and 3 other contributors

Assets 4

20 May 09:16

github-actions

fa4-v4.0.0.beta14

4178915

fa4-v4.0.0.beta14 Pre-release

Pre-release

What's Changed

Fix ZeroDivisionError in num_splits_heuristic for empty Q workloads by @shivam2199 in #2515
[Cute, flex, sm90] fix sm90 flex by @geruome in #2563
split out varlen batch search into utils by @reubenconducts in #2556
[Cute,Sm100] allow for zero length sequences in hdim 256 kernels by @jayhshah in #2568
Enable split-kv for blocksparse tensors by @drisspg in #2536

New Contributors

@shivam2199 made their first contribution in #2515

Full Changelog: fa4-v4.0.0.beta13...fa4-v4.0.0.beta14

Contributors

jayhshah, drisspg, and 3 other contributors

Assets 4

13 May 09:10

github-actions

fa4-v4.0.0.beta13

9bad4be

fa4-v4.0.0.beta13 Pre-release

Pre-release

What's Changed

[ROCm Windows] fix build failed by @Apophis3158 in #2519
[CuTe,Bwd,Sm100] don't disable 2cta due to cuda 12 in bwd by @reubenconducts in #2543
[CuTe,Bwd] guard softcap for varlen backward by @reubenconducts in #2544
[CuTe,Flex] varlen blocksparsity by @reubenconducts in #2224
[FA4][hd256] Fix layout of non-contiguous qkv in backward kernel by @wangsiyu in #2545
[Cute,Bwd,Sm100] fix incorrect calculation of n_block global max for bwd deterministic by @jayhshah in #2549
fix varlen w/ paging split kv bug by @liangel-02 in #2550

New Contributors

@Apophis3158 made their first contribution in #2519

Full Changelog: fa4-v4.0.0.beta12...fa4-v4.0.0.beta13

Contributors

wangsiyu, jayhshah, and 3 other contributors

Assets 4

06 May 08:57

github-actions

fa4-v4.0.0.beta12

2e53092

fa4-v4.0.0.beta12 Pre-release

Pre-release

What's Changed

Fix long MSVC linker commands on Windows by @jammm in #2517
Fix test_flash_attn_fast varlen call after qv positional insert by @henrylhtsang in #2527
[Cute,Bwd,Sm90] Fix determinism for GQA, port Sm100 approach in by @v0i0 in #2510
benchmarks/tune_ex2_emu: hd256 sweep support and clock lock/unlock by @Johnsonms in #2495
[FA4][hd256] Backward TMA bulk-store epilogue + LSE/dpsum coalesce by @Johnsonms in #2497
[hd256] Add TMA paged KV support to SM100 2CTA forward kernel by @Johnsonms in #2489
Deterministic backward for blocksparse impl by @drisspg in #2253

New Contributors

@jammm made their first contribution in #2517

Full Changelog: fa4-v4.0.0.beta11...fa4-v4.0.0.beta12

Contributors

jammm, v0i0, and 3 other contributors

Assets 4

29 Apr 08:53

github-actions

fa4-v4.0.0.beta11

ba59def

fa4-v4.0.0.beta11 Pre-release

Pre-release

What's Changed

Feat([FA4][CUTE DSL]) Add head_dim=256 support (forward + backward) by @wangsiyu in #2412
[Cute,hd256] Post-merge cleanup: dead code, duplicate imports by @Johnsonms in #2487
[CuTe,Flex] Wire up interface for flex autograd support by @reubenconducts in #2485
[CuTe,Flex] Add score_mod_bwd param to flash_attn_varlen_func by @reubenconducts in #2496
fix: typos and missing comments in FA4 cute kernel files by @dxasu in #2502
[SM100] Guard gO None in empty-tile correction by @geruome in #2504
[CuTe, Flex] simplify blocksparse interface in flash_attn_func by @reubenconducts in #2506
Fix: pass stream to SM100 MLA kernel by @MatthewBonanni in #2505
Fix clc scheduling request bug by @drisspg in #2508
[Tests,MLA] Close coverage gaps in test_flash_attn_mla_absorbed by @Johnsonms in #2483
Add cache utils logging test by @drisspg in #2509
[hd256] Improve forward kernel with exp2 FMA emulation (3% to 9% performance gain) by @Johnsonms in #2488
SM90 FA4 QuACK 0.4 Compatibility by @EduardDurech in #2513
ci: use /tmp for apptainer tmpdir to fix xattrerror on VAST by @Johnsonms in #2511

New Contributors

@wangsiyu made their first contribution in #2412
@dxasu made their first contribution in #2502
@EduardDurech made their first contribution in #2513

Full Changelog: fa4-v4.0.0.beta10...fa4-v4.0.0.beta11

Contributors

wangsiyu, Johnsonms, and 6 other contributors

Assets 4

22 Apr 08:43

github-actions

fa4-v4.0.0.beta10

3a7694c

fa4-v4.0.0.beta10 Pre-release

Pre-release

What's Changed

Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression by @Johnsonms in #2461
Add CLC scheduler heuristic by @drisspg in #2455
expose num_splits for FA2 and add option for kernel blocksize alignment by @liangel-02 in #2448
[Cute,Fwd,Sm100] fp8 e4m3 and e5m2 support by @dcw02 in #2109
Expose --pack-gqa and --num-splits in benchmark_attn.py by @Johnsonms in #2473
Fix: pass num_splits through varlen_fwd Python wrapper (fixes #2448 regression) by @hsyysy in #2476
[Cute,Fwd,Sm100] Fix the crash when seqlen_k=0 by @Johnsonms in #2470
fix causal calcs by @drisspg in #2463
[cute,bwd] fix PDL race in bwd_preprocess, which corrupting dpsum on SM90+ by @geruome in #2481

New Contributors

@dcw02 made their first contribution in #2109
@hsyysy made their first contribution in #2476
@geruome made their first contribution in #2481

Full Changelog: fa4-v4.0.0.beta9...fa4-v4.0.0.beta10

Contributors

Johnsonms, hsyysy, and 4 other contributors

Assets 4

Releases: Dao-AILab/flash-attention

fa4-v4.0.0.beta18

What's Changed

Contributors

Uh oh!

v2.8.3.post1

Uh oh!

fa4-v4.0.0.beta17

What's Changed

New Contributors

Contributors

Uh oh!

fa4-v4.0.0.beta16

What's Changed

New Contributors

Contributors

Uh oh!

fa4-v4.0.0.beta15

What's Changed

New Contributors

Contributors

Uh oh!

fa4-v4.0.0.beta14

What's Changed

New Contributors

Contributors

Uh oh!

fa4-v4.0.0.beta13

What's Changed

New Contributors

Contributors

Uh oh!

fa4-v4.0.0.beta12

What's Changed

New Contributors

Contributors

Uh oh!

fa4-v4.0.0.beta11

What's Changed

New Contributors

Contributors

Uh oh!

fa4-v4.0.0.beta10

What's Changed

New Contributors

Contributors

Uh oh!