Releases: Dao-AILab/flash-attention
Releases · Dao-AILab/flash-attention
fa4-v4.0.0.beta18
What's Changed
- Fix SM100 FP8 fwd with cutlass-dsl >=4.5.2 (MmaF8F6F4Op) by @Johnsonms in #2640
- [cute] Fix int32 overflow in SM100 LPT tile scheduler for long context by @sryap in #2662
- [Fwd,Sm100] Tune FP8 causal hd128 ex2_emu_freq (8 vs inherited 16) by @Johnsonms in #2642
- Make q_subtile_factor default to identity by @drisspg in #2660
Full Changelog: fa4-v4.0.0.beta17...fa4-v4.0.0.beta18
v2.8.3.post1
v2.8.3.post1
fa4-v4.0.0.beta17
What's Changed
- [Triton] Fix graph capture issues and env var by @micmelesse in #2620
- [CuTe,Bwd,Sm100] allow 2cta with score mod and mask mod in bwd by @reubenconducts in #2557
- [CuTe] Fix lint failures by @drisspg in #2625
- [CuTe] Fix lint failure in flash_bwd_sm100.py by @Johnsonms in #2627
- fix: add weights_only=True to all torch.load call sites by @aryanputta in #2622
- [Cute,Sm100,Fwd] use correction warps if not tma store; remove outdated packgqa guard by @jayhshah in #2629
- Add aux-scalars to interface to enable dynamic ints and floats in expressions by @drisspg in #2616
- fix: build and select cu13.2 prebuilt wheels by @ko3n1g in #2618
- ci(fa4): enforce cutlass-dsl/quack dep floors and rebake cu130 image by @Johnsonms in #2636
New Contributors
- @aryanputta made their first contribution in #2622
Full Changelog: fa4-v4.0.0.beta16...fa4-v4.0.0.beta17
fa4-v4.0.0.beta16
What's Changed
- Bump AITER submodule to commit 3b2e6f4 by @sstamenk in #2540
- Clamp kv_stage to avoid SMEM overflow for small head_dims on SM100 by @Johnsonms in #2594
- [CuTe,Sm100] fix: decode/prefill exp2 emulation consistency by @Luosuu in #2595
- NFC: replace deprecated APIs: cute.make_fragment and cute.core.ThrMma by @brandon-yujie-sun in #2602
- Bump nvidia-cutlass-dsl to >=4.5.2 and quack-kernels to >=0.5.0 by @Johnsonms in #2605
- [CuTe,Fwd,Sm100] refactor mla sm100 forward and add page table by @jayhshah in #2558
- ci: bump Jimver/cuda-toolkit to v0.2.35 for CUDA 13.2 support by @ko3n1g in #2617
- [ROCm] Bump Triton to >=3.6.0 and update aiter submodule by @micmelesse in #2614
New Contributors
Full Changelog: fa4-v4.0.0.beta15...fa4-v4.0.0.beta16
fa4-v4.0.0.beta15
What's Changed
- Wrap mask contruction in a function for mask subclassing by @sryap in #2584
- Build Fix: Update abi3 tag to cp310 and minimum python version to 3.10 by @aw920h in #2532
- [Cute,Flex,Sm100] vectorized mask_mod by @reubenconducts in #2261
- [CuTe, SM103] Update architecture assertion for SM 10.x and 11.x by @ocss884 in #2572
- Include sm_110 in Blackwell-family arch gating (follow-up to #2572) by @Johnsonms in #2590
- Use is_family_of for sm_90 and sm_103 arch checks by @Johnsonms in #2589
New Contributors
- @sryap made their first contribution in #2584
- @aw920h made their first contribution in #2532
- @ocss884 made their first contribution in #2572
Full Changelog: fa4-v4.0.0.beta14...fa4-v4.0.0.beta15
fa4-v4.0.0.beta14
What's Changed
- Fix ZeroDivisionError in num_splits_heuristic for empty Q workloads by @shivam2199 in #2515
- [Cute, flex, sm90] fix sm90 flex by @geruome in #2563
- split out varlen batch search into utils by @reubenconducts in #2556
- [Cute,Sm100] allow for zero length sequences in hdim 256 kernels by @jayhshah in #2568
- Enable split-kv for blocksparse tensors by @drisspg in #2536
New Contributors
- @shivam2199 made their first contribution in #2515
Full Changelog: fa4-v4.0.0.beta13...fa4-v4.0.0.beta14
fa4-v4.0.0.beta13
What's Changed
- [ROCm Windows] fix build failed by @Apophis3158 in #2519
- [CuTe,Bwd,Sm100] don't disable 2cta due to cuda 12 in bwd by @reubenconducts in #2543
- [CuTe,Bwd] guard softcap for varlen backward by @reubenconducts in #2544
- [CuTe,Flex] varlen blocksparsity by @reubenconducts in #2224
- [FA4][hd256] Fix layout of non-contiguous qkv in backward kernel by @wangsiyu in #2545
- [Cute,Bwd,Sm100] fix incorrect calculation of n_block global max for bwd deterministic by @jayhshah in #2549
- fix varlen w/ paging split kv bug by @liangel-02 in #2550
New Contributors
- @Apophis3158 made their first contribution in #2519
Full Changelog: fa4-v4.0.0.beta12...fa4-v4.0.0.beta13
fa4-v4.0.0.beta12
What's Changed
- Fix long MSVC linker commands on Windows by @jammm in #2517
- Fix test_flash_attn_fast varlen call after qv positional insert by @henrylhtsang in #2527
- [Cute,Bwd,Sm90] Fix determinism for GQA, port Sm100 approach in by @v0i0 in #2510
- benchmarks/tune_ex2_emu: hd256 sweep support and clock lock/unlock by @Johnsonms in #2495
- [FA4][hd256] Backward TMA bulk-store epilogue + LSE/dpsum coalesce by @Johnsonms in #2497
- [hd256] Add TMA paged KV support to SM100 2CTA forward kernel by @Johnsonms in #2489
- Deterministic backward for blocksparse impl by @drisspg in #2253
New Contributors
Full Changelog: fa4-v4.0.0.beta11...fa4-v4.0.0.beta12
fa4-v4.0.0.beta11
What's Changed
- Feat([FA4][CUTE DSL]) Add head_dim=256 support (forward + backward) by @wangsiyu in #2412
- [Cute,hd256] Post-merge cleanup: dead code, duplicate imports by @Johnsonms in #2487
- [CuTe,Flex] Wire up interface for flex autograd support by @reubenconducts in #2485
- [CuTe,Flex] Add score_mod_bwd param to flash_attn_varlen_func by @reubenconducts in #2496
- fix: typos and missing comments in FA4 cute kernel files by @dxasu in #2502
- [SM100] Guard gO None in empty-tile correction by @geruome in #2504
- [CuTe, Flex] simplify blocksparse interface in flash_attn_func by @reubenconducts in #2506
- Fix: pass
streamto SM100 MLA kernel by @MatthewBonanni in #2505 - Fix clc scheduling request bug by @drisspg in #2508
- [Tests,MLA] Close coverage gaps in test_flash_attn_mla_absorbed by @Johnsonms in #2483
- Add cache utils logging test by @drisspg in #2509
- [hd256] Improve forward kernel with exp2 FMA emulation (3% to 9% performance gain) by @Johnsonms in #2488
- SM90 FA4 QuACK 0.4 Compatibility by @EduardDurech in #2513
- ci: use /tmp for apptainer tmpdir to fix xattrerror on VAST by @Johnsonms in #2511
New Contributors
- @wangsiyu made their first contribution in #2412
- @dxasu made their first contribution in #2502
- @EduardDurech made their first contribution in #2513
Full Changelog: fa4-v4.0.0.beta10...fa4-v4.0.0.beta11
fa4-v4.0.0.beta10
What's Changed
- Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression by @Johnsonms in #2461
- Add CLC scheduler heuristic by @drisspg in #2455
- expose num_splits for FA2 and add option for kernel blocksize alignment by @liangel-02 in #2448
- [Cute,Fwd,Sm100] fp8 e4m3 and e5m2 support by @dcw02 in #2109
- Expose --pack-gqa and --num-splits in benchmark_attn.py by @Johnsonms in #2473
- Fix: pass num_splits through varlen_fwd Python wrapper (fixes #2448 regression) by @hsyysy in #2476
- [Cute,Fwd,Sm100] Fix the crash when seqlen_k=0 by @Johnsonms in #2470
- fix causal calcs by @drisspg in #2463
- [cute,bwd] fix PDL race in bwd_preprocess, which corrupting dpsum on SM90+ by @geruome in #2481
New Contributors
- @dcw02 made their first contribution in #2109
- @hsyysy made their first contribution in #2476
- @geruome made their first contribution in #2481
Full Changelog: fa4-v4.0.0.beta9...fa4-v4.0.0.beta10