Skip to content

Releases: Dao-AILab/flash-attention

fa4-v4.0.0.beta18

17 Jun 09:45
d16e381

Choose a tag to compare

fa4-v4.0.0.beta18 Pre-release
Pre-release

What's Changed

  • Fix SM100 FP8 fwd with cutlass-dsl >=4.5.2 (MmaF8F6F4Op) by @Johnsonms in #2640
  • [cute] Fix int32 overflow in SM100 LPT tile scheduler for long context by @sryap in #2662
  • [Fwd,Sm100] Tune FP8 causal hd128 ex2_emu_freq (8 vs inherited 16) by @Johnsonms in #2642
  • Make q_subtile_factor default to identity by @drisspg in #2660

Full Changelog: fa4-v4.0.0.beta17...fa4-v4.0.0.beta18

v2.8.3.post1

10 Jun 13:32
v2.8.3.post1
a8aa52b

Choose a tag to compare

v2.8.3.post1

fa4-v4.0.0.beta17

10 Jun 09:23
fb02fc8

Choose a tag to compare

fa4-v4.0.0.beta17 Pre-release
Pre-release

What's Changed

  • [Triton] Fix graph capture issues and env var by @micmelesse in #2620
  • [CuTe,Bwd,Sm100] allow 2cta with score mod and mask mod in bwd by @reubenconducts in #2557
  • [CuTe] Fix lint failures by @drisspg in #2625
  • [CuTe] Fix lint failure in flash_bwd_sm100.py by @Johnsonms in #2627
  • fix: add weights_only=True to all torch.load call sites by @aryanputta in #2622
  • [Cute,Sm100,Fwd] use correction warps if not tma store; remove outdated packgqa guard by @jayhshah in #2629
  • Add aux-scalars to interface to enable dynamic ints and floats in expressions by @drisspg in #2616
  • fix: build and select cu13.2 prebuilt wheels by @ko3n1g in #2618
  • ci(fa4): enforce cutlass-dsl/quack dep floors and rebake cu130 image by @Johnsonms in #2636

New Contributors

Full Changelog: fa4-v4.0.0.beta16...fa4-v4.0.0.beta17

fa4-v4.0.0.beta16

03 Jun 09:45
b02b07e

Choose a tag to compare

fa4-v4.0.0.beta16 Pre-release
Pre-release

What's Changed

  • Bump AITER submodule to commit 3b2e6f4 by @sstamenk in #2540
  • Clamp kv_stage to avoid SMEM overflow for small head_dims on SM100 by @Johnsonms in #2594
  • [CuTe,Sm100] fix: decode/prefill exp2 emulation consistency by @Luosuu in #2595
  • NFC: replace deprecated APIs: cute.make_fragment and cute.core.ThrMma by @brandon-yujie-sun in #2602
  • Bump nvidia-cutlass-dsl to >=4.5.2 and quack-kernels to >=0.5.0 by @Johnsonms in #2605
  • [CuTe,Fwd,Sm100] refactor mla sm100 forward and add page table by @jayhshah in #2558
  • ci: bump Jimver/cuda-toolkit to v0.2.35 for CUDA 13.2 support by @ko3n1g in #2617
  • [ROCm] Bump Triton to >=3.6.0 and update aiter submodule by @micmelesse in #2614

New Contributors

Full Changelog: fa4-v4.0.0.beta15...fa4-v4.0.0.beta16

fa4-v4.0.0.beta15

27 May 09:18
6c4f74f

Choose a tag to compare

fa4-v4.0.0.beta15 Pre-release
Pre-release

What's Changed

  • Wrap mask contruction in a function for mask subclassing by @sryap in #2584
  • Build Fix: Update abi3 tag to cp310 and minimum python version to 3.10 by @aw920h in #2532
  • [Cute,Flex,Sm100] vectorized mask_mod by @reubenconducts in #2261
  • [CuTe, SM103] Update architecture assertion for SM 10.x and 11.x by @ocss884 in #2572
  • Include sm_110 in Blackwell-family arch gating (follow-up to #2572) by @Johnsonms in #2590
  • Use is_family_of for sm_90 and sm_103 arch checks by @Johnsonms in #2589

New Contributors

Full Changelog: fa4-v4.0.0.beta14...fa4-v4.0.0.beta15

fa4-v4.0.0.beta14

20 May 09:16
4178915

Choose a tag to compare

fa4-v4.0.0.beta14 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: fa4-v4.0.0.beta13...fa4-v4.0.0.beta14

fa4-v4.0.0.beta13

13 May 09:10
9bad4be

Choose a tag to compare

fa4-v4.0.0.beta13 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: fa4-v4.0.0.beta12...fa4-v4.0.0.beta13

fa4-v4.0.0.beta12

06 May 08:57
2e53092

Choose a tag to compare

fa4-v4.0.0.beta12 Pre-release
Pre-release

What's Changed

  • Fix long MSVC linker commands on Windows by @jammm in #2517
  • Fix test_flash_attn_fast varlen call after qv positional insert by @henrylhtsang in #2527
  • [Cute,Bwd,Sm90] Fix determinism for GQA, port Sm100 approach in by @v0i0 in #2510
  • benchmarks/tune_ex2_emu: hd256 sweep support and clock lock/unlock by @Johnsonms in #2495
  • [FA4][hd256] Backward TMA bulk-store epilogue + LSE/dpsum coalesce by @Johnsonms in #2497
  • [hd256] Add TMA paged KV support to SM100 2CTA forward kernel by @Johnsonms in #2489
  • Deterministic backward for blocksparse impl by @drisspg in #2253

New Contributors

Full Changelog: fa4-v4.0.0.beta11...fa4-v4.0.0.beta12

fa4-v4.0.0.beta11

29 Apr 08:53
ba59def

Choose a tag to compare

fa4-v4.0.0.beta11 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: fa4-v4.0.0.beta10...fa4-v4.0.0.beta11

fa4-v4.0.0.beta10

22 Apr 08:43
3a7694c

Choose a tag to compare

fa4-v4.0.0.beta10 Pre-release
Pre-release

What's Changed

  • Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression by @Johnsonms in #2461
  • Add CLC scheduler heuristic by @drisspg in #2455
  • expose num_splits for FA2 and add option for kernel blocksize alignment by @liangel-02 in #2448
  • [Cute,Fwd,Sm100] fp8 e4m3 and e5m2 support by @dcw02 in #2109
  • Expose --pack-gqa and --num-splits in benchmark_attn.py by @Johnsonms in #2473
  • Fix: pass num_splits through varlen_fwd Python wrapper (fixes #2448 regression) by @hsyysy in #2476
  • [Cute,Fwd,Sm100] Fix the crash when seqlen_k=0 by @Johnsonms in #2470
  • fix causal calcs by @drisspg in #2463
  • [cute,bwd] fix PDL race in bwd_preprocess, which corrupting dpsum on SM90+ by @geruome in #2481

New Contributors

Full Changelog: fa4-v4.0.0.beta9...fa4-v4.0.0.beta10