CANN/catlass Copy Gm To L1模块概述

Copy Gm To L1 模块概述

【免费下载链接】catlass 本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。 【免费下载链接】catlass 项目地址: https://gitcode.com/cann/catlass

代码位置

[TOC]

概述

copy_gm_to_l1 模块提供将 tile 块从 GM(Global Memory)搬运到 L1(Local Memory)的模板类,支持多种数据排布格式(layout)转换。根据架构不同,实现分为两套:

模块包含 非 TLA 风格(直接操作 LocalTensor / GlobalTensor)和 TLA 风格(通过 tla::Tensor 封装)两套 API。

API 清单

组件名风格适用硬件说明
CopyGmToL1非 TLAAtlasA2 / Ascend950基础 GM→L1 搬运模板,支持多种 layout 转换
CopyGmToL1IntervalDataCopy非 TLAAtlasA2基于 strided DataCopy 的逐行/逐列搬运,适用于矮宽/高窄数据块
CopyGmToL1GMMPTD非 TLAAtlasA2 / Ascend950GMM PTD 场景专用搬运,含单行优化和手动 stride 接口
CopyGmToL1DynamicOptimized非 TLAAtlasA2 / Ascend950运行时动态选择最优搬运策略(小矩阵用 strided DataCopy,大矩阵用 Nd2Nz)
TileCopyTlaTLAAtlasA2 / Ascend950TLA 风格 GM→L1 搬运,通过 tla::Tensor 封装简化调用
TileCopyTlaExtTLAAtlasA2TLA 扩展搬运,支持 ActualShape 部分搬运和 Padding layout
TileCopySparseTlaTLAAtlasA2Sparse GEMM GM→L1 搬运,支持 RowMajor/ColumnMajor/zN/nZ→zN/nZ
TileCopyFAQTlaTLAAtlasA2FlashAttention LoadQ 搬运,支持 3D 多矩阵 GM→L1 zN 转换

适用硬件型号说明

硬件型号架构标识ARCH 宏支持的非 TLA 模板支持的 TLA 模板
Atlas A2Arch::AtlasA2CATLASS_ARCH == 2201CopyGmToL1 / CopyGmToL1IntervalDataCopy / CopyGmToL1GMMPTD / CopyGmToL1DynamicOptimizedTileCopyTla / TileCopyTlaExt
Ascend 950Arch::Ascend950CATLASS_ARCH == 3510CopyGmToL1 / CopyGmToL1GMMPTD / CopyGmToL1DynamicOptimizedTileCopyTla

接口调用示例

非 TLA 风格(CopyGmToL1)

#include "catlass/gemm/tile/copy_gm_to_l1.hpp"

using namespace Catlass::Gemm::Tile;

using LayoutTagSrc = layout::RowMajor;
using LayoutTagDst = layout::zN;
using ElementSrc = half;
using ElementDst = half;

// 定义 GM 上的 RowMajor 数据(A 矩阵)
using GmType = Gemm::GemmType<ElementSrc, LayoutTagSrc>;
// 定义 L1 上的 zN 数据
using L1Type = Gemm::GemmType<ElementDst, LayoutTagDst, AscendC::TPosition::A1>;

uint32_t row = 256;
uint32_t col = 256;

// 构造 GM 上的 RowMajor layout
auto layoutSrc =LayoutTagSrc::MakeLayout<ElementSrc>(row, col);
// 构造 L1 上的 zN layout
auto layoutDst = LayoutTagDst::MakeLayout<ElementDst>(row, col);

AscendC::GlobalTensor<ElementSrc> srcTensor;
AscendC::LocalTensor<ElementDst> dstTensor;

// 实例化并调用
using CopyOp = CopyGmToL1<Arch::AtlasA2, GmType, L1Type>;
CopyOp copyOp;
copyOp(dstTensor, srcTensor, layoutDst, layoutSrc);

TLA 风格(TileCopyTla)

#include "catlass/gemm/tile/tile_copy_tla.hpp"
#include "tla/tensor.hpp"

using namespace Catlass::Gemm::Tile;

const uint32_t M = 256;
const uint32_t K = 256;

// 通过 tla::MakeLayout 创建 Layout(由 LayoutTag + Element + 维度自动推导 Shape/Stride)
auto layoutSrc = tla::MakeLayout<half, layout::RowMajor>(M, K);
auto layoutDst = tla::MakeLayout<half, layout::zN>(M, K);

// 通过 tla::MakeTensor 构造 TLA Tensor
AscendC::GlobalTensor<half> srcGmTensor;
AscendC::LocalTensor<half> dstL1Tensor;
auto srcTensor = tla::MakeTensor(srcGmTensor, layoutSrc, Arch::PositionGM{});
auto dstTensor = tla::MakeTensor(dstL1Tensor, layoutDst, Arch::PositionL1{});

// 实例化并调用
TileCopyTla<Arch::AtlasA2, decltype(srcTensor), decltype(dstTensor)> copyOp;
copyOp(dstTensor, srcTensor);

TLA 风格(TileCopyTlaExt)

#include "catlass/gemm/tile/tile_copy_tla.hpp"
#include "tla/tensor.hpp"

using namespace Catlass::Gemm::Tile;

const uint32_t M = 256;
const uint32_t K = 256;
const uint32_t actualM = 128;
const uint32_t actualK = 128;

// 通过 tla::MakeLayout 创建 Layout
auto layoutSrc = tla::MakeLayout<half, layout::RowMajor>(M, K);
auto layoutDst = tla::MakeLayout<half, layout::zN>(M, K);

// 通过 tla::MakeTensor 构造 TLA Tensor
AscendC::GlobalTensor<half> srcGmTensor;
AscendC::LocalTensor<half> dstL1Tensor;
auto srcTensor = tla::MakeTensor(srcGmTensor, layoutSrc, Arch::PositionGM{});
auto dstTensor = tla::MakeTensor(dstL1Tensor, layoutDst, Arch::PositionL1{});

// 实例化 TileCopyTlaExt(LayoutTagSrc/LayoutTagDst 决定搬运策略,与 tensor 的 layout 无关)
TileCopyTlaExt<Arch::AtlasA2,
    decltype(srcTensor), decltype(dstTensor),
    layout::RowMajor, layout::zN> copyOp;

// 指定实际搬运的数据块形状(可小于 tensor 的完整 shape)
tla::Shape<uint32_t, uint32_t> actualShape(actualM, actualK);
copyOp(dstTensor, srcTensor, actualShape);

动态优化风格(CopyGmToL1DynamicOptimized)

#include "catlass/gemm/tile/copy_gm_to_l1.hpp"

using namespace Catlass::Gemm::Tile;

using LayoutTagSrc = layout::RowMajor;
using LayoutTagDst = layout::zN;
using ElementDst = half;

// 定义 GM 上的 Gemm 类型
using GmType = Gemm::GemmType<ElementDst, LayoutTagSrc>;
// 定义 L1 上的 Gemm 类型
using L1Type = Gemm::GemmType<ElementDst, LayoutTagDst, AscendC::TPosition::A1>;

uint32_t row = 256;
uint32_t col = 256;

// 构造 layout
auto layoutSrc = LayoutTagSrc::MakeLayout<ElementDst>(row, col);
auto layoutDst = LayoutTagDst::MakeLayout<ElementDst>(row, col);

AscendC::GlobalTensor<ElementDst> srcTensor;
AscendC::LocalTensor<ElementDst> dstTensor;

// 实例化 CopyGmToL1DynamicOptimized
// 内部会根据 row/col 自动选择 Nd2Nz 或 strided DataCopy
using CopyOp = CopyGmToL1DynamicOptimized<Arch::AtlasA2, GmType, L1Type>;
CopyOp copyOp;
copyOp(dstTensor, srcTensor, layoutDst, layoutSrc);

GMM PTD 风格(CopyGmToL1GMMPTD)

#include "catlass/gemm/tile/copy_gm_to_l1.hpp"

using namespace Catlass::Gemm::Tile;

using LayoutTagSrc = layout::RowMajor;
using LayoutTagDst = layout::zN;
using ElementDst = half;

// GMM PTD 场景只需指定 GmType(L1Type 默认为 void,由偏特化自动推导)
using GmType = Gemm::GemmType<ElementDst, LayoutTagSrc>;

uint32_t row = 256;
uint32_t col = 256;

// 构造 layout
auto layoutSrc = LayoutTagSrc::MakeLayout<ElementDst>(row, col);
auto layoutDst = LayoutTagDst::MakeLayout<ElementDst>(row, col);

AscendC::GlobalTensor<ElementDst> srcTensor;
AscendC::LocalTensor<ElementDst> dstTensor;

// 基础调用
using CopyOp = CopyGmToL1GMMPTD<Arch::AtlasA2, GmType>;
CopyOp copyOp;
copyOp(dstTensor, srcTensor, layoutDst, layoutSrc);

// 扩展调用:手动指定 stride(多矩阵搬运场景)
// copyOp(dstTensor, srcTensor, layoutDst, layoutSrc,
//        ndNum, srcNdMatrixStride, dstNzNStride, dstNzMatrixStride, dstNzC0Stride);

间隔搬运风格(CopyGmToL1IntervalDataCopy)

#include "catlass/gemm/tile/copy_gm_to_l1.hpp"

using namespace Catlass::Gemm::Tile;

using LayoutTagSrc = layout::RowMajor;
using LayoutTagDst = layout::zN;

// CopyGmToL1IntervalDataCopy 当前仅支持 half 类型和 AtlasA2 架构
using GmType = Gemm::GemmType<half, LayoutTagSrc>;

uint32_t row = 256;
uint32_t col = 256;

// 构造 layout
auto layoutSrc = LayoutTagSrc::MakeLayout<half>(row, col);
auto layoutDst = LayoutTagDst::MakeLayout<half>(row, col);

AscendC::GlobalTensor<half> srcTensor;
AscendC::LocalTensor<half> dstTensor;

// 使用 strided DataCopy 逐行搬运,适用于矮宽/高窄数据块
using CopyOp = CopyGmToL1IntervalDataCopy<Arch::AtlasA2, GmType>;
CopyOp copyOp;
copyOp(dstTensor, srcTensor, layoutDst, layoutSrc);

模板选择指南

场景推荐模板
通用矩阵乘 tile 搬运CopyGmToL1(非 TLA)或 TileCopyTla(TLA)
数据块形状不确定,需要运行时自适应CopyGmToL1DynamicOptimized
GMM PTD 场景,需要手动控制 strideCopyGmToL1GMMPTD
矮宽/高窄数据块(仅 half 类型)CopyGmToL1IntervalDataCopy
需要部分搬运或 Padding 场景TileCopyTlaExt

【免费下载链接】catlass 本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。 【免费下载链接】catlass 项目地址: https://gitcode.com/cann/catlass

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值