单机 8 卡 16Die, 每个 Die 16 个专家为例训练 ” I have a cat”

共有 8 个箭头不同情况。最好的是 local permutation. 最差的是经过 mate 转发 P2P + D2D + P2P (OCS 目前方案, preferred）或者 P2P + P2P + D2D （可能中间的 P2P 会成为阻碍后面 D2D 的瓶颈）。

dispatch

token cat 需要跨 group, 并且被对方 group 卡全部用到，但是只在 inter-group 转发一次。

相比 all gather 无论对方要不要都塞过去，节省数据量

UBBv2 与 OCS TOPO 相比要简化不少。

尤其是机内要简化

1， mate 角度： UBBv2 每个 die 都是 inter-group 的 mate ，而 OCS 则是挑 die。 UBBv2 少一次 mate 跳转

2， peer 角度： UBBv2 全卡全 die 互联。再少一次 peer 的跳转。

少了两次跳转。复杂度大大降低。

说明：本文档用于论证在 ubbv2+rdma 条件下，是否有开发的价值，以及面临的问题瓶颈。

一、瓶颈

1、reorder 时间

56MB，耗时 2.25 ms
112MB，耗时 4.5 ms
224MB，耗时 9ms

2、机间传输带宽瓶颈 10 - 12 GB/s

（VS. 单机 EP16 allgather + permute: 16.5 ms）

16 机，TP1DP16PP8EP32，MB1，机间需要传输的数据量 1 * 4096 * 7169 * sizeof(BF16) = 56 MB，耗时 56 MB / 11 GB/s = 4.9 ms。
32 机，TP1DP16PP16EP32, MB2，机间需要传输的数据量 2 * 4096 * 7168 * sizeof(BF16) = 112 MB，耗时 9.8 ms。
32 机，TP1DP16PP16EP32, MB4，机间需要传输的数据量 4 * 4096 * 7168 * sizeof(BF16) = 224 MB，耗时 19.6 ms。

3、机内转发

DMA：如果以 token 为最小转发单位，调用次数上千，此方案不可行。
SPC：采用 DeepEP 的方式，需要转发的数据量 4 * (112 MB * 2) = 896 MB，耗时 896 MB / 100 GB/s = 8.96 ms。

4、如果采用，机间直接 send/recv colmajor，通信后，在采用 SPC DeepEP 的方式转发数据，则总耗时：9.8 + 4.4 * 2 = 18.6，如果机内一半可以和机间 send/recv overlap，则总耗时 14.2 ms。

二、可能的收益

1、通信并行度提高，减少通信耗时，达到通信计算比 1:1，再通过 dualpipe，进行通信计算 overlap。

2、shape 变大，算子计算效率提升。

3、显存降低，减少 recompute。

三、EP32 策略

下表中第#1 行为 base case，显存使用量约为 max reserved = 61264MB，对应下表中计算公式为（A + M）。

|1|16|1|16|8|16|2|4|2048|16|A(tt) + M(lp)|t1|t2|t3|||

|2|16|1|32|16|8|1|8|2048|8|A/2 + M|t1/2|t2||||

|3|16|1|32|16|8|2|8|4096|8|A + 2M|t1|2t2||||

|||||||||||||||||

|4|32|1|32|16|16|1|4|2048|8|A/2 + M/2|t1/2|t2/2|t3|||

|5|32|1|32|16|16|2|4|4096|8|A + M|t1|t2||||

|6|32|1|32|16|16|4|4|8192|8|2A + 2M|2t1|2t2||||

|||||||||||||||||

|7|64|1|64|32|16|1|4|4096|4|A/2 + M/2|t1/2|t2/2||||

- #1：base case

- #2：略

- #3：显存太具挑战，先不考虑。

- #4：

- - 显存占用只有 "#1 base case" 一半，根据“[1.7offload with recompute]([https://conf01.birentech.com/display/solution/1.7offload+with+recompute](https://conf01.birentech.com/display/solution/1.7offload+with+recompute))”计算，当 activation 占用仅为一半，可考虑关闭 recompute。单层能减少 13ms 的计算时长。

- EP32 相对 EP16 的“#1 base case”会增加机间同号卡之间 56MB 的 send/recv 耗时，约为 4.9ms。这一部分可通过机间 send/recv 与机内的 allgather 进行重叠来消除，可参考“[3.UBBv2+RDMA网络上的DeepEP]([https://conf01.birentech.com/pages/viewpage.action?pageId=220667883](https://conf01.birentech.com/pages/viewpage.action?pageId=220667883)) 第五节”。

- 相对 "#1 base case"，permute+unpermute 的输入、输出大小不变，不会增加额外的耗时。

- #5：可以将 TCore 利用率从 42% 提高到 53%，能减少 1/4 的 Grouped-GEMM 计算时间。

- #6：显存太具体挑战，先不考虑。

- #7：明显有点在于一个 DP 组就有 512 卡，在测试环境中，无须再测 DP。暂时看到的性能瓶颈点在于，机间和机内通信量相较于 "#4 case" 加倍。

1.1、初步优化

如下图所示的 topo，all2all 算法流程如下（如下步骤仅示意如何将 GPU0 上的数据发送到其他所有 GPU，所有 GPU 操作的次序完全一样）：

step-1：并行执行 GPU0→ GPU1,2,3,5，发送到 GPU5 上数据包含了需要转发到 GPU4, 6,7 的数据。

因为 GPU0→ GPU5 的数据为到其他 GPU 的 4 倍，

step-2：将缓存在 GPU5 上的数据，转发到 GPU4, 6, 7。

带宽计算：S / t，其中，S 为均匀情况下每个 rank 上发送或接收的数据大小。step-1 耗时 t1 = (S / 2) / B，step-2 耗时 t2 = (S / 8) / B。B = 64 GB/s 为直连 GPU 间 PCIe X16 的单向带宽。

那么，综合带宽 B’ = S / (t1 + t2)，带入 t1、t2，可得 B’ = S/ (5 / 8B) = 8B / 5 = 102.4 GB/s。

1.2、进一步优化

如上图中下部份的表所示，如果将数据 d0_5 最后发送，那么就可以将 d0_4、d0_6、d0_7 的传输与 4GPU Group 组内的传输 overlap 起来，从而消掉 step-2 的耗时。

那么，耗时 t = (S / 2) / B，综合带宽 B’ = S / t = 2B = 128 GB/s。

数据量估算（下面仅考虑单机情况）：

单机上每个 DP 需要发送的数据 2 * 4096 * 7168 * sizeof(BF16) * top(8 - 1)= 784 MB
假设均分到每个专家，则需要发送到每个专家的数据 896 MB / 256 = 3.5 MB

需要提前发送的元数据：

router: gating + topk
cal_metadata(input: indices[2*4096, 8], output: all_expert[256])
non-fused reducescatter(input: all_expert[256], output: local_expert[16])

发送 token 数据：

alltoallv(input: token[2 * 4096 * 7168], indices[2*4096, 8]; output: )
发送到每个 EP Rank 上的数据 3.5 MB * 16 = 56 MB
因为发送到同一个 EP Rank 上的 token 数据可以合并，但这里假设 top-8 个专家是均匀分布在 16 个 EP Rank 上，这里先可不考虑这个优化。
- 如需考虑这个优化在进行 spc all2allv(…, indices[], …) 拷贝操作时，若发现 indices 中 token_i: [1,2, …] 存在发送到同一个 EP Rank 的专家时，不要从 remote pull，而直接进行 local copy。

六、succl all2all 原始算法流程

succl 原来的 topo 实现流程如

第一步（send）：rank 0 给 4 5 6 7 准备数据，将对应 4 5 6 7 的 offset 的数据 copy 到 staging buffer；同时，rank 8 9 10 11 给 4 5 6 7 准备数据，4 个 D2D 和 4 个 Die2Die 并行；切流水，假设 4spc 每次只拉 1M 数据，(0123)4 个 spc 拉完第一次流水之后，(4567)4 个 spc 开始 p2p 并行拉数据；

第二步（recv）：rank 0 从 1 2 3 拉数据，拉的是 8 9 10 11 给 4 5 6 7 准备的数据，存在 Die2Die 的 staging buffer 上，同时从 rank 4 拉 1 份数据存到 output buffer；4 个 P2P 并行；（此时 output buffer 有了 4）；

rank 4 上还有 5 6 7 三分数据，需要三个 P2P 串行，同时，rank 12 13 14 15 在做四次 Die2Die，给 rank 0 1 2 3 准备数据，四次 DieDie 要拆开来和三个串行 P2P 并行，不然会影响性能。（三个 P2P 串行结束之后，outputbuffer 有了 5 6 7）；（拉 0 1 2 3 的数据也可以插进来这个三个 P2P 可以和某一个 P2P 并行）

第三步（recv）：rank 0 从 1 2 3 拉数据，同时从 rank 4 拉 1 份数据存到 output buffer；（此时 output buffer 有了 12，同时 13 14 15 的数据也已经到达 rank 4）；4 个 P2P；

rank 4 上还有三份数据，需要三个 P2P 串行，同时 rank 8 9 10 11 再做四次 Die2Die, 给 rank 0 1 2 3 准备数据，四次 DieDie 要拆开来和三个串行 P2P 并行，不然会影响性能。（三个 P2P 串行结束之后，outputbuffer 有了 13 14 15）；

第四步（recv）: 在三个 P2P 串行的过程中，先把 0 直接 D2D 到 outputbuffer，再做一个 Die2Die, 把 8 的数据 ready；

切流水，假设 4spc 每次只拉 1M 数据，(0123)4 个 spc 拉完第一次流水之后，(4567)4 个 spc 开始 p2p 并行拉数据；

下面仅以 2 机 all2all dispatch 为例进行说明。

1、数据量

NV：2 * 4096 * 7168 * sizeof(FP8) = 56 MB。
BR：2 * 4096 * 7168 * sizeof(BF16) = 112 MB。

2、带宽

SPC send recv性能数据对比

2 机 2 卡：每机 1 卡

SUCCL_BUFF_SIZE=67108864

|1m|0.18|0.10|

|32m|3.88|2.16|

|64m|8.07|3.79|

|128m|7.55|6.37|

|256m|11.04|8.87|

|1G|12.19|11.00|

3、耗时估计

在 NV 上，由于机内通过 NVLink 转发的数据量大约是机间 IB 发送数据量的 3 倍♣，但机内 NVLink 的带宽是机间带宽的 (160 GB/s / 50 GB/s = )3.2 倍，所以，瓶颈主要在机间发送的 56 MB 数据的带宽。
在 BR 上，机内机间的数据倍率关系同上，机间、机内带宽分别为 12 GB/s、24 GB/s（机内为 DMA 带宽，SPC 版本可以更高，上限为 32 GB/s），机内带宽变为机间带宽的 2 倍，故瓶颈同样是机间发送的 112 MB 数据的带宽。
BR 上耗时：112 MB / 12 GB/s = 9.11 ms。

（3 倍♣：每个 token 转发的 top8 个专家最少可能位于 1 个 EP Rank，最多可能位于 8 个 EP Rank，简单取其均值为 4.5，即须在机内转发 4.5 次，但 DeepSeek V3 限定最多发给 3 个节点，取最少 0 次跨机转发和最多 3 次跨机转发的均值为 1.5，以此算出机内数据为机间数据的 3 倍。）

目的：快捷实现，以打通模型完整的跑通过程，来消除对模型往下开发和优化的影响。

一、拓扑与邻接关系简介

1、拓扑与数据结构

2、邻接关系

peer1 peer2 peer3 mate1(blink) mate2(ib) mate3(port-10 or stranger)

GPU Rank0 GR1 GR2 GR3 GR5 GR8 GR12

GPU Rank1 GR0 GR2 GR3 GR4 GR9 GR13

GPU Rank2 GR0 GR1 GR3 GR7 GR10 GR14

GPU Rank3 GR0 GR1 GR2 GR6 GR11 GR15

GPU Rank4 GR5 GR6 GR7 GR1 GR12 GR8

GPU Rank5 GR4 GR6 GR7 GR0 GR13 GR9

GPU Rank6 GR4 GR5 GR7 GR3 GR14 GR10

GPU Rank7 GR4 GR5 GR6 GR2 GR15 GR11

GPU Rank8 GR9 GR10 GR11 GR13 GR0 GR4

GPU Rank9 GR8 GR10 GR11 GR12 GR1 GR5

GPU Rank10 GR8 GR9 GR11 GR15 GR2 GR6

GPU Rank11 GR8 GR9 GR10 GR14 GR3 GR7

GPU Rank12 GR13 GR14 GR15 GR9 GR4 GR0

GPU Rank13 GR12 GR14 GR15 GR8 GR5 GR1

GPU Rank14 GR12 GR13 GR15 GR11 GR6 GR2

GPU Rank15 GR12 GR13 GR14 GR10 GR7 GR3

二、dispatch Layout 结构体说明和计算

1、without Port-10

dispatch 操作步骤如下：

第一步，permute：从本卡和 peers 上 pull 数据进行 permute。如GR0从 GR1、GR2、GR3 拉取数据做 permute。
第二步，与第一步同步进行，本卡从自己的 mates 拉取数据放到 stagingbuffer。如 GR0 从其 GR5、GR8拉取数据，GR1 从 GR4、GR9拉取数据，GR2 从 GR7、GR10，GR3 从 GR6、GR11拉取数据。
2.1 pull（from mate1_innode）：GR0 从其 GR5 拉取数据，GR1 从 GR4 拉取数据，GR2 从 GR7，GR3 从 GR6 拉取数据。（为简化 layout 计算，第一版采用 56 MB 全量数据拉取）
2.2 pull（from mate2_internode）：GR0 从其GR8拉取数据，GR1 从GR9拉取数据，GR2 从GR10，GR3 从GR11拉取数据。（56 MB 全量数据拉取）
第三步，等待第二步完成，GR0 将从 GR0、GR1、GR2、GR3 的 stagingbuffer 拉取数据，由于从机内 BLink 和 RDMA 拉取数据的速度并不一样，两次 permute 在开始时间上有所差异：
3.1 permute：GR0、GR1、GR2、GR3 从 GR4、GR5、GR6、GR7 速度更快（50 GB/s），数据达到后，GR0、GR1、GR2、GR3 首先做一次 permute。
3.2 permute：GR0、GR1、GR2、GR3 从 GR8、GR9、GR10、GR11 速度更慢（13 GB/s），数据达到后，GR0、GR1、GR2、GR3 再做一次 permute。
第四步，pull：与第 3.2 步同步进行，GR0、GR1、GR2、GR3 分别从 GR4、GR5、GR6、GR7 的 stagingbuffer 中拉取来自 GR12、GR13、GR14、GR15 的数据。（为简化 layout 计算，第一版采用 56 MB 全量数据拉取）
第五步，permute：等待第四步完成，GR0 将从 GR0、GR1、GR2、GR3 的 stagingbuffer 拉取数据并做 permute。
第六步，reorder。

dispatch_layout 示意图：

dispatch_layout 计算步骤如下：

首先采用 allgather 所有的 indices 信息，后续操作步骤如下，

allgather 输入：indices [4096, 8]
allgather 输出：indices_list = 16 * indices [4096, 8]
对 allgather 的结果 indices_list 调用函数 get_layout() 以计算最终的 layout 信息：
permutation layout 计算：计算逻辑为依次遍历收到的每个大小为 [4096, 8] indices，并将属于当前 GPU Rank 上的 expert 的 token 数量进行累加，同时计算出其对应的 src、dest 地址。
pull layout 计算：由于采用了全量数据 pull，每次直接把需要 pull 的数据全量 pull 到对应的 stagingbuffer 即可。

折叠源码

| | |

|---|---|

|1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33|from collections import namedtuple

# 结构体定义

permuteinfo = namedtuple(``'token_num'``, 'base_addr'``)

permutelayout = namedtuple(``'src'``, 'dest'``, 'prob'``)

# base_addr中四个地址分别存储本卡、inner_mate、inter_mate、stranger缓存在本机上stagingbuffer的起始地址

base_addr = [localcard_addr, stagingbuffer1_addr, stagingbuffer2_addr, stagingbuffer3_addr]

# 16个expert buffer的起始地址

expertbuffer_base_addr = [addr1, addr2, addr3, addr4, addr5, ..., addr16]

def get_layout(indices_list, gpu_rank, layout_permute1, layout_permute2, layout_permute3, layout_permute4):

start_exp, end_exp = 16 * gpu_rank, 16 * (gpu_rank + 1``)

def compute_layout(indices, base_addr, layout):

token_idx_in_expert = [``0``] * 16

for i in range``(``4096``):

for exp in indices[i]:

if exp >``= start_exp and exp < end_exp:

src = base_addr + i * 7168

dest = expertbuffer_base_addr[exp % 16``] + token_idx_in_expert[exp] * 7168

layout.push(permutelayout(src, dest, prob))

token_idx_in_expert[exp] +``= 1

return permuteinfo(``sum``(token_idx_in_expert), base_addr), layout

# 计算gpu_rank所在的VNode上的四个gpu_rank

peer_list = [(gpu_rank % 4``) + i for i in range``(``4``)]

compute_layout(indices_list[peer_list[``0``]], localcard_addr, layout_permute1)

compute_layout(indices_list[peer_list[``1``]], stagingbuffer1_addr, layout_permute2)

compute_layout(indices_list[peer_list[``2``]], stagingbuffer2_addr, layout_permute3)

compute_layout(indices_list[peer_list[``3``]], stagingbuffer3_addr, layout_permute4)|

2、with Port-10

传输数据量变为 1 - (3/4)^8 - (3/4)^8 = 0.8

dispatch 操作步骤如下：

第一步，permute：从本卡和 peers 上 pull 数据进行 permute。如GR0从 GR1、GR2、GR3 拉取数据做 permute。
第二步，pull + pull + pull：与第一步同步进行，本卡从自己的 mates 拉取数据放到 stagingbuffer。如 GR0 从其 GR5、GR12、GR8拉取数据，GR1 从 GR4、GR13、GR9拉取数据，GR2 从 GR7、GR14、GR10，GR3 从 GR6、GR15、GR11拉取数据。
第三步，等待第二步完成，GR0 将从 GR0、GR1、GR2、GR3 的 stagingbuffer 拉取数据。由于从机内 BLink、Port-10、RDMA 拉取数据的速度并不一样，两次 permute 时间上有所差异
3.1 permute：GR0、GR1、GR2、GR3 从 GR4、GR5、GR6、GR7 速度更快（50 GB/s），数据达到后，GR0、GR1、GR2、GR3 首先做一次 permute。
3.2 permute：GR0、GR1、GR2、GR3 从 GR12、GR13、GR14、GR15 速度更快（25 GB/s），数据达到后，GR0、GR1、GR2、GR3 首先做一次 permute。
3.3 permute：GR0、GR1、GR2、GR3 从 GR8、GR9、GR10、GR11 速度更慢（13 GB/s），数据达到后，GR0、GR1、GR2、GR3 再做一次 permute。
第四步，reorder。

dispatch_layout 计算步骤如下：

首先采用 allgather 所有的 indices 信息，后续操作步骤如下，

allgather 输入：indices [4096, 8]
allgather 输出：indices_list = 16 * indices [4096, 8]
对 allgather 的结果 indices_list 调用函数 get_layout() 以计算最终的 layout 信息
pull layout 计算：以 GR0、GR1、GR2、GR3 所在的 VNode 为例，每个 GR(with mate1, mate2, mate3) 都需要为其他三个 GRs 拉取并缓存来自 (mate1、mate2、mate3) 的 token 信息。
stagingbuffer 设计：由于 VNode 中的每个 GR gr 都需要为其三个 peers 缓存 gr 的三个 mate 的数据，那么需要总共 9 个 stagingbuffer。
token 在 stagingbuffer 中的 layout 设计：stagingbuffer 中 layout 与最终 permute 访问其的 layout 一致。
permutation layout 计算：同样以 GR0、GR1、GR2、GR3 所在的 VNode 为例，除第一次 permute 需要的数据在本 VNode 外，剩下三次 permute 需要的数据都是通过 pull 从其他三个 VNode 拉取并缓存在对应的 stagingbuffer 中。

折叠源码

| |

|---|

|from collections import namedtuple

# 结构体定义

permuteinfo = namedtuple(``'token_num'``, 'base_addr'``)

permutelayout = namedtuple(``'src'``, 'dest'``, 'prob'``)

# base_addr中四个地址分别存储本卡、inner_mate、inter_mate1、inter_mate2的起始地址

# 需要通过通信得到，当下GR的三个peer，三个mate存储4096个token的起始地址

base_addr = [localcard_addr, peer1_addr, peer2_addr, peer3_addr, inner_mate1_addr, inter_mate2_addr, inter_mate3_addr]

缓存在本机上stagingbuffer的起始地址

stagingbuffer_addr = [stagingbuffer_formate1_addr, stagingbuffer_formate2_addr, stagingbuffer_formate3_addr, ]

# 16个expert buffer的起始地址

expertbuffer_base_addr = [addr1, addr2, addr3, addr4, addr5, ..., addr16]

# forward token layout 计算

# 为避免多传数据，在每个VNode中，当前GPU Rank g需要为它的三个peers(p1, p2, p3)存储g的mate(m1, m2) forward 到g上的token信息

# 这样需要操作9个stagingbuffer及其起始地址

# layout_pull1：以GR0为例，分别用来存储为peer1, peer2, peer3需要从GR4拉取的token信息

# layout_pull2：以GR0为例，分别用来存储为peer1, peer2, peer3需要从GR8拉取的token信息

# layout_pull3：以GR0为例，分别用来存储为peer1, peer2, peer3需要从GR12拉取的token信息

def get_pull_layout(indices_list, gpu_rank, layout_pull1, layout_pull2, layout_pull3):

peer_list = [(gpu_rank % 4``) + i for i in range``(``4``) if (gpu_rank % 4``) + i !``= gpu_rank]

peer1_start_exp, peer1_end_exp = peer_list[``0``] * 16``, (peer_list[``0``] + 1``) * 16

peer2_start_exp, peer2_end_exp = peer_list[``1``] * 16``, (peer_list[``1``] + 1``) * 16

peer3_start_exp, peer3_end_exp = peer_list[``2``] * 16``, (peer_list[``2``] + 1``) * 16

# 计算传入的mate_indices中，需要forward到当前GR的peer 在做permute的token信息

def compute_layout(mate_indices, src_base_addr, dest_stagingbuffer_addr, layout):

token_count = 0

for i in range``(``4096``):

for exp in mate_indices[i]:

if exp < peer1_start_exp or exp >``= peer3_end_exp:

continue

token_count +``= 1

src = src_base_addr + i * 7168

# 由于这里的数据需要先forward一次，所以dest为当前GR的stagingbuffer

dest = dest_stagingbuffer_addr + token_count * 7168

layout.push(permutelayout(src, dest, prob))

# 调用compute_layout()分别处理当前GR的三个mate传过来的indices mate_rank1, mate_rank2, mate_rank3 = calc_mate_ranks() # 这里需要计算当前GR的三个mate的rank

compute_layout(indices_list[mate_rank1], base_addr[``4``], stagingbuffer_addr[``1``], layout_pull1)

compute_layout(indices_list[mate_rank2], base_addr[``5``], stagingbuffer_addr[``2``], layout_pull2)

compute_layout(indices_list[mate_rank3], base_addr[``6``], stagingbuffer_addr[``3``], layout_pull3)

# permute layout 计算, 需要计算出4次permute的layout信息：其中layout_permute1/2/3/4为第一次需要访问地址, layout_permute5/6/7为接下来三次的

# layout_permute1: 以GR0为例，需要local_card上属于expert0-expert16的token信息。

# layout_permute2: 以GR0为例，需要peer1上属于expert0-expert16的token信息。

# layout_permute3: 以GR0为例，需要peer2上属于expert0-expert16的token信息。

# layout_permute4: 以GR0为例，需要peer3上属于expert0-expert16的token信息。

# layout_permute5: 以GR0为例，需要mate1(GR5)上属于expert0-expert16的token信息。

# layout_permute6: 以GR0为例，需要mate2(GR8)上属于expert0-expert16的token信息。

# layout_permute7: 以GR0为例，需要mate3(GR12)上属于expert0-expert16的token信息。

def get_permute_layout(indices_list, gpu_rank, layout_permute1, layout_permute2, layout_permute3, layout_permute4，

layout_permute5, layout_permute6, layout_permute7):

start_exp, end_exp = 16 * gpu_rank, 16 * (gpu_rank + 1``)

def compute_layout(indices, base_addr, layout):

token_idx_in_expert = [``0``] * 16

for i in range``(``4096``):

for exp in indices[i]:

if exp >``= start_exp and exp < end_exp:

src = base_addr + i * 7168

dest = expertbuffer_base_addr[exp % 16``] + token_idx_in_expert[exp] * 7168

layout.push(permutelayout(src, dest, prob))

token_idx_in_expert[exp] +``= 1

return permuteinfo(``sum``(token_idx_in_expert), base_addr), layout

# 计算gpu_rank所在的VNode上的四个gpu_rank

peer_list = [(gpu_rank % 4``) + i for i in range``(``4``)]

compute_layout(indices_list[peer_list[``0``]], localcard_addr, layout_permute1)

compute_layout(indices_list[peer_list[``1``]], stagingbuffer1_addr, layout_permute2)

compute_layout(indices_list[peer_list[``2``]], stagingbuffer2_addr, layout_permute3)

compute_layout(indices_list[peer_list[``3``]], stagingbuffer3_addr, layout_permute4)

# 计算当下GPU Rank对应的三个mate的gpu_rank

mate_list = [...]

compute_layout(indices_list[ mate_list[``0``]], localcard_addr, layout_permute5)

compute_layout(indices_list[ mate_list[``1``]], localcard_addr, layout_permute6)

compute_layout(indices_list[ mate_list[``2``]], localcard_addr, layout_permute7)

# 最终需要的layout, 需要同时将pull_layout再进行通信，将pull_layout和permute_layout进行合并|

三、combine Layout 结构体说明和计算

1、without Port-10

Combine 的操作步骤：

第一步：先在单卡内，将来自不同 expert_buffer 中同一个 token 进行 reduce 操作，结果放到二维数组 local_card_reduce_buffer[16][n]。（local_card_reduce_buffer 第一维为 16，用于存放来自不同的 gpu rank 上的 token 在本卡的 reduce 结果）

举例：以 GR1 为例，依次读取 16 个 expert_buffer 中的 token 进行 reduce，并输出到 local_card_reduce_buffer[16] 中。读取规则为：判断记录 token 在 16 个 expert 上分发的 16 个二进制书中的对应位，如果某个位为 1，则从对应 expert buffer 上读取一个 token，将从多个 expert 上读取的 token 进行 reduce，并输出到 local_card_reduce_bffer[][] 中。

第二步：如下示意图所示，任何一个 VNode 中的卡上的 token，需要通过 4 次卡间 combine，下面以 GR12 为例，

2.1 peer 之间进行 reduce，并将当前 gpu rank 及其 mate、stranger 上 token 的 reduce 结果写入到 local_vnode_reduce_buffer[4][n]。（local_vnode_reduce_buffer 第一维为 4，local_vnode_reduce_buffer[0, 1, 2, 3] 分别用于存放当下 gpu rank、mate_innode、mate_internode、stranger 上的 token 的 reduce 结果）

举例：以 GR1 为例，GR1 会从它的 Peer（GR0、GR1、GR2、GR3）的 local_card_reduce_buffer[1]、local_card_reduce_buffer[4]、local_card_reduce_buffer[9]、local_card_reduce_buffer[12] 上依次进行读取。读取的规则为：判断包含四个二进制位 0000-1111 的数中，某个对应位为是否为 1，为 1 则从对应 peer 上拉取一个 token，否则跳过此 peer。那么，需要计算 layout 时，只需构造出所有 token 的这个二进制数即可。

2.2 Node 内的 mate 之间进行 reduce，并将 local_card、stranger 上 token 的 reduce 结果缓存在跨物理节点的同号卡上。

举例：如 GR4、GR12 上 token 在 VNode1 上的 reduce 结果与 GR4 进行 reduce 后，缓存在 GR4。GR4 会从 GR4 的 local_vnode_reduce_buffer[0] 和 GR1(mate_innode) 的 local_vnode_reduce_buffer[1] 依次拉取数据进行 reduce。拉取规则为：判断包含两个二进制位 00-11 的数中，某个对应位是否为 1，为 1 则从对应卡拉取一个 token，否则跳过此 peer。那么，需要计算 layout 时，只需构造出所有 token 的这个二进制数即可。

2.3 Node 间的 mate 之间进行 reduce。

举例：如 GR12 上的 token 在 GR4 和 GR12 之间进行 reduce，并存到 GR12。

第三步：reorder

layout 的计算过程 Python 代码：

首先采用 allgather 所有的 indices 信息，后续操作步骤如下，

allgather 输入：indices [4096, 8]
allgather 输出：indices_list = 16 * indices [4096, 8]
对 allgather 的结果 indices_list 调用函数 get_layout() 以计算最终的 layout 信息

1、计算卡内 reduce layout

遍历 indices_list 中的每个 indices，根据每个 token 中记录的 expert 信息，计算对应的 layout 数值。如 indices_list[0] 中第一个 token 的 top-8 expert 为 [0, 1, 200, 201, 224, 225, 253, 255] 则 GR1 中可计算出 local_card_layout[0][1] = 0…0001116。
输出：local_card_layout[16][n] 的第一维为 16，每一行 local_card_layout[gpu_rank] 用于记录来 gpu_rank 的 token 在当前卡中 16 个 expert 上的分布，这个分布用包含 16 个二进制位的数表示，如 0…0001 表示这个 token 仅被 dispatch 到了它的第一个 expert buffer 中。

2、计算 VNode 内 reduce layout

- 分别遍历当前 gpu rank 及其 mate_innernode、mate_internode、stranger 的 indices，判断其中每个 token 是否被 dispatch 到当前 VNode 的 4 个卡上。
- 输出 1：local_vnode_layout[n] 表示一个一维数组，其中每个元素用来存储本卡 token 在 peer1、peer2、peer3、peer4 上 token 上的分布，这个分布用包含 4 个二进制位的数表示，如 0011 表示当下 token 被 dispatch 到了当前 gpu rank 和它的第一个 peer 上。
输出 2：local_vnode_layout_for_innermate[n]，解释如输出 1。
输出 3：local_vnode_layout_for_intermate[n]，解释如输出 1。
输出 4：local_vnode_layout_for_stranger[n]，解释如输出 1。

3、机内 mate_innode 间的 reduce layout

- 遍历当前 gpu rank 的 mate_intermte 的 indices，计算其在当前 Node 中的两个 VNode 上的分布。
输出 1：inner_node_layout[4096] 为长度是 4096 的一维数组，其中每个元素用来存储当前 gpu rank 上 token 在在当前 VNode 和邻接机内 VNode 上分布，这个分布用包含 2 个二进制位的数表示，如 11 表示当下 token 被 dispatch 到了当前 VNode 和它的邻接机内 VNode 上了。
输出 2：inner_node_layout[4096] 为长度是 4096 的一维数组，其中每个元素用来存储当前 gpu rank 的 mate_internode 上 token 在在当前 VNode 和邻接机内 VNode 上分布，这个分布用包含 2 个二进制位的数表示，如 11 表示当下 token 被 dispatch 到了当前 VNode 和它的邻接机内 VNode 上了。

4、机间 mate_internode 间的 reduce layout

- 遍历当前 gpu rank 和 mate_internode 的 indices[gpu_rank]、indices[mate_internode‘ gpu rank]，判断其中每个 token 是否被 dispatch 到当前 Node 的 8 个卡和邻接 Node 的 8 个卡上；
输出 1：inner_node_layout[4096] 为长度是 4096 的一维数组，其中每个元素用来存储本卡上 token 在在当前 Node 和邻接 Node 上分布，这个分布用包含 2 个二进制位的数表示，如 11 表示当下 token 被 dispatch 到了当前 node 和它的邻接 Node 上了。

折叠源码

| |

|---|

|from collections import namedtuple

# 结构体定义

permuteinfo = namedtuple(``'token_num'``, 'base_addr'``)

permutelayout = namedtuple(``'src'``, 'dest'``, 'prob'``)

# 本卡内做reduce，并将结果缓存在local_card_reduce_buffer中，大小为16，将来自不同gpu rank的token经过reduce后的结果放在不同的缓存中

local_card_expert_buffer = [expert1_buffer_addr, ..., expert15_buffer_addr] # 卡内做reduce，src的首地址

local_card_reduce_buffer = [gr1_buffer_addr, ..., gr15_buffer_addr] # gr为gpu rank缩写。卡内做reduce的dest首地址

# 本VNode内做reduce，会将当前gpu rank、innermate、intermate及stranger的token进行reduce，src为上一步的dest，dest为local_vnode_reduce_buffer

local_vnode_reduce_buffer = [local_vnode_addr, local_vnode_innermate_addr, local_vnode_intermate_addr, local_vnode_stranger_addr]

# 本Node内做reduce，会将当前gpu rank和其intermate上的token在当前gpu rank和其innermate上进行reduce，src为上一步的dest（local_vnode_intermate_addr + local_vnode_stranger_addr），dest为local_node_reduce_buffer

local_node_reduce_buffer = [local_node_card_reduce_buffer, local_node_intermate_addr]

# Node间做reduce，会将当前gpu rank的token在当前gpu rank和其intermate上进行reduce，src为上一步的dest，dest为 inter_node_reduce_buffer

inter_node_reduce_buffer = []

# 首先计算卡内reduce的layout

# indices_list为allgather之后的输出，包含每个gpu_rank的全量indices

def get_local_card_reduce_layout(indices_list, gpu_rank):

local_card_reduce_layout = [[] for i in range``(``16``)]

for gpu_rank, indices in enumerate``(indices_list):

for i in range``(``4096``):

layout_code = 0

for exp in indices[i]:

if exp >``= start_exp_idx and exp < end_exp_idx:

layout_code +``= (``1 << exp)

local_card_reduce_layout[gpu_rank].append(layout_code)

# 计算本VNode内做reduce的layout

def get_local_vnode_reduce_layout(indices_list, gpu_rank):

peer_list = [(gpu_rank % 4``) + i for i in range``(``4``) if (gpu_rank % 4``) + i !``= gpu_rank]

peer1_start_exp, peer1_end_exp = peer_list[``0``] * 16``, (peer_list[``0``] + 1``) * 16

peer2_start_exp, peer2_end_exp = peer_list[``1``] * 16``, (peer_list[``1``] + 1``) * 16

peer3_start_exp, peer3_end_exp = peer_list[``2``] * 16``, (peer_list[``2``] + 1``) * 16

self_start_exp, self_end_exp = 16 * gpu_rank, 16 * (gpu_rank + 1``)

local_card_reduce_layout = [[] for i in range``(``4``)]

for gpu_rank, indices in enumerate``(indices_list):

for i in range``(``4096``):

layout_code = 0

for exp in indices[i]:

if exp >``= start_exp_idx and exp < end_exp_idx:

layout_code +``= (``1 << exp)

local_card_reduce_layout[gpu_rank].append(layout_code)

def compute_layout(indices, reduce_layout):

for i in range``(``4096``):

layout_code = 0

for exp in indices[i]:

if exp >``= self_start_expand exp < self_end_exp :

layout_code +``= 1

if exp >``= peer1_start_exp < exp < peer1_end_exp:

layout_code +``= (``1 << 1``)

if exp >``= peer2_start_exp < exp < peer2_end_exp:

layout_code +``= (``1 << 2``)

if exp >``= peer3_start_exp < exp < peer3_end_exp:

layout_code +``= (``1 << 3``)

reduce_layout[gpu_rank].append(layout_code)

# 计算gpu rank、innermate、intermate、stranger上的token在本VNode的四张卡上分布

compute_layout(indices_list[gpu_rank], local_card_reduce_layout[``0``])

compute_layout(indices_list[gpu_rank], local_card_reduce_layout[``1``])

compute_layout(indices_list[gpu_rank], local_card_reduce_layout[``2``])

compute_layout(indices_list[gpu_rank], local_card_reduce_layout[``3``])

return local_card_reduce_layout

# 计算Node内做reduce的layout：遍历当前gpu rank的mate_intermte的indices，计算其在当前Node中的两个VNode上的分布。

def get_local_node_reduce_layout(indices_list, gpu_rank):

intermte_rank = get_intermate_rank() # 假设get_intermate_rank()可以获得其intermate的gpu rank编号

current_vnode = gpu_rank % 4

neighbor_vnode = current_vnode + (``1 if current_vnode % 2 =``= 0 else -``1``)

current_vnode_start_exp, current_vnode_end_exp = current_vnode * 64``, (current_vnode + 1``) * 64

neighbor_vnode_start_exp, neigbor_vnode_end_exp = neighbor_vnode * 64``, ( neighbor_vnode + 1``) * 64

inner_node_layout = []

for i in range``(``4096``):

for exp in indices_list[intermate_rank]:

if exp >``= current_vnode_start_exp and exp < current_vnode_end_exp:

layout_code +``= 1

if exp >``= neighbor_vnode_start_exp and exp < neigbor_vnode_end_exp:

layout_code +``= (``1 << 1``)

inner_node_layout.append(layout_code)

return inner_node_layout

# 计算跨Node做reduce的layout

def get_inter_node_reduce_layout():

pass|

combine layout 示意图：

2、with Port-10

操作步骤：

第一步：本卡上的 reduce。

第二步：本 VNode 上的 reduce。

第三步：跨 VNode 的 reduce。

操作 Python 代码：

包含在 without-10 的操作中。

示意图：

包含在 without-10 的示意图中。

106 集群规模及超参设定：16 机，PP16DP1

|TP4DP2|EP8|16|1|cpu optimizer|True|True|

|kernel size|kernel size|||mem size|||

||(mbseqlen/expertstopkDP)hiddenffn_hidden

(1ACEHOLDER}(14096/256hidden2)82048

2567168*2048||||||

166 集群训练规模及超参设定：

()

|8 机| | | | | | |16 机| | | | | | |

|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

|TP4DP2(→DP8 4*)|EP8|16|1|cpu optimizer|True|True|TP4DP2(→DP8)|EP8|16|2|cpu optimizer|True|True|

|kernel size|kernel size|||mem size|||||||mem size|||

||(4*)25671682048|||||||(47168)(256)12048||||||

||||||||non-moe layer|moe layer(Die 间 EP)|global batchsize|micro batchsize|optimizer|recomputer|offload|

||||||||DP8|EP16|16|2/4||||

|||||||||kernel size||||||

|||||||||2048/409671682048||||||

||||||||non-moe layer|moe layer(Die 间 TP**)**|global batchsize|micro batchsize|optimizer|recomputer|offload|

||||||||TP4DP2(→DP8)|EP16|16|2|cpu optimizer|True|True|

|||||||||kernel size||||||

|||||||||204871682048||||||

需求算子：

1、moe layer(Die 间 DP)，succl 需要支持双 Die all2all 时不同的切分。

2、router 支持 166。

3、MLA 中 mma 算子。

NV 集群规模：256 机，2048 卡

|TP1|EP64|16|2|3072-⇒15360|未知|/|True|/|

|1|top|8||

|2|transformer layer|61||

|3|hidden dimension to|7168||

|4|parameters are randomly initialized with a standard deviation|0.006||

|5|n heads 7168 to|128||

|6|per-head dimension 𝑛__ℎ to .|128||

|7|KV compression dimension 𝑑__ℎ is set to|512||

|8|query compression dimension 𝑑__𝑐 ′ is set to|1536||

|9|decoupled queries and key, we set the per-head dimension 𝑑__𝑐 to|64||

|10|substitute all FFNs except for the first three layers with MoE layers|||

|11|Each MoE layer consists of 1 shared expert and 256 routed experts|||

|12|intermediate hidden dimension of each expert is|2048||

|13|Among the routed experts, 8 experts will be activated for each token|||

|14|each token will be ensured to be sent to at most 4 nodes|||

|15|multi-token prediction depth 𝑑__ℎ 𝑅 is set to 1|||

|16|37B are activated for each token|||

|17|AdamW optimizer _𝛽_1 = 0.9, _𝛽_2 = 0.95|||

|18|maximum sequence length to 4K during pre-training|||

|19|14.8T tokens|||

此前讨论：

one kernel groupGemm 对比 suBLAS for loop API （16x) 目前存在两大难点：

1, shape 不均衡， cpu costmodel 难以 fit into inside kernel

2, 如果右矩阵 size per die 足够大，塞满 16SPC。则等同于 for loop 16x. one kernel 只收获 15x kernel launch 收益. 但损失 costmodel ， DTG 带宽，失去意义.

DS MoE 中发生的新变化

1） size per die 变小 N=1024.

2）后期 Shape 可能可以均衡， indicate costmodel 可以策略固化到 kernel 里。

故此有可能做出一个实现固定好 outPartion/loadOnce/quadbuffer 策略的，不通用的不产品化不 fit suBLAS/cuBLAS interface 的一个 groupGemm for 后期。

此 one kernel groupGemm 需同事满足上述 1) & 2) 条件：即左右矩阵 size 小，且均衡。

A seq1024 预期收益：

N = 1024

for loop 方案： size1024,每次 kernel 大约有 4 个 SPC really 工作。目测测试= 51T. (1024, 1024, 7168)

使用 AiB 模拟= 236T. 但 group gemm 中 B 不再共享无 DTG 带宽。考虑 DTG 衰减，假设使用 NUMA 可以达到 200T。假设 B 使用 UMA。预计仍有 150T，good enough. (16, 1024, 1024, 7168)

N=2048

for loop 方案：测试约有 89TFLOPS，16 个 spc 都工作了 (1024, 2048, 7168)

的预期收益，通过 AiB 模拟=265TFLOPS，用 UMA 估计有 165 TFLOPS (16, 1024, 2048, 7168)

A seq2048 预期收益：

N=1024

for loop 方案：测试约有 90TFLOPS，16 个 spc 都工作了 (2048, 1024, 7168)

的预期收益，通过 AiB 模拟=244TFLOPS，用 UMA 估计有 155 TFLOPS (16, 2048, 1024, 7168)

N=2048

for loop 方案：测试约有 108TFLOPS，16 个 spc 都工作了 (2048, 2048, 7168)

的预期收益，通过 AiB 模拟=272TFLOPS，用 UMA 估计有 170 TFLOPS (16, 2048, 2048, 7168)

可以先用 3UMA 实现

情景限定：

deepseek 中专家的维度都比较小，hidden size 只有 2048， per die 则变为 1024，下同

MMA 参数，MNK（1024，2048，7168）,总共 16 个专家。

(1024, 7168) @ (7168, 2048), 一共做 32 次。(up/gate 各一次，总共 32 次), fwd, 后期可以使用 one groupGemm 方案。bpa, bpw 因为输入矩阵变换非 reduce 维度至少一个仍然不小 7168，仍使用 for loop 方案。

(1024, 2048) @ (2048, 7168), 一共做 16 次。（down 一次）bpa, 后期可以使用 one groupGemm 方案，fwd, bpw, 仍然使用 for loop 方案。原因上同

对比 naive loop 和 numa weight 的方案的性能。

deepseek 中一个专家的总参数如下：

7168 * 2048 * 2 * 3 / 1024 / 1024 / 1024 = 0.082G

16 个专家是 0.082 * 16 = 1.31G

deepseek 总共 256 个专家，如果 EP 通信只在机内，则每张卡 32 个专家，每个 moe mlp 消耗显存 2.62G。

而且每个专家的权重维度都是一样的，可以考虑把专家的权重换成 NUMA，这样在 load weight 的时候，带宽会有很大提升 (50%，数据来自 supa 指南)。可能在小专家下取得比 naive loop 更好的性能。

上层框架需要对应的调整 EP 策略，保证每张卡至少有 16 个专家（BR166 由 die 间并行策略决定）。注：如果 weight 使用 UMA4、UMA8，可以对应的减少专家数的约束。

1、Background

双 DIE EP 上限较高（tcore 利用率），本文档为适配双 DIE EP 设计了一个 non-duplicate kernel launch 方案。

2、Introduction

目前 BR166 上的 kernel 都是 duplicate mode，两个 DIE 处理相同 shape 的 tensor，引入双 DIE EP 会导致两个 DIE 上 suBLAS 的左矩阵不一样（每个专家分配到的 token 数会存在不一致）。

3、Solution

分布式策略：

单机 EP16（EP8+DIE 间 EP），每张卡 16 个专家，die0、die1 有 16 个 DIE 间专家组（同属于一个 UMA16 的叫一个 DIE 间专家组）

框架层：

需要确定同一个 DIE 间专家组的最大输入 token 数，两者之间取最大值来申请 UMA16 空间。

算子层 (在 sublas 算子之前调用，确保清零，可以保证 blas api 不变)：

按照同一个专家组的最大输入 token 数来选择 kernel；

需使用同一个 UMA16 指针构造两个 shape 不一致的 usharp，传入计算 kernel；

kernel 根据当前 die index 选择不同的 input usharp

伪代码：

UMA16 device_pointer: d_p(input), d_p_w(weight)

weight shape: KxN

input tokens: M0, M1

host: (构建 M0xK, M1xK 两个 usharp, 使用同一个 UMA16 device pointer)

A0 = matrix2d(M0, K, nullptr, d_p);

A1 = matrix2d(M1, K, nullptr, d_p);

B = matrix2d(K, N, nullptr, d_p_w);

device: (根据 die index 选择不同的 usharp id，die_0 use M0xK， die_1 use M1xK)

device mma(int uid_0, int uid_1, int uid_2, int M0, int M1, int N, int K) {

int input_uid;

int input_m;

// get die index

if (die_idx == 0) {

input_uid = uid_0;

input_m = M0;

} else {

input_uid = uid_1;

input_m = M1

}

input_tensor = gen_matrix(input_uid, input_m, K);

weight_tensor = gen_matrix(uid_2, K, N);

// call mma as usual

}

4、实验数据

考虑到 die 间 TP，还有 up/gate 合并的可能

N 取 1024，2048，4096

M 从 32 到 6144 遍历，间隔 32

K=7168

以下是实验结果，纵坐标是 tcore 利用率

开启 bf16cc，fullstack version:20250212-804.lkg

结论：

N=4096 上限最高，此时并行策略是双 die EP，up/gate 合并
M=2048 和 M=4096 性能差别不是特别大，特别是在 N=4096 的情况下；开启双 DieEP 可以节省显存，只需要一半的 activation 空间（尽管 moe 部分会存在部分 padding，但是非 moe 部分显存需求减半），就可以获取 90% 的性能

5、8SPC 适配方案

A0/A1 是同一个 UMA16 的不同 region，B0/B1 是另一组 UMA16，代表 4 个不同的输入。A0 和 A1(B0 和 B1) 的逻辑 shape 一致，但是实际是 4 个不同的专家的输入（存在 padding 0），A/B 的逻辑 shape 不一致（和 106 现在的两个专家类似）。

通过设置 duplicate mode，可以只创建两个 kernelNode，保持现在的 GroupMMA 的 api 以及整体逻辑和 106 一致。

1、功能排期与方案列表

2、以下蓝色、绿色带宽数据分别归属：ubb v1，ubb v2（尽量采用 ubb v2，缺失则用 ubb v1。[Multiple GPU][BR166][v2]dma non-fused单机性能数据，alltoall eagermode 性能数据）

3、DeepEP 的基本原理：因为 NV 机内带宽是机间带宽的 3.4 倍（153 / 45），所以将机间异号卡之间的 all2all 流量转化到机内进行，进而实现加速。BR 上两机间和机内同样近似 3（SPC 27.9 / 12）倍 send/recv 带宽的关系。

一、算子性能评估与并行策略预估

1、算子性能评估

目前测试了前向的数据，考虑了 N=1024，2048，4096（双 dieEP，合并 up,gate），从前向的性能数据来看，还是双 dieEP 上限更高，而且可以在 M=2000 的时候就取得不错的利用率。

2、策略与显存理论占用

| | | | | | | |

|---|---|---|---|---|---|---|

|106 十六机| | | | | | |

|TP4DP2|EP8|16|1|cpu optimizer|True|True|

|kernel size|kernel size|||mem size|||

||(mb_𝐷_topkseqlen/expertsACEHOLDER}(14096/256DP)hiddenffn_hidden

(12)hidden2048

2568*2048||||||

|显存理论预估|| | | | | |

|166 八机| | | | | | |

|TP4DP2(→DP8 4*)|EP8|16|1|cpu optimizer|True|True|

|kernel size|kernel size|||mem size|||

||(4*)25671682048||||||

|166 十六机| | | | | | |

|DP8|EP16|16|2/4|chunk adam 8bit|False|False|

||kernel size||||||

||2048/409671682048||||||

|显存理论预估|mb2: stage1 weight with state 24.47 GB, activation 260.92 GB, total 285.39 GB

mb4: stage1 weight with state 24.47 GB, activation 521.84 GB, total 546.31 GB| | | | | |

二、单机通信方案

1、单机方案：单机内 all2all 数据量重发 3.5_ 倍Œ_，allgather 重发 8 倍Œ。

数据量：若采用 non-moe layer 采用 TP1DP8, moe layer 采用 EP8（加上 Die 间 EP，则变为 EP16）
allgather： 2 * 4096 * 7168 * sizeof(BF16) = 112 MB，112 MB * 8 = 896 MB。(乘以 8 是计算通信后每个 Rank 通信的总的数据量)
all2allv：112 MB * 3.5 = 392 MB。（乘以 3.5 是采用 DeepEP 的方式，每个 token 在机内需要重复发送的次数）

（歧义说明：无论采用 all2allv，亦或是采用 DeepEP 的 NVLink send/recv，在 SCCL 中的 all2all 实现方式采用 ” 循环 send/recv” 能提供的 all2allv 或并行 send/recv 性能应该一致。）

带宽及耗时：

allgather：SPC 版本，896 MB / s = 6.8 ms；DMA 版本：896 MB / 88 GB/s = 9.9 ms。

all2all：SPC 版本，392 MB / 54.8 GB/s = 7.15 ms；DMA 版本：392 MB / 16.9 GB/s = 23.2 ms。

专家计算耗时：实测 M/K/N=2048/7168/4096 时，单个 expert 的耗时为 0.6 ms，9(=256 / 16 + 1 share) 个专家总耗时 11.9 ms。
传算并行：DMA 通信和专家计算时间较接近 max (9.9, 11.9）= 11.9 ms；若采用 SPC 通信，总耗时为 6.8 + 11.9 = 18.7。存在较高的并行化价值。

7168：每个 token 转发的 top-8 个专家最少需要转发 0 次（全部 expert 位于本 EP Rank），最多需要转发 7 次，简单取其均值为 (0 + 7) / 2 = 3.5，即每个 token 平均需要转发 3.5 次；

7168：每个 token 都转发给了其他 EP Rank，即数据量扩张为了原来的 8 倍。

三、两机通信方案

1、两机方案：机间同号卡 send/recv、机内 allgather

约束
- DeepEP 机内转发数据是机间的 3.5 倍Œ。
机内 send/recv 是机间的（24 GB/s / 12 GB/s = ）2 倍。
通信数据量及带宽
策略：若采用 non-moe layer 采用 TP1DP16, moe layer 采用 EP16（加上 Die 间 EP，则变为 EP32）
数据量：机间 send/recv 数据量 2 * 4096 * 7168 * sizeof(BF16) = 112 MB；机内 allgather 数据量 112 MB * 16 = 1.75 GB。
SPC 版本：机间 send/recv 带宽 s（耗时 112 MB / 12 GB/s = 9.11 ms），机内 fuse allgather s （耗时 1.75 GB / 128.6 GB/s = 13.6 ms）。
DMA 版本：机间 send/recv 带宽 12 GB/s（耗时 112 MB / 12 GB/s = 9.11 ms），机内 fuse allgather 88 GB/s （耗时 1.75 GB / 88 GB/s = 19.8 ms）。
原理：机内从 DeepEP 的 send/recv 改为 allgather 的原因是机内 send/recv 带宽并没有机间的 3 倍以上，虽然 allgather 会加 16 / 3.5= 4.6 倍多发送数据，但 allgather 的带宽也是前者的 4.8 - 5.9 倍（SPC 128.6 / 27 =4.8 倍、DMA 88 / 15 = 5.9 倍）。同时，留下了”传算并行“的可能性，并能重用为单机 allgather 开发的融合算子。
专家计算耗时：若 TCore 利用率与单机相近，且 token 数据变多一倍而专家数减少一倍，则耗时与单机持平仍保持 11.9 ms。
DeepEP 方案在两机下的性能评估：链接。简要信息如下表：

|数据量|- NV：2 * 4096 * 7168 * sizeof(FP8) = 56 MB。（FP8 dispatching and BF16 combining）
- BR：2 * 4096 * 7168 * sizeof(BF16) = 112 MB。|

|瓶颈|- 在 NV 上，由于机内通过 NVLink 转发的数据量大约是机间 IB 发送数据量的 2.33 倍♣，但机内 NVLink 的带宽是机间带宽的 (160 GB/s / 50 GB/s = )3.2 倍，所以，瓶颈主要在机间发送的 56 MB 数据的带宽。
- 在 BR 上，机内机间的数据倍率关系同上，机间、机内带宽分别为 12 GB/s、24 GB/s（机内为 DMA 带宽，SPC 版本可以更高，上限为 32 GB/s），机内带宽变为机间带宽的 2 倍，故瓶颈同样是机间发送的 112 MB 数据的带宽。|

|通信耗时|BR：112 MB / 12 GB/s = 9.11 ms。（假设）|

传算并行：DMA 通信和专家计算的 max(9.1 + 9.9，11.9）= 19 ms；采用 SPC 版本，总耗时则为 9.1 + 6.8 + 11.9 = 27.8 ms；采用 DeepEP 方案，总耗时 9.11 + 11.9 = 21.01 ms。存在少量的并行化价值。

2.33 倍♣：每个 token 转发的 top-8 个专家最少需要转发 0 次（全部 expert 位于本 EP Rank），最多需要转发 7 次，简单取其均值为 (0 + 7) / 2 = 3.5，即须在机内转发 3.5 次，但 DeepSeek V3 限定最多发给 3 个节点，取最少 0 次跨机转发和最多 3 次跨机转发的均值为 1.5，以此算出机内数据为机间数据的 2.33 倍。

四、四机方案 EP64

暂不考虑大于 2 机以上方案，等待 BR DeepEP 实现。原因如下：

数据量：首先 mb2 可以增大到 mb4，若采用 non-moe layer 采用 TP1DP32, moe layer 采用 EP32（加上 Die 间 EP，则变为 EP64），单卡数据量 4 * 4096 * 7168 * sizeof(BF16) = 224 MB。
allgather 方案：数据量 4 机 * 8 卡 * 224 MB = 7 GB。即便只考虑机内 allgather，机内通信时长高达 7 GB / s = 54.4ms（行不通）
采用 DeepEP 方案：机间数据量 224 MB * 3/4 * 3 = 504 MB。(3/4 是对发向其他节点的 token 的预估) 耗时 504 MB / 12 GB/s = 42 ms。
专家计算耗时：若 TCore 利用率与单机相近，且 token 数据变多一倍而专家数减少一倍，则耗时与单机持平仍保持 11.9 ms。
大 EP 时，all2all dispatch 方案数据量明显低于 allgather dispatch 方案，allgather 方案需要归拢的数据太多，
通信数据 128 MB，1 机→ 2 机→ 4 机（卡间◊）all2all 性能衰减趋势：15.03 GB/s→ 3.3 GB/s→ ?。（8 倍Œ）

五、不同 EP 规模性能对比

|单机 EP16|allgather|88||2.5|||||||||||

|两机 EP32|机间 s/r 机内 ag|10 , 88|||||||||||||

|四机 EP64|DeepEP|机间 10|||||||||||||

五、功能列表及排期

|allgather 版本性能对齐

- 算子性能对齐
- 通信性能对齐|3.22 - 3.29||||1、单机内 all2all 发送数据量是最佳状态♥的 3.5 倍，allgather 通信数据量是最佳状态的 8 倍。（3.5 倍Œ）

2、目前，单机 allgather 带宽 71 GB/s，all2all 带宽 15 GB/s。既 allgather 数据量即便 all2all 两倍，采用 allgather 通信时间 4.766 ms 仍低于 all2all 4.766 ms。

3、基于上面 “2”，若采用 SPC all2all 带宽为 58 GB/s。这时通信时间变为 4.766 ms，但前面会增加 permute 操作，耗时为 4.766 ms。

3、03-05跑通版性能profiling

4、性能对比|

|Die 间 EP, EP16

- Die 间 EP 适配开发
- 接入 8 SPC 方案|3.29 - 4.11||||目前测试了前向的数据，考虑了 N=1024，2048，4096（双 dieEP，合并 up,gate），从前向的性能数据来看，还是双 dieEP 上限更高，而且可以在 M=2000 的时候就取得不错的利用率。|

|non-moe TP4DP2→ DP8 显存优化

- DP8 显存优化|3.22 - 4.11|||||

|offload、recompute 调优

- 适配
- 调优|4.11 - 4.18|||||

|INT8 优化|4.18 - 4.30|||||

|interleave 1f1b

- 适配
- 性能调优|4.18 - 4.30|||||

|精度对齐|5.1||||依赖 EP16 A2A 通信库|

|两机适配 EP32|5.1 -||||依赖跨机 A2A 通信性能|

|||||||

reference：

1、带宽计算公式 [https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md](https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md)

目的：两机与单机的差异功能项。

一、分布式策略

单机 EP16 策略→两机 EP32 策略：TP1DP8EP16PP16 → TP1DP16EP32PP8

| | | | | | | |

|---|---|---|---|---|---|---|

|166 十六机| | | | | | |

|TP1DP16/TP2DP8|EP32|960|2/4|chunk adam 8bit|True|True|

||kernel size||||||

||4096_◊：最终方案为Die间EP，暂缺故暂时引用卡间数据_2048||||||

|显存理论预估|TP1DP16 + mb2: stage1 weight with state27.94GB, activation 251.45 GB, total 279.40 GB

TP2DP8 + mb4: stage1 weight with state 25.38 GB, activation 251.45 GB, total276.83 GB| | | | | |

二、计算算子与性能

expert + swigelu fwd/bwd 的 TCore 利用率：

1、EP32 时，两种策略和对应 shape：

MB2→ MB4, TP2DP8+EP32，4096 * 7168 * 4096，对应利用率为 52%。前反向耗时→（基于单机降低 25%，每个 EP Rank 处理的 expert 数量减半 16→ 8，但每个 expert 计算的 token 加倍）。
MB2, TP1DP8→ TP1DP16+EP32，4096 * 7168 * 4096, 对应利用率为 52%。(最佳状态♥表示每个 token 派发的专家属于同一个 EP Rank)

2、EP16 时对应分布式策略和 shape： MB2, TP1DP8+EP16, 2046 * 7168 * 4096, 对应利用率为 41%+。

non-moe layer 前向耗时：49.682 ms
mma + rope + KV_Transform：14 ms
kernel0_single_mha_fwd_0：8.962 ms
mma：2.901 ms
router：2.4 ms
_all_gather_base_fusion：11.307 ms

moe layer 前向耗时：41.435 ms
dynamic_ep_tp_permutation_fwd_stage1：0.388 ms
dynamic_ep_tp_permutation_fwd_stage2：4.824 ms
Grouped Gemm：13.155 ms
dynamic_unpermutation_fwd：5.410 ms
_reduce_scatter_base_fusion：17.897 ms (7168)

shared_expert 前反向耗时：11.494 ms
前向：3.255 ms
反向：8.167 ms

moe layer 反向耗时：
_all_gather_base_fusion：11.365 ms
dynamic_ep_tp_unpermutation_bwd_general：7.691 ms
Grouped Gemm：36.732 ms
dynamic_permutation_bwd：5.150 ms

non-moe layer 反向耗时：62.786 ms
_reduce_scatter_base_fusion：17.505 ms
router_bwd：2.68 ms
mma(bpa/bpw) + shape_transform：8.302 ms
kernel0_single_mha_bwd_0：
KV_Transform_Bwd + Rope_Bwd_Yarn_Tp + mma(bpa/bpw)：13.151 ms

3、单机 EP16 与两机 EP32 各操作环节耗时对比

部分算子性能会变好：如果采用 TP1DP16 的话，mha 的输入 shape 不变，性能不变（蓝色表示）；permute/unpermute 输入 shape 变大，性能可能会变差（黄色表示）；Grouped Gemm 的 shape 变为 4096_preferred_4096，性能可提升 25%（绿色表示）。
通信性能会变差：分为两阶段，机间同 send/recv + 机内 allgather（黄色表示），且切换为 non-fuse succl api 之后，通信后需要接一个 Die 间 reorder。

|EP16 (ms)|14|9|2.9|2.4|11.3 (fuse succl api)|5.2|13.155|5.410|17.9 (fuse succl api)|3.3|84.565|

|EP32 (ms)|14 (shape 不变)|9 (shape 不变)|2.9 (shape 不变)|2.4 (←)|9.11 + 9.6 + 8.8 = 27.51|⬆_这个已经变成non-fused succl api了，因为没有32 SPC机器，没有比较准确的耗时trace_3/4=9.9|⬆*2=10.82 (数据加倍)|9.11 + 10.4 + 8.8 = 28.31|3.3|118.54|

||反向操作|KV_Transform_Bwd + Rope_Bwd_Tp + mma(bpa/bpw)|kernel0_single_mha_bwd_0|mma(bpa/bpw) + shape_transform|router_bwd|_reduce_scatter_base_fusion|permutation_bwd|Grouped Gemm|unpermutation_bwd|_all_gather_base_fusion|shared_expert||

|EP16 (ms)|13.151|15.527|8.302|2.68|17.505 (fuse succl api)|5.150|36.732|7.691|11.365|8.167|126.27|

|EP32 (ms)|13.151|15.527 (shape 不变)|8.302 (shape 不变)|2.68 (←)|9.11 + 9.6 + 8.8 = 27.51|⬆71683/4=27.5|⬆*2=15.4 (数据加倍)|9.11 + 10.4 + 8.8 = 28.31|8.167|156.847|

左边是 ep16，右边是 ep32。结论是send/recv+allgather 的方式 EP32 相对 EP16 不会有性能收益，需要适配 DeepEP 方式，才是终解。

三、通信方案与性能

1、两机方案：机间同号卡 send/recv、机内 allgather

约束
- DeepEP 机内转发数据是机间的 2=10.4(数据加倍)|⬆。
机内 send/recv 是机间的（24 GB/s / 12 GB/s = ）2 倍。
通信数据量及带宽
策略：若采用 non-moe layer 采用 TP1DP16, moe layer 采用 EP16（加上 Die 间 EP，则变为 EP32）
数据量：机间 send/recv 数据量 2 * 4096 * 7168 * sizeof(BF16) = 112 MB；机内 allgather 数据量 112 MB * 16 = 1.75 GB。
SPC 版本：机间 send/recv 带宽 s（耗时 112 MB / 12 GB/s = 9.11 ms），机内 fuse allgather s （耗时 1.75 GB / 128.6 GB/s = 13.6 ms）。
DMA 版本：机间 send/recv 带宽 12 GB/s（耗时 112 MB / 12 GB/s = 9.11 ms），机内 fuse allgather 182 GB/s （耗时 1.75 GB / 182 GB/s = 9.6 ms）。机内 fuse reducescatter 169 GB/s （耗时 1.75 GB / 169 GB/s = 10.4 ms）。如果调用 non-fuse succl api 则后面需要再介绍 Die 间 reorder 1.75 GB / 200 GB/s = 8.8 ms。
原理：机内从 DeepEP 的 send/recv 改为 allgather 的原因是机内 send/recv 带宽并没有机间的 3 倍以上，虽然 allgather 会加 16 / 3.5= 4.6 倍多发送数据，但 allgather 的带宽也是前者的 4.8 - 5.9 倍（SPC 128.6 / 27 =4.8 倍、DMA 88 / 15 = 5.9 倍）。同时，留下了”传算并行“的可能性，并能重用为单机 allgather 开发的融合算子。
专家计算耗时：若 TCore 利用率与单机相近，且 token 数据变多一倍而专家数减少一倍，则耗时与单机持平仍保持 11.9 ms。
DeepEP 方案在两机下的性能评估：链接。简要信息如下表：

|数据量|- NV：2 * 4096 * 7168 * sizeof(FP8) = 56 MB。（FP8 dispatching and BF16 combining）
- BR：2 * 4096 * 7168 * sizeof(BF16) = 112 MB。|

|通信耗时|BR：112 MB / 12 GB/s = 9.11 ms。（假设）|

传算并行：DMA 通信和专家计算的 max(9.1 + 9.9，11.9）= 19 ms；采用 SPC 版本，总耗时则为 9.1 + 6.8 + 11.9 = 27.8 ms；采用 DeepEP 方案，总耗时 9.11 + 11.9 = 21.01 ms。存在少量的并行化价值。

2.33 倍♣：每个 token 转发的 top-8 个专家最少需要转发 0 次（全部 expert 位于本 EP Rank），最多需要转发 7 次，简单取其均值为 (0 + 7) / 2 = 3.5，即须在机内转发 3.5 次，但 DeepSeek V3 限定最多发给 3 个节点，取最少 0 次跨机转发和最多 3 次跨机转发的均值为 1.5，以此算出机内数据为机间数据的 2.33 倍。

2=10.3 (数据加倍)|⬆：每个 token 转发的 top-8 个专家最少需要转发 0 次（全部 expert 位于本 EP Rank），最多需要转发 7 次，简单取其均值为 (0 + 7) / 2 = 3.5，即每个 token 平均需要转发 3.5 次；

3.5 倍Œ：每个 token 都转发给了其他 EP Rank，即数据量扩张为了原来的 8 倍。

四、其他

基于UBBv2+RDMA的DeepEP方案。

Reference

1、1.5功能开发列表

2、[Multiple GPU][BR166][v2]dma non-fused单机性能数据

目的：DeepEP 代码梳理，为实现 BR 上 DeepEP 做参考。相交 NV 上实现，BR 上实现可针对特定的单机 EP16、两机 EP32 和四机 EP64 规模进行开发。

1、目录结构

|▾ csrc/
▸ kernels/
CMakeLists.txt
config.hpp
deep_ep.cpp
deep_ep.hpp
event.hpp||▾ csrc/
▾ kernels/
api.cuh
buffer.cuh
CMakeLists.txt
configs.cuh
exception.cuh
ibgda_device.cuh
[internode.cu](http://internode.cu/)
internode_ll.cu
[intranode.cu](http://intranode.cu/)
launch.cuh
[runtime.cu](http://runtime.cu/)
utils.cuh||▾ deep_ep/
init.py
buffer.py
utils.py||▾ third-party/
nvshmem.patch|

2、python 代码

class Buffer

const num_sms
func init：通过对 self.runtime 变量赋值为 C++ 版本的 deep_ep_cpp.Buffer 来实现对底层缓存空间的通信操作。
func set_num_sms
func capture
func get_low_latency_rdma_size_hint
func get_local_buffer_tensor
func get_dispatch_config
func get_combine_config
func get_dispatch_layout
func dispatch：通过成员变量 self.runtime 调用 C++ 实现的 self.runtime.intranode_dispatch() 或通过调用 self.internode_dispatch() 实现对 self.runtime.internode_dispatch() 的调用。
func combine：与上面 dispatch 的调用逻辑相似。
func internode_dispatch：对 self.runtime.internode_dispatch() 进行调用，实现机间的 dispatch()。
func internode_combine：与上面 disptch 的调用逻辑相似。
func clean_low_latency_buffer：
func low_latency_dispatch：通过调用 self.runtime.low_latency_combine() 实现的低延迟版本 dispatch。
func low_latency_combine：与上面 disptch 的调用逻辑相似。
func get_next_low_latency_combine_buffer：

使用示例如下，先通过 get_dispatch_layout() 获取 dispath 的每个 ep rank、rdma rank、expert 通讯后的大小，以便提前申请空间。

折叠源码

| |

|---|

|def dispatch_forward(x: Union[torch.Tensor, Tuple``[torch.Tensor, torch.Tensor]],

topk_idx: torch.Tensor, topk_weights: torch.Tensor,

num_experts: int``, previous_event: Optional[EventOverlap] = None``) -``> \

Tuple``[Union[torch.Tensor, Tuple``[torch.Tensor, torch.Tensor]], torch.Tensor, torch.Tensor, List``, Tuple``, EventOverlap]:

# NOTES: an optional `previous_event` means a CUDA event captured that you want to make it as a dependency

# of the dispatch kernel, it may be useful with communication-computation overlap. For more information, please

# refer to the docs of `Buffer.dispatch`

global _buffer

# Calculate layout before actual dispatch

num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, previous_event = \

_buffer.get_dispatch_layout(topk_idx, num_experts,

previous_event``=``previous_event, async_finish``=``True``,

allocate_on_comm_stream``=``previous_event is not None``)

# Do MoE dispatch

# NOTES: the CPU will wait for GPU's signal to arrive, so this is not compatible with CUDA graph

# For more advanced usages, please refer to the docs of the `dispatch` function

recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event = \

_buffer.dispatch(x, topk_idx``=``topk_idx, topk_weights``=``topk_weights,

num_tokens_per_rank``=``num_tokens_per_rank, num_tokens_per_rdma_rank``=``num_tokens_per_rdma_rank,

is_token_in_rank``=``is_token_in_rank, num_tokens_per_expert``=``num_tokens_per_expert,

previous_event``=``previous_event, async_finish``=``True``,

allocate_on_comm_stream``=``True``)

# For event management, please refer to the docs of the `EventOverlap` class

return recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event|

3、c++ 代码

几个主要接口介绍如下：

get_dispatch_layout：token 数据通信前，先将
internode_combine
internode_dispatch
intranode_combine
intranode_dispatch
sync

4、cuda 代码

初始化相关的文件：

buffer.cuh

[runtime.cu](http://runtime.cu/)

机间通信：

[internode.cu](http://internode.cu/)

internode_ll.cu

机内通信：

csrc/kernels/[intranode.cu](http://intranode.cu/)：主要用于在多个 GPU 之间进行数据通信和聚合。DeepEP 是一个用于分布式深度学习训练的框架，它通过优化通信和计算重叠来提高训练效率。这段代码主要涉及的是数据聚合的部分，即在多个 GPU 之间收集和合并数据。

notify_dispatch 核函数：主要目的是通知其他 GPU 有数据可以接收，并等待数据准备好。代码中使用了 CUDA 的共享内存和原子操作来同步不同 GPU 之间的数据状态。

折叠源码

| |

|---|

|template``<``int kNumRanks>

__global__ void notify_dispatch(``const int numtokensperrank, int moerecvcountermapped, ...) {

// 获取当前线程的rank和lane id

int rank = ...;

int laneid = ...;

// 获取channel tail idx

int cachedchanneltailidx = ld_acquire_sys_global(channeltailidx.buffer());

// Ready to copy

if (cachedchannelheadidx != cachedchanneltailidx) {

sharedchanneltailidx[responsiblerank] = cachedchanneltailidx;

}

// Timeout check

if (clock64() - starttime > NUMTIMEOUTCYCLES) {

printf``(``"DeepEP timeout for dispatch receivers, rank %d, responsiblechannel %d, tokens remained: %dn"``, rank, responsiblechannel, numtokenstorecv);

trap();

}

// Synchronize queue tail

asm volatile``(``"bar.sync %0, %1;" :: "r"``(responsiblerank), "r"``(numthreadsperrank));

cachedchanneltailidx = sharedchanneltailidx[responsiblerank];

// Copy data

int numrecvtokens = cachedchanneltailidx - cachedchannelheadidx;

for (``int chunkidx = recvwarpidinrank; chunkidx < numrecvtokens; chunkidx += numrecvwarpsperrank) {

int tokenidxinbuff...k64();

while (channeltailidx[recvlaneid] < expectedhead and expectedhead > 0) {

// Timeout check

if (clock64() - starttime > NUMTIMEOUTCYCLES) {

printf``(``"DeepEP timeout for combine receivers, rank %d, responsiblechannel %d, expect %dn"``, rank, responsiblechannel, expectedhead);

trap();

}

}

}

}|

数据聚合：在多个 GPU 之间进行数据聚合的逻辑。它使用了 CUDA 的 warp 同步和共享内存来高效地进行数据聚合。

| // Broadcast current heads

int num_topk_ranks = 0, topk_ranks[kNumRanks], slot_indices[kNumRanks];

#pragma unroll

for (``int i = 0; i < kNumRanks; ++ i) {

auto expected_head_i = __shfl_sync(0xffffffff, expected_head, i);

if (expected_head_i >= 0) {

slot_indices[num_topk_ranks] = expected_head_i % num_recv_buffer_tokens;

topk_ranks[num_topk_ranks ++] = i;

}

}

// Reduce data

#pragma unroll

for (``int i = recv_lane_id; i < hidden_int4; i += 32) {

// Read buffers

int4 recv_value_int4[kNumRanks];

#pragma unroll

for (``int j = 0; j < num_topk_ranks; ++ j)

recv_value_int4[j] = ld_nc_global(channel_x_buffers[topk_ranks[j]].buffer() + slot_indices[j] * hidden_int4 + i);

// Reduce all-to-all results

float values[kDtypePerInt4] = {0};

#pragma unroll

for (``int j = 0; j < num_topk_ranks; ++ j) {

auto recv_value_dtypes = reinterpret_cast``<``const dtype_t*>(&recv_value_int4[j]);

#pragma unroll

for (``int k = 0; k < kDtypePerInt4; ++ k)

values[k] += static_cast``<``float``>(recv_value_dtypes[k]);

}

// Cast back to `dtype_t` and write

int4 out_int4;

auto out_dtypes = reinterpret_cast``<dtype_t*>(&out_int4);

#pragma unroll

for (``int j = 0; j < kDtypePerInt4; ++ j)

out_dtypes[j] = static_cast``<dtype_t>(values[j]);

recv_int4[token_idx * hidden_int4 + i] = out_int4;

}

// Reduce `topk_weights`

if (recv_lane_id < num_topk) {

float value = 0;

#pragma unroll

for (``int i = 0; i < num_topk_ranks; ++ i)

value += ld_nc_global(channel_topk_weights_buffers[topk_ranks[i]].buffer() + slot_indices[i] * num_topk + recv_lane_id);

recv_topk_weights[token_idx * num_topk + recv_lane_id] = value;

}

// Update head

if (recv_lane_id < kNumRanks)

warp_channel_head_idx[recv_warp_id][recv_lane_id] = (expected_head < 0) ? -expected_head - 1 : expected_head + 1;

}

// Retired

__syncwarp();

if (recv_lane_id == 0)

warp_retired[recv_warp_id] = true``;

} |

asdf

5、调用第三方库 nvshmem 实现的内容

调用的文件 csrc/kernels/ibgda_device.cuh，该文件中实现的主要接口包括：

用于将主机字节序转换为网络字节序（大端序）

device static forceinline uint64_t HtoBE64(uint64_t x) { … }

device static forceinline uint32_t HtoBE32(uint32_t x) { … }

device static forceinline uint16_t HtoBE16(uint16_t x) { … }

队列对管理

device static forceinline nvshmemi_ibgda_device_state_t* ibgda_get_state() { … }

device static forceinline nvshmemi_ibgda_device_qp_t* ibgda_get_rc(int pe, int id) { … }

device static forceinline void ibgda_lock_acquire(int *lock) { … }

device static forceinline void ibgda_lock_release(int *lock) { … }

device static forceinline void ibgda_update_dbr(nvshmemi_ibgda_device_qp_t *qp, uint32_t dbrec_head) { … }

device static forceinline void ibgda_ring_db(nvshmemi_ibgda_device_qp_t *qp, uint16_t prod_idx) { … }

device static forceinline void ibgda_post_send(nvshmemi_ibgda_device_qp_t *qp, uint64_t new_prod_idx) { … }

远程内存写操作

device static forceinline void ibgda_write_rdma_write_inl_wqe(nvshmemi_ibgda_device_qp_t *qp, const uint32_t *val, uint64_t raddr, __be32 rkey, uint16_t wqe_idx, void **out_wqes, uint32_t imm) { … }

device static forceinline void ibgda_write_rdma_write_wqe(nvshmemi_ibgda_device_qp_t *qp, uint64_t laddr, __be32 lkey, uint64_t raddr, __be32 rkey, uint32_t bytes, uint16_t wqe_idx, void **out_wqes) { … }

远程键获取

device static forceinline uint64_t ibgda_get_lkey_and_rkey(uint64_t laddr, __be32 *lkey, uint64_t raddr, int dst_pe, uint64_t *out_raddr, __be32 *out_rkey) { … }

device static forceinline void ibgda_get_rkey(uint64_t addr, int dst_pe, uint64_t *out_raddr, __be32 *out_rkey) { … }

WQE 管理

device static forceinline uint64_t ibgda_reserve_wqe_slots(nvshmemi_ibgda_device_qp_t *qp, uint32_t num_wqes) { … }

device static forceinline void* ibgda_get_wqe_ptr(nvshmemi_ibgda_device_qp_t* qp, uint16_t wqe_idx) { … }

device static forceinline void ibgda_write_empty_recv_wqe(void *out_wqe) { … }

RDMA 操作

device static forceinline void nvshmemi_ibgda_rma_p(int *rptr, const int value, int dst_pe, int qp_id, uint32_t imm = std::numeric_limits<uint32_t>::max()) { … }

template device static forceinline void ibgda_submit_requests(nvshmemi_ibgda_device_qp_t *qp, uint64_t base_wqe_idx, uint32_t num_wqes, int message_idx = 0) { … }

template device static forceinline void nvshmemi_ibgda_put_nbi_warp(uint64_t req_rptr, uint64_t req_lptr, size_t bytes, int dst_pe, int qp_id, int lane_id, int message_idx) { … }

device static forceinline void ibgda_write_amo_add_wqe(nvshmemi_ibgda_device_qp_t *qp, const int &value, uint64_t laddr, __be32 lkey, uint64_t raddr, __be32 rkey, uint16_t wqe_idx, void **out_wqes) { … }

device forceinline void nvshmemi_ibgda_amo_nonfetch_add(void *rptr, const int& value, int pe, int qp_id, bool is_local_copy = false) { … }

nvshmemi_ibgda_rma_p：远程内存写操作。
ibgda_submit_requests：提交请求。
nvshmemi_ibgda_put_nbi_warp：批量提交请求。
ibgda_write_amo_add_wqe：原子加法操作。
nvshmemi_ibgda_amo_nonfetch_add：非内联原子加法操作

Reference

1、[https://github.com/deepseek-ai/DeepEP](https://github.com/deepseek-ai/DeepEP)

万卡包括 UBBv2 + OCS TOPO

此文档描述了方案原型，具体还需要与 SUCCL 团队进一步讨论。

有两点需要在此提前声明：

1、单机 EP16 的情形下，假设通过 top-8 选择的专家是均匀分布在 EP0(expert-0, expert-1) ～ EP15(expert-254, expert-255) 共计 16 个 EP Rank 上，这 16 个 EP Rank 对应 8 个 gpu rank，故每个 token 需要发送到每个 gpu rank 上的概率为 Combination(32, 1)^8 / Combination(256, 8)，即采用 allgather 是一种较低效的方式。通过 DeepEP 的方式，相交 allgather 不仅可以减少数据传输，还能进一步消除 permutation 操作，若将 reorder 进行融合，则需要采用 SPC 进行 reorder 和通信，这时会失去了“传算并行”的机会。当“传算并行”方案可以将通信全部 overlap 掉时，只有在 DeepEP 方案的总耗时小于 permutation 操作时，才会有收益。

2、跨两机 EP32 情形下，对机器数加倍（16 机→ 32 机），需要保证单个 transformer layer 执行时间保持不变，才能使TGS不降低，当采用 mb2 时，由于需要传输的数据量加倍，会在通信部分形成瓶颈，单个 transformer layer 执行时间会变长，且显存保持不变，此时性能相较单机 EP16 会有所降低；若采用 mb1 时，虽然增加了机间同号卡间的数据传输延时 (56 MB / 11 GB/s = 5 ms)，但机内传输数据量与单机时持平，moe layer 的 dipatch 和 combine 操作 shape 和单机一致，此外，显存使用率降低一半，可能通过降低重计算比例，以找补回跨机通信的耗时。

一、整体流程

整体思路是在前向时，将 hidden state 在进行 dispatch 前转换为 plain buffer (pb) rowmajor，之后进行 dispatch、grouped-gemm 和 combine 操作，combine 完成后再将 pb rowmajor 转换回 colmajor。反向操作与之相反。

虚线中采用 sublas groupgemm token, weights, output 全部的是 row major 方式执行，前反向都是。

二、单机中 Dispatch 和 Combine 操作流程

1、dispatch_fwd

名称定义：

EP Rank：单机 EP16 时，八个 GPU 卡共同承载 16 个 EP Rank，每个 EP Rank 中有 16 个 expert。

GPU Rank（对应 DeepEP 中的 rdma rank）：单机时，一个 GPU Rank 中包含两个 EP Rank，共 32 个 expert。

DP Rank：对应 MLA 部分的 DP。

整体流程如上图所示，每个时间步进行的操作如下：

t0：reorder hidden state with colmajor → send staging buffer with rowmajor，将派发到不同 expert 的 token 拷贝到 expert 所在的 GPU Rank 的 staging buffer 中（采用 bucket 机制）
- 方式一：将 112MB colmajor 直接转换为 112 MB rowmajor，再采用 SPC 进行DeepEP方式进行转发。
方式 2，3 不再考虑复杂且性能不一定有收益。
方式二：将 112MB colmajor 转换为 rowmajor，同时将发送到不同 gpu rank 的数据拷贝到 gpu rank buffer，此时发送缓冲区大小会变成 7 * 112 MB，再调用all2all进行通信，通信结束后，把接收到的数据拷贝到 expert buffer 中。
方式三：在进行 reorder 过程中，采用 staging buffer with bucket，根据 indices 信息，将阶段性 reorder 输出同步采用DMA 进行同步转发。（下面说明以“方式三”进行）
t1：在 reorder+copy 过程中，bucket 填满后，则将数据转发到相应的 gpu_rank 中。由于填满发送到不同 gpu rank 的 bucket 的时机并不确定，所以在进行点到点 send/recv 时，是乱序发送的，基于此可进行如下几种改进：
- 1、采用更大的 bucket size，极限为 112M，由于选择的专家可能均匀分布在不同的 EP Rank 上，此时，需要的总缓存大小为 1 * 112MB ≤ Size ≤ 8 * 112 MB。等所有 reorder+copy_to_stagingbuffer 操作完，再进行传输。缺点是不能在 reorder 过程中，进行同步通信；优点是能对通信进行排流水（如下图所示）。
2、/
t2：接收端，将 gpu_rank staging buffer 中的 token 根据 indices 拷贝到相应 expert buffer 中，为下一步 Grouped GEMM 计算做准备。

2、dispatch_bwd

与 combine_fwd 操作一致

3、combine_fwd

4、combine_bwd

与 dispatch_fwd 操作一致

三、EP32 两机通信方案与性能预估

（32 机 TP1DP16PP16EP32, mb1 策略下）

internode : 需要北向互联负责 inter-node 节点间，走 NIC 网卡，通过 switch 互联。远端访问，使用 RDMA

1、机间 send/recv + 机内 Allgather

（优点在于可以完全重用单机时，通信计算并行的方案，在 microbatch 之间 overlap DMA 通信和计算，缺点在于引入 permuation/unpermuation）

由于每个 token 选择的 expert 同时位于两台机器上的概率非常接近于 1，故机间 send/recv 数据量 56 MB，耗时 5 毫秒
使用 Node Limited Routing 理论可以人为让 expert 选 1 个机器的 top8.
机内采用 fused succl api allgather 112MB 数据需要 11.3 ms，将其分为两部分且分两步进行操作：
第一步：在进行机间 send/recv 同时，先将已有的 56MB 进行 allgather，耗时 5.6ms。
第二步：再对 send/recv 过来的 56MB 数据，进行 allgather，耗时 5.6ms
permutation：如上图示，将 permutation 分为两部分，各自耗时 3 ms。

总耗时：14.2 ms （VS. 单机 EP16 allgather + permute: 16.5 ms）

2、机间 send/recv + 机内 DeepEP

（优点在于采用 SPC 进行通信后，机间 send/recv 可以从 11-12 GB/s 提升到 16-17 GB/s，当然可以采用 DMA 进行机间 send/recv。同时，机内的可采用 DeepEP 的实现方式，以消除 permuation/unpermuation）

机间进行 send/recv 56 MB，耗时 5ms
机内采用 DeepEP 的方式，需要转发的数据量 4 * 112 MB = 448 MB，耗时 448 MB / 100 GB/s = 4.48 ms（这里假设 DeepEP Dispatch 带宽为 100 GB/s）。将其分为两步进行操作：
- 第一步：在进行机间 send/recv 同时，先将已有的 56MB 进行 DeepEP，耗时 4.9 ms。
第二步：再对 send/recv 过来的 56MB 数据，进行 DeepEP，耗时 2.24 ms。

若采用此方案，即便机间 send/recv 可以被机内的 DeepEP overlap，仍需耗时 7.14 ms。

3、SCCL 实现方式及性能预估

3.1 (基于 EP32（16 卡 32DIE），每个 DIE 2K token，28M per die)

实现方式主要分为 4 步，预计实际串行总时间预估为 10.25ms

性能极致优化情况下，可将第三步和第二步中的 nic forward pipeline 并行，加上较为理想的网络带宽，此时理论上总的最优性能预估为 8.5ms 左右

~~原有 Weidong, guodong 方案（疑似 AllGather 方案，此方案由于有冗余被各方放弃不再分析）：~~

~~reorder + send (OCS 复用代码， 8 SPC, send 是 allgather 吗？）~~
~~通信量：28M per die, 带宽：200GB/s 左右, 预计耗时：0.15ms~~
nic forward （inter-node， OCS 没这 part， port1 column 需要 1 个 SPC for RDMA) + p2p forward (OCS topo 不同，2.b 跟 3.a 是同一个 device function, 8 SPC 4+4 )
nic forward（通信量：56M，通信概率：1 - (1/2)^8 = 1，理论带宽：13GB/s(with RoCEv2, IB 15-16GB/s)，预计实测为理论的 80%，预计理论耗时：4.3ms，预计实际耗时：5.4ms）
~~NLR : 跨级几率降低 50%。Base 方案不采用。~~
~~p2p forward（直连转发，通信量 1：56M，通信 1 概率：1 - (12/16)^8 = 0.9，通信量 2：56M，通信 2 概率：0.4，带宽：20GB/s * 2，预计耗时：1.75ms）~~
~~p2p forward （intra-node 单机内，跟 OCS topo 不同单代码可能复用，9 SPC ： 4+4 )~~
~~与第二步中的 p2p forward 近似，预计耗时：1.75ms~~
~~reorder （OCS 复用代码，8SPC）~~
~~通信量：28M * topk per die, 带宽：200GB/s 左右 ,, 预计耗时：0.15ms * topk = 1.2ms~~

~~两机总耗时：~~

~~理论耗时 (红色操作可并行)：0.15 + 4.3 + 1.75 + 1.75 + 1.2 = 9.15 ms，第 2、3 步并行后，7.4 ms。~~
预计实际耗时 (红色操作可并行)：0.15 + 5.4 + 1.75 + 1.75 + 1.2 = 10.25 ms，第 2.a、3 步并行后，8.5 ms。理论上 2a.2b 不同的 token 也可以 pipeline 接口设计已经预留。

每 die 独立 virtual rank 方案

每个 die 处理 2048 个 token，一个 token 时 7168 个 BF16 = 28MB 数据。

reorder (OCS 复用代码， 8 SPC）
通信量：28M per die, 带宽：600GB/s 左右（3.5 倍Œ）, 预计耗时：0.05ms。 28MB = batch1 _8倍Œ_7168*sizeof(BF16)
nic forward （inter-node， OCS 没这 part， port1 column 需要 1 个 SPC for RDMA) + p2p forward (OCS topo 不同，2.b 跟 3.a 是同一个 device function, 8 SPC 4+4 )
nic forward（通信量：56M，通信概率：1 - (1/2)^8 = 1，理论带宽：13GB/s(with RoCEv2, IB 15-16GB/s)，预计实测为理论的 80%，预计理论耗时：4.3ms，预计实际耗时：5.4ms）
NLR : 跨级几率降低 50%。Base 方案不采用。
~~p2p forward（直连转发，通信量 1：56M，通信 1 概率：1 - (12/16)^8 = 0.9，通信量 2：56M，通信 2 概率：0.4，带宽：20GB/s * 2，预计耗时：1.75ms）~~
　 1. 机内数据通信量为: 28 MB per die，带宽为 $20 GB / s$ ，预计耗时 1.4 ms
p2p forward （intra-node 单机内，跟 OCS topo 不同单代码可能复用，9 SPC ： 4+4 )
~~与第二步中的 p2p forward 近似，预计耗时：1.75ms~~
2. 机间数据通信量为: 28 MB per die，带宽为 $20 GB / s$ ，预计耗时 1.4 ms
reorder （OCS 复用代码，8SPC）
通信量：28M * 8 专家 per per die, 带宽：600GB/s 左右（_ 每 die 独立 HBM 的 600GB/s 带宽 _）,预计耗时：0.05ms * topk = 0.4ms

VirtualNode 方案里。 intranode 最长的路径 2xP2P + 1D2D. 用保守的方式来预计是 2.8ms。但 UBBv2 的瓶颈不在 intranode ,

而是 inter-node 的 RDMA. RDMA 的通信在 4.3ms(实际 80% 利用率 5.4ms). intranode 无论是 2.8ms 还是 3.0ms 能被 RDMA 隐藏，对大局影响不大。

inter-node 很不幸每个 token 去对方 node 走 RDMA 的几率几乎是 100%（ 1 - (1/2)^8 ~= 1）。所以就无脑发给对方 node 的 mate 就好了。

既然全发给对方，那接下来就是要需要设计合适的 RDMA 的 pipe 粒度（eg. 128. 256 token 每次) 更好的利用 RDMA 的带宽比较关键。

3.2 (基于 EP16（8 卡 16DIE），每个 DIE 4K token，56M per die)

实现方式主要分为 4 步:

reorder
通信量：56M per die, 带宽：200GB/s 左右, 预计耗时：0.3ms
2.a p2p forward1 (inter-group via mate) + 2.b (intra-group) p2p forward2
p2p forward1（通信量：112M，通信概率：1-(4/8)^8=1，理论带宽：20GB/s * 2，预计耗时：2.8ms）
p2p forward2（通信量：112M，通信概率：1-(7/8)^8=0.66，理论带宽：20GB/s * 2，预计耗时：1.9ms）
~~permutation~~
~~与第二步中的 2b p2p forward2 近似，预计耗时：1.9ms.~~
~~2.3 其实在一起的~~
reorder
通信量：56M * topk per die, 带宽：200GB/s 左右, 预计耗时：0.3ms * topk = 2.4ms

单机总耗时（单机时 token num 变为 4k per die）：

0.3 + 2.8 + 1.9 + 1.9 + 2.4 = 9.3 ms，第 2、3 步并行后 7.4 ms。相比 AllGather + permutation 方案的 15-16ms 仍然有优势。

mate 需要有一块 buffer 用来转发.

3.3 人力预估&开发计划

[RDMA + GPU][Develop Plan]

四、64 机 EP64

（对应配置 TP1EP64DP32PP16MB1）

单卡上的数据量 56 MB，最坏情况下，每个卡上的所有的数据都需要发送到其他 3 台服务器上，增加的机间通信量变成两机的三倍。
机内从数据量最差情况下，从 112 MB 变成 224 MB。
但使用 Deepseek 的 Node Limited Routing(Node-Limited Routing (NLR) - SLO(Libraries) - Confluence): 理论上走减少 inter-node 通信量增大 intra-node 通信量

五、必要性与性能收益

注 1：下表中第#1 行为 base case，显存使用量约为 max reserved = 61264MB，对应下表中计算公式为（A + M）。

注 2：红色标注为现有方案改为本 DeepEP 方案后，通信部分耗时的变化。

|1|16|1|16|8|16|2|4|2048|16|A(tt) + M(lp)|t1|t2|16.5 → 6.2|||

|2|16|1|32|16|8|1|8|2048|8|A/2 + M|t1/2|t2||||

|3|16|1|32|16|8|2|8|4096|8|A + 2M|t1|2t2||||

|||||||||||||||||

|4|32|1|32|16|16|1|4|2048|8|A/2 + M/2|t1/2|t2/2|17.6 → 12|||

|5|32|1|32|16|16|2|4|4096|8|A + M|t1|t2||||

|6|32|1|32|16|16|4|4|8192|8|2A + 2M|2t1|2t2||||

|||||||||||||||||

|7|64|1|64|32|16|1|4|4096|4|A/2 + M/2|t1/2|t2/2||||

1、必要性

显存：
- 在“策略 1”时，将 token 分发调到强制均衡模式，单卡显存用量 60-63 GB。
在“策略 4”时，预估显存消耗将变为“策略 1”的一半。以应对不均衡时的极端显存用量。
在“策略 7”时，显存用量同“策略 4”。
算子性能保证：Grouped GEMM 的输出 shape(m/k/n)
- m=1024 时，TCore 利用率为 34%。
m=2048 时，TCore 利用率为 42%。
m=4096 时，TCore 利用率为 53%，能减少 1/4 的 Grouped-GEMM 计算时间。

2、性能收益

单机 EP16 场景
- 当下单机中采用 dispatch=allgather+permutation 和 combine=unpermutation+reducescatter 的方案，通信耗时占 e2e 的 21%。
当切换为 DeepEP 实现后，dispatch 通信耗时从 16.5 ms 降至 6.2 ms，通信耗时降低 62.4%，对 e2e 的提升 62.4% * 21% = 13%。
两机 EP32 场景
- 性能收益来源 1：通信耗时从单机 16.5ms 下降至 10.5 ms，通信耗时降低 36.4%，相对单机对 e2e 的提升 36.4% * 21% = 7.6%。
性能收益来源 2：显存从 60GB 下降至 30GB 后，可以适当减少部分算子的从计算，以提升性能。stage1 的显存相对占用较高，单层可释放 2 GB 显存占用（具体减少重计算操作可参见 1.7offload with recompute），e2e 提升约5%-10%。

六、UBBV2-Port10 利用

互联方式为将同一个机架上的两台服务器的所有同号卡的 GPU Port-10 采用铜缆进行互联。

1、带宽收益

PP：仅使用到两机 send/recv，10 - 17 GB/s → 29 GB/s（此带宽为不跨 Die 场景中带宽，跨 Die 时，性能会适当降低）
DP：使用到的 allgather/reducescatter 接口的带宽为聚合带宽，
- 两机时，两倍 send/recv 增益。
四机时，
更多机场景中，可通过多级策略，分级进行通信。
EP：跨机采用 DeepEP 场景下，跨机带宽，
- 两机 send/recv，10 - 17 GB/s → 29 GB/s。
四机 send/recv 带宽，13 GB/s → 26 - 28 GB/s。

2、端到端收益

dense 模型：3%
moe 567B：5 - 10%
DeepSeek V3:
- EP32 场景下，e2e 收益 3.5%。
EP32 场景下，e2e 收益？？。

3、成本增加

ubbv2 接光口：2k+

铜缆 (每两机共用 8 根，1k/根)：4k

总计：6k+/server

Quartz 4

Explorer

训练测试

dispatch

1.2、进一步优化

六、succl all2all 原始算法流程

2 机 2 卡：每机 1 卡

一、拓扑与邻接关系简介

1、拓扑与数据结构

2、邻接关系

二、dispatch Layout 结构体说明和计算

1、without Port-10

2、with Port-10

三、combine Layout 结构体说明和计算

1、without Port-10

2、with Port-10

1、Background

2、Introduction

3、Solution

4、实验数据

5、8SPC 适配方案

一、算子性能评估与并行策略预估

1、算子性能评估

2、策略与显存理论占用

二、单机通信方案

三、两机通信方案

四、四机方案 EP64

一、分布式策略

二、计算算子与性能

三、通信方案与性能

四、其他

Reference

一、整体流程

二、单机中 Dispatch 和 Combine 操作流程

1、dispatch_fwd

2、dispatch_bwd

3、combine_fwd

4、combine_bwd

三、EP32 两机通信方案与性能预估

1、机间 send/recv + 机内 Allgather

2、机间 send/recv + 机内 DeepEP

3、SCCL 实现方式及性能预估

3.1 (基于 EP32（16 卡 32DIE），每个 DIE 2K token，28M per die)

3.2 (基于 EP16（8 卡 16DIE），每个 DIE 4K token，56M per die)

3.3 人力预估&开发计划

四、64 机 EP64

五、必要性与性能收益

1、必要性

2、性能收益

六、UBBV2-Port10 利用

1、带宽收益

2、端到端收益

3、成本增加

Graph View

Table of Contents

Backlinks