性能分析

TP/PP Tensor Parallelism：把线性层按照行或者列维度对权重进行划分，，原本线性层按列进行划分，每个 GPU 只需要存一部分的权重即可，最后通过 All-reduce 操作来同步最终结果。Pipeline Parallelism：是模型做层间划分，如果模型原本有 36 层，想在 4 个 GPU 之间运行 pipeline，那么每个 GPU 只要按照先后顺序存 9 层模型即可

TP • 通信量:𝑇 = 𝐵 ∗ 𝑆 ∗ 𝐻 ∗ 2 ∗ 𝐿𝑎𝑦𝑒𝑟𝑠 ∗ 2 𝑡𝑝−1 𝑡𝑝 • 充分利用 GPU 的算力资源 • 对 GPU 的带宽依赖比较高，单个 GEMM 的维度变小 • 小 batch 的时候打不满带宽,大 batch 的时候带宽瓶颈 • 性能上界是 1/tp

PP 0 500 1000 1500 2000 2500 3000 3500 4000 1 2 4 8 16 32 64 128 PP Prefill Latency pp1 pp2 pp4 pp8 0 5 10 15 20 25 30 35 40 45 1 2 4 8 16 32 64 128 PP Decode Latency pp1 pp2 pp4 pp8 • 通信量：𝑇 = 𝐵 ∗ 𝑆 ∗ 𝐻 ∗ 𝐿𝑎𝑦𝑒𝑟𝑠 𝑃𝑃 • 对 Memory 带宽依赖低，但是对 GPU 的利用率比较低，需要分为 micro batch • 不需要等待 backward，bubble 更容易控制 • Batch 较大的时候能够拿到较好的性能 • 性能上界是：接近 bs1 的性能

TP+PP

TP8 小 batch 性能最好，batch 大的时候 bound 到带宽 • PP8 性能最差，但是对带宽依赖小，在超大 batch 的时候可能超过 TP8，• TP4+PP2：大 batch 时候用 PP 减少 TP8 对带宽的依赖，利用 PP 特性逼近 TP4 的性能收益 • 对特殊拓扑可能有效

Table of Contents

RDMA Protocol Stack
RDMA Connection Manager(CM)
OpenMPI RDMA Usage
NCCL RDMA Usage
RDMA access GPU memory directly (aka. GPUDirect,GDR)
Reference
Appendix: RDMA Software Stack
Appendix: RDMA Feature List
Appendix: RDMA Call Flow
Appendix: RDMA NIC List
Appendix: RDMA NIC Driver (Mellanox)
Appendix: Infiniband Network Topology

RDMA Protocol Stack

IB Stack Layer
ULP
Transport
Network
Link
Physical
IB Packet Structure

RDMA Connection Manager(CM)

Connection Type
Connected –1 Local QP: 1 Remote QP
Reliable –RC
Eg. FTP,NCCL
Unreliable –UC
Eg. Video
Datagram –1 Local QP: N Remote QP (scalable)
Reliable –RD
Eg.??
Unreliable –UD
Eg. Audio
Illustrate
Connection Handshake

Connection Call Flow
UCX RDMA UCT: Passive server port listen for client connect
UMD librdmacm: Accept client connection based on network event and internal state machine
KMD kCM: User connection manager access layer,responsible for parse IB driver API command
KMD kMAD: Management datagram layer,responsible for encapsulate handshake REP packet
KMD kVERB: Verbs layer, responsible for programming HCA to send MAD packet
KMD IB NIC: Mellanox driver layer, responsible for actually send IB packet onto wire
Illustrate

OpenMPI RDMA Usage

OpenMPI RMA(Remote Memory Access) operation is mapped to RDMA operation
OpenMPI use UCX as PML backend
UCX use Infiniband as UCT
Infiniband UCT will initiate RDMA operation through two modes
Normal mode –Invoke libibverbs
Accelerate mode –Invoke NIC UMD
RMA PUT msg will split into multiple RDMA WRITE packets

RDMA packet content
Transport Layer Header
BTH: Basic Transport Header
ETH: Extended Transport Header
RETH: RDMA ETH
Virtual Address: 64b. Correspond to Mellanox RADDR seg rdma_raddr
Remote Key: 32b. Correspond to Mellanox RADDR seg rdma_rkey
RDMA Transaction
There are two RDMA transaction in UCX
Eager Protocol
Use pre registered buffer to do RDMA transfer
For small/medium size message, low latency
Rendezvous Protocol (RNDZ)
Use dynamic registered buffer to do RDMA transfer
For large size message, high bandwidth
Eager Protocol
Use RDMA WRITE to transfer data from sender to receiver
Illustrate
There are two data transfer type based on message size
Short Copy (Short)
For small size message
Sender innline data directly into RDMA WRITE command header
Buffered Copy (Bcopy)
For medium size message
Sender copy data from user buffer to pre registered RDMA buffer
Sender RDMA WRITE RDMA buffer to receiver’s pre registered RDMA buffer
Receiver copy RDMA buffer to user buffer
RNDZ Protocol
There are two RNDZ protocol in UCX (Zero copy: Zcopy)
PUT Protocol (PUT Zcopy)
GET Protocol (GET Zcopy)
PUT Protocol
Use RDMA WRITE to transfer data from sender to receiver
Sender send RNDZ_START to start transfer
RNDZ_START –In UCX code: RTS(Ready To Send)
Receiver register MR after receive RNDZ_START
Receiver send RNDZ_REPLY to acknowledge transfer
RNDZ_REPLY –In UCX code: RTR(Ready To Receive)
Sender issue RDMA WRITE to receiver’s MR after receive RNDZ_REPLY
Sender send FIN to receiver after RDMA WRITE
FIN –In UCX code: ATP(??)
7Receiver deregister MR after receive FIN
Illustrate
GET Protocol
Use RDMA READ to transfer data from sender to receiver
Sender register MR
Sender send RNDZ_START to start transfer
Receiver issue RDMA READ from sender’s MR after receive RNDZ_START
Receiver send FIN to sender after RDMA READ
FIN –In UCX code: ATS(??)
Sender deregister MR after receiver FIN
Illustrate
RDMA GET Protocol latency is lower than RDMA PUT Protocol
UCX default use GET protocol
TBD: when switch to PUT protocol
Pipeline PUT/GET Protocol
Sometimes GPU data should be copied to host memory before RDMA transfer
Not support GDR
Non-contiguous GPU data pack/unpack for RDMA transfer
Under such condition UCX will pipeline PUT/GET protocol to increase throughput
Split data into fragments
Transfer per fragment use PUT/GET protocol
At sender side, overlap Next frag’s D2H copy with Current frag’s RDMA PUT
At receiver side, overlap Next frag’s RDMA GET with Current frag’s H2D copy

NCCL RDMA Usage

Background info
NCCL kernel refer to: NCCL Device Kernel
We’ll compare NCCL inter node communication modes below
Normal mode
RDMA mode
Normal Mode
For each channel/peer/direction, allocate multiple sockets
For each socket, allocate a data thread to handle data transfer on that socket
For each data block, split block into multiple chunks by number of sockets, assign chunk to each socket
Thus multiple CPU threads parallelly transfer different data chunks on different sockets
Which consumes CPU!! (as compared to RDMA mode described in next chapter)
Illustrate
RDMA Mode
For each channel/peer/direction, allocate one NIC QP (Queue Pair)
For each data block, RDMA write the whole block to remote MR (Memory Region)
Data transfer is handled by NIC HW
No CPU consumption!! (RDMA advantage)
NIC HW will chunk data block into multiple RDMA WRITE packets on the wire
As “Remote Memory Access” chapter describes
Illustrate
RDMA Transaction
[1] Sender register its meta mem to NIC as MR(Memory Region)
Mm –Memory region of send meta
Am–Address of send meta
Km–Remote key of send meta
[2] Receiver register its data mem to NIC as MR
Md –Memory region of recv data
Ad –Address of recv data
Kd –Remote key of recv data
[3] Sender announce its meta MR(Am,Km) on bootstrap ring
Bootstrap ring is pre-built when NCCL initialized
[4] Receiver RDMA write to send meta MR(Am,Km) with recv data MR slot info (Ad,Kd)
[5] Sender RDMA write to recv data MR slot(Ad,Kd) with data block
Illustrate
RDMA Sequence Diagram
RDMA Top Level Sequence
RDMA Initialization
RDMA Connection
RDMA I/O

RDMA access GPU memory directly (aka. GPUDirect,GDR)

NIC can directly access GPU memory
Instead of use host memory to avoid extra GPU to host memory copy
NCCL will enable GDR when
Detect system supports PeerDirect driver
PeerDirect Core – [https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed)
Nvidia GPU PeerDirect– [https://github.com/Mellanox/nv_peer_memory](https://github.com/Mellanox/nv_peer_memory)
AMD GPU PeerDirect– [https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver)
Register GPU mem to NIC as MR
PeerDirect Driver Architecture
PeerDirect GPU Driver: Client driver (Eg. Nvidia GPU / AMD GPU…) which provide device DMA page management hooks
PeerDirect GPU Driver is vendor specific
PeerDirect Core Driver: Bridge driver which connect producer(GPU/Storage…) with consumer(NIC)
PeerDirect Core Driver is common for all vendor GPUs
Detailed GDR flow work as below
Top Level Flow
Initialization Flow
Memory Region Registration Flow
Sync I/O Flow

Reference

IB Specification Vol 1-Release-1.4
OpenUCX
librdmacm
Linux Kernel
NCCL Device Kernel
Mellanox RDMA Aware Programming User Manual

Appendix: RDMA Software Stack

RDMA Software Stack
Mellanox OFED
[https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed)
Latest version is 5.2-2.2.0.0 (2021/03) 5.2-

2. 对于 conv fwd stride1 k3x3，假如 activation=(2, 128, 16, 16), 且保证 activation 无 reload，在 NUMA/FP32 下从 L2 load 至 B buffer 需要（）cycles：D

A. 216161284/44 = 5958

B. 224161284/44 = 8937

C. 216161284/256 = 1024

D. 224161284/256 = 1536

解析：由于 k3x3，padding_factor=1.5, 因此需要多 load 50% 数据

6. Vcore 理论性能计算需要考虑哪些变量？：ABC

A：计算量耗时

B：指令耗时

C: 访存耗时

解析：

1. 计算强度 operational intensity (Imax) 的计算方式：C

A. Imax = Maximum FLOPs Per Second

B. Imax = Maximum Memory Access Per Second

C. Imax = Maximum FLOPs Per Second/Maximum Memory Access Per Second

D. Imax = Maximum Memory Access Per Second/Maximum FLOPs Per Second

2. 关于 Ridge point 描述正确的是：AC

A. 相同的 HW 算力下，Ridge point 的 X 坐标对应的计算强度越大，表明优化难度越大；

B. 相同的 HW 算力下，Ridge point 的 X 坐标对应的计算强度越大，表明优化难度越小；

C. Ridge point 的 X 坐标反映了能达到的 peak performance 所需要的最小计算强度；

D. 只有在 Ridge point 上，才代表最好的性能；

解析：Ridge point 翻译为脊点。

3. 硬件消除 compute bound 的方式有：ABD

A：充分并行, 利用所有计算单元

B：指令展开，指令调度，hide SFU latency

C: 增加 kernel 的 workload

D: 利用 fmad 代替 fmul + fadd

解析：

4. 硬件消除 Memory bound 的方式有：ABD

A：ldmma 指令访问 matrix 优先使用 col-major

B：优先访问 local memory

C: issue 更多的 ld 指令

D: 数据预取

解析：

5. 算子融合的优点有：ACDE

A：降低了 HBM/Cache 访存量；

B：降低了 Kernel 的计算量；

C: 降低了显存占用；

D: 优化了计算流水，提高了并行度；

E: 减少了 Kernel launch 的开销；

解析：

**1、**反汇编文件中的指令地址，与 trace 中的 PC 值的对应关系是：（A）

A. HEX(PC_value) = HEX(dis_addr) // 8

B. HEX(PC_value) = HEX(dis_addr)

C. HEX(PC_value) = HEX(dis_addr) // 16

D. HEX(PC_value) = HEX(dis_addr) // 32

解析：

2、inst trace tlr 里面的数据为 03db5004f3，fp32 对应的数据为：（B）

A. 3db5004f = 0.0883794948459

B. 3db504f3 = 0.0883883461356

C. 04f33db5 = 5.71856944083e-36

D. b53df304 = -7.07616209183e-07

解析：

3、simt one or two-source inst / simt three-source inst，且 src 都为 tlr，其 issue cycle 数分别为：（B）

A. 1/2

B. 2/4

C. 1/3

D. 2/3

解析：

4.指令周期有哪几个：（ACDEF）

A. 取址

B. 编码

C. 译码

D. 执行

E. 访存

F. 写回

解析：

5. 有多少种指令冒险冲突：（ABC）

A. Structure Hazard– 计算资源冲突

B. Data Hazard– 数据依赖冲突

C. Control Hazard– 分支控制冲突

解析：

6. br10x 硬件中有几组可用于访存关系控制的 sync channel：（B）

A. 1

B. 2

C. 3

D. 4

解析：该题本意为目前 pipeline 中的策略使用了多少 sync channel 进行访存管理，实际上 BC 均为合理答案

**1、**反汇编文件中的指令地址，与 trace 中的 PC 值的对应关系是：（A）

A. HEX(PC_value) = HEX(dis_addr) // 8

B. HEX(PC_value) = HEX(dis_addr)

C. HEX(PC_value) = HEX(dis_addr) // 16

D. HEX(PC_value) = HEX(dis_addr) // 32

解析：

2、inst trace tlr 里面的数据为 03db5004f3，fp32 对应的数据为：（B）

A. 3db5004f = 0.0883794948459

B. 3db504f3 = 0.0883883461356

C. 04f33db5 = 5.71856944083e-36

D. b53df304 = -7.07616209183e-07

解析：

3、simt one or two-source inst / simt three-source inst，且 src 都为 tlr，其 issue cycle 数分别为：（B）

A. 1/2

B. 2/4

C. 1/3

D. 2/3

解析：

4.指令周期有哪几个：（ACDEF）

A. 取址

B. 编码

C. 译码

D. 执行

E. 访存

F. 写回

解析：

5. 有多少种指令冒险冲突：（ABC）

A. Structure Hazard– 计算资源冲突

B. Data Hazard– 数据依赖冲突

C. Control Hazard– 分支控制冲突

解析：

6. br10x 硬件中有几组可用于访存关系控制的 sync channel：（B）

A. 1

B. 2

C. 3

D. 4

解析：该题本意为目前 pipeline 中的策略使用了多少 sync channel 进行访存管理，实际上 BC 均为合理答案

Performance Programming

GEMM Write, LSC Read 的带宽分别是（）？[单选题] *

| |

|---|

|A、4kb;2kb(正确答案)|

|B、4kb;4kb|

|C、2kb;2kb|

|D、2kb;4kb|

**答案解析：**GEMM write 4kb/cycle; lsc read 2kb/cycle

在 Tmode 下，下列说法正确的是（）？*

| |

|---|

|A、每个 CWARP 都能访问所有的 TLR(正确答案)|

|B、每个 CWARP 都能访问所有的 WSR|

|C、在 EU 中，最多有 8 个 CWARP 并行 (正确答案)|

|D、每个 CWAP 包含 32 个 Thread(正确答案)|

**答案解析：**每个 CWARP 有自己的一组 WSR

3、L1P5 Buffer 的大小是（）？[单选题] *

| |

|---|

|A、3MB|

|B、4MB(正确答案)|

|C、7MB|

|D、8MB|

**答案解析：**L1P5 buffer 的大小是 4MB

在 Gmode 下，下列说法错误的是（）？*

| |

|---|

|A、每个 WARP 都有自己的一组 TLR|

|B、每个 WARP 都有自己的一组 WSR|

|C、每个 Thread Group 最多有 8 个 WARPS(正确答案)|

|D、每个 CU 最多 32 个 Thread Groups(正确答案)|

**答案解析：**每个 Thread Group 最多有 32 个 WARPS, 每个 CU 最多 8 个 Thread Groups

以下 fence 指令中，可以 flush l1p75 cache 的是（）？*

| |

|---|

|A、fll1(正确答案)|

|B、ackl1|

|C、ackgmb(正确答案)|

|D、flmask(正确答案)|

**答案解析：**ackl1 只能 flush l1 数据

对于 barrier 代价排序正确的是（）？[单选题] *

| |

|---|

|A、bar.dtg/bar.ntg>bar.wtg>bar.batch>[bar.tg](http://bar.tg/)|

|B、bar.dtg/bar.ntg>bar.batch>bar.wtg>[bar.tg](http://bar.tg/)(正确答案)|

|C、bar.dtg/bar.wtg>bar.batch>bar.ntg>[bar.tg](http://bar.tg/)|

|D、[bar.tg](http://bar.tg/)>bar.wtg>bar.batch>bar.dtg/bar.ntg|

**答案解析：**bar.dtg/bar.ntg>bar.batch>bar.wtg>[bar.tg](http://bar.tg/)

对于不同 CU 中的两个 EU 之间的写后读情况，需要进行的 fence/bar 行为是（）？[单选题] *

| |

|---|

|A、PRD→ackl1→[bar.tg](http://bar.tg/)→CSM|

|B、PRD→ackgmb(or invl1)→[bar.tg](http://bar.tg/)→CSM|

|C、PRD→ackgmb(or invl1)→bar.wtg→CSM(正确答案)|

|D、PRD→ackgmb→CSM|

**答案解析：**PRD→ackgmb(or invl1)→bar.wtg→CSM

下列属于可以提高硬件利用率的编程方法有（）？*

| |

|---|

|A、乒乓 buffer pipeline(正确答案)|

|B、在 hazard 相关的指令中间插入不相关的指令，减少 NOP/HOLD(正确答案)|

|C、多 CWARP 并行 (正确答案)|

|D、ALU 所需操作数尽量提前 load(正确答案)|

**答案解析：**全部正确

cuda pro model for hopper

nvcc runtime nvrtc

parallel threa exec isa

cudnn bestpr

get star with dl perf

gpu perf back

optimi conv lay

kernel profiling guide

optimizing para red in

Quartz 4

Explorer

性能分析

Graph View

Backlinks