TP/PP Tensor Parallelism:把线性层按照 行或者列维度对权重进行划分,,原 本线性层按列进行划分,每个 GPU 只需 要存一部分的权重即可,最后通过 All-reduce 操作来同步最终结果。Pipeline Parallelism:是模型做层 间划分,如果模型原本有 36 层,想在 4 个 GPU 之间运行 pipeline,那么每个 GPU 只要按照先后顺序存 9 层模型即可
TP • 通信量:𝑇 = 𝐵 ∗ 𝑆 ∗ 𝐻 ∗ 2 ∗ 𝐿𝑎𝑦𝑒𝑟𝑠 ∗ 2 𝑡𝑝−1 𝑡𝑝 • 充分利用 GPU 的算力资源 • 对 GPU 的带宽依赖比较高,单个 GEMM 的维度变小 • 小 batch 的时候打不满带宽,大 batch 的时候带宽 瓶颈 • 性能上界是 1/tp
PP 0 500 1000 1500 2000 2500 3000 3500 4000 1 2 4 8 16 32 64 128 PP Prefill Latency pp1 pp2 pp4 pp8 0 5 10 15 20 25 30 35 40 45 1 2 4 8 16 32 64 128 PP Decode Latency pp1 pp2 pp4 pp8 • 通信量:𝑇 = 𝐵 ∗ 𝑆 ∗ 𝐻 ∗ 𝐿𝑎𝑦𝑒𝑟𝑠 𝑃𝑃 • 对 Memory 带宽依赖低,但是对 GPU 的利用 率比较低,需要分为 micro batch • 不需要等待 backward,bubble 更容易控制 • Batch 较大的时候能够拿到较好的性能 • 性能上界是:接近 bs1 的性能
TP+PP
TP8 小 batch 性能最好,batch 大的时候 bound 到带宽 • PP8 性能最差,但是对带宽依赖小,在 超大 batch 的时候可能超过 TP8,• TP4+PP2:大 batch 时候用 PP 减少 TP8 对 带宽的依赖,利用 PP 特性逼近 TP4 的性 能收益 • 对特殊拓扑可能有效
Table of Contents
- RDMA Protocol Stack
- RDMA Connection Manager(CM)
- OpenMPI RDMA Usage
- NCCL RDMA Usage
- RDMA access GPU memory directly (aka. GPUDirect,GDR)
- Reference
- Appendix: RDMA Software Stack
- Appendix: RDMA Feature List
- Appendix: RDMA Call Flow
- Appendix: RDMA NIC List
- Appendix: RDMA NIC Driver (Mellanox)
- Appendix: Infiniband Network Topology
RDMA Protocol Stack
- IB Stack Layer
- ULP
- Transport
- Network
- Link
- Physical
- IB Packet Structure
RDMA Connection Manager(CM)
- Connection Type
- Connected –1 Local QP: 1 Remote QP
- Reliable –RC
- Eg. FTP,NCCL
- Unreliable –UC
- Eg. Video
- Datagram –1 Local QP: N Remote QP (scalable)
- Reliable –RD
- Eg.??
- Unreliable –UD
- Eg. Audio
- Illustrate
- Connection Handshake
- Connection Call Flow
- UCX RDMA UCT: Passive server port listen for client connect
- UMD librdmacm: Accept client connection based on network event and internal state machine
- KMD kCM: User connection manager access layer,responsible for parse IB driver API command
- KMD kMAD: Management datagram layer,responsible for encapsulate handshake REP packet
- KMD kVERB: Verbs layer, responsible for programming HCA to send MAD packet
- KMD IB NIC: Mellanox driver layer, responsible for actually send IB packet onto wire
- Illustrate
OpenMPI RDMA Usage
- OpenMPI RMA(Remote Memory Access) operation is mapped to RDMA operation
- OpenMPI use UCX as PML backend
- UCX use Infiniband as UCT
- Infiniband UCT will initiate RDMA operation through two modes
- Normal mode –Invoke libibverbs
- Accelerate mode –Invoke NIC UMD
- RMA PUT msg will split into multiple RDMA WRITE packets
- RDMA packet content
- Transport Layer Header
- BTH: Basic Transport Header
- ETH: Extended Transport Header
- RETH: RDMA ETH
- Virtual Address: 64b. Correspond to Mellanox RADDR seg rdma_raddr
- Remote Key: 32b. Correspond to Mellanox RADDR seg rdma_rkey
- RDMA Transaction
- There are two RDMA transaction in UCX
- Eager Protocol
- Use pre registered buffer to do RDMA transfer
- For small/medium size message, low latency
- Rendezvous Protocol (RNDZ)
- Use dynamic registered buffer to do RDMA transfer
- For large size message, high bandwidth
- Eager Protocol
- Use RDMA WRITE to transfer data from sender to receiver
- Illustrate
- There are two data transfer type based on message size
- Short Copy (Short)
- For small size message
- Sender innline data directly into RDMA WRITE command header
- Buffered Copy (Bcopy)
- For medium size message
- Sender copy data from user buffer to pre registered RDMA buffer
- Sender RDMA WRITE RDMA buffer to receiver’s pre registered RDMA buffer
- Receiver copy RDMA buffer to user buffer
- RNDZ Protocol
- There are two RNDZ protocol in UCX (Zero copy: Zcopy)
- PUT Protocol (PUT Zcopy)
- GET Protocol (GET Zcopy)
- PUT Protocol
- Use RDMA WRITE to transfer data from sender to receiver
- Sender send RNDZ_START to start transfer
- RNDZ_START –In UCX code: RTS(Ready To Send)
- Receiver register MR after receive RNDZ_START
- Receiver send RNDZ_REPLY to acknowledge transfer
- RNDZ_REPLY –In UCX code: RTR(Ready To Receive)
- Sender issue RDMA WRITE to receiver’s MR after receive RNDZ_REPLY
- Sender send FIN to receiver after RDMA WRITE
- FIN –In UCX code: ATP(??)
- 7Receiver deregister MR after receive FIN
- Illustrate
- GET Protocol
- Use RDMA READ to transfer data from sender to receiver
- Sender register MR
- Sender send RNDZ_START to start transfer
- Receiver issue RDMA READ from sender’s MR after receive RNDZ_START
- Receiver send FIN to sender after RDMA READ
- FIN –In UCX code: ATS(??)
- Sender deregister MR after receiver FIN
- Illustrate
- RDMA GET Protocol latency is lower than RDMA PUT Protocol
- UCX default use GET protocol
- TBD: when switch to PUT protocol
- Pipeline PUT/GET Protocol
- Sometimes GPU data should be copied to host memory before RDMA transfer
- Not support GDR
- Non-contiguous GPU data pack/unpack for RDMA transfer
- Under such condition UCX will pipeline PUT/GET protocol to increase throughput
- Split data into fragments
- Transfer per fragment use PUT/GET protocol
- At sender side, overlap Next frag’s D2H copy with Current frag’s RDMA PUT
- At receiver side, overlap Next frag’s RDMA GET with Current frag’s H2D copy
NCCL RDMA Usage
-
Background info
-
NCCL kernel refer to: NCCL Device Kernel
-
We’ll compare NCCL inter node communication modes below
-
Normal mode
-
RDMA mode
-
Normal Mode
-
For each channel/peer/direction, allocate multiple sockets
-
For each socket, allocate a data thread to handle data transfer on that socket
-
For each data block, split block into multiple chunks by number of sockets, assign chunk to each socket
-
Thus multiple CPU threads parallelly transfer different data chunks on different sockets
-
Which consumes CPU!! (as compared to RDMA mode described in next chapter)
-
Illustrate
-
-
RDMA Mode
-
For each channel/peer/direction, allocate one NIC QP (Queue Pair)
-
For each data block, RDMA write the whole block to remote MR (Memory Region)
-
Data transfer is handled by NIC HW
-
No CPU consumption!! (RDMA advantage)
-
NIC HW will chunk data block into multiple RDMA WRITE packets on the wire
-
As “Remote Memory Access” chapter describes
-
Illustrate
-
-
RDMA Transaction
-
[1] Sender register its meta mem to NIC as MR(Memory Region)
-
Mm –Memory region of send meta
-
Am–Address of send meta
-
Km–Remote key of send meta
-
[2] Receiver register its data mem to NIC as MR
-
Md –Memory region of recv data
-
Ad –Address of recv data
-
Kd –Remote key of recv data
-
[3] Sender announce its meta MR(Am,Km) on bootstrap ring
-
Bootstrap ring is pre-built when NCCL initialized
-
[4] Receiver RDMA write to send meta MR(Am,Km) with recv data MR slot info (Ad,Kd)
-
[5] Sender RDMA write to recv data MR slot(Ad,Kd) with data block
-
Illustrate
-
-
RDMA Sequence Diagram
-
RDMA Top Level Sequence
-
-
RDMA Initialization
-
-
RDMA Connection
-
-
RDMA I/O
-
RDMA access GPU memory directly (aka. GPUDirect,GDR)
- NIC can directly access GPU memory
- Instead of use host memory to avoid extra GPU to host memory copy
- NCCL will enable GDR when
- Detect system supports PeerDirect driver
- PeerDirect Core – [https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed)
- Nvidia GPU PeerDirect– [https://github.com/Mellanox/nv_peer_memory](https://github.com/Mellanox/nv_peer_memory)
- AMD GPU PeerDirect– [https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver)
- Register GPU mem to NIC as MR
- PeerDirect Driver Architecture
- PeerDirect GPU Driver: Client driver (Eg. Nvidia GPU / AMD GPU…) which provide device DMA page management hooks
- PeerDirect GPU Driver is vendor specific
- PeerDirect Core Driver: Bridge driver which connect producer(GPU/Storage…) with consumer(NIC)
- PeerDirect Core Driver is common for all vendor GPUs
- Detailed GDR flow work as below
- Top Level Flow
- Initialization Flow
- Memory Region Registration Flow
- Sync I/O Flow
Reference
- IB Specification Vol 1-Release-1.4
- OpenUCX
- librdmacm
- Linux Kernel
- NCCL Device Kernel
- Mellanox RDMA Aware Programming User Manual
Appendix: RDMA Software Stack
- RDMA Software Stack
- Mellanox OFED
- [https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed)
- Latest version is 5.2-2.2.0.0 (2021/03) 5.2-
2. 对于 conv fwd stride1 k3x3,假如 activation=(2, 128, 16, 16), 且保证 activation 无 reload,在 NUMA/FP32 下从 L2 load 至 B buffer 需要()cycles:D
A. 216161284/44 = 5958
B. 224161284/44 = 8937
C. 216161284/256 = 1024
D. 224161284/256 = 1536
解析:由于 k3x3,padding_factor=1.5, 因此需要多 load 50% 数据
6. Vcore 理论性能计算需要考虑哪些变量?:ABC
A:计算量耗时
B:指令耗时
C: 访存耗时
解析:
1. 计算强度 operational intensity (Imax) 的计算方式:C
A. Imax = Maximum FLOPs Per Second
B. Imax = Maximum Memory Access Per Second
C. Imax = Maximum FLOPs Per Second/Maximum Memory Access Per Second
D. Imax = Maximum Memory Access Per Second/Maximum FLOPs Per Second
2. 关于 Ridge point 描述正确的是:AC
A. 相同的 HW 算力下,Ridge point 的 X 坐标对应的计算强度越大,表明优化难度越大;
B. 相同的 HW 算力下,Ridge point 的 X 坐标对应的计算强度越大,表明优化难度越小;
C. Ridge point 的 X 坐标反映了能达到的 peak performance 所需要的最小计算强度;
D. 只有在 Ridge point 上,才代表最好的性能;
解析:Ridge point 翻译为脊点。
3. 硬件消除 compute bound 的方式有:ABD
A:充分并行, 利用所有计算单元
B:指令展开,指令调度,hide SFU latency
C: 增加 kernel 的 workload
D: 利用 fmad 代替 fmul + fadd
解析:
4. 硬件消除 Memory bound 的方式有:ABD
A:ldmma 指令访问 matrix 优先使用 col-major
B:优先访问 local memory
C: issue 更多的 ld 指令
D: 数据预取
解析:
5. 算子融合的优点有:ACDE
A:降低了 HBM/Cache 访存量;
B:降低了 Kernel 的计算量;
C: 降低了显存占用;
D: 优化了计算流水,提高了并行度;
E: 减少了 Kernel launch 的开销;
解析:
**1、**反汇编文件中的指令地址,与 trace 中的 PC 值的对应关系是:(A)
A. HEX(PC_value) = HEX(dis_addr) // 8
B. HEX(PC_value) = HEX(dis_addr)
C. HEX(PC_value) = HEX(dis_addr) // 16
D. HEX(PC_value) = HEX(dis_addr) // 32
解析:
2、inst trace tlr 里面的数据为 03db5004f3,fp32 对应的数据为:(B)
A. 3db5004f = 0.0883794948459
B. 3db504f3 = 0.0883883461356
C. 04f33db5 = 5.71856944083e-36
D. b53df304 = -7.07616209183e-07
解析:
3、simt one or two-source inst / simt three-source inst,且 src 都为 tlr,其 issue cycle 数分别为:(B)
A. 1/2
B. 2/4
C. 1/3
D. 2/3
解析:
4.指令周期有哪几个:(ACDEF)
A. 取址
B. 编码
C. 译码
D. 执行
E. 访存
F. 写回
解析:
5. 有多少种指令冒险冲突:(ABC)
A. Structure Hazard– 计算资源冲突
B. Data Hazard– 数据依赖冲突
C. Control Hazard– 分支控制冲突
解析:
6. br10x 硬件中有几组可用于访存关系控制的 sync channel:(B)
A. 1
B. 2
C. 3
D. 4
解析:该题本意为目前 pipeline 中的策略使用了多少 sync channel 进行访存管理,实际上 BC 均为合理答案
**1、**反汇编文件中的指令地址,与 trace 中的 PC 值的对应关系是:(A)
A. HEX(PC_value) = HEX(dis_addr) // 8
B. HEX(PC_value) = HEX(dis_addr)
C. HEX(PC_value) = HEX(dis_addr) // 16
D. HEX(PC_value) = HEX(dis_addr) // 32
解析:
2、inst trace tlr 里面的数据为 03db5004f3,fp32 对应的数据为:(B)
A. 3db5004f = 0.0883794948459
B. 3db504f3 = 0.0883883461356
C. 04f33db5 = 5.71856944083e-36
D. b53df304 = -7.07616209183e-07
解析:
3、simt one or two-source inst / simt three-source inst,且 src 都为 tlr,其 issue cycle 数分别为:(B)
A. 1/2
B. 2/4
C. 1/3
D. 2/3
解析:
4.指令周期有哪几个:(ACDEF)
A. 取址
B. 编码
C. 译码
D. 执行
E. 访存
F. 写回
解析:
5. 有多少种指令冒险冲突:(ABC)
A. Structure Hazard– 计算资源冲突
B. Data Hazard– 数据依赖冲突
C. Control Hazard– 分支控制冲突
解析:
6. br10x 硬件中有几组可用于访存关系控制的 sync channel:(B)
A. 1
B. 2
C. 3
D. 4
解析:该题本意为目前 pipeline 中的策略使用了多少 sync channel 进行访存管理,实际上 BC 均为合理答案
Performance Programming
- GEMM Write, LSC Read 的带宽分别是()?[单选题] *
| |
|---|
|A、4kb;2kb(正确答案)|
|B、4kb;4kb|
|C、2kb;2kb|
|D、2kb;4kb|
**答案解析:**GEMM write 4kb/cycle; lsc read 2kb/cycle
- 在 Tmode 下,下列说法正确的是()?*
| |
|---|
|A、每个 CWARP 都能访问所有的 TLR(正确答案)|
|B、每个 CWARP 都能访问所有的 WSR|
|C、在 EU 中,最多有 8 个 CWARP 并行 (正确答案)|
|D、每个 CWAP 包含 32 个 Thread(正确答案)|
**答案解析:**每个 CWARP 有自己的一组 WSR
- 3、L1P5 Buffer 的大小是()?[单选题] *
| |
|---|
|A、3MB|
|B、4MB(正确答案)|
|C、7MB|
|D、8MB|
**答案解析:**L1P5 buffer 的大小是 4MB
- 在 Gmode 下,下列说法错误的是()?*
| |
|---|
|A、每个 WARP 都有自己的一组 TLR|
|B、每个 WARP 都有自己的一组 WSR|
|C、每个 Thread Group 最多有 8 个 WARPS(正确答案)|
|D、每个 CU 最多 32 个 Thread Groups(正确答案)|
**答案解析:**每个 Thread Group 最多有 32 个 WARPS, 每个 CU 最多 8 个 Thread Groups
- 以下 fence 指令中,可以 flush l1p75 cache 的是()?*
| |
|---|
|A、fll1(正确答案)|
|B、ackl1|
|C、ackgmb(正确答案)|
|D、flmask(正确答案)|
**答案解析:**ackl1 只能 flush l1 数据
- 对于 barrier 代价排序正确的是()?[单选题] *
| |
|---|
|A、bar.dtg/bar.ntg>bar.wtg>bar.batch>[bar.tg](http://bar.tg/)|
|B、bar.dtg/bar.ntg>bar.batch>bar.wtg>[bar.tg](http://bar.tg/)(正确答案)|
|C、bar.dtg/bar.wtg>bar.batch>bar.ntg>[bar.tg](http://bar.tg/)|
|D、[bar.tg](http://bar.tg/)>bar.wtg>bar.batch>bar.dtg/bar.ntg|
**答案解析:**bar.dtg/bar.ntg>bar.batch>bar.wtg>[bar.tg](http://bar.tg/)
- 对于不同 CU 中的两个 EU 之间的写后读情况,需要进行的 fence/bar 行为是()?[单选题] *
| |
|---|
|A、PRD→ackl1→[bar.tg](http://bar.tg/)→CSM|
|B、PRD→ackgmb(or invl1)→[bar.tg](http://bar.tg/)→CSM|
|C、PRD→ackgmb(or invl1)→bar.wtg→CSM(正确答案)|
|D、PRD→ackgmb→CSM|
**答案解析:**PRD→ackgmb(or invl1)→bar.wtg→CSM
- 下列属于可以提高硬件利用率的编程方法有()?*
| |
|---|
|A、乒乓 buffer pipeline(正确答案)|
|B、在 hazard 相关的指令中间插入不相关的指令,减少 NOP/HOLD(正确答案)|
|C、多 CWARP 并行 (正确答案)|
|D、ALU 所需操作数尽量提前 load(正确答案)|
**答案解析:**全部正确
cuda pro model for hopper
nvcc runtime nvrtc
parallel threa exec isa
cudnn bestpr
get star with dl perf
gpu perf back
optimi conv lay
kernel profiling guide
optimizing para red in