Quartz 4

❯

❯

❯

NVSHMEM

Oct 12, 20255 min read

Which problem to solve?](https://conf01.birentech.com/pages/viewpage.action?pageId=197577955#NVSHMEM%E8%B0%83%E7%A0%94%E6%8A%A5%E5%91%8A-Whichproblemtosolve?\))
Strong scaling
MPI vs NVSHMEM
MPI信息交互(CPU-Initiated Communication)
NVSHMEM信息交互(GPU-Initiated Communication)：
Features
Symmetric data object
One-side comunication
NVSHMEM Example
NVSHMEM vs NCCL
NVSHMEM based on NCCL
cublasMp using NVSHMEM
Example
Conclusion
Reference

Which Problem to Solve?

Strong Scaling

在固定问题规模的前提下，增加并行处理器的数量，计算增速也是有限制的 → Amdahl’s law !

s: 串行部分所占执行时间比例，

p: 可并行部分所占执行时间比例，

N: 处理器数量

MPI Vs NVSHMEM

MPI 信息交互 (CPU-Initiated Communication)

GPU → CPU → interconnect → CPU → GPU. Side effects:

the overhead of repeated kernel launches;
CPU-GPU synchronization;
underutilization of the GPU during communication phases;
underutilization of the network during compute phases.

→ CPU communication & GPU compute 非常可能是串行的

→ 处理器数量增加时，根据 Amdahl’s law，speedup 非常受限

→ multi-stream 可以处理该问题，但是逻辑会变得复杂

NVSHMEM 信息交互 (GPU-Initiated Communication)

GPU → interconnect → GPU

→ 减少了 GPU 之间信息交互的 overhead

→ communication & compute 可以并行

→ 串行耗时减少

Features

Symmetric Data Object

NVSHMEM 是基于 PE（processing element）操作的，一个 GPU 可以有一个或多个 PE
图上绿色部分可以被所有的 PE 使用，红色部分是每个 PE 独有的，不可以被其他 PE 使用。
每个 symmetric data object 在所有的 PE 上都有个相同名字，相同类型，相同尺寸的 data object 与之对应。

One-side Comunication

two-side communication 和 one-side communication 的区别

|two-side|send + receive couple|local address + target address|

|one-side|send|symmetric address|

NVSHMEM Example

ring_broadcast 展开源码

NVSHMEM Vs NCCL

DGX 系统的 GPU（A100， H100， H200）通信方式：

NVLink：由 NVIDIA 开发的专有技术，是一种点对点的高速互连技术，用于连接两个设备（GPU-GPU 或 GPU-CPU），比传统 PCIe 总线更高的带宽和更低的延迟。
NVSwitch：是一种基于 NVLink 的硬件交换机，用于支持多个 GPU 之间的全互连通信，提供极高的带宽和低延迟，适用于大规模 GPU 集群。
InfiniBand: 采用多通道、高速串行连接的开放标准网络技术，支持点对点和多播通信。它通过路由器和交换机连接多个子网，构建起庞大的网络拓扑结构。

总体来说，NVSHMEM 较轻量级，适合小规模的不规则复杂通信；NCCL 较重量级，适合大规模的规则通信。

NVSHMEM Based on NCCL

[https://developer.nvidia.com/blog/accelerating-nvshmem-2-0-team-based-collectives-using-nccl/](https://developer.nvidia.com/blog/accelerating-nvshmem-2-0-team-based-collectives-using-nccl/)

NCCL 可以有 x10 的 speedup。原因在于 NVSHMEM 使用的算法只针对小数据量进行了优化，NCCL 使用的算法不仅对小数据量有优化，都针对大数据量也有优化。

cublasMp Using NVSHMEM

[https://docs.nvidia.com/cuda/cublasmp/getting_started/index.html](https://docs.nvidia.com/cuda/cublasmp/getting_started/index.html)

further: TP 通信与计算融合 cuBLASMp

Example

[https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASMp/pmatmul.cu](https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASMp/pmatmul.cu)

workspace 使用 NVSHMEM malloc 出来的空间
推断 cublasMpMatmul 内部 kernel 实现使用了 NVSHMEM device api 在 workspace 上进行数据交互

Conclusion

NVSHMEM 提供了 device & host API，使得 GPU 间的通信可以直接进行而不用经过 CPU。
NVSHMEM 依赖 DGX 系统的 NVLINK，InfiniBand 等 GPU 间通信技术。
NVSHMEM 适合小规模的不规则复杂通信，NCCL 适合大规模的规则通信。
数据量比较大的时候，NVSHMEM 可以依赖 NCCL，以提升运行性能。
cublasMP 可以使用 NVSHMEM malloc device memory，作为 mma 计算中缓存中间结果的 workspace 使用。

Reference

[https://www.cisl.ucar.edu/sites/default/files/2022-07/Multi%20Node%20Multi%20GPU%20Programming.pdf](https://www.cisl.ucar.edu/sites/default/files/2022-07/Multi%20Node%20Multi%20GPU%20Programming.pdf)

[https://developer.nvidia.com/nvshmem](https://developer.nvidia.com/nvshmem)

[https://developer.nvidia.com/blog/scaling-scientific-computing-with-nvshmem/](https://developer.nvidia.com/blog/scaling-scientific-computing-with-nvshmem/)

[https://docs.nvidia.com/nvshmem/api/index.html](https://docs.nvidia.com/nvshmem/api/index.html)

[https://developer.nvidia.com/zh-cn/blog/enhancing-application-portability-and-compatibility-across-new-platforms-using-nvidia-magnum-io-nvshmem-3-0/](https://developer.nvidia.com/zh-cn/blog/enhancing-application-portability-and-compatibility-across-new-platforms-using-nvidia-magnum-io-nvshmem-3-0/)

[NVSHMEM安装及使用-CSDN博客](NVSHMEM安装及使用-CSDN博客)

Graph View

Which Problem to Solve?
Strong Scaling
MPI Vs NVSHMEM
MPI 信息交互 (CPU-Initiated Communication)
NVSHMEM 信息交互 (GPU-Initiated Communication)
Features
Symmetric Data Object
One-side Comunication
NVSHMEM Example
NVSHMEM Vs NCCL
NVSHMEM Based on NCCL
cublasMp Using NVSHMEM
Example
Conclusion
Reference

Backlinks

Distributed

Created with Quartz v4.5.2 © 2025

GitHub
Discord Community