CUDA kernel

This document covers the CUDA kernel implementations that form the core computational engine of DeepEP’s communication system. These kernels handle token routing, data movement, and synchronization operations across different hardware communication paths.

Kernel Architecture Overview

DeepEP’s CUDA kernels are organized into specialized source files that handle different aspects of MoE communication:

Kernel Module Architecture

Sources: layout.cu1-136 intranode.cu1-806 internode.cu1-1495 internode_ll.cu1-737 runtime.cu1-94

Layout Kernels

The layout.cu file provides token distribution calculations that determine how tokens are routed across ranks and experts:

|Function|Purpose|Template Parameters|

|---|---|---|

|get_dispatch_layout|Calculate token routing statistics|kNumThreads=256, kNumExpertsPerSM=4, kNumRanksPerSM=8|

Layout Kernel Processing

The kernel processes topk_idx arrays to calculate token distribution statistics using shared memory for efficient per-thread counting, then reduces results across threads.

Sources: layout.cu9-116 layout.cu118-131

Intranode Kernels

The intranode.cu file handles NVLink-based communication within a single node:

|Function|Template Parameters|Purpose|

|---|---|---|

|notify_dispatch|kNumRanks|Setup rank-to-rank token counts and barriers|

|cached_notify_dispatch|kNumRanks|Optimized notify using cached data|

|dispatch|kNumRanks, kNumThreads=768, kNumTMABytesPerWarp=8192|Token scattering via NVLink|

|cached_notify_combine|kNumRanks|Setup combine metadata with send_head|

|combine|dtype_t, kNumRanks, kNumThreads, kNumTMABytesPerWarp|Result aggregation|

Intranode Dispatch Architecture

The dispatch kernel uses even-numbered SMs for sending and odd-numbered SMs for receiving, with TMA acceleration for efficient memory transfers on SM90 hardware.

Sources: intranode.cu11-130 intranode.cu166-509 intranode.cu579-805

Internode Kernels

The internode.cu file manages RDMA/NVSHMEM communication across multiple nodes using a sophisticated dual-path architecture:

|Function|Template Parameters|Purpose|

|---|---|---|

|notify_dispatch|kLowLatencyMode, kNumRDMARanks|Cross-node metadata exchange|

|dispatch|kLowLatencyMode, kNumRDMARanks, kCachedMode, kNumTMABytesPerWarp, kNumDispatchRDMASenderWarps|Cross-node token routing|

|combine|kLowLatencyMode, kNumRDMARanks, kCachedMode, kNumTMABytesPerWarp, kNumCombineRDMASenderWarps|Cross-node result aggregation|

Internode Buffer Architecture

Key Features:

Dual-Path Communication: Combines RDMA and NVLink for optimal bandwidth
Low-Latency Mode: Direct GPU-to-GPU RDMA using IBGDA
Symmetric/Asymmetric Buffers: Different buffer patterns for RDMA vs NVLink
Warp Roles: Specialized warp roles (kRDMASender, kRDMAAndNVLForwarder, kNVLReceivers)

Sources: internode.cu17-58 internode.cu83-303 internode.cu355-1495

Low-Latency Kernels

The internode_ll.cu file provides specialized kernels optimized for low-latency inference workloads:

|Function|Template Parameters|Purpose|

|---|---|---|

|clean_low_latency_buffer|kNumThreads|Buffer cleanup with NVSHMEM barriers|

|dispatch|kUseFP8, kUseUE8M0, kHidden|Low-latency token dispatch with FP8|

|combine|kUseLogFMT, kHidden, kNumMaxTopk|Result combination with LogFMT compression|

Low-Latency Execution Phases

Key Optimizations:

Phased Execution: Separate send/receive phases for pipeline optimization
FP8 Precision: Dynamic FP8 casting with local amax calculation
LogFMT Compression: Logarithmic format for ultra-low precision
Zero-Copy Mode: Direct memory access without intermediate buffers
TMA Acceleration: Tensor Memory Accelerator for SM90 hardware

Sources: internode_ll.cu10-37 internode_ll.cu39-392 internode_ll.cu394-732

Runtime Utilities

The runtime.cu file provides synchronization and NVSHMEM management utilities:

|Function|Template Parameters|Purpose|

|---|---|---|

|barrier|kNumRanks|Intranode GPU barrier synchronization|

|get_unique_id|-|NVSHMEM unique ID generation|

|init|-|NVSHMEM initialization with team creation|

|alloc/free|-|NVSHMEM memory management|

|barrier|-|Global NVSHMEM barrier|

|finalize|-|NVSHMEM cleanup|

NVSHMEM Team Management

The runtime manages NVSHMEM teams for low-latency mode, where GPU ranks are grouped by RDMA connectivity for optimized communication patterns.

Sources: runtime.cu18-31 runtime.cu37-89

Implementation Details

Memory Access Patterns

CUDA kernels use specific memory access patterns optimized for different communication modes:

Sources: layout.cu18-24 api.cuh45-50 api.cuh142-153

Synchronization Mechanisms

The kernel system uses multiple synchronization primitives:

|Mechanism|Scope|Implementation|

|---|---|---|

|__syncthreads()|Thread block|CUDA built-in|

|barrier()|Intranode GPUs|Signal-based|

|notify_dispatch|Cross-node|RDMA coordination|

|cached_notify|Optimized notify|Cached metadata|

Sources: api.cuh10 api.cuh85-96 layout.cu35-96

Hardware Optimization Features

SM90 Architecture Support

The kernel system leverages SM90 features for enhanced performance:

Cooperative Kernels: Enable cross-SM synchronization
Cluster Dimensions: Group SMs for coordinated execution
TMA (Tensor Memory Accelerator): Optimized memory transfers
Dynamic Shared Memory: Configurable shared memory allocation

Multi-Precision Support

Kernels support multiple data types with specialized code paths:

BF16: Primary data type for training workloads
FP8: Experimental support for inference optimization
Scale tensors: Separate scaling factors for quantized data

Rank Scalability

The system supports various distributed configurations:

Intranode: 2-8 GPUs per node via NVLink
Internode: 2-16 RDMA ranks across nodes
Hybrid: Combined NVLink + RDMA communication

Sources: launch.cuh7-18 launch.cuh79-83 launch.cuh62-69

Quartz 4

Explorer

CUDA kernel

Kernel Architecture Overview

Layout Kernels

Intranode Kernels

Internode Kernels

Low-Latency Kernels

Runtime Utilities

Implementation Details

Memory Access Patterns

Synchronization Mechanisms

Hardware Optimization Features

SM90 Architecture Support

Multi-Precision Support

Rank Scalability

Graph View

Table of Contents

Backlinks