This document covers the CUDA kernel implementations that form the core computational engine of DeepEP’s communication system. These kernels handle token routing, data movement, and synchronization operations across different hardware communication paths.
Kernel Architecture Overview
DeepEP’s CUDA kernels are organized into specialized source files that handle different aspects of MoE communication:
Kernel Module Architecture
Sources: layout.cu1-136 intranode.cu1-806 internode.cu1-1495 internode_ll.cu1-737 runtime.cu1-94
Layout Kernels
The layout.cu file provides token distribution calculations that determine how tokens are routed across ranks and experts:
|Function|Purpose|Template Parameters|
|---|---|---|
|get_dispatch_layout|Calculate token routing statistics|kNumThreads=256, kNumExpertsPerSM=4, kNumRanksPerSM=8|
Layout Kernel Processing

The kernel processes topk_idx arrays to calculate token distribution statistics using shared memory for efficient per-thread counting, then reduces results across threads.
Sources: layout.cu9-116 layout.cu118-131
Intranode Kernels
The intranode.cu file handles NVLink-based communication within a single node:
|Function|Template Parameters|Purpose|
|---|---|---|
|notify_dispatch|kNumRanks|Setup rank-to-rank token counts and barriers|
|cached_notify_dispatch|kNumRanks|Optimized notify using cached data|
|dispatch|kNumRanks, kNumThreads=768, kNumTMABytesPerWarp=8192|Token scattering via NVLink|
|cached_notify_combine|kNumRanks|Setup combine metadata with send_head|
|combine|dtype_t, kNumRanks, kNumThreads, kNumTMABytesPerWarp|Result aggregation|
Intranode Dispatch Architecture

The dispatch kernel uses even-numbered SMs for sending and odd-numbered SMs for receiving, with TMA acceleration for efficient memory transfers on SM90 hardware.
Sources: intranode.cu11-130 intranode.cu166-509 intranode.cu579-805
Internode Kernels
The internode.cu file manages RDMA/NVSHMEM communication across multiple nodes using a sophisticated dual-path architecture:
|Function|Template Parameters|Purpose|
|---|---|---|
|notify_dispatch|kLowLatencyMode, kNumRDMARanks|Cross-node metadata exchange|
|dispatch|kLowLatencyMode, kNumRDMARanks, kCachedMode, kNumTMABytesPerWarp, kNumDispatchRDMASenderWarps|Cross-node token routing|
|combine|kLowLatencyMode, kNumRDMARanks, kCachedMode, kNumTMABytesPerWarp, kNumCombineRDMASenderWarps|Cross-node result aggregation|
Internode Buffer Architecture

Key Features:
- Dual-Path Communication: Combines RDMA and NVLink for optimal bandwidth
- Low-Latency Mode: Direct GPU-to-GPU RDMA using IBGDA
- Symmetric/Asymmetric Buffers: Different buffer patterns for RDMA vs NVLink
- Warp Roles: Specialized warp roles (kRDMASender, kRDMAAndNVLForwarder, kNVLReceivers)
Sources: internode.cu17-58 internode.cu83-303 internode.cu355-1495
Low-Latency Kernels
The internode_ll.cu file provides specialized kernels optimized for low-latency inference workloads:
|Function|Template Parameters|Purpose|
|---|---|---|
|clean_low_latency_buffer|kNumThreads|Buffer cleanup with NVSHMEM barriers|
|dispatch|kUseFP8, kUseUE8M0, kHidden|Low-latency token dispatch with FP8|
|combine|kUseLogFMT, kHidden, kNumMaxTopk|Result combination with LogFMT compression|
Low-Latency Execution Phases

Key Optimizations:
- Phased Execution: Separate send/receive phases for pipeline optimization
- FP8 Precision: Dynamic FP8 casting with local amax calculation
- LogFMT Compression: Logarithmic format for ultra-low precision
- Zero-Copy Mode: Direct memory access without intermediate buffers
- TMA Acceleration: Tensor Memory Accelerator for SM90 hardware
Sources: internode_ll.cu10-37 internode_ll.cu39-392 internode_ll.cu394-732
Runtime Utilities
The runtime.cu file provides synchronization and NVSHMEM management utilities:
|Function|Template Parameters|Purpose|
|---|---|---|
|barrier|kNumRanks|Intranode GPU barrier synchronization|
|get_unique_id|-|NVSHMEM unique ID generation|
|init|-|NVSHMEM initialization with team creation|
|alloc/free|-|NVSHMEM memory management|
|barrier|-|Global NVSHMEM barrier|
|finalize|-|NVSHMEM cleanup|
NVSHMEM Team Management

The runtime manages NVSHMEM teams for low-latency mode, where GPU ranks are grouped by RDMA connectivity for optimized communication patterns.
Sources: runtime.cu18-31 runtime.cu37-89
Implementation Details
Memory Access Patterns
CUDA kernels use specific memory access patterns optimized for different communication modes:
Sources: layout.cu18-24 api.cuh45-50 api.cuh142-153
Synchronization Mechanisms
The kernel system uses multiple synchronization primitives:
|Mechanism|Scope|Implementation|
|---|---|---|
|__syncthreads()|Thread block|CUDA built-in|
|barrier()|Intranode GPUs|Signal-based|
|notify_dispatch|Cross-node|RDMA coordination|
|cached_notify|Optimized notify|Cached metadata|
Sources: api.cuh10 api.cuh85-96 layout.cu35-96
Hardware Optimization Features
SM90 Architecture Support
The kernel system leverages SM90 features for enhanced performance:
- Cooperative Kernels: Enable cross-SM synchronization
- Cluster Dimensions: Group SMs for coordinated execution
- TMA (Tensor Memory Accelerator): Optimized memory transfers
- Dynamic Shared Memory: Configurable shared memory allocation
Multi-Precision Support
Kernels support multiple data types with specialized code paths:
- BF16: Primary data type for training workloads
- FP8: Experimental support for inference optimization
- Scale tensors: Separate scaling factors for quantized data
Rank Scalability
The system supports various distributed configurations:
- Intranode: 2-8 GPUs per node via NVLink
- Internode: 2-16 RDMA ranks across nodes
- Hybrid: Combined NVLink + RDMA communication
Sources: launch.cuh7-18 launch.cuh79-83 launch.cuh62-69