This document covers the CUDA kernel implementations that form the core computational engine of DeepEP’s communication system. These kernels handle token routing, data movement, and synchronization operations across different hardware communication paths.

Kernel Architecture Overview

DeepEP’s CUDA kernels are organized into specialized source files that handle different aspects of MoE communication:

Kernel Module Architecture

Sources: layout.cu1-136 intranode.cu1-806 internode.cu1-1495 internode_ll.cu1-737 runtime.cu1-94

Layout Kernels

The layout.cu file provides token distribution calculations that determine how tokens are routed across ranks and experts:

|Function|Purpose|Template Parameters|

|---|---|---|

|get_dispatch_layout|Calculate token routing statistics|kNumThreads=256, kNumExpertsPerSM=4, kNumRanksPerSM=8|

Layout Kernel Processing

The kernel processes topk_idx arrays to calculate token distribution statistics using shared memory for efficient per-thread counting, then reduces results across threads.

Sources: layout.cu9-116 layout.cu118-131

Intranode Kernels

The intranode.cu file handles NVLink-based communication within a single node:

|Function|Template Parameters|Purpose|

|---|---|---|

|notify_dispatch|kNumRanks|Setup rank-to-rank token counts and barriers|

|cached_notify_dispatch|kNumRanks|Optimized notify using cached data|

|dispatch|kNumRanks, kNumThreads=768, kNumTMABytesPerWarp=8192|Token scattering via NVLink|

|cached_notify_combine|kNumRanks|Setup combine metadata with send_head|

|combine|dtype_t, kNumRanks, kNumThreads, kNumTMABytesPerWarp|Result aggregation|

Intranode Dispatch Architecture

The dispatch kernel uses even-numbered SMs for sending and odd-numbered SMs for receiving, with TMA acceleration for efficient memory transfers on SM90 hardware.

Sources: intranode.cu11-130 intranode.cu166-509 intranode.cu579-805

Internode Kernels

The internode.cu file manages RDMA/NVSHMEM communication across multiple nodes using a sophisticated dual-path architecture:

|Function|Template Parameters|Purpose|

|---|---|---|

|notify_dispatch|kLowLatencyMode, kNumRDMARanks|Cross-node metadata exchange|

|dispatch|kLowLatencyMode, kNumRDMARanks, kCachedMode, kNumTMABytesPerWarp, kNumDispatchRDMASenderWarps|Cross-node token routing|

|combine|kLowLatencyMode, kNumRDMARanks, kCachedMode, kNumTMABytesPerWarp, kNumCombineRDMASenderWarps|Cross-node result aggregation|

Internode Buffer Architecture

Key Features:

  • Dual-Path Communication: Combines RDMA and NVLink for optimal bandwidth
  • Low-Latency Mode: Direct GPU-to-GPU RDMA using IBGDA
  • Symmetric/Asymmetric Buffers: Different buffer patterns for RDMA vs NVLink
  • Warp Roles: Specialized warp roles (kRDMASender, kRDMAAndNVLForwarder, kNVLReceivers)

Sources: internode.cu17-58 internode.cu83-303 internode.cu355-1495

Low-Latency Kernels

The internode_ll.cu file provides specialized kernels optimized for low-latency inference workloads:

|Function|Template Parameters|Purpose|

|---|---|---|

|clean_low_latency_buffer|kNumThreads|Buffer cleanup with NVSHMEM barriers|

|dispatch|kUseFP8, kUseUE8M0, kHidden|Low-latency token dispatch with FP8|

|combine|kUseLogFMT, kHidden, kNumMaxTopk|Result combination with LogFMT compression|

Low-Latency Execution Phases

Key Optimizations:

  • Phased Execution: Separate send/receive phases for pipeline optimization
  • FP8 Precision: Dynamic FP8 casting with local amax calculation
  • LogFMT Compression: Logarithmic format for ultra-low precision
  • Zero-Copy Mode: Direct memory access without intermediate buffers
  • TMA Acceleration: Tensor Memory Accelerator for SM90 hardware

Sources: internode_ll.cu10-37 internode_ll.cu39-392 internode_ll.cu394-732

Runtime Utilities

The runtime.cu file provides synchronization and NVSHMEM management utilities:

|Function|Template Parameters|Purpose|

|---|---|---|

|barrier|kNumRanks|Intranode GPU barrier synchronization|

|get_unique_id|-|NVSHMEM unique ID generation|

|init|-|NVSHMEM initialization with team creation|

|alloc/free|-|NVSHMEM memory management|

|barrier|-|Global NVSHMEM barrier|

|finalize|-|NVSHMEM cleanup|

NVSHMEM Team Management

The runtime manages NVSHMEM teams for low-latency mode, where GPU ranks are grouped by RDMA connectivity for optimized communication patterns.

Sources: runtime.cu18-31 runtime.cu37-89

Implementation Details

Memory Access Patterns

CUDA kernels use specific memory access patterns optimized for different communication modes:

Sources: layout.cu18-24 api.cuh45-50 api.cuh142-153

Synchronization Mechanisms

The kernel system uses multiple synchronization primitives:

|Mechanism|Scope|Implementation|

|---|---|---|

|__syncthreads()|Thread block|CUDA built-in|

|barrier()|Intranode GPUs|Signal-based|

|notify_dispatch|Cross-node|RDMA coordination|

|cached_notify|Optimized notify|Cached metadata|

Sources: api.cuh10 api.cuh85-96 layout.cu35-96

Hardware Optimization Features

SM90 Architecture Support

The kernel system leverages SM90 features for enhanced performance:

  • Cooperative Kernels: Enable cross-SM synchronization
  • Cluster Dimensions: Group SMs for coordinated execution
  • TMA (Tensor Memory Accelerator): Optimized memory transfers
  • Dynamic Shared Memory: Configurable shared memory allocation

Multi-Precision Support

Kernels support multiple data types with specialized code paths:

  • BF16: Primary data type for training workloads
  • FP8: Experimental support for inference optimization
  • Scale tensors: Separate scaling factors for quantized data

Rank Scalability

The system supports various distributed configurations:

  • Intranode: 2-8 GPUs per node via NVLink
  • Internode: 2-16 RDMA ranks across nodes
  • Hybrid: Combined NVLink + RDMA communication

Sources: launch.cuh7-18 launch.cuh79-83 launch.cuh62-69