This document covers the core communication implementation for DeepEP’s internode operations, focusing on the RDMA and NVLink-based data dispatch and combine mechanisms. This implementation handles the orchestration of token communication across multiple RDMA ranks and NVLink peers in distributed expert-parallel workloads.
For intranode-only communication within a single node, see Intranode Communication. For low-latency optimized communication modes, see Low-Latency Mode. For higher-level communication patterns and the dispatch-combine model, see [Communication Model](Communication Model | deepseek-ai/DeepEP | DeepWiki).
Communication Architecture Overview
The internode communication system implements a two-phase dispatch-combine pattern that coordinates data movement across both RDMA ranks (inter-node) and NVLink peers (intra-node). The implementation uses a hierarchical approach where data flows through multiple communication layers.

Sources: internode.cu84-303 internode.cu357-992 internode.cu1368-1814
Core Data Structures and Metadata
The communication system relies on several key data structures for tracking token sources, managing buffer layouts, and coordinating multi-rank operations.
SourceMeta Structure
The SourceMeta structure encodes critical routing information for each token, including source RDMA rank and NVLink peer membership.

Sources: internode.cu17-36 internode.cu530-541 internode.cu932-934
Buffer Layout Management
The system uses symmetric and asymmetric buffer layouts to optimize data placement and access patterns across different communication scenarios.
|Buffer Type|Structure|Purpose|Key Methods|
|---|---|---|---|
|SymBuffer<T>|Symmetric layout|RDMA rank communication|send_buffer(), recv_buffer()|
|AsymBuffer<T>|Asymmetric layout|NVLink peer communication|buffer(), advance()|
|Channel-based|Multi-channel partitioning|Parallel processing|get_channel_task_range()|
Sources: internode.cu415-418 internode.cu430-434 internode.cu1434-1436
Dispatch Phase Implementation
The dispatch phase coordinates the forwarding of tokens from source ranks to destination ranks through a multi-stage process involving notification, data movement, and synchronization.
Notification and Coordination
The notify_dispatch kernel establishes communication channels and coordinates token count information across all participating ranks.

Sources: internode.cu84-303 internode.cu305-348
Token Dispatch Kernel
The main dispatch kernel implements a complex multi-warp coordination system with specialized roles for different aspects of the communication pipeline.
Warp Role Assignments

Sources: internode.cu370-405 internode.cu462-614 internode.cu690-834
TMA Integration and Memory Management

The dispatch kernel uses Tensor Memory Accelerator (TMA) operations for efficient data copying, particularly in the NVL forwarding and receiving paths.
Sources: internode.cu445-455 internode.cu802-818 internode.cu940-984
Combine Phase Implementation
The combine phase aggregates results from expert processing back to the originating ranks, using a coordinated multi-stage approach that mirrors the dispatch phase structure.
Cached Notification System
The cached_notify kernel prepares metadata for the combine phase, handling both buffer cleanup and head pointer management for cached execution scenarios.

Sources: internode.cu1043-1185 internode.cu1187-1222
Combine Kernel Architecture
The combine kernel implements the final aggregation phase with specialized warp roles for handling different aspects of the reduction process.
Multi-Stage Token Combination

Sources: internode.cu1224-1357 internode.cu1668-1677 internode.cu1747-1760
Buffer Management and Synchronization
The communication implementation relies on sophisticated buffer management and synchronization mechanisms to coordinate multi-rank operations safely and efficiently.
Queue Management and Flow Control

Sources: internode.cu588-613 internode.cu735-758 internode.cu1629-1642
IBGDA Integration
The implementation integrates closely with InfiniBand GPU Direct Async (IBGDA) for efficient RDMA operations.
|IBGDA Function|Purpose|Usage Context|
|---|---|---|
|nvshmemi_ibgda_put_nbi_warp|Non-blocking warp-level PUT|Data dispatch, result forwarding|
|nvshmemi_ibgda_amo_nonfetch_add|Atomic memory operation|Tail pointer updates|
|nvshmemi_ibgda_quiet|Wait for operation completion|Synchronization barriers|
|translate_dst_rdma_rank|Low-latency rank translation|Address space mapping|
Sources: internode.cu149-153 internode.cu672-674 internode.cu684-686
Performance Optimizations
The implementation includes several performance optimizations designed to maximize throughput and minimize latency in distributed communication scenarios.
Memory Access Patterns

Sources: internode.cu551-554 internode.cu1262-1294 internode.cu1588-1592