Communication Implementation

This document covers the core communication implementation for DeepEP’s internode operations, focusing on the RDMA and NVLink-based data dispatch and combine mechanisms. This implementation handles the orchestration of token communication across multiple RDMA ranks and NVLink peers in distributed expert-parallel workloads.

For intranode-only communication within a single node, see Intranode Communication. For low-latency optimized communication modes, see Low-Latency Mode. For higher-level communication patterns and the dispatch-combine model, see [Communication Model](Communication Model | deepseek-ai/DeepEP | DeepWiki).

Communication Architecture Overview

The internode communication system implements a two-phase dispatch-combine pattern that coordinates data movement across both RDMA ranks (inter-node) and NVLink peers (intra-node). The implementation uses a hierarchical approach where data flows through multiple communication layers.

Sources: internode.cu84-303 internode.cu357-992 internode.cu1368-1814

Core Data Structures and Metadata

The communication system relies on several key data structures for tracking token sources, managing buffer layouts, and coordinating multi-rank operations.

SourceMeta Structure

The SourceMeta structure encodes critical routing information for each token, including source RDMA rank and NVLink peer membership.

Sources: internode.cu17-36 internode.cu530-541 internode.cu932-934

Buffer Layout Management

The system uses symmetric and asymmetric buffer layouts to optimize data placement and access patterns across different communication scenarios.

|---|---|---|---|

Sources: internode.cu415-418 internode.cu430-434 internode.cu1434-1436

Dispatch Phase Implementation

The dispatch phase coordinates the forwarding of tokens from source ranks to destination ranks through a multi-stage process involving notification, data movement, and synchronization.

Notification and Coordination

The notify_dispatch kernel establishes communication channels and coordinates token count information across all participating ranks.

Sources: internode.cu84-303 internode.cu305-348

Token Dispatch Kernel

The main dispatch kernel implements a complex multi-warp coordination system with specialized roles for different aspects of the communication pipeline.

Warp Role Assignments

Sources: internode.cu370-405 internode.cu462-614 internode.cu690-834

TMA Integration and Memory Management

The dispatch kernel uses Tensor Memory Accelerator (TMA) operations for efficient data copying, particularly in the NVL forwarding and receiving paths.

Sources: internode.cu445-455 internode.cu802-818 internode.cu940-984

Combine Phase Implementation

The combine phase aggregates results from expert processing back to the originating ranks, using a coordinated multi-stage approach that mirrors the dispatch phase structure.

Cached Notification System

The cached_notify kernel prepares metadata for the combine phase, handling both buffer cleanup and head pointer management for cached execution scenarios.

Sources: internode.cu1043-1185 internode.cu1187-1222

Combine Kernel Architecture

The combine kernel implements the final aggregation phase with specialized warp roles for handling different aspects of the reduction process.

Multi-Stage Token Combination

Sources: internode.cu1224-1357 internode.cu1668-1677 internode.cu1747-1760

Buffer Management and Synchronization

The communication implementation relies on sophisticated buffer management and synchronization mechanisms to coordinate multi-rank operations safely and efficiently.

Queue Management and Flow Control

Sources: internode.cu588-613 internode.cu735-758 internode.cu1629-1642

IBGDA Integration

The implementation integrates closely with InfiniBand GPU Direct Async (IBGDA) for efficient RDMA operations.

|IBGDA Function|Purpose|Usage Context|

|---|---|---|

|nvshmemi_ibgda_put_nbi_warp|Non-blocking warp-level PUT|Data dispatch, result forwarding|

|nvshmemi_ibgda_amo_nonfetch_add|Atomic memory operation|Tail pointer updates|

|nvshmemi_ibgda_quiet|Wait for operation completion|Synchronization barriers|

|translate_dst_rdma_rank|Low-latency rank translation|Address space mapping|

Sources: internode.cu149-153 internode.cu672-674 internode.cu684-686

Performance Optimizations

The implementation includes several performance optimizations designed to maximize throughput and minimize latency in distributed communication scenarios.

Memory Access Patterns

Sources: internode.cu551-554 internode.cu1262-1294 internode.cu1588-1592

Quartz 4

Explorer