DeepEP is a high-performance communication library designed for expert-parallel workloads in Mixture-of-Experts (MoE) models. The system provides a unified interface for efficient token dispatch and combine operations across different hardware topologies, from single-node NVLink configurations to multi-node RDMA clusters.
This page provides a high-level architectural overview of the DeepEP system. For detailed implementation information, see the subsections on System Overview, Communication Model, Buffer System, and Configuration System.
System Layers
DeepEP is organized into four distinct architectural layers that provide efficient expert-parallel communication across different hardware configurations:
DeepEP System Architecture

The system supports three communication modes:
-
Intranode: High-throughput NVLink-based communication within a single node
-
Internode: RDMA + NVLink communication across multiple nodes
-
Low-latency: IBGDA-optimized communication for inference workloads
Sources: buffer.py13-28 buffer.py6-9 utils.py33-51
Core Components
DeepEP’s architecture centers around several key components that work together to provide expert-parallel communication:
Buffer Class
The deep_ep.Buffer class serves as the primary interface for all communication operations:
|Component|Purpose|Key Methods|
|---|---|---|
|Buffer.__init__|Initialize communication infrastructure|Setup IPC handles, NVSHMEM coordination|
|Buffer.dispatch|Route tokens to experts|intranode_dispatch, internode_dispatch, low_latency_dispatch|
|Buffer.combine|Aggregate expert outputs|intranode_combine, internode_combine, low_latency_combine|
|Buffer.get_dispatch_layout|Calculate token routing|Layout computation for communication patterns|
Buffer System Integration

Sources: buffer.py32-67 buffer.py177-194 buffer.py261-288
Communication Pattern
DeepEP implements a dispatch-combine communication pattern that efficiently handles token routing for MoE workloads:
Dispatch-Combine Flow

The system automatically selects the appropriate communication mode based on hardware topology:
-
Single-node setups use
intranode_dispatch/intranode_combinewith NVLink -
Multi-node setups use
internode_dispatch/internode_combinewith RDMA+NVLink -
Low-latency workloads use
low_latency_dispatch/low_latency_combinewith IBGDA
Sources: buffer.py290-417 buffer.py515-630
Hardware Abstraction
DeepEP provides a unified interface across different hardware communication mechanisms through a layered abstraction:
Hardware Communication Stack

Configuration System
The system uses performance-tuned configurations based on the number of ranks:
|Ranks|Dispatch Config|Combine Config|Use Case|
|---|---|---|---|
|2-8|Config(20, 6-24, 256, 6, 128)|Config(20, 4-10, 256, 6, 128)|Small scale|
|16-32|Config(20, 32-36, 288, 20-32, 128)|Config(20, 1-4, 288, 8-12, 128)|Medium scale|
|64-160|Config(20, 20-32, 560-720, 12-32, 128)|Config(20, 1-2, 560-720, 8-20, 128)|Large scale|
The deep_ep_cpp.Config class contains performance parameters:
-
num_sms: Number of streaming multiprocessors to use -
num_blocks_per_sm: CUDA blocks per SM -
num_threads_per_block: Threads per CUDA block -
chunk_size: Data chunk size for processing -
num_buffers: Number of communication buffers
Sources: buffer.py206-231 buffer.py234-259 buffer.py81-100
Event Management and Synchronization
DeepEP implements sophisticated event management to handle asynchronous operations and stream synchronization:
Event System Architecture

Sources: buffer.py126-133 deep_ep.cpp266-281 utils.py1-50
This architecture enables DeepEP to efficiently handle the complex communication patterns required for MoE workloads while maintaining high performance across different hardware configurations and scales.