Architecture

DeepEP is a high-performance communication library designed for expert-parallel workloads in Mixture-of-Experts (MoE) models. The system provides a unified interface for efficient token dispatch and combine operations across different hardware topologies, from single-node NVLink configurations to multi-node RDMA clusters.

This page provides a high-level architectural overview of the DeepEP system. For detailed implementation information, see the subsections on System Overview, Communication Model, Buffer System, and Configuration System.

System Layers

DeepEP is organized into four distinct architectural layers that provide efficient expert-parallel communication across different hardware configurations:

DeepEP System Architecture

The system supports three communication modes:

Intranode: High-throughput NVLink-based communication within a single node
Internode: RDMA + NVLink communication across multiple nodes
Low-latency: IBGDA-optimized communication for inference workloads

Sources: buffer.py13-28 buffer.py6-9 utils.py33-51

Core Components

DeepEP’s architecture centers around several key components that work together to provide expert-parallel communication:

Buffer Class

The deep_ep.Buffer class serves as the primary interface for all communication operations:

|Component|Purpose|Key Methods|

|---|---|---|

|Buffer.__init__|Initialize communication infrastructure|Setup IPC handles, NVSHMEM coordination|

|Buffer.dispatch|Route tokens to experts|intranode_dispatch, internode_dispatch, low_latency_dispatch|

|Buffer.combine|Aggregate expert outputs|intranode_combine, internode_combine, low_latency_combine|

|Buffer.get_dispatch_layout|Calculate token routing|Layout computation for communication patterns|

Buffer System Integration

Sources: buffer.py32-67 buffer.py177-194 buffer.py261-288

Communication Pattern

DeepEP implements a dispatch-combine communication pattern that efficiently handles token routing for MoE workloads:

Dispatch-Combine Flow

The system automatically selects the appropriate communication mode based on hardware topology:

Single-node setups use intranode_dispatch/intranode_combine with NVLink
Multi-node setups use internode_dispatch/internode_combine with RDMA+NVLink
Low-latency workloads use low_latency_dispatch/low_latency_combine with IBGDA

Sources: buffer.py290-417 buffer.py515-630

Hardware Abstraction

DeepEP provides a unified interface across different hardware communication mechanisms through a layered abstraction:

Hardware Communication Stack

Configuration System

The system uses performance-tuned configurations based on the number of ranks:

|---|---|---|---|

The deep_ep_cpp.Config class contains performance parameters:

num_sms: Number of streaming multiprocessors to use
num_blocks_per_sm: CUDA blocks per SM
num_threads_per_block: Threads per CUDA block
chunk_size: Data chunk size for processing
num_buffers: Number of communication buffers

Sources: buffer.py206-231 buffer.py234-259 buffer.py81-100

Event Management and Synchronization

DeepEP implements sophisticated event management to handle asynchronous operations and stream synchronization:

Event System Architecture

Sources: buffer.py126-133 deep_ep.cpp266-281 utils.py1-50

This architecture enables DeepEP to efficiently handle the complex communication patterns required for MoE workloads while maintaining high performance across different hardware configurations and scales.

Quartz 4

Explorer