DeepEP is a high-performance communication library designed for expert-parallel workloads in Mixture-of-Experts (MoE) models. The system provides a unified interface for efficient token dispatch and combine operations across different hardware topologies, from single-node NVLink configurations to multi-node RDMA clusters.

This page provides a high-level architectural overview of the DeepEP system. For detailed implementation information, see the subsections on System Overview, Communication Model, Buffer System, and Configuration System.

System Layers

DeepEP is organized into four distinct architectural layers that provide efficient expert-parallel communication across different hardware configurations:

DeepEP System Architecture

The system supports three communication modes:

  • Intranode: High-throughput NVLink-based communication within a single node

  • Internode: RDMA + NVLink communication across multiple nodes

  • Low-latency: IBGDA-optimized communication for inference workloads

Sources: buffer.py13-28 buffer.py6-9 utils.py33-51

Core Components

DeepEP’s architecture centers around several key components that work together to provide expert-parallel communication:

Buffer Class

The deep_ep.Buffer class serves as the primary interface for all communication operations:

|Component|Purpose|Key Methods|

|---|---|---|

|Buffer.__init__|Initialize communication infrastructure|Setup IPC handles, NVSHMEM coordination|

|Buffer.dispatch|Route tokens to experts|intranode_dispatch, internode_dispatch, low_latency_dispatch|

|Buffer.combine|Aggregate expert outputs|intranode_combine, internode_combine, low_latency_combine|

|Buffer.get_dispatch_layout|Calculate token routing|Layout computation for communication patterns|

Buffer System Integration

Sources: buffer.py32-67 buffer.py177-194 buffer.py261-288

Communication Pattern

DeepEP implements a dispatch-combine communication pattern that efficiently handles token routing for MoE workloads:

Dispatch-Combine Flow

The system automatically selects the appropriate communication mode based on hardware topology:

  • Single-node setups use intranode_dispatch/intranode_combine with NVLink

  • Multi-node setups use internode_dispatch/internode_combine with RDMA+NVLink

  • Low-latency workloads use low_latency_dispatch/low_latency_combine with IBGDA

Sources: buffer.py290-417 buffer.py515-630

Hardware Abstraction

DeepEP provides a unified interface across different hardware communication mechanisms through a layered abstraction:

Hardware Communication Stack

Configuration System

The system uses performance-tuned configurations based on the number of ranks:

|Ranks|Dispatch Config|Combine Config|Use Case|

|---|---|---|---|

|2-8|Config(20, 6-24, 256, 6, 128)|Config(20, 4-10, 256, 6, 128)|Small scale|

|16-32|Config(20, 32-36, 288, 20-32, 128)|Config(20, 1-4, 288, 8-12, 128)|Medium scale|

|64-160|Config(20, 20-32, 560-720, 12-32, 128)|Config(20, 1-2, 560-720, 8-20, 128)|Large scale|

The deep_ep_cpp.Config class contains performance parameters:

  • num_sms: Number of streaming multiprocessors to use

  • num_blocks_per_sm: CUDA blocks per SM

  • num_threads_per_block: Threads per CUDA block

  • chunk_size: Data chunk size for processing

  • num_buffers: Number of communication buffers

Sources: buffer.py206-231 buffer.py234-259 buffer.py81-100

Event Management and Synchronization

DeepEP implements sophisticated event management to handle asynchronous operations and stream synchronization:

Event System Architecture

Sources: buffer.py126-133 deep_ep.cpp266-281 utils.py1-50

This architecture enables DeepEP to efficiently handle the complex communication patterns required for MoE workloads while maintaining high performance across different hardware configurations and scales.