LIST "DeepEP Github Repository"
FROM ""
WHERE file.folder = this.file.folder OR startswith(file.folder, this.file.folder + "/")
SORT file.path

Purpose and Scope

DeepEP is a high-performance communication library specifically designed for Mixture-of-Experts (MoE) models and expert parallelism (EP) workloads. It provides optimized GPU kernels for the fundamental MoE operations: dispatch (routing tokens to experts) and combine (aggregating expert outputs). The library supports both high-throughput operations for training and inference prefilling, as well as low-latency operations optimized for inference decoding scenarios.

This document provides a high-level overview of the DeepEP system architecture, core concepts, and main components. For detailed installation instructions, see [Installation](Installation | deepseek-ai/DeepEP | DeepWiki). For comprehensive API documentation, see [Python API](Python API | deepseek-ai/DeepEP | DeepWiki). For implementation details of specific communication modes, see [Communication Implementation](Communication Implementation | deepseek-ai/DeepEP | DeepWiki).

Core MoE Communication Pattern

DeepEP implements the dispatch-combine communication pattern that is fundamental to MoE architectures. This pattern involves two primary phases:

The dispatch phase takes input tokens and routing information (topk_idx, topk_weights) and redistributes tokens to the appropriate expert ranks. The combine phase performs the inverse operation, aggregating expert outputs back to the original token layout using the routing metadata.

Sources: README.md1-10 README.md158-184

System Architecture

DeepEP is structured in multiple layers, from high-level Python APIs down to hardware-specific CUDA kernels:

The architecture provides multiple abstraction layers that allow users to access optimized communication primitives through a simple Python interface while leveraging low-level hardware capabilities.

Sources: README.md122-151 README.md235-288

Communication Modes

DeepEP supports three distinct communication modes, each optimized for different scenarios and hardware topologies:

Intranode Communication

Uses NVLink for high-bandwidth communication within a single node (typically 8 GPUs). Implemented in intranode.cu kernels with bandwidth reaching ~160 GB/s. Suitable for single-node MoE deployments.

Internode Communication

Combines NVLink and RDMA for multi-node scaling. Uses asymmetric bandwidth forwarding where tokens are first aggregated via NVLink within nodes, then communicated across nodes via RDMA (~50 GB/s), and finally distributed via NVLink on the destination node. Implemented in internode.cu kernels.

Low-Latency Mode

Optimized for inference decoding with pure RDMA communication to minimize latency. Uses specialized kernels in internode_ll.cu and supports communication-computation overlap through hook-based mechanisms without occupying Streaming Multiprocessors (SMs).

Sources: README.md13-23 README.md26-38 README.md232-288

Key Components

Buffer System

The deep_ep.Buffer class serves as the main interface for MoE communication operations. It manages memory allocation for both NVLink (num_nvl_bytes) and RDMA (num_rdma_bytes) buffers and provides methods for dispatch, combine, and get_dispatch_layout operations.

Configuration System

The deep_ep_cpp.Config class manages communication parameters and provides auto-tuned configurations for different cluster sizes. It includes methods like get_nvl_buffer_size_hint() and get_rdma_buffer_size_hint() for optimal memory allocation.

Event Management

The EventOverlap class provides CUDA event-based synchronization for communication-computation overlap, enabling asynchronous operations and pipeline optimization.

Sources: README.md138-150 README.md158-224 README.md247-259

Performance Characteristics

DeepEP delivers high performance across different hardware configurations:

|Communication Type|Expert Count|Bandwidth|Use Case|

|---|---|---|---|

|Intranode (NVLink)|8|~155 GB/s|Single-node training|

|Internode (RDMA)|16-64|43-58 GB/s|Multi-node training|

|Low-Latency (Pure RDMA)|8-256|39-127 GB/s|Inference decoding|

The library supports precision optimizations including FP8 for dispatch operations and BF16 for combine operations, balancing memory bandwidth with numerical precision requirements.

Sources: README.md11-38

Hardware Requirements

DeepEP requires specific hardware capabilities:

  • GPU Architecture: Ampere (SM80) or Hopper (SM90) GPUs with SM90 PTX ISA support
  • CUDA Version: CUDA 11.0+ for SM80, CUDA 12.3+ for SM90
  • Intranode: NVLink connectivity for high-bandwidth GPU-to-GPU communication
  • Internode: RDMA-capable network (InfiniBand recommended) for multi-node scaling
  • Dependencies: NVSHMEM runtime for multi-GPU memory programming, PyTorch 2.1+

The system is optimized for DeepSeek-V3 model configurations but supports general MoE architectures with configurable expert counts and token routing strategies.

Sources: README.md43-56 README.md89-115