This page provides an introduction to DeepEP and guides you through the basic setup and usage concepts. DeepEP is a high-performance communication library designed for expert-parallel workloads and Mixture of Experts (MoE) models, supporting both intranode and internode GPU communication.
For detailed installation instructions, see [Installation](Installation | deepseek-ai/DeepEP | DeepWiki). For information about the build system and compilation options, see Build System.
Overview
DeepEP provides a Python interface to optimized CUDA kernels that implement efficient dispatch-combine communication patterns across GPU clusters. The library is designed to handle the complex communication requirements of expert-parallel training and inference workloads.
Core Components
The system consists of four main layers:
|Layer|Components|Purpose|
|---|---|---|
|Python Interface|deep_ep.Buffer, deep_ep.utils|High-level API and utilities|
|C++ Runtime|deep_ep_cpp.Buffer, deep_ep_cpp.Config, deep_ep_cpp.EventHandle|Core implementation and configuration|
|CUDA Kernels|intranode.cu, internode.cu, internode_ll.cu, layout.cu|Communication primitives|
|Hardware Layer|NVLink, RDMA/InfiniBand, NVSHMEM|Physical communication infrastructure|
System Architecture
The following diagram illustrates how the main code entities relate to the system architecture:

Sources: setup.py36-47 setup.py112-121
Communication Workflow
DeepEP implements a dispatch-combine pattern optimized for different hardware topologies:

Sources: setup.py36 setup.py47
Build Process Overview
The build system automatically detects your environment and configures the appropriate features:

Sources: setup.py15-29 setup.py42-52 setup.py53-66
Quick Start
Prerequisites
Before installation, ensure you have:
- CUDA Toolkit (version 12+ recommended)
- PyTorch with CUDA support
- Python development headers
- C++ compiler supporting C++17
Optional Dependencies
- NVSHMEM: Required for internode and low-latency communication modes
- InfiniBand/RDMA: Required for multi-node deployments
Basic Installation
# Clone the repository
git clone [https://github.com/deepseek-ai/DeepEP.git](https://github.com/deepseek-ai/DeepEP.git)
cd DeepEP
# Install with default configuration
python [setup.py](http://setup.py/) install
The build system will automatically:
- Detect available NVSHMEM installation
- Configure appropriate CUDA architecture targets
- Enable/disable features based on your hardware
- Compile and link the necessary components
Environment Variables
Key environment variables that control the build:
|Variable|Purpose|Default|
|---|---|---|
|NVSHMEM_DIR|Path to NVSHMEM installation|Auto-detected|
|TORCH_CUDA_ARCH_LIST|Target GPU architectures|9.0 or 8.0|
|DISABLE_SM90_FEATURES|Disable H100-specific features|0|
|DISABLE_AGGRESSIVE_PTX_INSTRS|Disable advanced PTX instructions|1|
Sources: setup.py17-18 setup.py53-66 setup.py70-78
Next Steps
After installation:
- Detailed Installation: See [Installation](Installation | deepseek-ai/DeepEP | DeepWiki) for comprehensive setup instructions
- Build Configuration: See Build System for advanced build options
- System Architecture: See [Architecture](Architecture | deepseek-ai/DeepEP | DeepWiki) for deep technical details
- Python API: See [Python API](Python API | deepseek-ai/DeepEP | DeepWiki) for usage examples
- Testing: See [Testing and Validation](Testing and Validation | deepseek-ai/DeepEP | DeepWiki) for running tests
The following sections provide increasingly detailed information about specific aspects of the system, from high-level architecture to low-level CUDA kernel implementations.
Sources: setup.py1-126