This document covers the C++ runtime system that forms the core implementation of DeepEP. The runtime provides the foundational classes, memory management, and communication orchestration that backs the Python API. For detailed information about CUDA kernel implementations, see CUDA Kernels. For device-side utilities and low-level primitives, see Device Utilities.
Core Classes
The C++ runtime is centered around three primary classes that manage the complete lifecycle of expert-parallel communication.

Sources: deep_ep.cpp15-82 deep_ep.cpp1344-1350 deep_ep.cpp1353-1355
Buffer Class
The Buffer class serves as the primary runtime object, managing all communication resources and orchestrating data movement between GPUs. It encapsulates:
- Rank Management: Global rank (
rank), RDMA rank (rdma_rank), and NVLink rank (nvl_rank) for hierarchical communication - Memory Allocation: NVLink buffers (
buffer_ptrs) and RDMA buffers (rdma_buffer_ptr) with alignment guarantees - Stream Management: Dedicated communication stream (
comm_stream) separate from compute streams - Synchronization: Host-mapped counters (
moe_recv_counter,moe_recv_expert_counter) for CPU-GPU coordination
Sources: deep_ep.cpp15-82
Config Class
The Config class provides tuning parameters for communication performance, controlling chunk sizes and buffer utilization across different communication modes.
Sources: deep_ep.cpp1344-1350
EventHandle Class
The EventHandle class wraps CUDA events for precise stream synchronization, enabling asynchronous communication patterns without blocking compute streams.
Sources: deep_ep.cpp1353-1355
Runtime Lifecycle
The Buffer runtime follows a strict three-phase lifecycle with explicit resource management and distributed synchronization.

Sources: deep_ep.cpp15-82 deep_ep.cpp185-240 deep_ep.cpp143-183
Construction Phase
During construction, the Buffer allocates local memory resources and prepares for distributed coordination:
// Memory layout calculation and allocation
int64_t barrier_signal_bytes = NUM_MAX_NVL_PEERS * sizeof(int);
int64_t buffer_ptr_bytes = NUM_MAX_NVL_PEERS * sizeof(void*);
// Allocate contiguous block for buffer + metadata
cudaMalloc(&buffer_ptrs[nvl_rank], num_nvl_bytes + barrier_signal_bytes + buffer_ptr_bytes);
Sources: deep_ep.cpp21-82
Synchronization Phase
The sync() method coordinates distributed initialization by exchanging IPC handles and establishing NVSHMEM connections. This phase transforms the buffer from unavailable to ready for communication.
Sources: deep_ep.cpp185-240
Destruction Phase
The destroy() method performs careful cleanup with distributed barriers to ensure all ranks complete operations before releasing shared resources.
Sources: deep_ep.cpp143-183
Communication Operations
The runtime provides three distinct communication modes, each optimized for different hardware topologies and performance requirements.

Sources: deep_ep.cpp242-303 deep_ep.cpp305-540 deep_ep.cpp653-931 deep_ep.cpp1090-1206
Dispatch-Combine Pattern
All communication operations follow a two-phase pattern:
- Dispatch Phase: Route tokens to appropriate expert ranks based on expert assignments
- Combine Phase: Aggregate processed results back to original token locations
Each phase involves complex memory management, stream coordination, and distributed synchronization handled by the runtime.
Sources: deep_ep.cpp305-540 deep_ep.cpp542-651
Memory Management
The runtime implements sophisticated memory management with multiple buffer types and strict alignment requirements.

Sources: deep_ep.cpp47-82
Buffer Alignment
All buffers maintain strict alignment requirements (NUM_BUFFER_ALIGNMENT_BYTES) to ensure optimal memory access patterns and hardware compatibility.
Sources: deep_ep.cpp27-28
Host-Mapped Counters
The runtime uses host-mapped memory for CPU-GPU coordination, allowing the CPU to poll completion status without expensive device synchronization.
Sources: deep_ep.cpp65-81
Stream Management
The runtime maintains careful separation between compute and communication streams to enable overlapping computation with data movement.

Sources: deep_ep.cpp20 deep_ep.cpp254-262 deep_ep.cpp394-400
Asynchronous Operation Support
Methods support both synchronous and asynchronous execution modes through the async parameter, with optional tensor allocation on communication streams for zero-copy operations.
Sources: deep_ep.cpp282-300 deep_ep.cpp517-536
Python Bindings
The runtime exposes its functionality to Python through pybind11 bindings that preserve the full C++ interface while providing Pythonic tensor integration.

Sources: deep_ep.cpp1341-1381
The bindings preserve method signatures while automatically handling PyTorch tensor integration, CUDA stream management, and Python object lifetime management. All communication methods return tuples containing result tensors and optional EventHandle objects for asynchronous coordination.
Sources: deep_ep.cpp1357-1378