This document covers the C++ runtime system that forms the core implementation of DeepEP. The runtime provides the foundational classes, memory management, and communication orchestration that backs the Python API. For detailed information about CUDA kernel implementations, see CUDA Kernels. For device-side utilities and low-level primitives, see Device Utilities.

Core Classes

The C++ runtime is centered around three primary classes that manage the complete lifecycle of expert-parallel communication.

Sources: deep_ep.cpp15-82 deep_ep.cpp1344-1350 deep_ep.cpp1353-1355

Buffer Class

The Buffer class serves as the primary runtime object, managing all communication resources and orchestrating data movement between GPUs. It encapsulates:

  • Rank Management: Global rank (rank), RDMA rank (rdma_rank), and NVLink rank (nvl_rank) for hierarchical communication
  • Memory Allocation: NVLink buffers (buffer_ptrs) and RDMA buffers (rdma_buffer_ptr) with alignment guarantees
  • Stream Management: Dedicated communication stream (comm_stream) separate from compute streams
  • Synchronization: Host-mapped counters (moe_recv_counter, moe_recv_expert_counter) for CPU-GPU coordination

Sources: deep_ep.cpp15-82

Config Class

The Config class provides tuning parameters for communication performance, controlling chunk sizes and buffer utilization across different communication modes.

Sources: deep_ep.cpp1344-1350

EventHandle Class

The EventHandle class wraps CUDA events for precise stream synchronization, enabling asynchronous communication patterns without blocking compute streams.

Sources: deep_ep.cpp1353-1355

Runtime Lifecycle

The Buffer runtime follows a strict three-phase lifecycle with explicit resource management and distributed synchronization.

Sources: deep_ep.cpp15-82 deep_ep.cpp185-240 deep_ep.cpp143-183

Construction Phase

During construction, the Buffer allocates local memory resources and prepares for distributed coordination:


// Memory layout calculation and allocation

int64_t barrier_signal_bytes = NUM_MAX_NVL_PEERS * sizeof(int);

int64_t buffer_ptr_bytes = NUM_MAX_NVL_PEERS * sizeof(void*);

// Allocate contiguous block for buffer + metadata

cudaMalloc(&buffer_ptrs[nvl_rank], num_nvl_bytes + barrier_signal_bytes + buffer_ptr_bytes);

Sources: deep_ep.cpp21-82

Synchronization Phase

The sync() method coordinates distributed initialization by exchanging IPC handles and establishing NVSHMEM connections. This phase transforms the buffer from unavailable to ready for communication.

Sources: deep_ep.cpp185-240

Destruction Phase

The destroy() method performs careful cleanup with distributed barriers to ensure all ranks complete operations before releasing shared resources.

Sources: deep_ep.cpp143-183

Communication Operations

The runtime provides three distinct communication modes, each optimized for different hardware topologies and performance requirements.

Sources: deep_ep.cpp242-303 deep_ep.cpp305-540 deep_ep.cpp653-931 deep_ep.cpp1090-1206

Dispatch-Combine Pattern

All communication operations follow a two-phase pattern:

  1. Dispatch Phase: Route tokens to appropriate expert ranks based on expert assignments
  2. Combine Phase: Aggregate processed results back to original token locations

Each phase involves complex memory management, stream coordination, and distributed synchronization handled by the runtime.

Sources: deep_ep.cpp305-540 deep_ep.cpp542-651

Memory Management

The runtime implements sophisticated memory management with multiple buffer types and strict alignment requirements.

Sources: deep_ep.cpp47-82

Buffer Alignment

All buffers maintain strict alignment requirements (NUM_BUFFER_ALIGNMENT_BYTES) to ensure optimal memory access patterns and hardware compatibility.

Sources: deep_ep.cpp27-28

Host-Mapped Counters

The runtime uses host-mapped memory for CPU-GPU coordination, allowing the CPU to poll completion status without expensive device synchronization.

Sources: deep_ep.cpp65-81

Stream Management

The runtime maintains careful separation between compute and communication streams to enable overlapping computation with data movement.

Sources: deep_ep.cpp20 deep_ep.cpp254-262 deep_ep.cpp394-400

Asynchronous Operation Support

Methods support both synchronous and asynchronous execution modes through the async parameter, with optional tensor allocation on communication streams for zero-copy operations.

Sources: deep_ep.cpp282-300 deep_ep.cpp517-536

Python Bindings

The runtime exposes its functionality to Python through pybind11 bindings that preserve the full C++ interface while providing Pythonic tensor integration.

Sources: deep_ep.cpp1341-1381

The bindings preserve method signatures while automatically handling PyTorch tensor integration, CUDA stream management, and Python object lifetime management. All communication methods return tuples containing result tensors and optional EventHandle objects for asynchronous coordination.

Sources: deep_ep.cpp1357-1378