https://blog.csdn.net/xx_nm98/article/details/142422761 https://www.cnblogs.com/sunstrikes/p/18310517 https://zhuanlan.zhihu.com/p/2682700946

AlltoAll Dispatcher

You’re asking about the AlltoAll dispatcher workflow in the MoE (Mixture of Experts) system, specifically the MoEAlltoAllTokenDispatcher class in megatron/core/transformer/moe/token_dispatcher.py.

AlltoAll Dispatcher Workflow Overview

The MoEAlltoAllTokenDispatcher implements a communication-efficient token dispatching strategy for MoE models using AlltoAll collective operations. 1 The workflow consists of 7 main phases as documented in the class docstring.

Input/Output And Function of Dispatching

Input:

hidden_states: Token embeddings with shape [S/TP, B, H] where S=sequence length, B=batch size, H=hidden size, TP=tensor parallel size 2
routing_map: Boolean tensor indicating which expert each token should be routed to 3
probs: Routing probabilities for each token-expert pair 3

Output:

Processed tokens after expert computation, restored to original shape and order
Combined expert outputs with proper probability weighting

Function of Dispatching:

The dispatcher’s core function is to route tokens to their assigned experts across different GPU devices, enable expert computation, and then combine the results back. 4

Multi-GPU Process Workflow

Phase 1: Dispatch Preprocessing

The dispatch_preprocess method reshapes input tensors and computes communication metadata: 5

Key operations:

Reshapes hidden_states from [S/TP, B, H] to [S*B/TP, H]
Calls preprocess() to calculate input_splits and output_splits for AlltoAll communication
Performs token permutation using the permute() function

Phase 2: Token Dispatch (AlltoAll Communication)

The token_dispatch method performs the main AlltoAll communication across expert parallel (EP) ranks: 6

This uses the all_to_all function from megatron/core/tensor_parallel/mappings.py: 7

Phase 3: Dispatch Postprocessing

The dispatch_postprocess method handles post-communication operations: 8

Key operations:

AllGather across tensor parallel (TP) ranks if tp_size > 1
Sorts tokens by local expert using sort_chunks_by_idxs() when multiple local experts exist

Phase 4: Expert Computation

Experts process the dispatched tokens (handled by the MoELayer): 9

Phase 5: Combine Preprocessing

The combine_preprocess method prepares expert outputs for combination: 10

Operations include:

Un-sorting tokens by local expert (reverse of phase 3)
ReduceScatter across TP ranks if tp_size > 1

Phase 6: Token Combine (Reverse AlltoAll)

The token_combine method performs the reverse AlltoAll communication: 11

Phase 7: Combine Postprocessing

The combine_postprocess method finalizes token reconstruction: 12

Final operations:

Un-permutes tokens back to original order using unpermute()
Reshapes tensor to original shape
Adds shared expert output if enabled

Multi-GPU Communication Pattern

The AlltoAll dispatcher uses a two-level communication hierarchy:

Expert Parallel (EP) Communication: AlltoAll operations distribute tokens across ranks hosting different experts
Tensor Parallel (TP) Communication: AllGather/ReduceScatter operations handle tensor parallelism within expert computation

The process groups are managed through model_comm_pgs: 13

This workflow enables efficient scaling of MoE models across multiple GPUs by minimizing communication overhead compared to the AllGather-based dispatcher, especially when expert parallelism is large.

Notes

The AlltoAll dispatcher is recommended when expert parallelism is applied, as mentioned in the configuration documentation. 14 It achieves better memory efficiency by only storing tokens assigned to local experts rather than gathering all tokens like the AllGather dispatcher.

Wiki pages you might want to explore:

DeepEP Dispatcher

You’re asking about the relationship between the three token dispatchers in Megatron’s MoE system (allgather, alltoall, and flex) and specifically how DeepEP is utilized in the AlltoAll dispatch/combine strategy.

Relationship Between the Three Dispatchers

The three dispatchers represent different communication strategies for routing tokens to experts: 1

AllGather Dispatcher (MoEAllGatherTokenDispatcher): Uses AllGather communication to collect all tokens across ranks, then processes locally 2
AlltoAll Dispatcher (MoEAlltoAllTokenDispatcher): Uses AlltoAll communication for more efficient token distribution when expert parallelism is large 3
Flex Dispatcher (MoEFlexTokenDispatcher): A flexible dispatcher that abstracts underlying parallelism and currently supports DeepEP backend 4

The configuration determines which dispatcher is used: 5

DeepEP Integration in Token Dispatching

DeepEP is only available through the Flex dispatcher, not the AlltoAll dispatcher. The Flex dispatcher uses DeepEP’s optimized kernels for fused communication operations.

Key Classes and Functions for DeepEP

Primary DeepEP Manager Class: 6

DeepEP Integration in Flex Dispatcher: 7

Core DeepEP Functions: 8

DeepEP Token Dispatch Workflow

1. Setup Metadata:

The DeepEP manager processes routing information: 9

2. Fused Dispatch Operation:

DeepEP performs fused permutation and AlltoAll communication in a single kernel: 10

3. Expert Processing:

Tokens are permuted for expert computation: 11

4. Fused Combine Operation:

The reverse process uses fused kernels to unpermute and communicate: 12

Input Token Flow with DeepEP

The input tokens flow through the Flex dispatcher as follows:

Preprocessing: 13
Fused Dispatch: 14
Expert Computation: Tokens are processed by experts after DeepEP’s fused dispatch
Fused Combine: Results are combined using DeepEP’s optimized kernels

Notes

DeepEP is not used in the traditional AlltoAll dispatcher (MoEAlltoAllTokenDispatcher). It’s exclusively available through the Flex dispatcher (MoEFlexTokenDispatcher) when --moe-enable-deepep is set. 15 The DeepEP integration provides optimized cross-node communication for large-scale MoE training scenarios.

Wiki pages you might want to explore:

Fused Permutation and AlltoAll

You’re asking about the purpose and implementation of token permutation in MoE systems, and how permutation is fused with AlltoAll communication operations.

Purpose of Token Permutation

Token permutation serves a critical role in MoE token dispatching by reorganizing tokens so that tokens assigned to the same expert are grouped together contiguously in memory. 1 This grouping is essential because:

Enables efficient expert computation: Experts can process their assigned tokens as contiguous chunks
Optimizes communication patterns: AlltoAll operations work more efficiently with properly organized data
Facilitates memory management: Reduces fragmentation and improves cache locality

Permutation Implementation

Core Permutation Function

The main permutation logic is implemented in the permute() function: 2

The function supports both fused and non-fused implementations:

Fused Implementation (when fused=True):

Uses Transformer Engine’s optimized kernels: fused_permute() and fused_permute_with_probs() 3
Requires Transformer Engine >= 2.1.0 4

Non-fused Implementation:

Creates expert-to-token mapping from sparse token-to-expert routing 5
Uses torch.index_select() to reorder tokens based on computed indices

Permutation in AlltoAll Dispatcher

In the MoEAlltoAllTokenDispatcher, permutation happens in the dispatch_preprocess() method: 6

The permutation process:

Computes routing metadata via preprocess() 7
Performs token permutation using the permute() function with routing map and probabilities
Returns permuted tokens and probabilities ready for AlltoAll communication

Unpermutation Process

The reverse operation uses unpermute() to restore original token ordering: 8

This happens in combine_postprocess(): 9

Permutation and AlltoAll Fusion

Traditional Approach (Separate Operations)

In the standard MoEAlltoAllTokenDispatcher, permutation and AlltoAll are separate sequential operations:

Permutation: permute() reorganizes tokens locally 6
AlltoAll Communication: all_to_all() distributes permuted tokens across ranks 10

Fused Approach (DeepEP Integration)

DeepEP provides true fusion of permutation and AlltoAll through the Flex dispatcher, not the traditional AlltoAll dispatcher.

In MoEFlexTokenDispatcher with DeepEP enabled:

Fused Dispatch Operation:

The _DeepepManager.dispatch() method calls fused_dispatch() which combines permutation and AlltoAll in a single kernel: 11

Fused Combine Operation:

Similarly, the combine operation uses fused kernels: 12

DeepEP Fusion Implementation

The actual fusion happens in the FusedDispatch autograd function: 13

Key aspects of the fusion:

Single kernel execution: Permutation and communication happen atomically
Optimized memory access: Reduces intermediate buffer allocations
Improved performance: Eliminates separate permutation overhead

Configuration and Usage

Permutation fusion is controlled by the --moe-permute-fusion flag: 14

This enables fused permutation operations when Transformer Engine >= 2.1.0 is available: 15

Notes

The key distinction is that true permutation-AlltoAll fusion only exists in DeepEP through the Flex dispatcher, not in the traditional AlltoAll dispatcher. The traditional AlltoAll dispatcher can use fused permutation kernels locally, but the permutation and AlltoAll communication remain separate operations. DeepEP’s innovation is combining both into single optimized kernels for better performance in large-scale deployments.

Wiki pages you might want to explore:

3 MoE

The Three MoE Dispatcher Classes

The three dispatcher classes all inherit from MoETokenDispatcher and implement both dispatch and combine operations token_dispatcher.py:135-188 :

MoEAllGatherTokenDispatcher

Implementation Method: Uses AllGather communication pattern token_dispatcher.py:196-200

Key Operations:

Dispatch: Gathers tokens from all TP×EP ranks using AllGather token_dispatcher.py:240-262
Combine: Uses Reduce-Scatter to aggregate expert outputs back to original ranks token_dispatcher.py:312-325

Best Use Case: Small expert parallelism scenarios, but has higher memory overhead as it stores all tokens

MoEAlltoAllTokenDispatcher

Implementation Method: Uses AlltoAll communication pattern with a 7-step workflow token_dispatcher.py:332-344

Key Operations:

Dispatch: AlltoAll communication for expert parallelism, followed by AllGather for tensor parallelism token_dispatcher.py:602-628
Combine: Reverse AlltoAll to redistribute tokens back to original ranks token_dispatcher.py:742-768

Best Use Case: Large expert parallelism with lower memory overhead as it only stores assigned tokens

MoEFlexTokenDispatcher

Implementation Method: Uses fused DeepEP kernels that combine permutation and communication token_dispatcher.py:1133-1137

Key Operations:

Dispatch: Fused permutation and AlltoAll in a single optimized operation token_dispatcher.py:1232-1259
Combine: Fused un-permutation and communication using DeepEP kernels token_dispatcher.py:1289-1307

Best Use Case: Large-scale training with optimized cross-node communication

Backend Implementation Differences

MoEFlexTokenDispatcher - DeepEP with CUDA Kernels

The Flex dispatcher uses DeepEP’s fused CUDA kernels for optimized communication token_dispatcher.py:1161-1162 . It leverages the _DeepepManager which provides fused dispatch and combine operations token_dispatcher.py:902-922 .

Key Backend Features:

Fused CUDA kernels: Uses fused_dispatch and fused_combine from DeepEP token_dispatcher.py:989-999
Single-step operations: Combines permutation and AlltoAll communication in one kernel call token_dispatcher.py:1240-1245
Cross-node optimization: Specifically designed for efficient cross-node token communication README.md:460-462

The fused kernels are implemented in FusedDispatch and FusedCombine autograd functions fused_a2a.py:68-82 .

MoEAlltoAllTokenDispatcher - Standard PyTorch AlltoAll

This dispatcher uses PyTorch’s native distributed communication primitives without custom CUDA kernels token_dispatcher.py:621-627 .

Key Backend Features:

Standard AlltoAll: Uses all_to_all from megatron.core.tensor_parallel token_dispatcher.py:765-767
Separate permutation: Token permutation and communication are separate operations using permute and unpermute functions token_dispatcher.py:588-599
Multi-step workflow: 7-step process with explicit preprocess/postprocess phases token_dispatcher.py:336-344

MoEAllGatherTokenDispatcher - Standard PyTorch AllGather/ReduceScatter

This dispatcher uses the simplest backend with standard PyTorch collective operations token_dispatcher.py:258-260 .

Key Backend Features:

AllGather/ReduceScatter: Uses gather_from_sequence_parallel_region and reduce_scatter_to_sequence_parallel_region token_dispatcher.py:322-324
No custom kernels: Relies entirely on PyTorch’s native distributed primitives
Higher memory usage: Stores all tokens across TP×EP ranks README.md:441

Performance and Optimization Hierarchy

The three dispatchers represent different levels of optimization:

AllGather (simplest): Standard PyTorch collectives, highest memory usage
AlltoAll (intermediate): Standard PyTorch with optimized communication pattern, lower memory usage
Flex (most advanced): Custom CUDA kernels with fused operations, optimized for large-scale cross-node scenarios

The Flex dispatcher requires DeepEP installation and is enabled via —moe-enable-deepep arguments.py:2806-2807 , while the other two use standard PyTorch distributed communication without additional dependencies.

Quartz 4

Explorer

MoE

AlltoAll Dispatcher

AlltoAll Dispatcher Workflow Overview

Input/Output And Function of Dispatching

Multi-GPU Process Workflow

Phase 1: Dispatch Preprocessing

Phase 2: Token Dispatch (AlltoAll Communication)

Phase 3: Dispatch Postprocessing

Phase 4: Expert Computation

Phase 5: Combine Preprocessing

Phase 6: Token Combine (Reverse AlltoAll)

Phase 7: Combine Postprocessing

Multi-GPU Communication Pattern

Notes

DeepEP Dispatcher

Relationship Between the Three Dispatchers

DeepEP Integration in Token Dispatching

Key Classes and Functions for DeepEP

DeepEP Token Dispatch Workflow

Input Token Flow with DeepEP

Notes

Fused Permutation and AlltoAll

Purpose of Token Permutation

Permutation Implementation

Core Permutation Function

Permutation in AlltoAll Dispatcher

Unpermutation Process

Permutation and AlltoAll Fusion

Traditional Approach (Separate Operations)

Fused Approach (DeepEP Integration)

DeepEP Fusion Implementation

Configuration and Usage

Notes

3 MoE

Graph View

Table of Contents

Backlinks