LIST "Megatron Repository"
FROM ""
WHERE file.folder = this.file.folder OR startswith(file.folder, this.file.folder + "/")
SORT file.path

reference: Megatron-LM | DeepWiki

Overview

Purpose and Scope

This document provides a comprehensive overview of the Megatron-LM framework, a GPU-optimized system for training large-scale transformer models. Megatron-LM combines cutting-edge research implementations with production-ready training infrastructure, enabling efficient training of models with hundreds of billions of parameters across thousands of GPUs.

The framework consists of two main components:

  • Megatron-LM (research-oriented training scripts and model implementations) and
  • Megatron-Core (production-ready library of GPU-optimized techniques).

This documentation covers the entire system architecture, from core transformer implementations to distributed training orchestration and inference deployment.

For detailed information about specific subsystems, see: Core Architecture, Parallelism Strategies, Training System, Data Processing, Inference System, Fine-tuning and Evaluation, and CD and Testing.

System Architecture Overview

Megatron-LM implements a layered architecture centered around Megatron-Core as the foundational library, with training orchestration, model implementations, and infrastructure components built on top. The system separates core GPU-optimized techniques from research implementations and production workflows.

Overall System Architecture

Sources: README.md71-86 arguments.py45-82 transformer_config.py32-38 training.py1-4

Core Components

Megatron-LM Vs Megatron-Core

The framework separates research implementations from production-ready infrastructure:

  • Megatron-LM: Training scripts (pretrain_gpt.py, pretrain_bert.py), tools, examples, and experimental features
  • Megatron-Core: Production library under megatron.core with versioned APIs, optimized implementations, and formal support

Training Pipeline and Model Architecture Flow

Sources: README.md75-85 arguments.py84-119 pretrain_gpt.py95-108 gpt_model.py34-75 training.py1-4

Supported Model Architectures

Megatron supports multiple transformer-based architectures through the TransformerConfig system:

ArchitectureImplementationKey ClassesUse Case
GPTmegatron.core.models.gpt.GPTModelGPTModel, LanguageModuleAutoregressive language modeling
BERTmegatron.legacy.model.BertModelBertModel, MegatronModuleBidirectional language understanding
T5megatron.core.models.t5.T5ModelT5Model, encoder-decoderText-to-text generation
Retromegatron.core.models.retroRetroModelRetrieval-augmented generation
MoEmegatron.core.transformer.moeMoELayer, expert routingMixture of experts scaling
LLaVAmegatron.core.models.multimodalMultimodal fusionVision-language models
Mambaexamples/mamba/State space modelsSequence modeling with linear complexity

Sources: README.md210-380 gpt_model.py34-75 transformer_config.py32-38

Parallelism Strategies

The framework implements comprehensive parallelism support through megatron.core.parallel_state and specialized modules:

  • Tensor Parallelism (--tensor-model-parallel-size): megatron.core.tensor_parallel splits layers across GPUs
  • Pipeline Parallelism (--pipeline-model-parallel-size): megatron.core.pipeline_parallel.schedules distributes transformer blocks
  • Data Parallelism: megatron.core.distributed.DistributedDataParallel replicates models with gradient sync
  • Sequence Parallelism (--sequence-parallel): Distributes sequence dimension in layer norms and dropouts
  • Expert Parallelism (--expert-model-parallel-size): megatron.core.transformer.moe distributes MoE experts
  • Context Parallelism (--context-parallel-size): Handles long sequences via _CONTEXT_PARALLEL_GROUP

Sources: README.md292-306 layers.py1-4 schedules.py1-4 parallel_state.py22-106

Performance and Scalability

Training Performance

Megatron-LM demonstrates exceptional scaling characteristics across model sizes and hardware configurations:

  • Model Scale: Supports models from 345M to 462B parameters
  • Hardware Scale: Tested up to 6,144 H100 GPUs
  • Efficiency: Achieves 41-48% Model FLOPs Utilization (MFU)
  • Scaling Pattern: Shows superlinear weak scaling due to improved arithmetic intensity

Benchmark Results

Model SizeGPUsGlobal Batch SizeMFUThroughput
2B96115241%-
175B1024153647%138 TFLOP/GPU
462B6144-48%-

Sources: README.md87-100

Getting Started

Entry Points

The framework provides multiple entry points for different use cases:

Training Scripts:

  • pretrain_gpt.py: GPT model pretraining
  • pretrain_bert.py: BERT model pretraining
  • pretrain_t5.py: T5 model pretraining

Data Processing:

  • tools/preprocess_data.py: Convert raw text to training format

Inference:

  • tools/run_text_generation_server.py: REST API server for text generation
  • tools/text_generation_cli.py: Command-line interface for inference

Evaluation:

  • tasks/main.py: Downstream task evaluation

Basic Usage Pattern

Sources: README.md199-212 README.md492-509 pretrain_gpt.py1-10 checkpointing.py1-4

Installation and Prerequisites

Dependencies

  • PyTorch: Latest stable version
  • CUDA/cuDNN/NCCL: Latest stable versions
  • Hardware: NVIDIA Turing generation GPUs or later for best performance
  • FP8 Support: Available on Hopper, Ada, and Blackwell architectures

Installation Options

  • PyPI: pip install megatron-core[dev] for latest features
  • Docker: NVIDIA PyTorch NGC Container (recommended)
  • Source: Git clone with environment setup via docker/common/install.sh

Sources: README.md101-184