BLAS Library Selection

PyTorch provides mechanisms to select the preferred BLAS library at runtime, allowing users to choose between different implementations based on their workload characteristics.

BLAS Backend Selection

BLAS Backend Selection Process

The BLAS backend selection process allows users to choose between different BLAS libraries at runtime. The preferred_blas_library() function can be used to set the preferred backend, and the blas_library_context() context manager can be used to temporarily change the backend for a specific code block.

Sources:

test/test_matmul_cuda.py65-72 aten/src/ATen/Context.cpp1000-1050

torch/backends/cuda/init.py1-100

Performance Considerations

When working with PyTorch’s linear algebra backends, several factors can significantly impact performance:

  1. Data Type Selection:

    • FP16/BF16 operations can be significantly faster on modern GPUs with tensor cores
    • TF32 mode provides a good balance between precision and performance for FP32 operations
  2. Memory Layout:

    • Contiguous tensors generally perform better than strided ones
    • Column-major layout may be more efficient for certain operations due to BLAS conventions
  3. Batch Size Optimization:

    • Using batched operations (bmm, baddbmm) is more efficient than loops of individual operations
    • Very large batch sizes may require workspace size adjustments
  4. Backend Selection:

    • cuBLASLt/hipBLASLt generally perform better for operations supported by tensor cores
    • Traditional cuBLAS/rocBLAS may be faster for small matrices or unusual shapes
  5. Tuning System:

    • Enable the tunable operations system for workloads with consistent matrix shapes
    • Pre-tune operations during initialization to avoid runtime tuning overhead

Sources:

test/test_matmul_cuda.py82-143 aten/src/ATen/cuda/CUDABlas.cpp387-437 torch/cuda/tunable.py1-298

Linear Algebra Backends

This page documents the linear algebra backend system in PyTorch, which provides high-performance implementations of matrix and vector operations (such as GEMM, GEMV, and batched GEMM) across supported hardware and BLAS libraries. The system includes backend selection, data type and precision management, and a tunable operations system for automatic performance optimization.

For details on how these backends are invoked from the compilation pipeline, see TorchInductor. For device-specific execution details, see Device Backends.

System Architecture

PyTorch’s linear algebra backend system is structured to provide a unified interface for tensor operations, while abstracting over hardware-specific and library-specific optimizations. The system routes high-level operations to the appropriate backend implementation, with support for runtime backend selection and autotuning.

Diagram: High-Level Operation Routing to Backend Code Entities

Key Code Entities:

  • at::cuda::blas::gemm, at::cuda::blas::bgemm, at::cuda::blas::gemm_and_bias: Core dispatch points for matrix operations (

aten/src/ATen/cuda/CUDABlas.h- aten/src/ATen/native/cuda/Blas.cpp).

Sources:

aten/src/ATen/native/cuda/Blas.cpp323-448 aten/src/ATen/cuda/CUDABlas.cpp367-563 aten/src/ATen/cuda/CUDABlas.h40-231 aten/src/ATen/cuda/tunable/TunableGemm.h29-239

Backend Implementation Layer

The core linear algebra operations are implemented through a template-based system that supports multiple data types and backend libraries.

Backend Implementation: CUDA and ROCm

The CUDA backend supports both cuBLAS and cuBLASLt libraries, with backend selection logic based on operation parameters, data types, and hardware capabilities. ROCm backends (rocBLAS, hipBLASLt) are supported on AMD GPUs.

Diagram: Backend Selection and Code Entities

Key Code Entities:

Sources:

aten/src/ATen/cuda/CUDABlas.cpp136-189 aten/src/ATen/cuda/CUDABlas.cpp367-563

aten/src/ATen/native/cuda/Blas.cpp142-216

Data Type and Precision Support

The linear algebra backends support a range of data types and precision modes, including mixed precision and hardware-specific formats.

Data TypecuBLAScuBLASLtROCm BLAShipBLASLtMixed Precision / Special Modes
floatTF32 (Ampere+), FP32
doubleLimited-
at::HalfAccumulate in FP32
at::BFloat16Accumulate in FP32
c10::complex<float>Limited-
c10::complex<double>Limited-
float8 (experimental)----
  • TF32 is available on NVIDIA Ampere and later GPUs for FP32 inputs.
  • Mixed precision: Half/BFloat16 inputs can use FP32 accumulation and output.

Sources:

aten/src/ATen/cuda/CUDABlas.cpp387-437 aten/src/ATen/native/cuda/Blas.cpp360-402 aten/src/ATen/Context.cpp1000-1050

aten/src/ATen/cuda/tunable/GemmCommon.h106-215

Tunable Operations System

PyTorch provides a tunable operations system for linear algebra, which benchmarks and selects the fastest backend implementation for a given operation signature and input parameters. This system is especially relevant for GEMM and batched GEMM operations on ROCm and CUDA platforms.

Diagram: TunableOp System and Code Entities

Key Code Entities:

Sources:

aten/src/ATen/cuda/tunable/TunableGemm.h206-239 aten/src/ATen/cuda/tunable/GemmCommon.h31-377

torch/cuda/tunable.py1-196

Backend Registration and Environment Controls

The tunable operations system supports registration of multiple backend implementations, with selection and tuning controlled by environment variables and runtime APIs.

Diagram: Backend Registration and Environment Controls

  • Each backend provides a unique name and callable implementation.
  • Environment variables control which backends are enabled and whether tuning is active.

Sources:

aten/src/ATen/cuda/tunable/TunableGemm.h210-230 torch/cuda/tunable.py1-196

aten/src/ATen/cuda/tunable/GemmCommon.h31-375

Backend Selection Logic

Backend selection is performed dynamically based on hardware capabilities, operation parameters, and user or environment configuration.

Diagram: Backend Selection Decision Flow

  • The selection logic is implemented in blasPreferredBackend() and related helpers.
  • Platform checks and parameter evaluation determine the optimal backend for each operation.

Sources:

aten/src/ATen/native/cuda/Blas.cpp341-402 test/test_linalg.py55-70

torch/testing/_internal/common_cuda.py28-112

ROCm Backend Support

PyTorch supports AMD ROCm platforms with both rocBLAS and hipBLASLt libraries, including architecture-specific optimizations and mixed precision support.

Sources:

aten/src/ATen/cuda/CUDABlas.cpp19-65 aten/src/ATen/cuda/CUDABlas.cpp193-227 aten/src/ATen/cuda/CUDABlas.cpp647-660

test/test_linalg.py55-70

Mixed Precision and Data Type Handling

The linear algebra backends support mixed precision computation, allowing lower-precision inputs (Half, BFloat16) to be accumulated in higher precision, and enabling special hardware modes such as TF32.

Diagram: Data Type and Precision Routing

  • Precision control flags: allowFP16AccumulationCuBLAS(), allowBF16ReductionCuBLAS(), allowTF32CuBLAS(), setFloat32MatmulPrecision().
  • TF32 is available for FP32 inputs on Ampere+ GPUs.

Sources:

aten/src/ATen/cuda/CUDABlas.cpp387-437 aten/src/ATen/Context.cpp1000-1050

test/test_matmul_cuda.py82-143

Performance Optimization Features

Workspace and Memory Layout Management

  • CublasLtWorkspace: Manages temporary workspace for cuBLASLt (
  • aten/src/ATen/cuda/CUDABlas.cpp246-253).
  • getCUDABlasLtWorkspaceSize(): Workspace size is configurable via environment variables.
  • Memory layout optimization: Automatic detection of optimal layouts, stride analysis, and transpose logic.

Diagram: Memory Layout and Transpose Handling

Sources: aten/src/ATen/native/cuda/Blas.cpp142-216

aten/src/ATen/cuda/CUDABlas.cpp246-253

Fused Activation Epilogues

Modern BLAS libraries support fused activation functions in GEMM operations:

  • Supported activations: None, ReLU, GELU (see GEMMAndBiasActivationEpilogue).
  • Eliminates the need for separate activation kernel launches.

Sources:

aten/src/ATen/cuda/CUDABlas.cpp235-266 aten/src/ATen/native/cuda/Blas.cpp136-189

aten/src/ATen/native/cuda/Blas.cpp246-258

Configuration and Environment Controls

The linear algebra backend system provides extensive configuration options through environment variables and runtime APIs.

Environment Variable Controls

VariableDefaultDescription
PYTORCH_TUNABLEOP_ENABLED0Enable tunable operations
PYTORCH_TUNABLEOP_TUNING1Enable performance tuning
PYTORCH_TUNABLEOP_FILENAMEtunableop_results.csvTuning results file
CUBLASLT_WORKSPACE_SIZE1024 (KB)cuBLASLt workspace size
DISABLE_ADDMM_CUDA_LT0Disable cuBLASLt for addmm
PYTORCH_TUNABLEOP_ROCBLAS_ENABLED1Enable ROCm BLAS tuning
PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED1Enable hipBLASLt tuning

Runtime Configuration APIs

Python Interface:

import torch.cuda.tunable as tunable
tunable.enable(True)
tunable.set_max_tuning_duration(30)
tunable.set_filename("custom_results.csv")

C++ Interface:

auto tuning_ctx = at::cuda::tunable::getTuningContext();
tuning_ctx->EnableTunableOp(true);
tuning_ctx->SetMaxTuningDuration(30.0);

Sources:

torch/cuda/tunable.py177-298 aten/src/ATen/cuda/tunable/README.md135-150 aten/src/ATen/cuda/CUDABlas.cpp188-223