BLAS Library Selection

PyTorch provides mechanisms to select the preferred BLAS library at runtime, allowing users to choose between different implementations based on their workload characteristics.

BLAS Backend Selection

BLAS Backend Selection Process

The BLAS backend selection process allows users to choose between different BLAS libraries at runtime. The preferred_blas_library() function can be used to set the preferred backend, and the blas_library_context() context manager can be used to temporarily change the backend for a specific code block.

Sources:

test/test_matmul_cuda.py65-72 aten/src/ATen/Context.cpp1000-1050

torch/backends/cuda/init.py1-100

Performance Considerations

When working with PyTorch’s linear algebra backends, several factors can significantly impact performance:

Data Type Selection:
- FP16/BF16 operations can be significantly faster on modern GPUs with tensor cores
- TF32 mode provides a good balance between precision and performance for FP32 operations
Memory Layout:
- Contiguous tensors generally perform better than strided ones
- Column-major layout may be more efficient for certain operations due to BLAS conventions
Batch Size Optimization:
- Using batched operations (bmm, baddbmm) is more efficient than loops of individual operations
- Very large batch sizes may require workspace size adjustments
Backend Selection:
- cuBLASLt/hipBLASLt generally perform better for operations supported by tensor cores
- Traditional cuBLAS/rocBLAS may be faster for small matrices or unusual shapes
Tuning System:
- Enable the tunable operations system for workloads with consistent matrix shapes
- Pre-tune operations during initialization to avoid runtime tuning overhead

Sources:

test/test_matmul_cuda.py82-143 aten/src/ATen/cuda/CUDABlas.cpp387-437 torch/cuda/tunable.py1-298

Linear Algebra Backends

This page documents the linear algebra backend system in PyTorch, which provides high-performance implementations of matrix and vector operations (such as GEMM, GEMV, and batched GEMM) across supported hardware and BLAS libraries. The system includes backend selection, data type and precision management, and a tunable operations system for automatic performance optimization.

For details on how these backends are invoked from the compilation pipeline, see TorchInductor. For device-specific execution details, see Device Backends.

System Architecture

PyTorch’s linear algebra backend system is structured to provide a unified interface for tensor operations, while abstracting over hardware-specific and library-specific optimizations. The system routes high-level operations to the appropriate backend implementation, with support for runtime backend selection and autotuning.

Diagram: High-Level Operation Routing to Backend Code Entities

Key Code Entities:

at::cuda::blas::gemm, at::cuda::blas::bgemm, at::cuda::blas::gemm_and_bias: Core dispatch points for matrix operations (

aten/src/ATen/cuda/CUDABlas.h - aten/src/ATen/native/cuda/Blas.cpp).

BlasBackend enum: Represents backend choices (
aten/src/ATen/Context.cpp).
TunableOp System, GemmTunableOp: Tunable operation infrastructure (
aten/src/ATen/cuda/tunable/TunableGemm.h).

Sources:

aten/src/ATen/native/cuda/Blas.cpp323-448 aten/src/ATen/cuda/CUDABlas.cpp367-563 aten/src/ATen/cuda/CUDABlas.h40-231 aten/src/ATen/cuda/tunable/TunableGemm.h29-239

Backend Implementation Layer

The core linear algebra operations are implemented through a template-based system that supports multiple data types and backend libraries.

Backend Implementation: CUDA and ROCm

The CUDA backend supports both cuBLAS and cuBLASLt libraries, with backend selection logic based on operation parameters, data types, and hardware capabilities. ROCm backends (rocBLAS, hipBLASLt) are supported on AMD GPUs.

Diagram: Backend Selection and Code Entities

Key Code Entities:

cublasCommonArgs: Matrix preparation and transpose logic (
aten/src/ATen/native/cuda/Blas.cpp142-216).
bgemm_internal_cublaslt(), bgemm_internal_cublas(): Batched GEMM implementations (
aten/src/ATen/cuda/CUDABlas.cpp367-563).
CuBlasLtWorkspace: Workspace management for cuBLASLt (
aten/src/ATen/cuda/CUDABlas.cpp246-253).

Sources:

aten/src/ATen/cuda/CUDABlas.cpp136-189 aten/src/ATen/cuda/CUDABlas.cpp367-563

aten/src/ATen/native/cuda/Blas.cpp142-216

Data Type and Precision Support

The linear algebra backends support a range of data types and precision modes, including mixed precision and hardware-specific formats.

Data Type	cuBLAS	cuBLASLt	ROCm BLAS	hipBLASLt	Mixed Precision / Special Modes
`float`	✓	✓	✓	✓	TF32 (Ampere+), FP32
`double`	✓	✓	✓	Limited	-
`at::Half`	✓	✓	✓	✓	Accumulate in FP32
`at::BFloat16`	✓	✓	✓	✓	Accumulate in FP32
`c10::complex<float>`	✓	✓	✓	Limited	-
`c10::complex<double>`	✓	✓	✓	Limited	-
`float8` (experimental)	-	-	-	✓	-

TF32 is available on NVIDIA Ampere and later GPUs for FP32 inputs.
Mixed precision: Half/BFloat16 inputs can use FP32 accumulation and output.

Sources:

aten/src/ATen/cuda/CUDABlas.cpp387-437 aten/src/ATen/native/cuda/Blas.cpp360-402 aten/src/ATen/Context.cpp1000-1050

aten/src/ATen/cuda/tunable/GemmCommon.h106-215

Tunable Operations System

PyTorch provides a tunable operations system for linear algebra, which benchmarks and selects the fastest backend implementation for a given operation signature and input parameters. This system is especially relevant for GEMM and batched GEMM operations on ROCm and CUDA platforms.

Diagram: TunableOp System and Code Entities

Key Code Entities:

GemmParams<T>: Encapsulates operation parameters (
aten/src/ATen/cuda/tunable/GemmCommon.h283-377).
TuningContext: Manages tuning and caching (
aten/src/ATen/cuda/tunable/TunableGemm.h210-239).
Callable<GemmParams<T>>: Interface for backend implementations (
aten/src/ATen/cuda/tunable/TunableGemm.h29-43).

Sources:

aten/src/ATen/cuda/tunable/TunableGemm.h206-239 aten/src/ATen/cuda/tunable/GemmCommon.h31-377

torch/cuda/tunable.py1-196

Backend Registration and Environment Controls

The tunable operations system supports registration of multiple backend implementations, with selection and tuning controlled by environment variables and runtime APIs.

Diagram: Backend Registration and Environment Controls

Each backend provides a unique name and callable implementation.
Environment variables control which backends are enabled and whether tuning is active.

Sources:

aten/src/ATen/cuda/tunable/TunableGemm.h210-230 torch/cuda/tunable.py1-196

aten/src/ATen/cuda/tunable/GemmCommon.h31-375

Backend Selection Logic

Backend selection is performed dynamically based on hardware capabilities, operation parameters, and user or environment configuration.

Diagram: Backend Selection Decision Flow

The selection logic is implemented in blasPreferredBackend() and related helpers.
Platform checks and parameter evaluation determine the optimal backend for each operation.

Sources:

aten/src/ATen/native/cuda/Blas.cpp341-402 test/test_linalg.py55-70

torch/testing/_internal/common_cuda.py28-112

ROCm Backend Support

PyTorch supports AMD ROCm platforms with both rocBLAS and hipBLASLt libraries, including architecture-specific optimizations and mixed precision support.

rocblas_gemm_strided_batched_ex(): Used for advanced batched GEMM (
aten/src/ATen/cuda/CUDABlas.cpp650-660).
Architecture detection and selection logic for supported GCN architectures (
aten/src/ATen/cuda/CUDABlas.cpp19-65).
Mixed precision and TF32-like modes for MI300 and newer.
Integration with Composable Kernel (CK) for BFloat16 (
aten/src/ATen/native/hip/ck_bgemm.h).

Sources:

aten/src/ATen/cuda/CUDABlas.cpp19-65 aten/src/ATen/cuda/CUDABlas.cpp193-227 aten/src/ATen/cuda/CUDABlas.cpp647-660

test/test_linalg.py55-70

Mixed Precision and Data Type Handling

The linear algebra backends support mixed precision computation, allowing lower-precision inputs (Half, BFloat16) to be accumulated in higher precision, and enabling special hardware modes such as TF32.

Diagram: Data Type and Precision Routing

Precision control flags: allowFP16AccumulationCuBLAS(), allowBF16ReductionCuBLAS(), allowTF32CuBLAS(), setFloat32MatmulPrecision().
TF32 is available for FP32 inputs on Ampere+ GPUs.

Sources:

aten/src/ATen/cuda/CUDABlas.cpp387-437 aten/src/ATen/Context.cpp1000-1050

test/test_matmul_cuda.py82-143

Performance Optimization Features

Workspace and Memory Layout Management

CublasLtWorkspace: Manages temporary workspace for cuBLASLt (
aten/src/ATen/cuda/CUDABlas.cpp246-253).
getCUDABlasLtWorkspaceSize(): Workspace size is configurable via environment variables.
Memory layout optimization: Automatic detection of optimal layouts, stride analysis, and transpose logic.

Diagram: Memory Layout and Transpose Handling

Sources: aten/src/ATen/native/cuda/Blas.cpp142-216

aten/src/ATen/cuda/CUDABlas.cpp246-253

Fused Activation Epilogues

Modern BLAS libraries support fused activation functions in GEMM operations:

Supported activations: None, ReLU, GELU (see GEMMAndBiasActivationEpilogue).
Eliminates the need for separate activation kernel launches.

Sources:

aten/src/ATen/cuda/CUDABlas.cpp235-266 aten/src/ATen/native/cuda/Blas.cpp136-189

aten/src/ATen/native/cuda/Blas.cpp246-258

Configuration and Environment Controls

The linear algebra backend system provides extensive configuration options through environment variables and runtime APIs.

Environment Variable Controls

Variable	Default	Description
`PYTORCH_TUNABLEOP_ENABLED`	0	Enable tunable operations
`PYTORCH_TUNABLEOP_TUNING`	1	Enable performance tuning
`PYTORCH_TUNABLEOP_FILENAME`	tunableop_results.csv	Tuning results file
`CUBLASLT_WORKSPACE_SIZE`	1024 (KB)	cuBLASLt workspace size
`DISABLE_ADDMM_CUDA_LT`	0	Disable cuBLASLt for addmm
`PYTORCH_TUNABLEOP_ROCBLAS_ENABLED`	1	Enable ROCm BLAS tuning
`PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED`	1	Enable hipBLASLt tuning

Runtime Configuration APIs

Python Interface:

import torch.cuda.tunable as tunable
tunable.enable(True)
tunable.set_max_tuning_duration(30)
tunable.set_filename("custom_results.csv")

C++ Interface:

auto tuning_ctx = at::cuda::tunable::getTuningContext();
tuning_ctx->EnableTunableOp(true);
tuning_ctx->SetMaxTuningDuration(30.0);

Sources:

torch/cuda/tunable.py177-298 aten/src/ATen/cuda/tunable/README.md135-150 aten/src/ATen/cuda/CUDABlas.cpp188-223

Quartz 4

Explorer

Linear_Algebra_Backends

BLAS Library Selection

BLAS Backend Selection

Performance Considerations

Linear Algebra Backends

System Architecture

Diagram: High-Level Operation Routing to Backend Code Entities

Backend Implementation Layer

Backend Implementation: CUDA and ROCm

Diagram: Backend Selection and Code Entities

Data Type and Precision Support

Tunable Operations System

Diagram: TunableOp System and Code Entities

Backend Registration and Environment Controls

Diagram: Backend Registration and Environment Controls

Backend Selection Logic

Diagram: Backend Selection Decision Flow

ROCm Backend Support

Mixed Precision and Data Type Handling

Diagram: Data Type and Precision Routing

Performance Optimization Features

Workspace and Memory Layout Management

Diagram: Memory Layout and Transpose Handling

Fused Activation Epilogues

Configuration and Environment Controls

Environment Variable Controls

Runtime Configuration APIs

Graph View

Table of Contents

Backlinks