Using AIAK-Training PyTorch Edition

Updated at：2025-10-27

Prerequisites

The CCE cloud-native AI service has been successfully activated.

Supported version list

The PyTorch version of AIAK-Training currently supports specific versions. If other versions are needed, users can submit a support ticket.

CUDA version	PyTorch version	Python version
11.7	1.12.0	3.8

Acceleration features

For low-bandwidth network environments, communication optimization has been enhanced by adding a hierarchical Allreduce algorithm and enabling communication compression methods like PowerSGD and FP16.
The NVIDIA Apex AMP O2 mixed precision mode has been added, offering compatibility with the native Torch AMP workflow to facilitate transitioning more computations to FP16 for faster training.
A fused optimizer is supported, integrating computation kernels to minimize overheads like memory access and kernel launches, improving performance during parameter updates.
The LAMB optimizer algorithm is supported, addressing convergence challenges in ultra-large-batch training scenarios.

Practical workflow

1. Obtain the training image

In the “Baidu AI Cloud AI Images” section of CCR public images, select the “aiak-training” accelerated image as the base training image. This image comes pre-installed with CUDA, Python, PyTorch, AIAK-Training acceleration software, and other necessary components. Screenshot 6/18/2024 2.49.38 PM.png

2. Task submission

Submit the task. For detailed steps, refer to the Create a New PyTorch Task document. During the process, set the image address to the address of the aforementioned aiak-training image.

3. Apply acceleration capabilities

3.1 Accelerate distributed training under low-configuration network conditions

In environments with limited network bandwidth, cross-machine gradient synchronization becomes a significant bottleneck for distributed training speed. To address this issue, AIAK-Training has implemented a hierarchical AllReduce algorithm to minimize the impact of cross-machine communication. Additionally, we have streamlined the use of communication compression hooks (such as PowerSGD and FP16) provided by the official PyTorch. Users can now activate these hooks directly via environment variables without modifying their code. Note: In scenarios where communication is not a bottleneck (e.g., single-machine multi-GPU setups or RDMA environments), these features may not provide acceleration benefits.

Function	Environment variables	Default	Value description	Recommendation
Hierarchical AllReduce	AIAK_HIERARCHICAL_ALLREDUCE	0	0 = Disable this feature 1 = Enable this feature	Recommended to enable in TCP network environments
Communication compression algorithm	AIAK_FP16_COMPRESSION	0	0 = Disable this feature 1 = Compress gradients using FP16 format 2 = Compress gradients using BF16 format	Priority set to 1 (FP16 compression). BF16 is currently an experimental feature and requires NCCL version 2.9.6 or above After enabling this feature, gradients will be converted to float16/bfloat16 before communication, and converted back to float32 after communication
	AIAK_POWERSGD_COMPRESSION	0	0 = Disable this feature 1 = Enable PowerSGD compression 2 = Enable batched powersgd compression	Priority set to 1 (PowerSGD compression). Batched PowerSGD offers better performance but causes more precision loss. If the model precision meets requirements when AIAK_MATRIX_APPROXIMATION_RANK=1, consider using batched PowerSGD This feature can be used independently or combined with FP16 compression (compression algorithms will be applied in a stacked manner)
	AIAK_MATRIX_APPROXIMATION_RANK	1	Can be set starting from 1, in multiples of 2 (e.g., 1, 2, 4, ...)	This hyperparameter for the PowerSGD algorithm determines the compression rate; a smaller value means stronger compression. Generally, set it to 1-4. If precision is significantly affected, try increasing the value
	AIAK_START_COMPRESS_ITER	1000	Indicate the step from which to start the compression algorithm; default is 1,000	If there is a warmup phase in training, this value should usually not be less than the number of warmup steps Using gradient compression too early may affect final convergence. If adjusting this value, it is recommended to set it to no less than 10% of the total training steps

3.2 AMP O2 mixed precision mode for further computation acceleration

The official PyTorch offers the AMP (automatic mixed precision) feature, which users can leverage to improve model computation efficiency. Building on this, AIAK-Training has incorporated the O2 mode from NVIDIA APEX AMP and made it API-level compatible with Torch AMP. Users only need to add a single line of code to enable a more aggressive pure FP16 training mode quickly. Follow these steps:

After preparing the model and optimizer, users should add a single line of code to call the aiak_amp_decorate function. Moreover, when using DDP, ensure that aiak_amp_decorate is called before initializing DDP. (See Line 6 in the example below)
When applying gradient clipping, replace model.parameters() in clip_grad_norm(model.parameters(), max_norm) with torch.cuda.amp.aiak_amp_parameters(optimizer). (See Line 25 in the example below)

Plain Text

1# Define the model and optimizer
2model = Model().cuda()
3optimizer = optim.SGD(model.parameters(), ...)
4# Add the aiak_amp_decorate wrapper call to enable Apex AMP O2 mode
5model, optimizer = torch.cuda.amp.aiak_amp_decorate(model, optimizer)
6# Construct the DDP model after initializing O2 optimization
7model = DDP(model)
8# Relevant code for the native torch AMP workflow (no modifications required)
9scaler = GradScaler()
10
11for input, target in data:
12    optimizer.zero_grad()
13# Run forward in the autocast-enabled region
14    with autocast():
15        output = model(input)
16        loss = loss_fn(output, target)
17# Use the scaler to scale the loss (FP16) and perform backward propagation to get scaled gradients (FP16)
18    scaler.scale(loss).backward()
19# When using gradient clipping, follow the native torch AMP calling rules and call unscale to restore gradients
20    scaler.unscale_(optimizer)
21# For gradient clipping, replace model.parameters() with amp.amp_parameters(optimizer)
22    torch.nn.utils.clip_grad_norm_(torch.cuda.amp.aiak_amp_parameters(optimizer), max_norm)
23# The scaler updates parameters: it will automatically unscale gradients first; if NaN or Inf values exist, the update will be skipped automatically
24    scaler.step(optimizer)
25# Update the scaler factor
26    scaler.update()

Detailed parameters of the aiak_amp_decorate API:

Parameter item	Required or not	Default value	Description
model	Yes	None	Model
optimizer	Yes	None	Optimizer
keep_batchnorm_fp32	No	TRUE	Generally, no changes are required. This setting determines whether to retain BatchNorm computation precision in FP32; it is usually set to True by default.
loss_scale	No	None	In most cases, no modifications are necessary. If a float value is specified, it will be used as a static value for gradient scaling; if set to the string “dynamic,” the loss scale will automatically adjust over time (auto-dynamic scaling); if set to None, it behaves the same as “dynamic.”
num_losses	No	1	In most cases, no modifications are necessary. It indicates the number of loss/backward passes to use
min_loss_scale	No	None	In most cases, no modifications are necessary. It indicates the lower limit of the loss scale during dynamic scaling
max_loss_scale	No	2.**24	In most cases, no modifications are necessary. It indicates the upper limit of the loss scale during dynamic scaling

Additional information: When using O2 mode, parameters of PyTorch’s native GradScaler class (such as backoff_factor, growth_factor, and growth_interval) will automatically update according to O2 mode. Users do not need to adjust these manually, as manual changes will not take effect.

3.3 FusedOptimizer to accelerate parameter update efficiency

To enable the optimizer fusion feature, users only need to set the following environment variable:

Function	Environment variables	Default	Value description	Recommendation
Optimizer operator fusion	AIAK_FUSED_OPTIMIZER	0	0 = Disable this feature 1 = Enable this feature	Accelerate the efficiency of the optimizer step phase; recommended to enable

3.4 Ultra-large BatchSize training: Enable the LAMB optimizer

The LAMB optimizer is used similarly to other optimizers. A specific example is as follows: