Deploying Distributed Training Tasks Using AIAK-Training

Updated at：2025-10-27

Prerequisites

The CCE cloud-native AI service has been successfully activated.
CUDA 11.2, PyTorch 1.8.1, TensorFlow 2.5.0, and MxNet 1.8.0 are available. If your AI application requires other versions, please submit a support ticket.

Operation process

1. Select a training image

In the “Baidu AI Cloud AI Images” section of CCR public images, select the “Halo” accelerated image as the base training image. This image comes pre-installed with CUDA, Python 3, PyTorch, TensorFlow, MxNet, and AIAK-Training acceleration software. Screenshot 6/18/2024 2.21.18 PM.png

2. Component installation

Click on the cluster name to enter the cluster page. Navigate to Component Management - Cloud-Native AI. Click to install the CCE Deep Learning Frameworks Operator component and CCE AI Job Scheduler, then click OK. Screenshot 6/18/2024 2.28.48 PM.png

3. Task submission

Go to Cloud-Native AI - Task Management - Create New Task

Select the framework: AITrainingJob or Horovod.
Choose the training method: Distributed.
Select the role: Choose "Launcher" and define the elastic scaling range for the pod.

Set up the necessary configurations under Pod Group Configuration - Lifecycle - Startup Command as follows.

Enable optimized hierarchical Allreduce

Plain Text

1 horovodrun  --nccl-hierarchical-allreduce [other-horovod-args] python [executable] [app-args]

Activate this feature by adding the --nccl-hierarchical-allreduce parameter in the horovodrun command line.

Enable DGC sparse communication compression algorithm

Plain Text

1horovodrun --dgc \
2           --compress-ratio 0.001 \
3           --sample-ratio 0.01 \
4           --clip-grad-func norm/value\
5           --clip-grad-func-params 0.01 \
6           ... ...
7           [other-horovod-args] python [executable] [app-args]

Configurable parameters related to DGC are as follows:

Parameters	Description	Required
compress_ratio	Compression rate: Float type, default value is 0.001. A smaller value means less gradient communication volume; (manual configuration is recommended)	Yes
sample_ratio	Sample collection rate: Float type, default value is 0.01	No
strided_sample：bool	Default value is True. Indicates whether to collect samples by stride	No
compress_upper_bound	Compression upper limit: Default value is 1.3	No
compress_lower_bound	Compression lower limit: Default value is 0.8	No
max_adaptation_iters	Maximum number of adaptive iterations	No
resample	Bool type, default value is True. Indicates whether to re-sample	No
fp16_values	Bool type, default value is False. Indicates whether to convert gradient element values to FP16	No
int32_indices	Bool type, default value is False. Indicates whether to convert gradient element index to FP32	No
warmup_epochs	Pre-training epoch count	No
warmup_coeff	Compression rate list for model pre-training phase: If there are multiple values, separate them with spaces in the command line. For example, warmup_epochs=3, warmup_coeff=[0.5, 0.6, 0.7] means: in the preheating phase, 50% of gradients are used for communication in the first step, 60% in the second step, and 70% in the third step	No
clip_grad_func	Gradient clipping: Optional values are norm or value, corresponding to `clip_grad_norm_` and `clip_grad_value_` in PyTorch, respectively	No
clip_grad_func_params	Parameters for gradient clipping method	No
momentum_masking	Bool type, default value is True. Avoids the problem of outdated gradient updates	No

Note: At present, the DGC compression algorithm only supports the SGD optimizer.

Enable overlap of parameter update and gradient communication

To overlap parameter updates with gradient communication, simply configure the horovodrun command line as follows.

Plain Text

1horovodrun --overlap-backward-and-step \
2           --clip-grad-func norm/value\
3           --clip-grad-func-params 0.01 \
4           ... ...
5           [other-horovod-args] python [executable] [app-args]

If gradient clipping is needed before parameter updates, set the clip_grad_func and clip_grad_func_params parameters. In this scenario, remove the gradient clipping calls from your code.
If both DGC and parameter update/gradient communication overlap are enabled, and a gradient clipping method is specified, the gradient clipping will be handled within DGC, and no clipping will occur during the overlap process.
If parameter update and gradient communication overlap is not enabled, you do not need to set gradient clipping-related parameters (setting them will have no effect).

Enable FusedOptimizer

To enable the optimizer fusion feature, simply configure the horovodrun command line as follows.

Plain Text

1horovodrun --fuse-optimizer
2           ... ...
3           [other-horovod-args] python [executable] [app-args]

Note: The fused optimizer feature is currently incompatible with the "parameter update and gradient communication overlap" feature.

Using AIAK-Training PyTorch Edition

Accelerating Inference Business Using AIAK-Inference

CCE CCE

CCE CCE

Deploying Distributed Training Tasks Using AIAK-Training

Prerequisites

Operation process

1. Select a training image

2. Component installation

3. Task submission