CCE Deep Learning Frameworks Operator Description

Updated at：2025-10-27

Component introduction

Integrated with the CCE AI Job Scheduler, the mainstream deep learning framework operator component allows for direct deep learning model training on CCE.

Component function

It integrates mainstream deep learning frameworks and provides out-of-the-box capabilities for submitting deep learning tasks. Currently, the following deep learning frameworks are supported:

1、TensorFlow（TFJob）

2、PyTorch（PyTorchJob）

3、MXNet（MXJob）

4、PaddlePaddle（PaddleJob）

Application scenarios

Run deep learning tasks directly on CCE clusters to boost AI engineering efficiency.

Limitations

Supports only Kubernetes clusters version v1.18 and above.

Install component

Sign in to the Baidu AI Cloud official website and enter the management console.
Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
Click Cluster Management - Cluster List in the left navigation bar.
Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
On the Cluster Management page, click O&M & Management > Component Management.
From the component management list, select the CCE Deep Learning Frameworks Operator component and click Install.

Frameworks: Currently supports four deep learning frameworks—TensorFlow, PyTorch, MXNet, and PaddlePaddle.

Version records

Version No.	Cluster version compatibility	Update time	Update content	Impact
1.6.23	CCE/v1.18+	09/11/2024	New function: Support training hang scenario detection & alert Automatic injection of ssh password-free login configuration for mpijob Optimize: Optimize pytorchjob task timeline field Optimize fault tolerance in the event of nic up/down failure at the node where the task is located	This upgrade will not affect service
1.6.22	CCE/v1.18+	08/28/2024	New functions: Support stopping tasks in queuing/starting status Optimize: Support displaying more accurate task statuses Optimize the issue where MPIJob launcher retries several times when workers are not started	This upgrade will not affect service
1.6.21	CCE/v1.18+	07/22/2024	MPIJob supports configuring RDMA affinity policies via task labels; ftagent-exporter performance metrics adapt to P800 clusters; optimized job barrier with improved master and worker exit timeout mechanisms; add a switch to forcibly delete pods that remain in terminating status for a long time after fault tolerance is triggered; internal fault tolerance interaction logic with K8S is modified to use informer
1.6.20	CCE/v1.18+	05/29/2024	Fix the issue where ftagent occupied port 8080
1.6.19	CCE/v1.18+	05/29/2024	training-operator enables job-barrier by default, and job-barrier supports fault tolerance preemption; training-operator supports asynchronous cleanup of services for tasks in terminal status; training-operator supports adding stop condition && stop init container; fix repeated creation/deletion of pod services under special circumstances
1.6.18	CCE/v1.18+	05/17/2024	ftagent exporter is compatible with AIAK 2.0 images and supports loss metrics & backtracking time updated to 60 s. It is designed for fault tolerance and optimized for future nic up down scenarios
1.6.17	CCE/v1.18+	04/12/2024	The ftagent exporter metric has added tags for pod_name and job_name
1.6.16	CCE/v1.18+	03/11/2024	Support collection and reporting of training task performance metrics: throughput performance metrics and phase-specific time-consuming metrics
1.6.15	CCE/v1.18+	02/26/2024	When tasks are preempted and trigger fault tolerance rescheduling, pytorchjob is set to restarting status; fix the bug where pytorchjob lacks the created status and the bug where pytorchjob remains in the running status when pods fail; add ftagent exporter to support exposing collective communication bandwidth metrics;
1.6.14	CCE/v1.18+	02/6/2024	Add task event timeline & fault tolerance events & tensorboard gc
1.6.13	CCE/v1.18+	01/17/2024	Add fault tolerance optimizations for master/worker node not ready scenarios
1.6.12	CCE/v1.18+	12/18/2023	Add support for priority preemption in Training-Operator and Mpi-Operator; bugfix: Mpi-Operator fixes pod creation freezing during frequent creation/deletion of tasks with the same name
1.6.11	CCE/v1.18+	12/04/2023	Add automatic fault tolerance to cover node not ready scenarios;
1.6.10	CCE/v1.18+	11/22/2023	Add fault tolerance to support master-worker mode
1.6.9	CCE/v1.18+	11/03/2023	Add decoupling of hang detection from etcd; add validation for invalid task names; add validation for task name length (not exceeding 50 characters); bugfix: ft-agent supports Pods with restart policy OnFailure; fix occasional creation failures of c10d jobs; add fault tolerance switch; support fault tolerance for task hangs; v1.6.9 does not deploy etcd; the new version of fault tolerance is connected to the console; training-operator disables job barrier by default;
1.6.8	CCE/v1.18+	10/10/2023	The fault tolerance function is reconstructed to resolve the issue where resources cannot be released after task failures
1.6.6	CCE/v1.18+	08/25/2023	PyTorchJob supports hang detection; MPIJob supports hostfile injection for worker nodes and task stopping; fix failure to create Pods when training-operator deletes and creates tasks quickly; add maximum retry count for PyTorchJob initContainer; training-operator cleans up training processes when ftagent exits due to task failure
1.6.5	CCE/v1.18+	07/07/2023	Upgrade Pytorchjob to support fault tolerance for downtime
1.6.4	CCE/v1.18+	07/05/2023	MPI Operator specifies gang-scheduling, with pg maintained by the operator instead
1.6.3	CCE/v1.18+	06/27/2023	Add MPI Operator and Paddle Operator; support job stopping; training operator can expose job status via exporter
1.6.1	CCE/v1.18+	05/30/2023	Upgrade Pytorchjob to support fault tolerance for hardware failures (GPU and NIC)
0.3.0	CCE/v1.18+	05/12/2022	Upgrade to training operator, and merge PyTorch/Tensorflow/MXNet operator	Do not support one-click upgrade; you need to uninstall the old version plugin and reinstall it first
0.2.1	CCE/v1.18+	03/02/2022	Add: AiTrainingJob Webhook
0.2.0	CCE/v1.18+	01/21/2022	Add: AI Training Operator
0.1.0	CCE/v1.18+	05/31/2021	First release	-

CCE Credential Controller Description

Component Overview

CCE CCE