CCE Deep Learning Frameworks Operator Description
Updated at:2025-10-27
Component introduction
Integrated with the CCE AI Job Scheduler, the mainstream deep learning framework operator component allows for direct deep learning model training on CCE.
Component function
It integrates mainstream deep learning frameworks and provides out-of-the-box capabilities for submitting deep learning tasks. Currently, the following deep learning frameworks are supported:
1、TensorFlow(TFJob)
2、PyTorch(PyTorchJob)
3、MXNet(MXJob)
4、PaddlePaddle(PaddleJob)
Application scenarios
Run deep learning tasks directly on CCE clusters to boost AI engineering efficiency.
Limitations
- Supports only Kubernetes clusters version v1.18 and above.
Install component
- Sign in to the Baidu AI Cloud official website and enter the management console.
- Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
- Click Cluster Management - Cluster List in the left navigation bar.
- Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
- On the Cluster Management page, click O&M & Management > Component Management.
- From the component management list, select the CCE Deep Learning Frameworks Operator component and click Install.

- Frameworks: Currently supports four deep learning frameworks—TensorFlow, PyTorch, MXNet, and PaddlePaddle.
Version records
| Version No. | Cluster version compatibility | Update time | Update content | Impact |
|---|---|---|---|---|
| 1.6.23 | CCE/v1.18+ | 09/11/2024 | New function: Support training hang scenario detection & alert Automatic injection of ssh password-free login configuration for mpijob Optimize: Optimize pytorchjob task timeline field Optimize fault tolerance in the event of nic up/down failure at the node where the task is located | This upgrade will not affect service |
| 1.6.22 | CCE/v1.18+ | 08/28/2024 | New functions: Support stopping tasks in queuing/starting status Optimize: Support displaying more accurate task statuses Optimize the issue where MPIJob launcher retries several times when workers are not started | This upgrade will not affect service |
| 1.6.21 | CCE/v1.18+ | 07/22/2024 | MPIJob supports configuring RDMA affinity policies via task labels; ftagent-exporter performance metrics adapt to P800 clusters; optimized job barrier with improved master and worker exit timeout mechanisms; add a switch to forcibly delete pods that remain in terminating status for a long time after fault tolerance is triggered; internal fault tolerance interaction logic with K8S is modified to use informer | |
| 1.6.20 | CCE/v1.18+ | 05/29/2024 | Fix the issue where ftagent occupied port 8080 | |
| 1.6.19 | CCE/v1.18+ | 05/29/2024 | training-operator enables job-barrier by default, and job-barrier supports fault tolerance preemption; training-operator supports asynchronous cleanup of services for tasks in terminal status; training-operator supports adding stop condition && stop init container; fix repeated creation/deletion of pod services under special circumstances | |
| 1.6.18 | CCE/v1.18+ | 05/17/2024 | ftagent exporter is compatible with AIAK 2.0 images and supports loss metrics & backtracking time updated to 60 s. It is designed for fault tolerance and optimized for future nic up down scenarios | |
| 1.6.17 | CCE/v1.18+ | 04/12/2024 | The ftagent exporter metric has added tags for pod_name and job_name | |
| 1.6.16 | CCE/v1.18+ | 03/11/2024 | Support collection and reporting of training task performance metrics: throughput performance metrics and phase-specific time-consuming metrics | |
| 1.6.15 | CCE/v1.18+ | 02/26/2024 | When tasks are preempted and trigger fault tolerance rescheduling, pytorchjob is set to restarting status; fix the bug where pytorchjob lacks the created status and the bug where pytorchjob remains in the running status when pods fail; add ftagent exporter to support exposing collective communication bandwidth metrics; | |
| 1.6.14 | CCE/v1.18+ | 02/6/2024 | Add task event timeline & fault tolerance events & tensorboard gc | |
| 1.6.13 | CCE/v1.18+ | 01/17/2024 | Add fault tolerance optimizations for master/worker node not ready scenarios | |
| 1.6.12 | CCE/v1.18+ | 12/18/2023 | Add support for priority preemption in Training-Operator and Mpi-Operator; bugfix: Mpi-Operator fixes pod creation freezing during frequent creation/deletion of tasks with the same name | |
| 1.6.11 | CCE/v1.18+ | 12/04/2023 | Add automatic fault tolerance to cover node not ready scenarios; | |
| 1.6.10 | CCE/v1.18+ | 11/22/2023 | Add fault tolerance to support master-worker mode | |
| 1.6.9 | CCE/v1.18+ | 11/03/2023 | Add decoupling of hang detection from etcd; add validation for invalid task names; add validation for task name length (not exceeding 50 characters); bugfix: ft-agent supports Pods with restart policy OnFailure; fix occasional creation failures of c10d jobs; add fault tolerance switch; support fault tolerance for task hangs; v1.6.9 does not deploy etcd; the new version of fault tolerance is connected to the console; training-operator disables job barrier by default; | |
| 1.6.8 | CCE/v1.18+ | 10/10/2023 | The fault tolerance function is reconstructed to resolve the issue where resources cannot be released after task failures | |
| 1.6.6 | CCE/v1.18+ | 08/25/2023 | PyTorchJob supports hang detection; MPIJob supports hostfile injection for worker nodes and task stopping; fix failure to create Pods when training-operator deletes and creates tasks quickly; add maximum retry count for PyTorchJob initContainer; training-operator cleans up training processes when ftagent exits due to task failure | |
| 1.6.5 | CCE/v1.18+ | 07/07/2023 | Upgrade Pytorchjob to support fault tolerance for downtime | |
| 1.6.4 | CCE/v1.18+ | 07/05/2023 | MPI Operator specifies gang-scheduling, with pg maintained by the operator instead | |
| 1.6.3 | CCE/v1.18+ | 06/27/2023 | Add MPI Operator and Paddle Operator; support job stopping; training operator can expose job status via exporter | |
| 1.6.1 | CCE/v1.18+ | 05/30/2023 | Upgrade Pytorchjob to support fault tolerance for hardware failures (GPU and NIC) | |
| 0.3.0 | CCE/v1.18+ | 05/12/2022 | Upgrade to training operator, and merge PyTorch/Tensorflow/MXNet operator | Do not support one-click upgrade; you need to uninstall the old version plugin and reinstall it first |
| 0.2.1 | CCE/v1.18+ | 03/02/2022 | Add: AiTrainingJob Webhook | |
| 0.2.0 | CCE/v1.18+ | 01/21/2022 | Add: AI Training Operator | |
| 0.1.0 | CCE/v1.18+ | 05/31/2021 | First release | - |
