Cloud-Native AI Overview
Overview of cloud-native AI
Cloud-native AI is powered by Baidu AI Cloud Container Engine (CCE), which enables GPU memory and computational resource sharing and isolation. It integrates popular deep learning frameworks like PaddlePaddle, TensorFlow, and PyTorch. With efficient task orchestration and management, it offers low-threshold deep learning training services for enterprises. This boosts GPU resource utilization, accelerates AI training, and reduces costs while enhancing performance.
Usage process
Step 1 (mandatory): Create a new cluster of v1.18 or above, and add nodes with GPU devices;
Step 2 (mandatory): Install cloud-native AI components. For details, see [Component Overview](CCE/Operation guide/Component Management/Component Overview.md);
Step 3 (optional): Enable memory sharing for GPU nodes;
Step 4 (mandatory): Create a new queue, specify resource quotas and associate users. For details, see [Create a New Queue](CCE/Operation guide/Cloud-native AI/Queue Management/Create Queue.md);
Step 5 (mandatory): Create a new task and submit the AI training task. For details, see [Create a New Task](CCE/Operation guide/Cloud-native AI/Task Management/Create TensorFlow Task.md).
GPU/NPU support list
Currently, the sharing and isolation of memory and computing power are supported for the following GPU/NPU models (including but not limited to these models). You can submit a ticket to learn more:
| GPU/NPU card model |
|---|
| NVIDIA V100 16GB SXM2 |
| NVIDIA V100 32GB SXM2 |
| NVIDIA T4 |
| NVIDIA A100 80GB SXM |
| NVIDIA A100 40GB SXM |
| NVIDIA A800 80GB |
| NVIDIA A30 |
| NVIDIA A10 |
| Kunlun Chip R200 |
