Intelligent Diagnostics
Updated at:2025-11-03
Function overview
This feature targets users who train large models independently, providing intelligent diagnostic capabilities based on large model training logs. At the granularity of training tasks, users can view all issues list generated by intelligent diagnostics for the task, including the type identification and suggested solutions for each issue.
Preparations
- Intelligent diagnostics require training logs, so logs need to be collected and transmitted to a logstore. For specific operations, please refer toLogstore, Collector, andTransmission Task.
- This feature currently primarily supports PyTorchJob. In large model training logs, you need to write the training task job ID and worker ID into the pod_name field. Refer to the example below, where pod_name includes the job ID (llama-rdma) and worker ID (worker-0).

- In the transmission task, for source container metadata collection, pod_name must be checked.

- Locate the logstore in the logstore list, click Edit in the operation column, then select Advanced Configuration - Log Content - Large Model Training Logs and save.

Create intelligent diagnostics
- In the menu, select Log Service - Log Applications - Intelligent Diagnostics.
- Click the Create Intelligent Diagnostics button in the Training Tasks section to open the pop-up window.
-
Fill in the training task ID to be diagnosed, the corresponding logstore for the training task, and the log time range for diagnosis. Note:
- The training task ID can use the Job ID from the logs, which is the unique identifier for the training task.
- If the logstore is not queried, please refer to the "Preparations" above to confirm whether the "Log Content" configuration has been completed.
- If the training task is still ongoing, you can select logs from the last 1 or 2 hours for diagnosis; If the training task has stopped, you can select logs from several hours around the approximate stop time for diagnosis.

- Click OK to create the intelligent diagnostics record for the training task.
View intelligent diagnosis results
- After creating the intelligent diagnostics, you can quickly view the status of this diagnosis and the total count of issues diagnosed in the diagnostics record list. Click View Details in the row to see the details of each identified issue (if any) from this diagnosis, including the fault type, risk level, affected worker node, and recommended solutions.

- After addressing the issue according to the recommended solution, you can mark this issue as resolved.
Other operations
- To view past diagnosis results, search by diagnosis time range or training task ID in the training task diagnosis list.
- If you no longer need to keep a diagnosis record, click Delete in the corresponding row. Once deleted, the diagnosis information cannot be viewed again, but the related log data will not be cleared.
