Precautions for Disabling Node Video Memory Sharing
1. Instructions for disabling node memory sharing function
When a node with the memory-sharing function enabled needs to have it disabled, the system will verify if there are any running memory-sharing tasks on the current node. The memory-sharing function can only be disabled after all memory-sharing tasks on the node are completed. Otherwise, interruptions to memory-sharing tasks or failure to properly reclaim resources upon task completion may occur, affecting subsequent tasks on the node.
Use the commands in Steps 2 and 3 to locate nodes with memory sharing enabled and verify the status of memory-sharing tasks on each node.

2. Query nodes with memory sharing enabled
1kubectl get nodes -l cce.baidubce.com/gpu-share-device-plugin=enable
This command will list all nodes with the memory sharing function enabled, and these nodes are all labeled with cce.baidubce.com/gpu-share-device-plugin:enable.
3. Query pods running memory-sharing tasks
1kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.status.phase=="Running") | select(.spec.containers[].resources.limits // empty | keys[] // empty | test("baidu.com/.*(_core|_memory|_memory_percent)$")) | "\(.metadata.name) \(.spec.nodeName)"' | sort | uniq
This command lists all running Pods that utilize memory sharing along with their host nodes. The memory-sharing function cannot be immediately disabled on these nodes; it can only be disabled once these Pods have finished running.
4. Risks of modifying node labels via commands
In addition to the console method, you can use the kubectl label nodes command to modify the node label to cce.baidubce.com/gpu-share-device-plugin:disable to disable the node’s memory sharing function. Please pay attention to the following risks before modification:
- Service interruption: Modifying node labels triggers the installation of a non-shared environment, which interrupts running memory-sharing tasks or prevents resources from being reclaimed properly when tasks end. This impacts subsequent tasks on the node.
- Scheduling failure: After modifying labels, the scheduler might assign non-shared tasks to the node. Running both task types on the same machine could lead to issues like incorrect GPU detection or abnormal shared memory behavior.
To mitigate these risks, ensure the node labels in the cluster correctly represent the node's capabilities, and only disable memory sharing on a node after confirming there are no active memory-sharing tasks running on it.
