Version 6.5.RC2
Question 1:In the new deployment environment, only one P4 card can be allocated to multiple P4 cards.
1.1 Problem Description:
The GPU server displays multiple P4 cards, but only one P4 card is allocated to the task. If more than one P4 card is allocated, a message is displayed, indicating that the task fails.
1.2 Possible Causes:
1. GPU resources are occupied by other processes. As a result, the allocated GPU resources are insufficient.
2. GPU configuration parameters of cluster nodes are incorrect.
1.3 Fault Locating
1. Check the GPU server process in the background and check whether the GPU process is running.
2. Check GPU configuration parameters of nodes in the cluster.
1.4 Solution
1. Log in to the FusionInsight Manager management page and click Service Management, yarn, Service Configuration, and All Configuration.
2. Enter gpu in the search box and modify the yarn.scheduler.maximum-allocation-gpus parameter. The default value is 1. Change the value to a value greater than the number of P4 card blocks configured on the platform.
Question2: The Task Start Time Is Too Long
2.1 Problem Description:
The algorithm is deployed on the FusionMind platform, but the task is started for more than 10 minutes. In addition, logs cannot be viewed during the startup, and the system displays a message indicating that the operation fails.
2.2 Possible Causes:
(1) The FusionMind platform service is abnormal.
2. When a task is created, the allocated resources exceed the remaining resources in the resource pool.
2.3 Fault Locating
1. Log in to the Manager platform and check the status of each service. Ensure that all services are running properly and the bash test algorithm is executed successfully.
2) Stop the task, check the remaining resources in the resource pool, and allocate resources. It is found that the memory resources exceed the remaining memory resources in the resource pool.
2.4 Solution:
Create a task again. Note that the CPU, memory, and GPU resources are allocated to the remaining resource pool.