[Symptom] A monitoring alarm is received, indicating that the container memory usage is too high, and the service is restarted abnormally for multiple times.

View the monitoring service cpuUsage (CPU usage). The following figure shows the monitoring information.

[Troubleshooting]
1. Log in to the node where the service is deployed and run the docker stats command to check the CPU usage of Docker.

2. Run the docker exec -it container ID /bin/bash command to access the Docker container with high CPU usage.
Run the top command to check the CPU usage of a specific process in the container. If the top command cannot be used, run the export TERM=dumb command and then run the top command.


3. Run the top -H -p process ID command to check the thread CPU usage of the process.

4. According to the preceding check method, the cause of the high CPU and memory usage of the Java process cannot be determined.
5. Locate the fault based on logs. Go to the log directory and check the service logs. The error message "GC overhead limit exceeded" is displayed.
The amount of time to perform garbage collection (GC) is too high and the amount of effective calculation is too small. By default, the JVM throws this error if GC takes more than 98% of time and GC reclaims less than 2% of memory.

6. The sorting directory contains 7.3 GB preprocessed data, and a large number of preprocessed tasks are stacked. As a result, the memory and CPU usage of the container is high.

[Solution]
Temporary solution: Delete the BDI shared data stored in the /user/gtsai_manas_gz_w/AI/foundation/other/sorting directory of the Hadoop cluster and clear data in the container.
Solution: Follow-up rectification scheme: Process stacked tasks in batches instead of all tasks at a time.

