1 Basic Information
|
SR ticket NO. |
|
|
Accident Description |
Customer reported that when he do the inspection for the Oceanstor 9000 and find one of the cluster node snas_rep process have coredump file |
|
Accident Time |
2019/03/20 |
|
Product |
OceanStor 9000 V300R006C20SPC200 |
2 Problem Description
Customer reported that when he do the inspection for the Oceanstor 9000 and find one of the cluster node snas_rep process have coredump file. Need check the root cause.
3 Problem Analysis
The node backend IP (10.68.10.17) where the coredump file appears is found through the inspection report.
![]()
Login the node with putty and unzip the coredump file and find that the file is small (330MB) before decompression, but very large (more than 100GB) after decompression. It is suspected that the coredump is caused by a thread leak known to the version.
![]()
The "rnm shownodeinfo" command is used to query the remote replication master node ID of the cluster to 10, and jump to the node (the backend IP corresponding to node 10 can be obtained through cat /proc/monc_nodemap, and the IP jump is obtained). The number of threads occupied by the snas_rep process on the remote replication master node is queried by ps –eLF | grep snas_rep | wc -l. It is found that more than 3000, the thread leakage problem is confirmed.
![]()
4 Root Cause
The thread pool query replication task internal database is generated during each remote replication pair synchronization process, and the thread pool has a mechanism for not releasing the automatic release resource for 15 minutes (the usage time is updated every time);
In the current network version, some threads are not processed normally when the remote replication database thread pool is automatically released, resulting in 10 leaks in a single thread. The long-running operation causes the remote replication process to have too much virtual memory. When the thread is created, the virtual memory fails to be allocated.
When the remote replication database thread pool is initialized, the thread creation fails. Going into the error branch causes the thread pool to be processed abnormally, causing the snas_rep process to have a coredump.
5 Solutions
The problem have solved with the patch version V300R006C20SPC300.
Hot patch download link:
https://support.huawei.com/enterprise/en/software/23643152-SW2000090781
Summary:Checking the details it will fixed with the hot patch and it will takes about 30mins. No impact on production business.