1.Collect storage logs, then ***yze why controller B was abnormal and isolated. You can unzip logs of controller B as below. Only ***yze files with prefix "log_debug_" and "log_reset_", each file include the timepoint that the file was generated. The files with prefix "log_reset_" record the reset reason at that time. The files with prefix "log_debug_" record debug logs before and after the reset. So, we just ***yze "log_reset_20160223221139_poweron.txt" and "log_debug_20160223221139_poweron.txt" in this case.
2.First, we can find the reset reason is Out of memory reset in "log_reset_20160223221139_poweron.txt". It means that system memory used up and storage reset controller B self-heal.
The latest NO.1 reset: localorcmostime=1456261538, ji=5817415460, reason=out of memory reset
Desktime=2016-02-23-22:05:38
3.Second, we can search critical word "BIOS-p" to find the start log after reset, then we can look back to ***yze why memory used up.
Then ,we can see the list of files which stored in memory, also we have file size in memory. In this case, we can find these files which occupied extra memory space before Out of memory reset as below.
Then, we have see the output of bash command "ps -aux" as below. We can find these threads who occupied extra memory space.The RSS item represent the physical memory space of the coresponding thread.
Also, we can find memory trace information of kernel moudules like below. We can get the usage of memory for each kernel module.
We can calculate that they occupied about 300MB memory in total and this obviously caused the Out of memory reset.
Root Cause
1. The 10GE TOE card hard reset many times and this caused the performance statistics thread cored since it need to inquiry the performance data of 10GE cards.
2. Normally, the performance statistics thread records data file in memory directory(/OSM/script) first, then copy to coffer disk and delete old ones. Since the thread cored it didn't delete performance data files after copy to the coffer disk. So, there were a lot of historical performance file remain in memory, more and more. Finally, it used up memory of OS and caused controller reset.
Solution
1.Poweroff the faulty 10GE TOE card through ISM or CLI command line. Then apply a spare part to replace it.
2.Contact Huawei support to clear the remain performance statistics files in memory directory.