Got it

Memory utilization rate reach 100 occured in S5500TV2 when 10GE TOE card failure

Latest reply: Mar 28, 2016 13:34:39 1634 1 0 0 0
Issue Description

S5500T, system version is V200R002C00SPC400

There's 10GE TOE card in the storage, BOM code is 0302G367

Alarm Information
There's historical events like below. We can see 10GE TOE card was succeeded in Powering on event at first.


2016-02-22 13:26    0x200f0d1000f    Infor    None    System succeeded in powering on the interface module (engine ENG0, 2 interface module B0).
2016-02-22 13:26    0x200f0d1000f    Infor    None    System succeeded in powering on the interface module (engine ENG0, 2 interface module B0).
2016-02-22 13:26    0x200f0d1000f    Infor    None    System succeeded in powering on the interface module (engine ENG0, 2 interface module B0).
2016-02-22 13:26    0x200f0d1000f    Infor    None    System succeeded in powering on the interface module (engine ENG0, 2 interface module B0).
2016-02-22 13:26    0x200f0d1000f    Infor    None    System succeeded in powering on the interface module (engine ENG0, 2 interface module B0).
2016-02-22 13:26    0x200f0d1000f    Infor    None    System succeeded in powering on the interface module (engine ENG0, 2 interface module B0).
2016-02-22 13:26    0x200f0d1000f    Infor    None    System succeeded in powering on the interface module (engine ENG0, 2 interface module B0).
2016-02-22 13:26    0x200f0d1000f    Infor    None    System succeeded in powering on the interface module (engine ENG0, 2 interface module B0).



After a few hours, the system reported memory usage is too high, as below:

2016-02-22 13:26    0xf0c90005    Warning    2016-02-22 13:26    The memory utilization rate threshold (100)of the controller B in the enclosure Engine 0 has been reached.
2016-02-22 13:26    0xf0c90005    Warning    2016-02-22 13:26    The memory utilization rate threshold (100)of the controller B in the enclosure Engine 0 has been reached.

Later, the controller reset, alarm as below:

2016-02-22 13:26    0xf00cf0014    Critical    2016-02-22 13:26    The controller module (Engine ENG0,id B) can not be monitored. The error code is --.
2016-02-22 13:26    0x100f00cf0034    Critical    None    The controller (2) {0: controller enclosure; 1: disk enclosure; 2: engine} ENG0, controller B) is isolated.The error code is 0x40401703.
2016-02-22 13:26    0xf0cf0005    Major    2016-02-22 13:26    The communication between the controller (A) and the other controller (B) is abnormal in the enclosure (Engine ENG0), but the system can continue to work with error code 0x4000cf12.

Handling Process
1.Collect storage logs, then ***yze why controller B was abnormal and isolated. You can unzip logs of controller B as below. Only ***yze files with prefix "log_debug_" and "log_reset_", each file include the timepoint that the file was generated. The files with prefix "log_reset_" record the reset reason at that time. The files with prefix "log_debug_" record debug logs before and after the reset. So, we just ***yze "log_reset_20160223221139_poweron.txt" and "log_debug_20160223221139_poweron.txt" in this case.

Memory utilization rate reach 100 occured in S5500TV2 when 10GE TOE card failure-1757633-12.First, we can find the reset reason is Out of memory reset in "log_reset_20160223221139_poweron.txt". It means that system memory used up and storage reset controller B self-heal.

The latest NO.1 reset: localorcmostime=1456261538, ji=5817415460, reason=out of memory reset
Desktime=2016-02-23-22:05:38

3.Second, we can search critical word "BIOS-p" to find the start log after reset, then we can look back to ***yze why memory used up.

Memory utilization rate reach 100 occured in S5500TV2 when 10GE TOE card failure-1757633-2Then ,we can see the list of files which stored in memory, also we have file size in memory. In this case, we can find these files which occupied extra memory space before Out of memory reset as below.

Memory utilization rate reach 100 occured in S5500TV2 when 10GE TOE card failure-1757633-3

Then, we have see the output of bash command "ps -aux" as below. We can find these threads who occupied extra memory space.The RSS item represent the physical memory space of the coresponding thread.

Memory utilization rate reach 100 occured in S5500TV2 when 10GE TOE card failure-1757633-4

Also, we can find memory trace information of  kernel moudules  like below. We can get the usage of memory for each kernel module.
Memory utilization rate reach 100 occured in S5500TV2 when 10GE TOE card failure-1757633-54. In this case, we finally find many unexpected performance statistics file in "/OSM/script" directory like below.

Memory utilization rate reach 100 occured in S5500TV2 when 10GE TOE card failure-1757633-6Memory utilization rate reach 100 occured in S5500TV2 when 10GE TOE card failure-1757633-7We can calculate that they occupied about 300MB memory in total and this obviously caused the Out of memory reset.

Root Cause
1. The 10GE TOE card hard reset many times and this caused the performance statistics thread cored since it need to inquiry the performance data of 10GE cards.

2. Normally, the performance statistics thread records data file in memory directory(/OSM/script) first, then copy to coffer disk and delete old ones. Since the thread cored it didn't delete performance data files after copy to the coffer disk. So, there were a lot of historical performance file remain in memory, more and more. Finally, it used up memory of OS and caused controller reset.
Solution
1.Poweroff the faulty 10GE TOE card through ISM or CLI command line.  Then apply a spare part to replace it.

2.Contact Huawei support to clear the remain performance statistics files in memory directory.



View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.