Fault Symptoms
On a 2288H V5 server, the OS logs of the server show that the memory Uncorrected error occurs.
However, no memory error is reported in the BMC FDM log and no memory alarm is reported in the SEL, as shown in the following figure.

Cause Analysis and Handling Procedure
The key analysis process is as follows:
1. According to the OS error log, the memory reported the UCE at 09:03:48 on 2021.2. 20, and the BMC SEL log at the corresponding time point shows that the latest event is recorded at 05:08:36 on 2021-02-10. The SEL logs do not contain any exceptions, as shown in the following figure.

2. Check the BMC operation logs and confirm that the BMC logs were collected after 2021-02-25 02:26, which is the time when the problem occurred, as shown in the following figure.

3. Check the BMC FDM logs. No memory error information is reported when the fault occurs.
4. Analyze the block diagram of the fault handling system design module. It is found that the operating system processing module normally reports the MEC error from the following two sources:
As reported by the BIOS module,
AER NSI/NMI errors are directly reported.
AER is an advanced error control and reporting mechanism supported by PCIe devices and is independent of memory faults. Therefore, it can be concluded that the OS reports the memory MCE error, which is reported by the BIOS to the OS after detecting the memory fault. In addition, the BIOS also reports the error to the BMC processing module.
According to preliminary analysis, the BIOS has reported memory fault information to the BMC, but the BMC parsing mechanism is abnormal.
5. According to BMC logs, the CpuMem directory is empty.

6. Analyzed the dfm_debug_log file and found that the cpu_mem module frequently failed to open the /opt/pme/pram/per_power_off.ini file, as shown in the following figure.
It was suspected that the cpu_mem module was abnormal. As a result, no CPU and memory logs were generated during log collection. In addition, the diagnosis and processing of the memory fault information reported by the BIOS to the BMC are affected.

7. If a memory fault has occurred on a device, you need to determine the slot information of the faulty memory to replace the faulty memory.
You can use the ipmitool tool to query the specific device address based on the error information on the OS.

Analysis Conclusion
An exception occurs when the BMC accesses the PECI device handle. After the BMC runs for a long time, all handles (1024 handles in total) are leaked. As a result, the cpu-mem process is abnormal. After the memory is faulty, the BIOS reports an interrupt message, and the BMC reports an alarm, and new files fail to be opened. As a result, no alarm is reported for the FDM and SEL.
Triggering scenario (all of the following conditions are met):
V5 servers are delivered before April 2018.
The BMC has been running for more than half a year. (The problem occurs only after the device has been running for one year or more.)
The memory is faulty.
Solution:
Upgrade the BMC to 616 or later.

my pleasure 
