Problem Description
The BMC of the CH121 V3 server reported DIMM020 and DIMM030 uncorrectable memory errors.
Problem Analysis
1. Sel log Analysis
The SEL logs indicate that unrecoverable errors occurred on DIMM020 and DIMM030, resulting in CPU CAT ERROR and server breakdown.

2. FDM Log Analysis
The FDM logs indicate that a DDR4 Command and Address Parity (CAP) error occurred in memory channel 1 of CPU 0.

In addition, Cbo TOR_TIMEOUT and MLC watchdog timer (3-strike) errors also occurred on DIMM000, DIMM010, DIMM020, and DIMM030.




Among DIMM010, DIMM011, and DIMM012 in memory channel 1 of CPU 0, only DIMM010 was detected.

When a CAP error occurs in DDR4 memory, the memory controller will retry to process the data related to the error. During the retry, the memory controller blocks all memory operations in the controller for a period. For a single CAP error, the memory controller can obtain correct data by retry and the blocking time is short, which brings little impact to the system running. However, when multiple CAP errors occur continuously, the memory controller needs to perform retries for all related data and block memory operations for a period in each retry. In this case, the data read/write tasks in the back of the memory task queue time out due to task blocks in the front.
The LLC and MLC start a timer for each memory access request. The requests in LLC and MLC time out due to a large number of CAP errors. In this case, TOR_TIMEOUT and MLC watchdog timer (3-strick) errors occur. LLC TOR_TIMEOUT and MLC watchdog timer errors are uncorrectable errors in the current Intel RASM architecture and result in the system breakdown.
For more information, see the following figure.

Solution Description
DIMM010 is faulty and needs to be replaced. The errors of DIMM000, DIMM020, and DIMM030 are associated with DIMM010 errors, so these DIMMs do not need to be replaced.