Got it

DIMM Errors on a CH121 V3 Server Highlighted agree

Latest reply: Mar 17, 2021 13:21:31 1566 2 1 0 0

Problem Description

The BMC of the CH121 V3 server reported DIMM020 and DIMM030 uncorrectable memory errors.


Problem Analysis

1. Sel log Analysis

The SEL logs indicate that unrecoverable errors occurred on DIMM020 and DIMM030, resulting in CPU CAT ERROR and server breakdown.

110059be33w0lwmh58wzeh.png

2. FDM Log Analysis

The FDM logs indicate that a DDR4 Command and Address Parity (CAP) error occurred in memory channel 1 of CPU 0.

110317bwgvdozjrd5d155i.png


In addition, Cbo TOR_TIMEOUT and MLC watchdog timer (3-strike) errors also occurred on DIMM000, DIMM010, DIMM020, and DIMM030.       

110448swobsjomxj77wbfy.png

110451zxsxfubu99l0xie8.png

                           

110455vfm88rihxylninp4.png

                             

110459nys7mni8t1llz8y7.png


Among DIMM010, DIMM011, and DIMM012 in memory channel 1 of CPU 0, only DIMM010 was detected.

110748uowzololulo5w2io.png


When a CAP error occurs in DDR4 memory, the memory controller will retry to process the data related to the error. During the retry, the memory controller blocks all memory operations in the controller for a period. For a single CAP error, the memory controller can obtain correct data by retry and the blocking time is short, which brings little impact to the system running. However, when multiple CAP errors occur continuously, the memory controller needs to perform retries for all related data and block memory operations for a period in each retry. In this case, the data read/write tasks in the back of the memory task queue time out due to task blocks in the front.


The LLC and MLC start a timer for each memory access request. The requests in LLC and MLC time out due to a large number of CAP errors. In this case, TOR_TIMEOUT and MLC watchdog timer (3-strick) errors occur. LLC TOR_TIMEOUT and MLC watchdog timer errors are uncorrectable errors in the current Intel RASM architecture and result in the system breakdown.


For more information, see the following figure.

110912fc8eujs2gd8zcgrd.png

Solution Description

DIMM010 is faulty and needs to be replaced. The errors of DIMM000, DIMM020, and DIMM030 are associated with DIMM010 errors, so these DIMMs do not need to be replaced.

Rating

Number of participants 1E-coins +10 Collapse Reasons
olive.zhao olive.zhao + 10 Good!

View All scores

DIMM Errors solution
View more
  • x
  • convention:

VERY GOOD
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.