Got it

Multiple Disks Are Faulty Due to SAS Chip Faults

262 0 1 0 0

Hello, everyone!

This post will share with you a case about multiple disks are faulty due to SAS chip failts ( 0x54(xfer err internal crc err) ).

Symptoms

A large number of medium error alarms are generated on the storage device, multiple disks are faulty, and services are interrupted.

01

02

 

Cause

Q1. Why is the RAID group invalid and services interrupted?

Three disks in the same RAID group are removed.

 

Q2. Why are three disks in the same RAID group removed?

I/O timeout occurs on both the disks. BDM diagnoses the disks and then I/O timeout occurs. If the BDM diagnoses twice within 1 hour, the disk will be removed. If a disk is removed from both ends, the disk is out and the RAID is faulty.

 

Q3. Why timeout occurs on both ends of the three disks and then occurs after the diagnosis?

The 8072 chip of controller A was faulty. As a result, the disk was blocked and a single queue of the disk queued. As a result, I/Os on controller B also timed out. The fault persists after the fault is rectified by powering on and off the disk.

 

Q4. Why does the BDM read data DIF error occur on controllers A and B?

During the write data writing process of controller A to disks, data errors occur in the 8072 chip. This version does not provide the full-process DIF, and the write process does not have DIF verification. As a result, incorrect data is written to disks. A large number of DIF errors occur during the BDM read verification.

 

Q5. Why is the data of controller A faulty in the 8072 chip?

The RAM of the internal TX FIFO of the 8072 chip in controller A encounters multi-bit transitions. As a result, data errors occur and the chip reports events such as 0x54 to the driver.

 

Q6. After the 8072 chip detects a multi-bit error and reports the 0x54 event to the driver, the data is inconsistent. Why?

The event reporting mechanism of the 8072 chip is asynchronous. The RAM data of the TX FIFO is the last step to be sent to the disk. When the 0x54 event is reported, the data has been written to the disk.

 

Q7. Why doesn’t the chip rectify data after detecting a multi-bit error?

The chip can rectify only single-bit ECC errors, but cannot rectify bit data errors. (Single-bit ECC error correction is not enabled in chip firmware 3.8.36 and earlier versions, but is enabled in versions later than 3.8.36.)


 Analysis

1.Multiple disks are faulty in a short time.

03


2.The disk is faulty because the diagnosis is performed twice within one hour, causing the diagnosis failure and the disk is removed by the BDM.

04


3.A large number of I/O timeout occurs on many disks.

05


4.The SAS chip of the controller is abnormal, and a large number of 0x54(xfer err internal crc err) errors are reported.

06


Solution

1. Remove and reinsert the faulty controller. 

   (Replace the controller ASAP.)


2. Run the restore disk command to restore the isolated disk that fails to be diagnosed. 

    The disk domain is restored.


3. If the system reports a disk single-link alarm after the disk is restored, power on and off the disk.

    (only the disk status is restored during disk restoration, the disk single-link alarm is not cleared due to DIF errors), 


4. Check whether disklog alignment exists, run the stripe repair command to repair the BST.

    (The cause of disklog alignment is that the SAS chip is abnormal, causing the data written to the disk to be disordered.

     As a result, the data read from the disk fails the DIF verification. If two disks in a stripe have DIF errors, BST is generated.) .

    

5. To ensure data consistency, perform disk domain verification and repair.

That's all, thanks!

Comment

You need to log in to comment to the post Login | Register

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.