Hello, everyone!
This post will share with you a case about multiple disks are faulty due to SAS chip failts ( 0x54(xfer err internal crc err) ).
Symptoms
A large number of medium error alarms are generated on the storage device, multiple disks are faulty, and services are interrupted.


Cause
Q1. Why is the RAID group invalid and services interrupted?
Three disks in the same RAID group are removed.
Q2. Why are three disks in the same RAID group removed?
I/O timeout occurs on both the disks. BDM diagnoses the disks and then I/O timeout occurs. If the BDM diagnoses twice within 1 hour, the disk will be removed. If a disk is removed from both ends, the disk is out and the RAID is faulty.
Q3. Why timeout occurs on both ends of the three disks and then occurs after the diagnosis?
The 8072 chip of controller A was faulty. As a result, the disk was blocked and a single queue of the disk queued. As a result, I/Os on controller B also timed out. The fault persists after the fault is rectified by powering on and off the disk.
Q4. Why does the BDM read data DIF error occur on controllers A and B?
During the write data writing process of controller A to disks, data errors occur in the 8072 chip. This version does not provide the full-process DIF, and the write process does not have DIF verification. As a result, incorrect data is written to disks. A large number of DIF errors occur during the BDM read verification.
Q5. Why is the data of controller A faulty in the 8072 chip?
The RAM of the internal TX FIFO of the 8072 chip in controller A encounters multi-bit transitions. As a result, data errors occur and the chip reports events such as 0x54 to the driver.
Q6. After the 8072 chip detects a multi-bit error and reports the 0x54 event to the driver, the data is inconsistent. Why?
The event reporting mechanism of the 8072 chip is asynchronous. The RAM data of the TX FIFO is the last step to be sent to the disk. When the 0x54 event is reported, the data has been written to the disk.
Q7. Why doesn’t the chip rectify data after detecting a multi-bit error?
The chip can rectify only single-bit ECC errors, but cannot rectify bit data errors. (Single-bit ECC error correction is not enabled in chip firmware 3.8.36 and earlier versions, but is enabled in versions later than 3.8.36.)
Analysis
1.Multiple disks are faulty in a short time.

2.The disk is faulty because the diagnosis is performed twice within one hour, causing the diagnosis failure and the disk is removed by the BDM.

3.A large number of I/O timeout occurs on many disks.

4.The SAS chip of the controller is abnormal, and a large number of 0x54(xfer err internal crc err) errors are reported.

Solution
1. Remove and reinsert the faulty controller.
(Replace the controller ASAP.)
2. Run the restore disk command to restore the isolated disk that fails to be diagnosed.
The disk domain is restored.
3. If the system reports a disk single-link alarm after the disk is restored, power on and off the disk.
(only the disk status is restored during disk restoration, the disk single-link alarm is not cleared due to DIF errors),
4. Check whether disklog alignment exists, run the stripe repair command to repair the BST.
(The cause of disklog alignment is that the SAS chip is abnormal, causing the data written to the disk to be disordered.
As a result, the data read from the disk fails the DIF verification. If two disks in a stripe have DIF errors, BST is generated.) .
5. To ensure data consistency, perform disk domain verification and repair.
That's all, thanks!