Got it

The system disk is about to fail and the disk fault indicator is incorrectly turned on

180 0 1 0 0

Hello, everyone!

This a case about the system disk is about to fail and the disk fault indicator is incorrectly turned on.

Symptoms

Scenario 1 (The storage version is earlier than V300R003C10):

A CTE0.1 disk fault alarm is generated on the storage system. 

The fault indicator of the CTE0.1 disk is on, but the disk SN in the alarm is inconsistent with the actual disk SN.

2020-04-14   02:36:03 DST    0xF00A0003    Major      None    The disk (Controller   Enclosure CTE0, slot 1, serial number A0116092008350000470) is failing.

 

Scenario 2 (The storage version is V300R003C10 or later, V300R006C20, Dorado V300R001C21SPH105 or earlier than V500R007C10):

1. A controller fault alarm is generated.

2019-08-15   17:21:32    0xF00CF005F    Fault      Major    Unrecovered    None      Controller (Controller Enclosure CTE0, controller B, item 03056450, SN   21030564509WJ2000083) is faulty. Error code: 0x4000cf4d.

2. The fault indicator of the hard disk in slot CTE0.1 is on or blinks. 

The disk status is normal.


Cause

In versions earlier than V300R003C10, when the system detects that the system disk is about to fail, an alarm is reported. 

The system disk enclosure slot number (0, 1) is the same as that of the CTE0.1 disk in the disk enclosure, as a result, an alarm indicating that the CTE0.1 disk is about to fail is reported and
the alarm indicator of the CTE0.1 disk is turned on incorrectly.

2. If the storage version is V300R003C10 or later, V300R006C20, Dorado V300R001C21SPH105, or V500R007C10, a controller fault alarm is reported when the system detects that the system disk is about to fail, however, the system disk slot number (0, 1) is the same as that of the CTE0.1 disk in the disk enclosure slot (0, 1). 

Therefore, the alarm indicator of the CTE0.1 disk is turned on incorrectly.

In V300R006C20, Dorado V300R001C21SPH105, and V500R007C10 and later versions, the disk alarm indicator is not turned on when the system disk is about to fail.


Analysis

Scenario 1 (The storage version is earlier than V300R003C10):

1. A CTE0.1 disk fault alarm is generated on the storage system. 

The fault indicator of the CTE0.1 disk is on, but the disk SN in the alarm is inconsistent with the actual disk SN.

2020-04-14   02:36:03 DST    0xF00A0003    Major      None    The disk (Controller   Enclosure CTE0, slot 1, serial number A0116092008350000470)   is failing.


2. Open \DataCollect\System_log\log_controller_0A\Other\bdm_info and search for SN A0116xxxxxxxx0000470 

The system disk with this SN is located in the slot number (0, 1) with ID 32, indicating that the system disk of controller A is about to fail.

 

3. If the following log is recorded in the message log of controller A, the system routinely detects that the system disk is about to fail and reports an alarm.

[2020-04-13 18:36][46609783.314054]   [11667302299][15000009a0248][ERR][Smart info of disk 32 is abnormal, smart id   231.smart curVal:9, smart worst:9, threshold:10.][BDM_HDM] [hdmRtDoComAtaDiskPreFail,477][CSD_4]

 

[2020-04-13 18:36][67511694.800021]   [16899370581][15000009e0012][INFO][Send disk impending halflife alarm , (user   farmed id CTE0, inner frameid id 0, slot 1, temperature 0, SN   A0116092008350000470).][BDM_BA][bdmSendDiskHalfLifeAlm,629][TP_BDM_THREAD_P]

 

[2020-04-13 18:36][46609783.314066]   [11667302299][15000009a0288][INFO][Disk(sdevId 32, frameId 0, slotId 1) pre   invalid set alarm litght, result 0.][BDM_HDM][hdmRtDealPreFail,2716][CSD_4]

 

Scenario 2 (The storage version is V300R003C10 or later, V300R006C20, Dorado V300R001C21SPH105, or earlier than V500R007C10):

1. Run the show disk general disk_id=CTE0.1 command or check whether the CTE0.1 disk whose indicator is steady red is normal on the DM.


2. Check storage alarms. A controller fault alarm is generated. 

The error code is 0x4000cf4d, indicating that the system disk is about to fail.

2019-08-15   17:21:32    0xF00CF005F    Fault      Major    Unrecovered    None      Controller (Controller Enclosure CTE0, controller B, item 03056450, SN   21030564509WJ2000083) is faulty. Error code: 0x4000cf4d.

 

3. Check \DataCollect\System_log\log_controller_0B\Messages\SES_log_xxxx_mem.txt for SES log.

Search for the keyword disk led opt. 

The indicator is frequently lit up.

[1407][xx.xx.xx.xx.xx.xx.xx.xx][xxxxxxxx]SC:disk led opt 1, 0, 1

[1408][xx.xx.xx.xx.xx.xx.xx.xx][xxxxxxxx]SC:disk led opt 1, 0, 1

The 1st parameter means slot number.

The 2nd parameter means disk indicator type: 0 for the disk fault indicator, 1 for hard disk location indicator

The 3rd parameter means the switch of indicator: 0 for off, 1 for on

So disk led opt 1, 0, 1 indicates that the fault indicator of the hard disk in slot 1 is on.

 

4. Check the message log of the faulty controller and search for hdmRtDealPreFail.

The indicator is frequently turned on.

[2020-04-13 18:36][40518884.293630]   [][15000009a0288][INFO][BDM_RT:Disk(SdevId(32), FrameId(0), SlotId(1)) pre   invalid set alarm litght, result 0.][BDM_HDM][hdmRtDealPreFail,2581][CSD_10]

[2020-04-13 18:36][40522479.893657]   [][15000009a0288][INFO][BDM_RT:Disk(SdevId(32), FrameId(0), SlotId(1)) pre   invalid set alarm litght, result 0.][BDM_HDM][hdmRtDealPreFail,2581][CSD_1]

[2020-04-13 18:36][40526075.438187]   [][15000009a0288][INFO][BDM_RT:Disk(SdevId(32), FrameId(0), SlotId(1)) pre   invalid set alarm litght, result 0.][BDM_HDM][hdmRtDealPreFail,2581][CSD_6]


Solution

1. If the system disk is about to fail, replace the faulty controller.

After the controller is replaced, the disk fault indicator is cleared.

 

2. If the disk fault indicator is turned on by mistake, upgrade the system to a version later than V3R6C20 / Dorado V3R1C21SPH105 / V5R7C10.

 

3. Check the storage alarm and check whether the disk alarm indicator of the CTE0.1 slot recover or not.

That's all, thanks!

Comment

You need to log in to comment to the post Login | Register

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.