Hi team!
Here's a case that Alarms About Controller Faults of OceanStor 2200 V3 and 2600 V3 Are Reported.
Fault Description
During the running of storage devices, alarms about controller faults are reported on DeviceManager.
Symptom
An alarm about a faulty controller is reported (alarm code: 0xF00CF005).
The error code is 0x4000cf3b.
2020-01-25 10:320xF00CF005FMajorNoneController (Controller Enclosure CTE0, controller B, item 03057201, SN 21XXXXXXXXXXGB000XXX) is faulty. Error code: 0x4000cf3b.
Collect all related information and contact technical support engineers to replace the controller.
2020-01-25 10:320xF00CF005FMajorNoneController (Controller Enclosure CTE0, controller B, item 03057201, SN 21XXXXXXXXXXGC00XXX) is faulty. Error code: 0x4000cf3b.
Cause
When the system disk of the controller has only a few I/Os, the slow disk policy is incorrect for SAS disks,
causing the system to falsely report a controller fault.
Identification Method
1. Confirm symptoms.
On DeviceManager, check alarm information. The alarm code is 0xF00CF005F, and the error code is 0x4000cf3b.
2. Confirm the version.
OceanStor V300R005C00 series: V300R005C00SPC300
3. Check logs.
Search for the key word "SAL_ProcessSlowDisk", and compare the time after "ms in" and the time after "idle:". Note that the time after "idle:" must be converted to a value with the unit of ms (ns/1000000). If the time after "idle:" is greater than that after "ms in", the problem can be identified.
As displayed in the logs, the time after "idle:" is 1800541571380 ns (1800541 ms), which is greater than 1800080 ms.
If the preceding three conditions are met, this problem can be identified.
Solutions
For the V300R005C00SPC300 version of OceanStor V300R005C00 series, install the V300R005C00SPH302patch.
On DeviceManager, if alarms indicating that write cache is disabled and the controller cache enters the write through mode are found, run the change system force_write_back switch= all_data_force_writeback command.
Alternatively, perform step 4 in Workarounds to reset all controllers in sequence to recover the write cache function.
Workarounds
Run the specified commands to disable the slow disk statistics function of system disks, and then reset controllers.
The procedure is as follows:
1. Run the change user_mode current_mode user_mode=developer command to enter the developer mode.
2. Run the change slow_system_disk monitor_switch switch=off command to disable the slow disk statistics function.
3. Run the show slow_system_disk monitor_switch command to check whether the function is disabled.
4. Run the reboot controller controller=<Controller ID> command to reset all controllers in sequence.
(Before resetting a controller, confirm that all controllers are online and their states are running.)
Check After Recovery
After the controllers are reset, check whether the alarms are cleared. Then, perform the following step 2 to check whether the command is executed successfully.
1. Run the change user_mode current_mode user_mode=developer command to enter the developer mode.
2. Run the show slow_system_disk monitor_switch command to check whether the command execution is successful.