Method used to isolate disks

1

Method used to isolate disks:
1. Locating procedure:
The method is used to isolate disks impacting system services or proper running of the system.
a. In most bilateral isolation cases, it is recommended that the pre-failure configuration be used.
b. In the case of error code isolation and intermittent disconnection isolation, unilateral isolation and access, instead of slow disk isolation, are used.
2. Solution
a. Bilateral isolation and access (simulating disk rejection and acceptance)
Note:
Disk rejection is mostly used for RAID recovery. In other cases, it is recommended that the pre-failure configuration be used for disk rejection.
Isolation: Log in to the storage's MML mode and enter dev setdiskout enclosure ID slot ID to specify the enclosure number and slot number.
Access: Log in to the storage's MML mode and enter dev setdiskin enclosure ID slot ID to specify the enclosure number and slot number.
b. Unilateral isolation and access
Note:
It is mostly used for the error code isolation and intermittent disconnection isolation modes.
Error code isolation and intermittent disconnection isolation mostly occur on unilateral isolation (that is, the disk has one link). In this case, you need to connect the disk to the system by using a link and continue to use the disk. If the disk has single link again later, replace the disk.
Log in to the isolated controller.
Unilateral isolation: Log in to the storage's MML mode and enter dev setdiskout enclosure ID slot ID 1.
Unilateral access: Log in to the storage's MML mode and enter dev setdiskin enclosure ID slot ID 1.
After unilateral isolation is performed, on the OSM page, the disk indicator becomes yellow. After unilateral access is performed, the disk indicator turns normal.

Other related questions:
Methods used to isolate networks between desktop service domains
Each desktop service domain can be physically isolated by cluster or logically isolated by using VLANs within the cluster. In addition, you can use an ACL policy to control user access to the core switches.

Method used to locate the disk slot before replacing a disk
You can locate the disk slot before replacing a disk as follows: 1. Fault location and rectification a. The method is used before replacing a faulty disk. b. The method is used before replacing a risky disk. c. You may stay on the OSM for some time until timeout and logging out of the OSM. The disk slot locating expires and the disk slot requires a relocating. 2. Solution a. Log in to the OSM of the storage system. b. Select Devices and then the enclosure where a disk be replaced resides. Click the disk that you want to replace. In the dialog box that is displayed, click Locating hard disks. c. A dialog box is displayed indicating that the operation is successful. Do not close the dialog box. Otherwise, the disk location indicator is turned off. d. The slot whose indicator (alarm indicator) blinks red is where the disk to be replaced locates. Replace the disk. The location indicator of a disk blinks red while the alarm indicator of a disk is in steady red mode. You must differentiate the enclosure indicator and disk fault indicator. Engineers may regard the enclosure alarm indicator as the disk alarm indicator and remove the disk in slot (0,8). After the faulty disk is replaced, return to the OSM and close the location function (close the dialog box that is displayed in step c).

Method used to check a slow disk
You can check a slow disk as follows: 1. Checking the OSM alarm Check on the OSM management interface whether there is a slow disk alarm whose ID is 5613. If the alarm exist, check whether the slow disk is isolated (the disk has completed reconstruction). If the slow disk is not isolated, refer to relevant disk replacement guides to manually replace the disk. 2. Checking the SES log Collect the SES log of storage devices by obtaining SES_log.txt and bak files under the /OSM/log_conf_local/log/cur_debug directory. Check slow I/O records, I/O distribution, and search keyword Disk IO Delay. --------------------------Disk IO Delay Count------2012-01-10 02:30:52-------------------- Disk IO Delay Count Threshold: [300ms] [500ms] [700ms] [1000ms] [0][2][3LM4JYJJ00009844V79S][3, 5, 15, 1] The above information shows that within five minutes, the disk in slot (0,2) has three I/Os of over 300 ms latency, five I/Os of over 500 ms latency, 15 I/Os of over 700 ms latency, and one I/O of over 1000 ms latency. Longer I/O latency of a disk may result in frequent display of the disk. Refer to relevant disk replacement guides to manually replace the disk. If you have any question, contact technical support engineers. 3. Checking the message log Collect the message log of storage devices by obtaining >message and bak files under the /OSM/log_conf_local/log/cur_debug directory. Search keyword long time. Jun 20 14:45:25 OceanStor kernel: [21086119188]mptscsih SLOW IO INFO: cost long time (13135), host id(0), channel id(0), scsi id (14), lun id(0), io lenth (524288), io mode(1), io lba(0x215321088) The I/O of SCSI device scs id (14) is suspended. Log in to the debug mode of storage devices, enter lsscsi, and obtain the drive letters corresponding to SCSI ID. Log in to the MML mode and enter dev disk enclosure ID to obtain the drive letters corresponding to slot ID. 4. Checking a slow disk If the slow I/O record displays frequently in logs (SES log and message log) and the time when such record is displayed is close to the time when services are affected (such as video freeze), the disk may be the one that affects services and the disk is the slow disk.

Method used to set hot-spare disks
Prerequisite: The system has free disks. Precautions: You need to create hot spare disks to ensure the reliability of the storage device. For controller enclosures with integrated disks and controllers, the number of hot spare disks must not be smaller than the total number of controller enclosures and disk enclosures. For controller enclosures with separated disks and controllers, the number of hot spare disks in the storage system must not be smaller than the number of disk enclosures. The type and capacity of a hot spare disk must be the same as those of RAID group member disks. Note: If the storage system has additional free disks, you can set more hot spare disks for even higher system reliability. Currently, storage systems support global hot spare disks only, and you cannot assign a hot spare disk to a specific RAID group. Coffer disks cannot be used as hot spare disks. Only free disks can be set as hot spare disks. The procedures are as follows: In the ISM navigation tree, choose All Devices > SN_XX > Device Info > Storage Unit > Disks. (SN_XX is the name of the target storage device.) In the function pane of ISM, select disks that you want to set to hot spare disks. On the upper function pane of ISM, choose Hot Spare Disk > Set. The system displays the result of setting hot spare disks. Upon successful creation of hot pare disks, Logical Type of each selected hard disk changes to Hot Spare Disk. If the operation fails, select another available disk as the hot spare disk based on the error message. The results are as follows: After a hot spare disk is created, on the right function pane, the Running Status of the disk is Free Hot Spare Disk. When faulty member disks result in a RAID group failure, the created hot spare disk can take over the data stored on the faulty member disks to ensure normal operation of the RAID group. In this case, Running Status of the hot spare disk changes to Used Hot Spare Disk.

Problem and solution when disk isolation occurs
You can perform the following operations when disk isolation occurs: The following causes may result in disk isolation: Bit error Reinserting disks repeatedly Disk power connection problem 1. Bit error Check the bit error of back-end SAS disks. Search keywords err inc and disable disk phy in the SES log. Note: phy:9 phymon***disable disk phy in the log shows that disk phy 9 is isolated. That is, the disk in slot 9 is isolated (phy0 to phy23 corresponds to disk 0 to 23). Troubleshooting 1. Before removing a faulty disk, collect S.M.A.R.T. information. 2. If conditions permit, insert the isolated disk to other slots to check whether isolation is caused by the disk or the slot. If isolation is caused by the disk, apply for disk replacement. If isolation is caused by the slot, check whether the slot has any foreign objects. Check the bit error on Fibre Channel disks. Search keyword lcv that is Fibre Channel bit errors in the SES log. If HD 0 and lcv ffff are displayed, the information indicates that a large quantity of bit errors are produced in slot 0 and cause disk isolation. The back-end Fibre Channel bit errors can spread from the port to the disk. If a Fibre Channel disk is isolated, check whether bit errors occur on the port by using the following methods: Check on the ISM. Enter fc allinfo in MML mode. Note: If any information displayed is not 0, bit errors exist. If bit errors are detected on the port, verify whether bit error are generated in the link. For details about how to verify, see the troubleshooting cases for a single link failure of the Fibre Channel enclosure disk caused by bit errors. Troubleshooting: If only one disk fails, verify the failure by using the above method. If a link fails, replace the optical module and optical cables and verify the failure. If a link does not fail, use the same method as one carried out on the SAS disk. If multiple disks are faulty, refer to the troubleshooting cases for a single link failure of the Fibre Channel enclosure disk caused by bit errors. 2. Reinserting disks repeatedly Note: A drive can isolate the disk from other ones if intermittent disconnections occur on the disk. Reinserting disks repeatedly may lead to disk isolation. Verify whether the disk is reinserted many times within a short period. If such a case exists, reinserting disk may result in disk isolation. Troubleshooting: Reinsert the disk. 3. Disk power connection problem Note: If the disk enclosure is affected by violent shaking, disk power may be insecurely connected and the disk is isolated. Troubleshooting: Contact R&D engineers for further analysis.

If you have more questions, you can seek help from following ways:
To iKnow To Live Chat