Hi team, here's a new case.
Problem Symptom
An alarm indicating a faulty storage pool is generated.
The OS is damaged and cannot be logged in to.
The OS can be logged in to, but the configurations are lost.
Problem Diagnosis
Check whether an alarm indicating a faulty storage pool is generated.
If yes, go to 2.
If no, this document is not applicable.
Check whether you can log in to the server using SSH.
If yes, go to 4.
If no, go to 3.
Log in to the BMC management page of the server, and check whether the OS is running properly.
If yes, go to 4.
If no, reinstall the FSA node (system disks do not need to be replaced). For details, see Parts Replacement in FusionStorage Block Storage Product Documentation.
Log in to the active FSM node, switch to user root, and run the following command to query the roles assigned to the faulty node:
sh /opt/dsware/tools/ops_tool/emergency/fsa/recover_conf/get_server_role.sh Management IP address of the faulty node
For example, sh /opt/dsware/tools/ops_tool/emergency/fsa/recover_conf/get_server_role.sh 192.168.10.2
The operation is successful if the following information is displayed:
In the preceding information, 1 in the first line indicates that the node has the corresponding role, and 0 indicates that the node does not have the corresponding role.
ZK:1,MDC:1,VBS:1,VFS:1,OSD:1,KVS:1MDC_ID:1,MDC_PORT:10530,STORAGE_IP1:192.168.10.2,STORAGE_IP2:192.168.10.2VFS_ID:2,VFS_PORT:11901,VFS_DEV:Bond0POOL_ID:0Log in to the server and check whether configuration files of the processes obtained in 4 and the FSA configuration file on the faulty node are lost.
If yes, no further action is required.
If no, this document is not applicable.
Configuration file paths are as follows:
MDC: /opt/fusionstorage/persistence_layer/mdc/conf/mdc_conf.cfg;
ZooKeeper: /opt/fusionstorage/persistence_layer/agent/zk/conf/zoo.cfg;
OSD: /opt/fusionstorage/persistence_layer/osd/conf/osd_*_conf.cfg, in which * indicates the ID of the storage pool where the OSD process belongs.
FSA: /opt/fusionstorage/agent/conf/dsware_agent_conf;
/opt/fusionstorage/agent/conf/dsware_cluster_info
Causes
The storage pool becomes faulty due to FSA node configuration loss.
Restore the node configurations by performing the emergency handling operations.
Solution
Log in to the active FSM node, switch to user root, and run the following command to determine the roles assigned to the faulty node: (Then perform the corresponding restoration operation based on the roles of the faulty node.
sh /opt/fusionstorage/tools/ops_tool/emergency/fsa/recover_conf/get_server_role.sh Management IP address of the faulty node
For example, sh /opt/fusionstorage/tools/ops_tool/emergency/fsa/recover_conf/get_server_role.sh 192.168.10.2
The operation is successful if the following information is displayed:
In the preceding information, 1 in the first line indicates that the node has the corresponding role, and 0 indicates that the node does not have the corresponding role.
The second line contains the MDC information, which is required in 4.
The third line contains the VFS information, which is required in 6.
The fourth line is the storage pool information, which is required in 3 and 6.
ZK:1,MDC:1,VBS:1,VFS:1,OSD:1,KVS:1MDC_ID:1,MDC_PORT:10530,STORAGE_IP1:192.168.10.2,STORAGE_IP2:192.168.10.2VFS_ID:2,VFS_PORT:11901,VFS_DEV:Bond0POOL_ID:0Restore the ZooKeeper process configuration.
Log in to the active FSM node as user dsware, run the su - root command to switch to user root, and run the following command:
vim /opt/dsware/manager/webapps/dsware/WEB-INF/FsmConstantConf.xml
Change the value of parameter checkOtherZkWhenRestoreControlNode to false.
Run the following command to restart Tomcat:
su - omm -c restart_tomcat
Switch to user dsware and run the following commands:
cd /opt/dsware/client/bin
./dswareTool.sh --op restoreControlNode -ip ip -zkDiskSlot(optional) slotNum -formatZkDiskFlag(optional) true(default)/false -partitionName(optional) partitionName(such as sda10)
Parameter
Description
./dswareTool.sh
Specifies an executable program for the tool.
--op restoreControlNode
Restores the management node.
-zkDiskSlot(optional) slotNum
This parameter is optional. It is mandatory only when the SAS, SATA, or SSD disk serves as the ZooKeeper disk and the slot of the ZooKeeper disk has changed.
-ip ip
Specifies the IP address of the management node.
-formatZkDiskFlag(optional) true(default)/false
Specifies whether to format the ZooKeeper disk during the restoration. The default value is true. You can set this parameter to false if all management nodes are faulty but the ZooKeeper disk data is not lost. In this case, the management node is restored without formatting the ZooKeeper disk.
-partitionName(optional) diskPartition
Specifies the name of the ZooKeeper partition. This parameter is mandatory when the ZooKeeper process is deployed on a system partition and the ZooKeeper partition cannot be formatted during management node restoration. For example, enter sda14 as the partition name.
For example:
dsware@FSM:/opt/dsware/client/bin> ./dswareTool.sh --op restoreControlNode -ip 192.168.36.28Fri Jan 3 14:35:01 CST 2014 DswareTool operation start.Operation finish successfully. Result Code:0Fri Jan 3 14:35:31 CST 2014 DswareTool operation end.If the parameter used for specifying whether to format the ZooKeeper disk is left blank or set to true and the ZooKeeper disk uses the system partition, ensure that the partition has been attached before running this command.
Run the su - root command to switch to user root and run the following command:
vim /opt/dsware/manager/webapps/dsware/WEB-INF/FsmConstantConf.xml
Change the value of parameter checkOtherZkWhenRestoreControlNode to true.
Run the following command to restart Tomcat:
su - omm -c restart_tomcat
Restore the OSD process configuration.
cd /opt/dsware/client/bin
./dswareTool.sh --op restoreStorageNode -ip Management IP address of the node to be restored -p poolId
If the node to be restored contains multiple storage pools, run the preceding restoration command for each storage pool.
If SSD cards are used as the storage media, log in to the faulty node and run the following command to check whether the ssd_disk_info file is lost:
cat /opt/fusionstorage/agent/conf/ssd_disk_info
If the file does not exist, obtain the value of POOL_ID from the command output in 1, log in to a properly running node, run the following command to query the ssd_disk_info file, and copy the file content:
cat /opt/fusionstorage/agent/conf/ssd_disk_info

Run the following commands on the faulty node:
cd /opt/fusionstorage/agent/conf
vi ssd_disk_info
Copy the ssd_disk_info file on the properly running node to the file to be edited.
Run the following commands to obtain the OSD ESN and DISK SLOT values based on the IP address of the faulty node and replace the value of disk_sn in ssd_disk_info with the OSD ESN value based on the mapping between DISK SLOT and slot_no in ssd_disk_info:
cd /opt/fusionstorage/agent/tool
./dsware_insight 0 mdc_id mdc_ip mdc_port 8 101 poolId
In the command, mdc_id, mdc_ip, and mdc_port specify the ID, storage IP address, and port number of a properly running MDC node, respectively, and poolId specifies the ID of the specified storage pool.

Run the following command on the faulty node, compare the prefix (for example, 0a8cb00b01284ec70f9f589af560b in 0a8cb00b01284ec70f9f589af560b_64) of the new disk_sn value with the Esn information in the command output to obtain the suffix (for example, 76 in 0:76) of the corresponding Location information, replace the value of phy_no in ssd_disk_info with the obtained value, and save and exit the ssd_disk_info configuration after the modification is complete:
cat /proc/smio_host

Run the following commands to change the file permission:
chmod 600 ssd_disk_info
Obtain POOL_ID from the command output in 1, log in to the active FSM node as user dsware, and run the following commands to restore the OSD processes and configurations of the faulty node:
Restore the MDC process configuration of the faulty node based on that of other nodes.
Go to 5.
Copy configuration file /opt/fusionstorage/persistence_layer/mdc/conf/mdc_conf.cfg from a normal MDC node to the /opt/fusionstorage/persistence_layer/mdc/conf directory on the faulty node. Set permissions for the mdc_conf.cfg file and ensure that the file permissions are the same as those on the properly running nodes.
Obtain the values of MDC_ID, STORAGE_IP1, and STORAGE_IP2 from the command output in 1.
Log in to the faulty node and run the following commands to restore the MDC process configuration of the faulty node:
cd /opt/fusionstorage/persistence_layer/agent/tool/emergency/fsa/recover_conf
sh update_mdc_cfg_based_mdc.sh mdc_id storage_ip1 storage_ip2
The operation is successful if the following information is displayed:
finished to update mdc config file and agent monitor file
Restore the VBS process configuration.
If the HyperMetro service is in use, restore the HyperMetro service, then restore the mount point information. For details, see Emergency Handling for the Configuration Loss of Disaster Recovery (DR) Node.
In FusionSphere scenarios, forcibly stop the VM, restart it, and perform the restoration operations. For details, see FusionCompute Product Documentation.
In other scenarios, log in to the FusionStorage web client, choose Monitor > Alarms and Events > Events, query the logs of attaching and detaching volumes, export the system operation logs, and attach volumes based on the logs.

Log in to the active FSM node as user dsware and run the command to check whether the system is in upgrade mode. If the system is in upgrade mode, run the command to take the system out of upgrade mode. An example is provided as follows.
[dsware@euler bin]$ sh /opt/dsware/client/bin/dswareTool.sh --op queryUpgradeMode[Fri Jan 13 08:08:18 UTC 2017] DswareTool operation start.Enter User Name:cmdadminEnter Password :Operation finish successfully. Result Code:0isUpgradeMode: true[Fri Jan 13 08:08:31 UTC 2017] DswareTool operation end.[dsware@euler bin]$ sh /opt/dsware/client/bin/dswareTool.sh --op exitUpgradeModeThis operation is high risk,please input y to continue:y[Fri Jan 13 08:11:15 UTC 2017] DswareTool operation start.Enter User Name:cmdadminEnter Password :Operation finish successfully. Result Code:0[Fri Jan 13 08:11:26 UTC 2017] DswareTool operation end.If the values of MDC_ID, MDC_PORT, and STORAGE_IP1 can be obtained from 1, go to 5.d. Otherwise, go to 5.c.
Log in to the FusionStorage web client, obtain the management IP address of any MDC node. Go to 1 to obtain the values of MDC_ID, MDC_PORT, and STORAGE_IP1. (The management IP address is used as the input parameter for running the get_server_role.sh command.)
Log in to the node represented by STORAGE_IP1 and run the /opt/fusionstorage/agent/tool/dsware_insight 0 mdc_id storage_ip1 mdc_port 8 123 | grep -w "vbs_storage_ip" | cut -d"|" -f6 command. If multiple VBS nodes are available in the system, run the command repeatedly. In the command, storage_ip1 specifies the storage IP address obtained in the preceding step, and vbs_storage_ip specifies the storage IP address of each VBS node.
For example:

If CAN BE MASTER is displayed, set typeflag to 0. Otherwise, set typeflag to 1.
Log in to the active FSM node as user dsware and run the following commands to restore VBS processes and configuration on the faulty node:
cd /opt/dsware/client/bin
./dswareTool.sh --op createDSwareClient -ip Management IP address of the node to be restored -nodetype typeflag
../Resource/public_sys-resources/notice_en.png?v=20200527
If FSA of a version that is not used exists in the cluster, perform the following operations before running the restoration command. Log in to the active and standby FSM nodes as user dsware, respectively. Switch to user root. Run the vi /opt/dsware/manager/webapps/dsware/WEB-INF/DSwarePreCheckConfig.xml command. Change the value of preCheckSwitch to 0. Save the configuration. Run the su - omm -c restart_tomcat command to restart the Tomcat service. After restoring the VBS processes, change the value of preCheckSwitch on the active and standby FSM nodes to the original values and restart the Tomcat service.

Restore the mount point information.
Restore the KVS process configuration.
If FSA of a version that is not used exists in the cluster, perform the following operations before running the restoration command. Log in to the active and standby FSM nodes as user dsware, respectively. Switch to user root. Run the vi /opt/dsware/manager/webapps/dsware/WEB-INF/DSwarePreCheckConfig.xml command. Change the value of preCheckSwitch to 0. Save the configuration. Run the su - omm -c restart_tomcat command to restart the Tomcat service. After KVS processes are restored, change the value of preCheckSwitch on the active and standby FSM nodes to the original values and restart the Tomcat service.

Log in to the active FSM node as user dsware and run the command to check whether the system is in upgrade mode. If the system is in upgrade mode, run the command to take the system out of upgrade mode. An example is provided as follows.
[dsware@euler bin]$ sh /opt/dsware/client/bin/dswareTool.sh --op queryUpgradeMode[Fri Jan 13 08:08:18 UTC 2017] DswareTool operation start.Enter User Name:cmdadminEnter Password :Operation finish successfully. Result Code:0isUpgradeMode: true[Fri Jan 13 08:08:31 UTC 2017] DswareTool operation end.[dsware@euler bin]$ sh /opt/dsware/client/bin/dswareTool.sh --op exitUpgradeModeThis operation is high risk,please input y to continue:y[Fri Jan 13 08:11:15 UTC 2017] DswareTool operation start.Enter User Name:cmdadminEnter Password :Operation finish successfully. Result Code:0[Fri Jan 13 08:11:26 UTC 2017] DswareTool operation end.Log in to the active FSM node as user dsware and run the following commands to restore KVS processes and configuration on the faulty node:
cd /opt/dsware/client/bin
./dswareTool.sh --op createKvsClient -ip Management IP address of the node to be restored
Restore the OpenSM process configuration.
If the OpenSM process is deployed on the faulty node, create new OpenSM process on another node. For details, see Installation and Commissioning in the FusionStorage Block Product Documentation.
Clear the backup data of the restored node on the FSM node and manually back up the data.
To ensure the backup data security, log in to the active FSM node as user dsware and run the following command to manually back up the data:
/opt/dsware/client/bin/dswareTool.sh --op backupMetadata -label labelName
The following figure provides an example.

Log in to the active FSM node as user dsware, run the su - root command to switch to user root, and run the following commands to delete the automatic backup data of the restored FSA node:
cd /opt/dsware/tools/metadataBackup
rm -rf dsware_autoback_*_szxa.*.Management IP address of the restored node.tar.gz





