Got it

Storage Pool Faults Caused by FSA Node Configuration Loss

101 0 0 0 0

Hi team, here's a new case.


Problem Symptom

  1. An alarm indicating a faulty storage pool is generated.

  2. The OS is damaged and cannot be logged in to.

  3. The OS can be logged in to, but the configurations are lost.


Problem Diagnosis

  1. Check whether an alarm indicating a faulty storage pool is generated.

    If yes, go to 2.

    If no, this document is not applicable.

  2. Check whether you can log in to the server using SSH.

    If yes, go to 4.

    If no, go to 3.

  3. Log in to the BMC management page of the server, and check whether the OS is running properly.

    If yes, go to 4.

    If no, reinstall the FSA node (system disks do not need to be replaced). For details, see Parts Replacement in FusionStorage Block Storage Product Documentation.

  4. Log in to the active FSM node, switch to user root, and run the following command to query the roles assigned to the faulty node:

    sh /opt/dsware/tools/ops_tool/emergency/fsa/recover_conf/get_server_role.sh Management IP address of the faulty node

    For example, sh /opt/dsware/tools/ops_tool/emergency/fsa/recover_conf/get_server_role.sh 192.168.10.2

    The operation is successful if the following information is displayed:

    In the preceding information, 1 in the first line indicates that the node has the corresponding role, and 0 indicates that the node does not have the corresponding role.

    ZK:1,MDC:1,VBS:1,VFS:1,OSD:1,KVS:1MDC_ID:1,MDC_PORT:10530,STORAGE_IP1:192.168.10.2,STORAGE_IP2:192.168.10.2VFS_ID:2,VFS_PORT:11901,VFS_DEV:Bond0POOL_ID:0
  5. Log in to the server and check whether configuration files of the processes obtained in 4 and the FSA configuration file on the faulty node are lost.

    If yes, no further action is required.

    If no, this document is not applicable.


    Configuration file paths are as follows:

    1. MDC: /opt/fusionstorage/persistence_layer/mdc/conf/mdc_conf.cfg;

    2. ZooKeeper: /opt/fusionstorage/persistence_layer/agent/zk/conf/zoo.cfg;

    3. OSD: /opt/fusionstorage/persistence_layer/osd/conf/osd_*_conf.cfg, in which * indicates the ID of the storage pool where the OSD process belongs.

    4. FSA: /opt/fusionstorage/agent/conf/dsware_agent_conf;

      /opt/fusionstorage/agent/conf/dsware_cluster_info


Causes

The storage pool becomes faulty due to FSA node configuration loss. 

Restore the node configurations by performing the emergency handling operations.



Solution

  1. Log in to the active FSM node, switch to user root, and run the following command to determine the roles assigned to the faulty node: (Then perform the corresponding restoration operation based on the roles of the faulty node.

    sh /opt/fusionstorage/tools/ops_tool/emergency/fsa/recover_conf/get_server_role.sh Management IP address of the faulty node

    For example, sh /opt/fusionstorage/tools/ops_tool/emergency/fsa/recover_conf/get_server_role.sh 192.168.10.2

    The operation is successful if the following information is displayed:

    In the preceding information, 1 in the first line indicates that the node has the corresponding role, and 0 indicates that the node does not have the corresponding role.

    The second line contains the MDC information, which is required in 4.

    The third line contains the VFS information, which is required in 6.

    The fourth line is the storage pool information, which is required in 3 and 6.

    ZK:1,MDC:1,VBS:1,VFS:1,OSD:1,KVS:1MDC_ID:1,MDC_PORT:10530,STORAGE_IP1:192.168.10.2,STORAGE_IP2:192.168.10.2VFS_ID:2,VFS_PORT:11901,VFS_DEV:Bond0POOL_ID:0
  2. Restore the ZooKeeper process configuration.

    1. Log in to the active FSM node as user dsware, run the su - root command to switch to user root, and run the following command:

      vim /opt/dsware/manager/webapps/dsware/WEB-INF/FsmConstantConf.xml

      Change the value of parameter checkOtherZkWhenRestoreControlNode to false.

    2. Run the following command to restart Tomcat:

      su - omm -c restart_tomcat

    3. Switch to user dsware and run the following commands:

      cd /opt/dsware/client/bin

      ./dswareTool.sh --op restoreControlNode -ip ip -zkDiskSlot(optional) slotNum -formatZkDiskFlag(optional) true(default)/false -partitionName(optional) partitionName(such as sda10)

      Parameter

      Description

      ./dswareTool.sh

      Specifies an executable program for the tool.

      --op restoreControlNode

      Restores the management node.

      -zkDiskSlot(optional) slotNum

      This parameter is optional. It is mandatory only when the SAS, SATA, or SSD disk serves as the ZooKeeper disk and the slot of the ZooKeeper disk has changed.

      -ip ip

      Specifies the IP address of the management node.

      -formatZkDiskFlag(optional) true(default)/false

      Specifies whether to format the ZooKeeper disk during the restoration. The default value is true. You can set this parameter to false if all management nodes are faulty but the ZooKeeper disk data is not lost. In this case, the management node is restored without formatting the ZooKeeper disk.

      -partitionName(optional) diskPartition

      Specifies the name of the ZooKeeper partition. This parameter is mandatory when the ZooKeeper process is deployed on a system partition and the ZooKeeper partition cannot be formatted during management node restoration. For example, enter sda14 as the partition name.


      For example:

      dsware@FSM:/opt/dsware/client/bin> ./dswareTool.sh --op restoreControlNode -ip 192.168.36.28Fri Jan 3 14:35:01 CST 2014 DswareTool operation start.Operation finish successfully. Result Code:0Fri Jan 3 14:35:31 CST 2014 DswareTool operation end.


      If the parameter used for specifying whether to format the ZooKeeper disk is left blank or set to true and the ZooKeeper disk uses the system partition, ensure that the partition has been attached before running this command.

    4. Run the su - root command to switch to user root and run the following command:

      vim /opt/dsware/manager/webapps/dsware/WEB-INF/FsmConstantConf.xml

      Change the value of parameter checkOtherZkWhenRestoreControlNode to true.

    5. Run the following command to restart Tomcat:

      su - omm -c restart_tomcat

  3. Restore the OSD process configuration.

    cd /opt/dsware/client/bin

    ./dswareTool.sh --op restoreStorageNode -ip Management IP address of the node to be restored -p poolId


    If the node to be restored contains multiple storage pools, run the preceding restoration command for each storage pool.

    1. If SSD cards are used as the storage media, log in to the faulty node and run the following command to check whether the ssd_disk_info file is lost:

      cat /opt/fusionstorage/agent/conf/ssd_disk_info

      If the file does not exist, obtain the value of POOL_ID from the command output in 1, log in to a properly running node, run the following command to query the ssd_disk_info file, and copy the file content:

      cat /opt/fusionstorage/agent/conf/ssd_disk_info

      2_en-us_image_0193223973.jpg

      Run the following commands on the faulty node:

      cd /opt/fusionstorage/agent/conf

      vi ssd_disk_info

      Copy the ssd_disk_info file on the properly running node to the file to be edited.

      Run the following commands to obtain the OSD ESN and DISK SLOT values based on the IP address of the faulty node and replace the value of disk_sn in ssd_disk_info with the OSD ESN value based on the mapping between DISK SLOT and slot_no in ssd_disk_info:

      cd /opt/fusionstorage/agent/tool

      ./dsware_insight 0 mdc_id mdc_ip mdc_port 8 101 poolId

      In the command, mdc_id, mdc_ip, and mdc_port specify the ID, storage IP address, and port number of a properly running MDC node, respectively, and poolId specifies the ID of the specified storage pool.

      1_en-us_image_0298376125.jpg

      Run the following command on the faulty node, compare the prefix (for example, 0a8cb00b01284ec70f9f589af560b in 0a8cb00b01284ec70f9f589af560b_64) of the new disk_sn value with the Esn information in the command output to obtain the suffix (for example, 76 in 0:76) of the corresponding Location information, replace the value of phy_no in ssd_disk_info with the obtained value, and save and exit the ssd_disk_info configuration after the modification is complete:

      cat /proc/smio_host

      2_en-us_image_0193223975.jpg

      Run the following commands to change the file permission:

      chmod 600 ssd_disk_info

    2. Obtain POOL_ID from the command output in 1, log in to the active FSM node as user dsware, and run the following commands to restore the OSD processes and configurations of the faulty node:

  4. Restore the MDC process configuration of the faulty node based on that of other nodes.

    1. Go to 5.

    2. Copy configuration file /opt/fusionstorage/persistence_layer/mdc/conf/mdc_conf.cfg from a normal MDC node to the /opt/fusionstorage/persistence_layer/mdc/conf directory on the faulty node. Set permissions for the mdc_conf.cfg file and ensure that the file permissions are the same as those on the properly running nodes.

    3. Obtain the values of MDC_ID, STORAGE_IP1, and STORAGE_IP2 from the command output in 1.

    4. Log in to the faulty node and run the following commands to restore the MDC process configuration of the faulty node:

      cd /opt/fusionstorage/persistence_layer/agent/tool/emergency/fsa/recover_conf

      sh update_mdc_cfg_based_mdc.sh mdc_id storage_ip1 storage_ip2

      The operation is successful if the following information is displayed:

      finished to update mdc config file and agent monitor file
  5. Restore the VBS process configuration.

  • If the HyperMetro service is in use, restore the HyperMetro service, then restore the mount point information. For details, see Emergency Handling for the Configuration Loss of Disaster Recovery (DR) Node.

  • In FusionSphere scenarios, forcibly stop the VM, restart it, and perform the restoration operations. For details, see FusionCompute Product Documentation.

  • In other scenarios, log in to the FusionStorage web client, choose Monitor > Alarms and Events > Events, query the logs of attaching and detaching volumes, export the system operation logs, and attach volumes based on the logs.

    2_en-us_image_0193223978.png

  1. Log in to the active FSM node as user dsware and run the command to check whether the system is in upgrade mode. If the system is in upgrade mode, run the command to take the system out of upgrade mode. An example is provided as follows.

    [dsware@euler bin]$ sh /opt/dsware/client/bin/dswareTool.sh  --op queryUpgradeMode[Fri Jan 13 08:08:18 UTC 2017] DswareTool operation start.Enter User Name:cmdadminEnter Password :Operation finish successfully. Result Code:0isUpgradeMode: true[Fri Jan 13 08:08:31 UTC 2017] DswareTool operation end.
    [dsware@euler bin]$ sh /opt/dsware/client/bin/dswareTool.sh  --op exitUpgradeModeThis operation is high risk,please input y to continue:y[Fri Jan 13 08:11:15 UTC 2017] DswareTool operation start.Enter User Name:cmdadminEnter Password :Operation finish successfully. Result Code:0[Fri Jan 13 08:11:26 UTC 2017] DswareTool operation end.
  2. If the values of MDC_ID, MDC_PORT, and STORAGE_IP1 can be obtained from 1, go to 5.d. Otherwise, go to 5.c.

  3. Log in to the FusionStorage web client, obtain the management IP address of any MDC node. Go to 1 to obtain the values of MDC_ID, MDC_PORT, and STORAGE_IP1. (The management IP address is used as the input parameter for running the get_server_role.sh command.)

  4. Log in to the node represented by STORAGE_IP1 and run the /opt/fusionstorage/agent/tool/dsware_insight 0 mdc_id storage_ip1 mdc_port 8 123 | grep -w "vbs_storage_ip" | cut -d"|" -f6 command. If multiple VBS nodes are available in the system, run the command repeatedly. In the command, storage_ip1 specifies the storage IP address obtained in the preceding step, and vbs_storage_ip specifies the storage IP address of each VBS node.

    For example:

    2_en-us_image_0193223976.jpg

    If CAN BE MASTER is displayed, set typeflag to 0. Otherwise, set typeflag to 1.

  5. Log in to the active FSM node as user dsware and run the following commands to restore VBS processes and configuration on the faulty node:

    cd /opt/dsware/client/bin

    ./dswareTool.sh --op createDSwareClient -ip Management IP address of the node to be restored -nodetype typeflag

    ../Resource/public_sys-resources/notice_en.png?v=20200527  

    If FSA of a version that is not used exists in the cluster, perform the following operations before running the restoration command. Log in to the active and standby FSM nodes as user dsware, respectively. Switch to user root. Run the vi /opt/dsware/manager/webapps/dsware/WEB-INF/DSwarePreCheckConfig.xml command. Change the value of preCheckSwitch to 0. Save the configuration. Run the su - omm -c restart_tomcat command to restart the Tomcat service. After restoring the VBS processes, change the value of preCheckSwitch on the active and standby FSM nodes to the original values and restart the Tomcat service.

    2_en-us_image_0193223977.jpg

  6. Restore the mount point information.

Restore the KVS process configuration.


If FSA of a version that is not used exists in the cluster, perform the following operations before running the restoration command. Log in to the active and standby FSM nodes as user dsware, respectively. Switch to user root. Run the vi /opt/dsware/manager/webapps/dsware/WEB-INF/DSwarePreCheckConfig.xml command. Change the value of preCheckSwitch to 0. Save the configuration. Run the su - omm -c restart_tomcat command to restart the Tomcat service. After KVS processes are restored, change the value of preCheckSwitch on the active and standby FSM nodes to the original values and restart the Tomcat service.

2_en-us_image_0193223979.jpg

  1. Log in to the active FSM node as user dsware and run the command to check whether the system is in upgrade mode. If the system is in upgrade mode, run the command to take the system out of upgrade mode. An example is provided as follows.

    [dsware@euler bin]$ sh /opt/dsware/client/bin/dswareTool.sh  --op queryUpgradeMode[Fri Jan 13 08:08:18 UTC 2017] DswareTool operation start.Enter User Name:cmdadminEnter Password :Operation finish successfully. Result Code:0isUpgradeMode: true[Fri Jan 13 08:08:31 UTC 2017] DswareTool operation end.
    [dsware@euler bin]$ sh /opt/dsware/client/bin/dswareTool.sh  --op exitUpgradeModeThis operation is high risk,please input y to continue:y[Fri Jan 13 08:11:15 UTC 2017] DswareTool operation start.Enter User Name:cmdadminEnter Password :Operation finish successfully. Result Code:0[Fri Jan 13 08:11:26 UTC 2017] DswareTool operation end.
  2. Log in to the active FSM node as user dsware and run the following commands to restore KVS processes and configuration on the faulty node:

    cd /opt/dsware/client/bin

    ./dswareTool.sh --op createKvsClient -ip Management IP address of the node to be restored

Restore the OpenSM process configuration.

If the OpenSM process is deployed on the faulty node, create new OpenSM process on another node. For details, see Installation and Commissioning in the FusionStorage Block Product Documentation.

Clear the backup data of the restored node on the FSM node and manually back up the data.

  1. To ensure the backup data security, log in to the active FSM node as user dsware and run the following command to manually back up the data:

    /opt/dsware/client/bin/dswareTool.sh  --op backupMetadata -label labelName

    The following figure provides an example.

    2_en-us_image_0193223980.jpg

  2. Log in to the active FSM node as user dsware, run the su - root command to switch to user root, and run the following commands to delete the automatic backup data of the restored FSA node:

    cd /opt/dsware/tools/metadataBackup

    rm -rf dsware_autoback_*_szxa.*.Management IP address of the restored node.tar.gz

Comment

You need to log in to comment to the post Login | Register

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.