Got it

The SNS2124-1 switch restarts after the switch license has been imported Highlighted

Latest reply: Jun 19, 2019 07:21:40 760 1 13 0 0

1. Networking

191152goakzi3byi7yaxnb.png

 

2 Problem Analysis

2.1 Analysis on Data Store Exceptions and VM BSOD

l  The symptoms and causes for the two nodes are the same. One of the nodes (Node CNA05) is described in detail in this report. After the switch is restarted, the system log of the host shows that the data signature verification error (Invalid dinode signature = "\ '^") occurs after the VIMS file system reads metadata. Therefore, the data store is placed in the read-only state.

note

The VIMS file system is a virtual cluster file system used by FusionCompute. A shared data store can be attached to multiple hosts. Each file in the file system has an inode to describe the file, which is known as file metadata. When a host reads files from the storage, it reads the inode first and checks the inode size and digital signature to determine whether the inode is valid. If the inode is invalid, the data store associated with the host is placed in the read-only state.

191152qhh032p63zi05arb.jpg

 

l  The cause for the metadata error is as follows: The log analysis (Node x changed generation) indicates that the LUN mapping is incorrect. Each data store associated with a host has a heartbeat thread that reads the heartbeat area of the LUN every two seconds on the host. Based on the heartbeat information (node id +generation+timestamp), the system determines whether other hosts are alive. At the same time, the heartbeat thread writes the id+generation+timestamp information of the local node into the corresponding heartbeat area to indicate that the node is alive. The generationvalue does not change in normal conditions.

191153zs6bs6nlnjnbhvcb.png

 

For example, node 0 is used. The log shows that the generation value changed from 0xce687a7ea22ed786to 0xf0868c6efd644446 (generation value of another LUN) at 21:47:03. The possible cause is that the LUN mapping is incorrect. As a result, the heartbeat data is from that of other LUNs. Therefore, the generationvalue changes.

l  According to the command output of multipath –ll, the LUN mapping is incorrect.

191153dbaeggnweazgqtie.png

 

dm-12 manages four SD devices (17:0:0-3:3) that belong to dm-11. During LUN data reading, when the data of dm-12 (ending with 0000b) is read, the data of dm-11 (ending with 0000a) is returned.

191153agkmq9ieg4oque0s.png

 

l  The LUN mapping is incorrect. As a result, write I/O data on different LUNs is written to other LUNs, damaging data on other LUNs. As a result, the mirror data of some VMs is damaged and BSOD occurs after the VMs are started.

l  The causes for the data store exception and BSOD are clear. The next will analyze what causes the LUN mapping disorder.

2.2 Cause Analysis on LUN Mapping Disorder

l  In terms of storage, the operation records indicate that a LUN group was frequently modified during the period from August 13 to August 27. The log indicates that LUN group (ID 6) is a VDI LUN group on the platform. There are many records in the operation log.

191153r4hnoicn6tzizq33.png

 

According to the preceding records, the slot number change for the corresponding LUNs in the VDI LUN group on the storage side is as follows:

191153xwzk5w8jd0xbdkjc.png

 

According to the previous operations, slot numbers 10, 11, and 14 that correspond to the three LUNs ending with 000a, 000b, and 000e changed on August 18. The three LUNs are those whose data stores are abnormal due to LUN mapping disorder.

note

When a LUN group is empty and LUNs are added to the LUN group in sequence, the slot number increases in ascending order. When a LUN is removed from a LUN group, the corresponding slot is reclaimed. When a LUN is added, the previously reclaimed slot is reused in sequence.

l  The LUN group is frequently modified, and the slot number of the original LUN is changed. If the LUN group is not scanned on all hosts at the same time and the hosts do not detect the change of the storage device, the LUN mapping may be incorrect once the switch restarts unexpectedly. In addition, FusionCompute operation logs are analyzed. Although multiple scanning operations are performed, some operations are not performed on all hosts.

191153rouwgljo0g2i0wlb.png

 

l  Based on the preceding analysis, a situation is simulated in which the LUN group is frequently modified, the slot number of the LUN changes, and the switch restarts. The fault that is the same as that at the current site can be reproduced.

2.3 Fault Reproduction in the Lab

Key steps:

1.        On the storage side, delete and add LUNs frequently. Make sure that the slot numbers of the LUNs are misplaced. Perform the operation for multiple times.

2.        After scanning storage devices on some hosts (not all hosts) and then restarting the optical switch, the following symptoms can be reproduced:

191154walf63k3x55aff6v.png

 

Core characteristics of errors (these features also exist in the site environment):

a.        Devices with different LUN IDs are added to the same dm group (dm-4 contains two LUN IDs 1 and 2).

b.       All devices are in the active state.

c.        A device (13:00:0:2 sdw, 14:0:1:2 sdz) belongs to two different dm groups (dm-4, dm-8).

d.        The optical switch corresponding to the faulty device (13:00:0:2 sdw, 14:0:1:2 sdz) is not restarted. Instead, the other optical switch is restarted.

3 Preventive Measures


l  After the LUN group is modified, scan storage devices on all hosts as required based on the product requirement to ensure that all hosts detect the changed LUNs again.

l  Before modifying the LUN group that has been added to a host, you must detach the LUN group from the platform, remove the data store, and then perform the change operations on the underlying storage. Do not perform operations directly on the storage device.

l  During LUN group modification on a storage device, do not add other LUNs to a LUN group when you add or delete a LUN to or from the LUN group. Otherwise, the slot number will change after the original LUN is added back to the LUN group.

l  Before restarting the optical switch, run the mutipath –ll command to check whether the paths on the host are normal.

If an exception occurs, scan the storage device, restart the multipathing service, and restart the optical switch.


4 RecoveryMethods

If the mapping is incorrect, scan the storage device and restart the multipathing service to restore the device. Run the following commands on the host:

rescan-scsi-bus-uvp

systemctl restart multipathd.service

This problem does not occur even if the optical switch is restarted. However, you must not perform operations on LUNs directly, especially the LUNs that have been allocated for use on the platform.

5 Optimization Solution

To address this special issue, you are advised to perform operations based on the requirements of the product documents, and automatic detection and avoidance mechanisms are added on the product side to prevent user services from being affected by misoperations. For the next version, UltraPath can help optimize this issue.

 


thanks
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.