Got it

The datastore corresponding to the HyperMetro LUN on a VMware host is in the accessible state

144 0 2 0 0

Hi team, here's a new case.


Problem Description

The datastore corresponding to Huawei HyperMetro LUNs on the VMware host is in the inaccessible state. After the physical host is restarted, the datastore access status becomes normal.



Root Cause

The two sites are disconnected at the same time. As a result, the HyperMetro LUNs are dual-slaved, and the datastore is set to the PDL state. However, the configuration of the VMware layer is incorrect. As a result, the PDL status of the datastore is not restored after the HyperMetro LUN recovers. You can only restart the physical host to restore the PDL status of the datastore.



Location Process

1. Confirm the fault symptom. On the VMware host, the datastore status is inaccessible, and services corresponding to the datastore cannot be accessed. However, the path corresponding to the datastore and the corresponding Huawei LUN are in normal state.


2. Check storage event records and find that link disconnection has been recorded, which may be related to common active-active fault scenarios.

For details about common HyperMetro fault scenarios, see Principle Introduction > Data Arbitration Principle in HyperMetro Product Documentation.


3. Collect storage and VMware logs and check the information about datastores in the inaccessible state in the vmkernel logs. PDL errors occur on these datastores.

1


Generally, PDLs are generated in the datastore, either the error returned by the storage or the PDL converted by multipathing after the all path down (APD). (Generally, only some UltraPath versions perform APD2PDL conversion. NMP multipathing does not perform APD2PDL conversion.) According to the preceding PDL error information, the error code 0x2500 is received. This indicates that the storage device returns an error to the host, causing the datastore to be set to the PDL state.


4. Check storage logs. At about 08:40, the two storage systems returned error code 0x2500 to the host. (There may be a time difference between the two storage systems.)

2102351LVK10K7000005

1

2102351LVK10K7000006

1


The reason why the error code 0x2500 is returned is that the communication between the quorum server and the two storage systems is interrupted at this time.

2102351LVK10K7000005

2019-12-18 00:40 0xF3C030004 Fault Major Recovered 2019-12-18 00:40 Quorum server (server ID 0) is disconnected from the storage array.

2102351LVK10K7000006

2019-12-18 00:40 0xF3C030004 Fault Major Recovered 2019-12-18 00:40 Quorum server (server ID 0) is disconnected from the storage array.

The replication link is also disconnected at this time, as shown in the following figure.

2102351LVK10K7000005

2019-12-18 00:40 0xF0E10001 Fault Major Recovered 2019-12-18 00:40 Replication link (link ID 5, local controller 0A, local port CTE0.A.H1, remote controller 0A, remote port CTE0.A.H1, remote device name Oceanstor5500-A, serial number 2102351LVK10K7000006) is disconnected.

2019-12-18 00:40 0xF0E10001 Fault Major Recovered 2019-12-18 00:40 Replication link (link ID 0, local controller 0A, local port CTE0.A.H0, remote controller 0A, remote port CTE0.A.H0, remote device name Oceanstor5500-A, serial number 2102351LVK10K7000006) is disconnected.

2019-12-18 00:40 0xF0E10001 Fault Major Recovered 2019-12-18 00:40 Replication link (link ID 256, local controller 0B, local port CTE0.B.H0, remote controller 0B, remote port CTE0.B.H0, remote device name Oceanstor5500-A, serial number 2102351LVK10K7000006) is disconnected.

2019-12-18 00:40 0xF0E10001 Fault Major Recovered 2019-12-18 00:40 Replication link (link ID 261, local controller 0B, local port CTE0.B.H1, remote controller 0B, remote port CTE0.B.H1, remote device name Oceanstor5500-A, serial number 2102351LVK10K7000006) is disconnected.

2102351LVK10K7000006

2019-12-18 00:40 0xF0E10001 Fault Major Recovered 2019-12-18 00:40 Replication link (link ID 261, local controller 0B, local port CTE0.B.H1, remote controller 0B, remote port CTE0.B.H1, remote device name OceanStor5500V5-B, serial number 2102351LVK10K7000005) is disconnected.

2019-12-18 00:40 0xF0E10001 Fault Major Recovered 2019-12-18 00:40 Replication link (link ID 256, local controller 0B, local port CTE0.B.H0, remote controller 0B, remote port CTE0.B.H0, remote device name OceanStor5500V5-B, serial number 2102351LVK10K7000005) is disconnected.

2019-12-18 00:40 0xF0E10001 Fault Major Recovered 2019-12-18 00:40 Replication link (link ID 5, local controller 0A, local port CTE0.A.H1, remote controller 0A, remote port CTE0.A.H1, remote device name OceanStor5500V5-B, serial number 2102351LVK10K7000005) is disconnected.

2019-12-18 00:40 0xF0E10001 Fault Major Recovered 2019-12-18 00:40 Replication link (link ID 0, local controller 0A, local port CTE0.A.H0, remote controller 0A, remote port CTE0.A.H0, remote device name OceanStor5500V5-B, serial number 2102351LVK10K7000005) is disconnected.

The quorum server cannot determine which site takes over services. As a result, the quorum server fails to determine which site takes over services.

2102351LVK10K7000005

[2019-12-18 00:40][6984219.745740] [][1500002780000][INFO][Set rephc pair(0) arb result(2).][REPHC][setRepHcArbResult,4521]

2102351LVK10K7000006

[2019-12-18 00:40][6699932.272513] [][1500002780000][INFO][Set rephc pair(0) arb result(2).][REPHC][setRepHcArbResult,4521]

As a result, the LUNs at the two sites are inaccessible at the time when the problem occurs, the HyperMetro pair enters the dual-slave state, and the two sets of storage systems return error code 0x2500 to the host.

The current storage version is V500R007C30SPC100. In this version, the dual-slave status can be automatically restored after the link fault is rectified. As shown in the following figure, after the quorum link recovers, the arbitration takes effect and HyperMetro LUNs can take over services.

2102351LVK10K7000006

[2019-12-18 00:40][6700107.047829] [][1500002780000][INFO][Set rephc pair(0) arb result(1).][REPHC][setRepHcArbResult,4521]


5. On the storage side, the arbitration takes effect after the link recovers. The HyperMetro LUN is readable and writable. On the DeviceManager, the HyperMetro LUN is readable and writable at both ends. On the VMware host, the datastore path is normal. Why is the PDL status of the datastore not cleared?

Cause: After a pdl error occurs, the vmware attempts to perform a failover on the datastore and switches the datastore to another node in the vmware cluster to start services.

Failover did not proceed as usual because the customer did not configure it correctly according to the connectivity guidelines. For details, see the following link:

https://support.huawei.com/enterprise/zh/doc/EDOC1000144882?idPath=7919749|7941815|250389224|21462748|22462071



Solution

1. Restart the physical host to quickly restore the PDL status of the datastore. (Before restarting the host, check whether any VMs are running properly. If yes, stop services or migrate the VMs to other hosts.)


2. After the datastore is restored, modify the configuration by referring to the host connectivity guide to prevent the PDL status failure in similar scenarios.


3. It is recommended that the quorum server be deployed at a third-party site to avoid disconnection between the quorum server and two sites.


4. Note that if the storage version is earlier than V500R007C30SPC100/V300R006C50SPC100, dual-slave data cannot be automatically recovered in the current fault scenario. (After the link fault is rectified, you need to manually start HyperMetro.) For details, see Forcibly Starting HyperMetro Pai in the HyperMetro Product Documentation.

Comment

You need to log in to comment to the post Login | Register

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.