Got it

Keeping Failed LUN for a Long Time After Disk Scanning Triggers the System to Panic

Latest reply: Feb 26, 2021 08:53:45 217 1 2 0 0

Hi team!

Here's a case that keeping failed LUN for a long time after disk scanning triggers the system to panic.

Symptom

1. UltraPath is successfully installed on the ESXi host.

2. A failed LUN is mapped to the host from the array.

3. On the host, disk scanning command esxcfg-rescan -A is executed.

4. Keeping the failed LUN for a long time can trigger purple screen of death (PSOD) at a low probability.

The following figure shows the stack information.

1

Fault Diagnosis

1. VMware engineers replied that the inherent bug of the ESXi system triggers the PSOD, after parsing the dump information. The reply is as follows:

This is a bug in ESX code. 
We have a similar bug#1365517 logged with ESXi-6.0. 
However, as the race condition is an extremely rare case (more than 40 000 times retry to reproduce in your case) and reproduced only with torture testing, so it will not be considered to get fixed for 2015 release. 
It is currently planned to get fixed in 2016 release.

2. VMware confirmed that the retry operation after the failure of registering scsiDev with UltraPath has no problem. 

The reply is listed as follows:

--SCSIDeviceIteratorNext() is a utility function which moves the iterator forward to the next ScsiDevice. 
The reference count of the previous current device (if any) is decremented, and the reference count of the new current device (if any) is incremented. 
Retrying of device register not an issue. 
Usually, any PSA device layer issued I/Os you need to have a handle open or a ref on the device and there are functions which get invoked periodically (like SCSIDeviceTimeoutHandlerFn()) and uses SCSIDeviceIteratorNext().

So what I am saying is the retry operation you are trying in your MPP is ok and the issue is in ESXi code. 

There is a bug reported on the same but as it is a rare case(as mentioned in my previous comment) it is marked to be taken up in future releases.

Solution

Delete the mapping of the failed LUN and remap it to the host after recovering it. 

When a LUN fails, you can find the event information about the failed LUN on the array.

This is my solution, how about yours? Go ahead and share it with us!


good


View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.