[Symptom]
The device NVDIMM hardware fault alarm is reported on the DeviceManager home page of the cluster management page. You need to locate and rectify the fault.
[Troubleshooting Roadmap]
This alarm is usually caused by hardware faults or software bugs. In the second case, software bugs have been fixed in V300R006C00SPH205.
1. Log in to the node that reports the NVDIMM fault alarm using SSH.
2. Run the /opt/huawei/snas/script/inspect_mml/nvdimm_status_record command to check whether the NVDIMM hardware status is normal. If the following information is displayed, contact the hardware and replace the NVDIMM hardware.

3. If the hardware status is normal, perform the following operations to determine whether the NVDIMM software fault occurs:
4. On the node for which the NVDIMM fault alarm is generated, search for the keyword testEnv in /var/log/snasmessages at the alarm generation time and check whether any error information is reported. If error logs of the /opt/huawei/snas/etc/monstore/*map/testEnv file exist, the NVDIMM software fault is reported by the mon.

If the processing time is longer than one hour before the alarm generation time, find the log package of the corresponding time in the alarm archive directory (/var/log/backup/snas/snas.*), decompress the log package to find the corresponding snasmessages, or use the SmartKit information collection tool to collect logs generated at the corresponding alarm generation time and confirm the alarm generation time.
5. [Important] To prevent the monitor from continuing to read and write data to the faulty NVDIMM during the rectification, you need to stop the snas_mons process and check the time when the monc detects the NVDIMM every hour. Run the cd /var/log; BASELOG="snasmessages" n9grep `date -d "3 hour ago" +%Y%m%d%H%M%S` now MONC_EnvironmentAlarm command. You can view the time when the monc detects the NVDIMM every hour in the log. (For example, at 10 minutes, 10 minutes is the detection time point.) Do not perform subsequent operations at this time point and do not perform steps 4 and 5 at an interval of 1 hour.
6. Run the /opt/huawei/deploy/bin/daemon /opt/huawei/snas/bin/snas_mon -s command to manually stop the snas_mon process. Run the ps ax | grep snas_mon command to check whether the snas_mon process exits, as shown in the following figure.
7. Format the nvdimm_disk virtual disk.
Run the umount -l /opt/huawei/snas/etc/monstore command to unmount the nvdimm_disk virtual disk.
1). Run the mount command to check that the mount is successfully unmounted, as shown in the following figure.
2) Run the mkfs.ext3 /proc/nvdimm/nvdimm_disk command to format the nvdimm_disk virtual disk.
Description
Invoke the mkfs command provided by ext3 to format, instead of directly accessing the nvdimm space for clearing.
1. Run the sudo /opt/huawei/deploy/package/snas-mon/snas_mon_start.sh command to restart the snas_mon process. Run the ps ax | grep snas_mon command to check whether the snas_mon process is started.
2. Confirm the operation result.
1) Run the mount command to check whether the /opt/huawei/snas/etc/monstore directory is mounted to the virtual device /proc/nvdimm/nvdimm_disk.
2) Run the ll –R /opt/huawei/snas/etc/monstore/ | grep error command and view the command output. If no I/0 error is displayed, as shown in the following figure.
3) Run the MmlBatch 988 "monc formatnvmm 2" command to manually clear the alarm.
[Cause]
NVDIMM fault alarm.
[Solution]
Replace the NVDIMM or rectify the soft fault.
[Post-Recovery Check]
The alarm is cleared.
[Suggestion and Summary]
None.
[Applicability]
All.