E9000 troubleshooting documents (alarm handling and troubleshooting)

Latest reply: Mar 6, 2014 06:18:04 3281 1 0 0

The troubleshooting manual is described as follows: 

Document Name Main Content Support-E Link
E9000 Alarm Handling This document describes the alarms of the Tecal E9000 server in terms of the meaning, impact on the system, possible causes, and handling procedures. Click
E9000 Troubleshooting This document describes the E9000 server hardware information that you must be familiar with during troubleshooting, procedures for diagnosing the E9000 system faults during routine maintenance, common faults, and parts replacement.  Click
 Huawei Servers Troubleshooting  This document describes how to collect logs, diagnose faults, upgrade software, perform preventive maintenance and common operations, and obtain other resources for troubleshooting Huawei servers.  Click


 

Alarm Handling 

When a fault occurs, the system generates logs and an alarm based on the faulty module. When the universal server manager (USM) is configured, the alarm is reported to the USM over the Simple Network Management Protocol (SNMP). The sensors on the device monitor the operating environment and generate alarms if the environmental conditions do not meet device operating requirements.

Event Alarms and Fault Alarms

Based on the impact on the system, alarms are classified into the following types:

  • Event alarms

    Event alarms record the key events that occur during normal system operating. Event alarms do not affect system operating.

  • Fault alarms

    Fault alarms are generated for faults that affect normal system operating.

E9000 troubleshooting documents (alarm handling and troubleshooting)-1336083-1 NOTE:

This document describes fault alarms only.

 

Viewing Alarms

You can use the following methods to view alarms:

  • Viewing alarms on the command-line interface (CLI).

    • MM910: On a terminal, access the MM910 CLI, run the unhealthylocations command to query the position of an alarm, and then run the healthevents command to view the details of the alarm.

    • CX910, CX911, and CX913 Switch Module: On a terminal, access the CX910, CX911, and CX913 Switch Module CLI, and run the notification-log command to view the alarm logs.
    • CX310, CX311, CX312, and CX110 Switch Module: On a terminal, access the CX310, CX311, CX312, and CX110 Switch Module CLI, and run the notification-log command to view the alarm logs.
  • Viewing alarms in the USM based on SNMP trap messages.
    E9000 troubleshooting documents (alarm handling and troubleshooting)-1336083-2 NOTE:
    The CX310, CX311, CX312, and CX110 use SNMP v2 and SNMP v3. SNMP v1 does not support all types trap messages.
  • Viewing baseboard management controller (BMC) alarms of the MM910, compute nodes, and Switch Moduleon the MM910 web user interface (WebUI)

E9000 troubleshooting documents (alarm handling and troubleshooting)-1336083-3 NOTE:

For details about how to run commands on the MM910 CLI, see the E9000 MM910 Management Module Command Reference.

 

Alarm Severity

Alarms of the E9000 server system are classified into the following types by severity:

  • Minor

    Minor alarms are generated for the faults that do not have major impact on the system. Minor alarms require prompt corrective measures to prevent more serious faults.

  • Major

    Major alarms are generated for the faults that may affect normal operating of the system or interrupt services.

  • Critical

    Critical alarms are generated for the faults that may cause board power-off or even interrupt services. Critical alarms immediate corrective measures.

  • Warning

    Indicates that an error may occur and affect the system performance. The measures to be taken vary with the situation or the error.

  • Indeterminate

    Indicates that the severity cannot be determined. This means that the severity is determined by the real-world situation.

  • Cleared

    The Cleared severity indicates the clearing of one or more previously reported alarms. This alarm clears all alarms for this managed object that have the same Alarm type, Probable cause and Specific problems. Multiple associated notifications may be cleared by using the Correlated notifications parameter.

The alarms for the Tecal E9000 server system contain the alarms for all components in the system. This document describes the alarms for the BMC, MM910, and switch modules. After an alarm is generated, you can find the cause of the alarm by viewing the alarm information.

 

Troubleshooting 

Troubleshooting Process

Troubleshooting means using proper methods to find the causes of a fault and rectify the fault. The troubleshooting method is to narrow down the scope of the causes to reduce the complexity of rectifying the specific fault.

Figure 1 shows the troubleshooting process.
Figure 1  Troubleshooting process
E9000 troubleshooting documents (alarm handling and troubleshooting)-1336083-4
E9000 troubleshooting documents (alarm handling and troubleshooting)-1336083-5 NOTE:
The troubleshooting process shown in Figure 1 is a recommended but not the only way to troubleshoot.


 1.Troubleshooting Preparations

 

 

2. Collect fault information

Collect fault information on the WebUI of the MM910.

  1. Log in to the WebUI of the MM910.

    For details, see the Tecal E9000 MM910 Management Module V100R001 Installation Guide.

  2. In the navigation tree, choose Basic Information.

    The Basic Information UI is displayed, as shown in Figure 2.

    Figure 2  Basic Information
    E9000 troubleshooting documents (alarm handling and troubleshooting)-1336083-7
  3. Collect information such as alarms and versions of the modules in the shelf.
  4. In the navigation tree, choose System Management > Status Monitoring.

    The Status Monitoring UI is displayed, as shown in Figure 3.

    Figure 3  Status Monitoring
    E9000 troubleshooting documents (alarm handling and troubleshooting)-1336083-8
  5. Collect information such as sensor and indicator states of the modules in the shelf.
  6. In the navigation tree, choose System Management > SEL Information.

    The SEL Information UI is displayed, as shown in Figure 4.
    Figure 4  SEL Information
    E9000 troubleshooting documents (alarm handling and troubleshooting)-1336083-9
    E9000 troubleshooting documents (alarm handling and troubleshooting)-1336083-10 NOTE:
    On the SEL Information UI of the MM910 or the BMC on the compute node, click Clear to delete all SELs of the MM910 or the BMC on the compute node at a time. Deleted SELs cannot be restored. Perform the operation with caution.
  7. Query logs of the modules in the shelf.
  8. On the SMM tab page, click One touch collect.

    The UI for collecting logs is displayed.

  9. Select Full Collect, and click Startup.

    The system starts collecting logs, which takes about 20 minutes. After logs are collected, the log file one_touch_info_all.tar.gz is displayed in the File Name area.

  10. Click the log file and download it to the local PC as prompted.

    E9000 troubleshooting documents (alarm handling and troubleshooting)-1336083-11 NOTE:
    Only the latest 50 event logs can be collected by One touch collect.

     

Collect fault information on the WebUI of the BMC.

  3.Clearing an Alarm

 

 4.Identifying a Fault

 

5.Rectifying a Fault

 

Troubleshooting Cases and Common Operation

 

Tips: To access all server documents, select the documents from the attached Server_Documentation_Bookshelf_V2.0.xlsx.

This article contains more resources

You need to log in to download or view. No account?Register

x
  • x
  • convention:

server_info
Official Created Mar 6, 2014 06:18:04 Helpful(0) Helpful(0)

THANKS for sharing~~~
  • x
  • convention:

Comment

Reply
You need to log in to reply to the post Login | Register

Notice Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " Privacy."
If the attachment button is not available, update the Adobe Flash Player to the latest version!
Login and enjoy all the member benefits

Login and enjoy all the member benefits

Login