Got it

CPU Maintenance Instruction

Latest reply: Jun 9, 2021 10:28:04 599 6 4 0 0

Hello, everyone!

In this post, I summarized the CPU location methods.

Log

General information

Server Model and Serial Number

dump_info\RTOSDump\versioninfo\fruinfo.txt

OS log\mainboard\dmidecode.txt

 

BMC Version

dump_info\RTOSDump\versioninfo\app_revision.txt

 

Current alarms:

dump_info\AppDump\sensor_alarm\current_event.txt

 

Historical alarms:

dump_info\AppDump\sensor_alarm\sel_current.csv or sel.tar

OS log\bmc\sel.csv

 

KVM screenshot:

dump_info\OSDump\img*.jpeg

 

Maintenance logs:

dump_info\LogDump\maintenance_log

 

Serial port logs (You can view the errors reported before the system breaks down, such as MCE errors.)

dump_info\OSDump\systemcom.tar


Key information log location   

FDM logs:

dump_info\LogDump\fdm.bin

Fault Diagnosis Details

Fault classification

server

1. In a computer system, errors can be classified into detected errors and undetected errors. Undetected errors are not processed.

2. Errors that are not detected are classified into beign and silent data corruption (SCD). Benign errors do not affect data, but SCD errors cause data errors. The DIF technology of disks is used to rectify SCD errors.

3. CE: The detected errors are classified into correctable errors. Hardware directly corrects these errors. These errors do not affect the normal running of the system. The system logs these errors. (The memory CE storm affects the system.)

4. UCE: Uncorrected errors are classified into catstrophic, fatal, and recoverable errors. Catastrophic and fatal errors cause the system to restart.

5. UCR: The hardware detects a recoverable error (UCR) and sends it to the system software for processing. For UCNA errors, the operating system only records error information in logs. For SRAO errors, the system determines whether to handle them based on the error information. If an SRAR error occurs, the system will rectify the error. If the processing fails, the system will be restarted.

  • DUE: Detected but Uncorrectable Error.

  • UCR: Uncorrected Recoverable.

  • UCNA: Uncorrected No Action required.

  • SRAO: Software Recoverable Action OptionalSRAR: Software Recoverable Action Required.

MCA domain fault

Machine Check Architecture (MCA) divides the CPU into different modules. Each module corresponds to a bank (which can be simply understood as a register group), and each bank corresponds to a type of error.

Model Specific Register (MSR) is a series of registers used to control CPU running, function switch, debugging, program execution tracing, and CPU performance monitoring in the x86 processor.

Through the MCA, the system can detect hardware errors, such as system bus errors, ECC errors, parity errors, cache errors, and TLB errors. The MCA automatically detects hardware errors and records the error information to the corresponding bank MSR register.

MCA

Bank

Module

Explanation

MC Bank 0-3

Level-1 and level-2 caches

IFU level-1 instruction cache BANK 0

DCU level-1 data cache BANK1 and BANK2

MLC L2 cache BANK 3

MC Bank 4-5

QPI module

On the Brickland platform, QPI links in the 4P environment are interconnected. One CPU provides three QPI links: QPI0, QPI1, and QPI2. Generally, when a QPI fault occurs, the QPI fault is displayed on two interconnected MC bank registers.

QPI0 and QPI2 share a group of registers, and QPI0 and PCU share a group of registers.

MC Bank6

IIO module

IIO errors are recorded in this module only after the IOMCA option in the menu is enabled. In normal cases, this module is disabled.

MC Bank 7-16

Memory module

The IVY Bridge has iMC0 and iMC1. The RH5885 V3 iMC1 is not used.

MC Bank 17-31

Level-3 cache (LLC module)


AER Domain Fault (IIO AER)   

PCIe provides three error reporting mechanisms. These mechanisms can be controlled and reported by configuration registers mapped to three different regions of the configuration space.

PCI-compatible registers (mandatory): enable the PCI configuration command registers to maintain PCI backward compatibility.

PCIE function register (mandatory): This register can be used only by software that identifies the PCIe. It is enabled by using the PCIe Device Control Register in the PCI-compatible configuration space.

PCIE Advanced Error Reporting Register AER (optional): Provides more powerful error reporting capabilities than standard PCI Express error reporting mechanisms, including PCI Express AER, Traffic switch, IRP, IIO core, Intel VT-D, CBDMA, and other Intel-specific extensions.

 

Advanced Error Report (AER) is a mechanism provided by Intel for reporting PCIe and IIO errors.

There are three types of AER errors:

Correctable Errors - Handled by Hardware

Uncorrectable errors - non-fatal - handled by device-specific software

Uncorrectable Error - Critical - Handled by the system software.

CPU QPI topology

The CPU interconnection relationship of the RH5885 V3 server is as follows:

CPU1 Port0<-------------->CPU3 Port0

CPU1 Port1<-------------->CPU2 Port2

CPU1 Port2<-------------->CPU4 Port0

CPU2 Port0<-------------->CPU3 Port2

CPU2 Port1<-------------->CPU4 Port1

CPU3 Port1<-------------->CPU4 Port2

This alarm is generated when the QPI bus is faulty. The following figure shows the topology.

rh5885 v3


The CPU interconnection relationship of the RH5885H V3 server is as follows:


CPU1 Port0<-------------->CPU2 Port2


CPU1 Port1<-------------->CPU3 Port2


CPU1 Port2<-------------->CPU4 Port2


CPU2 Port0<-------------->CPU3 Port0


CPU2 Port1<-------------->CPU4 Port0


CPU3 Port1<-------------->CPU4 Port1

RN5885H V3 CPU


 

2488H V5/5885H V5

2488 V5

2488 V5 CPU


2488H V5

2488H V5

 

The following figure shows the mapping between the serial port information and the serial port information of the CH242 V5.

Link Exchange Parameter-----------------------CPU0: LEP0(1:CPU1): LEP1(0:CPU3): LEP2(2:CPU2) CPU1: LEP0(1:CPU2): LEP1(0:CPU0): LEP2(2:CPU3) CPU2: LEP0(1:CPU3): LEP1(0:CPU1): LEP2(2:CPU0)

CPU3: LEP0(1:CPU0): LEP1(0:CPU2): LEP2(2:CPU1)

If the number of CPUs ranges from 1 to 4, the following is an example:

CH242 V5

FDM logs are similar to the following:

FDM

 

RH8100 V3


Logical and physical mapping of CPU sockets in single-system mode.

CPU SOCKET

Logical and physical mapping of CPU sockets in dual-system mode.

socket

8100 V5

UPI topology in single-system mode.

UPI

UPI topology in dual-system mode.

UPI

This post will stop here for the time being. More CPU positioning and analysis will be updated continuously.

zaheernew
MVE Author Created Jun 9, 2021 06:04:56

CPU Maintenance Instruction-3963299-1
View more
  • x
  • convention:

CPU Maintenance Instruction-3963331-1
View more
  • x
  • convention:

stephen.xu
stephen.xu Created Jun 9, 2021 06:28:52 (1) (0)
 
Rumana
Rumana Reply stephen.xu  Created Jun 9, 2021 18:24:35 (0) (0)
 
COOL
View more
  • x
  • convention:

Rumana
Rumana Created Jun 11, 2021 06:23:29 (0) (0)
 

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.