Hello, everyone!
In this post, I summarized the CPU location methods.
Log
General information
Server Model and Serial Number
dump_info\RTOSDump\versioninfo\fruinfo.txt
OS log\mainboard\dmidecode.txt
BMC Version
dump_info\RTOSDump\versioninfo\app_revision.txt
Current alarms:
dump_info\AppDump\sensor_alarm\current_event.txt
Historical alarms:
dump_info\AppDump\sensor_alarm\sel_current.csv or sel.tar
OS log\bmc\sel.csv
KVM screenshot:
dump_info\OSDump\img*.jpeg
Maintenance logs:
dump_info\LogDump\maintenance_log
Serial port logs (You can view the errors reported before the system breaks down, such as MCE errors.)
dump_info\OSDump\systemcom.tar
Key information log location
FDM logs:
dump_info\LogDump\fdm.bin
Fault Diagnosis Details
Fault classification

1. In a computer system, errors can be classified into detected errors and undetected errors. Undetected errors are not processed.
2. Errors that are not detected are classified into beign and silent data corruption (SCD). Benign errors do not affect data, but SCD errors cause data errors. The DIF technology of disks is used to rectify SCD errors.
3. CE: The detected errors are classified into correctable errors. Hardware directly corrects these errors. These errors do not affect the normal running of the system. The system logs these errors. (The memory CE storm affects the system.)
4. UCE: Uncorrected errors are classified into catstrophic, fatal, and recoverable errors. Catastrophic and fatal errors cause the system to restart.
5. UCR: The hardware detects a recoverable error (UCR) and sends it to the system software for processing. For UCNA errors, the operating system only records error information in logs. For SRAO errors, the system determines whether to handle them based on the error information. If an SRAR error occurs, the system will rectify the error. If the processing fails, the system will be restarted.
DUE: Detected but Uncorrectable Error.
UCR: Uncorrected Recoverable.
UCNA: Uncorrected No Action required.
SRAO: Software Recoverable Action OptionalSRAR: Software Recoverable Action Required.
MCA domain fault
Machine Check Architecture (MCA) divides the CPU into different modules. Each module corresponds to a bank (which can be simply understood as a register group), and each bank corresponds to a type of error.
Model Specific Register (MSR) is a series of registers used to control CPU running, function switch, debugging, program execution tracing, and CPU performance monitoring in the x86 processor.
Through the MCA, the system can detect hardware errors, such as system bus errors, ECC errors, parity errors, cache errors, and TLB errors. The MCA automatically detects hardware errors and records the error information to the corresponding bank MSR register.

Bank | Module | Explanation |
MC Bank 0-3 | Level-1 and level-2 caches | IFU level-1 instruction cache BANK 0 DCU level-1 data cache BANK1 and BANK2 MLC L2 cache BANK 3 |
MC Bank 4-5 | QPI module | On the Brickland platform, QPI links in the 4P environment are interconnected. One CPU provides three QPI links: QPI0, QPI1, and QPI2. Generally, when a QPI fault occurs, the QPI fault is displayed on two interconnected MC bank registers. QPI0 and QPI2 share a group of registers, and QPI0 and PCU share a group of registers. |
MC Bank6 | IIO module | IIO errors are recorded in this module only after the IOMCA option in the menu is enabled. In normal cases, this module is disabled. |
MC Bank 7-16 | Memory module | The IVY Bridge has iMC0 and iMC1. The RH5885 V3 iMC1 is not used. |
MC Bank 17-31 | Level-3 cache (LLC module) |
AER Domain Fault (IIO AER)
PCIe provides three error reporting mechanisms. These mechanisms can be controlled and reported by configuration registers mapped to three different regions of the configuration space.
PCI-compatible registers (mandatory): enable the PCI configuration command registers to maintain PCI backward compatibility.
PCIE function register (mandatory): This register can be used only by software that identifies the PCIe. It is enabled by using the PCIe Device Control Register in the PCI-compatible configuration space.
PCIE Advanced Error Reporting Register AER (optional): Provides more powerful error reporting capabilities than standard PCI Express error reporting mechanisms, including PCI Express AER, Traffic switch, IRP, IIO core, Intel VT-D, CBDMA, and other Intel-specific extensions.
Advanced Error Report (AER) is a mechanism provided by Intel for reporting PCIe and IIO errors.
There are three types of AER errors:
Correctable Errors - Handled by Hardware
Uncorrectable errors - non-fatal - handled by device-specific software
Uncorrectable Error - Critical - Handled by the system software.
CPU QPI topology
The CPU interconnection relationship of the RH5885 V3 server is as follows:
CPU1 Port0<-------------->CPU3 Port0
CPU1 Port1<-------------->CPU2 Port2
CPU1 Port2<-------------->CPU4 Port0
CPU2 Port0<-------------->CPU3 Port2
CPU2 Port1<-------------->CPU4 Port1
CPU3 Port1<-------------->CPU4 Port2
This alarm is generated when the QPI bus is faulty. The following figure shows the topology.

The CPU interconnection relationship of the RH5885H V3 server is as follows:
CPU1 Port0<-------------->CPU2 Port2
CPU1 Port1<-------------->CPU3 Port2
CPU1 Port2<-------------->CPU4 Port2
CPU2 Port0<-------------->CPU3 Port0
CPU2 Port1<-------------->CPU4 Port0
CPU3 Port1<-------------->CPU4 Port1

2488H V5/5885H V5
2488 V5

2488H V5

The following figure shows the mapping between the serial port information and the serial port information of the CH242 V5.
Link Exchange Parameter-----------------------CPU0: LEP0(1:CPU1): LEP1(0:CPU3): LEP2(2:CPU2) CPU1: LEP0(1:CPU2): LEP1(0:CPU0): LEP2(2:CPU3) CPU2: LEP0(1:CPU3): LEP1(0:CPU1): LEP2(2:CPU0)
CPU3: LEP0(1:CPU0): LEP1(0:CPU2): LEP2(2:CPU1)
If the number of CPUs ranges from 1 to 4, the following is an example:

FDM logs are similar to the following:

RH8100 V3
Logical and physical mapping of CPU sockets in single-system mode.

Logical and physical mapping of CPU sockets in dual-system mode.

8100 V5
UPI topology in single-system mode.

UPI topology in dual-system mode.

This post will stop here for the time being. More CPU positioning and analysis will be updated continuously.



