1. 术语
Machine Check 硬件故障时在系统中产生日志信息
Machine Check Architecture (MCA) x86计算机允许软件统过硬件程序接口检测报告和处理corrected和uncorrected硬件故障。 这是一个抽象的接口,允许操作系统转发和逆向处理。详细的描述在Intel Architecture Software Developer Manual Volume 3 chapter 15。
Machine Check Exception (MCE) x86 CPU增加了完整的18类来预测一个不可修正的硬件错误。操作系统拥有一个特殊的进程处理MCA寄存器中的信息。
Error Correcting Code (ECC) 错误修正码,一段能够检测和修复错误的特殊的代码。常用的ECC码能够检测到2bit的错误和修正1bit的错误(有一些高级的编码能够处理更多)。阅读维基百科中的ECC部分。服务器的内存一般支持ECC。
Corrected error 硬件错误能够被硬件自动修复(例如:使用ECC修复单bit故障)。这些错误不需要软件立刻解决,但是任然会报告、统计、预测和失败分析。
Uncorrected error 硬件产生了一个不可修正的硬件错误。数据已经产生异常。这种错误要求软件及时反馈。
Predictive Failure Analysis (PFA) 主要通过可修正错误的趋势来预测硬件以后的状态,自动任务步骤以避免一些离线。 mcelog工具自动的offlining对于内存、CPU缓存。此外用户可以进行配置。
IO-MCA 在最新的Xeon系统中报告PCIE链路不可修正的错误。mcelog支持,具体看IO errors错误报告。
PCI AER(PCI-Express Advanced Error reporting) PCIE高级错误报告,在PCIE链路中用来报告错误。不支持mcelog工具,但是会产生内核日志。详细的内容可以查看OLS页或者IO-MCA。
RAS Reliability, Availability, Serviceability. 可靠性,实用性,可服务性。
DIMM Memory module. 内存模块
DMI (or SMBIOS) 这是一个标准的BIOS向操作系统报告当前硬件配置方式。DMI信息会被dmidecode程序输出。mcelog使用这些信息当能够映射DIMMnumbers to silk screen labels。
APEI 一个ACPI4的借口定义标准,他允许BIOS向操作系统报告错误。以前大家都知道的是WHEA :硬件错误报告体系结构(Windows Hardware Error Architecture)
EDAC 一个可选的内存错误报告框架,具体请阅读FAQ部分。
2. Glossary
2.1.1. Machine Check
An hardware error detected by hardware and reported to software.
2.1.2. Machine Check Architecture (MCA)
x86 machine check architecture is a hardware programming interface to allow software to report and handle both corrected and uncorrected hardware errors. This is an architectural interface with some abstraction and allows forwards and backwards compatible operating systems. Details are described in the Intel Architecture Software Developer Manual Volume 3 chapter 15.
2.1.3. Machine Check Exception (MCE)
The x86 CPU raises an int 18 exception to signify an uncorrected hardware error. The operating system has a special handler to process the information contained in the MCA registers
2.1.4. ECC
Error Correcting Code. A specific code that can detect and correct errors. Typical ECC codes can detect two bit of errors and correct one bit (there are some advanced encodings that can handle more errors). See Wikipedia's ECC entry. On servers the memory subsystem generally supports ECC.
2.1.5. Corrected error
An hardware error that was corrected by the hardware (e.g. using a single bit data corruption that was correctible using ECC). These errors do not require immediate software actions, but are still reported for accounting and predictive failure analysis.
2.1.6. Uncorrected error
An uncorrected hardware error detected by the hardware. Data corruption has occurred. These errors require software reaction.
2.1.7. Predictive Failure Analysis (PFA)
Using trends in (primarily) corrected errors to predict future failure of hardware components and automatically taking steps to avoid outages. mcelog implements automatic offlining for memory, CPU caches. Additional user-specified actions can be also configured.
2.1.8. IO-MCA
Used for reporting uncorrected errors on PCI Express links on newer Xeon systems. Supported by mcelog, see IO errors.
2.1.9. PCI AER
PCI-Express Advanced Error reporting. Used for error reporting on PCI Express links. Not supported by mcelog, but logged to the normal kernel log. For more details one the implementation see the OLS paper. See also IO-MCA.
2.1.10. RAS
Reliability, Availability, Serviceability.
2.1.11. DIMM
Memory module.
2.1.12. DMI (or SMBIOS)
This is a standardized way for a BIOS to report the current hardware configuration to the operating system. The DMI information can be dumped with the dmidecode program. mcelog uses this information when available to map DIMM numbers to silk screen labels.
2.1.13. APEI
An interface defined the ACPI 4 standard that allows a BIOS to report errors to an operating system. Formerly known as WHEA.
2.1.14. EDAC
An alternative memory error reporting framework. See the FAQ entry