1. 内存错误
服务器中内存错误是最常见的。内存错误也和内存的规模有关:内存越大错误越多。另外,一个大规模的集群系统随着机器数量的增加内存错误率也会越高。
操作系统通常能检测到ECC内存错误并且自动修复吧。
不用太在意系统中产生的一个corrected error或soft error错误。实际上一个长时间运行着的系统,一般是能够预测的soft error错误比例的。硬件平台使用错误修正码(ecc)和冗余来解决soft error。这就是为什么他们被叫可修正错误。 不同于一个不可修正的错误(硬件),他们数据已经异常了;soft error并不要求软件响应。此外,对于可以预测的soft error率能够在系统中预测一些soft error的发生。一段时间内少量的soft error通常是正常的。
在确认是系统哪个组件出现问题时,内存错误是很重要的。mcelog通过不同的buckets来预测内存错误。:
- 每个 DIMM (if available)
- 每个 Channel
- 每个 memory controller
- 每个 Socket (= physical CPU package)
- 每个 Page. This is used to automatically offline bad pages
这种情形在daemon方式运行时能够被查询到,使用如下方式:
mcelog --client
对于更多详细内容请阅读LinuxKongress 2010 mcelog版面或者其他的相关文献
2. Memory Error
Memory (RAM) errors are among the most common errors in typical server systems. They also scale with the amount of memory: the more memory the more errors. In addition large clusters of computers with tens or hundreds (or sometimes thousands) of active machines increase the total error rate of the system.
On systems with ECC memory memory errors can be detected and usually corrected.
When a corrected error -- or soft error -- occurs in a system this is not necessarily a problem. In fact on systems with long uptime there is an expected soft error rate that will be reported. The hardware platform uses error correcting codes and redundancy to handle soft errors. This is why they are called corrected errors. Unlike an uncorrected (hard) error -- that is data corruption -- soft errors do not directly require software reaction. Also since there is an expected soft error rate for each system some soft errors are expected to occur. A small number of soft errors in a given time frame is generally not a problem.
For memory errors it is important to be able to identify which components caused the problem. The mcelog daemon tracks memory errors in different buckets: per DIMM (if available) per Channel per memory controller per Socket (= physical CPU package) per Page. This is used to automatically offline bad pages The state of the running daemon can be queried using mcelog --client For more details please see this recent LinuxKongress 2010 mcelog paper and the other references.