1. Mcelog错误分析
1.1. 错误类型1
MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR
Transaction: Memory read error
这种错误目前是最多,属于可以修复的错误内存错误
2. 错误类型2
Hardware event. This is not a software error.
MCE 29
CPU 10 THERMAL EVENT TSC 1e8c8c4e7716c51
TIME 1472780482 Fri Sep 2 09:41:22 2016
Processor 10 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 88000bcb MCGSTATUS 0
MCGCAP 1000c12 APICID 9 SOCKETID 0
CPUID Vendor Intel Family 6 Model 45
Hardware event. This is not a software error.
MCE 30
CPU 10 THERMAL EVENT TSC 1e8c8c4e799640a
TIME 1472780482 Fri Sep 2 09:41:22 2016
Processor 10 below trip temperature. Throttling disabled
STATUS 88010a8a MCGSTATUS 0
MCGCAP 1000c12 APICID 9 SOCKETID 0
CPUID Vendor Intel Family 6 Model 45
故障类型:CPU温度过高 触发因素:CPU故障,风扇故障,机房空调故障灯
3. 错误类型3
Hardware event. This is not a software error.
MCE 0
CPU 6 BANK 5
MISC 2040080886 ADDR 107cd49280
TIME 1471889312 Tue Aug 23 02:08:32 2016
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR
Transaction: Memory read error
STATUS cc1af54000010091 MCGSTATUS 0
MCGCAP 1000c12 APICID 20 SOCKETID 1
CPUID Vendor Intel Family 6 Model 45
Hardware event. This is not a software error.
MCE 1
CPU 6 BANK 9
MISC 90000010001208c ADDR a65fd5fc0
TIME 1471889312 Tue Aug 23 02:08:32 2016
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER MS_CHANNEL1_ERR
Transaction: Memory scrubbing error
STATUS cc120010000800c1 MCGSTATUS 0
MCGCAP 1000c12 APICID 20 SOCKETID 1
CPUID Vendor Intel Family 6 Model 45
[root@TENCENT64_site ~]# cat /var/log/mcelog_old |grep BANK|awk '{print $2}'|sort |uniq -c
2 10
1 11
2 18
2 19
2 20
2 21
2 22
2 23
8267 6
2 7
2 8
2 9
[root@TENCENT64_site ~]# cat /var/log/mcelog_old |grep BANK|awk '{print $4}'|sort |uniq -c
8118 5
170 9
[root@TENCENT64_site ~]# cat /proc/cpuinfo |grep -E "processor|physical id"
processor : 0
physical id : 0
processor : 1
physical id : 0
processor : 2
physical id : 0
processor : 3
physical id : 0
processor : 4
physical id : 0
processor : 5
physical id : 0
processor : 6
physical id : 1
processor : 7
physical id : 1
processor : 8
physical id : 1
processor : 9
physical id : 1
processor : 10
physical id : 1
processor : 11
physical id : 1
processor : 12
physical id : 0
processor : 13
physical id : 0
processor : 14
physical id : 0
processor : 15
physical id : 0
processor : 16
physical id : 0
processor : 17
physical id : 0
processor : 18
physical id : 1
processor : 19
physical id : 1
processor : 20
physical id : 1
processor : 21
physical id : 1
processor : 22
physical id : 1
processor : 23
physical id : 1
是在一个CPU上报错的,可能是CPU1故障。 但是这种报错一般在数量1W+以上就需要更换内存条
4. 错误类型4
Hardware event. This is not a software error.
MCE 0
CPU 6 BANK 17
MISC b00e047000380086
TIME 1472487450 Tue Aug 30 00:17:30 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ea2000000018110a MCGSTATUS 0
MCGCAP 7000816 APICID 10 SOCKETID 1
CPUID Vendor Intel Family 6 Model 63
#
鼓励一下