1. Mcelog错误分析

1.1. 错误类型1

MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR
Transaction: Memory read error

这种错误目前是最多,属于可以修复的错误内存错误

2. 错误类型2

Hardware event. This is not a software error.
MCE 29
CPU 10 THERMAL EVENT TSC 1e8c8c4e7716c51 
TIME 1472780482 Fri Sep  2 09:41:22 2016
Processor 10 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 88000bcb MCGSTATUS 0
MCGCAP 1000c12 APICID 9 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 45

Hardware event. This is not a software error.
MCE 30
CPU 10 THERMAL EVENT TSC 1e8c8c4e799640a 
TIME 1472780482 Fri Sep  2 09:41:22 2016
Processor 10 below trip temperature. Throttling disabled
STATUS 88010a8a MCGSTATUS 0
MCGCAP 1000c12 APICID 9 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 45

故障类型:CPU温度过高 触发因素:CPU故障,风扇故障,机房空调故障灯

3. 错误类型3

Hardware event. This is not a software error.
MCE 0
CPU 6 BANK 5
MISC 2040080886 ADDR 107cd49280
TIME 1471889312 Tue Aug 23 02:08:32 2016
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR
Transaction: Memory read error
STATUS cc1af54000010091 MCGSTATUS 0
MCGCAP 1000c12 APICID 20 SOCKETID 1
CPUID Vendor Intel Family 6 Model 45
Hardware event. This is not a software error.
MCE 1
CPU 6 BANK 9
MISC 90000010001208c ADDR a65fd5fc0
TIME 1471889312 Tue Aug 23 02:08:32 2016
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER MS_CHANNEL1_ERR
Transaction: Memory scrubbing error
STATUS cc120010000800c1 MCGSTATUS 0
MCGCAP 1000c12 APICID 20 SOCKETID 1
CPUID Vendor Intel Family 6 Model 45
[root@TENCENT64_site ~]# cat /var/log/mcelog_old |grep BANK|awk '{print $2}'|sort |uniq -c
      2 10
      1 11
      2 18
      2 19
      2 20
      2 21
      2 22
      2 23
   8267 6
      2 7
      2 8
      2 9
[root@TENCENT64_site ~]# cat /var/log/mcelog_old |grep BANK|awk '{print $4}'|sort |uniq -c 
   8118 5
    170 9
[root@TENCENT64_site ~]# cat /proc/cpuinfo |grep -E "processor|physical id"
processor       : 0
physical id     : 0
processor       : 1
physical id     : 0
processor       : 2
physical id     : 0
processor       : 3
physical id     : 0
processor       : 4
physical id     : 0
processor       : 5
physical id     : 0
processor       : 6
physical id     : 1
processor       : 7
physical id     : 1
processor       : 8
physical id     : 1
processor       : 9
physical id     : 1
processor       : 10
physical id     : 1
processor       : 11
physical id     : 1
processor       : 12
physical id     : 0
processor       : 13
physical id     : 0
processor       : 14
physical id     : 0
processor       : 15
physical id     : 0
processor       : 16
physical id     : 0
processor       : 17
physical id     : 0
processor       : 18
physical id     : 1
processor       : 19
physical id     : 1
processor       : 20
physical id     : 1
processor       : 21
physical id     : 1
processor       : 22
physical id     : 1
processor       : 23
physical id     : 1

是在一个CPU上报错的,可能是CPU1故障。 但是这种报错一般在数量1W+以上就需要更换内存条

4. 错误类型4

Hardware event. This is not a software error.
MCE 0
CPU 6 BANK 17
MISC b00e047000380086
TIME 1472487450 Tue Aug 30 00:17:30 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ea2000000018110a MCGSTATUS 0
MCGCAP 7000816 APICID 10 SOCKETID 1
CPUID Vendor Intel Family 6 Model 63

#

Copyright © 温玉 2021 | 浙ICP备2020032454号 all right reserved,powered by Gitbook该文件修订时间: 2022-01-02 08:22:10
鼓励一下

results matching ""

    No results matching ""