我们几乎没有戴尔机器(用RHEL 7.6),当我们替换机器上的DIMM卡时,因为我们从内核消息中看到了错误。
过了一段时间,我们再次检查了内核消息,我们发现了下面的内容,我们可以看到有关内存内存的错误(也与related https://access.redhat.com/solutions/6961932相关)
[Mon May 8 21:08:01 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683580080 SOCKET 0 APIC 0
[Mon May 8 21:08:01 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x6f3c77 offset:0xc80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:2 rank:4)
[Mon May 8 21:08:21 2023] mce: [Hardware Error]: Machine check events logged
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: event severity: corrected
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: Error 0, type: corrected
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: fru_text: B6
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: section_type: memory error
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: error_status: 0x0000000000000400
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: physical_address: 0x000000446e0d5f00
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: node: 1 card: 1 module: 1 rank: 0 bank: 3 row: 64982 column: 888
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: error_type: 2, single-bit ECC
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: TSC 30d2ef7e9bfda
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: ADDR 446e0d5f00
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: MISC 0
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683610228 SOCKET 0 APIC 0
[Tue May 9 05:30:29 2023] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x446e0d5 offset:0xf00 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:0 channel_mask:2 rank:4)
[Tue May 9 05:30:51 2023] mce: [Hardware Error]: Machine check events logged
[Tue May 9 17:52:21 2023] perf: interrupt took too long (380026 > 7861), lowering kernel.perf_event_max_sample_rate to 1000
[Wed May 10 06:27:17 2023] warning: `lshw' uses legacy ethtool link settings API, link modes are only partially reported为了确保上面的消息不是随机消息,我们决定重新启动机器,并查看有关内存的坏消息是否被复制。
但是有关RAM内存的错误消息仍然存在。
因此,我们对从内核消息中看到的问题感到困惑。
尽管我们更换了DIMM卡,但我们仍然会对RAM产生错误。
我必须在这里提供关于我们从IDRAC看到的更多信息。

如上所示,关于DIMM卡或RAM内存,IDRAC尚未完成。
所以问题是- dmesg (内核消息)怎么会被抱怨内存内存,尽管所有的DIMM都被替换了?
有没有可能别的东西是坏的,而不是DIMM卡?比如戴尔机器的主板?
发布于 2023-05-10 14:32:47
您看到的错误是由硬件纠正的单位ECC可纠正内存错误。这不会触发iDRAC中列出的失败组件,至少在它们的数量超过某些内部定义的阈值之前,但是您应该会看到在iDRAC SEL (系统事件日志)下记录的这个内存错误。
它不建议混合单一和双重等级模块,但您的里程可能有所不同,取决于处理器/主板版本。
https://serverfault.com/questions/1130827
复制相似问题