moonshot-test3 appears to have bad RAM

RESOLVED WONTFIX

Status

Infrastructure & Operations
RelOps
RESOLVED WONTFIX
a year ago
3 months ago

People

(Reporter: dustin, Assigned: dragrom)

Tracking

(Blocks: 1 bug)

Details

Attachments

(1 attachment)

Logged about every 1s in syslog:

Apr 24 12:14:58 moonshot-test3.test.releng.scl3.mozilla.com kernel: [  898.641016] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8)
Apr 24 12:14:58 moonshot-test3.test.releng.scl3.mozilla.com kernel: [  898.641027] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)
I'm not sure if this is actually ecc errors or some xen vm funny business since this is running on the hypervisor.

I installed edac-utils:

[root@moonshot-test3 ~]# edac-util -vs
edac-util: EDAC drivers are loaded. 1 MC detected:
  mc0:IE31200

[root@moonshot-test3 ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
edac-util: No errors to report.

It reports no errors although xen might be spoofing stuff to the edac module:

[root@moonshot-test3 ~]# edac-ctl --mainboard
edac-ctl: mainboard: Xen HVM domU


Inspecting dmesg on the hypervisor itself shows no issues.
Looking at the HV under xencenter, there are over a 1000 errors about dom0 not having enough memory. I bumped that up from 768 to 2048

"Control Domain Memory Usage","3","The memory required by the control domain on server 'moonshot-cartridge1-demo.inband.releng.scl3.mozilla.com' is 95.6% of its allocated memory. Occasional performance degradation can be expected when memory swapping is forced to happen.
This alarm is set to be triggered when the memory required by the control domain is above 95.0% of its allocated memory.","moonshot-cartridge1-demo.inband.releng.scl3.mozilla.com","Apr 6, 2017 9:08 PM"
Bumping the mem on dom0 was a shot in the dark and didn't work but it did fix the error being reported in xencenter.

I downloaded memtest86++ (opensource) but it turns out it doesn't support UEFI so it was unusable.  I was able to get memtest86 (freeware) to boot once it I disabled UEFI optimized boot option.  I'm sure the memtest take awhile to run.  In the meantime, I still don't think this is really a memory issue so I'll email HP for support.
Created attachment 8861415 [details]
Memtest86 Results.png

Memtest86 finished without errors. Kernel still logs EDAC errors every second
I'd like to note here that while troubleshooting other issues on the node, I noticed the errors go away then the host is booted without the GPU pass through.  Once it was booted back up with the GPU enabled, the error cropped right back up.
Assignee: relops → dcrisan
:dragrom, I'm turning this over to you.  :arr suggested to install the latest intel graphics driver to see if that clears this up.
This is the link provided by HPE for the latest Intel graphics driver

https://01.org/linuxgraphics/downloads/intel-graphics-update-tool-linux-os-v2.0.2
(Assignee)

Updated

a year ago
Status: NEW → ASSIGNED
(Assignee)

Comment 8

a year ago
I run intel-graphics-update-tool-linux-os-v2.0.2. After that, in /var/log/syslog persisted the EDAC error:

May  5 02:45:04 moonshot-test3.test.releng.scl3.mozilla.com kernel: [849679.592139] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8)
May  5 02:45:04 moonshot-test3.test.releng.scl3.mozilla.com kernel: [849679.592153] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)


Possible solution

An exact solution is unfortunately not known, but in most cases it helps to blacklist the EDAC module. In any case, make sure that the server is working properly. It is best to perform a memory test.

Alternatively, there is the option to disable EDAC logging via sysfs.

  # Uncorrectable errors
 Echo 0> / sys / module / edac_core / parameters / edac_mc_log_ue
 # Correctable errors
 Echo 0> / sys / module / edac_core / parameters / edac_mc_log_ce

Update: Disabling the QuickBoot in BIOS disables some error messages (as in the example of Server 1 above). The start-up process takes 30-60 seconds longer, but by the RAM check through the BIOS during the boot up, the EDAC error messages disappear.
(Assignee)

Comment 9

a year ago
There is no bios.  All bios style controls are handled in XenCenter.
(Assignee)

Comment 10

a year ago
Email from HP:

Hi Dragos,


After more digging it does appear that the issue is to be to do with some of the “quirks in Intel’s windows guest driver”. 

How big a problem does this present to you? We can certainly talk with Intel and see what can be done.

 

Thanks

Tim
Blocks: 1366828

Comment 11

7 months ago
There is a newer 2.0.6 intel update tool now: https://01.org/linuxgraphics/downloads/intel-graphics-update-tool-linux-os-v2.0.6
I'm interested in testing that to see if the errors change. They do not support ubuntu16.04 for it anymore, but there is only one dependency I think we'll need to install (https://packages.ubuntu.com/zesty/amd64/libpackagekit-glib2-18/download) to run the update tool (and then remove/downgrade to not break other things).

Still seeing these:
```
Oct 26 15:09:36 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709299.289947] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8)
Oct 26 15:09:36 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709299.289959] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)
Oct 26 15:09:37 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709300.290277] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8)
Oct 26 15:09:37 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709300.290292] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)
```

Comment 12

7 months ago
We have set the log forwarding to drop the EDAC error messages: https://bugzilla.mozilla.org/show_bug.cgi?id=1410207
```
+# Bug 1410207: drop EDAC errors for now to avoid spamming the log
+:msg, contains, "EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8)" ~
+:msg, contains, "EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)" ~
```

The messages are still logged in the local syslog on these systems.

Of the possible workarounds, I like the idea of a longer boot better than disabling the memory error messages with edac_mc_log_ue. If the systems are being rebooted after every job, that time may add up, but it keeps us from ignoring actual failures from EDAC (?).
(Assignee)

Updated

3 months ago
Status: ASSIGNED → RESOLVED
Last Resolved: 3 months ago
Resolution: --- → FIXED
Resolution: FIXED → WONTFIX
You need to log in before you can comment on or make changes to this bug.