Closed
Bug 1359190
Opened 7 years ago
Closed 6 years ago
moonshot-test3 appears to have bad RAM
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: dustin, Assigned: dragrom)
References
Details
Attachments
(1 file)
151.20 KB,
image/png
|
Details |
Logged about every 1s in syslog: Apr 24 12:14:58 moonshot-test3.test.releng.scl3.mozilla.com kernel: [ 898.641016] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8) Apr 24 12:14:58 moonshot-test3.test.releng.scl3.mozilla.com kernel: [ 898.641027] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)
Comment 1•7 years ago
|
||
I'm not sure if this is actually ecc errors or some xen vm funny business since this is running on the hypervisor. I installed edac-utils: [root@moonshot-test3 ~]# edac-util -vs edac-util: EDAC drivers are loaded. 1 MC detected: mc0:IE31200 [root@moonshot-test3 ~]# edac-util -v mc0: 0 Uncorrected Errors with no DIMM info mc0: 0 Corrected Errors with no DIMM info edac-util: No errors to report. It reports no errors although xen might be spoofing stuff to the edac module: [root@moonshot-test3 ~]# edac-ctl --mainboard edac-ctl: mainboard: Xen HVM domU Inspecting dmesg on the hypervisor itself shows no issues.
Comment 2•7 years ago
|
||
Looking at the HV under xencenter, there are over a 1000 errors about dom0 not having enough memory. I bumped that up from 768 to 2048 "Control Domain Memory Usage","3","The memory required by the control domain on server 'moonshot-cartridge1-demo.inband.releng.scl3.mozilla.com' is 95.6% of its allocated memory. Occasional performance degradation can be expected when memory swapping is forced to happen. This alarm is set to be triggered when the memory required by the control domain is above 95.0% of its allocated memory.","moonshot-cartridge1-demo.inband.releng.scl3.mozilla.com","Apr 6, 2017 9:08 PM"
Comment 3•7 years ago
|
||
Bumping the mem on dom0 was a shot in the dark and didn't work but it did fix the error being reported in xencenter. I downloaded memtest86++ (opensource) but it turns out it doesn't support UEFI so it was unusable. I was able to get memtest86 (freeware) to boot once it I disabled UEFI optimized boot option. I'm sure the memtest take awhile to run. In the meantime, I still don't think this is really a memory issue so I'll email HP for support.
Comment 4•7 years ago
|
||
Memtest86 finished without errors. Kernel still logs EDAC errors every second
Comment 5•7 years ago
|
||
I'd like to note here that while troubleshooting other issues on the node, I noticed the errors go away then the host is booted without the GPU pass through. Once it was booted back up with the GPU enabled, the error cropped right back up.
Updated•7 years ago
|
Assignee: relops → dcrisan
Comment 6•7 years ago
|
||
:dragrom, I'm turning this over to you. :arr suggested to install the latest intel graphics driver to see if that clears this up.
Comment 7•7 years ago
|
||
This is the link provided by HPE for the latest Intel graphics driver https://01.org/linuxgraphics/downloads/intel-graphics-update-tool-linux-os-v2.0.2
Assignee | ||
Updated•7 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 8•7 years ago
|
||
I run intel-graphics-update-tool-linux-os-v2.0.2. After that, in /var/log/syslog persisted the EDAC error: May 5 02:45:04 moonshot-test3.test.releng.scl3.mozilla.com kernel: [849679.592139] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8) May 5 02:45:04 moonshot-test3.test.releng.scl3.mozilla.com kernel: [849679.592153] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8) Possible solution An exact solution is unfortunately not known, but in most cases it helps to blacklist the EDAC module. In any case, make sure that the server is working properly. It is best to perform a memory test. Alternatively, there is the option to disable EDAC logging via sysfs. # Uncorrectable errors Echo 0> / sys / module / edac_core / parameters / edac_mc_log_ue # Correctable errors Echo 0> / sys / module / edac_core / parameters / edac_mc_log_ce Update: Disabling the QuickBoot in BIOS disables some error messages (as in the example of Server 1 above). The start-up process takes 30-60 seconds longer, but by the RAM check through the BIOS during the boot up, the EDAC error messages disappear.
Assignee | ||
Comment 9•7 years ago
|
||
There is no bios. All bios style controls are handled in XenCenter.
Assignee | ||
Comment 10•7 years ago
|
||
Email from HP: Hi Dragos, After more digging it does appear that the issue is to be to do with some of the “quirks in Intel’s windows guest driver”. How big a problem does this present to you? We can certainly talk with Intel and see what can be done. Thanks Tim
Comment 11•7 years ago
|
||
There is a newer 2.0.6 intel update tool now: https://01.org/linuxgraphics/downloads/intel-graphics-update-tool-linux-os-v2.0.6 I'm interested in testing that to see if the errors change. They do not support ubuntu16.04 for it anymore, but there is only one dependency I think we'll need to install (https://packages.ubuntu.com/zesty/amd64/libpackagekit-glib2-18/download) to run the update tool (and then remove/downgrade to not break other things). Still seeing these: ``` Oct 26 15:09:36 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709299.289947] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8) Oct 26 15:09:36 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709299.289959] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8) Oct 26 15:09:37 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709300.290277] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8) Oct 26 15:09:37 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709300.290292] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8) ```
Comment 12•7 years ago
|
||
We have set the log forwarding to drop the EDAC error messages: https://bugzilla.mozilla.org/show_bug.cgi?id=1410207 ``` +# Bug 1410207: drop EDAC errors for now to avoid spamming the log +:msg, contains, "EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8)" ~ +:msg, contains, "EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)" ~ ``` The messages are still logged in the local syslog on these systems. Of the possible workarounds, I like the idea of a longer boot better than disabling the memory error messages with edac_mc_log_ue. If the systems are being rebooted after every job, that time may add up, but it keeps us from ignoring actual failures from EDAC (?).
Assignee | ||
Updated•6 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Resolution: FIXED → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•