1359190 - moonshot-test3 appears to have bad RAM

Reporter

Description

•

7 years ago

Logged about every 1s in syslog:

Apr 24 12:14:58 moonshot-test3.test.releng.scl3.mozilla.com kernel: [  898.641016] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8)
Apr 24 12:14:58 moonshot-test3.test.releng.scl3.mozilla.com kernel: [  898.641027] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)

Jake Watkins [:dividehex]

Comment 1

•

7 years ago

I'm not sure if this is actually ecc errors or some xen vm funny business since this is running on the hypervisor.

I installed edac-utils:

[root@moonshot-test3 ~]# edac-util -vs
edac-util: EDAC drivers are loaded. 1 MC detected:
  mc0:IE31200

[root@moonshot-test3 ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
edac-util: No errors to report.

It reports no errors although xen might be spoofing stuff to the edac module:

[root@moonshot-test3 ~]# edac-ctl --mainboard
edac-ctl: mainboard: Xen HVM domU


Inspecting dmesg on the hypervisor itself shows no issues.

Jake Watkins [:dividehex]

Comment 2

•

7 years ago

Looking at the HV under xencenter, there are over a 1000 errors about dom0 not having enough memory. I bumped that up from 768 to 2048

"Control Domain Memory Usage","3","The memory required by the control domain on server 'moonshot-cartridge1-demo.inband.releng.scl3.mozilla.com' is 95.6% of its allocated memory. Occasional performance degradation can be expected when memory swapping is forced to happen.
This alarm is set to be triggered when the memory required by the control domain is above 95.0% of its allocated memory.","moonshot-cartridge1-demo.inband.releng.scl3.mozilla.com","Apr 6, 2017 9:08 PM"

Jake Watkins [:dividehex]

Comment 3

•

7 years ago

Bumping the mem on dom0 was a shot in the dark and didn't work but it did fix the error being reported in xencenter.

I downloaded memtest86++ (opensource) but it turns out it doesn't support UEFI so it was unusable.  I was able to get memtest86 (freeware) to boot once it I disabled UEFI optimized boot option.  I'm sure the memtest take awhile to run.  In the meantime, I still don't think this is really a memory issue so I'll email HP for support.

Jake Watkins [:dividehex]

Comment 4

•

7 years ago

Attached image Memtest86 Results.png — Details

Memtest86 finished without errors. Kernel still logs EDAC errors every second

Jake Watkins [:dividehex]

Comment 5

•

7 years ago

I'd like to note here that while troubleshooting other issues on the node, I noticed the errors go away then the host is booted without the GPU pass through.  Once it was booted back up with the GPU enabled, the error cropped right back up.

Jake Watkins [:dividehex]

Updated

•

7 years ago

Assignee: relops → dcrisan

Jake Watkins [:dividehex]

Comment 6

•

7 years ago

:dragrom, I'm turning this over to you.  :arr suggested to install the latest intel graphics driver to see if that clears this up.

Jake Watkins [:dividehex]

Comment 7

•

7 years ago

This is the link provided by HPE for the latest Intel graphics driver

https://01.org/linuxgraphics/downloads/intel-graphics-update-tool-linux-os-v2.0.2

Dragos Crisan [:dragrom]

Assignee

Updated

•

7 years ago

Status: NEW → ASSIGNED

Dragos Crisan [:dragrom]

Assignee

Comment 8

•

7 years ago

I run intel-graphics-update-tool-linux-os-v2.0.2. After that, in /var/log/syslog persisted the EDAC error:

May  5 02:45:04 moonshot-test3.test.releng.scl3.mozilla.com kernel: [849679.592139] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8)
May  5 02:45:04 moonshot-test3.test.releng.scl3.mozilla.com kernel: [849679.592153] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)


Possible solution

An exact solution is unfortunately not known, but in most cases it helps to blacklist the EDAC module. In any case, make sure that the server is working properly. It is best to perform a memory test.

Alternatively, there is the option to disable EDAC logging via sysfs.

  # Uncorrectable errors
 Echo 0> / sys / module / edac_core / parameters / edac_mc_log_ue
 # Correctable errors
 Echo 0> / sys / module / edac_core / parameters / edac_mc_log_ce

Update: Disabling the QuickBoot in BIOS disables some error messages (as in the example of Server 1 above). The start-up process takes 30-60 seconds longer, but by the RAM check through the BIOS during the boot up, the EDAC error messages disappear.

Dragos Crisan [:dragrom]

Assignee

Comment 9

•

7 years ago

There is no bios.  All bios style controls are handled in XenCenter.

Dragos Crisan [:dragrom]

Assignee

Comment 10

•

7 years ago

Email from HP:

Hi Dragos,


After more digging it does appear that the issue is to be to do with some of the “quirks in Intel’s windows guest driver”. 

How big a problem does this present to you? We can certainly talk with Intel and see what can be done.

 

Thanks

Tim

Amy Rich [:arr] [:arich]

Updated

•

7 years ago

Blocks: 1366828

:dhouse

Comment 11

•

7 years ago

There is a newer 2.0.6 intel update tool now: https://01.org/linuxgraphics/downloads/intel-graphics-update-tool-linux-os-v2.0.6
I'm interested in testing that to see if the errors change. They do not support ubuntu16.04 for it anymore, but there is only one dependency I think we'll need to install (https://packages.ubuntu.com/zesty/amd64/libpackagekit-glib2-18/download) to run the update tool (and then remove/downgrade to not break other things).

Still seeing these:
```
Oct 26 15:09:36 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709299.289947] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8)
Oct 26 15:09:36 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709299.289959] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)
Oct 26 15:09:37 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709300.290277] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8)
Oct 26 15:09:37 t-linux64-xe-296.test.releng.mdc1.mozilla.com kernel: [3709300.290292] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)
```

:dhouse

Comment 12

•

7 years ago

We have set the log forwarding to drop the EDAC error messages: https://bugzilla.mozilla.org/show_bug.cgi?id=1410207
```
+# Bug 1410207: drop EDAC errors for now to avoid spamming the log
+:msg, contains, "EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:8)" ~
+:msg, contains, "EDAC MC0: 1 UE ie31200 UE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x0 offset:0x0 grain:8)" ~
```

The messages are still logged in the local syslog on these systems.

Of the possible workarounds, I like the idea of a longer boot better than disabling the memory error messages with edac_mc_log_ue. If the systems are being rebooted after every job, that time may add up, but it keeps us from ignoring actual failures from EDAC (?).

Dragos Crisan [:dragrom]

Assignee

Updated

•

6 years ago

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Jake Watkins [:dividehex]

Updated

•

6 years ago

Resolution: FIXED → WONTFIX

Bugzilla

Quick Search

moonshot-test3 appears to have bad RAM

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: dragrom)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Updated

Updated

Attachment

General

Description

File Name

Content Type