1395805 - Intermittent dom/canvas/test/webgl-mochitest/test_capture.html | application crashed [@ pages_commit]

Reporter

Description

•

7 years ago

treeherder

Filed by: philringnalda [at] gmail.com https://treeherder.mozilla.org/logviewer.html#?job_id=127469598&repo=autoland https://queue.taskcluster.net/v1/task/OV6ClJL1THa--8_d8x9yFg/runs/0/artifacts/public/logs/live_backing.log

Comment hidden (Intermittent Failures Robot)

Ryan VanderMeulen [:RyanVM]

Comment 6

•

7 years ago

Lots of jemalloc at the top of that stack. Can you please help me redirect to a proper owner for this, glandium?

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 7

•

7 years ago

The tricky part is that page_commit crashes are really OOMs, and the parts below jemalloc in the stack are most likely unrelated to what's sucking the memory.

Flags: needinfo?(mh+mozilla)

Ryan VanderMeulen [:RyanVM]

Comment 8

•

7 years ago

Peter, can you please help find an owner for this?

status-firefox56: --- → unaffected

status-firefox57: --- → affected

status-firefox58: --- → affected

status-firefox-esr52: --- → unaffected

Flags: needinfo?(howareyou322)

Comment 9

•

7 years ago

in crash stats the signature regressed on nightly in this range (first crashes came in in 57.0a1 build 20170831100258): https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ab2d700fda2b4934d24227216972dce9fac19b74&tochange=04b6be50a2526c7a26a63715f441c47e1aa1f9be

Keywords: regression

Comment hidden (Intermittent Failures Robot)

Erin Lancaster [:elan]

Updated

•

7 years ago

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 11

•

7 years ago

Are we running multiple tests at once? The only thing I can think of is that we somehow keep too many GL contexts around for too long, and that eats up the memory. The failure from comment 0, compared to the failure from bug 1404672 comment 0 are very similar, but crashing in different tests - perhaps the cleanup? I'm also trying to figure out where we're running into 16 contexts limit so much.

Flags: needinfo?(milan)

Peter Chang[:pchang]

Comment 12

•

7 years ago

Michael, please help to investigate this WebGL memory issue. Is it related to memory fragmentation?

Flags: needinfo?(howareyou322) → needinfo?(cleu)

Michael Leu[:Lenzak](UTC+8)

Assignee

Comment 13

•

7 years ago

Intermittent webgl mochitest fails on Win-32 non-e10s mode only are usually caused by OOM. And we cannot fix it because there are only 3.2GB available or less than 2GB for a single process under 32-bit environment. I remembered that we once discussed about disabling webgl tests on Windows 32 bit, especially non-e10s mode and the result is quite positive. So I think we can just disable it and others that failed on Windows 32bit non-e10s mode intermittently as well in Mochitest-errata.

Flags: needinfo?(cleu)

Comment hidden (Intermittent Failures Robot)

Michael Leu[:Lenzak](UTC+8)

Assignee

Comment 15

•

7 years ago

In bug 1379868, we have already discussed about disabling all mochitest-gl under windows7 32bit non-e10s mode, but it seems that they are still running on win7-32 debug non-e10s mode after moving all unittest task onto TaskCluster. https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f1251736514da97cf8065f631f0c93ce5c8695c7&selectedJob=134158330 Hi Geoff, can you take a look at it?

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Comment 16

•

7 years ago

Thanks Michael! :jmaher -- Can you confirm that there isn't a reason to be running win7 debug non-e10s mochitest-webgl, given the discussion here and in bug 1379868? Do you have an idea of how best to stop running them only on that platform? (If not, I can work it out.)

Flags: needinfo?(gbrown) → needinfo?(jmaher)

Ryan VanderMeulen [:RyanVM]

Comment 17

•

7 years ago

What worries me is that the pages_commit signature is also trending as one of our top crashes on Beta in the wild right now. This feels like a situation where our CI is alerting us to a real problem our users are facing and we're basically talking about turning off the alarm instead of fixing the problem? This wasn't a top crash in 56, so what's made this worse in 57?

Joel Maher ( :jmaher ) (UTC -8)

Comment 18

•

7 years ago

we can easily disable non-e10s win7 webgl bits: http://searchfox.org/mozilla-central/source/taskcluster/ci/test/tests.yml#906 just remove that line (actually that whole block) and only e10s tests will be run. the reason we run these in non-e10s is that there was pushback and concern when we disabled non-e10s and there are valid scenarios and configurations where we need non-e10s (legacy addons, android) and there is a lack of coverage. If we understand the risks (or maybe lack thereof), then I am happy to see us remove test coverage.

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Flags: needinfo?(jmaher)

Geoff Brown [:gbrown]

Comment 19

•

7 years ago

Thanks Joel...and Ryan! :Lenzak - Please address comment 17. If you still want these tests to be discontinued, ni me just one more time!

Flags: needinfo?(cleu)

Michael Leu[:Lenzak](UTC+8)

Assignee

Comment 20

•

7 years ago

I saw Bug 1229384, which discussed same signature as well. It seems that this crash signature happens on both 32 and 64 bit mode. It's possible that something happen which make Firefox more OOM-prone, and these recent emerged OOM-related intermittent fails are side effect. gbrown: I'll consider whether disable them or not tomorrow, maybe have some discussion.

Comment 21

•

7 years ago

After discussing with :jgilbert, I think we should not disable it for now. So just leave it.

Flags: needinfo?(cleu)

Comment hidden (Intermittent Failures Robot)

Peter Chang[:pchang]

Comment 23

•

7 years ago

We are going to investigate this in 58.

status-firefox57: affected → wontfix

Peter Chang[:pchang]

Updated

•

7 years ago

Assignee: nobody → cleu

Michael Leu[:Lenzak](UTC+8)

Assignee

Comment 24

•

7 years ago

I tried to run webgl mochitest locally, finding that the 32bit firefox build consumes only <300MB memory during the tests. It's hard to imagine this amount of memory can lead to OOM, so I decide to create a diagnostic patch which print some info when memory problem occurs and push it to tryserver, hope I can get lucky obtaining some useful information.

Comment hidden (mozreview-request)

Geoff Brown [:gbrown]

Updated

•

7 years ago

Comment 26

•

7 years ago

Attached image Screen Shot 2017-10-24 at 11.13.48 AM.png — Details

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8f56449b75f111e9e498c200eeaeb5c7f7729b3d&selectedJob=138832999 I managed to run into the crash, although it is another webgl-mochitest test. In the attachment, we can see that we failed to obtain a memory page with size 512KB, the error code "0x5AF" indicates the page file is too small to allocate requested virtual memory page. https://msdn.microsoft.com/en-us/library/windows/desktop/ms681385(v=vs.85).aspx However, it seems that the "MEMORY STAT" line told a different story, it shows that we have maximum contiguous available virtual memory space of 516MB which clearly more than enough for the allocation. If what "MEMORY STAT" said is correct, this allocation fail is not caused by memory fragmentation. I suspect it is related to page file size configuration of our windows 7 taskcluster. Hi Geoff, do you know the page file size setting in our Window 7 taskcluster? If this crash is caused by insufficient page file size, maybe we can just fix it by increasing it.

Flags: needinfo?(gbrown)

Joel Maher ( :jmaher ) (UTC -8)

Comment 27

•

7 years ago

:grenade would you know the win7 taskcluster VM page file size?

Flags: needinfo?(rthijssen)

Geoff Brown [:gbrown]

Updated

•

7 years ago

Flags: needinfo?(gbrown)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 28

•

7 years ago

we currently use a configuration that doesn't allocate a pagefile at all. here's a task that lists the current pagefile settings using the windows command `wmic pagefile list`: https://tools.taskcluster.net/groups/BPFU3Ks1S-yAvNYNmjoHpw/tasks/BPFU3Ks1S-yAvNYNmjoHpw/runs/0/logs/public%2Flogs%2Flive.log the output shows that the only pagefile is on c: and is disabled: Z:\task_1508871491>wmic pagefile list AllocatedBaseSize CurrentUsage Description InstallDate Name PeakUsage Status TempPageFile 64 0 C:\pagefile.sys 20161027152530.075200+000 C:\pagefile.sys 0 FALSE and here's a task that creates a new pagefile on the y: drive using windows command `wmic pagefileset create name="Y:\\pagefile.sys",InitialSize=2048,MaximumSize=2048`: https://tools.taskcluster.net/groups/aFPRhF2VST-iqGvj5VI9IQ/tasks/aFPRhF2VST-iqGvj5VI9IQ/runs/0/logs/public%2Flogs%2Flive.log

Flags: needinfo?(rthijssen)

Michael Leu[:Lenzak](UTC+8)

Assignee

Comment 29

•

7 years ago

Hi Joel Is it possible to allocate page file for only webgl mochitests in win7 32bit? I tried your taskcluster configuration and ran same mochitest-gl chunk for 25 times, no crash happened. I think maybe adding page file for memory swapping can mitigate some intermittent OOM mochitest fails like this one.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 30

•

7 years ago

:Lenzak, thanks for asking the question- I am not sure if we can make this happen, I think we would need to specific windows image for pagefile.sys, I would wonder why we don't have it on by default- it seems like a reasonable thing to have. I see a few options: 1) turn it on by default for all win7 jobs that run in a VM 2) create a secondary config that has a pagefile and find a method for specifying that in taskcluster, then use that new method for just the webgl tests (and any others we determine need it) In order to do this, we would need :grenade or :pmoore to help coordinate. We would need to test this at scale (--rebuild 20) on windows 7* (or webgl if we restrict it). Ideally that would yield no new failures or obvious [almost]perma failing tests. :pmoore- could you comment on options 1 and 2 above? maybe any further thoughts you have on this general topic?

Flags: needinfo?(jmaher) → needinfo?(pmoore)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 31

•

7 years ago

testing an enabled pagefile (size 8gb) on win 7 now: https://github.com/mozilla-releng/OpenCloudConfig/commit/39813c3c2266ae063ff510b34c9b0b2836032b4b

Geoff Brown [:gbrown]

Updated

•

7 years ago

Comment 32

•

7 years ago

pagefile is now created on instance instantiation. the file uses the y: drive (y:\pagefile.sys) for speed since c: on ec2 ebs is rather slow (the reason we disabled pagefiles in the first place, since the instances we use typically have 8 - 16 gb ram and shouldn't need to use the slow hdd) initial results look ok (nothing broke) but we can watch to see if we get reduced intermittent failures attributed to pagefile size now.

Michael Leu[:Lenzak](UTC+8)

Assignee

•

7 years ago

https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → INCOMPLETE

Bug 1395805 - Diagnose VirtualAlloc failures 7 years ago Michael Leu[:Lenzak](UTC+8) 59 bytes, text/x-review-board-request		Details
Screen Shot 2017-10-24 at 11.13.48 AM.png 7 years ago Michael Leu[:Lenzak](UTC+8) 85.12 KB, image/png		Details