Closed Bug 1395805 Opened 7 years ago Closed 6 years ago

Intermittent dom/canvas/test/webgl-mochitest/test_capture.html | application crashed [@ pages_commit]

Categories

(Core :: Graphics: CanvasWebGL, defect, P5)

defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox-esr52 --- unaffected
firefox56 --- unaffected
firefox57 --- wontfix
firefox58 --- affected

People

(Reporter: intermittent-bug-filer, Assigned: cleu)

References

Details

(Keywords: crash, intermittent-failure, regression)

Crash Data

Attachments

(2 files)

Lots of jemalloc at the top of that stack. Can you please help me redirect to a proper owner for this, glandium?
Flags: needinfo?(mh+mozilla)
The tricky part is that page_commit crashes are really OOMs, and the parts below jemalloc in the stack are most likely unrelated to what's sucking the memory.
Flags: needinfo?(mh+mozilla)
Peter, can you please help find an owner for this?
Flags: needinfo?(howareyou322)
See Also: → 1404672
in crash stats the signature regressed on nightly in this range (first crashes came in in 57.0a1 build 20170831100258): https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ab2d700fda2b4934d24227216972dce9fac19b74&tochange=04b6be50a2526c7a26a63715f441c47e1aa1f9be
Keywords: regression
Flags: needinfo?(milan)
Are we running multiple tests at once?

The only thing I can think of is that we somehow keep too many GL contexts around for too long, and that eats up the memory.  The failure from comment 0, compared to the failure from bug 1404672 comment 0 are very similar, but crashing in different tests - perhaps the cleanup?

I'm also trying to figure out where we're running into 16 contexts limit so much.
Flags: needinfo?(milan)
Michael, please help to investigate this WebGL memory issue. Is it related to memory fragmentation?
Flags: needinfo?(howareyou322) → needinfo?(cleu)
Intermittent webgl mochitest fails on Win-32 non-e10s mode only are usually caused by OOM.

And we cannot fix it because there are only 3.2GB available or less than 2GB for a single process under 32-bit environment.

I remembered that we once discussed about disabling webgl tests on Windows 32 bit, especially non-e10s mode and the result is quite positive.

So I think we can just disable it and others that failed on Windows 32bit non-e10s mode intermittently as well in Mochitest-errata.
Flags: needinfo?(cleu)
In bug 1379868, we have already discussed about disabling all mochitest-gl under windows7 32bit non-e10s mode, but it seems that they are still running on win7-32 debug non-e10s mode after moving all unittest task onto TaskCluster.
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f1251736514da97cf8065f631f0c93ce5c8695c7&selectedJob=134158330

Hi Geoff, can you take a look at it?
Flags: needinfo?(gbrown)
Thanks Michael!

:jmaher -- Can you confirm that there isn't a reason to be running win7 debug non-e10s mochitest-webgl, given the discussion here and in bug 1379868? Do you have an idea of how best to stop running them only on that platform? (If not, I can work it out.)
Flags: needinfo?(gbrown) → needinfo?(jmaher)
What worries me is that the pages_commit signature is also trending as one of our top crashes on Beta in the wild right now. This feels like a situation where our CI is alerting us to a real problem our users are facing and we're basically talking about turning off the alarm instead of fixing the problem? This wasn't a top crash in 56, so what's made this worse in 57?
we can easily disable non-e10s win7 webgl bits:
http://searchfox.org/mozilla-central/source/taskcluster/ci/test/tests.yml#906

just remove that line (actually that whole block) and only e10s tests will be run.

the reason we run these in non-e10s is that there was pushback and concern when we disabled non-e10s and there are valid scenarios and configurations where we need non-e10s (legacy addons, android) and there is a lack of coverage.

If we understand the risks (or maybe lack thereof), then I am happy to see us remove test coverage.
Flags: needinfo?(jmaher)
Thanks Joel...and Ryan!

:Lenzak - Please address comment 17. If you still want these tests to be discontinued, ni me just one more time!
Flags: needinfo?(cleu)
I saw Bug 1229384, which discussed same signature as well.

It seems that this crash signature happens on both 32 and 64 bit mode.

It's possible that something happen which make Firefox more OOM-prone,
and these recent emerged OOM-related intermittent fails are side effect.

gbrown: I'll consider whether disable them or not tomorrow, maybe have some discussion.
See Also: → 1229384
After discussing with :jgilbert, I think we should not disable it for now.

So just leave it.
Flags: needinfo?(cleu)
We are going to investigate this in 58.
Assignee: nobody → cleu
I tried to run webgl mochitest locally, finding that the 32bit firefox build consumes only <300MB memory during the tests.

It's hard to imagine this amount of memory can lead to OOM, so I decide to create a diagnostic patch which print some info when memory problem occurs and push it to tryserver, hope I can get lucky obtaining some useful information.
See Also: → 1400994
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8f56449b75f111e9e498c200eeaeb5c7f7729b3d&selectedJob=138832999

I managed to run into the crash, although it is another webgl-mochitest test.

In the attachment, we can see that we failed to obtain a memory page with size 512KB, the error code "0x5AF" indicates the page file is too small to allocate requested virtual memory page.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms681385(v=vs.85).aspx

However, it seems that the "MEMORY STAT" line told a different story, it shows that we have maximum contiguous available virtual memory space of 516MB which clearly more than enough for the allocation.

If what "MEMORY STAT" said is correct, this allocation fail is not caused by memory fragmentation.

I suspect it is related to page file size configuration of our windows 7 taskcluster.

Hi Geoff, do you know the page file size setting in our Window 7 taskcluster?

If this crash is caused by insufficient page file size, maybe we can just fix it by increasing it.
Flags: needinfo?(gbrown)
:grenade would you know the win7 taskcluster VM page file size?
Flags: needinfo?(rthijssen)
Flags: needinfo?(gbrown)
we currently use a configuration that doesn't allocate a pagefile at all.

here's a task that lists the current pagefile settings using the windows command `wmic pagefile list`:
https://tools.taskcluster.net/groups/BPFU3Ks1S-yAvNYNmjoHpw/tasks/BPFU3Ks1S-yAvNYNmjoHpw/runs/0/logs/public%2Flogs%2Flive.log

the output shows that the only pagefile is on c: and is disabled:

Z:\task_1508871491>wmic pagefile list 
AllocatedBaseSize  CurrentUsage  Description      InstallDate                Name             PeakUsage  Status  TempPageFile  
64                 0             C:\pagefile.sys  20161027152530.075200+000  C:\pagefile.sys  0                  FALSE         


and here's a task that creates a new pagefile on the y: drive using windows command `wmic pagefileset create name="Y:\\pagefile.sys",InitialSize=2048,MaximumSize=2048`:
https://tools.taskcluster.net/groups/aFPRhF2VST-iqGvj5VI9IQ/tasks/aFPRhF2VST-iqGvj5VI9IQ/runs/0/logs/public%2Flogs%2Flive.log
Flags: needinfo?(rthijssen)
Hi Joel

Is it possible to allocate page file for only webgl mochitests in win7 32bit?

I tried your taskcluster configuration and ran same mochitest-gl chunk for 25 times, no crash happened.

I think maybe adding page file for memory swapping can mitigate some intermittent OOM mochitest fails like this one.
Flags: needinfo?(jmaher)
:Lenzak, thanks for asking the question- I am not sure if we can make this happen, I think we would need to specific windows image for pagefile.sys, I would wonder why we don't have it on by default- it seems like a reasonable thing to have.

I see a few options:
1) turn it on by default for all win7 jobs that run in a VM
2) create a secondary config that has a pagefile and find a method for specifying that in taskcluster, then use that new method for just the webgl tests (and any others we determine need it)

In order to do this, we would need :grenade or :pmoore to help coordinate.  We would need to test this at scale (--rebuild 20) on windows 7* (or webgl if we restrict it).  Ideally that would yield no new failures or obvious [almost]perma failing tests.

:pmoore- could you comment on options 1 and 2 above?  maybe any further thoughts you have on this general topic?
Flags: needinfo?(jmaher) → needinfo?(pmoore)
See Also: → 1404541
pagefile is now created on instance instantiation. the file uses the y: drive (y:\pagefile.sys) for speed since c: on ec2 ebs is rather slow (the reason we disabled pagefiles in the first place, since the instances we use typically have 8 - 16 gb ram and shouldn't need to use the slow hdd)

initial results look ok (nothing broke) but we can watch to see if we get reduced intermittent failures attributed to pagefile size now.
Thank you very much, hope we can get rid of these annoying intermittent fails.
looks like this change has either caused another issue or possibly highlighted a problem. see bug 1412383 which started occuring right after the win 7 amis were updated to include the page file.
all of the failures mentioned in comment 35 above occured before the ami change, so it would appear that using a pagefile does indeed resolve this bug. will leave open until the next orange factor comment to be sure.
That's great, but bug 1412383 is quite concerning.

It crashes inside ANGLE's resource management code, it seems that something bad happen when deallocating memory space.

I will take a look at it.
Flags: needinfo?(pmoore)
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: