Closed Bug 1298285 Opened 8 years ago Closed 7 years ago

Intermittent dom/canvas/test/webgl-conf/generated/test_conformance__context__context-release-upon-reload.html | getError expected: NO_ERROR. Was CONTEXT_LOST_WEBGL : Should be no errors

Categories

(Core :: Graphics: CanvasWebGL, defect, P5)

defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Assigned: cleu)

References

Details

(Keywords: intermittent-failure, Whiteboard: [gfx-noted][stockwell disabled])

Attachments

(1 file, 2 obsolete files)

Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Whiteboard: [gfx-noted]
this seems to fail mostly on win7-pgo.  the recent spike is high and it appears to have started on May 31st.  I don't see any errors in 12+ hours on trunk branches, possibly this is fixed or greatly reduced?

from this log:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=104673572&lineNumber=7346


I see this error:
14:23:59     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__context__context-release-upon-reload.html | Buffer was the correct size: 1680x1050 
14:23:59     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__context__context-release-upon-reload.html | context was created properly 
14:23:59     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__context__context-release-upon-reload.html | getError was expected value: NO_ERROR : Should be no errors 
14:23:59     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__context__context-release-upon-reload.html | Buffer was the correct size: 1680x1050 
14:23:59     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__context__context-release-upon-reload.html | context was created properly 
14:23:59     INFO - Buffered messages finished
14:23:59     INFO - TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_conformance__context__context-release-upon-reload.html | getError expected: NO_ERROR. Was CONTEXT_LOST_WEBGL : Should be no errors 
14:23:59     INFO -     reportResults@dom/canvas/test/webgl-conf/mochi-single.html?checkout/conformance/context/context-release-upon-reload.html:22:7
14:23:59     INFO -     reportTestResultsToHarness@dom/canvas/test/webgl-conf/checkout/js/js-test-pre.js:116:5
14:23:59     INFO -     testFailed@dom/canvas/test/webgl-conf/checkout/js/js-test-pre.js:246:5
14:23:59     INFO -     glErrorShouldBeImpl@dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js:1590:5
14:23:59     INFO -     glErrorShouldBe@dom/canvas/test/webgl-conf/checkout/js/webgl-test-utils.js:1564:3
14:23:59     INFO -     testContext@dom/canvas/test/webgl-conf/checkout/conformance/context/context-release-upon-reload.html:66:3
14:23:59     INFO -     @dom/canvas/test/webgl-conf/checkout/conformance/context/context-release-upon-reload.html:83:5
14:23:59     INFO -     EventListener.handleEvent*@dom/canvas/test/webgl-conf/checkout/conformance/context/context-release-upon-reload.html:81:1
14:23:59     INFO - Not taking screenshot here: see the one that was previously logged
14:23:59     INFO - TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_conformance__context__context-release-upon-reload.html | Buffer was the wrong size: 0x0 
14:23:59     INFO -     reportResults@dom/canvas/test/webgl-conf/mochi-single.html?checkout/conformance/context/context-release-upon-reload.html:22:7
14:23:59     INFO -     reportTestResultsToHarness@dom/canvas/test/webgl-conf/checkout/js/js-test-pre.js:116:5
14:23:59     INFO -     testFailed@dom/canvas/test/webgl-conf/checkout/js/js-test-pre.js:246:5
14:23:59     INFO -     testContext@dom/canvas/test/webgl-conf/checkout/conformance/context/context-release-upon-reload.html:70:5
14:23:59     INFO -     @dom/canvas/test/webgl-conf/checkout/conformance/context/context-release-upon-reload.html:83:5
14:23:59     INFO -     EventListener.handleEvent*@dom/canvas/test/webgl-conf/checkout/conformance/context/context-release-upon-reload.html:81:1
14:29:09     INFO - Not taking screenshot here: see the one that was previously logged
14:29:09     INFO - TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_conformance__context__context-release-upon-reload.html | Test timed out. 
14:29:09     INFO -     reportError@SimpleTest/TestRunner.js:121:7
14:29:09     INFO -     TestRunner._checkForHangs@SimpleTest/TestRunner.js:142:7
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
14:29:09     INFO -     TestRunner.runTests@SimpleTest/TestRunner.js:380:5
14:29:09     INFO -     RunSet.runtests@SimpleTest/setup.js:194:3
14:29:09     INFO -     RunSet.runall@SimpleTest/setup.js:173:5
14:29:09     INFO -     hookupTests@SimpleTest/setup.js:266:5
14:29:09     INFO - parseTestManifest@http://mochi.test:8888/manifestLibrary.js:36:5
14:29:09     INFO - getTestManifest/req.onload@http://mochi.test:8888/manifestLibrary.js:49:11
14:29:09     INFO - EventHandlerNonNull*getTestManifest@http://mochi.test:8888/manifestLibrary.js:45:3
14:29:09     INFO -     hookup@SimpleTest/setup.js:246:5
14:29:09     INFO - EventHandlerNonNull*@http://mochi.test:8888/tests?autorun=1&closeWhenDone=1&consoleLevel=INFO&hideResultsTable=1&manifestFile=tests.json&dumpOutputDirectory=c%3A%5Cusers%5Ccltbld%5Cappdata%5Clocal%5Ctemp&cleanupCrashes=true:11:1
14:29:10     INFO - GECKO(3668) | MEMORY STAT | vsize 1611MB | vsizeMaxContiguous 95MB | residentFast 135MB | heapAllocated 66MB
14:29:10     INFO - TEST-OK | dom/canvas/test/webgl-conf/generated/test_conformance__context__context-release-upon-reload.html | took 311824ms



:milan, can you find someone to look at this in the next 2 weeks as this seems to have increased?
Flags: needinfo?(milan)
Whiteboard: [gfx-noted] → [gfx-noted][stockwell needswork]
Did we change the hardware/drivers on the systems that run these tests?
I'll look into it.
Assignee: nobody → cleu
This failure seems to present only in 32-bit non-e10s mode, which makes me suspect it's a memory issue.

I tried to reproduce it on my local VM with same configuration but no luck yet.

And I observed the MEMORY STAT part when running this mochitest, my local VM has a smaller vsize (about 700~900 MB) and bigger vsizeMaxContiguous (about 200~400 MB) while tryserver has a bigger vsize (about 1600MB) and smaller vsizeMaxContiguous (about 150MB), it may indicate that there is more memory fragment in tryserver which is a potential cause to this failure.
we have one click loaners available.  If you click on a job inside of treeherder (try or integration branch) and in the job details that display three is an option for a 'one click loaner'.  There is a wizard once you get into the shell (via the browser) to setup and run a specific test job.

We also have the ability to change the image if you feel there are things to do there.  We share the linux64 image, but have :i386 libraries installed so that the 32 bit browser and tools run successfully.
This failure happens in win7-32bit.

Windows VM is not supported by One-click loaner AFAIK.
:lenzak, do you have any updates on this intermittent?  It looks to be failing at the same rate.
Flags: needinfo?(cleu)
I am still investigating it.

Since I cannot reproduce it on my Windows 7 32-bit VM, I can only print some logs and push to try server to gather some information.

I initially think that it is caused by a mis-discarded GL context because of our maximum live context policy, but it turns out that it has nothing to do with this failure.

I am now printing all the context's memory address and comparing those event logs to find who force discard it and make the test fail.
Flags: needinfo?(cleu)
I think the context is force-lost because of a swap failure in WebGLContext::PresentScreenBuffer, now I will investigate why it fail, since it only happens in 32-bit and non-e10s configuration, I suspect it's caused by OOM.
OK, now I can confirm it's an OOM issue.

https://dxr.mozilla.org/mozilla-central/rev/95543bdc59bd038a3d5d084b85a4fec493c349ee/gfx/layers/client/CanvasClient.cpp#484
Aside being unable to allocate new back screen buffer, there is a warning about fail to allocate TextureClient for the canvas which is usually caused by memory pressure always presents just before this testfail happen.

It can also explain why this testfail only happens under 32-bit non-e10s configuration.

I think the reason why I cannot reproduce on my local VM is because the OOM condition only happens when the VM is running multiple mochitest tasks, so maybe this issue will be fixed if we can isolate this test or split into even smaller chunks.
the machines we run on have 15gb of memory available.  Is it possible that we are at the memory limit most of the time and we just happen to cross the limit on 5-10% of the time?  Is it possible that when this fails there is another condition causing us to use much more memory or not free up previously used memory?

Typically we run tests per directory which translates to per manifest.  Could we split the manifest into two parts?  In other directories we are able to run large volumes of tests in a single mochitest session.

We always seem to fail on the same test:
dom/canvas/test/webgl-conf/generated/test_conformance__context__context-release-upon-reload.html

this indicates that maybe this test or a previously run test is the root cause?
Yes, this VM have 16GB memory, but it is Win7 32bit, only 3.2GB is available.
Moreover, for a 32bit Windows app, it usually got memory problem when a single process used more than about 1.8G of memory.

It's also why I think it explains why this only happens under non-e10s mode
good point about 32 bit, I overlooked that.  Do we get value in testing non-e10s mode?  In 7 weeks (firefox 57 on trunk) we will disable all non-e10s tests when there is a e10s version running, so in this case we will disable the non-e10s webgl tests.  We could do this earlier :)
Actually there are some intermittent webgl testfails happens only in win7-32bit non-e10s mode, not only this one.

To avoid more oranges, maybe we can disable mochitest-gl on all win32 non-e10s mode?
that is very easy to do; I would like to hear from :milan on that before making a quick decision.
This patch adds some logs about failures related to this testfail, I think it will be helpful for future diagnose if similar intermittent failure happens.
Comment on attachment 8879493 [details]
Bug 1298285 - Add Logs to diagnose GL context lost caused by swap failure;

https://reviewboard.mozilla.org/r/150800/#review155930

These are normal, and shouldn't bring down a debug build normally.
Attachment #8879493 - Flags: review?(jgilbert) → review-
Attached patch skip on win, non-e10s (obsolete) — Splinter Review
While we are waiting on the needinfo for the larger question of all mochitest-gl, let's skip this one test on win, non-e10s, since it fails so frequently.
Attachment #8886620 - Flags: review?(jmaher)
Attached patch skip on win, non-e10s (obsolete) — Splinter Review
Sorry, attached empty patch earlier.
Attachment #8886620 - Attachment is obsolete: true
Attachment #8886620 - Flags: review?(jmaher)
Attachment #8886621 - Flags: review?(jmaher)
Comment on attachment 8886621 [details] [diff] [review]
skip on win, non-e10s

tests are not running anymore on win7 non-e10s as per bug 1379868
Attachment #8886621 - Flags: review?(jmaher)
Attachment #8886621 - Attachment is obsolete: true
Whiteboard: [gfx-noted][stockwell needswork] → [gfx-noted][stockwell disabled]
Flags: needinfo?(milan)
Bulk priority update of open intermittent test failure bugs. 

P3 => P5

https://bugzilla.mozilla.org/show_bug.cgi?id=1381960
Priority: P3 → P5
https://wiki.mozilla.org/Bugmasters#Intermittent_Test_Failure_Cleanup
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
See Also: → 1695503
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: