Closed
Bug 845486
Opened 12 years ago
Closed 12 years ago
webgl conformance crashes frequently on ubuntu VMs
Categories
(Core :: Graphics: CanvasWebGL, defect)
Tracking
()
VERIFIED
FIXED
mozilla22
Tracking | Status | |
---|---|---|
firefox21 | --- | verified |
People
(Reporter: jmaher, Assigned: jmaher)
References
(Blocks 1 open bug)
Details
Attachments
(2 files, 1 obsolete file)
2.72 KB,
patch
|
Details | Diff | Splinter Review | |
513 bytes,
patch
|
jgilbert
:
review+
|
Details | Diff | Splinter Review |
We are running the webgl conformance mochitests on ubuntu vms (in our staging branch https://tbpl.mozilla.org/?tree=Cedar). What we have found is that M1 is green about 70% of the time, but orange 30% of the time due to firefox crashing during the webgl subharness.
We run llvmpipe (the mesa driver) and use webgl.force-enabled=true in the preferences for this platform. These are virtual machines.
What can we do to debug this?
Are there problematic tests which leave us in a failure state?
Comment 1•12 years ago
|
||
The first step is running with MOZ_GL_DEBUG=1 in a debug build. Which version of LLVMpipe are we using? We have had a number of reasons we blacklisted various versions of this driver. Maybe bjacob remembers more.
Flags: needinfo?(bjacob)
Assignee | ||
Comment 2•12 years ago
|
||
Tomorrow I can get a debug build on a VM and run with MOZ_GL_DEBUG=1. This should be able to reproduce in a reasonable number of tries.
for llvmpipe, we have:
root@tst-jmaher-ubuntu64-009:/home/cltbld# dpkg -l | grep llvm
ii libllvm2.9 2.9+dfsg-3ubuntu4 Low-Level Virtual Machine (LLVM), runtime library
ii libllvm3.0 3.0-4ubuntu1 Low-Level Virtual Machine (LLVM), runtime library
ii libllvm3.0:i386 3.0-4ubuntu1 Low-Level Virtual Machine (LLVM), runtime library
ii llvm 2.9-7 Low-Level Virtual Machine (LLVM)
ii llvm-2.9 2.9+dfsg-3ubuntu4 Low-Level Virtual Machine (LLVM)
ii llvm-2.9-dev 2.9+dfsg-3ubuntu4 Low-Level Virtual Machine (LLVM), libraries and headers
ii llvm-2.9-runtime 2.9+dfsg-3ubuntu4 Low-Level Virtual Machine (LLVM), bytecode interpreter
ii llvm-dev 2.9-7 Low-Level Virtual Machine (LLVM), libraries and headers
ii llvm-runtime 2.9-7 Low-Level Virtual Machine (LLVM), bytecode interpreter
root@tst-jmaher-ubuntu64-009:/home/cltbld#
Maybe there is something else we need to run/query to get accurate llvmpipe information?
Comment 3•12 years ago
|
||
You want to check the Mesa version, since the llvmpipe backend is part of that.
Assignee | ||
Comment 4•12 years ago
|
||
to confirm:
"MOZ_GL_DEBUG=1" is something that is set during buildtime in .mozconfig?
llvmpipe backend version is found in about:support?
Assignee | ||
Comment 5•12 years ago
|
||
from about:support:
Adapter Description VMware, Inc. -- Gallium 0.4 on llvmpipe (LLVM 0x300)
Device ID Gallium 0.4 on llvmpipe (LLVM 0x300)
Driver Version 2.1 Mesa 8.0.4
GPU Accelerated Windows 0/1 Basic Blocked for your graphics driver version.
Vendor ID VMware, Inc.
WebGL Renderer VMware, Inc. -- Gallium 0.4 on llvmpipe (LLVM 0x300)
AzureCanvasBackend Cairo
AzureContentBackend none
AzureFallbackCanvasBackend none
Comment 6•12 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #4)
> to confirm:
> "MOZ_GL_DEBUG=1" is something that is set during buildtime in .mozconfig?
> llvmpipe backend version is found in about:support?
MOZ_GL_DEBUG=1 is a runtime env var.
Comment 8•12 years ago
|
||
llvmpipe was blacklisted for security bugs that are not triggered in the version of the test suite that we have in the tree. So it shouldn't crash here. As jgilbert says, you want MOZ_GL_DEBUG=1 so that stack traces will be reliable. Then looking at a stack we should know more. Also, if you catch this in GDB, please do
(gdb) call DumpJSStack()
so we know where it is exactly in JS code.
Also, please give the last ~100 lines of output before the crash. This should at least identify the crashing WebGL test, if we can't have a JS stack.
Assignee | ||
Comment 9•12 years ago
|
||
Comment 10•12 years ago
|
||
Any chance of getting a crashdump with symbols?
Assignee | ||
Comment 11•12 years ago
|
||
I am not sure how to do that? I downloaded the symbols for the build and pointed the harness at it.
Comment 12•12 years ago
|
||
Not too sure for the general case. You could run GDB against it and attach the output from `bt` after the crash.
Assignee | ||
Comment 13•12 years ago
|
||
this should give us real information!
Attachment #719282 -
Attachment is obsolete: true
Assignee | ||
Comment 14•12 years ago
|
||
Does this look like enough information? I would like to keep this bug moving, but want to make sure that I have provided enough information before I move onto a couple other bugs.
Comment 15•12 years ago
|
||
Yeah, this is definitely a Mesa bug. It's in a very normal-looking DrawArrays call, looking like glDrawArrays(GL_TRIANGLES, 0, 6), I think. The issue is probably buried down in the state used to do the draw call.
I don't know if I have enough bandwidth to address this. It could be a thing we could work around, but I don't have high hopes on avoiding it from our end. I could be wrong. Only root-causing this will let us know.
Assignee | ||
Comment 16•12 years ago
|
||
glad this is helpful, and I understand the fix would take a lot of extra time. For us to get going could we:
* disable the webgl tests in general?
* disable a test or small subset of conformance tests?
* maybe run the tests on hardware only?
I see these tests failing a lot on windows (with hardware), so there could be other issues or maybe a common issue.
Any suggestions on how we could move forward so we can get more of our testing on a scalable platform would be much appreciated!
Comment 17•12 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #16)
> glad this is helpful, and I understand the fix would take a lot of extra
> time. For us to get going could we:
> * disable the webgl tests in general?
> * disable a test or small subset of conformance tests?
> * maybe run the tests on hardware only?
>
> I see these tests failing a lot on windows (with hardware), so there could
> be other issues or maybe a common issue.
>
> Any suggestions on how we could move forward so we can get more of our
> testing on a scalable platform would be much appreciated!
We can't drop any coverage for these tests. (We're working on expanding tests here) As long as we have these tests running somewhere, we're fine, whether that's hardware or software.
These tests don't seem to cause any issues on well-behaved hardware+drivers, though our test slaves never seem to fall into this category.
Assignee | ||
Comment 18•12 years ago
|
||
by skipping more tests, I am able to get m1 to run 15+ times in a row with no crash, whereas before I would get an average of 2 crashes in 5 runs.
We are only skipping one test for linux:
http://dxr.mozilla.org/mozilla-central/content/canvas/test/webgl/skipped_tests_linux_mesa.txt
conformance/misc/type-conversion-test.html
and we have 4 tests marked as failing:
http://dxr.mozilla.org/mozilla-central/content/canvas/test/webgl/failing_tests_linux_mesa.txt
conformance/textures/texture-mips.html
conformance/textures/texture-size-cube-maps.html
conformance/extensions/oes-texture-float.html
conformance/glsl/functions/glsl-function-sin.html
Looking at linux in general, we have never skipped a test before, but we have plenty of failed tests for Fedora:
http://dxr.mozilla.org/mozilla-central/content/canvas/test/webgl/failing_tests_linux.txt
conformance/misc/uninitialized-test.html
conformance/programs/gl-get-active-attribute.html
conformance/textures/texture-mips.html
conformance/uniforms/gl-uniform-bool.html
conformance/renderbuffers/framebuffer-object-attachment.html
I am working on the minimal set of tests we can skip in order to run successfully. I started with adding in the ones we skip for Android and I ran successfully, I trimmed that in half and I ran 17 runs in a row all night long with this set of skipped tests:
conformance/misc/type-conversion-test.html
conformance/glsl/functions/glsl-function-normalize.html
conformance/glsl/functions/glsl-function-reflect.html
conformance/glsl/functions/glsl-function-sign.html
conformance/glsl/functions/glsl-function-smoothstep-float.html
conformance/glsl/functions/glsl-function-smoothstep-gentype.html
conformance/glsl/functions/glsl-function-step-float.html
conformance/glsl/functions/glsl-function-step-gentype.html
conformance/more/conformance/quickCheckAPI-B2.html
conformance/programs/gl-getshadersource.html
conformance/reading/read-pixels-test.html
conformance/textures/gl-teximage.html
conformance/textures/tex-image-and-sub-image-2d-with-image.html
My goal is to reduce this in half yet again if not reduced down to <4 tests we skip. If there are any ideas of which tests could be problematic in the above list, please speak up.
Assignee | ||
Comment 19•12 years ago
|
||
in various testing, I narrowed the playing field down and verified that only adding one test to the skipped list will allow us to run for 20 cycles with no failures.
Tested on 32 bit and 64 bit.
With this patch, we can now move mochitest 1-5 jobs to amazon EC2 and have a major impact on our push turnaround time.
I assume somebody could look at read-pixels-test.html and figure out why that backtrace is happening. The root cause might be similar to what is the problem for the windows crashes in the webgl tests as well.
Comment 20•12 years ago
|
||
This might be related to a bug we're hitting in some emscripten demos, so I'll take a look. This is one test I'd really prefer not to disable, but we can probably deal with it if we can get a solution for it soon. (And if it's run on other platforms)
Updated•12 years ago
|
Attachment #722156 -
Flags: review?(jgilbert) → review+
Assignee | ||
Comment 21•12 years ago
|
||
thanks for the review. This will run on hardware for Windows and OSX, if we want to try fixes for this, I would be happy to help test them. As for now, I will land this and work on migrating over our mochitests to ubuntu.
Comment 22•12 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
Assignee | ||
Comment 23•12 years ago
|
||
landed on aurora:
https://hg.mozilla.org/releases/mozilla-aurora/rev/7c1795414b75
Comment 24•12 years ago
|
||
How are things looking now, Joel? Can we call this verified fixed?
Assignee | ||
Updated•12 years ago
|
Status: RESOLVED → VERIFIED
Updated•12 years ago
|
status-firefox21:
--- → fixed
Comment 25•12 years ago
|
||
Joel, can we call this verified for Firefox 21 as well?
Assignee | ||
Comment 26•12 years ago
|
||
yeah, I am not sure how to do that.
Comment 27•12 years ago
|
||
Do the conformance tests not run against mozilla-aurora?
Assignee | ||
Comment 28•12 years ago
|
||
yes,we run them on ubuntu vms on aurora:
https://tbpl.mozilla.org/php/getParsedLog.php?id=20662814&tree=Mozilla-Aurora&full=1
Comment 29•12 years ago
|
||
Thanks Joel. Given that no errors or warnings are listed I'm going to mark this verified fixed for Firefox 21.
You need to log in
before you can comment on or make changes to this bug.
Description
•