Closed Bug 978966 Opened 10 years ago Closed 10 years ago

964 WebGL conformance test failures in mochitest-plain1 on Windows Server 2012 instances

Categories

(Core :: Graphics: CanvasWebGL, defect)

x86
Other
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla34

People

(Reporter: bugzilla, Assigned: bjacob)

References

Details

(Whiteboard: webgl-correctness)

Attachments

(4 files)

Modulo other problems caused by configuration issues, there are still 964 failures in mochitest-plain1 in the WebGL conformance test suite.
Attached file Error log
Component: Platform Support → Canvas: WebGL
Product: Release Engineering → Core
QA Contact: coop
Hardware: x86_64 → x86
Version: other → Trunk
These tests pass on regular Windows machines, but fail on the server, right?
Correct.
Note that the Windows Server 2012 test environment is set up such that the test session is running with the RemoteFX virtual GPU.
Nothing subtle about these differences.
Here's the actual webgl error list from this.  When looking at each individual failure, it's only 34 tests that have failures in them (some just have many, probably same underlying cause).

On the VMs, these are using the Windows Server virtual GPU.  Hopefully this is something that's fixable, as the virtual GPU seemed to be enough to run things like Epic Citadel just fine.
How do we get access to the virtual GPU for testing? I'd like to see screenshots of of actual vs. expect at a minimum.
If you open the error log (attachment) you'll see URLs in it that say something like "paste this into your browser address to see results".  So, you can at least see what that run looked like.
Aaron, can you set up Dan with a VM on AWS?  This kind of thing should be debuggable without any of the actual testing infra, since it should reproduce by just starting up firefox in the VM and running the webgl tests.  If that's difficult, I can give Dan an instance on my AWS account.
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #9)
> Aaron, can you set up Dan with a VM on AWS?  This kind of thing should be
> debuggable without any of the actual testing infra, since it should
> reproduce by just starting up firefox in the VM and running the webgl tests.
> If that's difficult, I can give Dan an instance on my AWS account.

John should be able to help out with that.
Flags: needinfo?(jhopkins)
We have two configurations of Windows Server 2012 VMs in use at the moment:

1) continuous integraton VMs for testing the Date branch builds.
2) VMs that run tests against a manually specified build upon bootup and output the test logs locally.

Both of these run the tests in a RDP environment to give us the proper graphical context.

What we can't do is simply RDP to the VM and run tests manually because then we won't have the right graphical context.  For example, the RDP server may use our local graphics card for acceleration instead of the VirtualFX based acceleration.

> This kind of thing should be debuggable without any of the actual testing infra, since it should reproduce by just starting up firefox in the VM and running the webgl tests.

For the reasons above, I don't think this will work.  Please let me know if I've missed something.
Flags: needinfo?(jhopkins)
(In reply to John Hopkins (:jhopkins) from comment #11)
> > This kind of thing should be debuggable without any of the actual testing infra, since it should reproduce by just starting up firefox in the VM and running the webgl tests.
> 
> For the reasons above, I don't think this will work.  Please let me know if
> I've missed something.

Yes -- when Dan RDP's in to the cltbld user, he should have the same graphics config that the VM has when the autologin user sets up the RDP session back to itself.  If it doesn't reproduce the problem, then we can look for another solution, but for now a VM of the #2 style should work (just something that he can RDP into, or even VNC if we want to be 100% sure that it'll be identical but I don't think we have vnc servers set up.. but that can be plan B if the tests don't reproduce).
Whiteboard: webgl-correctness
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #12)
> Yes -- when Dan RDP's in to the cltbld user, he should have the same
> graphics config that the VM has when the autologin user sets up the RDP
> session back to itself.

AFAIK, only a single RDP connection is permitted at one time so that probably won't work.

> If it doesn't reproduce the problem, then we can
> look for another solution, but for now a VM of the #2 style should work
> (just something that he can RDP into, or even VNC if we want to be 100% sure
> that it'll be identical but I don't think we have vnc servers set up.. but
> that can be plan B if the tests don't reproduce).

I tried pinging you on IRC earlier re: whether you self-served this request (per your email asking about the AMI).  Please let me know.  If not, I'll create a new instance for you Thursday morning (Eastern).
Flags: needinfo?(vladimir)
It doesn't need to be simultaneous, he just needs to be able to connect in.  Him RDP'ing should be fine.  I have not yet had a chance to get a VM for Dan -- GDC stuff is taking up too much time.  If you could spin him up a windows VM and send him the credentials, that'd be great.  Thanks!
Flags: needinfo?(vladimir)
Depends on: 983196
Loan request fulfilled in bug 983196
Dan, have you made any progress here?
Flags: needinfo?(dglastonbury)
Hi, I'm taking this for a couple days to at least diagnose what we should do about this.

Can I get a VM loaned to me?
Assignee: nobody → bjacob
Flags: needinfo?(taras.mozilla)
No progress, it dropped off my radar. I see Benoit is picking it up.
Flags: needinfo?(dglastonbury)
Chris can hook you up(if he hasn't already)
Flags: needinfo?(taras.mozilla) → needinfo?(catlee)
Thanks Chris, I am now reproducing this on the VM.

It's very strange. On the VM, running individual test pages from our content/canvas/test directory, I can reproduce the failures; but running the same tests from the upstream 1.0.1 tests from Khronos' server, most of these tests pass -- even tests that are identical to the version we have in our tree. Investigating.
Flags: needinfo?(catlee)
Figured it. The problem on these VMs is that the MSAA (multisampling antialiasing) implementation is broken in that it assumes that blending is enabled and uses the default blendFunc(SRC_ALPHA, ONE_MINUS_SRC_ALPHA). Indeed, disabling blending or trying to change the blendFunc has no effect at all, as long as MSAA is used.

For example, in the gl-clear.html test, we get failures like this:

PASS should be 0,0,0,255
FAIL should be 128,128,128,192
at (0, 0) expected: 128,128,128,192 was 32,32,32,239

The value 32,32,32,239 is what we would get if blending were enabled, but the test disables it; that disabling has no effect on this driver. I tried keeping blending enabled and using a blendFunc (ONE, ZERO) to simulate no blending, but again that blendFunc call had no effect. Then I suspected a bug in the driver's framebuffer operations so I disabled MSAA and then, everything works fine.

That gives us a very easy way out: in WebGL, antialiasing is not mandatory. So let's disable antialiasing on this driver.
Disabling anti-aliasing in these WebGL conformance tests also seems not too bad from the perspective of losing test coverage. It's orthogonal to WebGL state machine behavior which is what these mochitests primarily intend to cover. Anti-aliasing still works when the default blending operations are used, so we'll still be able to exercise antialiasing in Talos and Reftest.
After discussing this with Vlad:

Tweaking the WebGL conformance tests specifically to avoid antialiasing would be a non-upstreamable departure from Khronos tests, not something that we'd be thrilled to do.

Instead we probably need to accept that since antialiasing is broken on this driver, we must just avoid AA unconditionally there.

If we could agree to keep WebGL Reftests and Talos tests running on real graphics hardware, where they would get antialiasing, that would make it much more acceptable to have just the mochitests run without antialiasing. Because again, antialiasing is not so vital to the mochitests, but it's much more important in reftests (compositing correctness) and Talos (compositing performance).
For my reference - the GL_RENDERER string that we need to check for:

Microsoft Basic Render Driver Direct3D9Ex vs_3_0 ps_3_0

Exposed by ANGLE as

ANGLE (Microsoft Basic Render Driver Direct3D9Ex vs_3_0 ps_3_0)

I guess we should match a substring.
Attachment #8461773 - Flags: review?(jgilbert)
Comment on attachment 8461773 [details] [diff] [review]
no-multisample-in-redmond

Whoever gets this first.
Attachment #8461773 - Flags: review?(dglastonbury)
Comment on attachment 8461773 [details] [diff] [review]
no-multisample-in-redmond

Review of attachment 8461773 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8461773 - Flags: review?(jgilbert)
Attachment #8461773 - Flags: review?(dglastonbury)
Attachment #8461773 - Flags: review+
Benoit, try run?
Land or do not land, there is no try.

https://hg.mozilla.org/integration/mozilla-inbound/rev/fed0f0f3cb1c

We can leave open though until Aaron confirms that it's fixed.

Chris, I probably don't need my VM anymore.
Flags: needinfo?(catlee)
Flags: needinfo?(aklotz)
Whiteboard: webgl-correctness → webgl-correctness [leave open]
It's WAY better, but there's still quite a bit of stuff in here:
https://tbpl.mozilla.org/php/getParsedLog.php?id=44670275&tree=Date&full=1

Any ideas?
Flags: needinfo?(aklotz)
Great, so now we're down to just 4 WebGL test pages failing:

conformance/context/context-attributes-alpha-depth-stencil-antialias.html
conformance/renderbuffers/framebuffer-object-attachment.html
conformance/state/gl-object-get-calls.html
conformance/more/functions/isTests.html

(Somehow these didn't fail when I ran the upstream 1.0.1 tests on the VM, but our mochitests' copy differs substantially from upstream).

You just need to add them (copy and paste the above 4 lines) into this file:

http://hg.mozilla.org/mozilla-central/file/75fe3b8f592c/dom/canvas/test/webgl-conformance/failing_tests_windows.txt

It may look confusing that this one happens to be currently empty (i.e. current Windows slaves ran these tests free of any failure) so here is an example of a non-empty such file:

http://hg.mozilla.org/mozilla-central/file/75fe3b8f592c/dom/canvas/test/webgl-conformance/failing_tests_android.txt
Note: I'd gladly write that (4-line) patch but we can't land it as long as we're running the current windows slaves. What we could do is have two separate files, failing_tests_windows_old.txt and failing_tests_windows_new_vms.txt, and add these 4 lines to the latter only, and add some code in the mochitest to switch between the two. For example, here is current code that we use to switch between android failures files:

http://hg.mozilla.org/mozilla-central/file/75fe3b8f592c/dom/canvas/test/webgl-conformance/test_webgl_conformance_test_suite.html#l525

...hm, OK, let me take a stab at writing that patch.
Attachment #8463612 - Flags: review?(dglastonbury) → review+
Previous try run had a wrong trychooser command, "windows" instead of "win32".

New try - also has win64 in a futile hope that it would run on the new WS2012 VMs, but I'm told that that's not the case -

https://tbpl.mozilla.org/?tree=Try&rev=86b3fdc7a864
https://hg.mozilla.org/integration/mozilla-inbound/rev/d240749902d7

Aaron, how is it looking now?
Flags: needinfo?(aklotz)
Looks good! No more WebGL conformance test failures on M1!
https://tbpl.mozilla.org/?tree=Date&rev=10bd24ec3f55
Flags: needinfo?(aklotz)
Excellent! Let's close this bug, then.

Chris, I definitely don't need the VM anymore. Thanks for the help!
Whiteboard: webgl-correctness [leave open] → webgl-correctness
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Ok, deleted the instance. Thanks!
Flags: needinfo?(catlee)
Target Milestone: --- → mozilla34
QA Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: