Closed Bug 701269 Opened 9 years ago Closed 9 years ago

WebGL / Canvas displaying uninitialized video memory.

Categories

(Core :: Canvas: WebGL, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla11
Tracking Status
firefox8 --- wontfix
firefox9 --- wontfix
firefox10 + wontfix
firefox11 + fixed
firefox-esr10 11+ fixed
status1.9.2 --- unaffected

People

(Reporter: mbx, Assigned: jgilbert)

References

Details

(Whiteboard: [sg:vector-high][qa-])

Attachments

(5 files, 7 obsolete files)

106.79 KB, application/zip
Details
564 bytes, text/html
Details
7.87 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
7.46 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
7.46 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
Setting the canvas width and height properties AFTER you get an webgl context can lead to reading of arbitrary video data, desktop images, other tabs, etc. This seems to occur on MacBook Airs, and not on MacBook Pros. This occurs in Firefox 6.0.2, as well as Nightly.

Machine
=======

  Model Name:	MacBook Air
  Model Identifier:	MacBookAir4,2
  Processor Name:	Intel Core i7
  Processor Speed:	1.8 GHz
  Number of Processors:	1
  Total Number of Cores:	2
  L2 Cache (per Core):	256 KB
  L3 Cache:	4 MB
  Memory:	4 GB
  Boot ROM Version:	MBA41.0077.B0E
  SMC Version (system):	1.73f63

about:support
=============
 
Adapter Description: 0x24300,0x20400
WebGL Renderer: Intel Inc. -- Intel HD Graphics 3000 OpenGL Engine -- 2.1 APPLE-7.12.9GPU 
Accelerated Windows: 1/1 OpenGL
Whiteboard: sg:high
This seems to occur only on some hardware. Setting sg:high for now (information leak).
This is not the same bug as bug 684882, but that one also was specific to Intel GPUs on Mac OSX, with similar consequences.
The reason why it only happens on MacBook Airs is that we use the discrete (non-Intel) GPU for WebGL when there is one. MacBook Airs are special in that they only have Intel integrated graphics.

What is your Mac OS X version?
I'm running, Mac OS X Lion 10.7.2 (11C74)
Does the problem go away if you set webgl.msaa-level=0 ?   (in about:config)

Does it go away if you set layers.acceleration.disabled=true and restart Firefox?

Is the size 512 important in your testcase? Is there a minimal size under which the problem doesn't happen?

Note that this should be reproducible on other Macs by using the gfxCardStatus utility to force Firefox to use Intel integrated graphics.
I added webgl.msaa-level = 0 to about:config since it wasn't already there and the problem persists.

I set layers.acceleration.disabled=true and restarted Firefox and the problem still persists.

512 is not that important, the lowest I was able to set it, and yet reproduce the bug was 300. As you increase the size you can see more. I'm guessing that when using larger sizes it's more likely to encounter stretches of non zero memory.
Chrome doesn't seem to suffer from this bug. It's possible that it might be a bug on our side. investigating.
The bug is Mac-specific (but could well be a bug in our code).

The testcase results in GLContext::ResizeOffscreenFBO being called 3 times (yes, we need to make that lazy). We check on Jeff M's Mac and it didn't call ClearSafely() everytime.

But here on Linux, I do get ClearSafely called everytime by ResizeOffscreenFBO.

We need to step though ResizeOffscreenFBO carefully on a Mac to figure where we're returning early, before the ClearSafely call.

At a more meta-level, we need to make this code less prone to forgetting to ClearSafely.
Can RAII help here?
(In reply to Andreas Gal :gal from comment #10)
> Can RAII help here?

Probably, yes, since the problem is that we need to ensure that ClearSafely gets called even if we return early.

I would still like to understand where exactly things go wrong on Mac.
I can't reproduce this on a Nightly anymore. We should get a range on when it was fixed.
Michael, does this still happen for you with a nightly?
This was "fixed" because antialaising is now used on OS X. Forcing msaa-level to 0 causes the bug to show up again.
We don't call ClearSafely in ResizeOffscreenFBO because we take the following condition:

    if (!useDrawMSFBO && !aUseReadFBO)
        return true;
I don't understand that early return. Even if we don't have a MS draw FBO and don't have a read FBO, we still have a non-MS draw FBO to resize. So why return early here?

This is definitely the early return that Jeff was hitting. He had to set msaa-level=0 to be able to reproduce since AA has been turned on on Mac.

Side question: I couldn't reproduce this on Linux as there, aUseReadFBO is true (while it's false on Mac with AA disabled). Why is that?
Attachment #574094 - Flags: review?(jgilbert)
(In reply to Benoit Jacob [:bjacob] from comment #16)
> Created attachment 574094 [details] [diff] [review] [diff] [details] [review]
> remove early return in ResizeOffscreenFBO that we're hitting, and that
> doesn't seem to make sense
> 
> I don't understand that early return. Even if we don't have a MS draw FBO
> and don't have a read FBO, we still have a non-MS draw FBO to resize. So why
> return early here?
> 
> This is definitely the early return that Jeff was hitting. He had to set
> msaa-level=0 to be able to reproduce since AA has been turned on on Mac.
> 
> Side question: I couldn't reproduce this on Linux as there, aUseReadFBO is
> true (while it's false on Mac with AA disabled). Why is that?

This patch fixes it for me.
Thanks for the quick fix guys.
This isn't the proper fix, since ResizeOffscreenFBO wasn't even called for all paths until AA hit. For paths using backing PBuffers, for example, we never called it, so we had to make sure we were clearing the buffer elsewhere.

It looks like we are just not clearing the buffer on Mac w/PBuffers.
Comment on attachment 574094 [details] [diff] [review]
remove early return in ResizeOffscreenFBO that we're hitting, and that doesn't seem to make sense

Uploading a new fix shortly.
Attachment #574094 - Flags: review?(jgilbert) → review-
Try: https://tbpl.mozilla.org/?tree=Try&rev=de8f0e15f875
Assignee: nobody → jgilbert
Attachment #574094 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #574213 - Flags: review?(bjacob)
Attachment #574213 - Flags: review?(bjacob) → review+
Looks like it passes try.
Can someone on Mac replicate this bug before applying the fix, but not after?
I updated to the latest Nightly (11.0a1 (2011-11-14)) and the bug is still there.
Added MakeCurrent call before clearing after resize.
Attachment #574213 - Attachment is obsolete: true
This is fixed if we use the hidden pref cgl.prefer-fbo set to true. (default is false) This must be a problem with the PBuffers path in CGL.
What's the first version which this fails on?
We can simply fix 10 by using cgl.prefer-fbo until we can track down the elusive problem.
(In reply to Jeff Gilbert [:jgilbert] from comment #27)
> What's the first version which this fails on?
> We can simply fix 10 by using cgl.prefer-fbo until we can track down the
> elusive problem.

What's your feeling about this: is the CGL/FBO path in good enough shape for this despite not having been default until now? if we switch, we should do it ASAP to maximize testing.
I do not know. Can someone with a mac compare mochitest outputs with cgl.prefer-fbo? This is currently a pretty big problem. It may be worth switching paths until this is solved, even if we take a small conformance hit.
Just make a tryserver push?
Whiteboard: sg:high → [sg:high]
I also got this behavior with todays nightly on my MacBook Pro.
This website shows me uninitialized data at the beginning:
http://alteredqualia.com/three/examples/webgl_terrain_dynamic.html
According to Jeff M in comment 17, my patch (now r-'d and obsoleted) fixed (the effects of) this bug even if it wasn't the correct solution. Shouldn't we just take it? It just removes 2 lines that we agreed were wrong. So actually I'm not sure i understand the r- on it.
The code after the early exit relies upon the assumption that the conditions of the early exit are false. That is, the code is not designed to work without the early-out. Resize code in GLContext should not be responsible for clearing the buffers, because that's only needed by WebGL, and actually easier to assure if we deliberately clear the buffer in WebGL.

I think the reason the deguarantee patch was still failing was because the read FBO isn't blitted to before we use it as a texture for drawing. Either ClearSafely() should force-blit the FBOs, or layers code must guarantee that it will blit the buffers before using the texture. The latter seems the more proper solution, to me.
OK. Please continue driving this i.e. talk with Layers people as needed; we need to have this bug fixed ASAP.
Matt Woodrow's out most of this week. Who else knows GLContext and layers?

Is there a GLContext method the layer system needs to call before every render of a WebGL CanvasLayer? What is it? E.g., what is BasicCanvasLayer::UpdateSurface not doing that it should be doing?
Updated to assure that buffers are cleared before using the texture by adding GLContext::ReadyOffscreenTexture.

Testing via try at: https://tbpl.mozilla.org/?tree=Try&rev=f699ee432c35

I will be able to run tests/debug locally on mac and linux tomorrow.
Attachment #574403 - Attachment is obsolete: true
Attachment #575817 - Flags: review?(bjacob)
Comment on attachment 575817 [details] [diff] [review]
Deguarantee ResizeOffscreenFBO clearing its buffers, and adding clears in WebGLContext accordingly

Assumes that glFlush is sufficient to sync buffers across contexts.
Attachment #575817 - Flags: review?(bjacob) → review-
Updated to use OpenGL spec-guaranteed glFinish for syncing textures between GLContext and Layers in D3D10 and OGL.
Attachment #575817 - Attachment is obsolete: true
Attachment #575827 - Flags: review?(bjacob)
I'll also note here that I cannot reproduce this on my Mac:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0a1) Gecko/20111121 Firefox/11.0a1
ATI Technologies Inc. -- AMD Radeon HD 6630M OpenGL Engine -- 2.1 ATI-7.12.9
It must be using a path I'm not expecting, investigating...
Does it also replicate without errors on the debug builds from that try batch?
Same behavior in the debug builds.
Ok, until I can replicate this and fix it properly, we should probably hack around this and just manually blit the buffers in WebGLContext::SetDimensions.

Note that there still exists a bug where we are not correctly resolving the GLContext before drawing it out with OGL Layers.
Fixed forgetting to force the resolve after the blit.

https://tbpl.mozilla.org/?tree=Try&rev=6089a0a4ff7b
Attachment #576058 - Attachment is obsolete: true
Attachment #576058 - Flags: review?(bjacob)
Attachment #576058 - Flags: feedback?(mbebenita)
Attachment #576063 - Flags: review?(bjacob)
Attachment #576063 - Flags: feedback?(mbebenita)
While this should be fixed anyways, bug 702058 switches to using FBOs instead of PBuffers on Mac, which should hide this bug.
Comment on attachment 576063 [details] [diff] [review]
Deguarantee ResizeOffscreenFBO clearing its buffers, and adding clears in WebGLContext accordingly

Review of attachment 576063 [details] [diff] [review]:
-----------------------------------------------------------------

Just some nits, need some explanations.

::: content/canvas/src/WebGLContext.cpp
@@ +545,5 @@
>          // everything's good, we're done here
>          mWidth = gl->OffscreenActualSize().width;
>          mHeight = gl->OffscreenActualSize().height;
> +
> +        gl->ClearSafely();

Since we must not forget to call ClearSafely in SetDimensions, and I don't see a very good way to ensure it's always called (since it should not be called in error cases), I would really, really like a unit test (can take the testcase of this bug as a starting point).

::: gfx/gl/GLContext.cpp
@@ +1387,1 @@
>          fViewport(viewport[0], viewport[1], viewport[2], viewport[3]);

I don't get that. Why are we still playing at all with the viewport here? I thought that was only needed a long time ago when ClearSafely didn't take care of it. It now does, so this viewport business should have been useless for a while. Moreover I don't understand why firstTime matters here.

::: gfx/gl/GLContext.h
@@ +1010,5 @@
>  
> +    // Before reads from offscreen texture
> +    // Resolve contents in this context, before we resume work with the contents on 'other' context
> +    // If other is null, block until new commands will only be executed after the resolve. (Usually via glFinish)
> +    void ResolveBeforeResumingOther(GLContext* other = nsnull) {

As discussed, I would mimic the ARB_sync API here rather than coming up with a custom function name. Ideally it would be neat if we could implement the ARB_sync API with a fallback to using glFinish (and maybe, later, some mutex locking) when ARB_sync is not available.
Attachment #576063 - Flags: review?(bjacob) → review-
Depends on: 705024
Attachment #576063 - Flags: feedback?(mbebenita)
Although I cannot reproduce it, it seems that a similar problem persists when waking up the laptop from sleep mode. I'm guessing it may have something to do with a loss of the GL context.
Blocks: 711642
Attachment #582419 - Flags: review?(bjacob) → review+
(In reply to Michael Bebenita from comment #50)
> Although I cannot reproduce it, it seems that a similar problem persists
> when waking up the laptop from sleep mode. I'm guessing it may have
> something to do with a loss of the GL context.

Are you seeing this only in WebGL contexts, or for the whole browser?
(In reply to Jeff Gilbert [:jgilbert] from comment #53)
> (In reply to Michael Bebenita from comment #50)
> > Although I cannot reproduce it, it seems that a similar problem persists
> > when waking up the laptop from sleep mode. I'm guessing it may have
> > something to do with a loss of the GL context.
> 
> Are you seeing this only in WebGL contexts, or for the whole browser?

It was only in the WebGL context.
Target Milestone: mozilla11 → ---
inbound/m-c merge by Ed Moreley
https://hg.mozilla.org/mozilla-central/rev/91141088eb7d
Target Milestone: --- → mozilla11
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
From a quick read this sounds like a Mac OS/driver bug we're working around rather than a bug in our code per se. If that's wrong please change the whiteboard back to [sg:high].

Please create a patch to land on Firefox 10 and request approval-mozilla-beta, or attach the request to the existing patch if it works land as-is.
Whiteboard: [sg:high] → [sg:vector-high]
Attachment #589258 - Flags: review?(bjacob) → review+
Comment on attachment 589258 [details] [diff] [review]
Deguarantee ResizeOffscreenFBO clearing its buffers, and adding clears in WebGLContext accordingly (fx10)

[Approval Request Comment]
Regression caused by (bug #): This bug.
User impact if declined: Security concerns noted above.
Testing completed (on m-c, etc.): Was landed on MC in Fx11, now in Aurora.
Risk to taking this patch (and alternatives if risky): If there is anyone relying on GLContext resizes clearing its buffers (besides WebGL, as fixed in this patch), they may receive uninitialized data instead of cleared buffers.
Attachment #589258 - Flags: approval-mozilla-beta?
Try attempt here:
https://tbpl.mozilla.org/?tree=Try&rev=b60e9f97915f

Passing reftests, failing(?) talos, burning on android.
Comment on attachment 589258 [details] [diff] [review]
Deguarantee ResizeOffscreenFBO clearing its buffers, and adding clears in WebGLContext accordingly (fx10)

[Triage Comment]
Please re-set approval-mozilla-beta flag once try builds are in order and this is ready to land.
Attachment #589258 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Null try run from beta:
https://tbpl.mozilla.org/?tree=Try&rev=a43849b2a5f0

Preliminarily indicates that fails encountered on previous try build were to be expected when trying from beta.

Android appears to be burning because the makefile/mozconfig requirements changed. Talos appears to be failing because it's hitting a 404 when requesting 'http://hg.mozilla.org/try/raw-file/b60e9f97915f/testing/talos/talos.json'. 

I can spend time trying to fix these to get a clean try run, but effectively the same code has passed try and landed in Fx11. Is it inviable to land it on Beta while being prepared to back it out immediately?
Whiteboard: [sg:vector-high] → [sg:vector-high][qa+]
Having problems reproducing original using identical Macbook Air with 2011-11-09 Nightly build with the included testcase.

Info:

Build identifier: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0a1) Gecko/20111109 Firefox/11.0a1

Machine:

  Model Name:	MacBook Air
  Model Identifier:	MacBookAir4,2
  Processor Name:	Intel Core i7
  Processor Speed:	1.8 GHz
  Number of Processors:	1
  Total Number of Cores:	2
  L2 Cache (per Core):	256 KB
  L3 Cache:	4 MB
  Memory:	4 GB
  Boot ROM Version:	MBA41.0077.B0E
  SMC Version (system):	1.73f63

Display:

  Chipset Model:	Intel HD Graphics 3000
  Type:	GPU
  Bus:	Built-In
  VRAM (Total):	384 MB
  Vendor:	Intel (0x8086)
  Device ID:	0x0116
  Revision ID:	0x0009

about:support:

Vendor ID: 8086
Device ID: 0116
WebGL Renderer: Intel Inc. -- Intel HD Graphics 3000 OpenGL Engine -- 2.1 APPLE-7.14.5GPU 
Accelerated Windows: 1/1 OpenGL

Looks identical, aside from the renderer. This is Lion 10.7.2, same as mentioned in comment 5.

Since I can't repro original, can't verify fix yet.
[Triage comment]

This bug is being tracked for landing on ESR branch.  Please land patches on http://hg.mozilla.org/releases/mozilla-esr10/by Thursday March 1, 2012 in order to be ready for go-to-build on Friday March 2, 2012.

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more information.
[Triage comment]
Confirmed with dveditz that esr10 is affected by this bug and requires this patch to go to build - can someone please land today or tomorrow?  Let me know if there are any concerns we should be aware of.
Original patch applied (almost) cleanly to esr10, minus one comment line. (which conflicted)
Attachment #601483 - Flags: review?(bjacob)
Attachment #601483 - Flags: approval-mozilla-esr10?
Attachment #601483 - Flags: review?(bjacob) → review+
Comment on attachment 601483 [details] [diff] [review]
Deguarantee ResizeOffscreenFBO clearing its buffers, and adding clears in WebGLContext accordingly (esr10)

[Triage Comment]
Nomination for esr10 approved, please land this today if possible so we're ready to go to build on Friday. Thank you.
Attachment #601483 - Flags: approval-mozilla-esr10? → approval-mozilla-esr10+
I haven't been able to reproduce the original problem in order to verify this bug fix. Geo, could you give it another try when you get a chance?
I'll bring my Air tomorrow and will try again.
QA is unable to verify this fix because we are unable to reproduce the original issue, even with an identical test environment. If someone is able to reproduce this issue, please verify the fix in Firefox 11.0b6 and 10.0.3esr.

Thanks
Whiteboard: [sg:vector-high][qa+] → [sg:vector-high][qa-]
Group: core-security
You need to log in before you can comment on or make changes to this bug.