Closed Bug 613079 Opened 9 years ago Closed 9 years ago

WebGL crash [@mozilla::gl::GLContextProviderGLX::CreateOffscreen]

Categories

(Core :: Canvas: WebGL, defect, critical)

x86
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+
status1.9.2 --- unaffected
status1.9.1 --- unaffected

People

(Reporter: posidron, Assigned: bjacob)

References

(Blocks 1 open bug)

Details

(Keywords: crash, testcase, Whiteboard: [sg:dos])

Crash Data

Attachments

(5 files, 5 obsolete files)

Attached file testcase
Ubuntu 10.10 - 2.6.35-22-generic-pae
X.Org X Server 1.9.0

There was no module named i965 only i915 (video):

$ modinfo i915
filename:      
/lib/modules/2.6.35-22-generic-pae/kernel/drivers/gpu/drm/i915/i915.ko
license:        GPL and additional rights
description:    Intel Graphics
author:         Tungsten Graphics, Inc.
license:        GPL and additional rights
srcversion:     F936DAC9185D063B604C4FE
alias:          pci:v00008086d0000010Asv*sd*bc03sc00i*
depends:        drm,drm_kms_helper,video,intel-agp,i2c-algo-bit
vermagic:       2.6.35-22-generic-pae SMP mod_unload modversions 686 
parm:           modeset:int
parm:           fbpercrtc:int
parm:           powersave:int
parm:           lvds_downclock:int
Attached file callstack (obsolete) —
Summary: WebGL crash [mozilla::gl::GLContextProviderGLX::CreateOffscreen] → WebGL crash [@mozilla::gl::GLContextProviderGLX::CreateOffscreen]
How confident are you of that stack, and where'd you get the CreateOffscreen bit (it's not in the stack you attached)?  Could this actually be the stack in bug 612572?
That said, it's probably not.  (It might be good if you said what Firefox version you were using, though.)
"CreateOffscreen" is located in the console output: console.txt

Version: Mozilla/5.0 (X11; Linux i686; rv:2.0b8pre) Gecko/20101113 Firefox/4.0b8pre

I know it's a strange callstack but it keeps the same for this bug.
Keywords: crash, testcase
Could be a driver bug... does it happen anywhere else?
Is this crash in our code? Does it happen only on Linux drivers?
Assignee: nobody → bjacob
Attached file good callstack
I can reproduce on linux / x86-64 / NVIDIA.

This callstack was obtained with a debug build, with MOZ_GL_DEBUG and MOZ_X_SYNC env vars defined.
Attachment #491376 - Attachment is obsolete: true
I have set a breakpoint in GLContextGLX::~GLContextGLX(), it's never reached. So we accumulate more and more GL Contexts, using more and more memory --- even if we didn't have any crash bug here, we'd still eventually run out of memory.

The crash however occurs at a special time. At first, all the GL contexts are created by CreateOffscreenPixmapContext, and that runs fine, but eventually we run GLContextProviderGLX::CreateForWindow and that's when the crash occurs.
Forcing WebGL to use OSMesa makes it not crash, but Firefox is then using > 4G of virtual memory.
When it crashes here, 165 GLContextGLX objects have been constructed and none have been destroyed.

This is hard to reproduce: sometimes the contexts do get destroyed, in which case there's no crash.
Out of curiosity, why is your testcase JS code so weird? like the 64K characters long variable name AAAAA.....AAAAA on line 771? and with the try {} catch at every line, you don't get a chance to see any JS exception.
Checked the testcase sources, it's not forcing us to keep all those GL contexts alive simultaneously, so it's really us being stupid.

what the testcase does is basically

   var webgl = canvas.getContext("experimental-webgl");
   var webgl = canvas.getContext("experimental-webgl");
   var webgl = canvas.getContext("experimental-webgl");
   var webgl = canvas.getContext("experimental-webgl");
   var webgl = canvas.getContext("experimental-webgl");
   var webgl = canvas.getContext("experimental-webgl");
   ....

My modest understanding of Javascript is that there's only one webgl variable here so as soon as a new context is assigned to it, the previous one can be garbage-collected.

Thus, the problem is that our garbage collector is running too late.

Is there a way that our WebGL implementation can tell the JS engine: "those WebGLContext objects should really be agressively garbage-collected when they fall out of scope" ?
Confirmed: If I force a GC run between each new WebGL context creation, with this code,

    netscape.security.PrivilegeManager.enablePrivilege("UniversalXPConnect");
    window.QueryInterface(Components.interfaces.nsIInterfaceRequestor)
          .getInterface(Components.interfaces.nsIDOMWindowUtils)
          .garbageCollect();

then the crash goes away, memory usage stays reasonable, and I don't have many GL contexts alive simultaneously.

So I would like to know the answer to the question I ask at the end of comment 12. WebGLContext's should immediately get GC'd when their refcount falls down to 0.
Boris is a good go-to to get to the answer to comment 13!
After a discussion with Jeff, he're where we stand:

In order to make good decisions about when to GC unreferenced objects, the JS engine needs to know the "cost" of these objects.

Typically "cost" means memory size.

However, in the case of WebGLContext's, "cost" is more subtle, because each WebGLContext has its own GL context, and GL contexts have a cost that can't be measured from the outside.

So maybe what's really needed is a way for us to inform the JS engine that WebGLContext's are more "expensive" than they look.
That might be nice, yes.  ccing some jseng folks.
So there are 2 separate bugs in this bug:
  A. That unreferenced WebGLContexts are GC'd so late that we let 165 of them accumulate. That was discussed above.
  B. That we actually crash, instead of gracefully failing to create more GL Contexts.

Here is a patch fixing B. The crashes are assertions about bad drawables, so they had to be fixable by correctly checking for X errors, this is what this patch does. Here, it makes us not crash. There is a potential speed issue as checking X server errors requires syncing. If this matters, we have a some room for improvement as CreateOffscreenPixmapContext calls CreateGLContext, both these functions sync, so here we have 2 XSyncs that can be merged.
Attachment #493119 - Flags: review?(vladimir)
Comment on attachment 493119 [details] [diff] [review]
X Error handling fixes

bumping this over to matt, he knows GLX stuff better than I do...
Attachment #493119 - Flags: review?(vladimir) → review?(matt.woodrow+bugzilla)
Comment on attachment 493119 [details] [diff] [review]
X Error handling fixes

Looks good for the most part.

Are we ever likely to want to change the error handler in other places in gfx? Making this more public (maybe gfx/src/X11Util.h) would be worthwhile?

I see a couple of uses within cairo, widget and toolkit, which might be too spread out to be worth changing.

Also what about making this a stack based class using RAII (like ScopedXFree), I think that would be cleaner and let us skip the extra goto.

SyncAndGetError could also probably return the error code (maybe as an optional param?), we don't use it here so maybe not relevant.


I'm fine with this as is, just a few possible ideas.
Attachment #493119 - Flags: review?(matt.woodrow+bugzilla) → review+
Can someone suggest a security rating for this bug?
blocking2.0: --- → betaN+
I'd say betaN+ is good enough, as I'll probably land it in the next few days.
Blocks: 616416
Attached patch X Error handling fixes, updated (obsolete) — Splinter Review
OK, updated.

(In reply to comment #19)
> Comment on attachment 493119 [details] [diff] [review]
> X Error handling fixes
> 
> Looks good for the most part.
> 
> Are we ever likely to want to change the error handler in other places in gfx?
> Making this more public (maybe gfx/src/X11Util.h) would be worthwhile?

Done.

> Also what about making this a stack based class using RAII (like ScopedXFree),
> I think that would be cleaner and let us skip the extra goto.

Done.

Note that it was very nontrivial to make sure that different ScopedXErrorHandler's would not interfere with each other, since the actual X error handler is a plain C function. So can you please review again.

> SyncAndGetError could also probably return the error code (maybe as an optional
> param?), we don't use it here so maybe not relevant.

I kept the return value unchanged, but added an optional second parameter allowing to pass a XErrorEvent * pointer to be filled.
Attachment #494999 - Flags: review?(matt.woodrow+bugzilla)
Attached patch X Error handling fixes, updated (obsolete) — Splinter Review
forgot to hg add X11Utils.cpp
Attachment #494999 - Attachment is obsolete: true
Attachment #495000 - Flags: review?(matt.woodrow+bugzilla)
Attachment #494999 - Flags: review?(matt.woodrow+bugzilla)
Fix Maemo build error found by tryserver; remove ScopedXErrorHandler copy in GLContextProviderGLX.cpp
Attachment #495000 - Attachment is obsolete: true
Attachment #495069 - Flags: review?(matt.woodrow+bugzilla)
Attachment #495000 - Flags: review?(matt.woodrow+bugzilla)
Comment on attachment 495069 [details] [diff] [review]
X Error handling fixes, updated


>+    // this ScopedXErrorHandler's ErrorEvent object
>+    ErrorEvent m_xerror;
>+

Would prefer mXError style naming, to go with the rest of the file.

Rest looks good to me!
Attachment #495069 - Flags: review?(matt.woodrow+bugzilla) → review+
Pushed the X11 part of this bug:

http://hg.mozilla.org/mozilla-central/rev/056814244ceb

This means that we don't crash anymore, instead we just fail to create WebGL contexts.

There remains the JS engine part to deal with (the fact that we fail to create WebGL contexts)
This fixes the non-libxul build by defining the functions in the .cpp file instread of putting them inline in the .h (which caused them to be compiled into the user code which could be in a different object in non-libxul builds), and applies Matt's above style comments (m_xerror ---> mXError)
Attachment #495600 - Flags: review?(dholbert)
Previous patch had a linking issue when gfx/src/X11Util.cpp tried to use Xlib symbols in non-libxul builds. Hopefully this fixes it.
Attachment #495600 - Attachment is obsolete: true
Attachment #495608 - Flags: review?(dholbert)
Attachment #495600 - Flags: review?(dholbert)
nope, still having issues with Xlib symbols - I get this in my build output w/ latest patch:
  X11Util.cpp:60: undefined reference to `XSetErrorHandler'
  X11Util.cpp:66: undefined reference to `XSetErrorHandler'
  X11Util.cpp:72: undefined reference to `XSync'
ok, this times the non-libxul build is really fixed, checked myself... :) used NS_GFX to export the class.
Attachment #495608 - Attachment is obsolete: true
Attachment #495624 - Flags: review?(dholbert)
Attachment #495608 - Flags: review?(dholbert)
Comment on attachment 495624 [details] [diff] [review]
fix non-libxul build (and apply Matt's style comments)

Looks ok to me.  Thanks for fixing!
Attachment #495624 - Flags: review?(dholbert) → review+
Group: core-security
Whiteboard: [sg:dos]
I filed bug 617453 about the JS part of this bug. Closing this one (original bug reported here was a crash, and that's now fixed).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
OS: Mac OS X → Linux
Depends on: 620356
Blocks: 658170
Crash Signature: [@mozilla::gl::GLContextProviderGLX::CreateOffscreen]
You need to log in before you can comment on or make changes to this bug.