Closed Bug 797568 Opened 7 years ago Closed 7 years ago

Crash in nsWindow::OnMotionNotifyEvent with abort message: "DRI2DestroyDrawable: BadDrawable (invalid Pixmap or Window parameter)"

Categories

(Core :: Graphics: Layers, defect, critical)

18 Branch
x86
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla19
Tracking Status
firefox18 - affected

People

(Reporter: wgianopoulos, Assigned: karlt)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(8 files, 1 obsolete file)

I was asked to file this on IRC, even though to create it you need to set alter a pref layers.acceleration.force-enabled;true.

Steps to reproduce:

1.  set layers.acceleration.force-enabled;true
2.  visit http://www.patriots.com.
3.  Hover over team and click on Trnasactions

crash occurs

bp-8883fd9e-1682-4eeb-88a4-bf40a2121003
Guess I could have spelled Transactions better :-)
The graphics section of about:support would have been here had there been a useful way to copy it to the clipboard.  I guess we need another bug for that.
Since this is an X error (that we're just killing Firefox for), I'm going to call this a Widget bug.
Component: Graphics → Widget: Gtk
Crash Signature: [@ mozalloc_abort | NS_DebugBreak_P | X11Error]
Keywords: crash
Summary: Crash @ mozalloc_abort | NS_DebugBreak_P | X11Error → Crash in nsWindow::OnMotionNotifyEvent with abort message: "DRI2DestroyDrawable: BadDrawable (invalid Pixmap or Window parameter)"
FYI I probably would not even have asked if a bug was wanted here since I was overriding something turned off on purpose was that WebGL demos run orders of magnitude faster with layers acceleration on using this OS hardware combo and this is the only crash I have so far encountered with it forced enabled.
added this as a screenshot since it seems impossible to bet useful about:support output to attach to a bug.
second try with a better image format.
Attachment #667718 - Attachment is obsolete: true
Mesa tries to suppress this error
http://cgit.freedesktop.org/mesa/mesa/tree/src/glx/dri2.c?h=8.0&id=59997d619d8957fca3b2042fe4ebeed0709c0204#n185
but that only works if the error is received during _XReply, not while waiting for subsequent events
http://cgit.freedesktop.org/xorg/lib/libX11/tree/src/xcb_io.c?id=ed00b460acb08787b695f27b864e96102dfd4867#n202

Bill wasn't able to reproduce with MOZ_X_SYNC=1 in the environment.
I expect that is because XSynchronize is implemented with an _XReply after every async request.
This is a GLX layers issue, though it might be a Mesa bug.
Component: Widget: Gtk → Graphics: Layers
Nicolas has expressed a lot of interest in getting GL layers enabled by default on Linux :)
Assignee: nobody → nical.bugzilla
If the XSync before the DRI2DestroyDrawable request were instead after the request, I expect we wouldn't see this issue.  That XSync was there in the first revision
http://cgit.freedesktop.org/mesa/mesa/tree/src/glx/x11/dri2.c?id=e82dd8c6e1fa2fff5b960de26961080ba5e9651d#n217
before the DRI2Error function was added
http://cgit.freedesktop.org/mesa/mesa/commit/?h=8.0&id=16887d042a917fa4773e4d853f50051b54e9948c

There may be a bug in Gecko calling glXDestroyPixmap too late.  

An xtrace log may contain some helpful clues.
http://xtrace.alioth.debian.org/

Perhaps there's a chance attachment 663949 [details] [diff] [review] may help by making the context not current earlier.
Does the crash still happen with Flash Player disabled?
We are calling glXDestroyPixmap in the gfxXlibSurface destructor and that is not XFinished() before PluginInstanceParent::RecvShow() returns.
If libbonoboui-2.so.0 is loaded, it suppresses all BadDrawable messages, so to reproduce that needs to be moved out of the way.
http://git.gnome.org/browse/libbonoboui/tree/bonobo/bonobo-ui-main.c#n41
Once that was out of the way I could reproduce intermittently but reasonably frequently clicking on "Watch Press Conference", pressing back and trying again if necessary.

The resource id in the abort message was for a resource created in the plugin process, supporting the theory in comment 11.
(In reply to Karl Tomlinson (:karlt) from comment #10)
> If the XSync before the DRI2DestroyDrawable request were instead after the
> request, I expect we wouldn't see this issue.  That XSync was there in the
> first revision
> http://cgit.freedesktop.org/mesa/mesa/tree/src/glx/x11/dri2.
> c?id=e82dd8c6e1fa2fff5b960de26961080ba5e9651d#n217
> before the DRI2Error function was added
> http://cgit.freedesktop.org/mesa/mesa/commit/?h=8.
> 0&id=16887d042a917fa4773e4d853f50051b54e9948c
> 
> There may be a bug in Gecko calling glXDestroyPixmap too late.  
> 
> An xtrace log may contain some helpful clues.
> http://xtrace.alioth.debian.org/
> 
> Perhaps there's a chance attachment 663949 [details] [diff] [review] may
> help by making the context not current earlier.

Unfortunately, that patch did not seem to help. I will probably find more time to investigate the other suggestions over the weekend.
I think you are right on that this is flash related.  My sense of what is happening here is that if there is a load of a flash object pending when I do something to either close the tab or navigate away form the window, there are cases when this crash occurs when the flash object load completes.
So like when the object load completes, a missing check to see if the context that caused the load still exists.
Thanks, Bill.  Comments 11 and 12 identify an issue that we need to fix, and point to that as the likely cause here.

This is probably a regression from bug 787853, because that delayed the glXDestroyPixmap until gfxXlibSurface destruction.

I expect the simplest fix would ensure that PluginInstanceParent::RecvShow() releases the reference held by the old mFrontSurface and ensures that mImageContainer is not holding a reference before calling FinishX().

The code there is remnant from when the PluginInstanceOwner or nsObjectFrame managed the ImageContainer.  Now that the PluginInstanceParent owns the image container, I suspect it would make sense for the RecvShow just to update the ImageContainer and the code called on RecvNPN_InvalidateRect would no longer need to call GetImageContainer just to update the Image in the container.

Perhaps the MacIOSurface code could be left as is if the same doesn't make sense there.  There's some callbacks attached to surfaces, and I'm not familiar with the purpose of that.
Blocks: 787853
So, is there something else I should be testing here?  OR are we waiting for a patch to test?
I think we can fix the known issue, then test that when we have a patch, before doing further investigation, thanks.
Keywords: regression
Version: Trunk → 18 Branch
OK but I would be willing to test a not ready for review patch to see if it fixes the issue I am encountering.
(In reply to Karl Tomlinson (:karlt) from comment #16)
> Thanks, Bill.  Comments 11 and 12 identify an issue that we need to fix, and
> point to that as the likely cause here.
> 
> This is probably a regression from bug 787853, because that delayed the
> glXDestroyPixmap until gfxXlibSurface destruction.
> 
> I expect the simplest fix would ensure that PluginInstanceParent::RecvShow()
> releases the reference held by the old mFrontSurface and ensures that
> mImageContainer is not holding a reference before calling FinishX().
> 
> The code there is remnant from when the PluginInstanceOwner or nsObjectFrame
> managed the ImageContainer.  Now that the PluginInstanceParent owns the
> image container, I suspect it would make sense for the RecvShow just to
> update the ImageContainer and the code called on RecvNPN_InvalidateRect
> would no longer need to call GetImageContainer just to update the Image in
> the container.
> 
> Perhaps the MacIOSurface code could be left as is if the same doesn't make
> sense there.  There's some callbacks attached to surfaces, and I'm not
> familiar with the purpose of that.

It seems you are on to something backing out the fix for bug 787853 seems to fix the issue for me.
So, I am re-spinning my builds without the fix for bug 787853.
(In reply to Karl Tomlinson (:karlt) from comment #18)
> I think we can fix the known issue, then test that when we have a patch,
> before doing further investigation, thanks.

Well I gave time here because of the uplift.  If this can't be fixed then the patch causing the regression should be backed out.
Assignee: nical.bugzilla → karlt
When trying to follow the plugin ImageContainer code involved here, I found it quite difficult to follow, partly because the implementation has changed over time but there are still remnants of old implementations to distract you.  I'll attach some cleanup patches.

Having nsObjectFrame::GetImageContainer() is essentially deceitful, because this just creates an empty ImageContainer and never really uses it.  In bug 724886, ownership of the ImageContainer moved from the nsObjectFrame to the PluginInstanceParent.

Despite appearances, the logic for dispatching NotifyPaintWaiter() is unchanged by this patch.
Attachment #676439 - Flags: review?(roc)
This code is obsolete, perhaps since we now always use async drawing for OOP plugins as of https://hg.mozilla.org/mozilla-central/rev/b8fa6f780f38
(Note that UseAsyncRendering here is quite different from IsAsyncDrawing() in PluginInstanceParent/Child.)  IP plugins don't have an ImageContainer.

Does this mean that the mIOSurface code in PluginInstanceParent (and matching Child code) for sync
NPDrawingModelCoreAnimation/NPDrawingModelInvalidatingCoreAnimation is now
unused?
Attachment #676442 - Flags: review?(bgirard)
GetImageContainer(void) creates a container and never fails.
It's a little easier to follow the code when there are fewer places that can create ImageContainers.  This code comes from Bug 651192.
Attachment #676448 - Flags: review?(roc)
Now that the PluginInstanceParent owns the container, it is easier to follow the state of the ImageContainer Image when the PluginInstanceParent controls it.

When the plugin was shutting down, RecvNPN_InvalidateRect would not cause GetImageContainer(ImageContainer**) to be called, and so there was the special case of the PluginInstanceParent updating the container if it received a null surface, but the nsPluginInstanceOwner initiating when there is a surface.
However, I think I recall new non-null surfaces arriving during destruction, and these messages should remove the old surfaces from use, which wasn't happening because GetImageContainer(ImageContainer**) wasn't called.

The surface flush was a no-op given cairo hadn't been drawing to the surface.

It is not so simple to update the mFrontIOSurface code to do the same because ref counting in nsPluginInstanceOwner::GetImageContainer(ImageContainer**) assumes that PluginInstanceParent::GetImageContainer() creates a new image for each call.
Attachment #676451 - Flags: review?(roc)
At this point, it would be possible to rearrange the order of code in RecvShow and null out mFrontSurface at the right time to run ~gfxXlibSurface.

However relying on the timing of removal of references can be quite fickle.
gfxASurface::Finish() looks like what we want.
Try debug tests pass with changes very similar to these.
https://tbpl.mozilla.org/?tree=Try&rev=3a78a1b95c29
https://tbpl.mozilla.org/?tree=Try&rev=dc6e868153ad
Still to run Android tests.
I considered setting mDrawable to None, but it seems quite reasonable to use
ReleasePixmap() after calling Finish().  Even cairo_xlib_surface_get_drawable
returns the drawable after cairo_surface_finish() has been called.
Attachment #676469 - Flags: review?(roc)
http://www.cairographics.org/manual/cairo-cairo-surface-t.html#cairo-surface-finish is what we need here.

I wonder why we weren't having the similar trouble to this bug for RENDER
layers with the XRenderFreePicture call for destruction of
cairo_xlib_surfaces.

cairo_surface_finish calls cairo_surface_flush, and the flush may be necessary even for image surfaces.
Attachment #676470 - Flags: review?(roc)
(In reply to Karl Tomlinson (:karlt) from comment #24)
> Created attachment 676442 [details] [diff] [review]
> Does this mean that the mIOSurface code in PluginInstanceParent (and
> matching Child code) for sync
> NPDrawingModelCoreAnimation/NPDrawingModelInvalidatingCoreAnimation is now
> unused?

I think so but I'd have to do a few experiments to be 100% sure.
Attachment #676442 - Flags: review?(bgirard) → review+
> tracking-firefox18: --- → ?

AFAIK this affects only those on systems without libgnomeui (to pull in libbonoboui-2.so.0) that have set layers.acceleration.force-enabled true to test an incomplete feature.

If there are crashes happening due to errors from RenderFreePicture, then those may also be addressed by this, but I haven't seen that happen.
Depends on: 807728
Uncommon pref/crash, no need to track.
Depends on: 807925
You need to log in before you can comment on or make changes to this bug.