Closed Bug 1014355 Opened 6 years ago Closed 4 years ago

Intermittent Linux "ABORT: X_FreePixmap: BadPixmap (invalid Pixmap parameter)" in nsX11ErrorHandler.cpp, line 157 [@ mozalloc_abort(char const*)] (sometimes in test_bug346659.html, 626602-1.html))

Categories

(Core :: Graphics: Layers, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox30 --- unaffected
firefox31 --- unaffected
firefox32 --- fixed
firefox33 --- fixed
firefox40 --- fixed
b2g-v1.4 --- unaffected
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: KWierso, Assigned: nical)

References

Details

(Keywords: crash, intermittent-failure, regression)

Attachments

(1 file, 1 obsolete file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=40138520&tree=B2g-Inbound#error0
Ubuntu VM 12.04 b2g-inbound pgo test mochitest-e10s-3 on 2014-05-21 16:32:31 PDT for push 58aa8da5a45d

slave: tst-linux32-spot-192


16:37:54     INFO -  1351 INFO TEST-START | /tests/dom/tests/mochitest/bugs/test_bug346659.html
16:37:56     INFO -  System JS : ERROR chrome://global/content/browser-child.js:373 - NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIFormFillController.attachToBrowser]
16:37:57     INFO -  [Parent 2391] ###!!! ABORT: X_FreePixmap: BadPixmap (invalid Pixmap parameter); 3 requests ago: file /builds/slave/b2g-in-lx-pgo-0000000000000000/build/toolkit/xre/nsX11ErrorHandler.cpp, line 157
16:37:57     INFO -  [Parent 2391] ###!!! ABORT: X_FreePixmap: BadPixmap (invalid Pixmap parameter); 3 requests ago: file /builds/slave/b2g-in-lx-pgo-0000000000000000/build/toolkit/xre/nsX11ErrorHandler.cpp, line 157
16:37:58     INFO -  ###!!! [Child][MessageChannel::SendAndWait] Error: Channel error: cannot send/recv
16:37:58     INFO -  TEST-INFO | Main app process: killed by SIGSEGV
16:37:58  WARNING -  TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/bugs/test_bug346659.html | application terminated with exit code 11
16:37:58     INFO -  INFO | runtests.py | Application ran for: 0:02:18.820155
16:37:58     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmp5M27k9pidlog
16:37:58     INFO -  ==> process 2391 launched child process 2430
16:37:58     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2430
16:37:58     INFO -  mozcrash INFO | Downloading symbols from: https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux-pgo/1400682602/firefox-32.0a1.en-US.linux-i686.crashreporter-symbols.zip
16:38:25  WARNING -  PROCESS-CRASH | /tests/dom/tests/mochitest/bugs/test_bug346659.html | application crashed [@ mozalloc_abort(char const*)]
16:38:25     INFO -  Crash dump filename: /tmp/tmpUY461u/minidumps/1185fb0f-2abb-bf92-2468c37a-221b1fc3.dmp
16:38:25     INFO -  Operating system: Linux
16:38:25     INFO -                    0.0.0 Linux 3.2.0-23-generic-pae #36-Ubuntu SMP Tue Apr 10 22:19:09 UTC 2012 i686
16:38:25     INFO -  CPU: x86
16:38:25     INFO -       GenuineIntel family 6 model 26 stepping 5
16:38:25     INFO -       1 CPU
16:38:25     INFO -  Crash reason:  SIGSEGV
16:38:25     INFO -  Crash address: 0x0
16:38:25     INFO -  Thread 0 (crashed)
16:38:25     INFO -   0  libmozalloc.so!mozalloc_abort(char const*) [mozalloc_abort.cpp:58aa8da5a45d : 30 + 0x0]
16:38:25     INFO -      eip = 0xb77cb35c   esp = 0xbfc28bf0   ebp = 0xbfc28c08   ebx = 0xb77ccd84
16:38:25     INFO -      esi = 0xb7668d9c   edi = 0xbfc28c3c   eax = 0x0000000a   ecx = 0xffffffff
16:38:25     INFO -      edx = 0xb76698ac   efl = 0x00010282
16:38:25     INFO -      Found by: given as instruction pointer in context
16:38:25     INFO -   1  libxul.so!NS_DebugBreak [nsDebugImpl.cpp:58aa8da5a45d : 435 + 0xd]
16:38:25     INFO -      eip = 0xb3a037e6   esp = 0xbfc28c10   ebp = 0xbfc29058   ebx = 0xb6e61d98
16:38:25     INFO -      esi = 0xbfc28c48   edi = 0xbfc28c3c
16:38:25     INFO -      Found by: call frame info
16:38:25     INFO -   2  libxul.so!X11Error [nsX11ErrorHandler.cpp:58aa8da5a45d : 157 + 0x2f]
16:38:25     INFO -      eip = 0xb4e80db5   esp = 0xbfc29060   ebp = 0xbfc29978   ebx = 0xb6e61d98
16:38:25     INFO -      esi = 0xbfc2910c   edi = 0x9a1d6088
16:38:25     INFO -      Found by: call frame info
16:38:25     INFO -   3  libX11.so.6.3.0 + 0x39aca
16:38:25     INFO -      eip = 0xb277aacb   esp = 0xbfc29980   ebp = 0xb729e000   ebx = 0xb2871ff4
16:38:25     INFO -      esi = 0xbfc299bc   edi = 0xb28742e8
16:38:25     INFO -      Found by: call frame info
16:38:25     INFO -   4  firefox!malloc [jemalloc.c:58aa8da5a45d : 6193 + 0xa]
16:38:25     INFO -      eip = 0x08061dab   esp = 0xbfc2998c   ebp = 0xb729e000
16:38:25     INFO -      Found by: stack scanning
16:38:25     INFO -   5  firefox!free [jemalloc.c:58aa8da5a45d : 1649 + 0x7]
16:38:25     INFO -      eip = 0x080643b5   esp = 0xbfc299a0   ebp = 0xb729e000
16:38:25     INFO -      Found by: stack scanning
Tweaking summary to avoid mis-stars.
Summary: Intermittent test_bug346659.html | application crashed [@ mozalloc_abort(char const*)] after ABORT: X_FreePixmap: BadPixmap (invalid Pixmap parameter); 3 requests ago: file nsX11ErrorHandler.cpp, line 157 → Intermittent Linux "ABORT: X_FreePixmap: BadPixmap (invalid Pixmap parameter)" in nsX11ErrorHandler.cpp, line 157 [@ mozalloc_abort(char const*)]
Many of these failures were filed in bug 1005056, the first being bug 1005056 comment 3.
This is earlier than comment 28:
https://tbpl.mozilla.org/php/getParsedLog.php?id=40620611&tree=Mozilla-Inbound
Ubuntu VM 12.04 mozilla-inbound opt test crashtest-ipc on 2014-05-29 01:58:46 PDT for push 2430884972b6
slave: tst-linux32-spot-569

[Parent 2340] ###!!! ABORT: X_FreePixmap: BadPixmap (invalid Pixmap parameter): file /builds/slave/m-in-lx-0000000000000000000000/build/toolkit/xre/nsX11ErrorHandler.cpp, line 157
(In reply to Karl Tomlinson (needinfo?:karlt) from comment #38)
> http://hg.mozilla.org/integration/mozilla-inbound/
> pushloghtml?fromchange=28e12f7f68be&tochange=2430884972b6
> https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=28e12f7f68be

and (I mean) https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=2430884972b6
(In reply to Karl Tomlinson (needinfo?:karlt) from comment #38)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/96395ad2aac9 looks
> the most likely candidate
Flags: needinfo?(bas)
https://tbpl.mozilla.org/php/getParsedLog.php?id=41021446&tree=B2g-Inbound

Adding test_bug346659.html to summary, since there are unrelated search results for that test, which means we don't hit the top frame fallback.
Summary: Intermittent Linux "ABORT: X_FreePixmap: BadPixmap (invalid Pixmap parameter)" in nsX11ErrorHandler.cpp, line 157 [@ mozalloc_abort(char const*)] → Intermittent Linux "ABORT: X_FreePixmap: BadPixmap (invalid Pixmap parameter)" in nsX11ErrorHandler.cpp, line 157 [@ mozalloc_abort(char const*)] (sometimes in test_bug346659.html))
Summary: Intermittent Linux "ABORT: X_FreePixmap: BadPixmap (invalid Pixmap parameter)" in nsX11ErrorHandler.cpp, line 157 [@ mozalloc_abort(char const*)] (sometimes in test_bug346659.html)) → Intermittent Linux "ABORT: X_FreePixmap: BadPixmap (invalid Pixmap parameter)" in nsX11ErrorHandler.cpp, line 157 [@ mozalloc_abort(char const*)] (sometimes in test_bug346659.html, 626602-1.html))
I've pinged the request in bug 1002300 for a backout, again.
Hey Chris, could you have a look at this bug?
Assignee: nobody → chrislord.net
Flags: needinfo?(bas)
Assignee: chrislord.net → nical.bugzilla
Looking at the code (haven't reproduced it locally for a while) and not being a gtk/X11 expert, my first guess is that we should not be calling mSurface->ReleasePixmap() when allocating a TextureClientX11, since we don't know for sure that the TextureClient will effectively be shared with the compositor process (and a comment above that line says that we do this because we assume the TextureClient is always deallocated on the host side).
Try push with a patch that moves ReleasePixmap into the destructor and execute it only if the Texture was shared with the host side. https://tbpl.mozilla.org/?tree=Try&rev=1c1cd91e08f1
Comment on attachment 8441503 [details] [diff] [review]
Make sure we call ReleaseSurface only if we know thate the pixmap will be freed by the host side.

Review of attachment 8441503 [details] [diff] [review]:
-----------------------------------------------------------------

So if ShouldDeallocateInDestructor is true, what then? It looks to me like this would leak in that situation? If not, there definitely needs to be a comment in that destructor explaining what happens. r- because this seems confusing/not quite right to me and at the least, a decent explanatory comment needs to be added.
Attachment #8441503 - Flags: review?(chrislord.net) → review-
Attached patch v2Splinter Review
My understanding of gfxXlibSurface is that if a surface object is the owner of the pximap, it will deallocate it in its destructor. To take/abandon ownership one uses TakePixmap and ReleasePixmap. In this patch I moved the call to ReleasePixmap into ToSurfaceDescriptor to make more evident that the TextureClient is passing ownership to the Host at this point (if the proper texture flag is set). If it doesn't pass ownership, it means that the client will deallocate the memory. himself which can currently happen if the texture is deallocated before it was shared with the host side (hence it never got the occasion to pass the ownership on to someone else). This is as opposed to the current situation where the client always denies ownership of the pixmap, assuming that there will always be a host to take it, which isn't always true.
I hope this clears the confusion.

Note that we already assume that ToSurfaceDescriptor can only ever be called once and that we assume that it will always create a TextureHost (some D3D9 code breaks if we don't do that) so there is no risk of ToSurfaceDescriptor being called several times.
Attachment #8441503 - Attachment is obsolete: true
Attachment #8442077 - Flags: review?(chrislord.net)
Comment on attachment 8442077 [details] [diff] [review]
v2

Review of attachment 8442077 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM.

::: gfx/layers/basic/TextureClientX11.cpp
@@ +78,5 @@
>  
> +  if (!(mFlags & TextureFlags::DEALLOCATE_CLIENT)) {
> +    // Pass to the host the responsibility of freeing the pixmap. ReleasePixmap means
> +    // the underlying pixmap will not be deallocated in mSurface's destructor.
> +    mSurface->ReleasePixmap();

I think some comment about ToSurfaceDescriptor guaranteed to only be called once and that it signifies the surface descriptor will definitely be given to the host where the pixmap is freed wouldn't hurt, but this is already much better.
Attachment #8442077 - Flags: review?(chrislord.net) → review+
Added a comment in ToSurfaceDescriptor and landed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2c7d100d0595
The patch applies cleanly on aurora.
Comment on attachment 8442077 [details] [diff] [review]
v2

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
User impact if declined: crashes with OMTC/e10s on Linux (low impact, affects mostly e10s testing or users flipping prefs)
Testing completed (on m-c, etc.): landed in m-c on June 18th
Risk to taking this patch (and alternatives if risky): low risk
String or IDL/UUID changes made by this patch:
Attachment #8442077 - Flags: approval-mozilla-aurora?
Thank you for fixing this! :-)
https://hg.mozilla.org/mozilla-central/rev/2c7d100d0595
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Comment on attachment 8442077 [details] [diff] [review]
v2

Aurora approval granted.

Can you please confirm that this bug impacts Firefox 32 and 33 and that Firefox 31 and earlier are not impacted?
Attachment #8442077 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/4e6018ad2639

(In reply to Lawrence Mandel [:lmandel] from comment #175)
> Can you please confirm that this bug impacts Firefox 32 and 33 and that
> Firefox 31 and earlier are not impacted?

This was a regression from bug 1002300, which landed on Fx32 only.
(In reply to TBPL Robot from comment #178)
> RyanVM
> https://tbpl.mozilla.org/php/getParsedLog.php?id=42377458&tree=Mozilla-Aurora
> Ubuntu VM 12.04 mozilla-aurora pgo test crashtest-ipc on 2014-06-24 12:08:46
> revision: 59fe7d610045
> slave: tst-linux32-spot-453
> 
> [Parent 1687] ###!!! ABORT: X_FreePixmap: BadPixmap (invalid Pixmap
> parameter): file
> /builds/slave/m-aurora-lx-000000000000000000/build/toolkit/xre/
> nsX11ErrorHandler.cpp, line 157
> [Parent 1687] ###!!! ABORT: X_FreePixmap: BadPixmap (invalid Pixmap
> parameter): file
> /builds/slave/m-aurora-lx-000000000000000000/build/toolkit/xre/
> nsX11ErrorHandler.cpp, line 157
> TEST-UNEXPECTED-FAIL |
> file:///builds/slave/test/build/tests/reftest/tests/layout/generic/
> crashtests/479938-1.html | Exited with code 11 during test run
> PROCESS-CRASH |
> file:///builds/slave/test/build/tests/reftest/tests/layout/generic/
> crashtests/479938-1.html | application crashed [@ mozalloc_abort(char
> const*)]
> Return code: 1

Well that's fun. This was from a push after comment 177.
Do we have any idea why Aurora continues to hit these occasionally?
Flags: needinfo?(nical.bugzilla)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #186)
> Do we have any idea why Aurora continues to hit these occasionally?

Not really, I'll have a look.
I have managed to reproduce this sometimes by closing an e10s window while a video is playing. I am not sure but it looks like an X11 TextureClient is being destroyed too late. I don't know exactly at which point we can't use X11 any more (probably when the widget is destroyed?), but it could be that we need to check for this and skip all operations on TextureClientX11 (including deallocation) if we are past that point. Otherwise try to eagerly deallocate all of the X11 TextureClients when the widget dies but that may be tricky and cause other dependency issues.
Status: RESOLVED → REOPENED
Flags: needinfo?(nical.bugzilla)
Resolution: FIXED → ---
(In reply to Nicolas Silva [:nical] from comment #189)
> I have managed to reproduce this sometimes by closing an e10s window while a
> video is playing. I am not sure but it looks like an X11 TextureClient is
> being destroyed too late. I don't know exactly at which point we can't use
> X11 any more (probably when the widget is destroyed?)

The X11 display is available until nsAppRunner exits.
The Window (and dependent GL contexts IIRC) can't be used after the nsWindow is (On)Destroy()ed, but Pixmaps are independent of the Window and so can be used after the Window is destroyed.

I don't know whether it is the TextureClient or TextureHost that owns the Pixmap, but I would check the logic there.  Perhaps http://xtrace.alioth.debian.org/ might be useful to determine whether the Pixmaps are used before or after the server has created/destroyed them.  XSync() shows up as GetInputFocus.
This reproduces fairly easily with in GLContextGLX::MakeCurrent (with MOZ_X_SYNC) on the compositor thread before destroying the remaining textures. At the same time, the main thread is waiting for the ImageBirdge::ShutDown to complete and is already past the destructor of nsWindow.
 
The GLContext used by the compositor doesn't own, the X Drawable is is using, the latter being owned by the nsWindow, so it very much looks like we need either:
* a way for GLContextGLX::MakeCurrent to early-return false if the widget owning the drawable has already been destroyed
* a way to keep the widget alive longer or a way to kill all textures before the part of ~nsWindow where the window drawable gets destroyed (very hard).
You guys know more about X than I do, does comment 215 make sense?

To give a few more precisions, the Drawable I am referring to is nsWindow::mGdkWindow, which is passed to the compositor's GLContextGLX at creation time with deleteDrawable set to false (so the GLContext doesn't own the the drawable). So the Drawable is also GLContextGLX::mDrawable. The Drawable is destroyed somewhere in nsWindow::Destroy, which is called by ~nsWindow, all of this happening before all of the textures have had a chance to be destroyed (in particular, the ones used by async-video).

Would it be feasible to reference count the drawable and destroy it later (possibly on the compositor thread)? or would it be ok to notify all of the gl contexts that are created from a given widget that their drawable is gone?
Flags: needinfo?(karlt)
Flags: needinfo?(bjacob)
(In reply to Nicolas Silva [:nical] from comment #215)
> * a way for GLContextGLX::MakeCurrent to early-return false if the widget
> owning the drawable has already been destroyed

Yes to that. The other proposals seem more complicated for no benefit, but this one makes sense.

So in GLContextProvider*::CreateForWindow, as this method already receives a pointer to a Widget, you should be able to take a WeakPtr to it so you could later check if it's still there. But don't take a strong reference, as that could unnecessarily reorder the shutdown sequence and even create a cycle if the widget is conversely keeping the GLContext alive.
Flags: needinfo?(bjacob)
(In reply to Nicolas Silva [:nical] from comment #215)
> This reproduces fairly easily with in GLContextGLX::MakeCurrent (with
> MOZ_X_SYNC) on the compositor thread [...]
  
> The GLContext used by the compositor doesn't own, the X Drawable is is
> using, the latter being owned by the nsWindow,

(In reply to Nicolas Silva [:nical] from comment #216)
> To give a few more precisions, the Drawable I am referring to is
> nsWindow::mGdkWindow, which is passed to the compositor's GLContextGLX at
> creation time with deleteDrawable set to false (so the GLContext doesn't own
> the the drawable). So the Drawable is also GLContextGLX::mDrawable. The
> Drawable is destroyed somewhere in nsWindow::Destroy, which is called by
> ~nsWindow, all of this happening before all of the textures have had a
> chance to be destroyed (in particular, the ones used by async-video).

It sounds like you are describing the scenario with GL layers.
Don't these tests run with Basic layers?

B2G desktop was the only exception I knew using GL layers on X11.

This scenario would cause problems, but I don't know that they are the
problems that this bug tracks.

I wouldn't have expected this scenario to result in BadPixmap errors in
X_FreePixmap.  In the past we saw BadDrawable errors in X_CreateGC and
GLXBadDrawable in X_GLXMakeCurrent.

(In reply to Nicolas Silva [:nical] from comment #215)
> * a way for GLContextGLX::MakeCurrent to early-return false if the widget
> owning the drawable has already been destroyed

(In reply to Nicolas Silva [:nical] from comment #216)
> would it be ok to notify all of the
> gl contexts that are created from a given widget that their drawable is gone?

I'm guessing we can assume that all inaccessible resources will be released by
the implementation, though I haven't verified that.

In the past we also needed to ensure that a context associated with a window was
not current when the window was destroyed, or we got errors when a different
context was made current.  Perhaps this was due to flushing when the context
changed, or perhaps the implementation always needs the window or its glx object when the context is released - I don't know.
Flags: needinfo?(karlt)
(In reply to Karl Tomlinson (:karlt) from comment #218)
> It sounds like you are describing the scenario with GL layers.
> Don't these tests run with Basic layers?
> 
> B2G desktop was the only exception I knew using GL layers on X11.
> 
> This scenario would cause problems, but I don't know that they are the
> problems that this bug tracks.

You are right, This is probably a different problem, I'll file a separate bug.
I spent quite some today trying to reproduce the issue with OMTC + basic layers, and I unfortunately haven't managed yet.

> 
> I wouldn't have expected this scenario to result in BadPixmap errors in
> X_FreePixmap.  In the past we saw BadDrawable errors in X_CreateGC and
> GLXBadDrawable in X_GLXMakeCurrent.
> 
> (In reply to Nicolas Silva [:nical] from comment #215)
> > * a way for GLContextGLX::MakeCurrent to early-return false if the widget
> > owning the drawable has already been destroyed
> 
> (In reply to Nicolas Silva [:nical] from comment #216)
> > would it be ok to notify all of the
> > gl contexts that are created from a given widget that their drawable is gone?
> 
> I'm guessing we can assume that all inaccessible resources will be released
> by
> the implementation, though I haven't verified that.
> 
> In the past we also needed to ensure that a context associated with a window
> was
> not current when the window was destroyed, or we got errors when a different
> context was made current.  Perhaps this was due to flushing when the context
> changed, or perhaps the implementation always needs the window or its glx
> object when the context is released - I don't know.

The fact that the window is created/destroyed on the main thread while the GLContext lives in the compositor thread is quite unfortunate (both because I don't know how well drivers support/test against that, and because telling the GLContext that it can't call MakeCurrent after the window is destroyed on another thread is inconvenient).
Filed Bug 1059793
Perhaps one thing to try to induce the kind of races that might be involved here is to call XSynchronize() in only one process for each process.
https://hg.mozilla.org/mozilla-central/rev/2c717162f3f8
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
(In reply to Carsten Book [:Tomcat] from comment #233)
> https://hg.mozilla.org/mozilla-central/rev/2c717162f3f8

Oh snap! I put the wrong bug number in the patch. This patch was from bug 1147777, Sorry.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Managing textures around shutdown has improved a lot lately and there's no trace of this happening in a while on treeherder and crash-stats so I'll close. Feel free to reopen if it shows up again.
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.