Last Comment Bug 683063 - Crash opening about:memory in Fennec
: Crash opening about:memory in Fennec
Status: RESOLVED FIXED
[inbound]
: crash
Product: Core
Classification: Components
Component: IPC (show other bugs)
: Trunk
: x86_64 Linux
: -- critical (vote)
: ---
Assigned To: Chris Jones [:cjones] inactive; ni?/f?/r? if you need me
:
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-08-29 18:06 PDT by Nicholas Nethercote [:njn]
Modified: 2011-08-31 11:26 PDT (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Valgrind output (128.43 KB, text/plain)
2011-08-29 18:21 PDT, Nicholas Nethercote [:njn]
no flags Details
part 1: Don't delete ContentParent out from under the cleanup process (2.35 KB, patch)
2011-08-29 23:40 PDT, Chris Jones [:cjones] inactive; ni?/f?/r? if you need me
josh: review+
Details | Diff | Splinter Review
part 2: Make sure GetWindowsTable() always returns the table, even if an nsGlobalWindow hasn't been created yet (3.01 KB, patch)
2011-08-29 23:43 PDT, Chris Jones [:cjones] inactive; ni?/f?/r? if you need me
no flags Details | Diff | Splinter Review

Description Nicholas Nethercote [:njn] 2011-08-29 18:06:30 PDT
#0  0x00007fef676a45ad in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007fef676a443c in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00007fef64b5281e in ah_crap_handler (signum=11) at /home/njn/moz/mi3/toolkit/xre/nsSigHandlers.cpp:121
#3  0x00007fef64b56202 in nsProfileLock::FatalSignalHandler (signo=11, info=0x7fff7da366b0, context=0x7fff7da36580)
    at /home/njn/moz/mi3/mobile/d64/toolkit/profile/nsProfileLock.cpp:226
#4  <signal handler called>
#5  mozilla::dom::PContentParent::DeallocSubtree (this=0x7fef4dbc2400)
    at /home/njn/moz/mi3/mobile/d64/ipc/ipdl/PContentParent.cpp:2261
#6  0x00007fef65f0485c in mozilla::dom::PContentParent::OnChannelError (this=0x7fef4dbc2400)
    at /home/njn/moz/mi3/mobile/d64/ipc/ipdl/PContentParent.cpp:1858
#7  0x00007fef65eb864d in mozilla::ipc::AsyncChannel::NotifyMaybeChannelError (this=0x7fef4dbc2410)
    at /home/njn/moz/mi3/ipc/glue/AsyncChannel.cpp:378
#8  0x00007fef65eb8742 in mozilla::ipc::AsyncChannel::OnNotifyMaybeChannelError (this=0x7fef4dbc2410)
    at /home/njn/moz/mi3/ipc/glue/AsyncChannel.cpp:343
#9  0x00007fef65e93904 in DispatchToMethod<mozilla::plugins::PluginInstanceChild, void (mozilla::plugins::PluginInstanceChild::*)()> (this=<value optimised out>) at /home/njn/moz/mi3/ipc/chromium/src/base/tuple.h:383
#10 RunnableMethod<mozilla::plugins::PluginInstanceChild, void (mozilla::plugins::PluginInstanceChild::*)(), Tuple0>::Run (
    this=<value optimised out>) at /home/njn/moz/mi3/ipc/chromium/src/base/task.h:307
#11 0x00007fef66049e2f in MessageLoop::RunTask (this=0x7fef674ee8f0, task=0x7fef4b435c80)
    at /home/njn/moz/mi3/ipc/chromium/src/base/message_loop.cc:318
#12 0x00007fef6604baf6 in MessageLoop::DeferOrRunPendingTask (this=<value optimised out>, pending_task=<value optimised out>)
    at /home/njn/moz/mi3/ipc/chromium/src/base/message_loop.cc:326
#13 0x00007fef6604bd26 in MessageLoop::DoWork (this=0x7fef674ee8f0) at /home/njn/moz/mi3/ipc/chromium/src/base/message_loop.cc:426
#14 0x00007fef65ebd393 in mozilla::ipc::MessagePump::Run (this=0x7fef5ba54f40, aDelegate=0x7fef674ee8f0)
    at /home/njn/moz/mi3/ipc/glue/MessagePump.cpp:114
#15 0x00007fef66049cf6 in MessageLoop::RunInternal (this=0x7fef674ee8f0)
    at /home/njn/moz/mi3/ipc/chromium/src/base/message_loop.cc:208
#16 0x00007fef66049d07 in MessageLoop::RunHandler (this=<value optimised out>)
    at /home/njn/moz/mi3/ipc/chromium/src/base/message_loop.cc:201
#17 0x00007fef6604a01a in MessageLoop::Run (this=0x7fef674ee8f0) at /home/njn/moz/mi3/ipc/chromium/src/base/message_loop.cc:175
#18 0x00007fef65d6ae0c in nsBaseAppShell::Run (this=0x7fef559d58d0)
    at /home/njn/moz/mi3/widget/src/xpwidgets/nsBaseAppShell.cpp:189
#19 0x00007fef65aed3bf in nsAppStartup::Run (this=0x7fef52a091a0)
    at /home/njn/moz/mi3/toolkit/components/startup/nsAppStartup.cpp:224


Doing some more inspection:

#5  mozilla::dom::PContentParent::DeallocSubtree (this=0x7fef4dbc2400)
    at /home/njn/moz/mi3/mobile/d64/ipc/ipdl/PContentParent.cpp:2261
2261	        for (uint32 i = 0; (i) < ((kids).Length()); (++(i))) {
(gdb) p kids
$1 = (
    InfallibleTArray<mozilla::dom::PAudioParent*> &) @0x7fef4dbc26b8: {<nsTArray<mozilla::dom::PAudioParent*, nsTArrayInfallibleAllocator>> = {<nsTArray_base<nsTArrayInfallibleAllocator>> = {
      mHdr = 0x5a5a5a5a5a5a5a5a}, <nsTArray_SafeElementAtHelper<mozilla::dom::PAudioParent*, nsTArray<mozilla::dom::PAudioParent*, nsTArrayInfallibleAllocator> >> = {<No data fields>}, <No data fields>}, <No data fields>}

That 0x5a5a5a5a pattern makes it look like this is a use-after-free error.

This isn't a 100% reliable crash;  I tried a few times and one time the content process crashed but about:memory showed data for the chrome process.
Comment 1 Nicholas Nethercote [:njn] 2011-08-29 18:07:17 PDT
This is on a 64-bit desktop Linux build, BTW.  (I tried to do a 32-bit desktop build but failed due to some problems with freetype headers, unfortunately.)
Comment 2 Mark Finkle (:mfinkle) (use needinfo?) 2011-08-29 18:12:03 PDT
I have seen this happen on my 64bit Ubuntu builds as well as my Android builds. As Nicholas points out, it seems intermittent.
Comment 3 Nicholas Nethercote [:njn] 2011-08-29 18:21:26 PDT
Created attachment 556733 [details]
Valgrind output

Valgrind finds *lots* of use-after-free errors.  There were some errors before this but they appeared unrelated so I removed them from the attached file.
Comment 4 Josh Matthews [:jdm] 2011-08-29 19:02:37 PDT
I would just like to point out that this output from valgrind is completely nonsensical:

==31828==  Address 0x1bfba168 is 632 bytes inside a block of size 816 free'd
==31828==    at 0x402A381: free (vg_replace_malloc.c:366)
==31828==    by 0x403FFE1: moz_free (mozalloc.cpp:97)
==31828==    by 0x7761C4B: mozilla::dom::ContentParent::~ContentParent() (mozalloc.h:253)
==31828==    by 0x77627A1: mozilla::dom::ContentParent::Release() (in /home/njn/moz/mi3/mobile/v64/toolkit/library/libxul.so)
==31828==    by 0x66889EC: nsCOMPtr_base::~nsCOMPtr_base() (nsAutoPtr.h:969)
==31828==    by 0x667C96F: nsBoxLayoutState::~nsBoxLayoutState() (nsCSSFrameConstructor.cpp:1463)
==31828==    by 0x776179F: mozilla::dom::ContentParent::ActorDestroy(mozilla::ipc::IProtocolManager<mozilla::ipc::RPCChannel::RPCListener>::ActorDestroyReason) (ContentParent.cpp:285)

ActorDestroy -> ~nsBoxLayoutState? Those are two function names I would not expect to see back to back. Nicholas, any idea what Valgrind is trying to say here?
Comment 5 Nicholas Nethercote [:njn] 2011-08-29 19:39:54 PDT
Valgrind just reports what the stack + debug info tells it.  There could be some frames missing due to inlining.
Comment 6 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-29 20:46:06 PDT
STR here are just to open about:memory?
Comment 7 Nicholas Nethercote [:njn] 2011-08-29 22:18:55 PDT
(In reply to Chris Jones [:cjones] [:warhammer] from comment #6)
> STR here are just to open about:memory?

Yes.
Comment 8 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-29 23:40:37 PDT
Created attachment 556763 [details] [diff] [review]
part 1: Don't delete ContentParent out from under the cleanup process

This is a pretty serious bug.  It looks like ContentParent was always dying when ActorDestroy() exited, which always caused a bunch of use-after-free errors.  These were probably only happening to lead to crashes when the content process died abnormally, and more stuff happened to be alive under ContentParent.

Anywho, we should take this ASAP; it's very bad.  I'm quite surprised the existing test_process_error hasn't caught crashes from this.
Comment 9 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-29 23:43:26 PDT
Created attachment 556764 [details] [diff] [review]
part 2: Make sure GetWindowsTable() always returns the table, even if an nsGlobalWindow hasn't been created yet

What was tickling the problem in part 1 (at least in my testing), is that the memory reporters were querying the windows ID table in the content process, before an nsGlobalWindow has been created.  So the static table returned is null and the memory reporters weren't testing for that.  I'm not a big fan of that initialization scheme, so I made the table getter do implicit init.

This bug isn't all that big of a deal IMHO, but we should definitely fix it.
Comment 10 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-29 23:45:20 PDT
Also: hm, I'm trying to think of a good way to write a test for that bug.  There's not really a good way in our current setup; we'd have to restart all of firefox or crash all the content processes first.  That may not be worth the effort.
Comment 11 Boris Zbarsky [:bz] (still a bit busy) 2011-08-29 23:56:50 PDT
As far as part 2 goes, see bug 681211.  But yes, making that function never return null may be better.
Comment 12 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-30 00:03:21 PDT
Shame that hasn't landed yet :/.  Which approach do you prefer?  I can land either patch ASAP, your call.
Comment 13 Nicholas Nethercote [:njn] 2011-08-30 00:03:56 PDT
> This bug isn't all that big of a deal IMHO, but we should definitely fix it.

Not having about:memory in Fennec is a bloody big deal, IMO :)
Comment 14 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-30 00:06:05 PDT
(In reply to Nicholas Nethercote [:njn] from comment #13)
> > This bug isn't all that big of a deal IMHO, but we should definitely fix it.
> 
> Not having about:memory in Fennec is a bloody big deal, IMO :)

The *particular bug* of crashing if an nsGlobalWindow hasn't been created before about:memory is loaded is what I was referring to.  But it's not really worth discussing, let's just fix it.  There are two patches to choose from now.
Comment 15 Josh Matthews [:jdm] 2011-08-30 06:09:37 PDT
Comment on attachment 556763 [details] [diff] [review]
part 1: Don't delete ContentParent out from under the cleanup process

My goodness.
Comment 16 Boris Zbarsky [:bz] (still a bit busy) 2011-08-30 13:18:43 PDT
Comment on attachment 556764 [details] [diff] [review]
part 2: Make sure GetWindowsTable() always returns the table, even if an nsGlobalWindow hasn't been created yet

We're going to do bug 667183 instead.
Comment 17 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-30 21:12:13 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/86f1809d81d7
Comment 18 Marco Bonardo [::mak] 2011-08-31 02:18:45 PDT
http://hg.mozilla.org/mozilla-central/rev/86f1809d81d7

resolving based on comment 16 saying part 2 is not going to land.
Comment 19 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-31 11:26:49 PDT
Correct, thanks.

Note You need to log in before you can comment on or make changes to this bug.