Closed Bug 519438 Opened 15 years ago Closed 15 years ago

Frequent crashes in nsAppShell::ProcessNextNativeEvent after mozilla::storage::Connection::Close()

Categories

(Toolkit :: Storage, defect, P2)

1.9.2 Branch
x86
Linux
defect

Tracking

()

RESOLVED DUPLICATE of bug 528708

People

(Reporter: johnath, Unassigned)

References

Details

(Keywords: crash)

The stack below has been crashing talos regularly since last Friday (Sep 25) on mozilla-1.9.2 linux, apparently without a bug being filed.

The crash happens during a dirty profile Ts test (ts_places_generated_max) leading Mak, Ted and I to speculate that this was a latent crash that the tests have exposed. However, bhearsum confirms that the dirty profile tests have been running longer than that which suggests to me that, instead, a recent checkin caused it.

I can't find any instances of this crash earlier than Wan-Teh's NSPR update here: http://hg.mozilla.org/releases/mozilla-1.9.2/rev/7a6b4aea8917  but since it's intermittent, it's not clear whether or not that commit is the culprit.

Filing this against Storage as the direct trigger of the crash, and because of the suspicious connection to large profiles. Who can take a look?

Instances I know of:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1253933045.1253934285.31717.gz&fulltext=1
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1254030837.1254032087.23904.gz&fulltext=1
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1254051540.1254052785.7507.gz&fulltext=1
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1254144714.1254145997.25714.gz&fulltext=1
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1254188158.1254189435.5281.gz&fulltext=1
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1254224123.1254225413.11654.gz&fulltext=1

Stack:

0  libgtk-x11-2.0.so.0.1200.0 + 0x231c8b
    eip = 0xb6eaec8b   esp = 0xbfaa2e24   ebp = 0xbfaa2e9c   ebx = 0xb6ffe0bc
    esi = 0xb041cb80   edi = 0xb10a5cf0   eax = 0xb5e55820   ecx = 0xb041cb80
    edx = 0x656c6966   efl = 0x00010206
 1  libgtk-x11-2.0.so.0.1200.0 + 0x231f90
    eip = 0xb6eaef91   esp = 0xbfaa2ea4   ebp = 0xbfaa2ebc
 2  libgdk-x11-2.0.so.0.1200.0 + 0x175e7
    eip = 0xb6bf25e8   esp = 0xbfaa2ec4   ebp = 0xbfaa2edc
 3  libglib-2.0.so.0.1400.1 + 0x2f8d5
    eip = 0xb6a388d6   esp = 0xbfaa2ee4   ebp = 0xbfaa2f0c
 4  libglib-2.0.so.0.1400.1 + 0x2f11b
    eip = 0xb6a3811c   esp = 0xbfaa2f14   ebp = 0xbfaa2f6c
 5  libglib-2.0.so.0.1400.1 + 0x3255e
    eip = 0xb6a3b55f   esp = 0xbfaa2f74   ebp = 0xbfaa2fcc
 6  libglib-2.0.so.0.1400.1 + 0x32ac4
    eip = 0xb6a3bac5   esp = 0xbfaa2fd4   ebp = 0xbfaa2fec
 7  libxul.so!nsAppShell::ProcessNextNativeEvent(int) [nsAppShell.cpp:7a6b4aea8917 : 147 + 0x9]
    eip = 0xb7adea44   esp = 0xbfaa2ff4   ebp = 0xbfaa300c
 8  libxul.so!nsBaseAppShell::DoProcessNextNativeEvent(int) [nsBaseAppShell.cpp:7a6b4aea8917 : 151 + 0xa]
    eip = 0xb7af2eb4   esp = 0xbfaa3014   ebp = 0xbfaa302c
 9  libxul.so!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int) [nsBaseAppShell.cpp:7a6b4aea8917 : 278 + 0x9]
    eip = 0xb7af333d   esp = 0xbfaa3034   ebp = 0xbfaa305c
10  libxul.so!nsThread::ProcessNextEvent(int, int*) [nsThread.cpp:7a6b4aea8917 : 508 + 0x1c]
    eip = 0xb7bbc30e   esp = 0xbfaa3064   ebp = 0xbfaa309c
11  libxul.so!NS_ProcessNextEvent_P(nsIThread*, int) [nsThreadUtils.cpp : 230 + 0xd]
    eip = 0xb7b8afbb   esp = 0xbfaa30a4   ebp = 0xbfaa30cc
12  libxul.so!nsThread::Shutdown() [nsThread.cpp:7a6b4aea8917 : 468 + 0xb]
    eip = 0xb7bbc8f8   esp = 0xbfaa30d4   ebp = 0xbfaa310c
13  libxul.so!mozilla::storage::Connection::Close() [mozStorageConnection.cpp:7a6b4aea8917 : 500 + 0x8]
    eip = 0xb7a84093   esp = 0xbfaa3114   ebp = 0xbfaa313c
14  libxul.so!NS_GetXPTCallStub_P + 0x30
    eip = 0xb7bc94c3   esp = 0xbfaa3144   ebp = 0xbfaa3158
15  libxul.so!XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) [xpcwrappednative.cpp:7a6b4aea8917 : 2710 + 0x20]
    eip = 0xb731cdb8   esp = 0xbfaa3160   ebp = 0xbfaa3458
16  libxul.so!XPC_WN_CallMethod(JSContext*, JSObject*, unsigned int, int*, int*) [xpcwrappednativejsops.cpp:7a6b4aea8917 : 1740 + 0x9]
    eip = 0xb7323ca1   esp = 0xbfaa3460   ebp = 0xbfaa3528
17  libmozjs.so!js_Invoke [jsinterp.cpp:7a6b4aea8917 : 1373 + 0x27]
    eip = 0xb709a619   esp = 0xbfaa3530   ebp = 0xbfaa35f8
18  libmozjs.so!js_Interpret [jsops.cpp:7a6b4aea8917 : 2199 + 0x1d]
    eip = 0xb708b0af   esp = 0xbfaa3600   ebp = 0xbfaa3858
19  libmozjs.so!js_Invoke [jsinterp.cpp:7a6b4aea8917 : 1381 + 0xa]
    eip = 0xb709a7eb   esp = 0xbfaa3860   ebp = 0xbfaa3928
20  libxul.so!nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) [xpcwrappedjsclass.cpp:7a6b4aea8917 : 1671 + 0x1a]
    eip = 0xb731a741   esp = 0xbfaa3930   ebp = 0xbfaa3b98
21  libxul.so!nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) [xpcwrappedjs.cpp:7a6b4aea8917 : 570 + 0x13]
    eip = 0xb7315b67   esp = 0xbfaa3ba0   ebp = 0xbfaa3bc8
22  libxul.so!PrepareAndDispatch [xptcstubs_gcc_x86_unix.cpp:7a6b4aea8917 : 95 + 0x1d]
    eip = 0xb7bc9fd9   esp = 0xbfaa3bd0   ebp = 0xbfaa3c98
23  libxul.so!nsThread::ProcessNextEvent(int, int*) [nsThread.cpp:7a6b4aea8917 : 527 + 0xa]
    eip = 0xb7bbc368   esp = 0xbfaa3ca0   ebp = 0xbfaa3cd8
24  libxul.so!NS_ProcessNextEvent_P(nsIThread*, int) [nsThreadUtils.cpp : 230 + 0xd]
    eip = 0xb7b8afbb   esp = 0xbfaa3ce0   ebp = 0xbfaa3d08
25  libxul.so!nsBaseAppShell::Run() [nsBaseAppShell.cpp:7a6b4aea8917 : 170 + 0x9]
    eip = 0xb7af2fd6   esp = 0xbfaa3d10   ebp = 0xbfaa3d28
26  libxul.so!nsAppStartup::Run() [nsAppStartup.cpp:7a6b4aea8917 : 182 + 0x5]
    eip = 0xb79c341e   esp = 0xbfaa3d30   ebp = 0xbfaa3d48
27  libxul.so!XRE_main [nsAppRunner.cpp:7a6b4aea8917 : 3479 + 0xb]
    eip = 0xb72e80ea   esp = 0xbfaa3d50   ebp = 0xbfaa43c8
28  firefox-bin!main [nsBrowserApp.cpp:7a6b4aea8917 : 156 + 0xe]
    eip = 0x080495aa   esp = 0xbfaa43d0   ebp = 0xbfaa4428
29  libc-2.6.1.so + 0x1604f
    eip = 0xb65fa050   esp = 0xbfaa4430   ebp = 0xbfaa4498
Mak points out that this may be related to Jesse's bug 496019. I'll add the dependency so that they're co-tracked - the stacks are different, but may point to a similar underlying problem.
Depends on: 496019
Assignee: nobody → mak77
assigned to marco for investigation, due to Places involvement.
Blocks: 438871
i was able to start the max dirty profile talos test on linux, but for now i've not been able to crash, i will try with more runs.
Marco - if you can't reproduce it locally, are there things you could land on trunk to help debug - extra logging, for instance? This one is pretty hot.
Hard to tell, this is something deep in how storage handles threads during db closing, and nsAppShell is not exactly my land. Also could be storage i only activating the native event loop, but we are crashing due to the actual event that gets executed (that maybe has a different handling from trunk where we don't crash). If i could reproduce i could at least ensure if fixing bug 496019 would actually fix this. It took a bit of time to run dirty tests on Linux due to lack of a goot howto documentation. i tried with a 100MB db i was unable to reproduce anything, will try with larger dbs.
this is on mozilla-central and looks like the same, so this crash is cross-branches

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1254756006.1254757288.12591.gz
Linux mozilla-central talos dirty on 2009/10/05 08:20:06

but quite more rare on central.
as a reference the max places.sqlite generated files that are used by max dirty profiles tests are about 210MB as of today.
This is very likely bug 469019.  I'll try to get a patch for it today.
(In reply to comment #18)
> This is very likely bug 469019.  I'll try to get a patch for it today.
Of course, I meant bug 496019.
sadly, i was unable to locally reproduce with a 210MB db too.
I'm not sure if this crash is related.  Storage is not, as far as I can tell, implicated in the crash, however there are 5 frames at the bottom (top? business-end?) of the callstack that are identical to the callstacks from other logs in this bug.  FWIW.

 3  libxul.so!nsAppShell::ProcessNextNativeEvent(int) [nsAppShell.cpp:4c58fd598abb : 147 + 0x9]
    eip = 0xb7aae298   esp = 0xbf871ad0   ebp = 0xbf871ae8
 4  libxul.so!nsBaseAppShell::DoProcessNextNativeEvent(int) [nsBaseAppShell.cpp:4c58fd598abb : 151 + 0xa]
    eip = 0xb7ac2a0c   esp = 0xbf871af0   ebp = 0xbf871b08
 5  libxul.so!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int) [nsBaseAppShell.cpp:4c58fd598abb : 296 + 0xa]
    eip = 0xb7ac2ba3   esp = 0xbf871b10   ebp = 0xbf871b38
 6  libxul.so!nsThread::ProcessNextEvent(int, int*) [nsThread.cpp:4c58fd598abb : 508 + 0x1c]
    eip = 0xb7b8b3da   esp = 0xbf871b40   ebp = 0xbf871b78
 7  libxul.so!NS_ProcessNextEvent_P(nsIThread*, int) [nsThreadUtils.cpp : 230 + 0xd]
    eip = 0xb7b5ba44   esp = 0xbf871b80   ebp = 0xbf871ba8


http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1253933045.1253934285.31717.gz&fulltext=1
Curtis, iirc that's another bug, not related to this, i'd say bug 505572 maybe?
(In reply to comment #26)
> Curtis, iirc that's another bug, not related to this, i'd say bug 505572 maybe?

Now that the log on 505572 is gone, it's hard to say.

The similarities in the call stack looked a little bit suspicious.  It's definitely not the exact same thing, but if does happen to be related it could have some diagnostic value.
Right, the top of the stack is just "the event loop". If storage isn't on the stack, it's not this bug, but it could be a similar crash, with something else spinning an event loop.
Flags: blocking1.9.2?
This is linux only, and it's crashing in glib/gtk.  Do we have somebody familiar with these libraries who could take a look at this?
Why is this nominated for blocking? Hasn't been a lot of activity lately, and while test issues are annoying, unless they block the active development of something I have a hard time blocking on it in particular.

Basically: why do we feel this needs to be fixed in order to release Firefox 3.6?
(In reply to comment #37)
> Why is this nominated for blocking? Hasn't been a lot of activity lately,

If it had been blocking+'d when I nominated it three weeks ago, there might have been some activity.

> and
> while test issues are annoying, unless they block the active development of
> something I have a hard time blocking on it in particular.

If the test is crashing, the browser is probably crashing too.

> 
> Basically: why do we feel this needs to be fixed in order to release Firefox
> 3.6?

The browser is crashing on shutdown, while it didn't use to.
Flags: blocking1.9.2? → blocking1.9.2+
Priority: -- → P2
I don't think this should block.  The reason this was given blocking was "the test is crashing, the browser is probably crashing too" but we have no data of this crash on crash-stats.  That doesn't mean users may not be crashing in it, but it does mean it doesn't happen as much as some of our other crashes which are not blockers.

This crash is in gtk code, which is called in a nested event loop.  We have nested event loops in more than just this particular bit of code.  This crash is either a bug in gtk (given the stacks), or it's a bug we have in calling out to gtk in our native event loop processing.

Have we ruled out that this isn't a new crash because we updated system libraries on our tinderboxes?  Have we looked into getting symbols for the system calls to get a better idea of the failure?  I asked for help in comment 31 for someone more familiar with linux and have gotten deafening silence in return.
Flags: blocking1.9.2+ → blocking1.9.2?
why does it crash?
Flags: blocking1.9.2? → blocking1.9.2+
it doesn't mean that it doesn't happen as often to our users as other crashes -- what do we know about what has to happen to trigger the crash?  if using newer GTK causes it, we have a ticking clock.

I don't think we need kernel symbols to make progress here, but a core dump and debuginfo for the system libraries would let us at least triage on the basis of some more data about what triggers the problem.

To solicit Linux help, I do suggest asking for it at the dev meeting, or a newsgroup post, or mailing people directly.  They may well not be already cc:d on the bug!
(In reply to comment #42)
> To solicit Linux help, I do suggest asking for it at the dev meeting, or a
> newsgroup post, or mailing people directly.  They may well not be already cc:d
> on the bug!
Who?
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1257446852.1257448188.27999.gz
Linux mozilla-1.9.2 talos dirty on 2009/11/05 10:47:32

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1257448330.1257449615.12873.gz
Linux mozilla-1.9.2 talos dirty on 2009/11/05 11:12:10

Two cycles in a row, but now it's gone again.
I've requested debuginfo for the gdk/gtk libraries in bug 527223.
bug 496019 needs to land on 1.9.2, then we can reevaluate.

We need someone to get those gdk/gtk stack traces on the tinderboxes, otherwise we know that something crashes in image libraries but we can't tell what.
Depends on: amo-mxr
are we sure this is linux only?

the fx 3.6b1 stack in
http://crash-stats.mozilla.com/report/index/f8d60064-092f-4780-808c-b13712091107
looks a lot like comment zero and even more like comment 25.   Its not clear to me from comment 27 and 28 that a stack looking like this is not connected or could aid in debugging.


 	Module  	Signature [Expand]  	Source
0 	xul.dll 	nsAppShell::ProcessNextNativeEvent 	widget/src/windows/nsAppShell.cpp:181
1 	winmm.dll 	timeGetTime 	
2 	xul.dll 	nsBaseAppShell::OnProcessNextEvent 	widget/src/xpwidgets/nsBaseAppShell.cpp:296
3 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:508
4 	xul.dll 	nsBaseAppShell::Run 	widget/src/xpwidgets/nsBaseAppShell.cpp:170
5 	xul.dll 	nsAppStartup::Run 	toolkit/components/startup/src/nsAppStartup.cpp:182
6 	nspr4.dll 	PR_GetEnv 	
7 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:110
8 	firefox.exe 	__tmainCRTStartup 	obj-firefox/memory/jemalloc/crtsrc/crtexe.c:591
9 	kernel32.dll 	BaseProcessStart


these windows crashes are just outside the top 100 and we over 50 or those everyday.

distribution of all versions where the nsAppShell::ProcessNextNativeEvent crash was found on 20091106-crashdata.csv
  29 Firefox 3.5.4
  15 Firefox 3.0.15
  12 Firefox 3.5.5
   3 Firefox 3.5
   2 Firefox 3.5.3
   1 Firefox 3.6b1
   1 Firefox 3.5.1
   1 Firefox 3.0.7
   1 Firefox 3.0.14
   1 Firefox 3.0.13

If these are not connected let me know and I'll spin off another bug or bugs.
(In reply to comment #49)
> are we sure this is linux only?
pretty sure

> the fx 3.6b1 stack in
> http://crash-stats.mozilla.com/report/index/f8d60064-092f-4780-808c-b13712091107
> looks a lot like comment zero and even more like comment 25.   Its not clear to
> me from comment 27 and 28 that a stack looking like this is not connected or
> could aid in debugging.
These look very different to me.  You should file a new bug that crash.
Severity: normal → critical
any update here?
(In reply to comment #54)
> any update here?
Shaawn is still working on the patch for connection::close

Here we still need someone that could take a stack in libgtk on the talos box.
I'm unassigning from me, i can't do anything more here apart pinging here and there.

Shawn's patch could solve this not spinning the events loops but clearly this won't solve the crash, will just avoid firing it after connection::close()

Someone with gtk knowledge and access to talos machine should own this at this point.
Assignee: mak77 → nobody
Hi Karl, in the platform meeting they said you might know how to debug this?
It was mentioned in the platform meeting that this type of crash might apply to any nested event loop.  Is there any reason to believe there's a more general defect here than the one covered by bug 496019?
(In reply to comment #64)
> It was mentioned in the platform meeting that this type of crash might apply to
> any nested event loop.  Is there any reason to believe there's a more general
> defect here than the one covered by bug 496019?
I don't see how spinning the event loop in storage would be any different than spinning it elsewhere.  Fixing bug 496019 will likely make this crash no longer noticeable to us though...
filed bug 529380 to remove this behavior from Weave.
Blocks: 526391
Was the stack in comment 0 on this platform (Ubuntu 7.04 Desktop)
https://wiki.mozilla.org/ReferencePlatforms/Test/Linux

It should be possible to get the corresponding debug symbol packages and manually resolve the offsets into objects.

addr2line -j .text -C -f -i -e /usr/lib/debug/usr/lib/libgtk-x11-2.0.so.0.1200.0.debug 0x231c8b
etc.
(In reply to comment #68)
> Was the stack in comment 0 on this platform (Ubuntu 7.04 Desktop)
> https://wiki.mozilla.org/ReferencePlatforms/Test/Linux

The answer is no, because that release would have gtk+-2.10.11, but this is gtk+-2.12.0.

These look like good candidates:
https://launchpad.net/ubuntu/+source/gtk+2.0/2.12.0-1ubuntu3/+build/403567
https://launchpad.net/ubuntu/+source/glib2.0/2.14.1-1ubuntu1/+build/394621
(In reply to comment #66)
> (In reply to comment #64)
> > It was mentioned in the platform meeting that this type of crash might apply to
> > any nested event loop.  Is there any reason to believe there's a more general
> > defect here than the one covered by bug 496019?
> I don't see how spinning the event loop in storage would be any different than
> spinning it elsewhere. 

indeed for example bug 528708 is a trunk libgtk crash pretty similar to this (in dirty talos), and is not due to storage spinning the event loop.
Depends on: 528708
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → DUPLICATE
Depends on: 529786
No longer blocks: 438871
Whiteboard: [orange]
No longer depends on: 528708, 529786
No longer blocks: 526391
This no longer blocks for the same reasons bug 528708 doesn't block.
Flags: blocking1.9.2+ → blocking1.9.2-
You need to log in before you can comment on or make changes to this bug.