Closed Bug 559473 Opened 10 years ago Closed 10 years ago

Crash when destroying window after running test_composition_text_querycontent.xul

Categories

(Core :: Widget: Cocoa, defect, critical)

x86
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- beta1+

People

(Reporter: mats, Assigned: jfkthame)

References

Details

(Keywords: crash, intermittent-failure, regression)

Attachments

(1 file)

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271270362.1271271247.4581.gz
OS X 10.5.2 mozilla-central opt test mochitest-other on 2010/04/14 11:39:22

...
8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul...
SSLTUNNEL(0x315b60): poll timeout, looping
SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R-
SSLTUNNEL(0x315b60): poll timeout, looping
SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R-
SSLTUNNEL(0x315b60): poll timeout, looping
SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R-
SSLTUNNEL(0x315b60): poll timeout, looping
SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R-
TEST-UNEXPECTED-FAIL | automation.py | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:04:55.230710
INFO | automation.py | Reading PID log: /var/folders/TL/TLg3RrMbFAur2hBCXvCeqk+++TM/-Tmp-/tmpA3WmoEpidlog
PROCESS-CRASH | automation.py | application crashed (minidump found)
SSLTUNNEL(0x315b60): poll timeout, looping
SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R-
SSLTUNNEL(0x315b60): poll timeout, looping
SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R-
SSLTUNNEL(0x315b60): poll timeout, looping
SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R-
SSLTUNNEL(0x315b60): poll timeout, looping
SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R-
Operating system: Mac OS X
                  10.5.2 9C31
CPU: x86
     GenuineIntel family 6 model 14 stepping 8
     2 CPUs

Crash reason:  EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE
Crash address: 0x1

Thread 0 (crashed)
 0  libobjc.A.dylib + 0x146e4
    eip = 0x9202b6e4   esp = 0xbfffca0c   ebp = 0xbfffca38   ebx = 0x92e68253
    esi = 0x0ddbacd0   edi = 0xa04d9e30   eax = 0x00000001   ecx = 0x920c7394
    edx = 0x00000001   efl = 0x00210202
    Found by: given as instruction pointer in context
 1  AppKit + 0x6f23d
    eip = 0x92e6823e   esp = 0xbfffca40   ebp = 0xbfffca68
    Found by: previous frame's frame pointer
 2  AppKit + 0x6f1e4
    eip = 0x92e681e5   esp = 0xbfffca70   ebp = 0xbfffcaa8
    Found by: previous frame's frame pointer
 3  AppKit + 0xa372b
    eip = 0x92e9c72c   esp = 0xbfffcab0   ebp = 0xbfffcad8
    Found by: previous frame's frame pointer
 4  AppKit + 0xa20c0
    eip = 0x92e9b0c1   esp = 0xbfffcae0   ebp = 0xbfffcb28
    Found by: previous frame's frame pointer
 5  XUL!-[BaseWindow dealloc] [nsCocoaWindow.mm:1b4a2702d2dd : 1898 + 0x23]
    eip = 0x02bf6336   esp = 0xbfffcb30   ebp = 0xbfffcb68
    Found by: previous frame's frame pointer
 6  Foundation + 0x50699
    eip = 0x9479569a   esp = 0xbfffcb70   ebp = 0xbfffcb98
    Found by: previous frame's frame pointer
 7  CoreFoundation + 0x7192a
    eip = 0x90b4092b   esp = 0xbfffcba0   ebp = 0xbfffcbd8
    Found by: previous frame's frame pointer
 8  CoreFoundation + 0x72891
    eip = 0x90b41892   esp = 0xbfffcbe0   ebp = 0xbfffd198
    Found by: previous frame's frame pointer
 9  CoreFoundation + 0x72d17
    eip = 0x90b41d18   esp = 0xbfffd1a0   ebp = 0xbfffd1d8
    Found by: previous frame's frame pointer
10  HIToolbox + 0x3069f
    eip = 0x94dd76a0   esp = 0xbfffd1e0   ebp = 0xbfffd218
    Found by: previous frame's frame pointer
11  HIToolbox + 0x303f1
    eip = 0x94dd73f2   esp = 0xbfffd220   ebp = 0xbfffd2a8
    Found by: previous frame's frame pointer
12  HIToolbox + 0x3032c
    eip = 0x94dd732d   esp = 0xbfffd2b0   ebp = 0xbfffd2c8
    Found by: previous frame's frame pointer
13  AppKit + 0x407d8
    eip = 0x92e397d9   esp = 0xbfffd2d0   ebp = 0xbfffd638
    Found by: previous frame's frame pointer
14  AppKit + 0x4008d
    eip = 0x92e3908e   esp = 0xbfffd640   ebp = 0xbfffd938
    Found by: previous frame's frame pointer
15  XUL!nsAppShell::ProcessNextNativeEvent(int) [nsAppShell.mm:1b4a2702d2dd : 688 + 0x42]
I found a crash on TryServer as well:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1271273235.1271285852.12048.gz
OS X 10.5.2 try hg unit test on 2010/04/14 12:27:15

This one is also while running test_composition_text_querycontent.xul
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271331472.1271333067.23746.gz
OS X 10.5.2 mozilla-central debug test mochitest-other on 2010/04/15 04:37:52
s: moz2-darwin9-slave42
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271356451.1271356991.8172.gz#err1
OS X 10.5.2 mozilla-central opt test mochitest-other on 2010/04/15 11:34:11
s: bm-xserve18
:sigh: We don't have OS symbols for releases that old, so I can't even manually resymbolize the stack accurately (and I don't suppose those machines have the OS crash reporter turned on in server/silent mode).

This is what I get down to the first Gecko frame when using the 10.5.6 symbols (the oldest OS symbols we have); the stack makes no sense to me, so I'm sure everything has shifted too much between 10.5.2 and 10.5.6.

0 +[List initialize]
1 +[NSColor selectedMenuItemTextColor]
2 +[NSColor selectedMenuItemTextColor]
3 _NSUniqueMenuItemNamesForFullPaths
4 _NXFreeImageCache
5  XUL!-[BaseWindow dealloc] [nsCocoaWindow.mm:1b4a2702d2dd : 1898 + 0x23]
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271392437.1271393967.22778.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitest-other on 2010/04/15 21:33:57
s: talos-r3-leopard-030
OK, here's a more reasonable stack (down to frame 18) from the 10.5.8 failure in comment 5.  

Aki didn't know what sort of security updates had been applied on that slave, so I just went with "10.5.8 9L31a with Security Update 2010-002" for symbols.  Security Updates and whatnot can make a difference, although frames 0-7 at least seem plausible now (9-17 should be right; that's a common pattern at the appshell-level).

 0  CoreFoundation _CFRelease
 1  CoreFoundation __CFDictionaryDeallocate
 2  CoreFoundation _CFRelease
 3  AppKit         -[NSCell dealloc]
 4  AppKit         -[NSActionCell dealloc]
 5  AppKit         -[NSTextFieldCell dealloc]
 6  AppKit         -[NSNextStepFrame dealloc]
 7  AppKit         -[NSWindow dealloc]
 8  XUL            -[BaseWindow dealloc] [nsCocoaWindow.mm:c4a3c240ab5e : 1904 + 0x23]
 9  Foundation     __delayedPerformCleanup
10  CoreFoundation CFRunLoopTimerInvalidate
11  CoreFoundation CFRunLoopRunSpecific
12  CoreFoundation CFRunLoopRunInMode
13  HIToolbox      RunCurrentEventLoopInMode
14  HIToolbox      ReceiveNextEventCommon
15  HIToolbox      BlockUntilNextEventMatchingListInMode
16  AppKit         _DPSNextEvent
17  AppKit         -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
18  XUL            nsAppShell::ProcessNextNativeEvent(int) [nsAppShell.mm:c4a3c240ab5e : 676 + 0x3b]
I think one of the suspected checkins was from bug 559180:

http://hg.mozilla.org/mozilla-central/rev/a8fd7c7e46ac
Although the 10.6 test boxes aren't visible in the normal tree yet, they are in &noignore=1, where you can see things like

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271511934.1271512552.15510.gz
Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/17 06:45:34  
s: talos-r3-snow-012

8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul...
firefox-bin(354,0x103663be0) malloc: *** error for object 0x11a3ebea8: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
TEST-UNEXPECTED-FAIL | automation.py | Exited with code -6 during test run
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271466698.1271467325.7374.gz
Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/16 18:11:38
s: talos-r3-snow-012

8302 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul...
TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271425073.1271425693.27282.gz
Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/16 06:37:53
s: talos-r3-snow-007

8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul...
firefox-bin(360,0x103654be0) malloc: *** error for object 0x11aa4df60: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
TEST-UNEXPECTED-FAIL | automation.py | Exited with code -6 during test run
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271556152.1271556770.32222.gz
s: talos-r3-snow-010

8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul...
TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271593890.1271594502.19877.gz
Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/18 05:31:30
s: talos-r3-snow-012

8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul...
TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run
I've been repeatedly running mochitest-chrome locally to try and look into this. I've seen several crashes, though none that I can consistently reproduce; however, it does seem clear that there's some kind of heap damage, possibly related to management of Foundation/CF objects, that is leading to intermittent crashes.

If we could catch the malloc error (comment #8, #10) in a debugger and get a stack there, that might be interesting; so far, I have not run into this, however.
Running locally, I can't reproduce either the crash (on OS X 10.5.8)
or the malloc error (on OS 10.6.3) -- running
test_composition_text_querycontent.xul either by itself or together
with all the other Chrome tests.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271708332.1271710840.9720.gz
Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/19 13:18:52
s: talos-r3-snow-004

8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul...
TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run
blocking2.0: --- → ?
Keywords: regression
blocking2.0: ? → beta1+
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271720732.1271723404.8845.gz
Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/19 16:45:32
s: talos-r3-snow-010
8330 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul...
TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run
It seems that this is triggered by the patch in bug 559180, which provided local autorelease pools at a number of places during widget teardown, to prevent leaks that otherwise occur while the hidden window is being destroyed (outside the event loop).

It's not clear exactly what is causing the crashes, which are quite sporadic; my guess is that something ends up in one of these local autorelease pools and then gets destroyed when the pool goes out of scope, but then some other part of the code actually tries to use the object again. Given the unpredictable occurrence of the crashes, I wonder if it's related to something like GC getting triggered at an unfortunate time? Difficult to track down without a consistently reproducible testcase, or a record/replay setup.

I've tried running with the patch from bug 559180 backed-out, and this did seem to stop the crashes (in numerous test runs). Of course, the downside is that we then leak a few hundred Cocoa objects that are created during widget teardown, and leak because there's no pool to catch them.

We could argue that leaking during shutdown is irrelevant, as the entire process is about to go away anyhow, but the resulting __NSAutoreleaseNoPool messages - besides being alarming and cluttering the logs - may also swamp any new leaks that we want to be notified about.

A possible fix may be to replace the localized pools from bug 559180, which were placed just around the Cocoa methods that lead to leakage, with a single pool covering the entire shutdown process. I think this is more comparable to the level where we use pools during the normal event loop, and makes it less likely that an object will be "caught" by a temporary, local pool and then released prematurely.
This patch effectively backs-out the changes from bug 559180, and replaces the low-level pools there with a single, higher-level autorelease pool.

In repeated test runs locally, I have not seen any further crashes once this is applied.
Attachment #440189 - Flags: review?(joshmoz)
Comment on attachment 440189 [details] [diff] [review]
patch, v1 - replace temporary local pools during widget destruction with a higher-level pool covering shutdown

I'd also like Steven's opinion on this patch. It would be good to know exactly why it works so we don't make this mistake again.
Attachment #440189 - Flags: review?(smichaud)
It is possible that something is not being retained enough times but it didn't matter before because it was leaking. When the "regressing" patch landed the objected was used after the release pool sent the fatal release message, and with the new patch which has the release pool further down the stack the object stays alive long enough to avoid the bad access.

Just a theory :)
(In reply to comment #20)
> It is possible that something is not being retained enough times but it didn't
> matter before because it was leaking. When the "regressing" patch landed the
> objected was used after the release pool sent the fatal release message, and
> with the new patch which has the release pool further down the stack the object
> stays alive long enough to avoid the bad access.

Yes, I've also wondered about scenarios like this. I'm not at all sure the "regressing" patch is really "wrong", it may just have encouraged a latent problem to surface more readily. But I don't have any good ideas on how to track that down more precisely. The error need not be directly related to where the crash occurs (which is typically within several layers of Cocoa/CoreFoundation code), as this may be caused by an earlier memory management error.
Comment on attachment 440189 [details] [diff] [review]
patch, v1 - replace temporary local pools during widget destruction with a higher-level pool covering shutdown

Our shutdown sequence is fragile and unpredictable -- so, for example,
it's not currently possible to know for sure when it has finished.
(See bug 384284, and references to it in bug 491122 comment #8 and bug
539169 comment #2.)

So Jonathan's patch might not completely get rid of "no autorelease
pool" leaks on shutdown.  And it might even trigger crashes
(presumably different ones).  But the autorelease pool he proposes
will certainly cover a lot more of the shutdown sequence than does the
patch for bug 559180, and so I think it's much less likely to cause
trouble.

I agree that the patch for bug 559180 only triggers this bug's crash
-- that the real cause is some kind of memory mismanagement.  But
we're not going to be able to find the cause until we can reliably
reproduce its bad effects.
Attachment #440189 - Flags: review?(smichaud) → review+
Note:  In our tests of the shutdown sequence, we should try quitting
from the Dock menu and logging out while Minefield is still running.
Both of those substantially change how the shutdown works.
I've tried quitting from the Dock, and have not seen any ill effects (either leak warnings or crashes!). Haven't tested quitting via Logout yet but I'll give that a try.

Where I do see residual __NSAutoreleaseNoPool messages with this patch is on the first run of a new version (e.g. after I've been running FF 3.6) and the Addons Compatibility checking dialog is shown, or when using the GUI profile manager at startup. Several autoreleased objects are leaked from calls below nsCocoaWindow::DestroyNativeWindow. The patch from bug 559180 fixed these because a pool was added in that function, but this is one of the points under suspicion as a possible trigger of the crashes.

(This is similar to bug 559400, though that was a more limited fix for a startup-dialog issue that was not covered by bug 559180.)

A sample call stack for these leaks:

#0  0x00007fff85980de4 in __NSAutoreleaseNoPool ()
#1  0x00007fff858cdd19 in _CFAutoreleasePoolAddObject ()
#2  0x00007fff858cda86 in -[NSObject(NSObject) autorelease] ()
#3  0x00007fff87e3e35b in -[NSNotificationCenter(NSNotificationCenterAppKitExtensions) _removeObserver:notificationNamesAndSelectorNames:object:] ()
#4  0x00007fff87e6e31f in -[NSWindow setDelegate:] ()
#5  0x0000000100ac35ae in nsCocoaWindow::DestroyNativeWindow ()
#6  0x0000000100ac541a in nsCocoaWindow::~nsCocoaWindow ()
#7  0x0000000100af1a15 in nsBaseWidget::Release ()
#8  0x0000000100931a4b in nsXULWindow::ShowModal ()
#9  0x0000000100906394 in nsWindowWatcher::OpenWindowJSInternal ()
#10 0x0000000100906fe4 in nsWindowWatcher::OpenWindow ()
#11 0x0000000100bcf162 in NS_InvokeByIndex_P ()
#12 0x000000010004b741 in XPCWrappedNative::CallMethod ()
#13 0x0000000100051a1a in XPC_WN_CallMethod ()
#14 0x0000000101bfbfe0 in js_Invoke ()
#15 0x0000000101bea86f in js_Interpret ()
#16 0x0000000101bfc2a7 in js_Invoke ()
#17 0x000000010004748f in nsXPCWrappedJSClass::CallMethod ()
#18 0x0000000100bd03d5 in PrepareAndDispatch ()
#19 0x0000000100bcf26b in SharedStub ()
#20 0x0000000100010a8e in XRE_main ()
#21 0x0000000100001a3e in main ()

We could catch these with a pool somewhere up the call stack, though I'm not sure where is the most sensible place - or we could decide (at least for the time being) to ignore them as relatively harmless, given that they don't occur in the course of normal runs. My suggestion would be to try landing this patch first, and let it bake for a while. Shutdown leaks during a version-upgrade restart are a secondary issue that we can tackle as a followup once the crashing is resolved.
Comment on attachment 440189 [details] [diff] [review]
patch, v1 - replace temporary local pools during widget destruction with a higher-level pool covering shutdown

We should take this to solve our immediate problems but I'm not really happy with it in that it is a cover-up technique much like the original root pool was. Obviously much smaller and less harmful scope though.
Attachment #440189 - Flags: review?(joshmoz) → review+
Assignee: nobody → jfkthame
(Following up comment #23)

We should also test this patch in Camino, which had weird problems (on shutdown) with the original root autorelease pool.  See bug 559047 comment #2.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271832263.1271832944.26112.gz
Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/20 23:44:23
s: talos-r3-snow-012
TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run

This looks like comments 9, 11, 12,... However, I don't see a crash in the log, like in the original report, so I'm not sure it's really the same issue. I'm just trying to fix the crasher; if these other failures persist after that, we should probably open a new bug on them.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: 438871
Which of course has nothing whatsoever to do with this bug.
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.