Closed
Bug 559473
Opened 14 years ago
Closed 14 years ago
Crash when destroying window after running test_composition_text_querycontent.xul
Categories
(Core :: Widget: Cocoa, defect)
Tracking
()
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
blocking2.0 | --- | beta1+ |
People
(Reporter: MatsPalmgren_bugz, Assigned: jfkthame)
References
Details
(Keywords: crash, intermittent-failure, regression)
Attachments
(1 file)
9.15 KB,
patch
|
jaas
:
review+
smichaud
:
review+
|
Details | Diff | Splinter Review |
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271270362.1271271247.4581.gz OS X 10.5.2 mozilla-central opt test mochitest-other on 2010/04/14 11:39:22 ... 8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul... SSLTUNNEL(0x315b60): poll timeout, looping SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R- SSLTUNNEL(0x315b60): poll timeout, looping SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R- SSLTUNNEL(0x315b60): poll timeout, looping SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R- SSLTUNNEL(0x315b60): poll timeout, looping SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R- TEST-UNEXPECTED-FAIL | automation.py | Exited with code 1 during test run INFO | automation.py | Application ran for: 0:04:55.230710 INFO | automation.py | Reading PID log: /var/folders/TL/TLg3RrMbFAur2hBCXvCeqk+++TM/-Tmp-/tmpA3WmoEpidlog PROCESS-CRASH | automation.py | application crashed (minidump found) SSLTUNNEL(0x315b60): poll timeout, looping SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R- SSLTUNNEL(0x315b60): poll timeout, looping SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R- SSLTUNNEL(0x315b60): poll timeout, looping SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R- SSLTUNNEL(0x315b60): poll timeout, looping SSLTUNNEL(0x315b60): polling flags csock(0)=--, ssock(1)=R- Operating system: Mac OS X 10.5.2 9C31 CPU: x86 GenuineIntel family 6 model 14 stepping 8 2 CPUs Crash reason: EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE Crash address: 0x1 Thread 0 (crashed) 0 libobjc.A.dylib + 0x146e4 eip = 0x9202b6e4 esp = 0xbfffca0c ebp = 0xbfffca38 ebx = 0x92e68253 esi = 0x0ddbacd0 edi = 0xa04d9e30 eax = 0x00000001 ecx = 0x920c7394 edx = 0x00000001 efl = 0x00210202 Found by: given as instruction pointer in context 1 AppKit + 0x6f23d eip = 0x92e6823e esp = 0xbfffca40 ebp = 0xbfffca68 Found by: previous frame's frame pointer 2 AppKit + 0x6f1e4 eip = 0x92e681e5 esp = 0xbfffca70 ebp = 0xbfffcaa8 Found by: previous frame's frame pointer 3 AppKit + 0xa372b eip = 0x92e9c72c esp = 0xbfffcab0 ebp = 0xbfffcad8 Found by: previous frame's frame pointer 4 AppKit + 0xa20c0 eip = 0x92e9b0c1 esp = 0xbfffcae0 ebp = 0xbfffcb28 Found by: previous frame's frame pointer 5 XUL!-[BaseWindow dealloc] [nsCocoaWindow.mm:1b4a2702d2dd : 1898 + 0x23] eip = 0x02bf6336 esp = 0xbfffcb30 ebp = 0xbfffcb68 Found by: previous frame's frame pointer 6 Foundation + 0x50699 eip = 0x9479569a esp = 0xbfffcb70 ebp = 0xbfffcb98 Found by: previous frame's frame pointer 7 CoreFoundation + 0x7192a eip = 0x90b4092b esp = 0xbfffcba0 ebp = 0xbfffcbd8 Found by: previous frame's frame pointer 8 CoreFoundation + 0x72891 eip = 0x90b41892 esp = 0xbfffcbe0 ebp = 0xbfffd198 Found by: previous frame's frame pointer 9 CoreFoundation + 0x72d17 eip = 0x90b41d18 esp = 0xbfffd1a0 ebp = 0xbfffd1d8 Found by: previous frame's frame pointer 10 HIToolbox + 0x3069f eip = 0x94dd76a0 esp = 0xbfffd1e0 ebp = 0xbfffd218 Found by: previous frame's frame pointer 11 HIToolbox + 0x303f1 eip = 0x94dd73f2 esp = 0xbfffd220 ebp = 0xbfffd2a8 Found by: previous frame's frame pointer 12 HIToolbox + 0x3032c eip = 0x94dd732d esp = 0xbfffd2b0 ebp = 0xbfffd2c8 Found by: previous frame's frame pointer 13 AppKit + 0x407d8 eip = 0x92e397d9 esp = 0xbfffd2d0 ebp = 0xbfffd638 Found by: previous frame's frame pointer 14 AppKit + 0x4008d eip = 0x92e3908e esp = 0xbfffd640 ebp = 0xbfffd938 Found by: previous frame's frame pointer 15 XUL!nsAppShell::ProcessNextNativeEvent(int) [nsAppShell.mm:1b4a2702d2dd : 688 + 0x42]
Reporter | ||
Comment 1•14 years ago
|
||
I found a crash on TryServer as well: http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1271273235.1271285852.12048.gz OS X 10.5.2 try hg unit test on 2010/04/14 12:27:15 This one is also while running test_composition_text_querycontent.xul
Comment 2•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271331472.1271333067.23746.gz OS X 10.5.2 mozilla-central debug test mochitest-other on 2010/04/15 04:37:52 s: moz2-darwin9-slave42
Comment 3•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271356451.1271356991.8172.gz#err1 OS X 10.5.2 mozilla-central opt test mochitest-other on 2010/04/15 11:34:11 s: bm-xserve18
:sigh: We don't have OS symbols for releases that old, so I can't even manually resymbolize the stack accurately (and I don't suppose those machines have the OS crash reporter turned on in server/silent mode). This is what I get down to the first Gecko frame when using the 10.5.6 symbols (the oldest OS symbols we have); the stack makes no sense to me, so I'm sure everything has shifted too much between 10.5.2 and 10.5.6. 0 +[List initialize] 1 +[NSColor selectedMenuItemTextColor] 2 +[NSColor selectedMenuItemTextColor] 3 _NSUniqueMenuItemNamesForFullPaths 4 _NXFreeImageCache 5 XUL!-[BaseWindow dealloc] [nsCocoaWindow.mm:1b4a2702d2dd : 1898 + 0x23]
Comment 5•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271392437.1271393967.22778.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitest-other on 2010/04/15 21:33:57 s: talos-r3-leopard-030
OK, here's a more reasonable stack (down to frame 18) from the 10.5.8 failure in comment 5. Aki didn't know what sort of security updates had been applied on that slave, so I just went with "10.5.8 9L31a with Security Update 2010-002" for symbols. Security Updates and whatnot can make a difference, although frames 0-7 at least seem plausible now (9-17 should be right; that's a common pattern at the appshell-level). 0 CoreFoundation _CFRelease 1 CoreFoundation __CFDictionaryDeallocate 2 CoreFoundation _CFRelease 3 AppKit -[NSCell dealloc] 4 AppKit -[NSActionCell dealloc] 5 AppKit -[NSTextFieldCell dealloc] 6 AppKit -[NSNextStepFrame dealloc] 7 AppKit -[NSWindow dealloc] 8 XUL -[BaseWindow dealloc] [nsCocoaWindow.mm:c4a3c240ab5e : 1904 + 0x23] 9 Foundation __delayedPerformCleanup 10 CoreFoundation CFRunLoopTimerInvalidate 11 CoreFoundation CFRunLoopRunSpecific 12 CoreFoundation CFRunLoopRunInMode 13 HIToolbox RunCurrentEventLoopInMode 14 HIToolbox ReceiveNextEventCommon 15 HIToolbox BlockUntilNextEventMatchingListInMode 16 AppKit _DPSNextEvent 17 AppKit -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] 18 XUL nsAppShell::ProcessNextNativeEvent(int) [nsAppShell.mm:c4a3c240ab5e : 676 + 0x3b]
I think one of the suspected checkins was from bug 559180: http://hg.mozilla.org/mozilla-central/rev/a8fd7c7e46ac
Comment 8•14 years ago
|
||
Although the 10.6 test boxes aren't visible in the normal tree yet, they are in &noignore=1, where you can see things like http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271511934.1271512552.15510.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/17 06:45:34 s: talos-r3-snow-012 8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul... firefox-bin(354,0x103663be0) malloc: *** error for object 0x11a3ebea8: incorrect checksum for freed object - object was probably modified after being freed. *** set a breakpoint in malloc_error_break to debug TEST-UNEXPECTED-FAIL | automation.py | Exited with code -6 during test run
Comment 9•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271466698.1271467325.7374.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/16 18:11:38 s: talos-r3-snow-012 8302 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul... TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run
Comment 10•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271425073.1271425693.27282.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/16 06:37:53 s: talos-r3-snow-007 8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul... firefox-bin(360,0x103654be0) malloc: *** error for object 0x11aa4df60: incorrect checksum for freed object - object was probably modified after being freed. *** set a breakpoint in malloc_error_break to debug TEST-UNEXPECTED-FAIL | automation.py | Exited with code -6 during test run
Comment 11•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271556152.1271556770.32222.gz s: talos-r3-snow-010 8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul... TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run
Comment 12•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271593890.1271594502.19877.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/18 05:31:30 s: talos-r3-snow-012 8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul... TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run
Assignee | ||
Comment 13•14 years ago
|
||
I've been repeatedly running mochitest-chrome locally to try and look into this. I've seen several crashes, though none that I can consistently reproduce; however, it does seem clear that there's some kind of heap damage, possibly related to management of Foundation/CF objects, that is leading to intermittent crashes. If we could catch the malloc error (comment #8, #10) in a debugger and get a stack there, that might be interesting; so far, I have not run into this, however.
Comment 14•14 years ago
|
||
Running locally, I can't reproduce either the crash (on OS X 10.5.8) or the malloc error (on OS 10.6.3) -- running test_composition_text_querycontent.xul either by itself or together with all the other Chrome tests.
Comment 15•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271708332.1271710840.9720.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/19 13:18:52 s: talos-r3-snow-004 8304 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul... TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run
Reporter | ||
Updated•14 years ago
|
blocking2.0: --- → ?
Keywords: regression
Comment 16•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271720732.1271723404.8845.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/19 16:45:32 s: talos-r3-snow-010 8330 INFO Running chrome://mochikit/content/chrome/widget/tests/test_composition_text_querycontent.xul... TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run
Assignee | ||
Comment 17•14 years ago
|
||
It seems that this is triggered by the patch in bug 559180, which provided local autorelease pools at a number of places during widget teardown, to prevent leaks that otherwise occur while the hidden window is being destroyed (outside the event loop). It's not clear exactly what is causing the crashes, which are quite sporadic; my guess is that something ends up in one of these local autorelease pools and then gets destroyed when the pool goes out of scope, but then some other part of the code actually tries to use the object again. Given the unpredictable occurrence of the crashes, I wonder if it's related to something like GC getting triggered at an unfortunate time? Difficult to track down without a consistently reproducible testcase, or a record/replay setup. I've tried running with the patch from bug 559180 backed-out, and this did seem to stop the crashes (in numerous test runs). Of course, the downside is that we then leak a few hundred Cocoa objects that are created during widget teardown, and leak because there's no pool to catch them. We could argue that leaking during shutdown is irrelevant, as the entire process is about to go away anyhow, but the resulting __NSAutoreleaseNoPool messages - besides being alarming and cluttering the logs - may also swamp any new leaks that we want to be notified about. A possible fix may be to replace the localized pools from bug 559180, which were placed just around the Cocoa methods that lead to leakage, with a single pool covering the entire shutdown process. I think this is more comparable to the level where we use pools during the normal event loop, and makes it less likely that an object will be "caught" by a temporary, local pool and then released prematurely.
Assignee | ||
Comment 18•14 years ago
|
||
This patch effectively backs-out the changes from bug 559180, and replaces the low-level pools there with a single, higher-level autorelease pool. In repeated test runs locally, I have not seen any further crashes once this is applied.
Attachment #440189 -
Flags: review?(joshmoz)
Comment 19•14 years ago
|
||
Comment on attachment 440189 [details] [diff] [review] patch, v1 - replace temporary local pools during widget destruction with a higher-level pool covering shutdown I'd also like Steven's opinion on this patch. It would be good to know exactly why it works so we don't make this mistake again.
Attachment #440189 -
Flags: review?(smichaud)
Comment 20•14 years ago
|
||
It is possible that something is not being retained enough times but it didn't matter before because it was leaking. When the "regressing" patch landed the objected was used after the release pool sent the fatal release message, and with the new patch which has the release pool further down the stack the object stays alive long enough to avoid the bad access. Just a theory :)
Assignee | ||
Comment 21•14 years ago
|
||
(In reply to comment #20) > It is possible that something is not being retained enough times but it didn't > matter before because it was leaking. When the "regressing" patch landed the > objected was used after the release pool sent the fatal release message, and > with the new patch which has the release pool further down the stack the object > stays alive long enough to avoid the bad access. Yes, I've also wondered about scenarios like this. I'm not at all sure the "regressing" patch is really "wrong", it may just have encouraged a latent problem to surface more readily. But I don't have any good ideas on how to track that down more precisely. The error need not be directly related to where the crash occurs (which is typically within several layers of Cocoa/CoreFoundation code), as this may be caused by an earlier memory management error.
Comment 22•14 years ago
|
||
Comment on attachment 440189 [details] [diff] [review] patch, v1 - replace temporary local pools during widget destruction with a higher-level pool covering shutdown Our shutdown sequence is fragile and unpredictable -- so, for example, it's not currently possible to know for sure when it has finished. (See bug 384284, and references to it in bug 491122 comment #8 and bug 539169 comment #2.) So Jonathan's patch might not completely get rid of "no autorelease pool" leaks on shutdown. And it might even trigger crashes (presumably different ones). But the autorelease pool he proposes will certainly cover a lot more of the shutdown sequence than does the patch for bug 559180, and so I think it's much less likely to cause trouble. I agree that the patch for bug 559180 only triggers this bug's crash -- that the real cause is some kind of memory mismanagement. But we're not going to be able to find the cause until we can reliably reproduce its bad effects.
Attachment #440189 -
Flags: review?(smichaud) → review+
Comment 23•14 years ago
|
||
Note: In our tests of the shutdown sequence, we should try quitting from the Dock menu and logging out while Minefield is still running. Both of those substantially change how the shutdown works.
Assignee | ||
Comment 24•14 years ago
|
||
I've tried quitting from the Dock, and have not seen any ill effects (either leak warnings or crashes!). Haven't tested quitting via Logout yet but I'll give that a try. Where I do see residual __NSAutoreleaseNoPool messages with this patch is on the first run of a new version (e.g. after I've been running FF 3.6) and the Addons Compatibility checking dialog is shown, or when using the GUI profile manager at startup. Several autoreleased objects are leaked from calls below nsCocoaWindow::DestroyNativeWindow. The patch from bug 559180 fixed these because a pool was added in that function, but this is one of the points under suspicion as a possible trigger of the crashes. (This is similar to bug 559400, though that was a more limited fix for a startup-dialog issue that was not covered by bug 559180.) A sample call stack for these leaks: #0 0x00007fff85980de4 in __NSAutoreleaseNoPool () #1 0x00007fff858cdd19 in _CFAutoreleasePoolAddObject () #2 0x00007fff858cda86 in -[NSObject(NSObject) autorelease] () #3 0x00007fff87e3e35b in -[NSNotificationCenter(NSNotificationCenterAppKitExtensions) _removeObserver:notificationNamesAndSelectorNames:object:] () #4 0x00007fff87e6e31f in -[NSWindow setDelegate:] () #5 0x0000000100ac35ae in nsCocoaWindow::DestroyNativeWindow () #6 0x0000000100ac541a in nsCocoaWindow::~nsCocoaWindow () #7 0x0000000100af1a15 in nsBaseWidget::Release () #8 0x0000000100931a4b in nsXULWindow::ShowModal () #9 0x0000000100906394 in nsWindowWatcher::OpenWindowJSInternal () #10 0x0000000100906fe4 in nsWindowWatcher::OpenWindow () #11 0x0000000100bcf162 in NS_InvokeByIndex_P () #12 0x000000010004b741 in XPCWrappedNative::CallMethod () #13 0x0000000100051a1a in XPC_WN_CallMethod () #14 0x0000000101bfbfe0 in js_Invoke () #15 0x0000000101bea86f in js_Interpret () #16 0x0000000101bfc2a7 in js_Invoke () #17 0x000000010004748f in nsXPCWrappedJSClass::CallMethod () #18 0x0000000100bd03d5 in PrepareAndDispatch () #19 0x0000000100bcf26b in SharedStub () #20 0x0000000100010a8e in XRE_main () #21 0x0000000100001a3e in main () We could catch these with a pool somewhere up the call stack, though I'm not sure where is the most sensible place - or we could decide (at least for the time being) to ignore them as relatively harmless, given that they don't occur in the course of normal runs. My suggestion would be to try landing this patch first, and let it bake for a while. Shutdown leaks during a version-upgrade restart are a secondary issue that we can tackle as a followup once the crashing is resolved.
Comment 25•14 years ago
|
||
Comment on attachment 440189 [details] [diff] [review] patch, v1 - replace temporary local pools during widget destruction with a higher-level pool covering shutdown We should take this to solve our immediate problems but I'm not really happy with it in that it is a cover-up technique much like the original root pool was. Obviously much smaller and less harmful scope though.
Attachment #440189 -
Flags: review?(joshmoz) → review+
Comment 26•14 years ago
|
||
(Following up comment #23) We should also test this patch in Camino, which had weird problems (on shutdown) with the original root autorelease pool. See bug 559047 comment #2.
Assignee | ||
Comment 27•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271832263.1271832944.26112.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/20 23:44:23 s: talos-r3-snow-012 TEST-UNEXPECTED-FAIL | automation.py | Exited with code -11 during test run This looks like comments 9, 11, 12,... However, I don't see a crash in the log, like in the original report, so I'm not sure it's really the same issue. I'm just trying to fix the crasher; if these other failures persist after that, we should probably open a new bug on them.
Assignee | ||
Comment 28•14 years ago
|
||
Pushed http://hg.mozilla.org/mozilla-central/rev/a6d7a5677b4c
Assignee | ||
Updated•14 years ago
|
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 30•14 years ago
|
||
Which of course has nothing whatsoever to do with this bug.
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [orange]
You need to log in
before you can comment on or make changes to this bug.
Description
•