Closed Bug 881413 Opened 12 years ago Closed 11 years ago

I10t test_errorhandler.js,test_TelemetryPing.js,test_update.js,test_healthreporter.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!))

Categories

(Core :: Networking: HTTP, defect)

x86
Windows 8
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla30
Tracking Status
firefox28 --- wontfix
firefox29 --- fixed
firefox30 --- fixed
firefox-esr24 --- wontfix
b2g-v1.2 --- wontfix
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed

People

(Reporter: RyanVM, Assigned: benjamin)

References

Details

(Keywords: assertion, crash)

Crash Data

Attachments

(3 files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=23958467&tree=Mozilla-Inbound WINNT 6.2 mozilla-inbound debug test xpcshell on 2013-06-10 04:39:00 PDT for push eaae3921357b slave: t-w864-ix-015 05:48:44 INFO - TEST-PASS | (xpcshell/head.js) | 212 (+ 0) check(s) passed 05:48:44 INFO - TEST-INFO | (xpcshell/head.js) | 0 check(s) todo 05:48:44 INFO - WARNING: An event was posted to a thread that will never run it (rejected): file e:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/xpcom/threads/nsThread.cpp, line 367 05:48:44 INFO - Assertion failure: rc != 0 (destroyed timer off its target thread!), at e:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/xpcom/threads/TimerThread.cpp:247 05:48:44 INFO - WARNING: NS_ENSURE_TRUE(mThread != PR_GetCurrentThread()) failed: file e:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/xpcom/threads/nsThread.cpp, line 445 05:48:44 INFO - <<<<<<< 05:48:50 WARNING - PROCESS-CRASH | C:\slave\test\build\tests\xpcshell\tests\services\sync\tests\unit\test_errorhandler.js | application crashed [@ TimerThread::Run()] 05:48:50 INFO - Crash dump filename: C:\slave\test\build\tests\xpcshell\tests\services\sync\tests\unit\e86e63e2-e82f-43bf-a519-61fe7435413c.dmp 05:48:50 INFO - Operating system: Windows NT 05:48:50 INFO - 6.2.9200 05:48:50 INFO - CPU: x86 05:48:50 INFO - GenuineIntel family 6 model 30 stepping 5 05:48:50 INFO - 8 CPUs 05:48:50 INFO - Crash reason: EXCEPTION_BREAKPOINT 05:48:50 INFO - Crash address: 0x737c9e31 05:48:50 INFO - Thread 7 (crashed) 05:48:50 INFO - 0 xul.dll!TimerThread::Run() [TimerThread.cpp:eaae3921357b : 247 + 0x1b] 05:48:50 INFO - eip = 0x737c9e31 esp = 0x04fff93c ebp = 0x04fff9ac ebx = 0x05ca92e0 05:48:50 INFO - esi = 0x05ca9328 edi = 0x03297608 eax = 0x00000000 ecx = 0x21a37d50 05:48:50 INFO - edx = 0x71b1e4d8 efl = 0x00000206 05:48:50 INFO - Found by: given as instruction pointer in context 05:48:50 INFO - 1 xul.dll!nsThread::ProcessNextEvent(bool,bool *) [nsThread.cpp:eaae3921357b : 626 + 0xd] 05:48:50 INFO - eip = 0x737cccc0 esp = 0x04fff9b4 ebp = 0x04fffa04 05:48:50 INFO - Found by: call frame info 05:48:50 INFO - 2 xul.dll!NS_ProcessNextEvent(nsIThread *,bool) [nsThreadUtils.cpp:eaae3921357b : 238 + 0xc] 05:48:50 INFO - eip = 0x7378768f esp = 0x04fffa0c ebp = 0x04fffa18 05:48:50 INFO - Found by: call frame info 05:48:50 INFO - 3 xul.dll!nsThread::ThreadFunc(void *) [nsThread.cpp:eaae3921357b : 264 + 0x7] 05:48:50 INFO - eip = 0x737cc443 esp = 0x04fffa20 ebp = 0x04fffa40 05:48:50 INFO - Found by: call frame info
Crash Signature: [@ TimerThread::Run() ]
Summary: Intermittent test_errorhandler.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!)) → Intermittent test_errorhandler.js,test_TelemetryPing.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!))
Summary: Intermittent test_errorhandler.js,test_TelemetryPing.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!)) → Intermittent test_errorhandler.js,test_TelemetryPing.js,test_update.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!))
Summary: Intermittent test_errorhandler.js,test_TelemetryPing.js,test_update.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!)) → I10t test_errorhandler.js,test_TelemetryPing.js,test_update.js,test_healthreporter.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!))
Do you have any ideas on somebody who could look at this? The assertion sounds pretty sketchy, and it is happening fairly frequently. Thanks.
Flags: needinfo?(benjamin)
No longer depends on: 963474
FWIW, if push comes to shove, we can probably disable our way to "victory" here by skipping test_update.js on Windows debug. At the rate things are going, that's looking more and more likely :(
This is complicated to debug, of course. 13:06:52 INFO - [3540] WARNING: cannot post event if not initialized: file c:/builds/moz2_slave/m-cen-w32-d-000000000000000000/build/netwerk/protocol/http/nsHttpConnectionMgr.cpp, line 178 13:06:52 INFO - TEST-PASS | (xpcshell/head.js) | 230 (+ 0) check(s) passed 13:06:52 INFO - TEST-INFO | (xpcshell/head.js) | 0 check(s) todo 13:06:52 INFO - [3540] WARNING: An event was posted to a thread that will never run it (rejected): file c:\\builds\\moz2_slave\\m-cen-w32-d-000000000000000000\\build\\xpcom\\threads\\nsThread.cpp, line 367 13:06:52 INFO - Assertion failure: rc != 0 (destroyed timer off its target thread!), at c:\\builds\\moz2_slave\\m-cen-w32-d-000000000000000000\\build\\xpcom\\threads\\TimerThread.cpp:259 We pretty clearly are trying to fire an XPCOM timer at a thread which is already shutting down. This causes the timer service to hold the last reference to the timer and eventually assert. The following things initialize timers during this test: startupcache mozilla::dom::workers::WorkerPrivate::SetGCTimerMode nsXMLHttpRequest::StartProgressEventTimer (from main thread) mozilla::net::nsHttpConnectionMgr::ActivateTimeoutTick (socket transport thread) mozilla::net::nsHttpConnectionMgr::nsHalfOpenSocket::SetupBackupTimer (socket transport thread) JS timer from 0 namedTimer(callback = [function], wait = 300000000, thisObj = [object Object], name = "timeoutTimer") ["resource://gre/modules/services-common/utils.js":172] this = [object Object] 1 delayTimeout() ["resource://services-common/rest.js":357] this = [object Object] 2 dispatch(method = "POST", data = "...", onComplete = undefined, onProgress = undefined) ["resource://services-common/rest.js":347] this = [object Object] 3 post(data = "...") ["resource://services-common/rest.js":242] this = [object Object] 4 uploadJSON(namespace = "test00", id = "e22a62af-f2bb-4c1d-87f6-471aca8b155a", payload = "{"version":2,"thisPingDate":"2014-02-19","geckoAppInfo":{"_v":1,"vendor":"Mozilla","name":"xpcshell","id":"xpcshell@tests.mozilla.org","version":"1","appBuildID":"20121107","platformVersion":"p-ver","platformBuildID":"20121106","os":"XPCShell","xpcomabi":"noarch-spidermonkey","updateChannel":"default"},"data":{"last":{},"days":{}}}", options = [object Object]) ["resource://services-common/bagheeraclient.js":200] this = [object Object] 5 doUpload() ["resource://gre/modules/services/healthreport/healthreporter.jsm":1427] this = [object Object] 6 TaskImpl_run(aSendResolved = true, aSendValue = 111) ["resource://gre/modules/Task.jsm":233] this = [object Object] 7 anonymous() ["resource://gre/modules/Promise.jsm":774] this = [object Object] 8 anonymous() ["resource://gre/modules/Promise.jsm":653] this = [object Object] mozilla::net::nsHttpConnection::StartShortLivedTCPKeepalives (non-main thread but doesn't appear to be the socket transport thread) nsDiskCacheMap::ResetCacheTimer nsCacheService::CreateDiskDevice mozilla::net::nsHttpConnectionMgr::PruneDeadConnectionsAfter Since the JS timers all fire at the main thread, I don't think it's possible for them to be the problem here. But I don't know which of the remaining timer/threads is the one at fault. I'd love to add more debugging code to the "destroyed timer off its target thread" warning, but even though I have a handle to the nsThreadImpl in question, it doesn't have a name or other identifying data which would make it easy to tell which thread it is. The nsHttpConnectionMgr.cpp warning is interesting and points to this being one of the networking timers...
Flags: needinfo?(benjamin)
Maybe Jason can help answer that.
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(jduell.mcbugs)
Benjamin, thanks for the very nice forensic work. Looking at the hg log for /netwerk from ~ June 10 2013 (when this started), I don't see an obvious culprit. I suppose it's possible that the patch for bug 845190 (that blocked re-initializing nsStreamTransportService during shutdown--it's marked obsolete but it landed on m-i then) might somehow have exposed some latent bug with timers here. HTTP-off-main-thread landed around then, but slightly later (Jun 13th). The TCPKeepAlive stuff is very recent, so that's not it. From the stuff in the list in comment 284, that leaves a bunch of Http cxn manager stuff, and some non-necko culprits. I'm tossing to Patrick in case he can think of any possible timers that might be to blame here, and cc-ing some other folks who might have touched code that does off-main timer stuff.
Flags: needinfo?(mcmanus)
https://tbpl.mozilla.org/?tree=Try&rev=b956126f4be7 pushed to see if we can make this more reproducible. Included mac to help with bug 963474 which is similar but not identical.
Or actually https://tbpl.mozilla.org/?tree=Try&rev=1240b70eb699 since MOZ_ASSERT doesn't take non-literals.
I caught this at https://tbpl.mozilla.org/php/getParsedLog.php?id=35441626&tree=Try The timer that was not cancelled properly is set to call nsHttpConnectionMgr::Observe. There are two timers that fire at this function. mTimer and mTimeoutTick. I don't know which one this is, but I do see this: in nsHttpConnectionMgr::OnMsgShutdown we cancel mTimeoutTick but don't cancel mTimer. I'd be willing to bet we should be cancelling both timers here. I'm going to put up a patch to do that and see if neckofolk agree.
Component: XPCOM → Networking: HTTP
Assignee: nobody → benjamin
Status: NEW → ASSIGNED
Attachment #8383939 - Flags: review?(mcmanus)
Comment on attachment 8383939 [details] [diff] [review] bug881413-cancel-mTimer Review of attachment 8383939 [details] [diff] [review]: ----------------------------------------------------------------- that's obviously right benjamin - thanks. this bug has been there since at least firefox 3.5.. prbly longer.
Attachment #8383939 - Flags: review?(mcmanus) → review+
This needs an updated r= message before being pushed.
Keywords: checkin-needed
Flags: needinfo?(mcmanus)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Can we please nominate this for uplift to all affected branches (Aurora/Beta/b2g26/esr24)? @ RelMan - I know this doesn't really fit the usual ESR uplift criteria, but this is an annoying orange with a simple fix, so I'm hoping you're willing to bend the rules a bit for sanity's sake :)
Flags: needinfo?(benjamin)
Comment on attachment 8383939 [details] [diff] [review] bug881413-cancel-mTimer I really don't have a good risk assessment here. This is probably very safe, but I'm not sure whether it could affect shutdown in some way I haven't thought of. I think Patrick probably ought to make that call.
Flags: needinfo?(mcmanus)
this should be very safe. otoh I wouldn't personally take it on esr as I don't actually see a single esr24 report on this page - which isn't to say it isn't present (it probably dates to firefox 4), but how annoying can it be? ymmv.
Flags: needinfo?(mcmanus)
Fine with me. Can we get noms for the other 3 branches then? :)
Comment on attachment 8383939 [details] [diff] [review] bug881413-cancel-mTimer [Approval Request Comment] Bug caused by (feature/regressing bug #): old code User impact if declined: Random-orange tests, perhaps leaks or other shutdown weirdness Testing completed (on m-c, etc.): landed, orange seems to be gone? Risk to taking this patch (and alternatives if risky): Main risk is weird stuff happening at shutdown, but the risk seems minimal String or IDL/UUID changes made by this patch: none I don't think we need this on beta; it doesn't meet the reward criterion I'd normally use.
Attachment #8383939 - Flags: approval-mozilla-aurora?
Flags: needinfo?(benjamin)
I'm asking for it on beta because Gecko 28 lives on other longer-shelf life branches as well and this is a low-risk fix.
Comment on attachment 8383939 [details] [diff] [review] bug881413-cancel-mTimer Per the previous comments. It's a simple, low-risk fix that fixes an annoying crash on TBPL. I would really appreciate if we could consider this for the other branches as well.
Attachment #8383939 - Flags: approval-mozilla-beta?
Attachment #8383939 - Flags: approval-mozilla-b2g26?
Attachment #8383939 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8383939 [details] [diff] [review] bug881413-cancel-mTimer too late for beta, we'll see this in action post-merge on 29.
Attachment #8383939 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Comment on attachment 8383939 [details] [diff] [review] bug881413-cancel-mTimer "Fun" Same reasons as before. Going to be stuck with this crash that has a simple fix for another 6 months in that case.
Attachment #8383939 - Flags: approval-mozilla-b2g28?
Attachment #8383939 - Flags: approval-mozilla-b2g28? → approval-mozilla-b2g28+
Attachment #8383939 - Flags: approval-mozilla-b2g26?
Because this comes up occasionally and because try revisions aren't forever, I'm going to attach the two patches I used to debug this here. This one tells you what kind of callback the bad timer has.
And this patch, if the timer callback is a nsIObserver* or nsITimerCallback* will dump the address of the ::Observe or ::Notify method, on Windows/MSVC. If you need to do something similar using the GCC ABI, see https://blog.mozilla.org/nfroyd/2014/02/20/finding-addresses-of-virtual-functions/
(Removing keyword to avoid false positives on current intermittent failures)
See Also: → 1729964
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: