Closed Bug 881413 Opened 11 years ago Closed 11 years ago

I10t test_errorhandler.js,test_TelemetryPing.js,test_update.js,test_healthreporter.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!))

Categories

(Core :: Networking: HTTP, defect)

x86
Windows 8
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla30
Tracking Status
firefox28 --- wontfix
firefox29 --- fixed
firefox30 --- fixed
firefox-esr24 --- wontfix
b2g-v1.2 --- wontfix
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed

People

(Reporter: RyanVM, Assigned: benjamin)

References

Details

(Keywords: assertion, crash)

Crash Data

Attachments

(3 files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=23958467&tree=Mozilla-Inbound

WINNT 6.2 mozilla-inbound debug test xpcshell on 2013-06-10 04:39:00 PDT for push eaae3921357b
slave: t-w864-ix-015

05:48:44     INFO -  TEST-PASS | (xpcshell/head.js) | 212 (+ 0) check(s) passed
05:48:44     INFO -  TEST-INFO | (xpcshell/head.js) | 0 check(s) todo
05:48:44     INFO -  WARNING: An event was posted to a thread that will never run it (rejected): file e:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/xpcom/threads/nsThread.cpp, line 367
05:48:44     INFO -  Assertion failure: rc != 0 (destroyed timer off its target thread!), at e:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/xpcom/threads/TimerThread.cpp:247
05:48:44     INFO -  WARNING: NS_ENSURE_TRUE(mThread != PR_GetCurrentThread()) failed: file e:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/xpcom/threads/nsThread.cpp, line 445
05:48:44     INFO -  <<<<<<<
05:48:50  WARNING -  PROCESS-CRASH | C:\slave\test\build\tests\xpcshell\tests\services\sync\tests\unit\test_errorhandler.js | application crashed [@ TimerThread::Run()]
05:48:50     INFO -  Crash dump filename: C:\slave\test\build\tests\xpcshell\tests\services\sync\tests\unit\e86e63e2-e82f-43bf-a519-61fe7435413c.dmp
05:48:50     INFO -  Operating system: Windows NT
05:48:50     INFO -                    6.2.9200
05:48:50     INFO -  CPU: x86
05:48:50     INFO -       GenuineIntel family 6 model 30 stepping 5
05:48:50     INFO -       8 CPUs
05:48:50     INFO -  Crash reason:  EXCEPTION_BREAKPOINT
05:48:50     INFO -  Crash address: 0x737c9e31
05:48:50     INFO -  Thread 7 (crashed)
05:48:50     INFO -   0  xul.dll!TimerThread::Run() [TimerThread.cpp:eaae3921357b : 247 + 0x1b]
05:48:50     INFO -      eip = 0x737c9e31   esp = 0x04fff93c   ebp = 0x04fff9ac   ebx = 0x05ca92e0
05:48:50     INFO -      esi = 0x05ca9328   edi = 0x03297608   eax = 0x00000000   ecx = 0x21a37d50
05:48:50     INFO -      edx = 0x71b1e4d8   efl = 0x00000206
05:48:50     INFO -      Found by: given as instruction pointer in context
05:48:50     INFO -   1  xul.dll!nsThread::ProcessNextEvent(bool,bool *) [nsThread.cpp:eaae3921357b : 626 + 0xd]
05:48:50     INFO -      eip = 0x737cccc0   esp = 0x04fff9b4   ebp = 0x04fffa04
05:48:50     INFO -      Found by: call frame info
05:48:50     INFO -   2  xul.dll!NS_ProcessNextEvent(nsIThread *,bool) [nsThreadUtils.cpp:eaae3921357b : 238 + 0xc]
05:48:50     INFO -      eip = 0x7378768f   esp = 0x04fffa0c   ebp = 0x04fffa18
05:48:50     INFO -      Found by: call frame info
05:48:50     INFO -   3  xul.dll!nsThread::ThreadFunc(void *) [nsThread.cpp:eaae3921357b : 264 + 0x7]
05:48:50     INFO -      eip = 0x737cc443   esp = 0x04fffa20   ebp = 0x04fffa40
05:48:50     INFO -      Found by: call frame info
Crash Signature: [@ TimerThread::Run() ]
https://tbpl.mozilla.org/php/getParsedLog.php?id=26486056&tree=Mozilla-Inbound
Summary: Intermittent test_errorhandler.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!)) → Intermittent test_errorhandler.js,test_TelemetryPing.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!))
https://tbpl.mozilla.org/php/getParsedLog.php?id=29521615&tree=Fx-Team
Summary: Intermittent test_errorhandler.js,test_TelemetryPing.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!)) → Intermittent test_errorhandler.js,test_TelemetryPing.js,test_update.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!))
https://tbpl.mozilla.org/php/getParsedLog.php?id=30535891&tree=B2g-Inbound
Summary: Intermittent test_errorhandler.js,test_TelemetryPing.js,test_update.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!)) → I10t test_errorhandler.js,test_TelemetryPing.js,test_update.js,test_healthreporter.js | test failed (with xpcshell return code: -2147483645) | application crashed [@ TimerThread::Run()] (Assertion failure: rc != 0 (destroyed timer off its target thread!))
Do you have any ideas on somebody who could look at this?  The assertion sounds pretty sketchy, and it is happening fairly frequently.  Thanks.
Flags: needinfo?(benjamin)
Depends on: 963474
No longer depends on: 963474
FWIW, if push comes to shove, we can probably disable our way to "victory" here by skipping test_update.js on Windows debug. At the rate things are going, that's looking more and more likely :(
This is complicated to debug, of course.

13:06:52     INFO -  [3540] WARNING: cannot post event if not initialized: file c:/builds/moz2_slave/m-cen-w32-d-000000000000000000/build/netwerk/protocol/http/nsHttpConnectionMgr.cpp, line 178
13:06:52     INFO -  TEST-PASS | (xpcshell/head.js) | 230 (+ 0) check(s) passed
13:06:52     INFO -  TEST-INFO | (xpcshell/head.js) | 0 check(s) todo
13:06:52     INFO -  [3540] WARNING: An event was posted to a thread that will never run it (rejected): file c:\\builds\\moz2_slave\\m-cen-w32-d-000000000000000000\\build\\xpcom\\threads\\nsThread.cpp, line 367
13:06:52     INFO -  Assertion failure: rc != 0 (destroyed timer off its target thread!), at c:\\builds\\moz2_slave\\m-cen-w32-d-000000000000000000\\build\\xpcom\\threads\\TimerThread.cpp:259

We pretty clearly are trying to fire an XPCOM timer at a thread which is already shutting down. This causes the timer service to hold the last reference to the timer and eventually assert. The following things initialize timers during this test:

startupcache

mozilla::dom::workers::WorkerPrivate::SetGCTimerMode

nsXMLHttpRequest::StartProgressEventTimer (from main thread)

mozilla::net::nsHttpConnectionMgr::ActivateTimeoutTick (socket transport thread)

mozilla::net::nsHttpConnectionMgr::nsHalfOpenSocket::SetupBackupTimer (socket transport thread)

JS timer from 0 namedTimer(callback = [function], wait = 300000000, thisObj = [object Object], name = "timeoutTimer") ["resource://gre/modules/services-common/utils.js":172]
    this = [object Object]
1 delayTimeout() ["resource://services-common/rest.js":357]
    this = [object Object]
2 dispatch(method = "POST", data = "...", onComplete = undefined, onProgress = undefined) ["resource://services-common/rest.js":347]
    this = [object Object]
3 post(data = "...") ["resource://services-common/rest.js":242]
    this = [object Object]
4 uploadJSON(namespace = "test00", id = "e22a62af-f2bb-4c1d-87f6-471aca8b155a", payload = "{"version":2,"thisPingDate":"2014-02-19","geckoAppInfo":{"_v":1,"vendor":"Mozilla","name":"xpcshell","id":"xpcshell@tests.mozilla.org","version":"1","appBuildID":"20121107","platformVersion":"p-ver","platformBuildID":"20121106","os":"XPCShell","xpcomabi":"noarch-spidermonkey","updateChannel":"default"},"data":{"last":{},"days":{}}}", options = [object Object]) ["resource://services-common/bagheeraclient.js":200]
    this = [object Object]
5 doUpload() ["resource://gre/modules/services/healthreport/healthreporter.jsm":1427]
    this = [object Object]
6 TaskImpl_run(aSendResolved = true, aSendValue = 111) ["resource://gre/modules/Task.jsm":233]
    this = [object Object]
7 anonymous() ["resource://gre/modules/Promise.jsm":774]
    this = [object Object]
8 anonymous() ["resource://gre/modules/Promise.jsm":653]
    this = [object Object]

mozilla::net::nsHttpConnection::StartShortLivedTCPKeepalives (non-main thread but doesn't appear to be the socket transport thread)

nsDiskCacheMap::ResetCacheTimer

nsCacheService::CreateDiskDevice

mozilla::net::nsHttpConnectionMgr::PruneDeadConnectionsAfter

Since the JS timers all fire at the main thread, I don't think it's possible for them to be the problem here. But I don't know which of the remaining timer/threads is the one at fault. I'd love to add more debugging code to the "destroyed timer off its target thread" warning, but even though I have a handle to the nsThreadImpl in question, it doesn't have a name or other identifying data which would make it easy to tell which thread it is. The nsHttpConnectionMgr.cpp warning is interesting and points to this being one of the networking timers...
Flags: needinfo?(benjamin)
Maybe Jason can help answer that.
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(jduell.mcbugs)
Benjamin, thanks for the very nice forensic work.

Looking at the hg log for /netwerk from ~ June 10 2013 (when this started), I don't see an obvious culprit.  I suppose it's possible that the patch for bug 845190 (that blocked re-initializing nsStreamTransportService during shutdown--it's marked obsolete but it landed on m-i then) might somehow have exposed some latent bug with timers here.

HTTP-off-main-thread landed around then, but slightly later (Jun 13th). The TCPKeepAlive stuff is very recent, so that's not it.

From the stuff in the list in comment 284, that leaves a bunch of Http cxn manager stuff, and some non-necko culprits.  I'm tossing to Patrick in case he can think of any possible timers that might be to blame here, and cc-ing some other folks who might have touched code that does off-main timer stuff.
Flags: needinfo?(mcmanus)
https://tbpl.mozilla.org/?tree=Try&rev=b956126f4be7 pushed to see if we can make this more reproducible. Included mac to help with bug 963474 which is similar but not identical.
Or actually https://tbpl.mozilla.org/?tree=Try&rev=1240b70eb699 since MOZ_ASSERT doesn't take non-literals.
I caught this at https://tbpl.mozilla.org/php/getParsedLog.php?id=35441626&tree=Try

The timer that was not cancelled properly is set to call nsHttpConnectionMgr::Observe.

There are two timers that fire at this function. mTimer and mTimeoutTick. I don't know which one this is, but I do see this:

in nsHttpConnectionMgr::OnMsgShutdown we cancel mTimeoutTick but don't cancel mTimer. I'd be willing to bet we should be cancelling both timers here. I'm going to put up a patch to do that and see if neckofolk agree.
Component: XPCOM → Networking: HTTP
Assignee: nobody → benjamin
Status: NEW → ASSIGNED
Attachment #8383939 - Flags: review?(mcmanus)
Comment on attachment 8383939 [details] [diff] [review]
bug881413-cancel-mTimer

Review of attachment 8383939 [details] [diff] [review]:
-----------------------------------------------------------------

that's obviously right benjamin - thanks. this bug has been there since at least firefox 3.5.. prbly longer.
Attachment #8383939 - Flags: review?(mcmanus) → review+
This needs an updated r= message before being pushed.
Keywords: checkin-needed
Flags: needinfo?(mcmanus)
https://hg.mozilla.org/mozilla-central/rev/5d6a519d6108
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Can we please nominate this for uplift to all affected branches (Aurora/Beta/b2g26/esr24)?

@ RelMan - I know this doesn't really fit the usual ESR uplift criteria, but this is an annoying orange with a simple fix, so I'm hoping you're willing to bend the rules a bit for sanity's sake :)
Flags: needinfo?(benjamin)
Comment on attachment 8383939 [details] [diff] [review]
bug881413-cancel-mTimer

I really don't have a good risk assessment here. This is probably very safe, but I'm not sure whether it could affect shutdown in some way I haven't thought of. I think Patrick probably ought to make that call.
Flags: needinfo?(mcmanus)
this should be very safe.

otoh I wouldn't personally take it on esr as I don't actually see a single esr24 report on this page - which isn't to say it isn't present (it probably dates to firefox 4), but how annoying can it be? ymmv.
Flags: needinfo?(mcmanus)
Fine with me. Can we get noms for the other 3 branches then? :)
Comment on attachment 8383939 [details] [diff] [review]
bug881413-cancel-mTimer

[Approval Request Comment]
Bug caused by (feature/regressing bug #): old code
User impact if declined: Random-orange tests, perhaps leaks or other shutdown weirdness
Testing completed (on m-c, etc.): landed, orange seems to be gone?
Risk to taking this patch (and alternatives if risky): Main risk is weird stuff happening at shutdown, but the risk seems minimal
String or IDL/UUID changes made by this patch: none

I don't think we need this on beta; it doesn't meet the reward criterion I'd normally use.
Attachment #8383939 - Flags: approval-mozilla-aurora?
Flags: needinfo?(benjamin)
I'm asking for it on beta because Gecko 28 lives on other longer-shelf life branches as well and this is a low-risk fix.
Comment on attachment 8383939 [details] [diff] [review]
bug881413-cancel-mTimer

Per the previous comments. It's a simple, low-risk fix that fixes an annoying crash on TBPL. I would really appreciate if we could consider this for the other branches as well.
Attachment #8383939 - Flags: approval-mozilla-beta?
Attachment #8383939 - Flags: approval-mozilla-b2g26?
Attachment #8383939 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8383939 [details] [diff] [review]
bug881413-cancel-mTimer

too late for beta, we'll see this in action post-merge on 29.
Attachment #8383939 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Comment on attachment 8383939 [details] [diff] [review]
bug881413-cancel-mTimer

"Fun"

Same reasons as before. Going to be stuck with this crash that has a simple fix for another 6 months in that case.
Attachment #8383939 - Flags: approval-mozilla-b2g28?
Attachment #8383939 - Flags: approval-mozilla-b2g28? → approval-mozilla-b2g28+
Attachment #8383939 - Flags: approval-mozilla-b2g26?
Because this comes up occasionally and because try revisions aren't forever, I'm going to attach the two patches I used to debug this here. This one tells you what kind of callback the bad timer has.
And this patch, if the timer callback is a nsIObserver* or nsITimerCallback* will dump the address of the ::Observe or ::Notify method, on Windows/MSVC.

If you need to do something similar using the GCC ABI, see https://blog.mozilla.org/nfroyd/2014/02/20/finding-addresses-of-virtual-functions/
(Removing keyword to avoid false positives on current intermittent failures)
See Also: → 1729964
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: