Closed Bug 1196237 Opened 6 years ago Closed 6 years ago
Host Resolver thread is still running late in shutdown
This is per bug 1193038, comment 0 and bug 1193038, comment 6 ff. (In reply to Randell Jesup [:jesup] from comment #0) > Repeated crashes in the field with 0x5a signatures. > Called from > Telemetry::Accumulate(Telemetry::DNS_FAILED_LOOKUP_TIME, millis); > > https://crash-stats.mozilla.com/report/index/aabb2399-64d7-4ec7-a888- > 027b02150805 The follow-up comments there make it pretty clear that nsHostResolver is still running late (after the backing Telemetry storage is already cleared out). > Others include > Telemetry::Accumulate(Telemetry::STS_NUMBER_OF_ONSOCKETREADY_CALLS, > numberOfOnSocketReadyCalls) > (https://crash-stats.mozilla.com/report/index/c77f5b7c-cc39-44d3-9664- > 030ff2150807) Per that nsSocketTransportService might have a similar issue.
I have some concern over the unlocked polling of nsHostResolver mThreadCount since it's not Atomic... it's at least a TSAN issue, but more to the point I'm wondering if we might miss a just-created thread and continue on without waiting for it to exit. Or we could lock in the poll - it really wouldn't hurt to do that in shutdown. It's also possible that somehow the DNS service isn't getting shut down, or is somehow getting restarted.
I don't know our shutdown dance well enough to know what's going on here. Have we seen similar issues with other code calling telemetry this late? Is there any way to simply drop telemetry requests after the backing storage is gone?
We might be able to drop them (though it's tricky since the Telemetry code runs lock-free for perf reasons). gfritzche's patches for the bug that exposed this may make it not crash, but having these threads running in shutdown creates a myriad of failure possibilities and possible sec issues. While these are PR_Threads and not NS_Threads, the same sort of issues apply. These getting called imply some lack of synchronization of shutdown or missing shutdown, at least on some path. I'm concerned that the fix on the other bug will simply wallpaper the most obvious crash from a failure-to-shutdown, leaving others to be discovered later. I suspect the shutdown code could use a fresh eye on it; I looked and the basics (assuming DNSService gets Shutdown() called) seem reasonable. There might be some interesting interactions with the observers which separately can cause HostResolver to shutdown-and-restart (see ::Observe()); perhaps those are involved here. There are a few signs perhaps of other things hitting telemetry in Shutdown; need to ensure that they weren't IPC failures causing the other process to get dumped (see ipc_channel_error in the MetaData tab of a crash report - if it's there, likely it's a spurious report apparently)
when shutdown happens, the dns threadpool member is probably stuck in a very long blocking getaddrinfo() (that's why its on the threadpool, after all). especially true since you're seeing DNS_FAILED_LOOKUP_TIME. It will exit shortly afterwards. We specifically don't synchronously join the threads because that getaddrinfo timeout can block shutdown for a long long time - see the bit around line 690 in nsHostResolver.cpp where it does block for a while only if built with REFCOUNT_LOGGING (for leak purposes). and iirc there was a reason kill-thread semantics couldn't be portably implemented. I guess we could create some kind of per-thread lock and shutdown flag.. the lock could be held when not blocked on getaddrinfo or waiting for new work.. then the xpcom shutdown event could grab the lock quickly enough and set the flag.
this is easier than I thought... the only cross-module work that gets done in that threadpool is the telemetry, and onLookupComplete already checks the mshutdown flag under lock to prevent using it.. the other uses that are flagged in this bug are the outlier. so I think we can fix it by just putting them under a locked check of that flag and making that flag an atomic.
Assignee: nobody → mcmanus
Status: NEW → ASSIGNED
Comment on attachment 8651010 [details] [diff] [review] telemetry after dns shutdown Review of attachment 8651010 [details] [diff] [review]: ----------------------------------------------------------------- Looks good
Attachment #8651010 - Flags: review?(dd.mozilla) → review+
Comment on attachment 8651010 [details] [diff] [review] telemetry after dns shutdown [Security approval request comment] How easily could an exploit be constructed based on the patch? you could get a uaf crash - it would effect the heap (not the stack) and it would have very limited attacker controllable data (maybe or 3 bytes worth if you could control the timing really closely). Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? pretty much the patch itself identifes the problem. Which older supported branches are affected by this flaw? all of them If not all supported branches, which bug introduced the flaw? n/a Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? can make them trivially if this does not apply cleanly How likely is this patch to cause regressions; how much testing does it need? This is simple and low risk to regression. It effects shutdown code paths so would be well exercised in normal testing.
Comment on attachment 8651010 [details] [diff] [review] telemetry after dns shutdown [Triage Comment] Let's land this at least on Aurora, a=dveditz After some testing we can then look at landing on Beta. sec-approval = dveditz
https://hg.mozilla.org/integration/mozilla-inbound/rev/a7d4d29b32ac https://hg.mozilla.org/releases/mozilla-aurora/rev/197a0c3b8d8e Sounds like this is going to need to land on esr38 as well?
Worth considering this for beta/esr38 approval now?
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #12) > Worth considering this for beta/esr38 approval now? tough call - I think both risk and reward are fairly low here, so I would probably favor stability and not port it.
We can take a look at this again for esr 38.4.0.
Patrick, based on comment 13, are you suggesting we wont fix this for esr38? Or given that the fix has been stabilized in other branches for a few weeks now, do you think it is safe to uplift to esr38.4.0? Thanks!
right - I would say the impact of this bug is low.
This issue is deemed low impact and therefore not worth backporting to esr38.
Hi, for 2.2. this needs maybe backporting or rebase, since this has conflicts to apply: merging netwerk/dns/nsHostResolver.cpp warning: conflicts during merge. merging netwerk/dns/nsHostResolver.cpp incomplete! (edit conflicts, then use 'hg resolve --mark') merging netwerk/dns/nsHostResolver.h abort: unresolved conflicts, can't continue (use hg resolve and hg graft --continue)
I think this should be wontfixed on the b2g branches for the same reasoning as esr (low impact vs conflicts)
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main42+]
You need to log in before you can comment on or make changes to this bug.