Closed Bug 1196237 Opened 9 years ago Closed 9 years ago

nsHostResolver thread is still running late in shutdown

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla43

Tracking Flags:

Tracking

Status

firefox40

---

wontfix

firefox41

---

wontfix

firefox42

---

fixed

firefox43

---

fixed

firefox-esr38

wontfix

b2g-v2.0

---

wontfix

b2g-v2.0M

---

wontfix

b2g-v2.1

---

wontfix

b2g-v2.1S

---

wontfix

b2g-v2.2

---

affected

b2g-v2.2r

---

affected

b2g-master

---

fixed

People

(Reporter: gfritzsche, Assigned: mcmanus)

References

Details

(Keywords: csectype-race, csectype-uaf, sec-high, Whiteboard: [post-critsmash-triage][adv-main42+])

Attachments

(1 file)

telemetry after dns shutdown 9 years ago Patrick McManus [:mcmanus] 8.09 KB, patch	dragana : review+ dveditz : approval-mozilla-aurora+ dveditz : sec-approval+	Details \| Diff \| Splinter Review

Georg Fritzsche [:gfritzsche]

Reporter

Description

•

9 years ago

This is per bug 1193038, comment 0 and bug 1193038, comment 6 ff. (In reply to Randell Jesup [:jesup] from comment #0) > Repeated crashes in the field with 0x5a signatures. > Called from > Telemetry::Accumulate(Telemetry::DNS_FAILED_LOOKUP_TIME, millis); > > https://crash-stats.mozilla.com/report/index/aabb2399-64d7-4ec7-a888- > 027b02150805 The follow-up comments there make it pretty clear that nsHostResolver is still running late (after the backing Telemetry storage is already cleared out). > Others include > Telemetry::Accumulate(Telemetry::STS_NUMBER_OF_ONSOCKETREADY_CALLS, > numberOfOnSocketReadyCalls) > (https://crash-stats.mozilla.com/report/index/c77f5b7c-cc39-44d3-9664- > 030ff2150807) Per that nsSocketTransportService might have a similar issue.

Flags: needinfo?(mcmanus)

Flags: needinfo?(jduell.mcbugs)

Randell Jesup [:jesup] (needinfo me)

Comment 1

•

9 years ago

I have some concern over the unlocked polling of nsHostResolver mThreadCount since it's not Atomic... it's at least a TSAN issue, but more to the point I'm wondering if we might miss a just-created thread and continue on without waiting for it to exit. Or we could lock in the poll - it really wouldn't hurt to do that in shutdown. It's also possible that somehow the DNS service isn't getting shut down, or is somehow getting restarted.

Jason Duell

Comment 2

•

9 years ago

I don't know our shutdown dance well enough to know what's going on here. Have we seen similar issues with other code calling telemetry this late? Is there any way to simply drop telemetry requests after the backing storage is gone?

Flags: needinfo?(jduell.mcbugs)

Randell Jesup [:jesup] (needinfo me)

Comment 3

•

9 years ago

We might be able to drop them (though it's tricky since the Telemetry code runs lock-free for perf reasons). gfritzche's patches for the bug that exposed this may make it not crash, but having these threads running in shutdown creates a myriad of failure possibilities and possible sec issues. While these are PR_Threads and not NS_Threads, the same sort of issues apply. These getting called imply some lack of synchronization of shutdown or missing shutdown, at least on some path. I'm concerned that the fix on the other bug will simply wallpaper the most obvious crash from a failure-to-shutdown, leaving others to be discovered later. I suspect the shutdown code could use a fresh eye on it; I looked and the basics (assuming DNSService gets Shutdown() called) seem reasonable. There might be some interesting interactions with the observers which separately can cause HostResolver to shutdown-and-restart (see ::Observe()); perhaps those are involved here. There are a few signs perhaps of other things hitting telemetry in Shutdown; need to ensure that they weren't IPC failures causing the other process to get dumped (see ipc_channel_error in the MetaData tab of a crash report - if it's there, likely it's a spurious report apparently)

Patrick McManus [:mcmanus]

Assignee

Comment 4

•

9 years ago

when shutdown happens, the dns threadpool member is probably stuck in a very long blocking getaddrinfo() (that's why its on the threadpool, after all). especially true since you're seeing DNS_FAILED_LOOKUP_TIME. It will exit shortly afterwards. We specifically don't synchronously join the threads because that getaddrinfo timeout can block shutdown for a long long time - see the bit around line 690 in nsHostResolver.cpp where it does block for a while only if built with REFCOUNT_LOGGING (for leak purposes). and iirc there was a reason kill-thread semantics couldn't be portably implemented. I guess we could create some kind of per-thread lock and shutdown flag.. the lock could be held when not blocked on getaddrinfo or waiting for new work.. then the xpcom shutdown event could grab the lock quickly enough and set the flag.

Flags: needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Assignee

Comment 5

•

9 years ago

this is easier than I thought... the only cross-module work that gets done in that threadpool is the telemetry, and onLookupComplete already checks the mshutdown flag under lock to prevent using it.. the other uses that are flagged in this bug are the outlier. so I think we can fix it by just putting them under a locked check of that flag and making that flag an atomic.

Patrick McManus [:mcmanus]

Assignee

Updated

•

9 years ago

Flags: needinfo?(dd.mozilla)

Patrick McManus [:mcmanus]

Assignee

Comment 6

•

9 years ago

Attached patch telemetry after dns shutdown — Details — Splinter Review

Attachment #8651010 - Flags: review?(dd.mozilla)

Patrick McManus [:mcmanus]

Assignee

Updated

•

9 years ago

Assignee: nobody → mcmanus

Status: NEW → ASSIGNED

Patrick McManus [:mcmanus]

Assignee

Updated

•

9 years ago

Flags: needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Comment 7

•

9 years ago

Comment on attachment 8651010 [details] [diff] [review] telemetry after dns shutdown Review of attachment 8651010 [details] [diff] [review]: ----------------------------------------------------------------- Looks good

Attachment #8651010 - Flags: review?(dd.mozilla) → review+

Patrick McManus [:mcmanus]

Assignee

Comment 8

•

9 years ago

Comment on attachment 8651010 [details] [diff] [review] telemetry after dns shutdown [Security approval request comment] How easily could an exploit be constructed based on the patch? you could get a uaf crash - it would effect the heap (not the stack) and it would have very limited attacker controllable data (maybe or 3 bytes worth if you could control the timing really closely). Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? pretty much the patch itself identifes the problem. Which older supported branches are affected by this flaw? all of them If not all supported branches, which bug introduced the flaw? n/a Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? can make them trivially if this does not apply cleanly How likely is this patch to cause regressions; how much testing does it need? This is simple and low risk to regression. It effects shutdown code paths so would be well exercised in normal testing.

Attachment #8651010 - Flags: sec-approval?

Daniel Veditz [:dveditz]

Comment 9

•

9 years ago

Comment on attachment 8651010 [details] [diff] [review] telemetry after dns shutdown [Triage Comment] Let's land this at least on Aurora, a=dveditz After some testing we can then look at landing on Beta. sec-approval = dveditz

Attachment #8651010 - Flags: sec-approval?

Attachment #8651010 - Flags: sec-approval+

Attachment #8651010 - Flags: approval-mozilla-aurora+

Daniel Veditz [:dveditz]

Updated

•

9 years ago

Keywords: sec-critical → sec-high

Ryan VanderMeulen [:RyanVM]

Comment 10

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/a7d4d29b32ac https://hg.mozilla.org/releases/mozilla-aurora/rev/197a0c3b8d8e Sounds like this is going to need to land on esr38 as well?

status-b2g-v2.0: --- → wontfix

status-b2g-v2.0M: --- → wontfix

status-b2g-v2.1: --- → wontfix

status-b2g-v2.1S: --- → affected

status-b2g-v2.2: --- → affected

status-b2g-v2.2r: --- → affected

status-b2g-master: --- → affected

status-firefox42: affected → fixed

status-firefox-esr38: --- → affected

tracking-firefox-esr38: --- → ?

Flags: in-testsuite?

Ryan VanderMeulen [:RyanVM]

Comment 11

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/a7d4d29b32ac

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-b2g-master: affected → fixed

status-firefox43: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla43

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

status-firefox40: affected → wontfix

Ryan VanderMeulen [:RyanVM]

Comment 12

•

9 years ago

Worth considering this for beta/esr38 approval now?

Flags: needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Assignee

Comment 13

•

9 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #12) > Worth considering this for beta/esr38 approval now? tough call - I think both risk and reward are fairly low here, so I would probably favor stability and not port it.

Flags: needinfo?(mcmanus)

BMO Automation

Updated

•

9 years ago

Group: core-security → core-security-release

Liz Henry (:lizzard) (relman/hg->git project)

Comment 14

•

9 years ago

We can take a look at this again for esr 38.4.0.

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

status-b2g-v2.1S: affected → wontfix

status-firefox41: affected → wontfix

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 15

•

9 years ago

Patrick, based on comment 13, are you suggesting we wont fix this for esr38? Or given that the fix has been stabilized in other branches for a few weeks now, do you think it is safe to uplift to esr38.4.0? Thanks!

Flags: needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Assignee

Comment 16

•

9 years ago

right - I would say the impact of this bug is low.

Flags: needinfo?(mcmanus)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 17

•

9 years ago

This issue is deemed low impact and therefore not worth backporting to esr38.

status-firefox-esr38: affected → wontfix

tracking-firefox-esr38: ? → -

Carsten Book [:Tomcat]

Comment 18

•

9 years ago

Hi, for 2.2. this needs maybe backporting or rebase, since this has conflicts to apply: merging netwerk/dns/nsHostResolver.cpp warning: conflicts during merge. merging netwerk/dns/nsHostResolver.cpp incomplete! (edit conflicts, then use 'hg resolve --mark') merging netwerk/dns/nsHostResolver.h abort: unresolved conflicts, can't continue (use hg resolve and hg graft --continue)

Flags: needinfo?(mcmanus)

Matt Wobensmith [:mwobensmith][:matt:]

Updated

•

9 years ago

Whiteboard: [post-critsmash-triage]

Patrick McManus [:mcmanus]

Assignee

Comment 19

•

9 years ago

I think this should be wontfixed on the b2g branches for the same reasoning as esr (low impact vs conflicts)

Flags: needinfo?(mcmanus)

Al Billings [:abillings - ex-MoCo]

Updated

•

9 years ago

Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main42+]

Daniel Veditz [:dveditz]

Updated

•

9 years ago

Group: core-security-release

You need to log in before you can comment on or make changes to this bug.