Closed Bug 1498782 Opened 6 years ago Closed 6 years ago

Crash in shutdownhang | nsThread::Shutdown | nsThreadPool::Shutdown in nsHostResolver::Shutdown

Tracking

()

Status:

RESOLVED WONTFIX

Tracking Flags:

Tracking

Status

firefox-esr60

---

wontfix

firefox62

---

wontfix

firefox63

wontfix

firefox64

fixed

firefox65

---

fixed

People

(Reporter: philipp, Assigned: valentin)

References

(Depends on 1 open bug)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [necko-triaged])

Crash Data

Attachments

(2 files, 1 obsolete file)

Bug 1498782 - Skip thread shutdown in nsHostResolver if there are still active threads after a delay r?bagder! 6 years ago Valentin Gosu [:valentin] (he/him) 46 bytes, text/x-phabricator-request		Details \| Review
Bug 1498782 - Add way to leak stuck threads r?kmag! 6 years ago Valentin Gosu [:valentin] (he/him) 46 bytes, text/x-phabricator-request		Details \| Review
Back out changeset d7788f2fd8a2, db553cb4218d, 80dd9e57c3a4 6 years ago Eric Rahm [:erahm] 12.05 KB, patch	valentin : review+ pascalc : approval-mozilla-release-	Details \| Diff \| Splinter Review

[:philipp]

Reporter

Description

•

6 years ago

This bug was filed from the Socorro interface and is report bp-917aef33-c044-4c8b-884c-44ee40181013. ============================================================= Top 10 frames of crashing thread: 0 ntdll.dll NtWaitForAlertByThreadId 1 ntdll.dll RtlSleepConditionVariableSRW 2 kernelbase.dll SleepConditionVariableSRW 3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:58 4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::GetEvent xpcom/threads/ThreadEventQueue.cpp:168 5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1093 6 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:519 7 xul.dll nsThread::Shutdown xpcom/threads/nsThread.cpp:871 8 xul.dll nsThreadPool::Shutdown xpcom/threads/nsThreadPool.cpp:347 9 xul.dll nsHostResolver::Shutdown netwerk/dns/nsHostResolver.cpp:771 ============================================================= shutdownhang reports with such a stack have become more common across platforms in the 63 cycle after the landing of bug 1471280. the volume is low to mid level with 5-20 daily reports on beta currently. the [@shutdownhang | nsThread::Shutdown | nsThreadPool::Shutdown] signature covers a number of different issues, this socorro query should narrow it down to the shutdown of DNS resolver threads: https://crash-stats.mozilla.com/search/?proto_signature=~nsHostResolver%3A%3AShutdown&release_channel=release&release_channel=beta&release_channel=nightly&build_id=%3E20180605171542&product=Firefox&date=%3E%3D2018-06-01&_facets=signature&_facets=version&_facets=build_id&_facets=platform_pretty_version&_facets=cpu_arch#facet-version

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 1

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca9159f7ea8623140bc886859db8c02cf0872b84

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

6 years ago

Assignee: nobody → valentin.gosu

Priority: -- → P2

Whiteboard: [necko-triaged]

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 2

•

6 years ago

Attached file Bug 1498782 - Skip thread shutdown in nsHostResolver if there are still active threads after a delay r?bagder! — Details

Pulsebot

Comment 3

•

6 years ago

Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/a4b43a47589a Skip thread shutdown in nsHostResolver if there are still active threads after a delay r=bagder

Eliza Balazs [:ebalazs_]

Comment 4

•

6 years ago

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 5

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=89be6aa2888493780163d6dfd43af472d2191c39

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 6

•

6 years ago

Attached file Bug 1498782 - Add way to leak stuck threads r?kmag! (obsolete) — Details

In bug 1471280 we converted the DNS resolver threads from PR_Thread to using an nsThreadPool. However, sometimes the threads may be stuck in getaddrinfo with no way of killing the thread or knowing if it's ever going to resolve. This means we need a way to instruct nsThreadManager to skip shutdown for certain threads.

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 7

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f1b7e8114c56a53519e059d27e10ef24e5fc2390

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

6 years ago

Flags: needinfo?(valentin.gosu)

Priority: P2 → P1

Sylvestre Ledru [:Sylvestre]

Updated

•

6 years ago

status-firefox62: --- → wontfix

status-firefox-esr60: --- → affected

Keywords: topcrash

Pascal Chevrel:pascalc

Updated

•

6 years ago

tracking-firefox63: --- → +

Eric Rahm [:erahm]

Comment 8

•

6 years ago

I commented in Phabricator, but it probably makes more sense here: The vast majority of these hangs on are on Windows, and Windows has a solution for this: GetAddrInfoEx [1]. This has been available since Vista, so it should be okay to switch over. I know that's more work, but I think it's the right solution to your problem. Of course it's possible there are technical limitations to switching over, if that's the case we can discuss changes to the thread api. In general adding a nsIThread::Cancel function might be useful, but that might requiring messing with NSPR which is a much larger endeavor. [1] https://docs.microsoft.com/en-us/windows/desktop/api/ws2tcpip/nf-ws2tcpip-getaddrinfoexa

Daniel Stenberg [:bagder]

Comment 9

•

6 years ago

> The vast majority of these hangs on are on Windows Because the vast majority of our users are on Windows, right? I mean, this issue is real on all platforms as getaddrinfo() can indeed take quite a long time to return. I'm simply suggesting that this (sort of) solution is probably still necessary even if we can improve it further on Windows.

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 10

•

6 years ago

>>! In D9024#218150, @erahm wrote: > This is the type of thing we'd only want to allow at xpcom shutdown -- I don't think we want this in the nsThreadPool interface and we definitely don't want it to rely on a few external steps to work properly ("before calling LeakThreads do these things"). A better interface would be to add an optional timeout for thread shutdown, but that's still papering over an issue. I agree. But I'm thinking about an immediate fix here. We can: 1. Leak the thread, or shutdown with a timeout. 2. Switch back to using PR_Thread, and still keep leaking the thread. 3. Not do anything about the threads, implement async DNS resolution (The 40-50 shutdown hangs a day we have now will probably increase on release; also, see below). > The vast majority of these hangs on are on Windows, and Windows has a solution for this: [GetAddrInfoEx](https://docs.microsoft.com/en-us/windows/desktop/api/ws2tcpip/nf-ws2tcpip-getaddrinfoexa). This has been available since Vista, so it should be okay to switch over. That is closer to the correct solution, but I wouldn't be confident to uplift such a fix. And the fact that it's getaddrinfo_a on linux (not sure it works on Android), and CFHost on OSX makes it even more tricky. And there's a real risk that not all APIs work the same way on the same platform, leading to weird DNS bugs.

Liz Henry (:lizzard) (relman/hg->git project)

Comment 11

•

6 years ago

Tracking for 64 as well.

tracking-firefox64: --- → +

Eric Rahm [:erahm]

Comment 12

•

6 years ago

(In reply to Daniel Stenberg [:bagder] from comment #9) > > The vast majority of these hangs on are on Windows > > Because the vast majority of our users are on Windows, right? I mean, this > issue is real on all platforms as getaddrinfo() can indeed take quite a long > time to return. > > I'm simply suggesting that this (sort of) solution is probably still > necessary even if we can improve it further on Windows. It's also possible linux/mac getaddrinfo impls are less likely to get stuck. Regardless, I think stability folks would be probably be happy with a Windows-only fix for the time being.

Eric Rahm [:erahm]

Comment 13

•

6 years ago

(In reply to Valentin Gosu [:valentin] from comment #10) > >>! In D9024#218150, @erahm wrote: > > This is the type of thing we'd only want to allow at xpcom shutdown -- I don't think we want this in the nsThreadPool interface and we definitely don't want it to rely on a few external steps to work properly ("before calling LeakThreads do these things"). A better interface would be to add an optional timeout for thread shutdown, but that's still papering over an issue. > > I agree. But I'm thinking about an immediate fix here. We can: > 1. Leak the thread, or shutdown with a timeout. > 2. Switch back to using PR_Thread, and still keep leaking the thread. > 3. Not do anything about the threads, implement async DNS resolution (The > 40-50 shutdown hangs a day we have now will probably increase on release; > also, see below). If we need a fix for 63 reverting back to using PR_Thread just in the 63 branch might be the simplest and *safest* solution (we know it's not great, but leaking pr threads on shutdown in release isn't the end of the world). For 64 adding a timeout to thread shutdown is definitely doable, I don't think I'd feel comfortable r+'ing that for 63, and we'd probably want to wait to land until after the code freeze (and then go through proper uplift procedures). This feature would be useful outside of the scope of this bug, so it wouldn't be a waste of effort. > > The vast majority of these hangs on are on Windows, and Windows has a solution for this: [GetAddrInfoEx](https://docs.microsoft.com/en-us/windows/desktop/api/ws2tcpip/nf-ws2tcpip-getaddrinfoexa). This has been available since Vista, so it should be okay to switch over. > > That is closer to the correct solution, but I wouldn't be confident to > uplift such a fix. And the fact that it's getaddrinfo_a on linux (not sure > it works on Android), and CFHost on OSX makes it even more tricky. And > there's a real risk that not all APIs work the same way on the same > platform, leading to weird DNS bugs. I agree, this should be something we'd try to land in 65 with plenty of time baking on Nightly.

Eric Rahm [:erahm]

Comment 14

•

6 years ago

Attached patch Back out changeset d7788f2fd8a2, db553cb4218d, 80dd9e57c3a4 — Details — Splinter Review

In order to work around shutdown hangs this backs out the changesets that converted the DNS threads to using nsThreadPool. This is the proposed backout that we'd land on beta before 63 hits release (or nominate for a dot release). Valentin, does this make sense? I think I chose the right patches, but am not an expert on all the recent DNS changes. Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3fdd130dddb1b13492f7124b33390f8b23278072

Attachment #9018752 - Flags: review?(valentin.gosu)

Eric Rahm [:erahm]

Updated

•

6 years ago

Assignee: valentin.gosu → erahm

Status: NEW → ASSIGNED

Eric Rahm [:erahm]

Comment 15

•

6 years ago

(In reply to Eric Rahm [:erahm] from comment #14) > Created attachment 9018752 [details] [diff] [review] > Back out changeset d7788f2fd8a2, db553cb4218d, 80dd9e57c3a4 > > In order to work around shutdown hangs this backs out the changesets that > converted the DNS threads to using nsThreadPool. > > This is the proposed backout that we'd land on beta before 63 hits release > (or nominate for a dot release). Valentin, does this make sense? I think I > chose the right patches, but am not an expert on all the recent DNS changes. > Try run: > > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=3fdd130dddb1b13492f7124b33390f8b23278072 Just to be clear, this is a proposed very short term fix. If we can't get a patch into 63, then I'd just go with the other two options.

Assignee: erahm → valentin.gosu

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

6 years ago

Attachment #9018752 - Flags: review?(valentin.gosu) → review+

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 16

•

6 years ago

Comment on attachment 9018752 [details] [diff] [review] Back out changeset d7788f2fd8a2, db553cb4218d, 80dd9e57c3a4 [Beta/Release Uplift Approval Request] Feature/Bug causing the regression: Bug 1471280 User impact if declined: Potential shutdownhang when getaddrinfo call doesn't return in time (possible for really slow resolvers). Bad user experience if they try to open the browser again quickly after. Is this code covered by automated tests?: Yes Has the fix been verified in Nightly?: No Needs manual test from QE?: Yes If yes, steps to reproduce: Set system resolver to a really slow one. Resolve a webpage. Close firefox. Make sure no shutdown crash occurs. Start firefox 30 seconds after, and make sure that you can. List of other uplifts needed: None Risk to taking this patch: Low Why is the change risky/not risky? (and alternatives if risky): It reverts back to a behaviour we had for years. String changes made/needed:

Attachment #9018752 - Flags: approval-mozilla-beta?

Pascal Chevrel:pascalc

Comment 17

•

6 years ago

Comment on attachment 9018752 [details] [diff] [review] Back out changeset d7788f2fd8a2, db553cb4218d, 80dd9e57c3a4 We beta ship sailed off a week ago and I don't think it's a good idea to take a network patch for a crash fix not even on Nightly yet the week end before shipping 63 and build an RC3 on Monday for that. It may fix the crash but it may also introduce new unknown regressions to the product just before shipping. We are also not sure that the problem would be fixed at scale for our users as we would need several days of data to see if the crashes no longer happen for them. Given the above, the fact that we would need a full day of QA for an RC, that would put our 63 release date in jeopardy so I will not take it for our Tuesday release but I am keeping the tracking flag sets to keep it on our radar for a potential dot release. [Beta/Release Uplift Approval Request] Feature/Bug causing the regression: None User impact if declined: Is this code covered by automated tests?: Yes Has the fix been verified in Nightly?: Yes Needs manual test from QE?: Yes If yes, steps to reproduce: List of other uplifts needed: None Risk to taking this patch: Low Why is the change risky/not risky? (and alternatives if risky): String changes made/needed:

Attachment #9018752 - Flags: approval-mozilla-beta? → approval-mozilla-release?

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

6 years ago

Depends on: 1500861

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

6 years ago

Depends on: 1500863

Phabricator Automation

Updated

•

6 years ago

Attachment #9018026 - Attachment is obsolete: true

Pascal Chevrel:pascalc

Updated

•

6 years ago

status-firefox65: --- → affected

Pascal Chevrel:pascalc

Updated

•

6 years ago

Attachment #9018752 - Flags: approval-mozilla-release? → approval-mozilla-release-

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 18

•

6 years ago

Bug 1500861 was uplifted to 64 and fixed this issue.

status-firefox64: affected → fixed

status-firefox65: affected → fixed

Pascal Chevrel:pascalc

Comment 19

•

6 years ago

Wontfix for 63 as 64 is 3 weeks away and we don't have plans for another dot release.

status-firefox63: affected → wontfix

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 20

•

6 years ago

Thanks Pascal. Closing as WONTFIX for 63. 64+ fix landed in Bug 1500861.

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → WONTFIX

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox-esr60: affected → wontfix

You need to log in before you can comment on or make changes to this bug.