Closed Bug 1498782 Opened Last year Closed Last year

Crash in shutdownhang | nsThread::Shutdown | nsThreadPool::Shutdown in nsHostResolver::Shutdown

Categories

(Core :: Networking, defect, P1, critical)

63 Branch
defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox-esr60 --- wontfix
firefox62 --- wontfix
firefox63 + wontfix
firefox64 + fixed
firefox65 --- fixed

People

(Reporter: philipp, Assigned: valentin)

References

(Depends on 1 open bug)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [necko-triaged])

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is
report bp-917aef33-c044-4c8b-884c-44ee40181013.
=============================================================

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernelbase.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:58
4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::GetEvent xpcom/threads/ThreadEventQueue.cpp:168
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1093
6 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:519
7 xul.dll nsThread::Shutdown xpcom/threads/nsThread.cpp:871
8 xul.dll nsThreadPool::Shutdown xpcom/threads/nsThreadPool.cpp:347
9 xul.dll nsHostResolver::Shutdown netwerk/dns/nsHostResolver.cpp:771

=============================================================

shutdownhang reports with such a stack have become more common across platforms in the 63 cycle after the landing of bug 1471280.
the volume is low to mid level with 5-20 daily reports on beta currently.

the [@shutdownhang | nsThread::Shutdown | nsThreadPool::Shutdown] signature covers a number of different issues, this socorro query should narrow it down to the shutdown of DNS resolver threads:
https://crash-stats.mozilla.com/search/?proto_signature=~nsHostResolver%3A%3AShutdown&release_channel=release&release_channel=beta&release_channel=nightly&build_id=%3E20180605171542&product=Firefox&date=%3E%3D2018-06-01&_facets=signature&_facets=version&_facets=build_id&_facets=platform_pretty_version&_facets=cpu_arch#facet-version
Assignee: nobody → valentin.gosu
Priority: -- → P2
Whiteboard: [necko-triaged]
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/a4b43a47589a
Skip thread shutdown in nsHostResolver if there are still active threads after a delay r=bagder
In bug 1471280 we converted the DNS resolver threads from PR_Thread to using
an nsThreadPool. However, sometimes the threads may be stuck in getaddrinfo
with no way of killing the thread or knowing if it's ever going to resolve.
This means we need a way to instruct nsThreadManager to skip shutdown for
certain threads.
Flags: needinfo?(valentin.gosu)
Priority: P2 → P1
I commented in Phabricator, but it probably makes more sense here:

The vast majority of these hangs on are on Windows, and Windows has a solution for this: GetAddrInfoEx [1]. This has been available since Vista, so it should be okay to switch over. I know that's more work, but I think it's the right solution to your problem. Of course it's possible there are technical limitations to switching over, if that's the case we can discuss changes to the thread api. In general adding a nsIThread::Cancel function might be useful, but that might requiring messing with NSPR which is a much larger endeavor.

[1] https://docs.microsoft.com/en-us/windows/desktop/api/ws2tcpip/nf-ws2tcpip-getaddrinfoexa
> The vast majority of these hangs on are on Windows

Because the vast majority of our users are on Windows, right? I mean, this issue is real on all platforms as getaddrinfo() can indeed take quite a long time to return.

I'm simply suggesting that this (sort of) solution is probably still necessary even if we can improve it further on Windows.
>>! In D9024#218150, @erahm wrote:
> This is the type of thing we'd only want to allow at xpcom shutdown -- I don't think we want this in the nsThreadPool interface and we definitely don't want it to rely on a few external steps to work properly ("before calling LeakThreads do these things"). A better interface would be to add an optional timeout for thread shutdown, but that's still papering over an issue.

I agree. But I'm thinking about an immediate fix here. We can:
1. Leak the thread, or shutdown with a timeout.
2. Switch back to using PR_Thread, and still keep leaking the thread.
3. Not do anything about the threads, implement async DNS resolution (The 40-50 shutdown hangs a day we have now will probably increase on release; also, see below).

> The vast majority of these hangs on are on Windows, and Windows has a solution for this: [GetAddrInfoEx](https://docs.microsoft.com/en-us/windows/desktop/api/ws2tcpip/nf-ws2tcpip-getaddrinfoexa). This has been available since Vista, so it should be okay to switch over.

That is closer to the correct solution, but I wouldn't be confident to uplift such a fix. And the fact that it's getaddrinfo_a on linux (not sure it works on Android), and CFHost on OSX makes it even more tricky. And there's a real risk that not all APIs work the same way on the same platform, leading to weird DNS bugs.
(In reply to Daniel Stenberg [:bagder] from comment #9)
> > The vast majority of these hangs on are on Windows
> 
> Because the vast majority of our users are on Windows, right? I mean, this
> issue is real on all platforms as getaddrinfo() can indeed take quite a long
> time to return.
> 
> I'm simply suggesting that this (sort of) solution is probably still
> necessary even if we can improve it further on Windows.

It's also possible linux/mac getaddrinfo impls are less likely to get stuck. Regardless, I think stability folks would be probably be happy with a Windows-only fix for the time being.
(In reply to Valentin Gosu [:valentin] from comment #10)
> >>! In D9024#218150, @erahm wrote:
> > This is the type of thing we'd only want to allow at xpcom shutdown -- I don't think we want this in the nsThreadPool interface and we definitely don't want it to rely on a few external steps to work properly ("before calling LeakThreads do these things"). A better interface would be to add an optional timeout for thread shutdown, but that's still papering over an issue.
> 
> I agree. But I'm thinking about an immediate fix here. We can:
> 1. Leak the thread, or shutdown with a timeout.
> 2. Switch back to using PR_Thread, and still keep leaking the thread.
> 3. Not do anything about the threads, implement async DNS resolution (The
> 40-50 shutdown hangs a day we have now will probably increase on release;
> also, see below).

If we need a fix for 63 reverting back to using PR_Thread just in the 63 branch might be the simplest and *safest* solution (we know it's not great, but leaking pr threads on shutdown in release isn't the end of the world).

For 64 adding a timeout to thread shutdown is definitely doable, I don't think I'd feel comfortable r+'ing that for 63, and we'd probably want to wait to land until after the code freeze (and then go through proper uplift procedures). This feature would be useful outside of the scope of this bug, so it wouldn't be a waste of effort.

> > The vast majority of these hangs on are on Windows, and Windows has a solution for this: [GetAddrInfoEx](https://docs.microsoft.com/en-us/windows/desktop/api/ws2tcpip/nf-ws2tcpip-getaddrinfoexa). This has been available since Vista, so it should be okay to switch over.
> 
> That is closer to the correct solution, but I wouldn't be confident to
> uplift such a fix. And the fact that it's getaddrinfo_a on linux (not sure
> it works on Android), and CFHost on OSX makes it even more tricky. And
> there's a real risk that not all APIs work the same way on the same
> platform, leading to weird DNS bugs.

I agree, this should be something we'd try to land in 65 with plenty of time baking on Nightly.
In order to work around shutdown hangs this backs out the changesets that
converted the DNS threads to using nsThreadPool.

This is the proposed backout that we'd land on beta before 63 hits release (or nominate for a dot release). Valentin, does this make sense? I think I chose the right patches, but am not an expert on all the recent DNS changes. Try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3fdd130dddb1b13492f7124b33390f8b23278072
Attachment #9018752 - Flags: review?(valentin.gosu)
Assignee: valentin.gosu → erahm
Status: NEW → ASSIGNED
(In reply to Eric Rahm [:erahm] from comment #14)
> Created attachment 9018752 [details] [diff] [review]
> Back out changeset d7788f2fd8a2, db553cb4218d, 80dd9e57c3a4
> 
> In order to work around shutdown hangs this backs out the changesets that
> converted the DNS threads to using nsThreadPool.
> 
> This is the proposed backout that we'd land on beta before 63 hits release
> (or nominate for a dot release). Valentin, does this make sense? I think I
> chose the right patches, but am not an expert on all the recent DNS changes.
> Try run:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=3fdd130dddb1b13492f7124b33390f8b23278072

Just to be clear, this is a proposed very short term fix. If we can't get a patch into 63, then I'd just go with the other two options.
Assignee: erahm → valentin.gosu
Attachment #9018752 - Flags: review?(valentin.gosu) → review+
Comment on attachment 9018752 [details] [diff] [review]
Back out changeset d7788f2fd8a2, db553cb4218d, 80dd9e57c3a4

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: Bug 1471280

User impact if declined: Potential shutdownhang when getaddrinfo call doesn't return in time (possible for really slow resolvers).
Bad user experience if they try to open the browser again quickly after.

Is this code covered by automated tests?: Yes

Has the fix been verified in Nightly?: No

Needs manual test from QE?: Yes

If yes, steps to reproduce: Set system resolver to a really slow one.
Resolve a webpage.
Close firefox.
Make sure no shutdown crash occurs.
Start firefox 30 seconds after, and make sure that you can.

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): It reverts back to a behaviour we had for years.

String changes made/needed:
Attachment #9018752 - Flags: approval-mozilla-beta?
Comment on attachment 9018752 [details] [diff] [review]
Back out changeset d7788f2fd8a2, db553cb4218d, 80dd9e57c3a4

We beta ship sailed off a week ago and I don't think it's a good idea to take a network patch for a crash fix not even on Nightly yet the week end before shipping 63 and build an RC3 on Monday for that. It may fix the crash but it may also introduce new unknown regressions to the product just before shipping. 

We are also not sure that the problem would be fixed at scale for our users as we would need several days of data to see if the crashes no longer happen for them. 

Given the above, the fact that we would need a full day of QA for an RC, that would put our 63 release date in jeopardy so I will not take it for our Tuesday release but I am keeping the tracking flag sets to keep it on our radar for a potential dot release.

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: None

User impact if declined: 

Is this code covered by automated tests?: Yes

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: Yes

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): 

String changes made/needed:
Attachment #9018752 - Flags: approval-mozilla-beta? → approval-mozilla-release?
Depends on: 1500861
Depends on: 1500863
Attachment #9018026 - Attachment is obsolete: true
Attachment #9018752 - Flags: approval-mozilla-release? → approval-mozilla-release-
Bug 1500861 was uplifted to 64 and fixed this issue.
Wontfix for 63 as 64 is 3 weeks away and we don't have plans for another dot release.
Thanks Pascal.
Closing as WONTFIX for 63. 64+ fix landed in Bug 1500861.
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.