Closed
Bug 1498782
Opened 6 years ago
Closed 6 years ago
Crash in shutdownhang | nsThread::Shutdown | nsThreadPool::Shutdown in nsHostResolver::Shutdown
Categories
(Core :: Networking, defect, P1)
Tracking
()
People
(Reporter: philipp, Assigned: valentin)
References
(Depends on 1 open bug)
Details
(Keywords: crash, regression, topcrash, Whiteboard: [necko-triaged])
Crash Data
Attachments
(2 files, 1 obsolete file)
46 bytes,
text/x-phabricator-request
|
Details | Review | |
12.05 KB,
patch
|
valentin
:
review+
pascalc
:
approval-mozilla-release-
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-917aef33-c044-4c8b-884c-44ee40181013.
=============================================================
Top 10 frames of crashing thread:
0 ntdll.dll NtWaitForAlertByThreadId
1 ntdll.dll RtlSleepConditionVariableSRW
2 kernelbase.dll SleepConditionVariableSRW
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:58
4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::GetEvent xpcom/threads/ThreadEventQueue.cpp:168
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1093
6 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:519
7 xul.dll nsThread::Shutdown xpcom/threads/nsThread.cpp:871
8 xul.dll nsThreadPool::Shutdown xpcom/threads/nsThreadPool.cpp:347
9 xul.dll nsHostResolver::Shutdown netwerk/dns/nsHostResolver.cpp:771
=============================================================
shutdownhang reports with such a stack have become more common across platforms in the 63 cycle after the landing of bug 1471280.
the volume is low to mid level with 5-20 daily reports on beta currently.
the [@shutdownhang | nsThread::Shutdown | nsThreadPool::Shutdown] signature covers a number of different issues, this socorro query should narrow it down to the shutdown of DNS resolver threads:
https://crash-stats.mozilla.com/search/?proto_signature=~nsHostResolver%3A%3AShutdown&release_channel=release&release_channel=beta&release_channel=nightly&build_id=%3E20180605171542&product=Firefox&date=%3E%3D2018-06-01&_facets=signature&_facets=version&_facets=build_id&_facets=platform_pretty_version&_facets=cpu_arch#facet-version
Assignee | ||
Comment 1•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → valentin.gosu
Priority: -- → P2
Whiteboard: [necko-triaged]
Assignee | ||
Comment 2•6 years ago
|
||
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/a4b43a47589a
Skip thread shutdown in nsHostResolver if there are still active threads after a delay r=bagder
Comment 4•6 years ago
|
||
Backed out changeset a4b43a47589a (Bug 1498782) for causing multimple failures e.g.: ts_paint_webext CLOSED TREE
Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Cusercancel%2Crunnable&revision=a4b43a47589a481f19fd60184881ed5d8caddd73&selectedJob=206036059
Failure log:
https://treeherder.mozilla.org/logviewer.html#?job_id=206028115&repo=autoland&lineNumber=1533
https://treeherder.mozilla.org/logviewer.html#?job_id=206028122&repo=autoland&lineNumber=3397
https://treeherder.mozilla.org/logviewer.html#?job_id=206032788&repo=autoland&lineNumber=51351
https://treeherder.mozilla.org/logviewer.html#?job_id=206032764&repo=autoland&lineNumber=51331
https://treeherder.mozilla.org/logviewer.html#?job_id=206032769&repo=autoland&lineNumber=868
https://treeherder.mozilla.org/logviewer.html#?job_id=206032754&repo=autoland&lineNumber=1013
https://treeherder.mozilla.org/logviewer.html#?job_id=206032803&repo=autoland&lineNumber=1429
Backout: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Cusercancel%2Crunnable&revision=86474265d1630457d800c782f6ec072af87cccd3
Flags: needinfo?(valentin.gosu)
Assignee | ||
Comment 5•6 years ago
|
||
Assignee | ||
Comment 6•6 years ago
|
||
In bug 1471280 we converted the DNS resolver threads from PR_Thread to using
an nsThreadPool. However, sometimes the threads may be stuck in getaddrinfo
with no way of killing the thread or knowing if it's ever going to resolve.
This means we need a way to instruct nsThreadManager to skip shutdown for
certain threads.
Assignee | ||
Comment 7•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Flags: needinfo?(valentin.gosu)
Priority: P2 → P1
Updated•6 years ago
|
Updated•6 years ago
|
tracking-firefox63:
--- → +
Comment 8•6 years ago
|
||
I commented in Phabricator, but it probably makes more sense here:
The vast majority of these hangs on are on Windows, and Windows has a solution for this: GetAddrInfoEx [1]. This has been available since Vista, so it should be okay to switch over. I know that's more work, but I think it's the right solution to your problem. Of course it's possible there are technical limitations to switching over, if that's the case we can discuss changes to the thread api. In general adding a nsIThread::Cancel function might be useful, but that might requiring messing with NSPR which is a much larger endeavor.
[1] https://docs.microsoft.com/en-us/windows/desktop/api/ws2tcpip/nf-ws2tcpip-getaddrinfoexa
Comment 9•6 years ago
|
||
> The vast majority of these hangs on are on Windows
Because the vast majority of our users are on Windows, right? I mean, this issue is real on all platforms as getaddrinfo() can indeed take quite a long time to return.
I'm simply suggesting that this (sort of) solution is probably still necessary even if we can improve it further on Windows.
Assignee | ||
Comment 10•6 years ago
|
||
>>! In D9024#218150, @erahm wrote:
> This is the type of thing we'd only want to allow at xpcom shutdown -- I don't think we want this in the nsThreadPool interface and we definitely don't want it to rely on a few external steps to work properly ("before calling LeakThreads do these things"). A better interface would be to add an optional timeout for thread shutdown, but that's still papering over an issue.
I agree. But I'm thinking about an immediate fix here. We can:
1. Leak the thread, or shutdown with a timeout.
2. Switch back to using PR_Thread, and still keep leaking the thread.
3. Not do anything about the threads, implement async DNS resolution (The 40-50 shutdown hangs a day we have now will probably increase on release; also, see below).
> The vast majority of these hangs on are on Windows, and Windows has a solution for this: [GetAddrInfoEx](https://docs.microsoft.com/en-us/windows/desktop/api/ws2tcpip/nf-ws2tcpip-getaddrinfoexa). This has been available since Vista, so it should be okay to switch over.
That is closer to the correct solution, but I wouldn't be confident to uplift such a fix. And the fact that it's getaddrinfo_a on linux (not sure it works on Android), and CFHost on OSX makes it even more tricky. And there's a real risk that not all APIs work the same way on the same platform, leading to weird DNS bugs.
Comment 12•6 years ago
|
||
(In reply to Daniel Stenberg [:bagder] from comment #9)
> > The vast majority of these hangs on are on Windows
>
> Because the vast majority of our users are on Windows, right? I mean, this
> issue is real on all platforms as getaddrinfo() can indeed take quite a long
> time to return.
>
> I'm simply suggesting that this (sort of) solution is probably still
> necessary even if we can improve it further on Windows.
It's also possible linux/mac getaddrinfo impls are less likely to get stuck. Regardless, I think stability folks would be probably be happy with a Windows-only fix for the time being.
Comment 13•6 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #10)
> >>! In D9024#218150, @erahm wrote:
> > This is the type of thing we'd only want to allow at xpcom shutdown -- I don't think we want this in the nsThreadPool interface and we definitely don't want it to rely on a few external steps to work properly ("before calling LeakThreads do these things"). A better interface would be to add an optional timeout for thread shutdown, but that's still papering over an issue.
>
> I agree. But I'm thinking about an immediate fix here. We can:
> 1. Leak the thread, or shutdown with a timeout.
> 2. Switch back to using PR_Thread, and still keep leaking the thread.
> 3. Not do anything about the threads, implement async DNS resolution (The
> 40-50 shutdown hangs a day we have now will probably increase on release;
> also, see below).
If we need a fix for 63 reverting back to using PR_Thread just in the 63 branch might be the simplest and *safest* solution (we know it's not great, but leaking pr threads on shutdown in release isn't the end of the world).
For 64 adding a timeout to thread shutdown is definitely doable, I don't think I'd feel comfortable r+'ing that for 63, and we'd probably want to wait to land until after the code freeze (and then go through proper uplift procedures). This feature would be useful outside of the scope of this bug, so it wouldn't be a waste of effort.
> > The vast majority of these hangs on are on Windows, and Windows has a solution for this: [GetAddrInfoEx](https://docs.microsoft.com/en-us/windows/desktop/api/ws2tcpip/nf-ws2tcpip-getaddrinfoexa). This has been available since Vista, so it should be okay to switch over.
>
> That is closer to the correct solution, but I wouldn't be confident to
> uplift such a fix. And the fact that it's getaddrinfo_a on linux (not sure
> it works on Android), and CFHost on OSX makes it even more tricky. And
> there's a real risk that not all APIs work the same way on the same
> platform, leading to weird DNS bugs.
I agree, this should be something we'd try to land in 65 with plenty of time baking on Nightly.
Comment 14•6 years ago
|
||
In order to work around shutdown hangs this backs out the changesets that
converted the DNS threads to using nsThreadPool.
This is the proposed backout that we'd land on beta before 63 hits release (or nominate for a dot release). Valentin, does this make sense? I think I chose the right patches, but am not an expert on all the recent DNS changes. Try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3fdd130dddb1b13492f7124b33390f8b23278072
Attachment #9018752 -
Flags: review?(valentin.gosu)
Updated•6 years ago
|
Assignee: valentin.gosu → erahm
Status: NEW → ASSIGNED
Comment 15•6 years ago
|
||
(In reply to Eric Rahm [:erahm] from comment #14)
> Created attachment 9018752 [details] [diff] [review]
> Back out changeset d7788f2fd8a2, db553cb4218d, 80dd9e57c3a4
>
> In order to work around shutdown hangs this backs out the changesets that
> converted the DNS threads to using nsThreadPool.
>
> This is the proposed backout that we'd land on beta before 63 hits release
> (or nominate for a dot release). Valentin, does this make sense? I think I
> chose the right patches, but am not an expert on all the recent DNS changes.
> Try run:
>
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=3fdd130dddb1b13492f7124b33390f8b23278072
Just to be clear, this is a proposed very short term fix. If we can't get a patch into 63, then I'd just go with the other two options.
Assignee: erahm → valentin.gosu
Assignee | ||
Updated•6 years ago
|
Attachment #9018752 -
Flags: review?(valentin.gosu) → review+
Assignee | ||
Comment 16•6 years ago
|
||
Comment on attachment 9018752 [details] [diff] [review]
Back out changeset d7788f2fd8a2, db553cb4218d, 80dd9e57c3a4
[Beta/Release Uplift Approval Request]
Feature/Bug causing the regression: Bug 1471280
User impact if declined: Potential shutdownhang when getaddrinfo call doesn't return in time (possible for really slow resolvers).
Bad user experience if they try to open the browser again quickly after.
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: No
Needs manual test from QE?: Yes
If yes, steps to reproduce: Set system resolver to a really slow one.
Resolve a webpage.
Close firefox.
Make sure no shutdown crash occurs.
Start firefox 30 seconds after, and make sure that you can.
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): It reverts back to a behaviour we had for years.
String changes made/needed:
Attachment #9018752 -
Flags: approval-mozilla-beta?
Comment 17•6 years ago
|
||
Comment on attachment 9018752 [details] [diff] [review]
Back out changeset d7788f2fd8a2, db553cb4218d, 80dd9e57c3a4
We beta ship sailed off a week ago and I don't think it's a good idea to take a network patch for a crash fix not even on Nightly yet the week end before shipping 63 and build an RC3 on Monday for that. It may fix the crash but it may also introduce new unknown regressions to the product just before shipping.
We are also not sure that the problem would be fixed at scale for our users as we would need several days of data to see if the crashes no longer happen for them.
Given the above, the fact that we would need a full day of QA for an RC, that would put our 63 release date in jeopardy so I will not take it for our Tuesday release but I am keeping the tracking flag sets to keep it on our radar for a potential dot release.
[Beta/Release Uplift Approval Request]
Feature/Bug causing the regression: None
User impact if declined:
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: Yes
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky):
String changes made/needed:
Attachment #9018752 -
Flags: approval-mozilla-beta? → approval-mozilla-release?
Updated•6 years ago
|
Attachment #9018026 -
Attachment is obsolete: true
Updated•6 years ago
|
status-firefox65:
--- → affected
Updated•6 years ago
|
Attachment #9018752 -
Flags: approval-mozilla-release? → approval-mozilla-release-
Assignee | ||
Comment 18•6 years ago
|
||
Bug 1500861 was uplifted to 64 and fixed this issue.
Comment 19•6 years ago
|
||
Wontfix for 63 as 64 is 3 weeks away and we don't have plans for another dot release.
Assignee | ||
Comment 20•6 years ago
|
||
Thanks Pascal.
Closing as WONTFIX for 63. 64+ fix landed in Bug 1500861.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Updated•5 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•