Open Bug 1482224 Opened Last year Updated 22 days ago

Crash in shutdownhang | nsThread::Shutdown | mozilla::net::WaitForThreadShutdown::Run | nsThread::Shutdown | AsyncLocalFileWinDone::Run

Categories

(Core :: Networking, defect, P2, critical)

defect

Tracking

()

REOPENED
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox61 --- unaffected
firefox62 --- unaffected
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- fix-optional
firefox69 --- fix-optional

People

(Reporter: marcia, Unassigned, NeedInfo)

Details

(Keywords: crash, regression, Whiteboard: [necko-triaged])

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is
report bp-c42c7a41-8703-483d-b645-64e890180809.
===========================================================

Seen during nightly crash triage - crashes starting using 20180710222524: https://bit.ly/2vxQ5se. Filing in XPCOM but likely not the correct bucket. Seems to be thread related.

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3d20b0701781731e0f9b08e1cd40ac842f385e03&tochange=70f901964f9725e6dfd09750f48996e9d6670492

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernelbase.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:58
4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::GetEvent xpcom/threads/ThreadEventQueue.cpp:168
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1167
6 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:519
7 xul.dll nsThread::Shutdown xpcom/threads/nsThread.cpp:939
8 xul.dll mozilla::net::WaitForThreadShutdown::Run netwerk/base/nsPACMan.cpp:187
9 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1235

=============================================================
This looks proxy related. We have `nsPacMan` shutting down on frame 8 [1]. If you look at the crash report, thread 35 (proxy resolution) is spinning its event loop waiting on a result [2], possibly blocking the thread shutdown. Seems like checking for shutdown in the spin event loop lambda might help?

That range has bug 356831 in it, but I think it got backed out and re-landed a week later.

[1] https://hg.mozilla.org/mozilla-central/annotate/cea8cf3f6abefaccb80a640b47aad8bcdb3115ae/netwerk/base/nsPACMan.cpp#l187
[2] https://hg.mozilla.org/mozilla-central/annotate/cea8cf3f6abefaccb80a640b47aad8bcdb3115ae/netwerk/base/ProxyAutoConfig.cpp#l455
Component: XPCOM → Networking
Adding the Mac signature which showed up starting in 20180809101613.
Crash Signature: [@ shutdownhang | nsThread::Shutdown | mozilla::net::WaitForThreadShutdown::Run | nsThread::Shutdown | AsyncLocalFileWinDone::Run] → [@ shutdownhang | nsThread::Shutdown | mozilla::net::WaitForThreadShutdown::Run | nsThread::Shutdown | AsyncLocalFileWinDone::Run] [@ shutdownhang | __psynch_cvwait | <name omitted> | nsThread::Shutdown | mozilla::net::WaitForThreadShutdown::Run | nsThre…
OS: Windows 10 → All
Hardware: Unspecified → All
Version: 61 Branch → Trunk
Assignee: nobody → polly.shaw
I have taken this but all I plan to do is to implement Eric's suggestion of exiting the spin event loop if mShutdown is true, and I won't be able to get onto this until this evening, and I'll have to get myself set up with Arc for reviewing, so I won't be offended if anyone de-assigns me.
Thanks for taking this Polly. If you have any trouble with this please let me know. Cheers!
Priority: -- → P2
Whiteboard: [necko-triaged]
(In reply to Eric Rahm [:erahm] from comment #1)
> This looks proxy related. We have `nsPacMan` shutting down on frame 8 [1].
> If you look at the crash report, thread 35 (proxy resolution) is spinning
> its event loop waiting on a result [2], possibly blocking the thread
> shutdown. Seems like checking for shutdown in the spin event loop lambda
> might help?
> 
> That range has bug 356831 in it, but I think it got backed out and re-landed
> a week later.
> 
> [1]
> https://hg.mozilla.org/mozilla-central/annotate/
> cea8cf3f6abefaccb80a640b47aad8bcdb3115ae/netwerk/base/nsPACMan.cpp#l187
> [2]
> https://hg.mozilla.org/mozilla-central/annotate/
> cea8cf3f6abefaccb80a640b47aad8bcdb3115ae/netwerk/base/ProxyAutoConfig.
> cpp#l455

I think we should have shutdown dns resolver which would call OnLookupCompleted which would make the spin stop. We should take a look why this is not happening.
This crash has arisen due to a loop in ProxyAutoConfig which spins until a
pending PAC request is complete, even when the ProxyAutoConfig object is being
shut down. This patch's remedy to prevent the crash is to exit the loop when the
mShutdown flag is set. This does not attempt to address the underlying cause
of why the request continues to be pending, although this is also a concern.
Files changed:
netwerk/base/ProxyAutoConfig.cpp - expanded lambda containing logic of when
to stop waiting for the request to complete to include a check to the mShutdown
flag. Also logged a warning if the loop is exited because of mShutdown.
Comment on attachment 8999432 [details]
Crash in Shutdown due to ProxyAutoConfig still aiming to resolve address

Valentin Gosu [:valentin] has approved the revision.
Attachment #8999432 - Flags: review+
My patch for bug 365831 did add something else for the PACThread to do, namely to look up the DHCP, and I don't know if this could be the cause or an exacerbation for this issue.

I did note, however, that when I was investigating the memory leak, which was to do with an object graph of classes being left on the PAC thread after shutdown, that this usually happened when proxy was *not* being auto-detected, and therefore there was no DHCP request made in the background. Invariably, the root object of type ExecutePACThreadAction which was left on the thread was a ProcessPendingQ action, rather than a DHCP query. The reason why an iteration of my patch caused this memory leak was that I had made the PAC thread shutdown more frequently than it had previously done (and my subsequent fix reverted this), but it could still be happening, just in circumstances that are too rare to appear in the CI tests.

Not sure whether this is relevant, but it seems like a similar issue - things on the PAC thread not being cleared up when they're expected to be.
can we take a look why dns is not shutting down! It should be and it does call LookupComplete for all pending request.
Crash volume is still fairly low - are we still investigating the issue in Comment 9 or are we going to land the patch? Thanks.
> Crash volume is still fairly low - are we still investigating the issue in
> Comment 9 or are we going to land the patch? Thanks.

I haven't been able to reproduce the crash on Windows, so my plan was to land the patch, which warns if the shutdown is called whilst still waiting for requests to be resolved, and leave the investigation open. According to the Lando UI, my problems with landing the patch are twofold - firstly, I don't have level 3 commit access, which is expected, but also the patch itself doesn't seem to be landable as I created it using `hg diff` rather than by using Arcanist. I have now followed the new instructions for setting up Arcanist on Windows but haven't yet created the diff. Unfortunately I think I will need another review when I land the patch generated from Lando.
...down. The remedy to prevent the crash is to exit the loop when the
mShutdown flag is set. This does not attempt to address the underlying cause
of why the request continues to be pending, although this is also a concern.
Files changed:
netwerk/base/ProxyAutoConfig.cpp - expanded lambda containing logic of when
to stop waiting for the request to complete to include a check to the mShutdown
flag. Also logged a warning if the loop is exited because of mShutdown.
Attachment #9003132 - Attachment description: Bug 1482224 Crash in Shutdown due to ProxyAutoConfig still aiming to resolve address This crash has arisen due to a loop in ProxyAutoConfig which spins until a pending PAC request is complete, even when the ProxyAutoConfig object is being shut... → Bug 1482224 Crash in Shutdown due to ProxyAutoConfig still aiming to resolve addressThis crash has arisen due to a loop in ProxyAutoConfig which spins until apending PAC request is complete, even when the ProxyAutoConfig object is beingshut...
Have now submitted another patch using arc diff, which I hope will be landable. The previous one has the following error: 
```
This diff does not have the proper author information uploaded to Phabricator. This can happen if the diff was created using the web UI, or a non standard client. The author should re-upload the diff to Phabricator using the "arc diff" command. 
```
Comment on attachment 9003132 [details]
Bug 1482224 Crash in Shutdown due to ProxyAutoConfig still aiming to resolve addressThis crash has arisen due to a loop in ProxyAutoConfig which spins until apending PAC request is complete, even when the ProxyAutoConfig object is beingshut...

Valentin Gosu [:valentin] has approved the revision.
Attachment #9003132 - Flags: review+
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/cc2cb08f3759
Crash in Shutdown due to ProxyAutoConfig still aiming to resolve addressThis crash has arisen due to a loop in ProxyAutoConfig which spins until apending PAC request is complete, even when the ProxyAutoConfig object is beingshut... r=valentin
https://hg.mozilla.org/mozilla-central/rev/cc2cb08f3759
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Flags: qe-verify+
Attachment #8999432 - Attachment is obsolete: true
crash reports with this signature are still hanging around in the 63 release build. should we reopen this bug or create a new one for this?
Flags: needinfo?(polly.shaw)
I would say we reopen this bug, as it seems to be the same thing and the original investigation didn't actually pinpoint the cause.
Flags: needinfo?(polly.shaw)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Polly, do you still intend to investigate these remaining crashes in beta 65?
Flags: needinfo?(polly.shaw)
I haven’t had time to look at it yet but was planning to over the Christmas break by January 2, so most likely for 66.
Flags: needinfo?(polly.shaw)
My findings:

In general, WPAD is not enabled in these crashes: I haven't seen any where network.proxy.type is set in the telemetry environment settings, and the activity on the ProxyResolution thread sometimes shows evidence that System Proxy settings are being used - e.g. the preference for the PAC URI is being retrieved in https://crash-stats.mozilla.com/report/index/423aaaff-41ab-4ce2-80e1-fa0600181218#allthreads. 

The ProxyResolution thread isn't always doing the same thing when the main thread crashes. In https://crash-stats.mozilla.com/report/index/423aaaff-41ab-4ce2-80e1-fa0600181218#allthreads (the same link as above) it's getting a preference, but in https://crash-stats.mozilla.com/report/index/39d64db1-05e7-4c8e-96c7-8b8eb0181219#allthreads it's disposing of the JSContext that's used to parse the PAC file.
Clarification - in https://crash-stats.mozilla.com/report/index/423aaaff-41ab-4ce2-80e1-fa0600181218#allthreads the action on the ProxyResolution thread is not getting a Firefox preference, but the system proxy settings.
In what seems to be the majority* of these crashes, threads other than ProxyResolution are also doing proxy related things. For example in https://crash-stats.mozilla.com/report/index/f73ee32b-edaf-4506-bf15-456250181218, Thread 38 is updating the proxy settings due to windows changes, while the ProxyResolution thread is looking up the system proxy. And in https://crash-stats.mozilla.com/report/index/bc4f0415-11f5-44e4-9c8b-d87120190105#allthreads, a StreamTrans thread is looking up the system proxy while the ProxyResolution thread is also doing this. I wonder if there is some sort of deadlock happening within the Windows DLLs.

* the term 'majority' is not arrived at by data analysis but just by me clicking through a handful of crashes.

Adding 66 as affected.

Target Milestone: mozilla63 → ---

polly.shaw, since this is assigned to you, I'm checking in to see if this will get fixed in time for the 67 soft freeze (Mar 11)?

Flags: needinfo?(polly.shaw)

Hi Neha,

No - I haven't got a way forward on this and I don't have a huge amount of time to see if I can replicate the deadlock which I think might be the cause of the issue. I have a couple of days off on March 21st-22nd so I will try to have a look at it then - but in the meantime I have unassigned myself from this.

Polly

Assignee: polly.shaw → nobody
Flags: needinfo?(polly.shaw)

Dragana, can you please look into this or finding someone to? We still see crashes in beta and would like a patch to uplift for 67 beta, if possible.

Flags: needinfo?(dd.mozilla)
Crash Signature: nsThread::Shutdown | nsDestroyThreadEvent::Run] → nsThread::Shutdown | nsDestroyThreadEvent::Run] [@ shutdownhang | nsThread::Shutdown | AsyncLocalFileWinDone::Run ]
You need to log in before you can comment on or make changes to this bug.