1500861 - Add shutdownWithTimeout method to nsIThread and nsIThreadPool

Valentin Gosu [:valentin] (he/him)

Assignee

Description

•

6 years ago

We need this as a to prevent shutdown hangs when a thread is doing a blocking call that doesn't return.

Eric Rahm [:erahm]

Comment 1

•

6 years ago

Nathan is this something you can look at or work with Valentin to design? It's a blocker for a somewhat frequent shutdown hang. The idea is to provide a mechanism that allows us to shutdown a threadpool that has threads that are potentially blocked (calling gethostaddr in this instance). An initial draft [1] proposed a 'leak these threads intentionally' interface, but I didn't feel comfortable with that approach. [1] https://phabricator.services.mozilla.com/D9024

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 2

•

6 years ago

(In reply to Eric Rahm [:erahm] from comment #1) > Nathan is this something you can look at or work with Valentin to design? > It's a blocker for a somewhat frequent shutdown hang. The idea is to provide > a mechanism that allows us to shutdown a threadpool that has threads that > are potentially blocked (calling gethostaddr in this instance). > > An initial draft [1] proposed a 'leak these threads intentionally' > interface, but I didn't feel comfortable with that approach. I think there are fundamentally two approaches to this problem of threads being "stuck": 1) Some way to "cancel" the thread in the middle of an operation. 2) Accept that the thread is just going to leak. The first one is accomplished via pthread_cancel on Unix-y systems and--I think--TerminateThread on Windows. TerminateThread is a lot stronger (the target thread is given no chance to execute any sort of cleanup) and comes with strong warnings against its use in the documentation. I don't think I'd want to use it to terminate threads running random Win32 code. pthread_cancel is at least in theory usable, but my impression is that it's only usable in an environment where you wrote code that paid some attention to the possibility of cancelability...and I'm betting that Gecko (and its numerous attendant libraries) were not written with the necessary level of attention. That leaves the second option, which is more understandable and (to my mind) more reasonable, because you're shutting down anyway, so you don't really care about leaks. What don't you like about the leaking approach? (Or is there a third option that I'm not thinking of?)

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 3

•

6 years ago

OK, so I talked to erahm on IRC and he indicated his concerns were not so much with the "leak the threads" idea as with the particular route taken to implement said idea: 10:49 AM <@erahm> froydnj: so my thoughts re: dns thread leaking, I'm okay leaking the threads (it's what we did before), I just didn't like the proposed api (limit the threadpool to 0 threads, sleep(100ms), nsThreadPool::leakAllTheThings()) and was suggesting making thread shutdown take a timeout instead which seems reasonable. After some discussion, it's also clear that: pool->ShutdownWithTimeout(...); is a better interface than: pool->SetThreadLimit(0); Sleep(timeout); pool->MarkAllThreadsAsLeaked(); pool->Shutdown(); as the former enables you to be more control over sleeping (assuming you don't want to spin waiting for threads) and you might be able to mark not every thread as possibly leaked. (You could be more precise about marking threads with the latter interface, but I think it falls out more naturally with the timeout version.) I also think that we don't want to expose nsIThread.skipShutdown; I think we want nsThreadPool to reach directly into nsThread's innards for setting flags (and also we're going to have to expose bits of nsThread::ShutdownInternal to nsThreadPool anyway for this plan to work). So we want to implement a non-scriptable, not-xpcom'able (I think?) API on nsIThreadPool: void shutdownWithTimeout(in int timeout); and it should be documented as being used in situations where threads might be stuck waiting. `shutdown()` should be preferred for general use. We should also add that this interface *will* leak any threads leftover after `timeout`. (If things go well, we might be able to use it to replace the giant 20 second wait in nsHostResolver as well, which would be *great*.) As an implementation sketch, we want to expose nsThread::ShutdownInternal to nsThreadPool. We'll setup asynchronous shutdowns for all the threads in the pool, and then we'll spin until all the threads are shutdown, or the timeout has expired. Any active threads at that point will be marked as leakable. Valentin, does that seem like enough for you to move forward?

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 4

•

6 years ago

(In reply to Nathan Froyd [:froydnj] from comment #3) > As an implementation sketch, we want to expose nsThread::ShutdownInternal to > nsThreadPool. We'll setup asynchronous shutdowns for all the threads in the > pool, and then we'll spin until all the threads are shutdown, or the timeout > has expired. Any active threads at that point will be marked as leakable. > > Valentin, does that seem like enough for you to move forward? That sounds great! Much better than what I managed to think up :)

Flags: needinfo?(valentin.gosu)

Eric Rahm [:erahm]

Comment 5

•

6 years ago

Valentin, is this something you are going to be able to work on? I unfortunately don't have time to work on it in the near term.

Assignee: erahm → nobody

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 6

•

6 years ago

(In reply to Eric Rahm [:erahm] from comment #5) > Valentin, is this something you are going to be able to work on? I > unfortunately don't have time to work on it in the near term. I'm also quite busy right now, but if there's no one else do to it I'll give it a shot. Regarding the solution, would something like this be enough? > auto startTime = Timestamp::Now(); > SpinEventLoopUntil<ProcessFailureBehavior::IgnoreAndContinue>([&](){ > return shutdownContexts.IsEmpty() || Timestamp::Now() > startTime + BaseTimeDuration::FromMilliseconds(timeout); > }); I assume the condition only gets evaluated when an event is processed and it might be a problem if no events are dispatched. But maybe we don't care about that?

Nathan Froyd [:froydnj]

Comment 7

•

6 years ago

SpinEventLoopUntil does indeed block, so if there are no events, that'd be a problem. That's exactly what's getting us into trouble with the shutdown hangs: we're expecting an nsThreadShutdownAckEvent to be dispatched to our event loop while we're spinning, but the thread getting shutdown is hanging somewhere and not able to dispatch the event, so we never see that event. So we'd need to do something more akin to what NS_ProcessPendingEvents does (i.e. spin by not blocking when calling ProcessNextEvent).

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 8

•

6 years ago

Attached file Bug 1500861 - Add shutdownWithTimeout method to nsIThreadPool — Details

* This is a WIP patch to get some feedback * It changes nsHostResolver adding a task with an infinite loop to demonstrate the problem (will be removed from final patch, maybe replaced with a gtest) * The IDL method is not [notxpcom] yet * Needs comments

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

6 years ago

Assignee: nobody → valentin.gosu

Phabricator Automation

Updated

•

6 years ago

Attachment #9019527 - Attachment description: Bug 1500861 - [WIP] Add shutdownWithTimeout method to nsIThreadPool → Bug 1500861 - Add shutdownWithTimeout method to nsIThreadPool

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 9

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5db770ff7abc4a403c09175d2495b4294c1ac066

Pulsebot

Comment 10

•

6 years ago

Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/129d9009661f Add shutdownWithTimeout method to nsIThreadPool r=froydnj,erahm

Natalia Csoregi [:nataliaCs]

Comment 11

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/129d9009661f

Status: NEW → RESOLVED

Closed: 6 years ago

status-firefox65: --- → fixed

Resolution: --- → FIXED

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 12

•

6 years ago

Comment on attachment 9019527 [details] Bug 1500861 - Add shutdownWithTimeout method to nsIThreadPool [Beta/Release Uplift Approval Request] Feature/Bug causing the regression: Bug 1471280 User impact if declined: Potential shutdown hang when DNS threads are stuck making blocking calls. Bad user experience when trying to reopen firefox, and they get a message that firefox is already running. Is this code covered by automated tests?: Yes Has the fix been verified in Nightly?: No Needs manual test from QE?: Yes If yes, steps to reproduce: Technically doesn't need QE, as we should be able to observe the crashes going away, but for additional safety it can be tested: Set the following in /etc/resolv.conf ``` options timeout:30 attempts:5 nameserver 72.66.115.13 ``` Open firefox, go to example.com, close firefox immediately... we should have a shutdown hang. With the fix, instead of a shutdown hang, firefox should exit after about 25 seconds. List of other uplifts needed: None Risk to taking this patch: Low Why is the change risky/not risky? (and alternatives if risky): The mechanism to leak the threads is quite straight-forward. The behaviour to leak the DNS threads is also something we did prior to bug 1471280. It would be great if we could take this early in the beta to make sure it gets 5-6 weeks of testing before release. String changes made/needed:

Attachment #9019527 - Flags: approval-mozilla-beta?

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox63: --- → wontfix

status-firefox64: --- → affected

status-firefox-esr60: --- → unaffected

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

Blocks: 1471280

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

Flags: qe-verify+

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

Target Milestone: mozilla64 → mozilla65

Ryan VanderMeulen [:RyanVM]

Comment 13

•

6 years ago

Comment on attachment 9019527 [details] Bug 1500861 - Add shutdownWithTimeout method to nsIThreadPool [Triage Comment] Fixes a DNS shutdown hang. Approving for 64.0b5 so it gets bake time.

Attachment #9019527 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 14

•

6 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/442e20f46b5d

status-firefox64: affected → fixed

Cristian Baica [:cbaica], Release Desktop QA

Comment 15

•

6 years ago

I have managed to reproduce the issue using Fx65.0a1 buildID: 20181022220734. After following the steps mentioned, a hand would be noticed in the system monitor for more than 30 seconds. That hang would end with a firefox crash report. The issue is verified as fixed using Fx64.0b5 and latest Fx65.0a1 on Ubuntu 16.04 x64 LTS. Firefox is closed and the process 'disappears' from the system monitor after about 20-25 seconds and no crash report is displayed at the end of that period.

Status: RESOLVED → VERIFIED

status-firefox64: fixed → verified

status-firefox65: fixed → verified

Flags: qe-verify+

Nathan Froyd [:froydnj]

Comment 16

•

6 years ago

(In reply to Cristian Baica [:cbaica], Release Desktop QA from comment #15) > I have managed to reproduce the issue using Fx65.0a1 buildID: > 20181022220734. After following the steps mentioned, a hand would be > noticed in the system monitor for more than 30 seconds. That hang would end > with a firefox crash report. > > The issue is verified as fixed using Fx64.0b5 and latest Fx65.0a1 on Ubuntu > 16.04 x64 LTS. Firefox is closed and the process 'disappears' from the > system monitor after about 20-25 seconds and no crash report is displayed at > the end of that period. \o/ Thank you! (And thank you to Valentin for fixing this!) I wonder if it's worth decreasing that 20s timeout number; we are shutting down at this point, after all...OTOH, maybe there's still some lingering website communication that needs to happen, or something?

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

6 years ago

Depends on: 1503725

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

6 years ago

Depends on: 1504335