Closed Bug 1500861 Opened 11 months ago Closed 11 months ago

Add shutdownWithTimeout method to nsIThread and nsIThreadPool

Categories

(Core :: XPCOM, defect)

60 Branch
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla65
Tracking Status
firefox-esr60 --- unaffected
firefox63 --- wontfix
firefox64 --- verified
firefox65 --- verified

People

(Reporter: valentin, Assigned: valentin)

References

(Depends on 1 open bug)

Details

Attachments

(1 file)

We need this as a to prevent shutdown hangs when a thread is doing a blocking call that doesn't return.
Nathan is this something you can look at or work with Valentin to design? It's a blocker for a somewhat frequent shutdown hang. The idea is to provide a mechanism that allows us to shutdown a threadpool that has threads that are potentially blocked (calling gethostaddr in this instance).

An initial draft [1] proposed a 'leak these threads intentionally' interface, but I didn't feel comfortable with that approach.

[1] https://phabricator.services.mozilla.com/D9024
Flags: needinfo?(nfroyd)
(In reply to Eric Rahm [:erahm] from comment #1)
> Nathan is this something you can look at or work with Valentin to design?
> It's a blocker for a somewhat frequent shutdown hang. The idea is to provide
> a mechanism that allows us to shutdown a threadpool that has threads that
> are potentially blocked (calling gethostaddr in this instance).
> 
> An initial draft [1] proposed a 'leak these threads intentionally'
> interface, but I didn't feel comfortable with that approach.

I think there are fundamentally two approaches to this problem of threads being "stuck":

1) Some way to "cancel" the thread in the middle of an operation.
2) Accept that the thread is just going to leak.

The first one is accomplished via pthread_cancel on Unix-y systems and--I think--TerminateThread on Windows.  TerminateThread is a lot stronger (the target thread is given no chance to execute any sort of cleanup) and comes with strong warnings against its use in the documentation.  I don't think I'd want to use it to terminate threads running random Win32 code.  pthread_cancel is at least in theory usable, but my impression is that it's only usable in an environment where you wrote code that paid some attention to the possibility of cancelability...and I'm betting that Gecko (and its numerous attendant libraries) were not written with the necessary level of attention.

That leaves the second option, which is more understandable and (to my mind) more reasonable, because you're shutting down anyway, so you don't really care about leaks.  What don't you like about the leaking approach?  (Or is there a third option that I'm not thinking of?)
Flags: needinfo?(nfroyd)
OK, so I talked to erahm on IRC and he indicated his concerns were not so much with the "leak the threads" idea as with the particular route taken to implement said idea:

10:49 AM <@erahm> froydnj: so my thoughts re: dns thread leaking, I'm okay leaking the threads (it's what we did before), I just didn't like the proposed api (limit the threadpool to 0 threads, sleep(100ms), nsThreadPool::leakAllTheThings()) and was suggesting making thread shutdown take a timeout instead

which seems reasonable.  After some discussion, it's also clear that:

 pool->ShutdownWithTimeout(...);

is a better interface than:

 pool->SetThreadLimit(0);
 Sleep(timeout);
 pool->MarkAllThreadsAsLeaked();
 pool->Shutdown();

as the former enables you to be more control over sleeping (assuming you don't want to spin waiting for threads) and you might be able to mark not every thread as possibly leaked.  (You could be more precise about marking threads with the latter interface, but I think it falls out more naturally with the timeout version.)

I also think that we don't want to expose nsIThread.skipShutdown; I think we want nsThreadPool to reach directly into nsThread's innards for setting flags (and also we're going to have to expose bits of nsThread::ShutdownInternal to nsThreadPool anyway for this plan to work).

So we want to implement a non-scriptable, not-xpcom'able (I think?) API on nsIThreadPool:

  void shutdownWithTimeout(in int timeout);

and it should be documented as being used in situations where threads might be stuck waiting.  `shutdown()` should be preferred for general use.  We should also add that this interface *will* leak any threads leftover after `timeout`.  (If things go well, we might be able to use it to replace the giant 20 second wait in nsHostResolver as well, which would be *great*.)

As an implementation sketch, we want to expose nsThread::ShutdownInternal to nsThreadPool.  We'll setup asynchronous shutdowns for all the threads in the pool, and then we'll spin until all the threads are shutdown, or the timeout has expired.  Any active threads at that point will be marked as leakable.

Valentin, does that seem like enough for you to move forward?
Flags: needinfo?(valentin.gosu)
(In reply to Nathan Froyd [:froydnj] from comment #3)
> As an implementation sketch, we want to expose nsThread::ShutdownInternal to
> nsThreadPool.  We'll setup asynchronous shutdowns for all the threads in the
> pool, and then we'll spin until all the threads are shutdown, or the timeout
> has expired.  Any active threads at that point will be marked as leakable.
> 
> Valentin, does that seem like enough for you to move forward?

That sounds great! Much better than what I managed to think up :)
Flags: needinfo?(valentin.gosu)
Valentin, is this something you are going to be able to work on? I unfortunately don't have time to work on it in the near term.
Assignee: erahm → nobody
(In reply to Eric Rahm [:erahm] from comment #5)
> Valentin, is this something you are going to be able to work on? I
> unfortunately don't have time to work on it in the near term.

I'm also quite busy right now, but if there's no one else do to it I'll give it a shot.
Regarding the solution, would something like this be enough?
> auto startTime = Timestamp::Now();
> SpinEventLoopUntil<ProcessFailureBehavior::IgnoreAndContinue>([&](){
>   return shutdownContexts.IsEmpty() || Timestamp::Now() > startTime + BaseTimeDuration::FromMilliseconds(timeout);
> });

I assume the condition only gets evaluated when an event is processed and it might be a problem if no events are dispatched. But maybe we don't care about that?
SpinEventLoopUntil does indeed block, so if there are no events, that'd be a problem.  That's exactly what's getting us into trouble with the shutdown hangs: we're expecting an nsThreadShutdownAckEvent to be dispatched to our event loop while we're spinning, but the thread getting shutdown is hanging somewhere and not able to dispatch the event, so we never see that event.

So we'd need to do something more akin to what NS_ProcessPendingEvents does (i.e. spin by not blocking when calling ProcessNextEvent).
* This is a WIP patch to get some feedback
* It changes nsHostResolver adding a task with an infinite loop to demonstrate the problem (will be removed from final patch, maybe replaced with a gtest)
* The IDL method is not [notxpcom] yet
* Needs comments
Assignee: nobody → valentin.gosu
Attachment #9019527 - Attachment description: Bug 1500861 - [WIP] Add shutdownWithTimeout method to nsIThreadPool → Bug 1500861 - Add shutdownWithTimeout method to nsIThreadPool
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/129d9009661f
Add shutdownWithTimeout method to nsIThreadPool r=froydnj,erahm
https://hg.mozilla.org/mozilla-central/rev/129d9009661f
Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
Comment on attachment 9019527 [details]
Bug 1500861 - Add shutdownWithTimeout method to nsIThreadPool

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: Bug 1471280

User impact if declined: Potential shutdown hang when DNS threads are stuck making blocking calls.
Bad user experience when trying to reopen firefox, and they get a message that firefox is already running.

Is this code covered by automated tests?: Yes

Has the fix been verified in Nightly?: No

Needs manual test from QE?: Yes

If yes, steps to reproduce: Technically doesn't need QE, as we should be able to observe the crashes going away, but for additional safety it can be tested:
Set the following in /etc/resolv.conf
```
options timeout:30 attempts:5
nameserver 72.66.115.13
```
Open firefox, go to example.com, close firefox immediately... we should have a shutdown hang.

With the fix, instead of a shutdown hang, firefox should exit after about 25 seconds.

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): The mechanism to leak the threads is quite straight-forward.
The behaviour to leak the DNS threads is also something we did prior to bug 1471280.
It would be great if we could take this early in the beta to make sure it gets 5-6 weeks of testing before release.

String changes made/needed:
Attachment #9019527 - Flags: approval-mozilla-beta?
Blocks: 1471280
Flags: qe-verify+
Target Milestone: mozilla64 → mozilla65
Comment on attachment 9019527 [details]
Bug 1500861 - Add shutdownWithTimeout method to nsIThreadPool

[Triage Comment]
Fixes a DNS shutdown hang. Approving for 64.0b5 so it gets bake time.
Attachment #9019527 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I have managed to reproduce the issue using Fx65.0a1 buildID: 20181022220734. After following the steps mentioned,  a hand would be noticed in the system monitor for more than 30 seconds. That hang would end with a firefox crash report.

The issue is verified as fixed using Fx64.0b5 and latest Fx65.0a1 on Ubuntu 16.04 x64 LTS. Firefox is closed and the process 'disappears' from the system monitor after about 20-25 seconds and no crash report is displayed at the end of that period.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
(In reply to Cristian Baica [:cbaica], Release Desktop QA from comment #15)
> I have managed to reproduce the issue using Fx65.0a1 buildID:
> 20181022220734. After following the steps mentioned,  a hand would be
> noticed in the system monitor for more than 30 seconds. That hang would end
> with a firefox crash report.
> 
> The issue is verified as fixed using Fx64.0b5 and latest Fx65.0a1 on Ubuntu
> 16.04 x64 LTS. Firefox is closed and the process 'disappears' from the
> system monitor after about 20-25 seconds and no crash report is displayed at
> the end of that period.

\o/ Thank you!  (And thank you to Valentin for fixing this!)  I wonder if it's worth decreasing that 20s timeout number; we are shutting down at this point, after all...OTOH, maybe there's still some lingering website communication that needs to happen, or something?
Depends on: 1503725
Depends on: 1504335
You need to log in before you can comment on or make changes to this bug.