Open Bug 1477393 Opened 6 years ago Updated 8 months ago

Replace various watchdog/hang monitor threads with timers

Categories

(Core :: XPCOM, enhancement)

enhancement

Tracking

()

People

(Reporter: kmag, Unassigned)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [overhead:128K])

We currently have 4 watchdog/hang monitor threads per content process in nightly and beta populations:

│     ├────8,192 B (00.04%) ── BHMgr Monitor (tid=8360)
│     ├────8,192 B (00.04%) ── BHMgr Processor (tid=8361)
│     ├────8,192 B (00.04%) ── JS Watchdog (tid=8345)
│     ├────8,192 B (00.04%) ── ProcessHangMon (tid=8441)

Three of these can almost certainly be converted to timers which dispatch to a shared thread pool. BHMgr Processor can probably just use a thread pool directly.
(In reply to Kris Maglione [:kmag] from comment #0)
> Three of these [4 watchdog/hang monitor threads] can almost certainly be
> converted to timers which dispatch to
> a shared thread pool. BHMgr Processor can probably just use a thread pool
> directly.

What if the thread pool is blocked? It should be rare, but I've seen it happen with the media-playback thread pool, when we had a "dining philosophers" bug and lots of videos!
I think watchdogs that are tasked with terminating processes in case of (probably-)blocking bugs should have at least one thread to themselves.
(In reply to Gerald Squelart [:gerald] from comment #1)
> What if the thread pool is blocked? It should be rare, but I've seen it
> happen with the media-playback thread pool, when we had a "dining
> philosophers" bug and lots of videos!
> I think watchdogs that are tasked with terminating processes in case of
> (probably-)blocking bugs should have at least one thread to themselves.

I was leaning towards having a priority queue, with watchdog runnables having the highest priority, and possibly some provision for spinning up a new thread for critical events if the rest of the pool is blocked.

Each thread on Windows uses a minimum of 32K, and we already need at least one timer thread, so I'm reluctant to add any more always-running-usually-idle threads to every content process if we can possibly avoid it.
(In reply to Kris Maglione [:kmag] from comment #0)
> We currently have 4 watchdog/hang monitor threads per content process in
> nightly and beta populations:
> 
> │     ├────8,192 B (00.04%) ── BHMgr Monitor (tid=8360)
> │     ├────8,192 B (00.04%) ── BHMgr Processor (tid=8361)
> │     ├────8,192 B (00.04%) ── JS Watchdog (tid=8345)
> │     ├────8,192 B (00.04%) ── ProcessHangMon (tid=8441)
> 
> Three of these can almost certainly be converted to timers which dispatch to
> a shared thread pool. BHMgr Processor can probably just use a thread pool
> directly.

BHMgr Monitor does some weird timer-ish stuff, and while it should almost certainly be rewritten, I'm a bit dubious as to how worthwhile that excercise is. It's not a great system right now & I spent a lot of time while working on it hunting down weird bugs :-/. (Turns out running a "profiler" in the background is kinda error-prone...)

One option which could be interesting would be for the parent process to handle the profiling of content processes, pausing them & walking their stacks when they aren't clearing out their IPC message queue fast enough.

BHMgr Processor actually used to be in a thread pool (it used to use the StreamTransportService), but it caused issues where a sufficiently slow computer would cause BHR to flood the STS with a large number of slow-running runnables.

(P.S. Both BHMgr Monitor, and BHMgr Processor should not appear outside of nightly builds, they won't even appear for beta users, so they're probably not super high priority)
Assignee: kmaglione+bmo → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.