Closed Bug 1370027 Opened 3 years ago Closed 3 years ago

Runaway subprocess_worker_unix.js worker thread is pegging an entire CPU

Categories

(Toolkit :: Async Tooling, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 + fixed

People

(Reporter: dbaron, Assigned: kmag)

References

Details

(Keywords: perf, power, Whiteboard: [qf:p1])

Attachments

(2 files)

I've had a local opt build of Firefox running for a few days (since Wednesday May 31 at 16:31 PDT, to be exact).  It's running https://hg.mozilla.org/mozilla-central/rev/34ac1a5d6576 (plus my local patches as of Monday May 29 at 20:14 PDT up to "Don't propagate style flushes to the parent document").

I notice that Firefox is pegging an entire core and then some.  "top -H" shows that the guilty thread is a worker thread, and adding "Worker" to the thread filter in the Gecko profiler (https://perf-html.io/ and see bug 1297773) shows that the problematic worker is an instance of toolkit/modules/subprocess/subprocess_worker_unix.js .

It's pretty bad that our own code has run away like this and is pegging an entire CPU core.

Profile at https://perfht.ml/2rnHMxf

[Tracking Requested - why for this release]:  This is a massive power usage bug -- it would lead to users basically reporting all the time that Firefox is killing their battery.  It's unacceptable to ship in this state.
s/all the time/whenever this bug occurs/ (since I'm not sure what triggers it)
And here's a profile that also includes the Timer thread: https://perfht.ml/2rnF3UC

(top -H says this worker thread is using 92% CPU and the Timer thread is using 37% CPU, with two other worker threads using 6% and 1%.)
Presumably this is triggered by the profiler's use of the Subprocess API, but it's hard to tell from that profile whether it's a bug in the profiler API or a bug in the subprocess implementation. But from that profile, it *looks* like we're spawning a huge number of processes and waiting for them to exit when there's an error.

It may be caused by another WebExtension that uses the native messaging API, though. What extensions do you have installed in that profile?

Also, a surprising amount of that time seems to be spent in WorkerPrivate::SetGCTimerMode, canceling GC timers.
The only other extension I have installed is https://addons.mozilla.org/en-US/firefox/addon/character-identifier/ (source at https://github.com/dbaron/char-identifier ) which doesn't use native messaging (as far as I'm aware).
(have installed *and enabled*, at least)
Also, this Firefox is still running, so I can debug it to figure out what ran this if there's a good way to do so.
(Also, perhaps this JS code could have profiler labels that say who ran it and/or what it was running?  Is that something that we can do with the profiler?)
Do you have any useful output in the browser console?

And, if you can debug that worker thread, can you check the values of the io.processes map? In particular, the exitCode and pid properties should give us some idea of whether we're in an inifinite loop polling for the same process that's already exited (or failed to spawn).

(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #7)
> (Also, perhaps this JS code could have profiler labels that say who ran it
> and/or what it was running?  Is that something that we can do with the
> profiler?)

Not at the moment, but I agree that it would be helpful.
There are 39 elements in the map.  All have an exitCode of null.  None of the PIDs correspond to running processes.
OK. That's interesting... So clearly Process.wait() is failing, and we're continuing to poll for processes that don't exist anymore. But the question is why.

I guess the most likely possibilities are that either something is changing the SIGCHLD handler for your parent process to SIG_IGN, or some other thread is calling wait() or waitpid(-1) and reaping the process before the worker gets a chance. And, unfortunately, [1] looks like a likely culprit for the latter.

So this probably winds up happening as soon as someone uses nsIProcess. From a quick code search, I suspect that's the Browser Toolbox[2] in this case.

If so, the simplest short term solution is probably to update that code not to use nsIProcess. Beyond that, we should probably try to handle that situation non-catastrophically, but I'd be very surprised if that wait loop thread doesn't cause problems elsewhere, particularly for our IPC code.


[1]: http://searchfox.org/mozilla-central/rev/507743376d1ba753d14ab6b9305b7c6358570be8/nsprpub/pr/src/md/unix/uxproces.c#648
[2]: http://searchfox.org/mozilla-central/rev/507743376d1ba753d14ab6b9305b7c6358570be8/devtools/client/framework/ToolboxProcess.jsm#234
Tracking 55+ for this issue.
See Also: → 1370731
Comment on attachment 8875055 [details]
Bug 1370027: Part 1 - Cleanly handle a subprocess child being reaped by NSPR.

https://reviewboard.mozilla.org/r/146422/#review150436

I was looking forward to subprocess.rs but this will do in the mean time.
Attachment #8875055 - Flags: review?(aswan) → review+
(In reply to Andrew Swan [:aswan] from comment #14)
> I was looking forward to subprocess.rs

Some day...
Comment on attachment 8875056 [details]
Bug 1370027: Part 2 - Use Subprocess.jsm rather than nsIProcess to create the Browser Toolbox process.

https://reviewboard.mozilla.org/r/146424/#review150820

This all looks reasonable, and the Browser Toolbox still works, so fine by me
Attachment #8875056 - Flags: review?(bgrinstead) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/f4f2dab8b32338f937a3661946809df4674440d6
Bug 1370027: Part 1 - Cleanly handle a subprocess child being reaped by NSPR. r=aswan

https://hg.mozilla.org/integration/mozilla-inbound/rev/0d16a08bbcfa181535789a43856ef176cc9829c8
Bug 1370027: Part 2 - Use Subprocess.jsm rather than nsIProcess to create the Browser Toolbox process. r=bgrins
https://hg.mozilla.org/mozilla-central/rev/f4f2dab8b323
https://hg.mozilla.org/mozilla-central/rev/0d16a08bbcfa
Assignee: nobody → kmaglione+bmo
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Depends on: 1371548
Depends on: 1374837
Whiteboard: [qf] → [qf:p1]
You need to log in before you can comment on or make changes to this bug.