Closed Bug 1370027 Opened 8 years ago Closed 8 years ago

Runaway subprocess_worker_unix.js worker thread is pegging an entire CPU

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla55

Project Flags:

Performance Impact

high

Tracking Flags:

Tracking

Status

firefox55

fixed

People

(Reporter: dbaron, Assigned: kmag)

References

Details

(Keywords: perf, power)

Attachments

(2 files)

Bug 1370027: Part 1 - Cleanly handle a subprocess child being reaped by NSPR. 8 years ago Kris Maglione [:kmag] 59 bytes, text/x-review-board-request	aswan : review+	Details
Bug 1370027: Part 2 - Use Subprocess.jsm rather than nsIProcess to create the Browser Toolbox process. 8 years ago Kris Maglione [:kmag] 59 bytes, text/x-review-board-request	bgrins : review+	Details

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Description

•

8 years ago

I've had a local opt build of Firefox running for a few days (since Wednesday May 31 at 16:31 PDT, to be exact). It's running https://hg.mozilla.org/mozilla-central/rev/34ac1a5d6576 (plus my local patches as of Monday May 29 at 20:14 PDT up to "Don't propagate style flushes to the parent document"). I notice that Firefox is pegging an entire core and then some. "top -H" shows that the guilty thread is a worker thread, and adding "Worker" to the thread filter in the Gecko profiler (https://perf-html.io/ and see bug 1297773) shows that the problematic worker is an instance of toolkit/modules/subprocess/subprocess_worker_unix.js . It's pretty bad that our own code has run away like this and is pegging an entire CPU core. Profile at https://perfht.ml/2rnHMxf [Tracking Requested - why for this release]: This is a massive power usage bug -- it would lead to users basically reporting all the time that Firefox is killing their battery. It's unacceptable to ship in this state.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 1

•

8 years ago

s/all the time/whenever this bug occurs/ (since I'm not sure what triggers it)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 2

•

8 years ago

And here's a profile that also includes the Timer thread: https://perfht.ml/2rnF3UC (top -H says this worker thread is using 92% CPU and the Timer thread is using 37% CPU, with two other worker threads using 6% and 1%.)

Kris Maglione [:kmag]

Assignee

Comment 3

•

8 years ago

Presumably this is triggered by the profiler's use of the Subprocess API, but it's hard to tell from that profile whether it's a bug in the profiler API or a bug in the subprocess implementation. But from that profile, it *looks* like we're spawning a huge number of processes and waiting for them to exit when there's an error. It may be caused by another WebExtension that uses the native messaging API, though. What extensions do you have installed in that profile? Also, a surprising amount of that time seems to be spent in WorkerPrivate::SetGCTimerMode, canceling GC timers.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 4

•

8 years ago

The only other extension I have installed is https://addons.mozilla.org/en-US/firefox/addon/character-identifier/ (source at https://github.com/dbaron/char-identifier ) which doesn't use native messaging (as far as I'm aware).

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 5

•

8 years ago

(have installed *and enabled*, at least)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 6

•

8 years ago

Also, this Firefox is still running, so I can debug it to figure out what ran this if there's a good way to do so.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 7

•

8 years ago

(Also, perhaps this JS code could have profiler labels that say who ran it and/or what it was running? Is that something that we can do with the profiler?)

Kris Maglione [:kmag]

Assignee

Comment 8

•

8 years ago

Do you have any useful output in the browser console? And, if you can debug that worker thread, can you check the values of the io.processes map? In particular, the exitCode and pid properties should give us some idea of whether we're in an inifinite loop polling for the same process that's already exited (or failed to spawn). (In reply to David Baron :dbaron: ⌚️UTC-7 from comment #7) > (Also, perhaps this JS code could have profiler labels that say who ran it > and/or what it was running? Is that something that we can do with the > profiler?) Not at the moment, but I agree that it would be helpful.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 9

•

8 years ago

There are 39 elements in the map. All have an exitCode of null. None of the PIDs correspond to running processes.

Kris Maglione [:kmag]

Assignee

Comment 10

•

8 years ago

OK. That's interesting... So clearly Process.wait() is failing, and we're continuing to poll for processes that don't exist anymore. But the question is why. I guess the most likely possibilities are that either something is changing the SIGCHLD handler for your parent process to SIG_IGN, or some other thread is calling wait() or waitpid(-1) and reaping the process before the worker gets a chance. And, unfortunately, [1] looks like a likely culprit for the latter. So this probably winds up happening as soon as someone uses nsIProcess. From a quick code search, I suspect that's the Browser Toolbox[2] in this case. If so, the simplest short term solution is probably to update that code not to use nsIProcess. Beyond that, we should probably try to handle that situation non-catastrophically, but I'd be very surprised if that wait loop thread doesn't cause problems elsewhere, particularly for our IPC code. [1]: http://searchfox.org/mozilla-central/rev/507743376d1ba753d14ab6b9305b7c6358570be8/nsprpub/pr/src/md/unix/uxproces.c#648 [2]: http://searchfox.org/mozilla-central/rev/507743376d1ba753d14ab6b9305b7c6358570be8/devtools/client/framework/ToolboxProcess.jsm#234

Marcia Knous [:marcia]

Comment 11

•

8 years ago

Tracking 55+ for this issue.

tracking-firefox55: ? → +

Kris Maglione [:kmag]

Assignee

Updated

•

8 years ago

Comment 14

•

8 years ago

mozreview-review

Comment on attachment 8875055 [details] Bug 1370027: Part 1 - Cleanly handle a subprocess child being reaped by NSPR. https://reviewboard.mozilla.org/r/146422/#review150436 I was looking forward to subprocess.rs but this will do in the mean time.

Attachment #8875055 - Flags: review?(aswan) → review+

Kris Maglione [:kmag]

Assignee

Comment 15

•

8 years ago

(In reply to Andrew Swan [:aswan] from comment #14) > I was looking forward to subprocess.rs Some day...

Brian Grinstead [:bgrins]

Comment 16

•

8 years ago

mozreview-review

Comment on attachment 8875056 [details] Bug 1370027: Part 2 - Use Subprocess.jsm rather than nsIProcess to create the Browser Toolbox process. https://reviewboard.mozilla.org/r/146424/#review150820 This all looks reasonable, and the Browser Toolbox still works, so fine by me

Attachment #8875056 - Flags: review?(bgrinstead) → review+

Kris Maglione [:kmag]

Assignee

Comment 17

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/f4f2dab8b32338f937a3661946809df4674440d6 Bug 1370027: Part 1 - Cleanly handle a subprocess child being reaped by NSPR. r=aswan https://hg.mozilla.org/integration/mozilla-inbound/rev/0d16a08bbcfa181535789a43856ef176cc9829c8 Bug 1370027: Part 2 - Use Subprocess.jsm rather than nsIProcess to create the Browser Toolbox process. r=bgrins

Ryan VanderMeulen [:RyanVM]

Comment 18

•

8 years ago

https://hg.mozilla.org/mozilla-central/rev/f4f2dab8b323 https://hg.mozilla.org/mozilla-central/rev/0d16a08bbcfa

Assignee: nobody → kmaglione+bmo

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox55: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

atlanto

Updated

•

8 years ago

Depends on: 1371548

Brian Grinstead [:bgrins]

Updated

•

7 years ago

Depends on: 1374837

Ciprian Muresan [:cmuresan], Ecosystem QA, OOO until the 16th of Dec needinfo @mcoman or @poiegas

Updated

•

7 years ago

Depends on: 1377326

Jean Gong :jgong

Updated

•

7 years ago

Whiteboard: [qf] → [qf:p1]

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → P1

Whiteboard: [qf:p1]

You need to log in before you can comment on or make changes to this bug.