child processes aren't cleaned up when restarting for update
Categories
(Core :: XPCOM, defect, P3)
Tracking
()
People
(Reporter: jcristau, Assigned: pbone)
References
(Blocks 1 open bug)
Details
(Whiteboard: [fission])
Attachments
(1 file, 1 obsolete file)
When applying an update and restarting we don't seem to waitpind() our child processes so the new firefox keeps accumulating a bunch of zombie children. Most visible in nightly, e.g.:
182639 ? Sl 765:07 | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin
182641 ? Z 0:00 | | \_ [firefox-bin] <defunct>
182721 ? Z 30:11 | | \_ [Socket Process] <defunct>
182792 ? Z 52:19 | | \_ [Web Content] <defunct>
182836 ? Z 329:44 | | \_ [Web Content] <defunct>
182860 ? Z 142:04 | | \_ [Web Content] <defunct>
182868 ? Z 25:40 | | \_ [Web Content] <defunct>
182875 ? Z 23:52 | | \_ [Web Content] <defunct>
182879 ? Z 8:08 | | \_ [Web Content] <defunct>
182886 ? Z 7:22 | | \_ [Web Content] <defunct>
182907 ? Z 7:07 | | \_ [Web Content] <defunct>
183090 ? Z 2:29 | | \_ [Privileged Cont] <defunct>
183113 ? Z 10:34 | | \_ [WebExtensions] <defunct>
183343 ? Z 3:49 | | \_ [RDD Process] <defunct>
224459 ? Z 0:00 | | \_ [pingsender] <defunct>
224463 ? Z 0:00 | | \_ [firefox-bin] <defunct>
224560 ? Z 0:00 | | \_ [Socket Process] <defunct>
225096 ? Z 0:00 | | \_ [RDD Process] <defunct>
225206 ? Z 0:00 | | \_ [pingsender] <defunct>
225209 ? Z 0:00 | | \_ [pingsender] <defunct>
225212 ? Z 0:00 | | \_ [pingsender] <defunct>
225216 ? Z 0:00 | | \_ [firefox-bin] <defunct>
225279 ? Z 0:00 | | \_ [Socket Process] <defunct>
225355 ? Z 81:01 | | \_ [Web Content] <defunct>
225395 ? Z 4:10 | | \_ [Web Content] <defunct>
225402 ? Z 21:22 | | \_ [Web Content] <defunct>
225404 ? Z 27:42 | | \_ [Web Content] <defunct>
225423 ? Z 10:05 | | \_ [Web Content] <defunct>
225427 ? Z 6:52 | | \_ [Web Content] <defunct>
225451 ? Z 2:41 | | \_ [Web Content] <defunct>
225459 ? Z 7:23 | | \_ [Web Content] <defunct>
225635 ? Z 1:18 | | \_ [Privileged Cont] <defunct>
225658 ? Z 5:11 | | \_ [WebExtensions] <defunct>
225839 ? Z 8:07 | | \_ [RDD Process] <defunct>
239455 ? Z 0:00 | | \_ [pingsender] <defunct>
239459 ? Z 0:00 | | \_ [firefox-bin] <defunct>
239538 ? Z 0:00 | | \_ [Socket Process] <defunct>
239601 ? Z 30:14 | | \_ [Web Content] <defunct>
239642 ? Z 5:07 | | \_ [Web Content] <defunct>
239649 ? Z 5:43 | | \_ [Web Content] <defunct>
239664 ? Z 32:14 | | \_ [Web Content] <defunct>
239672 ? Z 40:46 | | \_ [Web Content] <defunct>
239678 ? Z 14:01 | | \_ [Web Content] <defunct>
239700 ? Z 5:11 | | \_ [Web Content] <defunct>
239721 ? Z 8:41 | | \_ [Web Content] <defunct>
239876 ? Z 1:12 | | \_ [Privileged Cont] <defunct>
239934 ? Z 7:09 | | \_ [WebExtensions] <defunct>
240124 ? Z 0:00 | | \_ [RDD Process] <defunct>
282295 ? Z 0:00 | | \_ [pingsender] <defunct>
282299 ? Z 0:00 | | \_ [firefox-bin] <defunct>
282362 ? Z 0:00 | | \_ [Socket Process] <defunct>
282444 ? Z 4:11 | | \_ [Web Content] <defunct>
282485 ? Z 0:55 | | \_ [Web Content] <defunct>
282492 ? Z 0:35 | | \_ [Web Content] <defunct>
282507 ? Z 2:20 | | \_ [Web Content] <defunct>
282513 ? Z 2:19 | | \_ [Web Content] <defunct>
282520 ? Z 0:57 | | \_ [Web Content] <defunct>
282558 ? Z 0:25 | | \_ [Web Content] <defunct>
282580 ? Z 0:26 | | \_ [Web Content] <defunct>
282719 ? Z 0:09 | | \_ [Privileged Cont] <defunct>
282776 ? Z 0:31 | | \_ [WebExtensions] <defunct>
282964 ? Z 0:00 | | \_ [RDD Process] <defunct>
294944 ? Z 0:00 | | \_ [pingsender] <defunct>
294948 ? Z 0:00 | | \_ [firefox-bin] <defunct>
295013 ? Z 0:00 | | \_ [Socket Process] <defunct>
295092 ? Z 17:31 | | \_ [Web Content] <defunct>
295133 ? Z 6:12 | | \_ [Web Content] <defunct>
295140 ? Z 12:39 | | \_ [Web Content] <defunct>
295155 ? Z 18:44 | | \_ [Web Content] <defunct>
295161 ? Z 27:58 | | \_ [Web Content] <defunct>
295183 ? Z 10:52 | | \_ [Web Content] <defunct>
295209 ? Z 3:09 | | \_ [Web Content] <defunct>
295215 ? Z 2:34 | | \_ [Web Content] <defunct>
295367 ? Z 0:57 | | \_ [Privileged Cont] <defunct>
295425 ? Z 4:30 | | \_ [WebExtensions] <defunct>
295588 ? Z 0:00 | | \_ [RDD Process] <defunct>
306549 ? Z 0:00 | | \_ [pingsender] <defunct>
306553 ? Z 0:00 | | \_ [firefox-bin] <defunct>
306632 ? Z 0:00 | | \_ [Socket Process] <defunct>
306693 ? Z 17:02 | | \_ [Web Content] <defunct>
306733 ? Z 4:27 | | \_ [Web Content] <defunct>
306740 ? Z 2:25 | | \_ [Web Content] <defunct>
306756 ? Z 26:07 | | \_ [Web Content] <defunct>
306763 ? Z 18:51 | | \_ [Web Content] <defunct>
306768 ? Z 9:31 | | \_ [Web Content] <defunct>
306793 ? Z 4:15 | | \_ [Web Content] <defunct>
306813 ? Z 3:16 | | \_ [Web Content] <defunct>
306967 ? Z 0:50 | | \_ [Privileged Cont] <defunct>
307026 ? Z 3:57 | | \_ [WebExtensions] <defunct>
307182 ? Z 0:02 | | \_ [RDD Process] <defunct>
318841 ? Z 0:00 | | \_ [pingsender] <defunct>
318845 ? Z 0:00 | | \_ [firefox-bin] <defunct>
318924 ? Z 0:00 | | \_ [Socket Process] <defunct>
318984 ? Z 16:25 | | \_ [Web Content] <defunct>
319026 ? Z 8:41 | | \_ [Web Content] <defunct>
319033 ? Z 4:48 | | \_ [Web Content] <defunct>
319048 ? Z 14:02 | | \_ [Web Content] <defunct>
319056 ? Z 18:40 | | \_ [Web Content] <defunct>
319064 ? Z 10:10 | | \_ [Web Content] <defunct>
319072 ? Z 1:55 | | \_ [Web Content] <defunct>
319091 ? Z 1:53 | | \_ [Web Content] <defunct>
319260 ? Z 0:38 | | \_ [Privileged Cont] <defunct>
319318 ? Z 2:35 | | \_ [WebExtensions] <defunct>
319486 ? Z 0:00 | | \_ [RDD Process] <defunct>
329723 ? Z 0:00 | | \_ [pingsender] <defunct>
329727 ? Z 0:00 | | \_ [firefox-bin] <defunct>
329806 ? Sl 0:00 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -parentBuildID 20200401095134 -prefsLen 1 -prefMapSize 223822 -appdi
329862 ? Sl 20:26 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 1 -isForBrowser -prefsLen 308 -prefMapSize 223822 -parentBu
329903 ? Sl 3:39 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 2 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
329910 ? Sl 3:57 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 3 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
329925 ? Sl 37:57 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 4 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
329930 ? Sl 22:50 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 5 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
329938 ? Sl 5:24 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 6 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
329957 ? Sl 2:32 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 7 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
329980 ? Sl 2:14 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 8 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
330137 ? Sl 0:45 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 9 -isForBrowser -prefsLen 625 -prefMapSize 223822 -parentBu
330195 ? Sl 3:02 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 10 -isForBrowser -prefsLen 6695 -prefMapSize 223822 -parent
330347 ? Sl 0:00 | | \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -parentBuildID 20200401095134 -prefsLen 7658 -prefMapSize 223822 -ap
Comment 1•4 years ago
|
||
There's some ideas about what could be going wrong in the IPC meeting notes: https://docs.google.com/document/d/19hFtByA2dKB1WSjwWnThDIiMnngptb54T76Zw-Uf354/edit#
This is likely an updater bug though.
Comment 2•4 years ago
|
||
FYI, we should get this prioritized and fixed. Who should be looking at this?
Comment 3•4 years ago
|
||
The updater isn't responsible for conducting the restart, all we do is call into Services.startup
. No update operations happen during that shutdown; at that point we've already arranged the necessary state to be processed very early during the next startup. If that processing is happening after any other processes have already been started, then that's the problem, but otherwise I don't understand how this can be an updater bug.
Comment 4•4 years ago
•
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #2)
FYI, we should get this prioritized and fixed. Who should be looking at this?
The only updater change to land since Dec 2019 is 1551306 that went in 3 days ago....so the timeline is at least right for this to be seen in Nightly.
:mhowell, do you think it's possible that any of those changes could have had an impact here? Or, if you're not sure, I wonder if we could try backing them out and then checking to see if this is still happening?
Other than that set of changes, it's hard to think of how the updater might be responsible for these issues in Nightly only since there have been no other recent changes made (ie, I'd expect to see them across the board, not just Nightly).
Comment 5•4 years ago
|
||
(In reply to Rachel Tublitz [:rachel] from comment #4)
The only updater change to land since Dec 2019 is 1551306 that went in 3 days ago....so the timeline is at least right for this to be seen in Nightly.
:mhowell, do you think it's possible that any of those changes could have had an impact here? Or, if you're not sure, I wonder if we could try backing them out and then checking to see if this is still happening?
I hadn't seen that patch, thanks for pointing that out. But it definitely doesn't look relevant here; all it's doing is dropping a context parameter, which nothing was using anyway, through some HTTP request handling stuff. Looks like just some totally safe cleanup to me.
Comment 6•4 years ago
|
||
I was able to reproduce this bug (using Ubuntu 19.10 with yesterday's Nightly) by restarting to apply an update, but then I was also able to reproduce the bug by using the safe mode restart button in about:support when there was no update available (in a new instance, I made sure all the processes from the previous one were gone first), so I think that's enough to show that this is not an updater issue.
Updated•4 years ago
|
Comment 7•4 years ago
|
||
If we're looking at the code calling execve
, this is probably XPCOM or Toolkit :: Startup or something like that.
It's probably not DOM: Content Processes, or IPC, because other types of IPC child processes and non-IPC child processes like pingsender
are also affected.
Comment 8•4 years ago
|
||
Recent changes in that code were in XPCOM:Core, so redirecting there.
Comment 9•4 years ago
|
||
NI cpeterson to add it to the list of fission issues - zombie browsers can lock up a linux desktop by running you out of xclient connections (default max is 256; use can up it but none do normally). (it can also cause issues with memory/swap usage I imagine, and other resource issues)
Comment 10•4 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #9)
NI cpeterson to add it to the list of fission issues - zombie browsers can lock up a linux desktop by running you out of xclient connections (default max is 256; use can up it but none do normally). (it can also cause issues with memory/swap usage I imagine, and other resource issues)
I will send this bug to Fission triage.
Sounds like this bug can affect non-Fission users too, but is much more likely to affect Fission users because they have more content processes.
Comment 11•4 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #9)
NI cpeterson to add it to the list of fission issues - zombie browsers can lock up a linux desktop by running you out of xclient connections (default max is 256; use can up it but none do normally). (it can also cause issues with memory/swap usage I imagine, and other resource issues)
I don't think that's the case: zombie processes should have all their fds closed (and memory unmapped) at _exit
time; essentially they're just placeholders for a pid that can't be reclaimed yet.
Reporter | ||
Comment 12•4 years ago
|
||
I did run into X11 maxclients due to this fwiw.
Comment 13•4 years ago
|
||
Blocks Fission M6c Nightly for zombie process leaks.
This problem affects all users, but Fission makes it worse because we'll have more processes.
Comment 14•4 years ago
|
||
Nathan, could you review the severity of this bug (S2) and prioritize it?
Comment 15•4 years ago
|
||
I don't know what the right priority is; based on comment 11, it sounds like things might not be too bad (once the x11 maxclients issue is fixed, comment 12).
What is the right solution: when we restart for an update, we have to wait for all of our child processes to exit/forcibly terminate them, or something else?
Reporter | ||
Comment 16•4 years ago
|
||
Would it be worth forking before exec on restart, to let the parent exit and its children get reaped by init?
Comment 17•4 years ago
|
||
S1 or S2 bugs need an assignee - could you find someone for this bug?
Comment 19•4 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #16)
Would it be worth forking before exec on restart, to let the parent exit and its children get reaped by init?
That would be the simplest solution, but if there's anything out there that expects the browser won't exit and the updated browser will have the same pid, it would break. That doesn't mean we can't change the behavior, but it's something to consider.
The alternatives are not great: we may not want to just SIGKILL
our own child processes without giving them a chance to exit normally, but that means delaying the restart, and there may be helper apps launched via nsIProcess
that we don't want to kill on restart but we also don't want them to become zombies when they eventually exit.
Comment 20•4 years ago
|
||
Concretely, we'd replace this line with one of our process launching APIs: nsIProcess
, base::LaunchApp
, even PR_CreateProcess
if necessary because its odd side effects (bug 227246) won't matter, followed by exit
or _exit
.
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 22•3 years ago
|
||
Re-assigned during Fission meeting.
Assignee | ||
Comment 23•3 years ago
|
||
Notes from fission meeting.
This is where it’s happening: https://searchfox.org/mozilla-central/rev/31ddf859c57e812878a5f817e4097efb06de4d97/toolkit/xre/nsAppRunner.cpp#2232
Test code is here: https://searchfox.org/mozilla-central/rev/31ddf859c57e812878a5f817e4097efb06de4d97/testing/marionette/harness/marionette_harness/tests/unit/test_quit_restart.py#77
Assignee | ||
Comment 24•3 years ago
|
||
Depends on D101700
Assignee | ||
Comment 25•3 years ago
|
||
Depends on D101701
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 26•3 years ago
|
||
Comment on attachment 9197009 [details]
Bug 1626682 - Improve code formatting r=nika
Revision D101701 was moved to bug 1688150. Setting attachment 9197009 [details] to obsolete.
Comment 27•3 years ago
|
||
Pushed by pbone@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fbd96279c66d Don't restart using exec on Unix r=marionette-reviewers,whimboo,kmag
Comment 28•3 years ago
|
||
bugherder |
Updated•3 years ago
|
Description
•