Closed Bug 1626682 Opened 4 years ago Closed 3 years ago

child processes aren't cleaned up when restarting for update

Categories

(Core :: XPCOM, defect, P3)

Desktop
Linux
defect

Tracking

()

RESOLVED FIXED
87 Branch
Fission Milestone M6c
Tracking Status
firefox76 --- wontfix
firefox77 --- wontfix
firefox78 --- wontfix
firefox87 --- fixed

People

(Reporter: jcristau, Assigned: pbone)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fission])

Attachments

(1 file, 1 obsolete file)

When applying an update and restarting we don't seem to waitpind() our child processes so the new firefox keeps accumulating a bunch of zombie children. Most visible in nightly, e.g.:

 182639 ?        Sl   765:07  |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin
 182641 ?        Z      0:00  |   |   \_ [firefox-bin] <defunct>
 182721 ?        Z     30:11  |   |   \_ [Socket Process] <defunct>
 182792 ?        Z     52:19  |   |   \_ [Web Content] <defunct>
 182836 ?        Z    329:44  |   |   \_ [Web Content] <defunct>
 182860 ?        Z    142:04  |   |   \_ [Web Content] <defunct>
 182868 ?        Z     25:40  |   |   \_ [Web Content] <defunct>
 182875 ?        Z     23:52  |   |   \_ [Web Content] <defunct>
 182879 ?        Z      8:08  |   |   \_ [Web Content] <defunct>
 182886 ?        Z      7:22  |   |   \_ [Web Content] <defunct>
 182907 ?        Z      7:07  |   |   \_ [Web Content] <defunct>
 183090 ?        Z      2:29  |   |   \_ [Privileged Cont] <defunct>
 183113 ?        Z     10:34  |   |   \_ [WebExtensions] <defunct>
 183343 ?        Z      3:49  |   |   \_ [RDD Process] <defunct>
 224459 ?        Z      0:00  |   |   \_ [pingsender] <defunct>
 224463 ?        Z      0:00  |   |   \_ [firefox-bin] <defunct>
 224560 ?        Z      0:00  |   |   \_ [Socket Process] <defunct>
 225096 ?        Z      0:00  |   |   \_ [RDD Process] <defunct>
 225206 ?        Z      0:00  |   |   \_ [pingsender] <defunct>
 225209 ?        Z      0:00  |   |   \_ [pingsender] <defunct>
 225212 ?        Z      0:00  |   |   \_ [pingsender] <defunct>
 225216 ?        Z      0:00  |   |   \_ [firefox-bin] <defunct>
 225279 ?        Z      0:00  |   |   \_ [Socket Process] <defunct>
 225355 ?        Z     81:01  |   |   \_ [Web Content] <defunct>
 225395 ?        Z      4:10  |   |   \_ [Web Content] <defunct>
 225402 ?        Z     21:22  |   |   \_ [Web Content] <defunct>
 225404 ?        Z     27:42  |   |   \_ [Web Content] <defunct>
 225423 ?        Z     10:05  |   |   \_ [Web Content] <defunct>
 225427 ?        Z      6:52  |   |   \_ [Web Content] <defunct>
 225451 ?        Z      2:41  |   |   \_ [Web Content] <defunct>
 225459 ?        Z      7:23  |   |   \_ [Web Content] <defunct>
 225635 ?        Z      1:18  |   |   \_ [Privileged Cont] <defunct>
 225658 ?        Z      5:11  |   |   \_ [WebExtensions] <defunct>
 225839 ?        Z      8:07  |   |   \_ [RDD Process] <defunct>
 239455 ?        Z      0:00  |   |   \_ [pingsender] <defunct>
 239459 ?        Z      0:00  |   |   \_ [firefox-bin] <defunct>
 239538 ?        Z      0:00  |   |   \_ [Socket Process] <defunct>
 239601 ?        Z     30:14  |   |   \_ [Web Content] <defunct>
 239642 ?        Z      5:07  |   |   \_ [Web Content] <defunct>
 239649 ?        Z      5:43  |   |   \_ [Web Content] <defunct>
 239664 ?        Z     32:14  |   |   \_ [Web Content] <defunct>
 239672 ?        Z     40:46  |   |   \_ [Web Content] <defunct>
 239678 ?        Z     14:01  |   |   \_ [Web Content] <defunct>
 239700 ?        Z      5:11  |   |   \_ [Web Content] <defunct>
 239721 ?        Z      8:41  |   |   \_ [Web Content] <defunct>
 239876 ?        Z      1:12  |   |   \_ [Privileged Cont] <defunct>
 239934 ?        Z      7:09  |   |   \_ [WebExtensions] <defunct>
 240124 ?        Z      0:00  |   |   \_ [RDD Process] <defunct>
 282295 ?        Z      0:00  |   |   \_ [pingsender] <defunct>
 282299 ?        Z      0:00  |   |   \_ [firefox-bin] <defunct>
 282362 ?        Z      0:00  |   |   \_ [Socket Process] <defunct>
 282444 ?        Z      4:11  |   |   \_ [Web Content] <defunct>
 282485 ?        Z      0:55  |   |   \_ [Web Content] <defunct>
 282492 ?        Z      0:35  |   |   \_ [Web Content] <defunct>
 282507 ?        Z      2:20  |   |   \_ [Web Content] <defunct>
 282513 ?        Z      2:19  |   |   \_ [Web Content] <defunct>
 282520 ?        Z      0:57  |   |   \_ [Web Content] <defunct>
 282558 ?        Z      0:25  |   |   \_ [Web Content] <defunct>
 282580 ?        Z      0:26  |   |   \_ [Web Content] <defunct>
 282719 ?        Z      0:09  |   |   \_ [Privileged Cont] <defunct>
 282776 ?        Z      0:31  |   |   \_ [WebExtensions] <defunct>
 282964 ?        Z      0:00  |   |   \_ [RDD Process] <defunct>
 294944 ?        Z      0:00  |   |   \_ [pingsender] <defunct>
 294948 ?        Z      0:00  |   |   \_ [firefox-bin] <defunct>
 295013 ?        Z      0:00  |   |   \_ [Socket Process] <defunct>
 295092 ?        Z     17:31  |   |   \_ [Web Content] <defunct>
 295133 ?        Z      6:12  |   |   \_ [Web Content] <defunct>
 295140 ?        Z     12:39  |   |   \_ [Web Content] <defunct>
 295155 ?        Z     18:44  |   |   \_ [Web Content] <defunct>
 295161 ?        Z     27:58  |   |   \_ [Web Content] <defunct>
 295183 ?        Z     10:52  |   |   \_ [Web Content] <defunct>
 295209 ?        Z      3:09  |   |   \_ [Web Content] <defunct>
 295215 ?        Z      2:34  |   |   \_ [Web Content] <defunct>
 295367 ?        Z      0:57  |   |   \_ [Privileged Cont] <defunct>
 295425 ?        Z      4:30  |   |   \_ [WebExtensions] <defunct>
 295588 ?        Z      0:00  |   |   \_ [RDD Process] <defunct>
 306549 ?        Z      0:00  |   |   \_ [pingsender] <defunct>
 306553 ?        Z      0:00  |   |   \_ [firefox-bin] <defunct>
 306632 ?        Z      0:00  |   |   \_ [Socket Process] <defunct>
 306693 ?        Z     17:02  |   |   \_ [Web Content] <defunct>
 306733 ?        Z      4:27  |   |   \_ [Web Content] <defunct>
 306740 ?        Z      2:25  |   |   \_ [Web Content] <defunct>
 306756 ?        Z     26:07  |   |   \_ [Web Content] <defunct>
 306763 ?        Z     18:51  |   |   \_ [Web Content] <defunct>
 306768 ?        Z      9:31  |   |   \_ [Web Content] <defunct>
 306793 ?        Z      4:15  |   |   \_ [Web Content] <defunct>
 306813 ?        Z      3:16  |   |   \_ [Web Content] <defunct>
 306967 ?        Z      0:50  |   |   \_ [Privileged Cont] <defunct>
 307026 ?        Z      3:57  |   |   \_ [WebExtensions] <defunct>
 307182 ?        Z      0:02  |   |   \_ [RDD Process] <defunct>
 318841 ?        Z      0:00  |   |   \_ [pingsender] <defunct>
 318845 ?        Z      0:00  |   |   \_ [firefox-bin] <defunct>
 318924 ?        Z      0:00  |   |   \_ [Socket Process] <defunct>
 318984 ?        Z     16:25  |   |   \_ [Web Content] <defunct>
 319026 ?        Z      8:41  |   |   \_ [Web Content] <defunct>
 319033 ?        Z      4:48  |   |   \_ [Web Content] <defunct>
 319048 ?        Z     14:02  |   |   \_ [Web Content] <defunct>
 319056 ?        Z     18:40  |   |   \_ [Web Content] <defunct>
 319064 ?        Z     10:10  |   |   \_ [Web Content] <defunct>
 319072 ?        Z      1:55  |   |   \_ [Web Content] <defunct>
 319091 ?        Z      1:53  |   |   \_ [Web Content] <defunct>
 319260 ?        Z      0:38  |   |   \_ [Privileged Cont] <defunct>
 319318 ?        Z      2:35  |   |   \_ [WebExtensions] <defunct>
 319486 ?        Z      0:00  |   |   \_ [RDD Process] <defunct>
 329723 ?        Z      0:00  |   |   \_ [pingsender] <defunct>
 329727 ?        Z      0:00  |   |   \_ [firefox-bin] <defunct>
 329806 ?        Sl     0:00  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -parentBuildID 20200401095134 -prefsLen 1 -prefMapSize 223822 -appdi
 329862 ?        Sl    20:26  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 1 -isForBrowser -prefsLen 308 -prefMapSize 223822 -parentBu
 329903 ?        Sl     3:39  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 2 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
 329910 ?        Sl     3:57  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 3 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
 329925 ?        Sl    37:57  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 4 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
 329930 ?        Sl    22:50  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 5 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
 329938 ?        Sl     5:24  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 6 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
 329957 ?        Sl     2:32  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 7 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
 329980 ?        Sl     2:14  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 8 -isForBrowser -prefsLen 531 -prefMapSize 223822 -parentBu
 330137 ?        Sl     0:45  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 9 -isForBrowser -prefsLen 625 -prefMapSize 223822 -parentBu
 330195 ?        Sl     3:02  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -childID 10 -isForBrowser -prefsLen 6695 -prefMapSize 223822 -parent
 330347 ?        Sl     0:00  |   |   \_ /home/jcristau/firefox/nightly/firefox/firefox-bin -contentproc -parentBuildID 20200401095134 -prefsLen 7658 -prefMapSize 223822 -ap

There's some ideas about what could be going wrong in the IPC meeting notes: https://docs.google.com/document/d/19hFtByA2dKB1WSjwWnThDIiMnngptb54T76Zw-Uf354/edit#

This is likely an updater bug though.

FYI, we should get this prioritized and fixed. Who should be looking at this?

Flags: needinfo?(rtublitz)

The updater isn't responsible for conducting the restart, all we do is call into Services.startup. No update operations happen during that shutdown; at that point we've already arranged the necessary state to be processed very early during the next startup. If that processing is happening after any other processes have already been started, then that's the problem, but otherwise I don't understand how this can be an updater bug.

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #2)

FYI, we should get this prioritized and fixed. Who should be looking at this?

The only updater change to land since Dec 2019 is 1551306 that went in 3 days ago....so the timeline is at least right for this to be seen in Nightly.

:mhowell, do you think it's possible that any of those changes could have had an impact here? Or, if you're not sure, I wonder if we could try backing them out and then checking to see if this is still happening?

Other than that set of changes, it's hard to think of how the updater might be responsible for these issues in Nightly only since there have been no other recent changes made (ie, I'd expect to see them across the board, not just Nightly).

Flags: needinfo?(rtublitz) → needinfo?(mhowell)

(In reply to Rachel Tublitz [:rachel] from comment #4)

The only updater change to land since Dec 2019 is 1551306 that went in 3 days ago....so the timeline is at least right for this to be seen in Nightly.

:mhowell, do you think it's possible that any of those changes could have had an impact here? Or, if you're not sure, I wonder if we could try backing them out and then checking to see if this is still happening?

I hadn't seen that patch, thanks for pointing that out. But it definitely doesn't look relevant here; all it's doing is dropping a context parameter, which nothing was using anyway, through some HTTP request handling stuff. Looks like just some totally safe cleanup to me.

Flags: needinfo?(mhowell)

I was able to reproduce this bug (using Ubuntu 19.10 with yesterday's Nightly) by restarting to apply an update, but then I was also able to reproduce the bug by using the safe mode restart button in about:support when there was no update available (in a new instance, I made sure all the processes from the previous one were gone first), so I think that's enough to show that this is not an updater issue.

Component: Application Update → General
Product: Toolkit → Firefox
Component: General → DOM: Content Processes
Product: Firefox → Core

If we're looking at the code calling execve, this is probably XPCOM or Toolkit :: Startup or something like that.

It's probably not DOM: Content Processes, or IPC, because other types of IPC child processes and non-IPC child processes like pingsender are also affected.

Recent changes in that code were in XPCOM:Core, so redirecting there.

Component: DOM: Content Processes → XPCOM

NI cpeterson to add it to the list of fission issues - zombie browsers can lock up a linux desktop by running you out of xclient connections (default max is 256; use can up it but none do normally). (it can also cause issues with memory/swap usage I imagine, and other resource issues)

Flags: needinfo?(cpeterson)
Whiteboard: [fission]

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #9)

NI cpeterson to add it to the list of fission issues - zombie browsers can lock up a linux desktop by running you out of xclient connections (default max is 256; use can up it but none do normally). (it can also cause issues with memory/swap usage I imagine, and other resource issues)

I will send this bug to Fission triage.

Sounds like this bug can affect non-Fission users too, but is much more likely to affect Fission users because they have more content processes.

Severity: normal → S2
Fission Milestone: --- → ?
Flags: needinfo?(cpeterson)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #9)

NI cpeterson to add it to the list of fission issues - zombie browsers can lock up a linux desktop by running you out of xclient connections (default max is 256; use can up it but none do normally). (it can also cause issues with memory/swap usage I imagine, and other resource issues)

I don't think that's the case: zombie processes should have all their fds closed (and memory unmapped) at _exit time; essentially they're just placeholders for a pid that can't be reclaimed yet.

I did run into X11 maxclients due to this fwiw.

Blocks Fission M6c Nightly for zombie process leaks.

This problem affects all users, but Fission makes it worse because we'll have more processes.

Fission Milestone: ? → M6c

Nathan, could you review the severity of this bug (S2) and prioritize it?

Flags: needinfo?(nfroyd)

I don't know what the right priority is; based on comment 11, it sounds like things might not be too bad (once the x11 maxclients issue is fixed, comment 12).

What is the right solution: when we restart for an update, we have to wait for all of our child processes to exit/forcibly terminate them, or something else?

Flags: needinfo?(nfroyd)
Priority: -- → P2

Would it be worth forking before exec on restart, to let the parent exit and its children get reaped by init?

S1 or S2 bugs need an assignee - could you find someone for this bug?

Flags: needinfo?(nfroyd)

(In reply to Julien Cristau [:jcristau] from comment #16)

Would it be worth forking before exec on restart, to let the parent exit and its children get reaped by init?

That would be the simplest solution, but if there's anything out there that expects the browser won't exit and the updated browser will have the same pid, it would break. That doesn't mean we can't change the behavior, but it's something to consider.

The alternatives are not great: we may not want to just SIGKILL our own child processes without giving them a chance to exit normally, but that means delaying the restart, and there may be helper apps launched via nsIProcess that we don't want to kill on restart but we also don't want them to become zombies when they eventually exit.

Concretely, we'd replace this line with one of our process launching APIs: nsIProcess, base::LaunchApp, even PR_CreateProcess if necessary because its odd side effects (bug 227246) won't matter, followed by exit or _exit.

Flags: needinfo?(nfroyd)
Assignee: nobody → kmaglione+bmo
Status: NEW → ASSIGNED
Priority: P2 → P3

Re-assigned during Fission meeting.

Assignee: kmaglione+bmo → pbone

Depends on D101700

Depends on D101701

Attachment #9197009 - Attachment description: Bug 1626682 - Improve code formatting r=kmag → Bug 1626682 - Improve code formatting r=nika
Attachment #9197010 - Attachment description: Bug 1626682 - Don't restart using exec on Unix r=kmag → Bug 1626682 - Don't restart using exec on Unix r=nika
Blocks: 1688150
Whiteboard: [fission] → [fission], [2/3] patch r+, can land today
Blocks: 1690691

Comment on attachment 9197009 [details]
Bug 1626682 - Improve code formatting r=nika

Revision D101701 was moved to bug 1688150. Setting attachment 9197009 [details] to obsolete.

Attachment #9197009 - Attachment is obsolete: true
Pushed by pbone@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fbd96279c66d
Don't restart using exec on Unix r=marionette-reviewers,whimboo,kmag
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 87 Branch
Whiteboard: [fission], [2/3] patch r+, can land today → [fission]
Regressions: 1692851
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: