WaitPidDaemonThread() stealing return status and asserting
Categories
(NSPR :: NSPR, defect, P2)
Tracking
(Not tracked)
People
(Reporter: beddingfield, Unassigned, NeedInfo)
Details
(Whiteboard: [specification][type:bug])
Updated•7 years ago
|
Comment 1•7 years ago
|
||
Not sure how to triage this one.
Comment 2•7 years ago
|
||
So, there are actually two different parts to this bug, one that makes sense and I don't see how to fix, and one that kind of doesn't make sense.
-
The part where
system()returns -1 makes sense. Unfortunately, there is no system call for "wait for a pid in this list". The best that could be done would be do loop through all child processes, and waitpid for each, but that would have to be an active loop sucking 100% CPU, and that wouldn't be a great solution. Process groups are not an option because they would require a different process group leader. On Linux, there might be the possibility to use cgroups, but that seems heavyweight, and that wouldn't solve the problem on other platforms (where basically everything but Windows is likely affected). NSPR kind of assumes everything passes through it, which makes some kind of sense, but doesn't really in a larger ecosystem. In the context of Gecko, we should probably move away from NSPR to execute processes. -
The part where the assertion happens makes less sense.
ProcessReapedChildInternaltries to find the pid that waitpid returns in NSPR's table. The assert only happens when it does, which would mean it knows about that pid, suggesting it was not created by system(). Or... the process has been running for a looooong time, and pids have been recycled and a pid that had been used in the past has been reused for the system() call. NSPR is not being a very nice actor here, but we're back to "NSPR kind of assumes everything passes through it".
TL;DR: There clearly is a bug or two in NSPR here, but I don't know what to do about them. Separately, we should probably file a separate bug for Gecko to get rid of PR_CreateProcess and related calls.
Looping in other people that might have ideas?
Comment 3•7 years ago
|
||
Yes, we reached the assert when when the PID's got recycled. It's doesn't take too long if you're creating a lot of short-lived processes. While reviewing the code I noticed the waitpid() stealing.
We fixed it in-house by looping over the known PID's. It doesn't consume 100% CPU because the WaitPidDaemonThread() blocks until a SIGCHLD happens. (Since we're using Linux, _PR_NATIVE_THREADS is undefined, thus the SIGCHLD implementation of WaitPidDaemonThread().) Prosaically, you need to hold the lock while iterating through the list. So we copy the PID's from each bucket to a space that only WaitPidDaemonThread() can access... and iterate over those without the lock. If we get a hit, we re-acquire the lock and call ProcessReapedChildInternal(). There's a slight race condition here if the PID's are getting recycled at lightning speed.
Note: we're using v4.8.6... in case I just said something out-of-date. :-)
Updated•7 years ago
|
Comment 4•6 years ago
|
||
(In reply to Gabriel Beddingfield from comment #3)
Note: we're using v4.8.6... in case I just said something out-of-date. :-)
That release is 9 years old...
https://groups.google.com/forum/#!topic/mozilla.dev.tech.nspr/7utcRVUm_bU
What's the reason you haven't upgraded? Any fix would need to be applied to most recent NSPR, and for backporting you'd be on your own.
Regarding the potential fix, I have never worked on the related code.
From the various options you gave in your newsgroup post, I like solution 1, it seems the most elegant to me, and threads are supposed to be lightweight:
"1. For each process that NSPR spawns, spawn a corresponding thread that
will call waitpid() with the actual pid as the argument (in lieu of -1)."
Comment 5•6 years ago
|
||
(In reply to Kai Engert (:kaie:) from comment #4)
(In reply to Gabriel Beddingfield from comment #3)
What's the reason you haven't upgraded?
Because I don't run the show, here. :-)
Any fix would need to be applied to most recent NSPR, and for backporting you'd be on your own.
Understood. As I said, we've already fixed it in-house. I'm merely reporting the bug to you. (To be honest, closing the bug as WAI would not have surprised me... depending on the goals and priorities for your project.)
Comment 6•6 years ago
|
||
FYI, I just rebased our internal fixes to 4.24 and the issue is still present in that release.
Comment 7•6 years ago
|
||
Comment 8•6 years ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #7)
(In reply to Gabriel Beddingfield from comment #5)
As I said, we've already fixed it in-house.
How?
I described the fix in comment #3.
Updated•3 years ago
|
Description
•