Open Bug 1483328 Opened 7 years ago Updated 3 years ago

WaitPidDaemonThread() stealing return status and asserting

Categories

(NSPR :: NSPR, defect, P2)

All
Other

Tracking

(Not tracked)

UNCONFIRMED

People

(Reporter: beddingfield, Unassigned, NeedInfo)

Details

(Whiteboard: [specification][type:bug])

What did you do? ================ 1. Run a program using NSPR 2. Spawn at least one program using PR_CreateProcess() 3. Spawn several programs using system(3). What happened? ============== Every once in a while, I'll get either: A. system(3) returns -1 with errno == 10 (ECHILD) B. Assertion failure: pRec->state != _PR_PID_REAPED, at pr/src/md/unix/uxproces.c:543 What should have happened? ========================== A. system(3) should have returned the exit status of the program instead of an error. B. uxprocess.c should not be assert-ing. Is there anything else we should know? ====================================== I posted about this on the mailing list: https://groups.google.com/forum/#!topic/mozilla.dev.tech.nspr/M6qJFI6glVs Related bug (possibly duplicate): https://bugzilla.mozilla.org/show_bug.cgi?id=603004 Related Debian bugs: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=700639 https://groups.google.com/forum/#!msg/linux.debian.bugs.dist/0sjh8MtJpgk/vfY-qYfoBqMJ The root cause is that NSPR spawns a thread that is constantly calling waitpid(-1, ...) and that this will sometimes steal the return code of any sub-process created outside of the NSPR framework (e.g. using system(3) instead of PR_CreateProcess()).
Component: API → NSPR
Product: developer.mozilla.org → NSPR
Version: unspecified → other

Not sure how to triage this one.

Flags: needinfo?(mh+mozilla)
QA Contact: jjones

So, there are actually two different parts to this bug, one that makes sense and I don't see how to fix, and one that kind of doesn't make sense.

  • The part where system() returns -1 makes sense. Unfortunately, there is no system call for "wait for a pid in this list". The best that could be done would be do loop through all child processes, and waitpid for each, but that would have to be an active loop sucking 100% CPU, and that wouldn't be a great solution. Process groups are not an option because they would require a different process group leader. On Linux, there might be the possibility to use cgroups, but that seems heavyweight, and that wouldn't solve the problem on other platforms (where basically everything but Windows is likely affected). NSPR kind of assumes everything passes through it, which makes some kind of sense, but doesn't really in a larger ecosystem. In the context of Gecko, we should probably move away from NSPR to execute processes.

  • The part where the assertion happens makes less sense. ProcessReapedChildInternal tries to find the pid that waitpid returns in NSPR's table. The assert only happens when it does, which would mean it knows about that pid, suggesting it was not created by system(). Or... the process has been running for a looooong time, and pids have been recycled and a pid that had been used in the past has been reused for the system() call. NSPR is not being a very nice actor here, but we're back to "NSPR kind of assumes everything passes through it".

TL;DR: There clearly is a bug or two in NSPR here, but I don't know what to do about them. Separately, we should probably file a separate bug for Gecko to get rid of PR_CreateProcess and related calls.

Looping in other people that might have ideas?

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(kaie)
Flags: needinfo?(jld)

Yes, we reached the assert when when the PID's got recycled. It's doesn't take too long if you're creating a lot of short-lived processes. While reviewing the code I noticed the waitpid() stealing.

We fixed it in-house by looping over the known PID's. It doesn't consume 100% CPU because the WaitPidDaemonThread() blocks until a SIGCHLD happens. (Since we're using Linux, _PR_NATIVE_THREADS is undefined, thus the SIGCHLD implementation of WaitPidDaemonThread().) Prosaically, you need to hold the lock while iterating through the list. So we copy the PID's from each bucket to a space that only WaitPidDaemonThread() can access... and iterate over those without the lock. If we get a hit, we re-acquire the lock and call ProcessReapedChildInternal(). There's a slight race condition here if the PID's are getting recycled at lightning speed.

Note: we're using v4.8.6... in case I just said something out-of-date. :-)

Priority: -- → P2

(In reply to Gabriel Beddingfield from comment #3)

Note: we're using v4.8.6... in case I just said something out-of-date. :-)

That release is 9 years old...
https://groups.google.com/forum/#!topic/mozilla.dev.tech.nspr/7utcRVUm_bU

What's the reason you haven't upgraded? Any fix would need to be applied to most recent NSPR, and for backporting you'd be on your own.

Regarding the potential fix, I have never worked on the related code.

From the various options you gave in your newsgroup post, I like solution 1, it seems the most elegant to me, and threads are supposed to be lightweight:
"1. For each process that NSPR spawns, spawn a corresponding thread that
will call waitpid() with the actual pid as the argument (in lieu of -1)."

Flags: needinfo?(kaie)

(In reply to Kai Engert (:kaie:) from comment #4)

(In reply to Gabriel Beddingfield from comment #3)
What's the reason you haven't upgraded?

Because I don't run the show, here. :-)

Any fix would need to be applied to most recent NSPR, and for backporting you'd be on your own.

Understood. As I said, we've already fixed it in-house. I'm merely reporting the bug to you. (To be honest, closing the bug as WAI would not have surprised me... depending on the goals and priorities for your project.)

FYI, I just rebased our internal fixes to 4.24 and the issue is still present in that release.

(In reply to Gabriel Beddingfield from comment #5)

As I said, we've already fixed it in-house.

How?

(In reply to Kai Engert (:KaiE:) from comment #7)

(In reply to Gabriel Beddingfield from comment #5)

As I said, we've already fixed it in-house.

How?

I described the fix in comment #3.

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.