Closed Bug 543441 Opened 11 years ago Closed 11 years ago

Crash when using process.kill() on a process that has ended

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9.3a5

People

(Reporter: glandium, Assigned: glandium)

References

Details

Attachments

(2 files, 3 obsolete files)

This was uncovered at the same time as bug 543438.
When stdin is not a tty during the nsIProcess unit test, TestBlockingProcess returns immediately, and it appears that is it ends way before process.kill is called from the unit test.

What happens then is that PR_KillProcess is called with a NULL argument, which is then dereferenced, leading to a segmentation fault.

The problem could be considered a NSPR bug, where PR_KillProcess should be checking for non NULLness of its argument (but maybe the API requires that the argument is not NULL).

It could also be considered a bug in nsProcess::Kill, which doesn't check if mProcess is NULL before calling PR_KillProcess.

I'm going to attach a possible patch, but I'm not convinced by the return value. Should process.kill() call throw an exception when the process is already done ? That doesn't feel quite right.
Attached patch Possible patch (obsolete) — Splinter Review
Assignee: nobody → mh+mozilla
Attachment #424573 - Flags: review?(benjamin)
Attachment #424573 - Flags: review?(benjamin) → review?(dtownsend)
We should be able to have an automated test for this.
Flags: in-testsuite?
Comment on attachment 424573 [details] [diff] [review]
Possible patch

This looks ok to me but please can you write up an automated test, it should just be as simple as blocking the main thread until you're sure a background process has exited and then trying to kill it.
(In reply to comment #3)
> (From update of attachment 424573 [details] [diff] [review])
> This looks ok to me but please can you write up an automated test, it should
> just be as simple as blocking the main thread until you're sure a background
> process has exited and then trying to kill it.

I'm okay with writing a test for that, I just wanted to make sure we all agree that failing is the right way to handle the condition, since the test would obviously be different if we'd decide not to fail, but silently return.
Status: NEW → ASSIGNED
Comment on attachment 424573 [details] [diff] [review]
Possible patch

Failing is correct I think, r+ but don't land without a test
Attachment #424573 - Flags: review?(dtownsend) → review+
(In reply to comment #5)
> (From update of attachment 424573 [details] [diff] [review])
> Failing is correct I think, r+ but don't land without a test

Just a stupid question: should that be a unit test or a crashtest ?
xpcshell unit test
Attached patch Test (obsolete) — Splinter Review
I don't like having to use a global variable, but while https://developer.mozilla.org/en/Writing_xpcshell-based_unit_tests claims it's possible to use a function instead of a string for do_timeout(), it doesn't to be the case (yet) on 1.9.1, which is all I have at hand to do the testing.

The do_timeout is necessary to avoid race conditions, which happened a few times when testing without a do_timeout.
Attachment #427200 - Flags: review?(dtownsend)
Attachment #427200 - Flags: review?(dtownsend) → review-
Comment on attachment 427200 [details] [diff] [review]
Test

You need to use a function for do_timeout not a string, this should work on 1.9.2 as well, but using a string will not work on trunk at all which is where we must land first. Perhaps you are seeing a problem because you are trying to call finish_test_kill rather than finish_test_kill_2?

Even after correcting both of those the test passes on trunk without your patch to fix the bug applied so this test is not properly testing the bug.

>diff --git a/xpcom/tests/unit/test_nsIProcess.js b/xpcom/tests/unit/test_nsIProcess.js
>index 06d8e1e..8bbce3e 100644
>--- a/xpcom/tests/unit/test_nsIProcess.js
>+++ b/xpcom/tests/unit/test_nsIProcess.js
>@@ -185,6 +185,32 @@ function test_notify_killed()
>   process.kill();
> }
> 
>+var process;
>+
>+function finish_test_kill_2()
>+{

Nit: Braces go at the end of the function declaration.
(In reply to comment #8)
> Created an attachment (id=427200) [details]
> Test
> 
> I don't like having to use a global variable, but while
> https://developer.mozilla.org/en/Writing_xpcshell-based_unit_tests claims it's
> possible to use a function instead of a string for do_timeout(), it doesn't to
> be the case (yet) on 1.9.1, which is all I have at hand to do the testing.

You won't be able to use a function on 1.9.1 so you'll need to test on a more recent branch, ideally trunk.
(In reply to comment #9)
> (From update of attachment 427200 [details] [diff] [review])
> You need to use a function for do_timeout not a string, this should work on
> 1.9.2 as well, but using a string will not work on trunk at all which is where
> we must land first. Perhaps you are seeing a problem because you are trying to
> call finish_test_kill rather than finish_test_kill_2?

I actually edited that after testing, and it worked properly then, but only when using a string.

> Even after correcting both of those the test passes on trunk without your patch
> to fix the bug applied so this test is not properly testing the bug.

Mmmmm interesting. I must say i didn't try the crash case with do_timeout. What happens if you call finish_test_kill_2() directly ?
> You won't be able to use a function on 1.9.1 so you'll need to test on a more
> recent branch, ideally trunk.

I won't be able to do so before next week, at which point I may be able to test on 1.9.2. Testing on trunk will take more time.
I've been trying to get the test do something useful, but unfortunately, there is some kind of race here. Unless I'm killing right after the process.run(), without a do_timeout, I have a hard time triggering the segfault, and even then it doesn't happen reliably. And when it doesn't segfault, process.kill *does* kill the process because it is not terminated yet... which makes it difficult to have a test that doesn't fail in such cases... 
So the only way I can think of is to do a process.run/process.kill loop a bunch of times, catching any exception, but no do_check_* or do_throw...

Any better idea?
Attached patch Test v2 (obsolete) — Splinter Review
cf. comment 13, this is the best I could come up with. Usually, it takes between 1 and 100 loops to get a crash on my machine without the patch.
Attachment #427200 - Attachment is obsolete: true
Attachment #437287 - Flags: review?(dtownsend)
Attached patch Test v2.01Splinter Review
More accurate comment
Attachment #437287 - Attachment is obsolete: true
Attachment #437288 - Flags: review?(dtownsend)
Attachment #437287 - Flags: review?(dtownsend)
Comment on attachment 437288 [details] [diff] [review]
Test v2.01

Just a thought, what if you run the process then delay for a short time using a simple loop and then kill?

Otherwise this feels kind of nasty but I guess if that is all that works then at least we have something.
Attachment #437288 - Flags: review?(dtownsend) → review+
You made me wonder, so I went back to the code and the problem is actually a pure race condition, which I didn't spot first, and is why it's hard to come up with something that works accurately. With my patch, there still remains a race condition, by the way, though it has a much smaller window.

What seems to happen without the patch is that in the non blocking case, you can call Kill slightly after nsProcess was notified the process terminated, but before it sets mThread to null. mThread is nulled in nsProcess::ProcessComplete, which happens long enough after mProcess is nulled in Monitor that there is a race condition window.

With the patch, there is still a race condition window, though tighter. Kill may be called before Monitor acquires the lock, the mProcess test then occurs, and as being non null, Kill continues, then the scheduler may switch threads, Monitor gets the lock, nulls mProcess, releases the lock, and then whenever the scheduler goes back to the thread currently in Kill(), and PR_KillProcess is called with a null argument.

I think the mProcess test in Kill() should be moved after the lock.
Attached patch Patch v2Splinter Review
See comment 17
Attachment #424573 - Attachment is obsolete: true
Attachment #441252 - Flags: review?(dtownsend)
Attachment #441252 - Flags: review?(dtownsend) → review+
http://hg.mozilla.org/mozilla-central/rev/46c6105f5429

http://hg.mozilla.org/mozilla-central/rev/5e3ddc9cd1c5
(with a wrong bug number in the commit message, damnit)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.3a5
Depends on: 620101
You need to log in before you can comment on or make changes to this bug.