"ABORT: constructor for actor failed" in PContentChild.cpp, line 1634 in opt mochitest-browser-chrome-2

RESOLVED WONTFIX

Status

()

Core
IPC
RESOLVED WONTFIX
4 years ago
4 years ago

People

(Reporter: mccr8, Assigned: mccr8)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Assignee)

Description

4 years ago
On this try push from the 19th, in the Linux Opt e10s bc1, I see a single instance of "TEST-START | Shutdown".  Similarly, there's a single instance of "MochitestServer : launching [u'/builds/slave/test/build/tests/bin/xpcshell'," etc.
  https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=24d497d18607

But then on this try push from the 21st (today), in the log from the same test, I see 13 instances of "TEST-START | Shutdown" and 13 instances of MochitestServer launching xpcshell.
  https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=82f7ce409c07

The second log also has 12 instances of "ERROR Got second suite_start message before suite_end."

This seems very bizarre.  Is it an intentional change?  What is spawning all of these MochitestServers?

I'm asking because in a later try push (with a patch that fixed the orange in the previous try run) I'm getting some new crashes:
  https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=7345732409cf

The crashing process, 3387, was spawned by 3347, according to the log, but the pid 3347 doesn't really seem to appear anywhere, so I'm trying to figure out what is going on, and how to reproduce it locally.
(Assignee)

Comment 1

4 years ago
Do you have some idea what might be happening, Joel?  Thanks.
Flags: needinfo?(jmaher)
This week we switched to --run-by-dir where for a given test job we run each directory as a unique instance of mochitest (fresh profile, fresh servers, fresh browser).

I am sure that is what you see, I wonder if that helps clarify the confusion a bit?  Is there anything I can do to help?  This is on for all browser-chrome jobs (not devtools though).
Flags: needinfo?(jmaher)
(Assignee)

Comment 3

4 years ago
Ah, that makes sense, thanks.  I'll hopefully be able to reproduce the error I'm seeing with that.
(Assignee)

Comment 4

4 years ago
I was able to reproduce this locally with:
  ./mach mochitest-browser --e10s --run-by-dir --total-chunks 3 --this-chunk 2
I think it isn't happening all the time.  I think the underlying problem is that run-by-dir is exposing all sorts of new shutdown behavior that we weren't seeing before.
(Assignee)

Comment 5

4 years ago
The failure changed again.  It went back to only affecting bc2.  This may be due to bug 1037625.  I'm having trouble reproducing it locally.
Summary: "aborting because of MsgProcessingError" in ContentChild in mochitest-browser-chrome → "ABORT: constructor for actor failed" in PContentChild.cpp, line 1634 in mochitest-browser-chrome-2
(Assignee)

Comment 6

4 years ago
I still can't reproduce locally, but I looked at my try logs again, and I did get a useful stack.  TabChild::RecvLoadRemoteScript() does a bunch of stuff that ends up calling DOMStorageCache::StartDatabase(), which ends up calling PContentChild::SendPStorageConstructor(), which fails, presumably because we're in shutdown.  I'm going to try making RecvLoadRemoteScript not doing anything if the TabChild is destroyed.
  https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=a5ef3715faba
(Assignee)

Comment 7

4 years ago
Joel, how can I run some mochitest-browser tests locally in the same way they are run on TBPL?

I tried
  ./mach mochitest-browser --e10s --run-by-dir netwerk/test/browser/
but with this billm and I are both unable to reproduce this crash, so if there's some way I can use the test harness more directly or whatever to more closely reproduce what try is doing that would be appreciated.  It could also be an issue with shutdown speed.

More recent test run: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=710496b18c26
Flags: needinfo?(jmaher)
(Assignee)

Comment 8

4 years ago
Oops, I wasn't able to reproduce before because I was using a debug build.  A non-debug build reproduced the failure, as expected.
Flags: needinfo?(jmaher)
(Assignee)

Updated

4 years ago
Summary: "ABORT: constructor for actor failed" in PContentChild.cpp, line 1634 in mochitest-browser-chrome-2 → "ABORT: constructor for actor failed" in PContentChild.cpp, line 1634 in opt mochitest-browser-chrome-2
(Assignee)

Comment 9

4 years ago
As I'd guessed, I can reproduce the crash in a debug build by making the QuickExit() call in ContentChild::ActorDestroy() happen in debug builds.
(Assignee)

Comment 10

4 years ago
billm spent some time looking at this with me.  We determined that it seems like the test harness is creating a new child process in waitForWindowsState, with "about:blank", which causes us to start creating a new child process. Then the parent starts to shut down, but the child is not far enough along in startup to be able to get a message from the parent telling it that the parent shuts down.  Then the child continues startup, but the parent is dead, so the next time it tries to send it a message, the constructor fails and we abort.

We decided that probably the better solution here is to change IPDL code generation so that on a constructor failure we do _exit(0) and just gently exit.  That is basically just ensuring that with bug  1091766 we maintain the existing behavior we have today for constructor failures.

An alternative would be to change the test harness code to create "about:about" instead of "about:blank", because the former is not remote, so we don't create a child process.  That fixes the immediate crash, but we could see these RUNTIME_ABORTs in the wild with my other patch, which doesn't seem great, and it leaves a little landmine waiting for anybody who might make about:about remote or somehow cause a child process to happen.
Component: Mochitest → IPC
Product: Testing → Core
(Assignee)

Updated

4 years ago
Assignee: nobody → continuation
(Assignee)

Comment 11

4 years ago
Created attachment 8534071 [details] [diff] [review]
Exit quietly on constructor abort. WIP

Comment 12

4 years ago
Why aren't we killing (TerminateProcess/SIGKILL) all active children when a parent exits? ISTM that the child behavior here is "correct": the only case where the IPC connection should fall down is when the chrome process dies a violent/crashy death.

Otherwise, we should design the system so that either the parent waits for the child to stand up and then shut down (perhaps something we'd do in leak-checking configurations) or just hard-kill any children when we reach shutdown (for normal configurations).
(Assignee)

Comment 13

4 years ago
Bill was saying that we do wait for the child to finish starting up, but only once it has gotten past some graphics thing.  He hacked up some kind of thing to try to wait for children to finish starting up, but it didn't work for whatever reason.  I can try looking at it some more.

We do have ContentParent::ShutDownProcess(), which is called at xpcom-shutdown, but it seems to KillHard() only for the Nuwa process.  I think when we looked at this particular case, it was calling Close(), but leaving the child process alone otherwise.  I'm not sure what close does, but I guess it just shuts down the IPC channels.
(Assignee)

Comment 14

4 years ago
Created attachment 8534587 [details] [diff] [review]
QuickExit in debug build in ContentChild::ActorDestroy() to cause the crash in debug builds. Not for landing.
(Assignee)

Comment 15

4 years ago
I'm currently not planning to land the patch in the bug this is blocking, so this doesn't need to be fixed either.  Instead, bug 1103036 is the new magic.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.