Closed Bug 1103279 Opened 11 years ago Closed 11 years ago

"ABORT: constructor for actor failed" in PContentChild.cpp, line 1634 in opt mochitest-browser-chrome-2

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: mccr8, Assigned: mccr8)

References

Details

Attachments

(2 files)

Exit quietly on constructor abort. WIP 11 years ago Andrew McCreight [:mccr8] 1.63 KB, patch		Details \| Diff \| Splinter Review
QuickExit in debug build in ContentChild::ActorDestroy() to cause the crash in debug builds. Not for landing. 11 years ago Andrew McCreight [:mccr8] 988 bytes, patch		Details \| Diff \| Splinter Review

Andrew McCreight [:mccr8]

Assignee

Description

•

11 years ago

On this try push from the 19th, in the Linux Opt e10s bc1, I see a single instance of "TEST-START | Shutdown". Similarly, there's a single instance of "MochitestServer : launching [u'/builds/slave/test/build/tests/bin/xpcshell'," etc. https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=24d497d18607 But then on this try push from the 21st (today), in the log from the same test, I see 13 instances of "TEST-START | Shutdown" and 13 instances of MochitestServer launching xpcshell. https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=82f7ce409c07 The second log also has 12 instances of "ERROR Got second suite_start message before suite_end." This seems very bizarre. Is it an intentional change? What is spawning all of these MochitestServers? I'm asking because in a later try push (with a patch that fixed the orange in the previous try run) I'm getting some new crashes: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=7345732409cf The crashing process, 3387, was spawned by 3347, according to the log, but the pid 3347 doesn't really seem to appear anywhere, so I'm trying to figure out what is going on, and how to reproduce it locally.

Andrew McCreight [:mccr8]

Assignee

Comment 1

•

11 years ago

Do you have some idea what might be happening, Joel? Thanks.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 2

•

11 years ago

This week we switched to --run-by-dir where for a given test job we run each directory as a unique instance of mochitest (fresh profile, fresh servers, fresh browser). I am sure that is what you see, I wonder if that helps clarify the confusion a bit? Is there anything I can do to help? This is on for all browser-chrome jobs (not devtools though).

Flags: needinfo?(jmaher)

Andrew McCreight [:mccr8]

Assignee

Comment 3

•

11 years ago

Ah, that makes sense, thanks. I'll hopefully be able to reproduce the error I'm seeing with that.

Andrew McCreight [:mccr8]

Assignee

Comment 4

•

11 years ago

I was able to reproduce this locally with: ./mach mochitest-browser --e10s --run-by-dir --total-chunks 3 --this-chunk 2 I think it isn't happening all the time. I think the underlying problem is that run-by-dir is exposing all sorts of new shutdown behavior that we weren't seeing before.

Andrew McCreight [:mccr8]

Assignee

Comment 5

•

11 years ago

The failure changed again. It went back to only affecting bc2. This may be due to bug 1037625. I'm having trouble reproducing it locally.

Summary: "aborting because of MsgProcessingError" in ContentChild in mochitest-browser-chrome → "ABORT: constructor for actor failed" in PContentChild.cpp, line 1634 in mochitest-browser-chrome-2

Andrew McCreight [:mccr8]

Assignee

Comment 6

•

11 years ago

I still can't reproduce locally, but I looked at my try logs again, and I did get a useful stack. TabChild::RecvLoadRemoteScript() does a bunch of stuff that ends up calling DOMStorageCache::StartDatabase(), which ends up calling PContentChild::SendPStorageConstructor(), which fails, presumably because we're in shutdown. I'm going to try making RecvLoadRemoteScript not doing anything if the TabChild is destroyed. https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=a5ef3715faba

Andrew McCreight [:mccr8]

Assignee

Comment 7

•

11 years ago

Joel, how can I run some mochitest-browser tests locally in the same way they are run on TBPL? I tried ./mach mochitest-browser --e10s --run-by-dir netwerk/test/browser/ but with this billm and I are both unable to reproduce this crash, so if there's some way I can use the test harness more directly or whatever to more closely reproduce what try is doing that would be appreciated. It could also be an issue with shutdown speed. More recent test run: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=710496b18c26

Flags: needinfo?(jmaher)

Andrew McCreight [:mccr8]

Assignee

Comment 8

•

11 years ago

Oops, I wasn't able to reproduce before because I was using a debug build. A non-debug build reproduced the failure, as expected.

Flags: needinfo?(jmaher)

Andrew McCreight [:mccr8]

Assignee

Updated

•

11 years ago

Summary: "ABORT: constructor for actor failed" in PContentChild.cpp, line 1634 in mochitest-browser-chrome-2 → "ABORT: constructor for actor failed" in PContentChild.cpp, line 1634 in opt mochitest-browser-chrome-2

Andrew McCreight [:mccr8]

Assignee

Comment 9

•

11 years ago

As I'd guessed, I can reproduce the crash in a debug build by making the QuickExit() call in ContentChild::ActorDestroy() happen in debug builds.

Andrew McCreight [:mccr8]

Assignee

Comment 10

•

11 years ago

billm spent some time looking at this with me. We determined that it seems like the test harness is creating a new child process in waitForWindowsState, with "about:blank", which causes us to start creating a new child process. Then the parent starts to shut down, but the child is not far enough along in startup to be able to get a message from the parent telling it that the parent shuts down. Then the child continues startup, but the parent is dead, so the next time it tries to send it a message, the constructor fails and we abort. We decided that probably the better solution here is to change IPDL code generation so that on a constructor failure we do _exit(0) and just gently exit. That is basically just ensuring that with bug 1091766 we maintain the existing behavior we have today for constructor failures. An alternative would be to change the test harness code to create "about:about" instead of "about:blank", because the former is not remote, so we don't create a child process. That fixes the immediate crash, but we could see these RUNTIME_ABORTs in the wild with my other patch, which doesn't seem great, and it leaves a little landmine waiting for anybody who might make about:about remote or somehow cause a child process to happen.

Component: Mochitest → IPC

Product: Testing → Core

Andrew McCreight [:mccr8]

Assignee

Updated

•

11 years ago

Assignee: nobody → continuation

Andrew McCreight [:mccr8]

Assignee

Comment 11

•

11 years ago

Attached patch Exit quietly on constructor abort. WIP — Details — Splinter Review

Benjamin Smedberg

Comment 12

•

11 years ago

Why aren't we killing (TerminateProcess/SIGKILL) all active children when a parent exits? ISTM that the child behavior here is "correct": the only case where the IPC connection should fall down is when the chrome process dies a violent/crashy death. Otherwise, we should design the system so that either the parent waits for the child to stand up and then shut down (perhaps something we'd do in leak-checking configurations) or just hard-kill any children when we reach shutdown (for normal configurations).

Andrew McCreight [:mccr8]

Assignee

Comment 13

•

11 years ago

Bill was saying that we do wait for the child to finish starting up, but only once it has gotten past some graphics thing. He hacked up some kind of thing to try to wait for children to finish starting up, but it didn't work for whatever reason. I can try looking at it some more. We do have ContentParent::ShutDownProcess(), which is called at xpcom-shutdown, but it seems to KillHard() only for the Nuwa process. I think when we looked at this particular case, it was calling Close(), but leaving the child process alone otherwise. I'm not sure what close does, but I guess it just shuts down the IPC channels.

Andrew McCreight [:mccr8]

Assignee

Comment 14

•

11 years ago

Attached patch QuickExit in debug build in ContentChild::ActorDestroy() to cause the crash in debug builds. Not for landing. — Details — Splinter Review

Andrew McCreight [:mccr8]

Assignee

Comment 15

•

11 years ago

I'm currently not planning to land the patch in the bug this is blocking, so this doesn't need to be fixed either. Instead, bug 1103036 is the new magic.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.

Bugzilla

"ABORT: constructor for actor failed" in PContentChild.cpp, line 1634 in opt mochitest-browser-chrome-2

Categories

(Core :: IPC, defect)

Tracking

()

People

(Reporter: mccr8, Assigned: mccr8)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Attachment

General

Description

File Name

Content Type