test_bug260264.html can kill the child process because of a mismatched Msg_UpdateZoomConstraints__ID message

RESOLVED FIXED in Firefox 47

Status

()

defect
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: Ehsan, Assigned: kats)

Tracking

({regression})

47 Branch
mozilla48
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox46 unaffected, firefox47 fixed, firefox48 fixed)

Details

(Whiteboard: [gfx-noted])

Attachments

(1 attachment, 2 obsolete attachments)

STR:

1. Be on Linux.  (This may also happen elsewhere)
2. Apply the patch in bug 1255121.
3. Run:
$ MOZ_CHAOSMODE=0 ./mach mochitest --e10s test_bug260264.html --debugger=rr --debugger-args="record -h"

This happens once out of three or four runs when debugging under rr chaos mode.

You'll sometimes get:
...
TEST-PASS | /tests/dom/tests/mochitest/bugs/test_bug260264.html | not properly blocked
TEST-PASS | /tests/dom/tests/mochitest/bugs/test_bug260264.html | not properly opened
TEST-PASS | /tests/dom/tests/mochitest/bugs/test_bug260264.html | not properly blocked
TEST-PASS | /tests/dom/tests/mochitest/bugs/test_bug260264.html | not properly blocked
###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x7,name=???) Route error: message sent to unknown actor ID
[Parent 29704] WARNING: pipe error (34): Connection reset by peer: file /home/ehsan/moz/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459
[Parent 29704] WARNING: pipe error (36): Connection reset by peer: file /home/ehsan/moz/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459
###!!! [Parent][MessageChannel] Error: (msgtype=0x2C0077,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv

Followed by the child being killed.  msgtype 0x7 is Msg_UpdateZoomConstraints__ID based on <https://wiki.mozilla.org/Electrolysis/Debugging#Working_backwards_from_a_C.2B.2B_Message_to_its_IPDL_message>
I talked to Ehsan about this on IRC, I'll investigate.
Assignee: nobody → bugmail.mozilla
So what appears to be happening here is that:

1) parent process creates a PAPZParent and sends over the constructor
2) child process creates a PAPZChild
3) parent process destroys the PAPZParent and sends over the delete (it's an async message)
4) child process sends over the UpdateZoomConstraints message
5) child process receives the delete

when the message in 4 goes over to the parent, it can't find the actor anymore, because it destroyed it in step 3, and barfs.

According to IRC conversation we need to have a two-step destruction process for protocols with an async delete so that this sort of thing doesn't happen because a message was inflight before the delete was received.

Also I'm fairly certain this was a regression from bug 1020199. Peter, do you have cycles to pick this up in the immediate future?
Blocks: 1020199
Flags: needinfo?(peterv)
The immediate future has come and gone. I wrote a patch which turns the PAPZ shutdown into a two-step process. Try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ecccfefebf9
Flags: needinfo?(peterv)
Try push seems green.
I forget if we have a bug filed (bill would know), but we want that two-step handshake pattern to be built into IPDL at some point. It's kind of ridiculous we need to hand-roll it out in every new actor.
Attachment #8737924 - Flags: review?(dvander) → review+
I haven't heard back from peterv and I'd like to land this asap to get more bake time before uplift, so I'm going to go ahead and do that. Leaving r? on the patch to get a post-landing review from him, since I'd still like him to look at the patch.
Comment on attachment 8737924 [details]
MozReview Request: Bug 1255823 - Add a two-step destruction process to PAPZ. r?

https://reviewboard.mozilla.org/r/44189/#review41233

Thanks for picking this up!

::: gfx/layers/ipc/APZChild.cpp:93
(Diff revision 1)
>    if (mObserver) {
>      nsCOMPtr<nsIObserverService> os = services::GetObserverService();
>      os->RemoveObserver(mObserver, "tab-child-created");
> -  } else {
> +  } else if (mBrowser) {
>      mBrowser->SetAPZChild(nullptr);
> +    mBrowser = nullptr;

Is there a reason why you're nulling a member in the destructor?

::: gfx/layers/ipc/RemoteContentController.cpp:339
(Diff revision 1)
>  RemoteContentController::Destroy()
>  {
>    RefPtr<RemoteContentController> controller = this;
>    NS_DispatchToMainThread(NS_NewRunnableFunction([controller] {
>      if (controller->CanSend()) {
> -      DeletePAPZParent(controller);
> +      if (controller->SendDestroy()) {

Maybe combine these to |if (controller->CanSend() && controller->SendDestroy() {|.

::: gfx/layers/ipc/RemoteContentController.cpp:347
(Diff revision 1)
> +        // __delete__ back, otherwise we might get destroyed in the meantime and
> +        // the IPC code will crash on try to handle the __delete__.
> +        uint64_t key = controller->mLayersId;
> +        MOZ_ASSERT(sDestroyedControllers.find(key) == sDestroyedControllers.end());
> +        sDestroyedControllers[key] = controller;
> +      }

Do we really need a hash, it seems like we could just call AddRef here and Release in ActorDestroy?
Attachment #8737924 - Flags: review?(peterv) → review+
Sigh, I reviewed it this morning, but probably didn't push the right buttons in MozReview :-(.
^ was for crashes in PAPZParent::DestroySubtree. I'm not sure off hand what's going on, will look into it.

16:06:11 KWierso: https://treeherder.mozilla.org/logviewer.html#?job_id=25289631&repo=mozilla-inbound is one
16:06:21 KWierso: another https://treeherder.mozilla.org/logviewer.html#?job_id=25286481&repo=mozilla-inbound
16:06:30 KWierso: and another https://treeherder.mozilla.org/logviewer.html#?job_id=25284784&repo=mozilla-inbound
(In reply to Peter Van der Beken [:peterv] from comment #10)
> > +    mBrowser = nullptr;
> 
> Is there a reason why you're nulling a member in the destructor?

Hm, good point, that shouldn't be needed.

> > -      DeletePAPZParent(controller);
> > +      if (controller->SendDestroy()) {
> 
> Maybe combine these to |if (controller->CanSend() &&
> controller->SendDestroy() {|.

Will do.

> > +        MOZ_ASSERT(sDestroyedControllers.find(key) == sDestroyedControllers.end());
> > +        sDestroyedControllers[key] = controller;
> > +      }
> 
> Do we really need a hash, it seems like we could just call AddRef here and
> Release in ActorDestroy?

We could, yeah. I'm usually not a fan of manually calling AddRef/Release though because even though it might be fine now, future copypasta or modifications can get them out of sync and then it's harder to track down the problem.
I'm having a hard time reproducing the crash, but from tracing through the code it looks like something like this is happening:

1. RemoteContentController::Destroy() is called. This sends a destroy message to the child and puts the RemoteContentController in the sDestroyedControllers map (aka AddRef's it).
2. The child process gets the destroy message from the parent and sends back a __delete__ message
3. ActorDestroy is called on the parent side (unsure why). This releases the ref and actually destroys the RemoteContentController.
4. The parent side gets the __delete__ message and tries to call DestroySubtree() on the PAPZParent which crashes because it derefs a dangling pointer to RemoteContentController.

I suspect the right fix is that in RemoteContentController::ActorDestroy, we skip the ref release bit for certain values of ActorDestroyReason. However I'm not sure what values, and the documentation isn't very clear either. Guessing wrong could end up leaking stuff.

Also note that the crashes in comment 13 are during tests that intentionally crash the child process and trigger some sort of error shutdown in IPC code.
I moved the release-ref into an overridden Recv__delete__. Hopefully this will solve the problem without leaking stuff. I added some more comments in the code to explain why I think this should work. Try push:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=40d8b05d79d7
That version leaked a single RemoteContentController on NormalShutdown (the ActorDestroy(NormalShutdown) call happened between the SendDestroy() and the Recv__delete__, which prevented the Recv__delete__ from ever arriving, and leaked the entry in the map).

New attempt: https://treeherder.mozilla.org/#/jobs?repo=try&revision=60047c2bc5e7
Ok, that one looks good. Updated patch coming up.
Comment on attachment 8737924 [details]
MozReview Request: Bug 1255823 - Add a two-step destruction process to PAPZ. r?

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/44189/diff/1-2/
Attachment #8737924 - Attachment description: MozReview Request: Bug 1255823 - Add a two-step destruction process to PAPZ. r?peterv,dvander → MozReview Request: Bug 1255823 - Add a two-step destruction process to PAPZ. r?
Comment on attachment 8737924 [details]
MozReview Request: Bug 1255823 - Add a two-step destruction process to PAPZ. r?

Flipping flags back to r? for a review of the updated patch.
Attachment #8737924 - Flags: review?(peterv)
Attachment #8737924 - Flags: review?(dvander)
Attachment #8737924 - Flags: review+
Comment on attachment 8737924 [details]
MozReview Request: Bug 1255823 - Add a two-step destruction process to PAPZ. r?

https://reviewboard.mozilla.org/r/44189/#review42877

::: gfx/layers/ipc/RemoteContentController.cpp:364
(Diff revision 2)
> +      // the IPC code will crash on try to handle the __delete__.
> +      uint64_t key = controller->mLayersId;
> +      MOZ_ASSERT(sDestroyedControllers.find(key) == sDestroyedControllers.end());
> +      sDestroyedControllers[key] = controller;
> +    } else {
> +      // If the send failed, then this object must already have had ActorDestroy

Is this always true? I thought that ActorDestroy was posted to the event loop, so if the Send fails for some other reason, it might not be called yet. But I could be wrong.
Attachment #8737924 - Flags: review?(dvander)
(In reply to David Anderson [:dvander] from comment #21)
> > +      // If the send failed, then this object must already have had ActorDestroy
> 
> Is this always true? I thought that ActorDestroy was posted to the event
> loop

In the generated ipdl code (e.g. objdir/ipc/ipdl/PAPZParent.cpp) I see ActorDestroy being called directly during IPC message handling, not getting posted to the event loop. I'm not sure if that's what you mean.

> so if the Send fails for some other reason, it might not be called
> yet.

What other failure causes are there? If the send fails for other reasons, I'm assuming the channel is in a bad/error state anyway and is going to get torn down shortly. Will ActorDestroy still be called under those circumstances, and if so, do we know what the ActorDestroyReason will be? Because then I can keep a ref to the actor if the condition |CanSend() && !SendDestroy()| is true, and clear it when I get appropriate ActorDestroyReason.

> But I could be wrong.

So could I! :)
Really I think the wonky part about all this is that the IPC code can call Recv__delete__ on an actor *after* it has already called ActorDestroy on it. That just seems wrong, and that's the crash I described in comment 15. The ActorDestroy in step 3 there was for NormalShutdown, and it was followed by a Recv__delete__.
err I mean AbnormalShutdown. I think.
Posted patch Alternate patch (obsolete) — Splinter Review
Here's a slightly different fix - let me know if you prefer this one. The diff between the first approach and this one can be found at https://hg.mozilla.org/try/rev/7cf2492cc13e5ea2de8a1c489967f577c7ee1899 (try push is in the last comment).
Attachment #8741738 - Flags: review?(dvander)
Comment on attachment 8741738 [details] [diff] [review]
Alternate patch

Review of attachment 8741738 [details] [diff] [review]:
-----------------------------------------------------------------

I do like this one better, I hope it works - it's terrible we have to do any of this.
Attachment #8741738 - Flags: review?(dvander) → review+
Attachment #8737924 - Attachment is obsolete: true
Attachment #8737924 - Flags: review?(peterv)
This apparently caused some tests to leak like https://treeherder.mozilla.org/logviewer.html#?job_id=26026924&repo=mozilla-inbound

Possibly only on Linux64?

Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/6e6c5143feaa
Flags: needinfo?(bugmail.mozilla)
For some reason I'm not able to reproduce the problem I was seeing before where the Recv__delete__ is called after the ActorDestroy. Without that problem the fix is relatively straightforward, just unconditionally release the ref in ActorDestroy. Try push with that seems to be pretty green so far.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=90aed9aeea7e&group_state=expanded
Flags: needinfo?(bugmail.mozilla)
Posted patch Patch v3Splinter Review
Ok, that does seem to be looking good, and I can't reproduce the scenario I was seeing before. I'll wait for the few remaining jobs on the try push and then land this, carrying r=dvander (the only change in this version is that I took out the "aWhy != ActorDestroyReason::AbnormalShutdown" condition and made that code always execute).
Attachment #8741738 - Attachment is obsolete: true
Attachment #8741738 - Flags: review?(peterv)
Attachment #8742509 - Flags: review?(peterv)
https://hg.mozilla.org/mozilla-central/rev/e9500cac731d
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
I'm going to wontfix this for 47 because it carries some risk that I don't want to uplift to beta. It already didn't spend very much time on Nightly before it merged to aurora. And as far as I can tell the underlying issue that it fixes hasn't been reported anywhere else or shown up in crash-stats. Also e10s/APZ will not go to release in 47 so it should be ok to wontfix for 47.
kats, did you test your fix with the patch for bug 1255121?  Or should I do that separately before landing that patch?
Flags: needinfo?(bugmail.mozilla)
I did test it with that patch locally but not on try. It's probably worth testing again just to make sure.
Flags: needinfo?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #34)
> And as far as I can tell the underlying issue
> that it fixes hasn't been reported anywhere else or shown up in crash-stats.

My crash-stats-fu is apparently horrible, because this appears to be the #1 topcrash on 47, bug 1268881. We should uplift this fix after checking crash-stats to make sure this actually fixes stuff and doesn't make anything else worse.
Comment on attachment 8742509 [details] [diff] [review]
Patch v3

Approval Request Comment
[Feature/regressing bug #]: bug 1020199
[User impact if declined]: lotsa crashes, see bug 1268881.
[Describe test coverage new/current, TreeHerder]: not a whole lot, but we have crash-stats data that indicates this fixed the crashes in bug 1268881.
[Risks and why]: medium-ish risk, IPDL shutdown is hairy at the best of times. But I assume if something went wrong it would have shown up on crash-stats by now.
[String/UUID change made/needed]: none
Attachment #8742509 - Flags: review?(peterv)
Attachment #8742509 - Flags: review+
Attachment #8742509 - Flags: approval-mozilla-beta?
We would very much like to get this in 47 beta build 2.
Flags: needinfo?(rkothari)
(In reply to Jim Mathies [:jimm] from comment #39)
> We would very much like to get this in 47 beta build 2.

Sure. I will approve the uplift if you think it helps! I was looking at the crash data from bug 1268881 and whether we've had any occurrences on 48.0a2 since 04-20 (when this landed in central) and I do see an instances from 04-28 https://crash-stats.mozilla.com/report/index/a3ac3947-5019-4d14-9876-f887c2160429 and one from 04-25 https://crash-stats.mozilla.com/report/index/2a7db262-7034-40c3-8c97-20b2b2160425. 

This is for the first of 3 crash signatures associated with bug 1268881. The data for 2nd and 3rd signature looks promising from 48.0a2.
Flags: needinfo?(rkothari)
Comment on attachment 8742509 [details] [diff] [review]
Patch v3

Top crash on 47.0b1, Beta47+
Attachment #8742509 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Ritu Kothari (:ritu) from comment #40)
> https://crash-stats.mozilla.com/report/index/a3ac3947-5019-4d14-9876-
> f887c2160429 and one from 04-25
> https://crash-stats.mozilla.com/report/index/2a7db262-7034-40c3-8c97-
> 20b2b2160425. 

These two crashes have the following in their ipc_channel_error:

(msgtype=0xEC0003,name=PTCPSocket::Msg_Data) Processing error: message was deserialized, but the handler returned false (indicating failure)

so it seems to have a different root cause.

Thanks for the quick approval, I'll land the uplift shortly.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #42)
> (In reply to Ritu Kothari (:ritu) from comment #40)
> > https://crash-stats.mozilla.com/report/index/a3ac3947-5019-4d14-9876-
> > f887c2160429 and one from 04-25
> > https://crash-stats.mozilla.com/report/index/2a7db262-7034-40c3-8c97-
> > 20b2b2160425. 
> 
> These two crashes have the following in their ipc_channel_error:
> 
> (msgtype=0xEC0003,name=PTCPSocket::Msg_Data) Processing error: message was
> deserialized, but the handler returned false (indicating failure)
> 
> so it seems to have a different root cause.
> 
> Thanks for the quick approval, I'll land the uplift shortly.

Ok. That's good to know. I am glad you got a chance to review those two reports. 

Awesome job all around flagging this one as a top crasher and finding a good fix to uplift! :)
Duplicate of this bug: 1268881
Doesn't look like beta is liking the uplift, there's a bunch of orange :(
So there's three paths we can follow here:

1) Back out bug 1020199 from 47. Although the patch is conceptually not too complex, it's a monster patch in terms of lines touched. Still, I created a backout patch and pushed it to try based on top of current 47 beta [1].

2) Bisect through the stuff that landed on 48 before this patch (pushlog at [2]) to see which of those patches fixed the OS X crashes. If we can find it, and it's upliftable, then we just need to uplift that along with this and we're good. I've started doing this bisection via try pushes ([3], [4]) and will continue as I can over the weekend.

3) Investigate the crashes in situ (they are reproducible on a local beta OS X build) and find some fix for it. Given my poor understanding of IPC/shutdown code this seems pretty hard to do, but if somebody else is willing to do so please go right ahead.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=02e8416c0deb
[2] http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=552ef9bae233&tochange=e9500cac731d
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=d23741cc3a5b
[4] https://treeherder.mozilla.org/#/jobs?repo=try&revision=9bf652334ad6
Quick update:

1) The backout try push looks mostly clean, there's some failures but they look unrelated.

2) I've narrowed down the range to http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=e17ded471c3f&tochange=e7a1c3636af1 so far, and am continuing to bisect.
Narrowed the range further to http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=95e61ede373d&tochange=a57dc2f4a6f6, so I'm pretty sure it's bug 1219296 which is responsible. It moves CSS scroll snapping into the compositor and eliminates a RequestFlingSnap IPC message, which I presume is the thing that's causing the crash.

I rebased those patches onto beta, try push at https://treeherder.mozilla.org/#/jobs?repo=try&revision=444d4e11fab6
That try push looks ok. I'll request uplift on bug 1219296.
Ritu, I flagged bug 1219296 and its follow-up bug 1267471 for uplift to beta. Given the options I think uplifting those two bugs along with this one to beta is probably the safest approach.

An alternate approach, as I said in comment 48, would be to back out bug 1020199 from 47, but I think that would be more risky (you can see how big that patch is). I can also try to find a "smaller" version of bug 1219296 for beta, but it's unlikely I can do that before the Monday deadline.
Flags: needinfo?(rkothari)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #51)
> That try push looks ok. I'll request uplift on bug 1219296.

Thanks kats!
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #52)
> Ritu, I flagged bug 1219296 and its follow-up bug 1267471 for uplift to
> beta. Given the options I think uplifting those two bugs along with this one
> to beta is probably the safest approach.
> 
> An alternate approach, as I said in comment 48, would be to back out bug
> 1020199 from 47, but I think that would be more risky (you can see how big
> that patch is). I can also try to find a "smaller" version of bug 1219296
> for beta, but it's unlikely I can do that before the Monday deadline.

Yup, if this is the least risky option to take to address the top crash, let's do it. I am glad this is e10s specific only so the overall impact to 47 release which will ship with e10s off by default should be minimal.
Flags: needinfo?(rkothari)
Duplicate of this bug: 1270134
Version: unspecified → 47 Branch
You need to log in before you can comment on or make changes to this bug.