Crash in nsGlobalWindow::SetOpenerWindow

NEW
Unassigned

Status

()

Core
DOM
P2
critical
7 months ago
4 months ago

People

(Reporter: philipp, Unassigned)

Tracking

({crash, leave-open, regression})

52 Branch
crash, leave-open, regression
Points:
---
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox-esr45 unaffected, firefox51 unaffected, firefox52 fixed, firefox-esr52 fixed, firefox53 wontfix, firefox54 fixed, firefox55 fixed)

Details

(crash signature)

Attachments

(2 attachments, 1 obsolete attachment)

(Reporter)

Description

7 months ago
This bug was filed from the Socorro interface and is 
report bp-342a68bf-ed97-46a3-a811-1be472170126.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	nsGlobalWindow::SetOpenerWindow(nsPIDOMWindowOuter*, bool) 	dom/base/nsGlobalWindow.cpp:3235
1 	xul.dll 	nsWindowWatcher::ReadyOpenedDocShellItem(nsIDocShellTreeItem*, nsPIDOMWindowOuter*, bool, bool, mozIDOMWindowProxy**) 	embedding/components/windowwatcher/nsWindowWatcher.cpp:2164
2 	xul.dll 	nsWindowWatcher::OpenWindowInternal(mozIDOMWindowProxy*, char const*, char const*, char const*, bool, bool, bool, nsIArray*, bool, bool, nsIDocShellLoadInfo*, mozIDOMWindowProxy**) 	embedding/components/windowwatcher/nsWindowWatcher.cpp:1077
3 	xul.dll 	nsWindowWatcher::OpenWindow2(mozIDOMWindowProxy*, char const*, char const*, char const*, bool, bool, bool, nsISupports*, bool, bool, nsIDocShellLoadInfo*, mozIDOMWindowProxy**) 	embedding/components/windowwatcher/nsWindowWatcher.cpp:442
4 	xul.dll 	nsGlobalWindow::OpenInternal(nsAString_internal const&, nsAString_internal const&, nsAString_internal const&, bool, bool, bool, bool, bool, nsIArray*, nsISupports*, nsIDocShellLoadInfo*, bool, nsPIDOMWindowOuter**) 	dom/base/nsGlobalWindow.cpp:12590
5 	xul.dll 	nsGlobalWindow::OpenNoNavigate(nsAString_internal const&, nsAString_internal const&, nsAString_internal const&, nsPIDOMWindowOuter**) 	dom/base/nsGlobalWindow.cpp:8463
6 	xul.dll 	nsDocShell::InternalLoad(nsIURI*, nsIURI*, bool, nsIURI*, unsigned int, nsIPrincipal*, nsIPrincipal*, unsigned int, nsAString_internal const&, char const*, nsAString_internal const&, nsIInputStream*, nsIInputStream*, unsigned int, nsISHEntry*, bool, nsAString_internal const&, nsIDocShell*, nsIURI*, nsIDocShell**, nsIRequest**) 	docshell/base/nsDocShell.cpp:10068
7 	xul.dll 	nsDocShell::OnLinkClickSync(nsIContent*, nsIURI*, char16_t const*, nsAString_internal const&, nsIInputStream*, nsIInputStream*, nsIDocShell**, nsIRequest**) 	docshell/base/nsDocShell.cpp:14043
8 	xul.dll 	OnLinkClickEvent::Run() 	docshell/base/nsDocShell.cpp:13822
9 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1216
10 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:96
11 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:225
12 	xul.dll 	nsBaseAppShell::Run() 	widget/nsBaseAppShell.cpp:156
13 	xul.dll 	nsAppShell::Run() 	widget/windows/nsAppShell.cpp:262
14 	xul.dll 	nsAppStartup::Run() 	toolkit/components/startup/nsAppStartup.cpp:283
15 	xul.dll 	XREMain::XRE_mainRun() 	toolkit/xre/nsAppRunner.cpp:4467
16 	xul.dll 	XREMain::XRE_main(int, char** const, nsXREAppData const*) 	toolkit/xre/nsAppRunner.cpp:4600
17 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp:4691

this is a cross-platform crash that is regressing since firefox 52 with the MOZ_RELEASE_ASSERT(!contentOpener || !mTabGroup || mTabGroup == Cast(contentOpener)->mTabGroup) that was added in bug 1303196.
Michael, can you take a look?
Flags: needinfo?(michael)
Priority: -- → P2
Ehsan, you've been looking at this type of crash recently, and may have a better idea of where this sort of problem would be coming from.

Our best bet for 52 may be to remove the assertion and try to fix it in 54.
Flags: needinfo?(michael) → needinfo?(ehsan)
(In reply to Michael Layzell [:mystor] from comment #2)
> Ehsan, you've been looking at this type of crash recently, and may have a
> better idea of where this sort of problem would be coming from.

I have?  This is the first I'm seeing this type of crash as far as my memory serves.  Are you confusing this with bug 1334281 (unrelated)?
Flags: needinfo?(ehsan) → needinfo?(michael)
(In reply to :Ehsan Akhgari from comment #3)
> I have?  This is the first I'm seeing this type of crash as far as my memory
> serves.  Are you confusing this with bug 1334281 (unrelated)?

Yup, I think I am. sorry about that.
Flags: needinfo?(michael)
This seems like it's only occurring in Firefox 52 (at least I don't see any crashes on 53/54) - not sure why we wouldn't be hitting this crash in Firefox 53+.

I fixed a lot of bugs like this one when I originally landed the OpenerWindow changes, but I don't really have any ideas about what in particular could be triggering this crash. The stack doesn't really tell me much.
(Reporter)

Comment 6

7 months ago
there are crashes with 53&54 as well: http://bit.ly/2jx205H
(In reply to [:philipp] from comment #6)
> there are crashes with 53&54 as well: http://bit.ly/2jx205H

Ah, I'm just bad at using the crashstats API :)

Comment 8

6 months ago
Michael, could you suggest how to progress this bug? Thanks.
Flags: needinfo?(michael)
Created attachment 8837200 [details] [diff] [review]
Don't check opener's TabGroup when in parent process

I looked into this crash a bit. There are a couple of interesting things to note:

a) All but one of these crashes occurs in the chrome process, and
b) IIRC we don't actually care about TabGroups being 100% accurate outside of the content process.

This means that, at least as a temporary solution, we can simply disable this assertion in the parent process, as it shouldn't cause us any problems to simply have slightly incorrect TabGroups when in the chrome process.

Ehsan, do you have thoughts on this as, at least a temporary solution. I spent a few hours about a week ago looking through a minidump from one of these crash reports and I wasn't able to figure out where the setting of the opener went wrong. The code paths for opening new windows with openers in the parent process are much more complex than the ones in the content process, so limiting ourselves to only having to ensure that the content process sets openers correctly would greatly simplify things.

MozReview-Commit-ID: A9kIBs6lC5x
Attachment #8837200 - Flags: review?(ehsan)
Flags: needinfo?(michael)
Can you provide a breakdown of how many of the existing reports are from the parent vs the child?  Also, how are you distinguishing whether the crash is really coming from the parent vs the case where e10s is disabled?  I'm OK to disable this check on beta to deal with the current bleeding but this is a wallpaper, and I would be more comfortable with it if we were on a path towards discovering why this happens...
Flags: needinfo?(michael)
(In reply to :Ehsan Akhgari from comment #10)
> Can you provide a breakdown of how many of the existing reports are from the
> parent vs the child?  Also, how are you distinguishing whether the crash is
> really coming from the parent vs the case where e10s is disabled?  I'm OK to
> disable this check on beta to deal with the current bleeding but this is a
> wallpaper, and I would be more comfortable with it if we were on a path
> towards discovering why this happens...

I used the crash reporter to filter by which process was causing the crash ("Process type"). In the results of that check, I discovered that there was only 1 crash which occurred in a non-browser process (where browser as far as I can tell means chrome, as the options are "content", "browser", "gpu", and "plugin"). This is that crash: https://crash-stats.mozilla.com/report/index/d254b080-dfc9-498f-900f-ca95c2161013

The stack in that crash report looks mildly insane, as for some reason the call to Initialize in gfxGDIFont is calling into nsGlobalWindow::SetOpenerWindow, so I imagine that this crash is due to heap corruption of some form, rather than because of a problem with our opener logic. In addition, unlike the other crashes, this crash was not caused by the MOZ_RELEASE_ASSERT, but rather by a different memory access violation.

The other crashes all occurred in the parent process.

I also would prefer to be on a path toward discovering why this happens. I imagine that it occurs because there is some code path in the parent process where the opener window is not set correctly when a new window is created. I tried inspecting a minidump which you sent to me, and the codepath which it appears would have been used should have worked fine, so I don't know what the problem was.

One advantage of this change is that the codepaths in the content process for correctly setting window.opener are much simpler than the codepaths in the parent process, meaning that we will have much less code to look through in the future if TabGroups end up being incorrect.
Flags: needinfo?(michael)
(In reply to Michael Layzell [:mystor] from comment #11)
> (In reply to :Ehsan Akhgari from comment #10)
> > Can you provide a breakdown of how many of the existing reports are from the
> > parent vs the child?  Also, how are you distinguishing whether the crash is
> > really coming from the parent vs the case where e10s is disabled?  I'm OK to
> > disable this check on beta to deal with the current bleeding but this is a
> > wallpaper, and I would be more comfortable with it if we were on a path
> > towards discovering why this happens...
> 
> I used the crash reporter to filter by which process was causing the crash
> ("Process type"). In the results of that check, I discovered that there was
> only 1 crash which occurred in a non-browser process (where browser as far
> as I can tell means chrome, as the options are "content", "browser", "gpu",
> and "plugin"). This is that crash:
> https://crash-stats.mozilla.com/report/index/d254b080-dfc9-498f-900f-
> ca95c2161013

Are you paying attention to the absense of the ProcessType field?  AFAICT that field only gets added to the crash report in e10s.  And most of these crashes indeed seem to be from non-e10s browsers (for example, comment 0.)

> The stack in that crash report looks mildly insane, as for some reason the
> call to Initialize in gfxGDIFont is calling into
> nsGlobalWindow::SetOpenerWindow, so I imagine that this crash is due to heap
> corruption of some form, rather than because of a problem with our opener
> logic. In addition, unlike the other crashes, this crash was not caused by
> the MOZ_RELEASE_ASSERT, but rather by a different memory access violation.

Hmm, the stack trace there is surely suspect.  I have no good theory what's going on here.

> The other crashes all occurred in the parent process.

I don't think that's correct, per above.

> I also would prefer to be on a path toward discovering why this happens. I
> imagine that it occurs because there is some code path in the parent process
> where the opener window is not set correctly when a new window is created. I
> tried inspecting a minidump which you sent to me, and the codepath which it
> appears would have been used should have worked fine, so I don't know what
> the problem was.
> 
> One advantage of this change is that the codepaths in the content process
> for correctly setting window.opener are much simpler than the codepaths in
> the parent process, meaning that we will have much less code to look through
> in the future if TabGroups end up being incorrect.

Again, after looking at about 10 or so of these crashes, I'm now fairly certain that it happens in non-e10s builds rather than the parent process.  So r- for the patch because of that.

My suggestion for what to do next: try out the URLs that I sent you a while back with these crashes in a non-e10s browser to see if you can reproduce (assuming you can figure out how to create a new window from those sites).  If that doesn't reveal anything, then I think the best and only remaining option would be to assert our invariants earlier and hope that one of such assertions would fire off earlier than the crash to shed more light on what's happening.
Attachment #8837200 - Flags: review?(ehsan) → review-
Another thing to note is that this crash happens under https://hg.mozilla.org/releases/mozilla-beta/annotate/78ae21055d9f/docshell/base/nsDocShell.cpp#l9990.  So you should look for code that does a window.open with a target window name but without an existing docshell registered for that target (it could be "_blank" or a real name I think if I'm reading the code correctly).

Updated

6 months ago
status-firefox-esr52: --- → affected
Hey Michael, I don't suppose you've had a chance to come back to this? Fx52 goes to RC next Monday, so we're getting low on time.
Flags: needinfo?(michael)
We should remove the assertion on 52 beta.
Comment on attachment 8837200 [details] [diff] [review]
Don't check opener's TabGroup when in parent process

r? on this patch for Beta so we can disable the assertion there.
Flags: needinfo?(michael)
Attachment #8837200 - Flags: review- → review?(ehsan)
Comment on attachment 8837200 [details] [diff] [review]
Don't check opener's TabGroup when in parent process

Review of attachment 8837200 [details] [diff] [review]:
-----------------------------------------------------------------

The commit message also needs changing...

::: dom/base/nsGlobalWindow.cpp
@@ +3424,5 @@
> +  // when in a content process.
> +  if (XRE_IsContentProcess()) {
> +    nsPIDOMWindowOuter* contentOpener = GetSanitizedOpener(aOpener);
> +    MOZ_RELEASE_ASSERT(!contentOpener || !mTabGroup ||
> +                       mTabGroup == Cast(contentOpener)->mTabGroup);

No need to assert in any process on beta.  Please instead of doing this, just remove the MOZ_RELEASE_ASSERT completely.  r=me with that for beta.
Attachment #8837200 - Flags: review?(ehsan) → review+
Created attachment 8840498 [details] [diff] [review]
Don't check opener's TabGroup in beta

MozReview-Commit-ID: 3cAsrJzTvXt
Attachment #8837200 - Attachment is obsolete: true
Keywords: leave-open
Comment on attachment 8840498 [details] [diff] [review]
Don't check opener's TabGroup in beta

Approval Request Comment
[Feature/Bug causing the regression]: bug 1303196
[User impact if declined]: Occasional unnecessary crashes
[Is this code covered by automated tests?]: N/A
[Has the fix been verified in Nightly?]: N/A
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: No - this removes an unused-in-beta RELEASE_ASSERT which was occasionally being hit.
[Why is the change risky/not risky?]: The assertion which is being removed cannot lead to memory unsafety in beta.
[String changes made/needed]: None
Attachment #8840498 - Flags: approval-mozilla-beta?
Comment on attachment 8840498 [details] [diff] [review]
Don't check opener's TabGroup in beta

remove a release assert in beta52
Attachment #8840498 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
We may want to turn this into a diagnostic assert on trunk and aurora for now, so that we don't have to keep fixing this on beta when 53 goes to beta...

Comment 22

6 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/3ba4a9e4bf35
status-firefox52: affected → fixed
Created attachment 8840550 [details] [diff] [review]
Change the SetOpenerWindow tabgroup assertion to a DIAGNOSTIC_ASSERT

MozReview-Commit-ID: 13zUBotz7Tr
Attachment #8840550 - Flags: review?(ehsan)
Setting qe-verify- based on Michael's assessment on manual testing needs (see Comment 19).
Flags: qe-verify-

Comment 25

6 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-esr52/rev/3ba4a9e4bf35
status-firefox-esr52: affected → fixed
Attachment #8840550 - Flags: review?(ehsan) → review+

Comment 26

6 months ago
Pushed by michael@thelayzells.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/32dce6ccde94
Change the SetOpenerWindow tabgroup assertion to a DIAGNOSTIC_ASSERT, r=ehsan

Comment 27

6 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/32dce6ccde94
Should this still be open?
Flags: needinfo?(michael)
(In reply to Mark Banner (:standard8) from comment #28)
> Should this still be open?

Yes, we haven't had a chance to fix the actual underlying problem, we have only disabled the problematic code in beta/release.
Flags: needinfo?(michael)
Michael and I just discussed how this *could* be caused by addons but looking at recent reports it does appear to be happening to people who only have the default addons.
(In reply to Michael Layzell [:mystor] from comment #29)
> (In reply to Mark Banner (:standard8) from comment #28)
> > Should this still be open?
> 
> Yes, we haven't had a chance to fix the actual underlying problem, we have
> only disabled the problematic code in beta/release.

This does not appear to be the case:

https://hg.mozilla.org/releases/mozilla-release/annotate/d345b657d381/dom/base/nsGlobalWindow.cpp#l3191

That revision is tagged as 53 release, and the crash reports for this signature are still occurring in 53 release.  What went wrong?
Flags: needinfo?(michael)
Peculiarly, it only seems to be happening on 53 and earlier; no reports for 54 and 55.
How odd. It looks like when the patch was landed it was only landed on beta and central, and not into aurura (which is presumably 53 release today). Applying the patch from comment 19 should make this go away.
Flags: needinfo?(michael)
Comment on attachment 8840498 [details] [diff] [review]
Don't check opener's TabGroup in beta

Approval Request Comment
see comment 19

The fix for this problem seems like it was only landed on beta and central when it was landed, and it was not landed on aurora (which at the time would have been 53). Landing this patch on 53 would make this crash signature go away.

It's probably not worth triggering a dot release, but might be good as a ride-along.
Attachment #8840498 - Flags: approval-mozilla-release?
This landed on m-b on Feb. 23 and m-c on Feb 25. Beta was 52 and nightly was 54 at that point. Because the status-firefox54 flag wasn't ever set to fixed (as it should be automatically from bugherder), we didn't see this as a "missed uplift" candidate.
status-firefox54: affected → fixed
status-firefox55: --- → fixed
This isn't very high volume on release, so I think wontfixing is reasonable for 53.
status-firefox53: affected → wontfix
Attachment #8840498 - Flags: approval-mozilla-release? → approval-mozilla-release-
You need to log in before you can comment on or make changes to this bug.