Crash in nsGlobalWindow::SetOpenerWindow

NEW
Unassigned

Status

()

P3
critical
2 years ago
15 days ago

People

(Reporter: philipp, Unassigned)

Tracking

({crash, leave-open, regression})

52 Branch
crash, leave-open, regression
Points:
---
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox-esr45 unaffected, firefox51 unaffected, firefox52 fixed, firefox-esr52 fixed, firefox53 wontfix, firefox54 fixed, firefox55 fixed)

Details

(crash signature)

Attachments

(2 attachments, 1 obsolete attachment)

(Reporter)

Description

2 years ago
This bug was filed from the Socorro interface and is 
report bp-342a68bf-ed97-46a3-a811-1be472170126.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	nsGlobalWindow::SetOpenerWindow(nsPIDOMWindowOuter*, bool) 	dom/base/nsGlobalWindow.cpp:3235
1 	xul.dll 	nsWindowWatcher::ReadyOpenedDocShellItem(nsIDocShellTreeItem*, nsPIDOMWindowOuter*, bool, bool, mozIDOMWindowProxy**) 	embedding/components/windowwatcher/nsWindowWatcher.cpp:2164
2 	xul.dll 	nsWindowWatcher::OpenWindowInternal(mozIDOMWindowProxy*, char const*, char const*, char const*, bool, bool, bool, nsIArray*, bool, bool, nsIDocShellLoadInfo*, mozIDOMWindowProxy**) 	embedding/components/windowwatcher/nsWindowWatcher.cpp:1077
3 	xul.dll 	nsWindowWatcher::OpenWindow2(mozIDOMWindowProxy*, char const*, char const*, char const*, bool, bool, bool, nsISupports*, bool, bool, nsIDocShellLoadInfo*, mozIDOMWindowProxy**) 	embedding/components/windowwatcher/nsWindowWatcher.cpp:442
4 	xul.dll 	nsGlobalWindow::OpenInternal(nsAString_internal const&, nsAString_internal const&, nsAString_internal const&, bool, bool, bool, bool, bool, nsIArray*, nsISupports*, nsIDocShellLoadInfo*, bool, nsPIDOMWindowOuter**) 	dom/base/nsGlobalWindow.cpp:12590
5 	xul.dll 	nsGlobalWindow::OpenNoNavigate(nsAString_internal const&, nsAString_internal const&, nsAString_internal const&, nsPIDOMWindowOuter**) 	dom/base/nsGlobalWindow.cpp:8463
6 	xul.dll 	nsDocShell::InternalLoad(nsIURI*, nsIURI*, bool, nsIURI*, unsigned int, nsIPrincipal*, nsIPrincipal*, unsigned int, nsAString_internal const&, char const*, nsAString_internal const&, nsIInputStream*, nsIInputStream*, unsigned int, nsISHEntry*, bool, nsAString_internal const&, nsIDocShell*, nsIURI*, nsIDocShell**, nsIRequest**) 	docshell/base/nsDocShell.cpp:10068
7 	xul.dll 	nsDocShell::OnLinkClickSync(nsIContent*, nsIURI*, char16_t const*, nsAString_internal const&, nsIInputStream*, nsIInputStream*, nsIDocShell**, nsIRequest**) 	docshell/base/nsDocShell.cpp:14043
8 	xul.dll 	OnLinkClickEvent::Run() 	docshell/base/nsDocShell.cpp:13822
9 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1216
10 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:96
11 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:225
12 	xul.dll 	nsBaseAppShell::Run() 	widget/nsBaseAppShell.cpp:156
13 	xul.dll 	nsAppShell::Run() 	widget/windows/nsAppShell.cpp:262
14 	xul.dll 	nsAppStartup::Run() 	toolkit/components/startup/nsAppStartup.cpp:283
15 	xul.dll 	XREMain::XRE_mainRun() 	toolkit/xre/nsAppRunner.cpp:4467
16 	xul.dll 	XREMain::XRE_main(int, char** const, nsXREAppData const*) 	toolkit/xre/nsAppRunner.cpp:4600
17 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp:4691

this is a cross-platform crash that is regressing since firefox 52 with the MOZ_RELEASE_ASSERT(!contentOpener || !mTabGroup || mTabGroup == Cast(contentOpener)->mTabGroup) that was added in bug 1303196.
Michael, can you take a look?
Flags: needinfo?(michael)
Priority: -- → P2

Comment 2

2 years ago
Ehsan, you've been looking at this type of crash recently, and may have a better idea of where this sort of problem would be coming from.

Our best bet for 52 may be to remove the assertion and try to fix it in 54.
Flags: needinfo?(michael) → needinfo?(ehsan)

Comment 3

2 years ago
(In reply to Michael Layzell [:mystor] from comment #2)
> Ehsan, you've been looking at this type of crash recently, and may have a
> better idea of where this sort of problem would be coming from.

I have?  This is the first I'm seeing this type of crash as far as my memory serves.  Are you confusing this with bug 1334281 (unrelated)?
Flags: needinfo?(ehsan) → needinfo?(michael)

Comment 4

2 years ago
(In reply to :Ehsan Akhgari from comment #3)
> I have?  This is the first I'm seeing this type of crash as far as my memory
> serves.  Are you confusing this with bug 1334281 (unrelated)?

Yup, I think I am. sorry about that.
Flags: needinfo?(michael)

Comment 5

2 years ago
This seems like it's only occurring in Firefox 52 (at least I don't see any crashes on 53/54) - not sure why we wouldn't be hitting this crash in Firefox 53+.

I fixed a lot of bugs like this one when I originally landed the OpenerWindow changes, but I don't really have any ideas about what in particular could be triggering this crash. The stack doesn't really tell me much.
(Reporter)

Comment 6

2 years ago
there are crashes with 53&54 as well: http://bit.ly/2jx205H

Comment 7

2 years ago
(In reply to [:philipp] from comment #6)
> there are crashes with 53&54 as well: http://bit.ly/2jx205H

Ah, I'm just bad at using the crashstats API :)
Michael, could you suggest how to progress this bug? Thanks.
Flags: needinfo?(michael)

Comment 9

2 years ago
Created attachment 8837200 [details] [diff] [review]
Don't check opener's TabGroup when in parent process

I looked into this crash a bit. There are a couple of interesting things to note:

a) All but one of these crashes occurs in the chrome process, and
b) IIRC we don't actually care about TabGroups being 100% accurate outside of the content process.

This means that, at least as a temporary solution, we can simply disable this assertion in the parent process, as it shouldn't cause us any problems to simply have slightly incorrect TabGroups when in the chrome process.

Ehsan, do you have thoughts on this as, at least a temporary solution. I spent a few hours about a week ago looking through a minidump from one of these crash reports and I wasn't able to figure out where the setting of the opener went wrong. The code paths for opening new windows with openers in the parent process are much more complex than the ones in the content process, so limiting ourselves to only having to ensure that the content process sets openers correctly would greatly simplify things.

MozReview-Commit-ID: A9kIBs6lC5x
Attachment #8837200 - Flags: review?(ehsan)

Updated

2 years ago
Flags: needinfo?(michael)

Comment 10

2 years ago
Can you provide a breakdown of how many of the existing reports are from the parent vs the child?  Also, how are you distinguishing whether the crash is really coming from the parent vs the case where e10s is disabled?  I'm OK to disable this check on beta to deal with the current bleeding but this is a wallpaper, and I would be more comfortable with it if we were on a path towards discovering why this happens...
Flags: needinfo?(michael)

Comment 11

2 years ago
(In reply to :Ehsan Akhgari from comment #10)
> Can you provide a breakdown of how many of the existing reports are from the
> parent vs the child?  Also, how are you distinguishing whether the crash is
> really coming from the parent vs the case where e10s is disabled?  I'm OK to
> disable this check on beta to deal with the current bleeding but this is a
> wallpaper, and I would be more comfortable with it if we were on a path
> towards discovering why this happens...

I used the crash reporter to filter by which process was causing the crash ("Process type"). In the results of that check, I discovered that there was only 1 crash which occurred in a non-browser process (where browser as far as I can tell means chrome, as the options are "content", "browser", "gpu", and "plugin"). This is that crash: https://crash-stats.mozilla.com/report/index/d254b080-dfc9-498f-900f-ca95c2161013

The stack in that crash report looks mildly insane, as for some reason the call to Initialize in gfxGDIFont is calling into nsGlobalWindow::SetOpenerWindow, so I imagine that this crash is due to heap corruption of some form, rather than because of a problem with our opener logic. In addition, unlike the other crashes, this crash was not caused by the MOZ_RELEASE_ASSERT, but rather by a different memory access violation.

The other crashes all occurred in the parent process.

I also would prefer to be on a path toward discovering why this happens. I imagine that it occurs because there is some code path in the parent process where the opener window is not set correctly when a new window is created. I tried inspecting a minidump which you sent to me, and the codepath which it appears would have been used should have worked fine, so I don't know what the problem was.

One advantage of this change is that the codepaths in the content process for correctly setting window.opener are much simpler than the codepaths in the parent process, meaning that we will have much less code to look through in the future if TabGroups end up being incorrect.
Flags: needinfo?(michael)

Comment 12

2 years ago
(In reply to Michael Layzell [:mystor] from comment #11)
> (In reply to :Ehsan Akhgari from comment #10)
> > Can you provide a breakdown of how many of the existing reports are from the
> > parent vs the child?  Also, how are you distinguishing whether the crash is
> > really coming from the parent vs the case where e10s is disabled?  I'm OK to
> > disable this check on beta to deal with the current bleeding but this is a
> > wallpaper, and I would be more comfortable with it if we were on a path
> > towards discovering why this happens...
> 
> I used the crash reporter to filter by which process was causing the crash
> ("Process type"). In the results of that check, I discovered that there was
> only 1 crash which occurred in a non-browser process (where browser as far
> as I can tell means chrome, as the options are "content", "browser", "gpu",
> and "plugin"). This is that crash:
> https://crash-stats.mozilla.com/report/index/d254b080-dfc9-498f-900f-
> ca95c2161013

Are you paying attention to the absense of the ProcessType field?  AFAICT that field only gets added to the crash report in e10s.  And most of these crashes indeed seem to be from non-e10s browsers (for example, comment 0.)

> The stack in that crash report looks mildly insane, as for some reason the
> call to Initialize in gfxGDIFont is calling into
> nsGlobalWindow::SetOpenerWindow, so I imagine that this crash is due to heap
> corruption of some form, rather than because of a problem with our opener
> logic. In addition, unlike the other crashes, this crash was not caused by
> the MOZ_RELEASE_ASSERT, but rather by a different memory access violation.

Hmm, the stack trace there is surely suspect.  I have no good theory what's going on here.

> The other crashes all occurred in the parent process.

I don't think that's correct, per above.

> I also would prefer to be on a path toward discovering why this happens. I
> imagine that it occurs because there is some code path in the parent process
> where the opener window is not set correctly when a new window is created. I
> tried inspecting a minidump which you sent to me, and the codepath which it
> appears would have been used should have worked fine, so I don't know what
> the problem was.
> 
> One advantage of this change is that the codepaths in the content process
> for correctly setting window.opener are much simpler than the codepaths in
> the parent process, meaning that we will have much less code to look through
> in the future if TabGroups end up being incorrect.

Again, after looking at about 10 or so of these crashes, I'm now fairly certain that it happens in non-e10s builds rather than the parent process.  So r- for the patch because of that.

My suggestion for what to do next: try out the URLs that I sent you a while back with these crashes in a non-e10s browser to see if you can reproduce (assuming you can figure out how to create a new window from those sites).  If that doesn't reveal anything, then I think the best and only remaining option would be to assert our invariants earlier and hope that one of such assertions would fire off earlier than the crash to shed more light on what's happening.

Updated

2 years ago
Attachment #8837200 - Flags: review?(ehsan) → review-

Comment 13

2 years ago
Another thing to note is that this crash happens under https://hg.mozilla.org/releases/mozilla-beta/annotate/78ae21055d9f/docshell/base/nsDocShell.cpp#l9990.  So you should look for code that does a window.open with a target window name but without an existing docshell registered for that target (it could be "_blank" or a real name I think if I'm reading the code correctly).

Updated

2 years ago
status-firefox-esr52: --- → affected
Hey Michael, I don't suppose you've had a chance to come back to this? Fx52 goes to RC next Monday, so we're getting low on time.
Flags: needinfo?(michael)

Comment 15

2 years ago
We should remove the assertion on 52 beta.

Comment 16

2 years ago
Comment on attachment 8837200 [details] [diff] [review]
Don't check opener's TabGroup when in parent process

r? on this patch for Beta so we can disable the assertion there.
Flags: needinfo?(michael)
Attachment #8837200 - Flags: review- → review?(ehsan)

Comment 17

2 years ago
Comment on attachment 8837200 [details] [diff] [review]
Don't check opener's TabGroup when in parent process

Review of attachment 8837200 [details] [diff] [review]:
-----------------------------------------------------------------

The commit message also needs changing...

::: dom/base/nsGlobalWindow.cpp
@@ +3424,5 @@
> +  // when in a content process.
> +  if (XRE_IsContentProcess()) {
> +    nsPIDOMWindowOuter* contentOpener = GetSanitizedOpener(aOpener);
> +    MOZ_RELEASE_ASSERT(!contentOpener || !mTabGroup ||
> +                       mTabGroup == Cast(contentOpener)->mTabGroup);

No need to assert in any process on beta.  Please instead of doing this, just remove the MOZ_RELEASE_ASSERT completely.  r=me with that for beta.
Attachment #8837200 - Flags: review?(ehsan) → review+

Comment 18

2 years ago
Created attachment 8840498 [details] [diff] [review]
Don't check opener's TabGroup in beta

MozReview-Commit-ID: 3cAsrJzTvXt

Updated

2 years ago
Attachment #8837200 - Attachment is obsolete: true

Updated

2 years ago
Keywords: leave-open

Comment 19

2 years ago
Comment on attachment 8840498 [details] [diff] [review]
Don't check opener's TabGroup in beta

Approval Request Comment
[Feature/Bug causing the regression]: bug 1303196
[User impact if declined]: Occasional unnecessary crashes
[Is this code covered by automated tests?]: N/A
[Has the fix been verified in Nightly?]: N/A
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: No - this removes an unused-in-beta RELEASE_ASSERT which was occasionally being hit.
[Why is the change risky/not risky?]: The assertion which is being removed cannot lead to memory unsafety in beta.
[String changes made/needed]: None
Attachment #8840498 - Flags: approval-mozilla-beta?
Comment on attachment 8840498 [details] [diff] [review]
Don't check opener's TabGroup in beta

remove a release assert in beta52
Attachment #8840498 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 21

2 years ago
We may want to turn this into a diagnostic assert on trunk and aurora for now, so that we don't have to keep fixing this on beta when 53 goes to beta...

Comment 22

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/3ba4a9e4bf35
status-firefox52: affected → fixed

Comment 23

2 years ago
Created attachment 8840550 [details] [diff] [review]
Change the SetOpenerWindow tabgroup assertion to a DIAGNOSTIC_ASSERT

MozReview-Commit-ID: 13zUBotz7Tr
Attachment #8840550 - Flags: review?(ehsan)
Setting qe-verify- based on Michael's assessment on manual testing needs (see Comment 19).
Flags: qe-verify-

Comment 25

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-esr52/rev/3ba4a9e4bf35
status-firefox-esr52: affected → fixed

Updated

2 years ago
Attachment #8840550 - Flags: review?(ehsan) → review+

Comment 26

2 years ago
Pushed by michael@thelayzells.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/32dce6ccde94
Change the SetOpenerWindow tabgroup assertion to a DIAGNOSTIC_ASSERT, r=ehsan
Should this still be open?
Flags: needinfo?(michael)

Comment 29

2 years ago
(In reply to Mark Banner (:standard8) from comment #28)
> Should this still be open?

Yes, we haven't had a chance to fix the actual underlying problem, we have only disabled the problematic code in beta/release.
Flags: needinfo?(michael)
Michael and I just discussed how this *could* be caused by addons but looking at recent reports it does appear to be happening to people who only have the default addons.
(In reply to Michael Layzell [:mystor] from comment #29)
> (In reply to Mark Banner (:standard8) from comment #28)
> > Should this still be open?
> 
> Yes, we haven't had a chance to fix the actual underlying problem, we have
> only disabled the problematic code in beta/release.

This does not appear to be the case:

https://hg.mozilla.org/releases/mozilla-release/annotate/d345b657d381/dom/base/nsGlobalWindow.cpp#l3191

That revision is tagged as 53 release, and the crash reports for this signature are still occurring in 53 release.  What went wrong?
Flags: needinfo?(michael)
Peculiarly, it only seems to be happening on 53 and earlier; no reports for 54 and 55.

Comment 33

2 years ago
How odd. It looks like when the patch was landed it was only landed on beta and central, and not into aurura (which is presumably 53 release today). Applying the patch from comment 19 should make this go away.
Flags: needinfo?(michael)

Comment 34

2 years ago
Comment on attachment 8840498 [details] [diff] [review]
Don't check opener's TabGroup in beta

Approval Request Comment
see comment 19

The fix for this problem seems like it was only landed on beta and central when it was landed, and it was not landed on aurora (which at the time would have been 53). Landing this patch on 53 would make this crash signature go away.

It's probably not worth triggering a dot release, but might be good as a ride-along.
Attachment #8840498 - Flags: approval-mozilla-release?
This landed on m-b on Feb. 23 and m-c on Feb 25. Beta was 52 and nightly was 54 at that point. Because the status-firefox54 flag wasn't ever set to fixed (as it should be automatically from bugherder), we didn't see this as a "missed uplift" candidate.
status-firefox54: affected → fixed
status-firefox55: --- → fixed
This isn't very high volume on release, so I think wontfixing is reasonable for 53.
status-firefox53: affected → wontfix
Attachment #8840498 - Flags: approval-mozilla-release? → approval-mozilla-release-
Moving to p3 because no activity for at least 1 year(s).
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P2 → P3
You need to log in before you can comment on or make changes to this bug.