Focus/window activation on Linux really doesn't like changing whether we load about:blank in the initial browser

RESOLVED INVALID

Status

()

enhancement
P2
normal
RESOLVED INVALID
9 months ago
3 months ago

People

(Reporter: Gijs, Unassigned)

Tracking

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox64 affected)

Details

Reporter

Description

9 months ago
For the past few weeks I've been trying to get the tree green with bug 1362774 landed. The effective patch there is very small - https://hg.mozilla.org/try/rev/fad02c752832ee33382e6c4610d1992cccc7db6c .

The unfortunate reality is that a *lot* of very different tests on Linux go orange with that change. The root cause seems to be a difference in how focus and window activation is handled on Linux.

As a relatively simple example, let's take browser_ext_themes_dynamic_getCurrent.js - a webextension test that checks how webextension theme APIs pick the "current" window. ( https://searchfox.org/mozilla-central/source/toolkit/components/extensions/test/browser/browser_ext_themes_dynamic_getCurrent.js ).

The test does a bunch of set up, then creates 2 (additional) windows, then changes their themes to be red-ish (first window) and green-ish (second window).

Then it tries to set focus to the (first) red window, which is behind the (second) green window.

After applying the patch mentioned above, the test fails reproducibly for me on both Linux and Linux64 opt builds. It times out waiting for the webextension onFocusChanged event from giving focus to the first window. If I make the test wait a bit, e.g. add:

await new Promise(r => setTimeout(r, 2000));

before every focus change, the test passes.

The problem seems to be that (without those hacky timeouts), when that code runs, we're still only halfway through with making the second window active/focused, and finishing that job races with attepting to change focus to the first window.

Obviously there's various ways of fixing this particular test by making it wait for the "right" events or whatever. The issue I'm having is that this problem is fairly pervasive - *lots* of tests create windows, close them, and expect focus to follow more or less synchronously, and with these changes they either reproducibly or "often intermittently" start failing.

It's also unclear to me why the change to the initial about:blank loading triggers these race conditions. The best I have is some handwavy "timings of various events change, so now failures are more likely".

Neil, do you have any ideas as to what's going on here, if there's a gtk/widget/focus-handling fix we could use instead, and/or why the about:blank document existing in the initial browser in newly opened windows would make a difference to focus/activate behavior on Linux? Happy to do more debugging, but at the moment I'm just not sure where to start, or if this is something that we can solve at the gecko/platform level rather than by fixing a bunch of tests.

(Comparatively speaking, there are almost no problems on Windows/macOS, to the point that I'm considering landing the above change for those platforms only, in order to get test coverage and reap the benefits of the change on our more heavily used platforms, rather than blocking it on seemingly idiosyncratic Linux issues.)
Flags: needinfo?(enndeakin)
See Also: → 521233
Reporter

Updated

9 months ago
Depends on: 1492782

Comment 1

9 months ago
> After applying the patch mentioned above, the test fails reproducibly for me
> on both Linux and Linux64 opt builds. It times out waiting for the
> webextension onFocusChanged event from giving focus to the first window. If
> I make the test wait a bit, e.g. add:
> 

Does the test check if the window is already focused before waiting for a focus event? I don't see that from a skim of browser_ext_themes_dynamic_getCurrent.js . Is it checking the content window or top-level window?

Window activation and focus events aren't necessarily immediate, especially on Linux.

Can you verify when nsGlobalWindowInner::SetReadyForFocus is called? Focus events shouldn't happen before then and should afterwards.
Flags: needinfo?(enndeakin)
According to comment #1, add need info to the reporter.
Flags: needinfo?(gijskruitbosch+bugs)
Priority: -- → P2
Reporter

Comment 3

9 months ago
(In reply to Neil Deakin from comment #1)
> > After applying the patch mentioned above, the test fails reproducibly for me
> > on both Linux and Linux64 opt builds. It times out waiting for the
> > webextension onFocusChanged event from giving focus to the first window. If
> > I make the test wait a bit, e.g. add:
> > 
> 
> Does the test check if the window is already focused before waiting for a
> focus event? I don't see that from a skim of
> browser_ext_themes_dynamic_getCurrent.js . Is it checking the content window
> or top-level window?
> 
> Window activation and focus events aren't necessarily immediate, especially
> on Linux.
> 
> Can you verify when nsGlobalWindowInner::SetReadyForFocus is called? Focus
> events shouldn't happen before then and should afterwards.

In a passing run, I see: https://gist.github.com/gijsk/e5d21d9f5bbc8d50ae3bb2e00a9a98fb

In the case where the test times out: https://gist.github.com/gijsk/a5cf3d8e8c7e36eb7558ae5c95ef4481

This logs when SetReadyForFocus happens, and stacks for ActivateOrDeActivate() in the parent, plus some front-end logging.

There's clearly some difference as to when `SetReadyForFocus` is called. As far as I can tell it's to do with whether the window is activated - adding a capturing focus handler shows that we get a focus event in the window we're trying to focus in the working case, whereas we do not in the broken case. That in turn seems to match up with the child thinking it's inactive and so changing behavior here - https://searchfox.org/mozilla-central/rev/819cd31a93fd50b7167979607371878c4d6f18e8/dom/base/nsFocusManager.cpp#915-920 . But maybe it's something else - the gist of it seems to be that in the working case, we activate/focus/focusin the window that gets created much sooner than in the broken case. It happening later in the broken case causes a race, which means we never end up focusing the window we want.

But it's not clear to me *why* this happens - the events seem to all come from the gtk widget manager, so I don't see how to correlate them to any gecko behavior changes. Plus, the focus code is just... really hard to reason about. Like, why do we check against `mFocusedWindow` (here https://searchfox.org/mozilla-central/rev/819cd31a93fd50b7167979607371878c4d6f18e8/dom/base/nsFocusManager.cpp#922-923 ), when I can literally *never* see that check passing? It seems to always be either nullptr or some other window (both before and after the patch). Why do we try to send child focus updates in the child process? ( https://searchfox.org/mozilla-central/rev/819cd31a93fd50b7167979607371878c4d6f18e8/dom/base/nsFocusManager.cpp#1212-1214 ). I expect the latter isn't relevant here, but I'm less sure about the former. It doesn't help that by the nature of the patch, we create a different number of windows and so the window IDs vary between the runs, making it hard to compare them.
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(enndeakin)
Reporter

Comment 4

9 months ago
In particular, in the successful case, we get:

 0:05.65 INFO Testing getCurrent() with after theme.update(windowId)
 0:05.79 GECKO(59342) SetReadyForFocus 2147483654
 0:05.79 GECKO(59342) Setting tabchild 0x7ff48fe171a0 as (active)
 0:05.79 GECKO(59342) Sending parent activated to children of 2147483654
 0:05.79 GECKO(59342) Window 2147483654 shown, aNeedsFocus: yes
 0:05.79 GECKO(59342) Focused window (2147483654) not the same as mFocusedWindow (0).
 0:06.10 GECKO(59342) SetReadyForFocus 23
 0:06.10 GECKO(59342) Window 23 shown, aNeedsFocus: yes
 0:06.10 GECKO(59342) Focused window (23) not the same as mFocusedWindow (3).

<identical JS goop>

 0:06.22 GECKO(59342) #01: nsWebShellWindow::WidgetListenerDelegate::WindowDeactivated() (/home/gijs/mozilla-central/xpfe/appshell/nsWebShellWindow.cpp:494)
 0:06.22 GECKO(59342) #02: nsWindow::OnContainerFocusOutEvent(_GdkEventFocus*) (/home/gijs/mozilla-central/widget/gtk/nsWindow.cpp:6375)
 0:06.22 GECKO(59342) #03: focus_out_event_cb(_GtkWidget*, _GdkEventFocus*) (/home/gijs/mozilla-central/widget/gtk/../../mfbt/RefPtr.h:47)

<remaining stack>

 0:06.24 GECKO(59342) Sending parent deactivated to children of 3
 0:06.24 GECKO(59342) Recvd parent inactive for window 6442450945
 0:06.24 GECKO(59342) Sending parent deactivated to children of 6442450945
 0:06.24 GECKO(59342) Recvd parent inactive for window 2147483654
 0:06.24 GECKO(59342) Sending parent deactivated to children of 2147483654
 0:06.25 GECKO(59342) #01: nsWebShellWindow::WindowActivated() (/home/gijs/mozilla-central/xpfe/appshell/nsWebShellWindow.cpp:479)
 0:06.26 GECKO(59342) #02: nsWebShellWindow::WidgetListenerDelegate::WindowActivated() (/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/atomic_base.h:524)
 0:06.26 GECKO(59342) #03: nsWindow::OnContainerFocusInEvent(_GdkEventFocus*) (/home/gijs/mozilla-central/widget/gtk/nsWindow.cpp:2881)
 0:06.26 GECKO(59342) #04: focus_in_event_cb(_GtkWidget*, _GdkEventFocus*) (/home/gijs/mozilla-central/widget/gtk/../../mfbt/RefPtr.h:47)

<remaining stack>

 0:06.29 GECKO(59342) Sending parent activated to children of 23
 0:06.35 GECKO(59342) Recvd parent inactive for window 8589934593
 0:06.35 GECKO(59342) Sending parent deactivated to children of 8589934593
 0:06.40 GECKO(59342) Recvd parent active for window 8589934593
 0:06.40 GECKO(59342) Sending parent activated to children of 8589934593

<js sees URL bar being focused>

 0:06.58 GECKO(59342) SetReadyForFocus 8589934593
 0:06.58 GECKO(59342) Setting tabchild 0x7ff9c5dd09a0 as (active)
 0:06.58 GECKO(59342) Sending parent activated to children of 8589934593
 0:06.58 GECKO(59342) Window 8589934593 shown, aNeedsFocus: yes
 0:06.58 GECKO(59342) Focused window (8589934593) not the same as mFocusedWindow (0).

and then the subtest (which opened window 23) runs.


In the bad case:

 0:04.28 INFO Testing getCurrent() with after theme.update(windowId)
 0:04.46 GECKO(59972) SetReadyForFocus 2147483653
 0:04.46 GECKO(59972) Setting tabchild 0x7f9b610171a0 as (active)
 0:04.46 GECKO(59972) Sending parent activated to children of 2147483653
 0:04.46 GECKO(59972) Window 2147483653 shown, aNeedsFocus: yes
 0:04.46 GECKO(59972) Focused window (2147483653) not the same as mFocusedWindow (0).
 0:04.66 GECKO(59972) SetReadyForFocus 23
 0:04.66 GECKO(59972) Window 23 shown, aNeedsFocus: yes
 0:04.66 GECKO(59972) Focused window (23) not the same as mFocusedWindow (3).

<identical JS goop>

 0:04.96 GECKO(59972) Recvd parent inactive for window 8589934593
 0:04.96 GECKO(59972) Sending parent deactivated to children of 8589934593


and that subtest (which opened window 23) just runs, before the window has been activated. Then after that:

 0:05.05 PASS Waiting for window to be 3 - 
 0:05.18 GECKO(59972) #01: nsWebShellWindow::WidgetListenerDelegate::WindowDeactivated() (/home/gijs/mozilla-central/xpfe/appshell/nsWebShellWindow.cpp:494)
 0:05.18 GECKO(59972) #02: nsWindow::OnContainerFocusOutEvent(_GdkEventFocus*) (/home/gijs/mozilla-central/widget/gtk/nsWindow.cpp:6375)
 0:05.18 GECKO(59972) #03: focus_out_event_cb(_GtkWidget*, _GdkEventFocus*) (/home/gijs/mozilla-central/widget/gtk/../../mfbt/RefPtr.h:47)

 <more stack>

 0:05.19 GECKO(59972) Sending parent deactivated to children of 3
 0:05.19 GECKO(59972) Recvd parent inactive for window 6442450945
 0:05.19 GECKO(59972) Recvd parent inactive for window 2147483653
 0:05.19 GECKO(59972) Sending parent deactivated to children of 6442450945
 0:05.19 GECKO(59972) Sending parent deactivated to children of 2147483653
 0:05.21 GECKO(59972) #01: nsWebShellWindow::WindowActivated() (/home/gijs/mozilla-central/xpfe/appshell/nsWebShellWindow.cpp:479)
 0:05.21 GECKO(59972) #02: nsWebShellWindow::WidgetListenerDelegate::WindowActivated() (/usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/atomic_base.h:524)
 0:05.21 GECKO(59972) #03: nsWindow::OnContainerFocusInEvent(_GdkEventFocus*) (/home/gijs/mozilla-central/widget/gtk/nsWindow.cpp:2881)
 0:05.21 GECKO(59972) #04: focus_in_event_cb(_GtkWidget*, _GdkEventFocus*) (/home/gijs/mozilla-central/widget/gtk/../../mfbt/RefPtr.h:47)

 <more stack>

 0:05.24 GECKO(59972) Sending parent activated to children of 23

<js sees URL bar being focused>

 0:05.30 GECKO(59972) SetReadyForFocus 8589934593
 0:05.31 GECKO(59972) Setting tabchild 0x7f67717ce9a0 as (inactive)
 0:05.31 GECKO(59972) Sending parent deactivated to children of 8589934593
 0:05.31 GECKO(59972) Window 8589934593 shown, aNeedsFocus: yes
 0:05.31 GECKO(59972) Focused window (8589934593) not the same as mFocusedWindow (0).
 0:05.31 GECKO(59972) Recvd parent active for window 8589934593
 0:05.31 GECKO(59972) Sending parent activated to children of 8589934593


So there aren't activate/deactivate events for the parent immediately when it's being opened, unlike in the 'good' case - the active window switches after this subtest runs. Again, I don't know *why* - the stacks between the good/bad case for the parent level active window changing look exactly the same, and come from the gtk event loop. I imagine it's us putting those events in somehow, but I don't know where or how to look for them.

As noted in comment #0, while I can obviously fix this test by making it do more waiting, there are a lot of these tests, and it also seems undesirable for focus behavior here to actually be changing. I'm also wondering about the wpt test changes from bug 1489700 being right, and if I'm not just running into bugs in our gtk/focus/widget implementation due to the about:blank window no longer being created here and all the nth-order effects from that.


I'll also note that leaving the about:blank window creation alone but making the initial browser remote earlier has exactly the same symptoms ( https://treeherder.mozilla.org/#/jobs?repo=try&revision=306f577a29c017f5352497b82e6bd0054fc616df&selectedJob=201443200 )

Comment 5

9 months ago
I took a look at debugging the test browser_ext_themes_dynamic_getCurrent.js

It creates a window, does some checks, then creates a second window via 'browser.windows.create()' The test doesn't wait for the second window to be focused before continuing. The test then uses 'browser.windows.update' with a focused flag to try to focus the first window. But the first window is still focused, so this flag is effectively ignored (the active window is the same as the requested window at the beginning of nsFocusManager::RaiseWindow). So the focus event the test is expecting never happens.

So I don't see anything wrong with the focus system from here.

Presumably, changing the way the page loads makes some event loop processing not happen immediately, so the response to the focus request doesn't happen at the same time.
Flags: needinfo?(enndeakin)

Comment 6

9 months ago
>  Like, why do we check against `mFocusedWindow` (here
> https://searchfox.org/mozilla-central/rev/
> 819cd31a93fd50b7167979607371878c4d6f18e8/dom/base/nsFocusManager.cpp#922-923
> ), when I can literally *never* see that check passing? It seems to always
> be either nullptr or some other window (both before and after the patch).

I think there are cases where a page can try to focus before painting has occurred. I don't remember what specific cases they are.

> Why do we try to send child focus updates in the child process? (
> https://searchfox.org/mozilla-central/rev/
> 819cd31a93fd50b7167979607371878c4d6f18e8/dom/base/nsFocusManager.cpp#1212-
> 1214 ).

This is mostly to update the -moz-window-inactive state on all child frames.

Neither of these are relevant to this bug I think.
Reporter

Comment 7

9 months ago
(In reply to Neil Deakin from comment #5)
> I took a look at debugging the test browser_ext_themes_dynamic_getCurrent.js

Thanks very much for taking a look.

> So I don't see anything wrong with the focus system from here.

Sure, it's less that I think that there's something wrong with the focus system than that I'm concerned about the implications of the change and not fully understanding those, and whether these automated test failures are telling us something about the window open sequence post-patch that is a real issue, e.g. where keypresses could get lost or where content windows/docshells get marked 'active' later (or, in bad situations, not at all), causing issues with APIs that depend on that (e.g. webauthn, whose tests are also failing with this change, only on Linux, because a bunch of their APIs point-blank reject if run in inactive windows).

> It creates a window, does some checks, then creates a second window via
> 'browser.windows.create()' The test doesn't wait for the second window to be
> focused before continuing. The test then uses 'browser.windows.update' with
> a focused flag to try to focus the first window. But the first window is
> still focused,

OK, sure, but why? Why do we always focus the second window immediately pre-patch, and don't we do that post-patch?

> Presumably, changing the way the page loads makes some event loop processing
> not happen immediately, so the response to the focus request doesn't happen
> at the same time.

I'd really like to understand this better.

In an ideal world, this patch is causing us to load fewer documents. And actually, doing some stack tracing from WindowShown (this time also in the child process), I found something that does seem odd. Specifically, comparing:

https://gist.github.com/gijsk/e6112a85d55f403103ea183d7a9146cb (bad)
and
https://gist.github.com/gijsk/99b4020c259d3216ae157b17f297f0ac (good)

I see 3 calls (1 parent, 2 child) in the 'good' case, between ' 0:05.20 INFO Testing getCurrent() with after theme.update(windowId)' and the testing happening (  0:05.97 PASS Theme 1 header URL should be applied - ).

In the bad case, I see only 2 calls (1 parent, 1 child). 1 of the child calls has gone AWOL. I don't really understand our focus system, but it seems from the stacks that when a document loads, we (sometimes?) call SetReadyForFocus, because we run layout then and as a result we end up here:

https://searchfox.org/mozilla-central/rev/819cd31a93fd50b7167979607371878c4d6f18e8/layout/base/PresShell.cpp#3977

So it stands to reason that, loading 1 fewer document (because we no longer load about:blank in the content process before loading our test document, or something along those lines) there's a change in how often we call SetReadyForFocus on the content windows, and that influences focus for the parent windows. Or something. Neil, is that plausible?

The other thing I'm wondering here is if loading about:newtab in the new window (which the test does), it's possible that we swap in a preloaded browser, and if this is causing us to be confused about the active state of the docshell. In that respect, it seems interesting that in the "bad" case, 1 of the triggers of the focus event seems to be hanging off an SVG image loading (causing us to mark the document as loaded), and that doesn't happen in the "good" case. Mike, any chance you can quickly exclude preloaded about:newtab playing a role here and/or pointing me to code that'd be involved in that case?
Flags: needinfo?(mconley)
Flags: needinfo?(enndeakin)

Comment 8

9 months ago
I would expect SetReadyForFocus to be called on blank documents and would be called less when blank documents aren't being loaded. But I think SetReadyForFocus is not related to the issue here.

My testing shows that the event loop is being processed significantly more often without this patch (in fact, almost at every line of the test), and much less often with this patch. This leads to the focus change events which were before received just after 'browser.windows.update' on the newly created second window, to now not being received until the next 'update' call.
Flags: needinfo?(enndeakin)
(In reply to :Gijs (he/him) from comment #7)
> Mike, any chance you can quickly exclude preloaded
> about:newtab playing a role here and/or pointing me to code that'd be
> involved in that case?

The preloaded about:newtab goop can be disabled by setting browser.newtab.preload to false.

FWIW, I seem to recall hitting some focus issues when I was working on bug 1353013... if I recall correctly, it had something to do with Marionette, and was also due to the fact that TabChild's always think they're visible[1], which I _think_ means that it's convinced that it can focus itself.

Unfortunately, my notes[2] are kinda fragmentary and incomplete about it. :/

[1]: https://searchfox.org/mozilla-central/rev/924e3d96d81a40d2f0eec1db5f74fc6594337128/dom/ipc/TabChild.cpp#887
[2]: https://www.evernote.com/l/AbIywYX5mXVIh73D7sbjC5ozNCPVQEHa0fo
Flags: needinfo?(mconley)
QA Contact: overholt
QA Contact: overholt
Reporter

Comment 10

8 months ago
OK, so based on Neil's comments, I think there isn't much else to do here - I'll just have to go fix all the tests this breaks.
Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → INVALID
See Also: → 1497364
Component: Event Handling → User events and focus handling
You need to log in before you can comment on or make changes to this bug.