Closed Bug 1589022 Opened 6 years ago Closed 6 years ago

Page scrolls in wrong direction in popup window created by chrome.windows.create({ type: "popup"});

Categories

(Core :: Panning and Zooming, defect, P2)

69 Branch
defect

Tracking

()

VERIFIED FIXED
mozilla72
Tracking Status
firefox-esr68 --- wontfix
firefox70 --- wontfix
firefox71 --- wontfix
firefox72 --- verified

People

(Reporter: dw-dev, Assigned: botond)

References

(Regression)

Details

(Keywords: regression)

Attachments

(3 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0

Steps to reproduce:

This defect was reported in Bug 1573795, which was closed as a duplicate.

This defect was not fixed in Bug 1565525, which has been closed as fixed.

  1. Start Firefox 69.0.1 or Nightly 71.0a1 (2019-10-15)
  2. Install my Tile Tabs WE extension.
  3. Load two tabs with Wikipedia and select first tab.
  4. Click on the Tile Tabs WE toolbar button to create a layout with two tiled windows.
  5. In either tiled window, right-click on Tile Tabs WE toolbar button, select Toggle Toolbars, and then select All Tiles.
  6. In either tiled window, try scrolling the window contents using the mouse wheel.

Actual results:

The page contents are scrolled in the WRONG direction (e.g. down instead of up).

Expected results:

The page contents should be scrolled in the correct direction.

Component: Untriaged → Panning and Zooming
Product: Firefox → Core

(In reply to dw-dev from comment #0)

This defect was reported in Bug 1573795, which was closed as a duplicate.

This defect was not fixed in Bug 1565525, which has been closed as fixed.

It looks like steps 6-8 from bug 1573795 were fixed by bug 1565525, but step 9, which seems to be a different issue, was overlooked. Thanks for filing a separate bug about it.

The issue seems to be specific to WebRender (workaround: disable WebRender via gfx.webrender.force-disabled=true).

Running mozregression with default settings pointed to the patch that enabled WebRender by default.

Running mozregression with --pref gfx.webrender.all:true pointed to https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=a3acb30248ff81e6a0a10a61783fa25896e23e21&tochange=c513275730f60c1366be89fc39ae10b81529bf37.

Glenn, this seems to be a regression from bug 1545006. Could you have a look please?

Component: Panning and Zooming → Graphics: WebRender
Flags: needinfo?(gwatson)
Regressed by: 1545006

I can reproduce this, I'm not quite sure what's going on though.

There's (as far as I know) only one place in the WR API where scroll layers can be defined (the wr_dp_define_scroll_layer binding) and only one place where scrolls can occur (the wr_transaction_scroll_layer binding).

The most likely cause of a WR bug I can think of is if there are nested scroll roots. I did a WR capture and can only see a single scroll root in both cases (when loading the tab normally, and then after applying the repro steps above), so it doesn't seem to be that.

Is there anything you're aware of that Gecko might be doing differently in this case, in terms of how scroll roots / scroll offsets are defined? Is there anything special about popup windows that might be a clue to what's going on?

I'll see if I can create a smaller repro today, which might be easier to debug.

Flags: needinfo?(gwatson)

I think this might be a bug in Gecko - but the interactions between various moving parts are complex enough that I might be misinterpreting the outputs.

I added some logging to ScrollFrameInfo::new() and also SpatialNode::set_scroll_origin().

When opening the initial page, I see logs showing that there are new scroll frames created when a new display list arrives, and then scroll offsets being set as the scrolls occur:

set_scroll_origin (0.0,0.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,0.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,0.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,0.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,0.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,0.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,2.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,5.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,10.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,15.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,20.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,26.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,32.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,38.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,45.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,51.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,57.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,63.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,69.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,75.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,80.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,84.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,89.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,92.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,96.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,98.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,100.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,101.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,102.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,102.0) NoClamping [0.0×5524.0]
SFI::new s=0.0×0.0 e=(0.0,0.0)
SFI::new s=0.0×0.0 e=(0.0,0.0)
SFI::new s=0.0×5524.0 e=(0.0,306.0)
set_scroll_origin (0.0,0.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,0.0) NoClamping [0.0×5524.0]
set_scroll_origin (0.0,0.0) NoClamping [0.0×5524.0]

I see similar outputs once I click the "Tile Tabs WE" button.

However, once I apply the 'Toggle Toolbars' option, I see:

SFI::new s=0.0×5414.0 e=(0.0,100.0)
SFI::new s=0.0×0.0 e=(0.0,0.0)
SFI::new s=0.0×0.0 e=(0.0,0.0)
SFI::new s=0.0×5414.0 e=(0.0,101.0)
SFI::new s=0.0×0.0 e=(0.0,0.0)
SFI::new s=0.0×0.0 e=(0.0,0.0)
SFI::new s=0.0×5414.0 e=(0.0,102.0)
SFI::new s=0.0×0.0 e=(0.0,0.0)
SFI::new s=0.0×0.0 e=(0.0,0.0)
SFI::new s=0.0×5414.0 e=(0.0,102.0)
SFI::new s=0.0×0.0 e=(0.0,0.0)
SFI::new s=0.0×0.0 e=(0.0,0.0)
SFI::new s=0.0×5414.0 e=(0.0,102.0)

Noteworthy things in this case:
(1) There are no calls from Gecko to set_scroll_origin.
(2) A new display list is being sent every frame (so I guess that means APZ is not active?).

I think what's occurring is that Gecko is creating new scroll frames in this case with an external scroll offset, but not setting the correct scroll origins, which is confusing things.

Normally I'd ask kats at this point if that makes sense, since I don't know the gecko-side WR code at all :)

Does it seem possible that it should be setting scroll origins and it's missing that path? Is it expected that there is no APZ (a new DL every frame) once we get this popup window case? Could that be a clue to where the code paths differ (whether the eventual bug is in WR or Gecko).

Flags: needinfo?(botond)

I'll have a more detailed look at your comment tomorrow, but for now just wanted to say that not having APZ in popups is a known issue (bug 1493208). However, scrolling should still work fine in the absence of APZ. Perhaps we unintentionally started relying on APZ for scrolling to work properly with WebRender?

So yeah, I'm not seeing any code that would call set_scroll_origin in the absence of APZ, at least not via wr_transaction_scroll_layer.

While I'm not sure, I would guess the expected thing to happen in this case is that we do get a new DL every frame, and the DL contains the scroll offset.

(In reply to Glenn Watson [:gw] from comment #5)

I think what's occurring is that Gecko is creating new scroll frames in this case with an external scroll offset, but not setting the correct scroll origins, which is confusing things.

Is there a distinction between "scroll offset" and "scroll origin" here?

Looking at the logging during the incorrect behaviour:

SFI::new s=0.0×5414.0 e=(0.0,102.0)

Assuming e here is the scroll offset, that looks correct, i.e. matching the same offset that showed up in the set_scroll_origin calls.

Could you clarify which value here is unexpected / contributes to the shifted rendering?

Flags: needinfo?(botond) → needinfo?(gwatson)

A little bit of background first:

Gecko modifies the coordinates of all primitives each time a new display list is sent, such that the "true" (original) local coordinates of the primitives are offset by the current scroll amount at the time the DL was created. WR doesn't really need / want this change in the coordinates of each primitive due to scrolling, because it wants to identify if content in the picture caches is the same by comparing hashes of the primitive content (and just apply the scroll offsets once via a transform on GPU, rather than per-primitive).

Since it seems that this change of primitive coordinates is difficult to change in Gecko, we have the concept in the WR API of an "external scroll offset" and a "scroll offset / origin". The external scroll offset is the amount that primitive coordinates have been offset by in the display list, while the scroll offset / origin is the scroll amount of the current display list. I think you can roughly consider the scroll offset to be the APZ scroll amount, if I have my terminology correct.


So, with that said, the e field above is the supplied external scroll offset. There's no way during definition of a scroll frame in a display list to also set the current scroll offset. It does seem like, if we're not running with APZ in this window that there probably shouldn't be a need to call set_scroll_origin then, since I would assume it's always going to be zero from Gecko's perspective? I wonder if the issue is simply that in the non-APZ case, the external scroll offset being supplied should be negated? Or perhaps you could confirm if the scroll offset would always be zero in this case from Gecko's perspective? (if not, maybe adding a set_scroll_origin case in the non-APZ path is what's needed).

I apologize for the vagueness of the above description - I certainly don't have a clear picture in my head of where / what is going wrong. I'll take another look today and see if I can see the exact cause of the artifact

Flags: needinfo?(gwatson)

(In reply to Glenn Watson [:gw] from comment #8)

It does seem like, if we're not running with APZ in this window that there probably shouldn't be a need to call set_scroll_origin then, since I would assume it's always going to be zero from Gecko's perspective?

Yeah, based on your description of external scroll offset vs. scroll offset, I would expect the latter to always be zero if APZ is disabled. (It should be easy to test by doing a scroll in a regular browsing environment (no extensions / popups) with layers.async-pan-zoom.enabled=false.)

I wonder if the issue is simply that in the non-APZ case, the external scroll offset being supplied should be negated?

Hmm, why would it need to be negated only in the non-APZ case? My understanding is that the external offset is still used with APZ, i.e. once things stabilize and we repaint, the scroll offset becomes the external scroll offset. So, it seems to me that if it needed to be negated, the behaviour with APZ would be wrong as well. (However, it's very possible that I'm missing or misunderstanding something.)

Confirmed that regular scrolling seems to work with layers.async-pan-zoom.enabled=false. I'll dig into this a bit more today and try to get a clearer picture what's happening.

So, with layers.async-pan-zoom.enabled=false set, the bug can no longer be reproduced with WR running. I would have thought that wouldn't affect the bug since APZ is disabled for those popup windows anyway? Does that seem surprising to you, or would you expect that?

Flags: needinfo?(botond)

Interesting! That does seem surprising to me.

A couple of theories:

  • Perhaps in the popup case, there is some code that still believes / expects APZ to be enabled.
  • In the popup case, the page goes from using APZ (before the "Toggle Toolbars") to not using APZ (after the "Toggle Toolbars"). Perhaps there is some state left over somewhere from the time it was using APZ, that's messing things up.
Flags: needinfo?(botond)

So, while changing something about how early the layout tree is created (for bug 1584935), I've seen this assert firing:

Where the differing options is mUseApz. May be worth checking that's not what's going on. Extensions are doing something really weird with <browser> and <panel> and what not, see bug 1397876 comment 48, etc.

It'd be good to double-check that that assertion isn't firing.

That assertion almost certainly is firing. We should downgrade it to a warning or something, because addons that move tabs into popup windows trip it all the time. I don't think it's the cause of this bug, though, at least it's hard to see how given the regressing change.

With APZ disabled, the external scroll offset sent to WR is always zero. I think this makes sense to me, since each display list is considered an absolute positioning of the content (I think my explanation of this above was slightly wrong).

With APZ enabled, the external scroll offset is typically non-zero (it changes each time a new display list is sent where the scroll offset is not at the origin).

However, in the popup window case, if APZ is enabled, there are non-zero external scroll offsets supplied, even though APZ is not being used.

So, I think your theory above is correct - I suspect some code in Gecko is getting confused about whether APZ is on/off (or not accounting for whether APZ is enabled).

I don't know the Gecko-side code for this stuff at all, but I'll have a dig and see what I can find (or if you might know where to look, please let me know!).

Based on a brief look, the first thing I would suggest checking is whether, perhaps, the place where the scroll frame item is added shifts from build_scroll_frame() (which provides a potentially nonzero external offset) to build_iframe() or push_root() (which always provide zero as the external offset).

For build_scroll_frame(), the external offset comes from here; I would be surprised if that were zero when we're actually scrolled.

Ah, an interesting finding based on your suggestion:

For the bug repro:

  • With layers.async-pan-zoom.enabled=true - we get a scroll frame defined in the WR display list, and it has external scroll offsets.
  • With layers.async-pan-zoom.enabled=false - the WR display list in the popup window doesn't contain any ScrollFrame items.

Without knowing much about how Gecko builds WR display lists, this seems reasonable - that is, if there's no APZ, there's no reason to define scroll frames in the WR display list, since any scroll is handled by Gecko producing a new absolutely positioned display list.

So, perhaps in the normal case of APZ=false, Gecko (correctly) doesn't create ScrollFrame display items, but in the case of this bug, where APZ=true, but APZ=disabled for the popup window, it is (incorrectly) creating ScrollFrame display items, and so there are scrolling offsets being applied by both WR and Gecko.

I'm going to dig a bit further into the Gecko code to see what decides if ScrollFrame items are created, but does the above sound like a possible explanation?

(In reply to Glenn Watson [:gw] from comment #17)

I'm going to dig a bit further into the Gecko code to see what decides if ScrollFrame items are created, but does the above sound like a possible explanation?

It does, and in fact this is making me think that perhaps the issue is related to the assertion mentioned in comment 13, after all.

I don't know what exactly the creation of ScrollFrame items is conditioned on, but I note that here is one place where WebRender display list building code checks for APZ being enabled. That ultimately calls into here, which is basically checking CompositorOptions::mUseApz.

The assertion mentioned in comment 13 is checking that when a tab is moved from one window into another, the two windows' compositor options are the same.

So, a theory for what could be happening is:

  • the compositor options are stored in both the parent process and the content process
  • initially, the tab is in a window with mUseApz=true, and this is reflected in both places
  • we move the tab into a window with mUseApz=false
  • the flag changes on the parent side, but the object on the content side is reused
  • we don't bother updating the content side compositor options (since we're assuming they were the same as before)

Yep, that does seem to be the problem.

I confirmed that with APZ enabled, this bool is reporting that APZ is enabled in the popup window, which results in the creation of the scrollframe in the WR display list.

Fixing that looks to be out of my depth - is there someone else we could assign to look at this, or perhaps a bug this should be merged into?

Thanks for confirming the theory! I can take the bug from here.

Assignee: nobody → botond

(In reply to Botond Ballo [:botond] from comment #18)

So, a theory for what could be happening is:

  • the compositor options are stored in both the parent process and the content process

This is a simplification, as there are in fact three processes in play: GPU, parent, and content.

Compositor options originate in the GPU process, are communicated to the parent process via PCompositorBridge, and are further propagated to the content process via PBrowser.

The place where we currently become aware of the discrepancy between old and new compositor options is in the GPU process, when receiving the PCompositorBridge::AdoptChild message from the parent process.

So, I think we want to do something along these lines:

  • Send a "compositor options changed" message from the GPU process back to the parent process via PCompositorBridge. Note, it can't be a direct response to AdoptChild since AdoptChild is asynchronous. The message would contain the LayersId to which the change pertains (i.e. the LayersId that was adopted), and the new compositor options.
  • When handling this message in CompositorBridgeChild, use the LayersId to look up the correct BrowserParent via BrowserParent::GetBrowserParentFromLayersId().
  • Send a "compositor options changed" message to the content process via PBrowser.

Handling all compositor options changes correctly in the content process would be involved, but we should be able to handle mUseApz=true --> mUseApz=false fairly easily (basically, we just change the flag).

Handling mUseApz=false --> mUseApz=true would be more involved, because we'd have to create APZ actors like PAPZCTreeManager. However, if we're only looking to handle the case where mUseApz is changed back to true (which should be enough for Tile Tabs), we can "cheat" by leaving the APZ actors around during the true-->false transition, and just going back to using them during the false-->true transition. (The APZ actors on the content side don't change during AdoptChild, they're just re-wired on the parent process side to be hooked up to the right window, so the same actors will continue working with the new window; this is how moving tabs between two ordinary windows works today.)

Component: Graphics: WebRender → Panning and Zooming
Priority: -- → P2

I wrote up a patch that does what I described in comment 21, but I'm running into a crash here. I guess we need some more work for WebRender to handle the sudden change from APZ disabled to APZ enabled gracefully.

The crash seems to be intermittent, caused by a race or something. I haven't been able to trigger it inside rr.

UPDATE: I can trigger it pretty consistently outside of rr, though.

(In reply to Glenn Watson [:gw] from comment #19)

I confirmed that with APZ enabled, this bool is reporting that APZ is enabled in the popup window, which results in the creation of the scrollframe in the WR display list.

Glenn, do you have an understanding of the mechanism by which that bool causes scrollframe items to be defined?

What I'm seeing with my patch is that the bool in question is now false, but the fact that we get here suggests that we are still trying to create scrollframe display items.

For reference, my patch is here.

Flags: needinfo?(gwatson)

I don't know the Gecko side of that at all, unfortunately. Jeff might know who to ask about it?

Flags: needinfo?(gwatson) → needinfo?(jmuizelaar)

I think the assertion can be ignored/removed if APZ is not enabled there. There is already a graceful-handling early return on the next line that should probably be ok. Effectively we will not tell WR about the scrollframe which is fine because we will be doing main thread repaints for scrolling anyway.

Thanks for the suggestion! I can confirm that removing the assertion makes the patch work, or at least I didn't see any noticeable problems.

I do wonder though whether the fact that we get there in the first place is indicative a subtler underlying problem.

In particular, I would expect the PBrowser message that clears the "APZ enabled" flag on the content side to be sequenced wrt. paints (i.e. not be processed in the middle of a paint). So, if WebRender is wanting to create scroll frames even with the flag set to false, I figure it's doing it based on some state that persists across paints (e.g. the displayport content properties). That state may have other side effects; we may instead want to clear it when we clear the "APZ enabled" flag.

(In reply to Botond Ballo [:botond] from comment #28)

So, if WebRender is wanting to create scroll frames even with the flag set to false, I figure it's doing it based on some state that persists across paints (e.g. the displayport content properties).

I've confirmed that this is what seems to be happening: WebRender wants to create the scroll frame based on ASR information, and the ASR information is in turn based on a displayport being set.

So I think ideally, we would clear all displayports when APZ goes from enabled to disabled. Alternatively, to avoid having to walk the DOM or the frame tree to find and clear them all, we could have GetDisplayPortImpl() ignore the properties if APZ is disabled; they should expire and be removed in due course.

Flags: needinfo?(jmuizelaar)

(I think that also explains why I couldn't repro the crash with rr: the window shenanigans that Tile Tabs does are so slow while recording, that the displayports expire before they get a chance to cause problems.)

If only the APZ enablement changed, produce a warning rather than an assertion.

This involves two new IPC messages (both async) to propagate the change in
compositor options (of which APZ enablement is one) from the GPU process to
the parent process (via PCompositorBridge) and on to the content process
(via PBrowser).

The support is only partial, in that going from non-APZ to APZ is only
supported if APZ was enabled at the time the window was created.

Depends on D51467

This facilitates disabling APZ "live", such as when moving a tab from an APZ
window into a non-APZ window.

Depends on D51468

Pushed by bballo@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e6305a1457f0 Make the assertion about a compositor options mismatch in RecvAdoptChild more nuanced. r=tnikkel https://hg.mozilla.org/integration/autoland/rev/408a199d1d47 Partial support for moving a tab between windows with different APZ enablement. r=nika,tnikkel https://hg.mozilla.org/integration/autoland/rev/081aa9877fc1 Ignore displayport properties in GetDisplayPortImpl() if APZ is disabled. r=tnikkel
Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Flags: qe-verify+

Reproduced the issue using 71.0a1 (20191016212918) on Windows 10x64 and STR from comment 0.
The issue is verified fixed using Firefox 72.0.2 (20200117190643) and 75.0b11 (20200328010112) on Windows 10x64.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: