Closed Bug 1683188 Opened 3 years ago Closed 3 years ago

Perma-checkerboarding and broken paints after tab switching, sometimes (with Ctrl+Tab); fixed by another tab switch

Categories

(Firefox :: Tabbed Browser, defect)

defect

Tracking

()

VERIFIED FIXED
86 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox84 --- unaffected
firefox85 + verified
firefox86 --- verified

People

(Reporter: mstange, Assigned: emilio)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: correctness, regression)

Attachments

(6 files)

This happened to me about 5 times this week. It never happened to me before.

Basically, what happens is that a tab simply stops painting. The process is still responsive, APZ still works (albeit with perma-checkerboarding), the mouse cursor updates as I move the mouse, but the screen just never updates with new content. Until I switch tabs, that is!

One time it started happening after I opened a link in a background tab. The other times it happened after a tab switch.

Severity: -- → S2

Markus, is this with webrender enabled or disabled?

Flags: needinfo?(mstange.moz)

WebRender enabled.

I just hit this again, with build ID 20201218095607. Profile: https://share.firefox.dev/3hcpY0F

It looks like the tab thinks it's in the background. I found one refresh tick call, here: https://share.firefox.dev/37tI7Ua
It was triggered by InactiveRefreshDriverTimer::TimerTickOne. The inactive refresh driver timer should only be used for background tabs.

Next time this happens, I will check what gBrowser.selectedBrowser.docShellIsActive on the browser console says.

Flags: needinfo?(mstange.moz)
Blocks: wr-mac
OS: Unspecified → macOS
Hardware: Unspecified → x86_64

Just happened again. gBrowser.selectedBrowser.docShellIsActive is false. Is the tab switcher maybe not activating the tab's docshell? Or is it trying to activate it but some other bug causes it to not stick?

Component: Graphics: WebRender → Tabbed Browser
Product: Core → Firefox
See Also: → 1683782
See Also: 1683782
See Also: → 1682615

I think this mostly happens if I switch to the tab using Ctrl+Tab. I have browser.ctrlTab.recentlyUsedOrder enabled. The duplicate bug 1683782 also mentions Ctrl+Tab.

QA Whiteboard: [qa-regression-triage]

This is a long shot, but could it be related to bug 1625590 (ctrl+tab + macos native fonts => high memory) which just landed for 85?

OS: macOS → All
Hardware: x86_64 → All

I did my best to find a regression window for what I observe on Linux (see bug 1684643) by opening several tabs with different websites by repeatedly using Ctrl+T and entering a random URL so that the pages were loaded in the background, and then switching between tabs with Ctrl+Tab (thanks for this hint!). I could reproduce the bug most easily with Twitter pages, but that might be a coincidence.

Result:
Last good revision: 8ade6a20744e637d81f8e3bfecd05526d144d8a3
First bad revision: a3add3f43cbcbfdf053d901689882975479842eb
Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=8ade6a20744e637d81f8e3bfecd05526d144d8a3&tochange=a3add3f43cbcbfdf053d901689882975479842eb

This is bug 1678786, and there already is bug 1682158 as a regression. I have no idea what this is about. Does this explain your observations?

I will clear out the flag for regressionwindow-wanted condiering the fact that Viktor provided one.

Has Regression Range: --- → yes

Bug 1678786 only touched a test file so it can't have caused this.

Has Regression Range: yes → ---

Sorry, I should have seen that.

When I used mozregression, I found that the Nightly build 2020-12-11 (changeset 1130661c79c222fb1acd29b7ec5dc5202cdd0d2d) is not affected by this bug. Although I had tested this one for quite long, I tested it again and still couldn't reproduce this bug, so I am quite sure that the changeset we are looking for is listed in the following pushlog containing 11 changesets:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=1130661c79c222fb1acd29b7ec5dc5202cdd0d2d&tochange=a3add3f43cbcbfdf053d901689882975479842eb

Looking at the bugs I think bug 1635914 looks related, so I tested the corresponding revision and could reproduce this bug after about 10 minutes. The pushlog from above can reduced to:

Has Regression Range: --- → yes
Regressed by: 1635914
Flags: needinfo?(emilio)

I'll poke, thanks!

Assignee: nobody → emilio

To be clear, str is enabling the pref so that control+tab switches tabs, and just tabswitching for a while right? Curious, do you have fission enabled?

I'm not at a computer right now but the other thing I'd check is whether there's any tab switcher assertion when this happens on a debug build.

Viktor, do you know how could I more reliably reproduce this? Could you post the steps you found to do the bisection? I only could reproduce this once, and using quite a weird combination of actions involving restoring tabs and so on (and I haven't been able to repro it ever since). FTR:

so, open tab in the background, wait for it to load, close (with the context menu), Ctrl+shift+T to restore it, then Ctrl+Tab to switch to the original tab, then Ctrl+Tab again

Thanks in advance.

Flags: needinfo?(viktor_jaegerskuepper)
Summary: Perma-checkerboarding and broken paints after tab switching, sometimes; fixed by another tab switch → Perma-checkerboarding and broken paints after tab switching, sometimes (with Ctrl+Tab); fixed by another tab switch

(In reply to Viktor Jägersküpper from comment #8)

I did my best to find a regression window for what I observe on Linux (see bug 1684643) by opening several tabs with different websites by repeatedly using Ctrl+T and entering a random URL so that the pages were loaded in the background, and then switching between tabs with Ctrl+Tab (thanks for this hint!). I could reproduce the bug most easily with Twitter pages, but that might be a coincidence.

I also tried going to twitter, literally ctrl+clicking everything I found (so that it gets loaded in the background), then switching tabs with Ctrl+Tab, but also no dice.

Do you need to tab-switch while the page is loading? Or after it's loaded?

Also, curious if this reproduces with browser.tabs.remote.warmup.enabled=false.

Flags: needinfo?(emilio)
Flags: needinfo?(emilio)

I am just trying to find the shortest STR, give me a few minutes. The problem is that I can't easily reproduce the bug, I have to try many many times.

These are my STR:

  1. Go to a random website in the first tab and wait until it's loaded (I used bugzilla.mozilla.org)
  2. Open a second tab with Ctrl+T, type a random Twitter URL (e.g. twitter.com/firefoxnightly) and hit Enter
  3. Immediately open another tab with Ctrl+T and wait about 15 seconds so that the page in the second tab should have been loaded
  4. Enter another Twitter URL and immediately switch to second tab with Ctrl+Tab
  5. Scroll down and hope that you can reproduce this bug
  6. If you weren't successful, wait about 15 seconds so that the page in the third tab should have been loaded
  7. Switch to the third tab with Ctrl+Tab, scroll down and hope that you can reproduce this bug
  8. If you still can't reproduce the bug, close the second and third tab and repeat steps 2-5/7

Unfortunately it can take some time to reproduce the bug this way, so this is really ugly. I am not even sure if Ctrl+Tab is relevant, I used it because Markus and Albert mentioned it. I don't know why the first tab is necessary, but without it I couldn't reproduce the bug.

For me (on Linux) browser.ctrlTab.recentlyUsedOrder is enabled by default. I don't have fission enabled.

browser.tabs.remote.warmup.enabled is set to true by default for me. I will test with setting it to false, although I don't know what this is about. I will also test with a debug build. (Not clearing the needinfo at the moment, I will test tomorrow.)

Since I just observed the bug on a live blog news site, you should know that I have a very slow CPU (Intel Core 2 Duo) and I think any site that is "bloated" like Twitter (Javascript plus images and whatever) might do the trick so that the page isn't loaded yet when you switch tabs with Ctrl+Tab.

Awesome, thanks! I tweaked a bit that method, using session-restore, and I seem to be able to reproduce it somewhat consistently like this. Basically, having two twitter tabs open, then Ctrl+Tab to the second so it starts loading (it's nice that we remember recently used tabs across a restart, was not expecting that!), wait for a paint, then quickly Ctrl+Tab back, then Ctrl+Tab once everything seems idle on the last tab.

I have a bunch of other twitter tabs in the background, but those are leftovers from a test.

I'll try to play myself with those prefs, but this gives me a way to investigate this. Thank you!

Attached file Logging patch

Catching this on rr has been tricky so far (since it's not so consistent as I thought).

Here's a logging patch I'm using. Will attach the log asap.

Interesting context is 44. I think this is the relevant bit, and there seems to be a race between the parent and the child process:

console.log: "set docShellIsActive(true): 44, false, https://twitter.com/firefoxnightly, set docShellIsActive@chrome://global/content/elements/browser-custom-element.js:398:142\nmaybeActivateDocShell@resource:///modules/AsyncTabSwitcher.jsm:552:7\nupdateDisplay@resource:///modules/AsyncTabSwitcher.jsm:481:16\npostActions@resource:///modules/AsyncTabSwitcher.jsm:673:10\nhandleEvent@resource:///modules/AsyncTabSwitcher.jsm:1116:12\nqueueUnload@resource:///modules/AsyncTabSwitcher.jsm:1052:10\nrequestTab@resource:///modules/AsyncTabSwitcher.jsm:1048:10\nupdateCurrentBrowser@chrome://browser/content/tabbrowser.js:996:31\n_setupEventListeners/<@chrome://browser/content/tabbrowser.js:5287:16\nset selectedIndex@chrome://global/content/elements/tabbox.js:197:14\nset selectedPanel@chrome://global/content/elements/tabbox.js:216:7\nset selectedIndex@chrome://global/content/elements/tabbox.js:545:11\nset selectedItem@chrome://global/content/elements/tabbox.js:565:35\nset selectedTab@chrome://global/content/elements/tabbox.js:87:11\nset selectedTab@chrome://browser/content/tabbrowser.js:290:7\nctrlTab_close@chrome://browser/content/browser-ctrlTab.js:427:9\nctrlTab_pick@chrome://browser/content/browser-ctrlTab.js:345:12\nctrlTab_handleEvent@chrome://browser/content/browser-ctrlTab.js:598:18\n"
[parent 267120] DidSetIsActive(44, 0 -> 1)
[child 267622] DidSetIsActive(44, 0 -> 1)
[child 267622] DidSetIsActive(44, 1 -> 0)
[parent 267120] DidSetIsActive(44, 1 -> 1)
[parent 267120] DidSetIsActive(44, 1 -> 0)

It seems like The parent activates, then the child, then the child overrides it at some point... I believe unless Chrome JS code is doing funky stuff (which I don't think is doing) the only possible callsite is this one. Which I already flagged as suspect, but was replacing this.

So I that call was wrong before, but now that's propagating to the parent process and we don't handle no-op activeness changes it is causing even more badness...

I flagged this as sketchy before (though it was trying to preserve
existing behavior).

However now that that state propagates to the parent process and races
with the state that the parent process reads, it started causing
correctness issues.

Just remove this line, it shouldn't be needed. I'm not sure how to write
a test for this, unfortunately :(

Viktor, whenever the build with the patch is ready here, can you confirm the patch fixes the issue?

I haven't been able to repro the issue with the patch, but I'm not reproducing so consistently so sanity-checking would be awesome.

Thanks a lot for helping us track this down.

Flags: needinfo?(emilio)

Unfortunately the issue is still there. I used my STR from comment 19 and reproduced the bug at the first attempt. To be sure, I repeated the steps several times until I reproduced it again.

Just to be sure that I didn't mess this up, this is how I downloaded the try build: On the treeherder page I clicked on "+3" behind "Bpgo" in the "Linux x64 shippable opt" line, then I clicked on "B", then I clicked on "Artifacts" (appearing below) and then I downloaded the "target.tar.bz2" file from the list. I used a new profile for testing.

Flags: needinfo?(emilio)

hmmm, oh well, thanks for testing, will dig a bit more then / keep trying to repro with that patch too.

Ugh, I wish this was not so timing dependent so I could capture this with rr...

To be clear, when you reproduce the issue with the patch applied, if you press Ctrl+Shift+Alt+I and go to the console tab, and type gBrowser.tabs[gBrowser.tabs.length - 1].linkedBrowser.docShellIsActive, the result is still false?

Will check that.

I couldn't reproduce the bug after setting browser.tabs.remote.warmup.enabled=false during about 15 minutes of repeated testing.

I couldn't reproduce the bug after setting browser.tabs.remote.warmup.enabled=false during about 15 minutes of repeated testing.

With the regular Nightly build.

I get gBrowser is not defined (see attached screenshot), but I also get this with my regular Nightly build. Am I doing something wrong?

Yes, you don't want the devtools console, you want the browser console, you can enable from DevTools -> Settings -> Enable broser chrome and add-on debugging toolboxes + Enable remote debugging.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #28)

To be clear, when you reproduce the issue with the patch applied, if you press Ctrl+Shift+Alt+I and go to the console tab, and type gBrowser.tabs[gBrowser.tabs.length - 1].linkedBrowser.docShellIsActive, the result is still false?

The result is still false.

In the meantime I reproduced the bug with fission enabled for a regular Nightly build.

Do you still want me to try a (recent) debug build and attach the terminal output here? (I know how I can get a debug build with mozregression: mozregression --repo autoland --build-type debug --launch 2021-01-06)

There's JS running since we save the active status till we restore it,
so arbitrary things can happen, including receiving an IPC message from
the child saying that we're now really active.

If then we restore the wrong (old) status, stuff gets confused and
sadness ensues.

Screenshotting background tabs seems to work without this so it's not
clear to me why messing with the activeness state was necessary to begin
with.

(In reply to Viktor Jägersküpper from comment #33)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #28)

To be clear, when you reproduce the issue with the patch applied, if you press Ctrl+Shift+Alt+I and go to the console tab, and type gBrowser.tabs[gBrowser.tabs.length - 1].linkedBrowser.docShellIsActive, the result is still false?

The result is still false.

Bummer, then I think there may be yet another bug lurking around, but I think the other patch I posted is correct.

Do you still want me to try a (recent) debug build and attach the terminal output here? (I know how I can get a debug build with mozregression: mozregression --repo autoland --build-type debug --launch 2021-01-06)

No, no worries, that's super-helpful already. I discarded the CrossProcessPaint code because it looked synchronous to me, and because I didn't think it'd be used in this case, but after dumping a couple stacks I managed to reproduce it and see where it goes wrong / what overrides the activeness again from the child (and then propagates to the parent).

In hindsight, it seems obvious how that code is involved... Ctrl+T shows a switcher with tab captures, that are created on the fly using this API... And that code is not really necessarily synchronous because flushing notifications and painting can do IPC...

If you have the time, can you confirm that with this build you can no longer see the issue?

If you can, I'd be a bit baffled, but hey, I've been wrong before so... :-)

Flags: needinfo?(emilio)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #35)

If you have the time, can you confirm that with this build you can no longer see the issue?

Good news! I couldn't reproduce the bug with this build (using a fresh profile with fission disabled by default).

If you have to change the patches, feel free to ask me again for testing.

Flags: needinfo?(viktor_jaegerskuepper)
Keywords: leave-open
Pushed by ealvarez@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/badff6002a34
BrowserChild::MakeHidden() shouldn't mess with tab state. r=nika
Pushed by ealvarez@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f4af0087a1b4
CrossProcessPaint code shouldn't mess with tab state from the content process. r=mattwoodrow,NeilDeakin
Keywords: leave-open
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch

Please request beta uplift when you get a chance.

Flags: needinfo?(emilio)

Comment on attachment 9195771 [details]
Bug 1683188 - BrowserChild::MakeHidden() shouldn't mess with tab state. r=nika

Beta/Release Uplift Approval Request

  • User impact if declined: Comment 0
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: See various comments for different str which reproduce slightly more frequently.
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): Somewhat trivial removals, but in a tricky area. However I think these should be safe enough for uplift.
  • String changes made/needed: none
Flags: needinfo?(emilio)
Attachment #9195771 - Flags: approval-mozilla-beta?
Attachment #9195931 - Flags: approval-mozilla-beta?
Flags: qe-verify+
QA Whiteboard: [qa-regression-triage] → [qa-regression-triage] [qa-triaged]

Unfortunately, I cannot confirm this fix, since I was not able to reproduce the initial issue. I've tried on two different mac machines (OS version 10.15 and 11.1), by following the steps from comment 19 and comment 21, on Nightly 86.0a1 (2020-12-17).

Markus, can you please help us checking if this bug is fixed on latest Nightly 86.0a1?

Flags: needinfo?(mstange.moz)

Would it be safer to back out the regressing change from 85? We're about to build the last beta today.

Flags: needinfo?(emilio)

The regressing change blocked some other fixes so it's not clear to me everything that should be backed out. Uplifting these seems less risky to me.

Flags: needinfo?(emilio)

(In reply to Ciprian Georgiu [:ciprian_georgiu], Release Desktop QA from comment #43)

Markus, can you please help us checking if this bug is fixed on latest Nightly 86.0a1?

I'll report if I run into it again. If I don't hit it during the next 8 hours, it's probably fixed.

Flags: needinfo?(mstange.moz)

Comment on attachment 9195771 [details]
Bug 1683188 - BrowserChild::MakeHidden() shouldn't mess with tab state. r=nika

approved for 85.0b9

Attachment #9195771 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9195931 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

I haven't seen the issue the whole day since updating my Dev Edition today to 85.0b9

I haven't seen this again on Nightly since the patch landed.

Status: RESOLVED → VERIFIED

Comment on attachment 9195931 [details]
Bug 1683188 - CrossProcessPaint code shouldn't mess with tab state from the content process. r=mattwoodrow

(In reply to Emilio Cobos Álvarez (:emilio) from comment #34)

Screenshotting background tabs seems to work without this so it's not
clear to me why messing with the activeness state was necessary to begin
with.

The removed code was added to fix bug 1657036, but presumably bug 1635914 fixed that in a better way.

Yeah, that came up during review of that patch :)

Marking this verified as fixed on latest Nightly and Beta, based on comments 49 and 50. Thank you!

No longer blocks: gfx-triage
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: