Closed Bug 1362987 Opened 7 years ago Closed 7 years ago

CHECKERBOARD_DURATION regression on 2017-05-04

Categories

(Core :: Panning and Zooming, defect, P1)

55 Branch
defect

Tracking

()

RESOLVED FIXED
Performance Impact low
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed

People

(Reporter: kats, Assigned: kmag)

References

Details

(Keywords: regression, Whiteboard: [gfx-noted])

Poking around in the evolution dashboard, it looks like the regression is limited to Windows and Linux, but affects both 32-bit and 64-bit builds. Also, it affects desktop but not mobile. It might just be that the volumes are too low on OS X / Android to make out the regression, but the spike is very clearly visible on Windows/Linux so I suspect it's actually platform-specific.

From the regression range, these are the bugs I consider most likely to have caused this, in order of likelihood:

bug 1359868 - Scrollbar is not painted fully after mouse leaves and reenters Firefox window
bug 1353060 - Remote <browser>s are not visible as children of XUL <popup>s
bug 1358185 - Content incorrectly clipped on fennec on http://www.zpravy.cz/

The scrollbar one seems most likely to me because scrollbars are different on Linux/Windows vs OS X / Android, unless you fiddle with OS defaults. Also the nature of the regression (a spike in the biggest bucket of checkerboard duration values, with a change in the checkerboard peak values) seems to indicate we introduced some sort of perma-checkerbaording state. This seems most plausible with either scrollbars (bug 1359868) or a new type of window (bug 1353060) that we're checkerboarding.
One path forward here is to back out both bug 1359868 and bug 1353060 to see if that fixes the regression. If so, we can reland them one at a time to determine which one reintroduces it. We don't necessarily need to wait days between each action, as long as at least one nightly is generated with each combination we should be able to look at the telemetry data post-facto and figure out which bug caused it. i.e.
  back out both bugs for may 9 nightly
  reland bug 1353060 for may 10 nightly
  reland bug 1359868 for may 11 nightly
and then by early next week we should have enough data to tell which bug (if either of those two) caused the problem.

Botond/Kris, any thoughts?
Flags: needinfo?(kmaglione+bmo)
Flags: needinfo?(botond)
Bug 1358185 should only affect android, so I would discount that.
I'd be quite surprised if bug 1359868 were the culprit. The only "checkerboarding" that bug should affect is the checkerboarding of scrollbar thumbs, which is not measured by our checkerboarding telemetry (and even that checkerboarding the patch should strictly reduce). The patch does not affect our displayport heuristics, or anything on the pipeline from input events to scrolling.

That said, I've had patches cause unexpected consequences in the past, so I'm happy to submit it to the backout experiment proposed in comment 2.
Flags: needinfo?(botond)
I have a try push with the patches backed out [1], it's looking fairly green. I'll push those later today. So the schedule will be:
  back out both bugs for may 10 nightly
  reland bug 1353060 for may 11 nightly
  reland bug 1359868 for may 12 nightly

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=c41899fe07592ecaa8710a5dc0283f936949540a
Also, tagging as [qf] since I believe "no checkerboarding regressions" was part of the QF mandate and this probably needs to be tracked somewhere by somebody.
Whiteboard: [qf]
(In reply to Botond Ballo [:botond] from comment #4)
> I'd be quite surprised if bug 1359868 were the culprit. The only
> "checkerboarding" that bug should affect is the checkerboarding of scrollbar
> thumbs, which is not measured by our checkerboarding telemetry (and even
> that checkerboarding the patch should strictly reduce). The patch does not
> affect our displayport heuristics, or anything on the pipeline from input
> events to scrolling.

And yeah, you have a valid point - it's unlikely that bug 1359868 could have caused this. Windows and Linux are probably similar with respect to popup windows as well as scrollbars, so bug 1353060 is the more likely candidate.
I backed out the two bugs:
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/3e9a56b96d0fc439f672254ef77094e105700ba0
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/310707ea9db8332c499ed0b3716125094b026c2e

Will reland tomorrow and the day after, assuming the backouts make it into the May 10 nightly.
Flags: needinfo?(kmaglione+bmo)
Whiteboard: [qf] → [gfx-noted][qf]
FWIW both May 10 and May 11 had issues on nightly and there were backouts/respins so I decided to give this a little extra time just to make sure it went into a nightly that people will actually use.
I relanded bug 1353060. It might end up in the May 12 nightly, or May 13.
Bug 1353060 made it to the May 12 nightly. Will reland bug 1359868.
Telemetry data is finally back in. The spike shows up on all build dates except May 11. So that pretty clearly points to bug 1353060 as the regressor. Kris, are you looking into the fallout from your bug? It seems to have caused a number of issues and if you don't have any plans to address in the immediate future we should back it out until you have time to do so.
Blocks: 1353060
Flags: needinfo?(kmaglione+bmo)
Whiteboard: [gfx-noted][qf] → [gfx-noted][qf:p3]
Given the discussion in bug 1362621, this should also have been fixed by the changes I landed for bug 1365660.
Depends on: 1365660
Flags: needinfo?(kmaglione+bmo)
Indeed. The telemetry data from May 20 onwards shows the spike gone.
Assignee: nobody → kmaglione+bmo
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Performance Impact: --- → P3
Whiteboard: [gfx-noted][qf:p3] → [gfx-noted]
You need to log in before you can comment on or make changes to this bug.