Closed Bug 1452632 Opened 6 years ago Closed 6 years ago

Checkerboarding regression on Apr 5

Categories

(Core :: Panning and Zooming, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID
Tracking Status
firefox-esr52 --- unaffected
firefox59 --- unaffected
firefox60 --- unaffected
firefox61 + wontfix

People

(Reporter: kats, Unassigned)

References

Details

(Keywords: regression)

I got telemetry alerts for changes in distribution for all 4 of the checkerboarding probes:

http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/1722/alerts/?from=2018-04-05&to=2018-04-05
http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/1717/alerts/?from=2018-04-05&to=2018-04-05
http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/1720/alerts/?from=2018-04-05&to=2018-04-05
http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/1721/alerts/?from=2018-04-05&to=2018-04-05

With this regression range:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=071ee904485e21e19ca08456d32bce6825b77a26&tochange=2f5ffe4fa2153a798ed8b310a597ea92abd1b868

The checkerboarding regression, if real, seems significant. The severity distribution has shifted to the right, i.e. more checkerboarding instances are more severe.

From the regression range, this bugs jump out at me (in order of likelihood):
bug 1449268 - Treat document-level touch event listeners as passive
bug 1450099 - Mac scroll bar scrolls on click even when not shown
bug 1447193 - Rendering error after scroll

The last one should have *reduced* the amount of checkerboarding, but it's possible that there was a bug. However I don't see how that last bug could have affected the POTENTIAL_DURATION probe; that seems more likely to have been affected by one of the first two bugs on the list.
Ooof. :(

Regressions that only show up in Telemetry like this are, from my experience, a pain to deal with. Unless we can figure this out by inspection, we might have to do selective backouts and wait a few days to narrow down to the culprit. :(
Could bug 1449268 have caused this, since we try to pan faster and not wait for event handling in the web page. If so, that is in a way positive change, I'd say.
I pretty sure bug 1450099 didn't cause this. Functionally, it's a partial revert of bug 1422070, which didn't have any effects on checkerboarding.

(In reply to Olli Pettay [:smaug] (only webcomponents and event handling reviews, please) from comment #2)
> Could bug 1449268 have caused this, since we try to pan faster and not wait
> for event handling in the web page. If so, that is in a way positive change,
> I'd say.

I think that's the most likely explanation.
The number of pings and samples for all four of the probes is in free-fall. See the second and third plots on https://mzl.la/2JsBTGh for CHECKERBOARD_SEVERITY's edition.

This means that the distribution changes aren't shifts: populations of measurements are no longer being sent.

Sometimes this is a good thing (when unpleasant things happen less often).
Sometimes this means a code change is now sometimes bypassing your measurement (in which case it's a 'measurement's broken, can't tell' thing)

Not necessarily a bad thing.
In my experience every probe shows that "free fall" pattern over the last few days because people are still using builds from a few days ago so all the data for those buildids isn't in yet. I'd expect that "free fall" to disappear in a few more days.
The sample counts for the checkerboard probes drop by more than half from April 4 to 5. In contrast, GC_MS shows a drop in sample counts of about 15% over that same period. The submission counts of the _submission_ (ping) counts on the other hand follow the ~15% we'd expect.

So you're right, the submission count does follow the usual "we're waiting for the data to come in" curve.

But the sample count is rather more drastic than that (unless the last 15% of pings (submissions) contain 50% of the checkerboard samples)
So while I would expect the sample count for CHECKERBOARD_DURATION, CHECKERBOARD_PEAK, and CHECKERBOARD_SEVERITY to change in response to code changes, I wouldn't really expect CHECKERBOARD_POTENTIAL_DURATION to change. The sample count for that probe should be correlated with user input rather than code changes - that is, the number of POTENTIAL_DURATION samples submitted should be roughly based on the number of fling/scroll actions the user did rather than how much checkerboarding they actually saw. So if that sample count drops as well I'd be quite surprised.

Regardless, I agree with :cpeterson that it's probably worth adding a pref to control the passive listener behaviour so that we can disable it for a couple of days and confirm that's the root cause.
Flags: needinfo?(bugs)
So putting Olli's changes behind a disabled pref didn't affect the numbers. That means it wasn't the passive listeners change that caused this regression, it must be something else in the regression range.
mconley, neither Mac scroll bar bug 1450099 (as per comment 3) nor passive event listener bug 1449268 (as per comment 9) caused this checkerboarding telemetry regression. The other suspicious changeset in comment 0's regression range was your fix for "Rendering error after scroll" bug 1447193. Do you want to back out bug 1447193 or try something else first?
Flags: needinfo?(mconley)
See Also: → 1454668
:kats and I had a chat on IRC just now and we think we understand what happened.

Nightly build 20180313100127 has https://hg.mozilla.org/mozilla-central/rev/664c633802d4 which turns on Tab Warming via bug 1423220. The number of sample counts of CHECKERBOARD_SEVERITY (et al) doubles: https://mzl.la/2JsBTGh (see the third plot)

Nightly build 20180405104009 has https://hg.mozilla.org/mozilla-central/rev/1b258f938525fda65ef80ffa0408bc665d5d8948 which fixes bug 1447193. The number of sample counts of CHECKERBOARD_SEVERITY (et al) halves.

In between these two events we have an unexplained event around 2018-03-22 where CHECKERBOARD_SEVERITY goes from looking like this: https://mzl.la/2qG50x1https://mzl.la/2qG50x1

To looking like this: https://mzl.la/2qDSsGB

Note that it has developed a second mode[1] around the 60k-90k bucket. At the same time, the number of pings with CHECKERBOARD_SEVERITY in them (submissions) rises by about 50%: https://mzl.la/2JsBTGh (see the second plot)

This unexplained event was hidden by the increased submissions caused by bug 1423220 on the first mode, so when the regression was fixed, it looked all of a sudden as though something horrible had happened.

This is not actually the case. bug 1447193 (if responsible for this change) resulted in a halving of total number of checkerboarding events which is a good thing. It just looks like a bad thing because the removal of the low-severity events uncovered the high-severity events from around 2018-03-22

So you can consider this bug as not really a bug but an artefact of multiple overlapping signals. Sorry for the noise.

I have filed bug 1454668 to look into the true regression. Feel free to +Cc if you'd like to follow along.

[1]: https://en.wikipedia.org/wiki/Multimodal_distribution
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(mconley)
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.