Closed Bug 1292145 Opened 3 years ago Closed 3 years ago
July 28 2016 Regression in Performance Measure FX
_REFRESH _DRIVER _SYNC _SCROLL _FRAME _DELAY _MS
From July 28 to July 29 there was a sudden  increase in the aggregate measures of FX_REFRESH_DRIVER_SYNC_SCROLL_FRAME_DELAY_MS. The mean jumped from 0 to 1, the 75th percentile from 1 to 11. Submission volumes in terms of clients and pings are roughly the same (compare metric/sample count in  vs ) which is consistent with a change in the measured code, not change in what is being measured. (IOW: we're not suddenly measuring more/fewer things which is skewing the distribution. The actual things we're measuring have changed to be larger) This was detected by medusa  which came up with a pushlog for the regression range . Unfortunately, I can't tell what change in that list may have caused this, and whether the change is intentional. The relevant files according to dxr  are layout/base/nsRefreshDriver.cpp and layout/generic/nsGfxScrollFrame.cpp. The only change in the pushlog for either of those files  was for bug 1282408 and appears to just add some annotations for static analysis. Just in case, I've +Cc'd the two people visible from that commit. Questions: Is this change in the measure expected/intentional? Is this change in the measure acceptable? Is this evidence that the measure is incorrect/not useful? : https://mzl.la/2anRdkY : https://mzl.la/2anRAfl : https://mzl.la/2anRXGJ : http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/1683/alerts/?from=2016-07-29&to=2016-07-29 : https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=db3ed1fdbbeaf5ab1e8fe454780146e7499be3db&tochange=2ea3d51ba1bb9f5c3b6921c43ea63f70b4fdf5d2 : https://dxr.mozilla.org/mozilla-central/search?q=FX_REFRESH_DRIVER_SYNC_SCROLL_FRAME_DELAY_MS&case=true&=mozilla-central : https://hg.mozilla.org/mozilla-central/rev/d8b267ae7f69
I don't suppose the simultaneous change  in COMPOSITE_FRAME_ROUNDTRIP time might be related? : https://mzl.la/2anSXdV : http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/1645/alerts/?from=2016-07-29&to=2016-07-29
What release channel(s) did this happen on? Did things change across platforms or only on one?
This appears to be Windows-only ( Mac+Linux  remain flat ) and it was detected on Nightly. : https://mzl.la/2ao5Uoq
Due to timing, it may have crept into Aurora as well, but submission volumes are too low to say for sure just yet.
It looks like FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS went up but FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS did not. This suggests that it may have been a change to the Firefox frontend and not a graphics change that caused the difference.
I didn't see anything obvious in the regression-window. Here's the best I was able to come up with: https://hg.mozilla.org/mozilla-central/rev/2b3f0b8a6318
I don't think that the patch that was mentioned here has something to do with the issue.
I also don't see anything in the regression range that stands out as a probable culprit. (In reply to Chris H-C :chutten from comment #0) > Questions: > Is this change in the measure expected/intentional? No. > Is this change in the measure acceptable? No. > Is this evidence that the measure is incorrect/not useful? No. (In reply to Chris H-C :chutten from comment #1) > I don't suppose the simultaneous change  in COMPOSITE_FRAME_ROUNDTRIP > time might be related? Seems that way, yeah.
If it's any help, it appears to be concentrated on Windows. Linux and Mac show no corresponding increases at that day. : https://mzl.la/2bJH5EF : https://mzl.la/2bJGfbh : https://mzl.la/2bJGlPO
I've been puzzling over this a bit. I sort of agree with Jeff's conclusion in comment 5 that since this only affects the parent process it's unlikely to be a gfx change, and is likely something that changed in the browser front-end. Given that the regression is in the sync smooth-scrolling code (ScrollFrameHelper::AsyncScroll and ScrollFrameHelper::AsyncSmoothMSDScroll are the two classes that enable recording of the FX_REFRESH_DRIVER_SYNC_SCROLL_FRAME_DELAY_MS histogram), and given that APZ should be enabled by default, it seems that the regression is limited to some part of the browser chrome that is somehow doing smooth-scrolling without APZ. Really there shouldn't be anything that does this at all. I'll try setting a breakpoint in that code in a local build and see what trips it, that might provide some clues.
One example of an action that triggers this behaviour is scrolling down a page using the spacebar on about:addons or about:preferences. Another is wheel-scrolling inside select element popups (which basically spawn non-APZ popup windows), and thus wheel scrolling inside them with smooth scrolling enabled does sync scrolling on the main thread.
Looking over the regression pushlog again I also noticed bug 1278408, which actually might be the cause. If the vsync timestamps are shifted around then the refresh driver will tick at different times, and so any sync-scroll animations will be affected. It may not be an actual performance regression, but just a shift in timestamps being used for the probes. I'm not entirely sure why the content process escaped unscathed but maybe the IPC overhead negated some of this? Mason, any thoughts? It'll be interesting to see if bug 1295214 fixes this regression as well.
(In reply to Kartikaya Gupta (email:firstname.lastname@example.org) from comment #12) > Looking over the regression pushlog again I also noticed bug 1278408, which > actually might be the cause. If the vsync timestamps are shifted around then > the refresh driver will tick at different times, and so any sync-scroll > animations will be affected. It may not be an actual performance regression, > but just a shift in timestamps being used for the probes. I'm not entirely > sure why the content process escaped unscathed but maybe the IPC overhead > negated some of this? Mason, any thoughts? > > It'll be interesting to see if bug 1295214 fixes this regression as well. From , we track the latency between the vsync timestamp and when the parent process runs the refresh drivers. For the child process, we track the latency since the last tick. Like Kats said this is kind of a regression, but not a real one. It's noisy because we had bugs in reporting the vsync timestamp with intel drivers that we forgot to correct for, which bug 1295214 should fix.  http://searchfox.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp#452
Priority: -- → P3
The telemetry data appears to back down as of the Aug 27 build, which corresponds to the first build that has the fix from bug 1295214. However I want a few more days data before I call it.
Ok, calling this fixed by bug 1295214, on both nightly and aurora.
You need to log in before you can comment on or make changes to this bug.