1.91 KB, patch
|Details | Diff | Splinter Review|
Regression from bug 1150727
On a side note, I got a 7% decrease in tresize lol Improvement: Mozilla-Inbound-Non-PGO - TResize - Ubuntu HW 12.04 x64 - 7.18% decrease
Assignee: nobody → mchang
Status: NEW → ASSIGNED
There is also a -9.11% regression on SVG, Opacity Row Major for Ubuntu 12.04 x64: http://alertmanager.allizom.org:8080/alerts.html?rev=2e0ebba4a766&showAll=1&testIndex=0&platIndex=0
I see two tests in this talos test: big-optimizable-group-opacity-2500.svg small-group-opacity-2500.svg From the try test, it looks like it's test small-group-opacity-2500.svg that is causing the regression. Profiles based off commit a530b5c3b713 for linux opt 32 (both x64 and x32 regress, but x32 regresses more). w/o vsync refresh driver: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try-Non-PGO/sha512/7c1bf924b0ca4afa7025fa202b93168fff83be01b5bac93ad3b20d369c6bc7d2cca3b8c428c78ec24d7e9bb16869420d0a0f692c4ce342b511bb160bc6a844e6&pathInZip=profile_tsvgr_opacity/small-group-opacity-2500.svg/cycle_0.sps#report=4b2e793f6e48aec9e3e51c5a3551f9ad83299c01 w/ vsync refresh driver http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try-Non-PGO/sha512/5e6d66b946b025a996f5bfcca6ff678effaebe66d00a38aa68e3670b8cc102d33ce2fb0268881393f6de736a4ca16e5b39d41d6203a5eee00225b0613ba337fb&pathInZip=profile_tsvgr_opacity/small-group-opacity-2500.svg/cycle_0.sps#report=9b1da598428bc08f1d263dcc44dfed59cbf2907e Interesting that there are no vsync markers in the profile, but that's because the Compositor is asleep the whole time.
From the profiles, it looks like during every test run, we tick the refresh driver an extra time at the start of the page load with the vsync refresh driver versus. This causes 3 total refresh ticks per test versus 2 without the vsync refresh driver. This extra tick takes ~65 - 70ms, which is the regression here. Investigating.
This is a sad state of affairs. This test only uses the refresh driver and not the compositor. Without Silk, between each test load, we stop the refresh timer from ticking since there is about 300ms between each test. When we load the page, we start the refresh timer again  with a nsITimer. But starting this timer means that the first tick will be the timer rate from Now(), e.g. 16ms from when nsRefreshTimer::StartTimer() is called. The first refresh driver tick will be 16 ms after it actually starts the refresh timer. Between the page load and the first refresh driver tick, the main thread is actually busy with nsInputStreamPump::OnInputStreamReady, hence the first actual refresh driver tick from the time the page loads is 58 ms. The time between the page load and the nsInputStreamPump is 5 ms. With Silk, because the time between tests is 300ms, we disable vsync since the only thing listening to vsync is the refresh driver and the refresh timer stopped the timer between tests. This is on Linux so it uses the software vsync source. When the test starts and the page loads, we start the refresh timer, which eventually enables vsync. The software vsync source posts a vsync task on the vsync thread  immediately, by design, which ticks the refresh driver BEFORE the nsInputStreamPump gets to execute. The unfortunate side effect is that the refresh driver tick is very expensive in this test, so we get a large regression. Because nsTimers don't start immediately but start later, it skips one refresh driver tick. If I force the software vsync source to start the first vsync task 16 ms after enabling vsync to be like the nsITimer, this regression goes away. I'm not sure we actually want to fix the software vsync source to become more like nsITimers though. On real hardware vsync sources, we don't have any control when the first vsync will occur from the time we enable vsync. We just flip the switch and wait for the callback. On OSX, this can be anywhere from 1-10 ms. On Windows, we post a task immediately to the vsync thread and wait for DwmFlush to wake up, which will wake up whenever DWM decides is the next vblank regardless of what Gecko does because the whole OS is throttled by DWM. So while ticking immediately doesn't always consistently mimic hardware vsync, it models it more than nsITimers which are consistently 16ms, which rarely happens with real hardware. An alternative would be to post the task at a random time between 1-16 ms, but we'd still have this regression if the random number is < 5ms and this test would also become a lot noisier. @Kats - Any strong opinion on whether we should make the software vsync source more like nsITimers? I'm leaning towards no at the moment to make it mimic hardware vsync. Thanks!  https://dxr.mozilla.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp?from=nsRefreshDriver.cpp#264  https://dxr.mozilla.org/mozilla-central/source/gfx/thebes/SoftwareVsyncSource.cpp?from=SoftwareVsyncSource.cpp#52
Thanks for a great investigation! In order to understand better how to proceed, I'd ask the following questions: 1. How much of this would be noticeable to a user, and do we understand the scope of cases where the user could tell the difference of other cases which might be affected by a similar sequence of events? 2. Assuming the regressed test results do reflect a worse user experience, how much worse would it actually be? (this is subjective of course, but an assessment could help nevertheless). 3. If you think that this test doesn't represent user-experience cases we should care about WRT Silk, let us know. As for when to start the timer, you could do what the old refresh timer did in software timers, and stick to a virtual vsync signal. E.g. set the origin timestamp once when you initialize silk, and from then onwards assume vsyncs according to some virtual vsync hardware source. When you need to start the timer, just fire the next event whenever the virtual vsync source says it should. While this indeed add some uncertainty to the timing, it will not be different than real hardware vsync source, AND it will cut the latency on average by half, since whenever randomly you wait for the next vsync signal, on average it will arrive at half the vsync interval.
(In reply to Avi Halachmi (:avih) from comment #7) > Thanks for a great investigation! > > In order to understand better how to proceed, I'd ask the following > questions: > > 1. How much of this would be noticeable to a user, and do we understand the > scope of cases where the user could tell the difference of other cases which > might be affected by a similar sequence of events? This regression is what 60ms? So it would be noticeable only to linux users and in the case where they were idle for a while and loaded a new page that took a long time to paint. I don't think users would be able to see this because by the time they clicked on a page to load, those chrome interactions would start the refresh driver / vsync. > > 2. Assuming the regressed test results do reflect a worse user experience, > how much worse would it actually be? (this is subjective of course, but an > assessment could help nevertheless). I don't expect much. I can't tell the difference between 200 ms vs 260 ms and this only affects linux desktop users. > 3. If you think that this test doesn't represent user-experience cases we > should care about WRT Silk, let us know. Page load is pretty important, and we are doing more work, it's just irksome that this is based off timings causing us to do more work. It's also interesting in that the display list generation takes 60 ms, which is something we should probably fix. I do wonder how often we load a page without having the refresh driver start ticking before hand though. > As for when to start the timer, you could do what the old refresh timer did > in software timers, and stick to a virtual vsync signal. E.g. set the origin > timestamp once when you initialize silk, and from then onwards assume vsyncs > according to some virtual vsync hardware source. When you need to start the > timer, just fire the next event whenever the virtual vsync source says it > should. > > While this indeed add some uncertainty to the timing, it will not be > different than real hardware vsync source, AND it will cut the latency on > average by half, since whenever randomly you wait for the next vsync signal, > on average it will arrive at half the vsync interval. Yeah, that could work but we'd need a lot of extra time keeping. Software vsync is for platforms that have lower priority, so I'm not sure how much we'd gain from it. I'd be more inclined to have the next vsync occur at a random interval just because of the simpler implementation.
Created attachment 8589372 [details] [diff] [review] Pick a random time between 0-15 ms to vsync in software vsync Let's see if randomizing the vsync start time fixes this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=545bd113a6fe
I think that Windows XP also doesn't have HW vsyncs, so that adds few more users. As for hard, what about something like: > nextVsyncIn = interval - (now - init_time) % interval; and if you care about rounding errors, reset init_time once a day? Also, if you don't care about consistency of a virtual vsync signal, why wait interval ms (or random ms) before the next vsync instead of firing one right when you know you should start providing vsync events? That should cut the latency down to probably the minimum possible with this system, would it not? In general, IMO, responsiveness and refresh frequency usually have higher weight than delaying other work (e.g. refresh the screen more and make the page load slightly longer is preferable IMO to finishing the load 50ms earlier but with less screen refreshes). As an anecdote, some very old Gecko code was designed to reflect the exact opposite, i.e. it would prioritize internal Gecko events up to 2000 ms (2s!) from page load in order to finish page load without any interruptions, and while doing so freeze screen updates almost completely during this time (since paint events have lower priority). That's bug 930793 and it's still open.
Created attachment 8589379 [details] [diff] [review] Randomize next vsync timestamp when enabling software vsync. Updated to also update the timestamp passed to the vsync task.
Attachment #8589372 - Attachment is obsolete: true
(In reply to Avi Halachmi (:avih) from comment #10) > I think that Windows XP also doesn't have HW vsyncs, so that adds few more > users. > > As for hard, what about something like: > > > nextVsyncIn = interval - (now - init_time) % interval; > > and if you care about rounding errors, reset init_time once a day? Ahh yeah that would work, math fail. Thanks! > Also, if you don't care about consistency of a virtual vsync signal, why > wait interval ms (or random ms) before the next vsync instead of firing one > right when you know you should start providing vsync events? That should cut > the latency down to probably the minimum possible with this system, would it > not? That's what we do now, but that's why we get this regression :(. Unless I'm mis-understanding you.
Created attachment 8589407 [details] [diff] [review] Randomize next vsync tick when enabling software vsync. Randomize the next vsync tick when enabling software vsync. Before, we would have the next vsync event ASAP after enabling software vsync. This patch changes it to choose a random time between 0-16 ms to more closely mimic hardware vsync, which is we have no control over when the vsync will occur after enabling vsync. It also mostly fixes this regression. There will still be cases when we tick too early, which will cause a large refresh driver tick, but it shouldn't happen every test load anymore. http://compare-talos.mattn.ca/dev/?oldBranch=Try&oldRevs=4b1c508b52d2&newBranch=Try&newRev=545bd113a6fe&submit=true
In my experience having a test provide a consistent noise-free value is more important here. This patch will make the test bimodal in that in some cases we do an extra paint and in some cases we do not, and the times for those two are different. Sure, it makes the test more representative of real world cases but it also means future regressions will be harder to diagnose and debug because you don't get deterministic behavior across runs. For example say the extra paint occurs 30% of the time but regresses by 5x. This will cause an significant overall regression in the test score but when running the test locally will only be reproducible 30% of the time which makes it annoying to debug. I would argue that instead we should not change anything and just accept the regression. Mason did a great job in digging through and understanding why the regression is happening. If I understood it correctly the test is now representative of the worst-case scenario that the user might encounter with the refresh driver code, to the extent that the test is representative of any real-world use case. And exercising the worst-case scenario is perfectly fine IMHO, as regressions will be more easily picked up. Also I want to expand on comment 8 a little about the real-world implications of this change. The regression identified by the test means that if we go from vsync idle to vsync enabled, we reduce the latency to painting by ~15ms, at the cost of possibly doing an extra paint in scenarios where there are going to be multiple paints, such as page loads. So in the case where we do a single paint we benefit by reducing latency, and this patch undoes that (randomly). In the case where we are doing multiple paints already, an extra paint is not going to hurt all that much unless the paints are super expensive. And I agree with Mason that in most page load scenarios we will not be starting from a vsync idle state. Thoughts?
Comment on attachment 8589407 [details] [diff] [review] Randomize next vsync tick when enabling software vsync. Clearing for now.
My foremost concern is impact on users. Test numbers are just numbers. They're useful if they reflect changes we care about, and we should not make any effort to "artificially" improve test numbers only for the sake of improving/not-regressing the numbers. That being said, this test in its current form does seem to reflect a worst case scenario for users, and - I don't think we know how frequent we'll be hitting similar cases in the real world, but my guess is that it's not rare (feel free to correct me). As for noise, I think the results have been relatively consistent already, but I'm unable to assess how much noise would be added by randomizing the first timeout. I'd suggest to only make changes which we consider as best for the users, and if it ends up showing "worse" test results, so be it. Makes sense?
I agree 100% that we should make the change that is best for the user. I don't think the patch that Mason posted does that though. If you think of a delay of 0 as "low latency, possibly extra repaint" and a delay of 15ms as "high latency, no repaint", then setting a random delay is basically picking a random point on this cost-benefit slider. I don't see how that's any better than just sticking with the delay=0 case always. IMO all of the points on slider are pretty much equivalent from the overall user benefit point of view (i.e. averaged across all users across all page loads). So given that, we might as well pick the point that keeps the test less noisy, as a secondary factor to the decision-making process.
Here's a perf-o-matic chart for the test (I think I did this right) - http://graphs.mozilla.org/graph.html#tests=[[225,63,33],[225,63,35]]&sel=1428147510174.1204,1428508377644&displayrange=30&datatype=geo The test looks pretty consistent. I think doing nothing here and accepting the regression is the right thing to do. I agree with :kats that a test providing a consistent noise-free value is more important. Debugging intermittents is a very large pain and making it easier to debug further regressions is the right thing to do. I also agree with Avih that we should do what's best for the users, and doing nothing is best for the user since we want to show content to the user ASAP and the current implementation will paint immediately. Since we assume we can paint and composite within 16 ms, a paint of more than 16 ms is considered an anomaly. This test case of a 60ms paint shouldn't be happening most of the time. In the normal case, painting ASAP is actually a performance improvement from a responsiveness perspective. Avih pointed this out in comment 10 and I agree. Finally, I do agree that this test stresses a case that will probably not occur in the real world for the reason that this test doesn't composite. Any animations or anything in the background will be ticking the refresh driver / compositor, which will keep vsync alive. Now that off main thread animations are enabled, we will be ticking the compositor independently of the refresh driver, which would reduce the scenarios this test measures. It seems like doing nothing here is a win win except for the actual test numbers. We keep the test in a worst-case, but consistent state and users actually get a reduction in painting latency. Any objections?
Sounds good to me!
No objection here if you both agree this is best for users. Thanks again for an excellent investigation!
@Joel - We all agree we should just accept this regression. Any objections to closing this bug? Thanks!
Mason! Thanks for investigating this. I will be happy to accept the new baseline:)
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → WONTFIX
BTW, we probably should have an effort to minimize noise in tests by controlling when the refresh driver and compositor tick. That would give us a lot more accuracy versus waiting on timers.
This only affects tp5o and svg opacity tests which measure "pure" page load time till mozafterpaint, and tresize which measure the same thing after each resize "step" (and historically does show good correlation to gfx performance). Most of the other tests do some animation in ASAP mode and so are unaffected by hw/sw vsync timing. And the current Silk behavior should have more consistent results (compared to pre-Silk) if I understand correctly. So what tests/cases would such effort still improve?
(In reply to Avi Halachmi (:avih) from comment #24) > This only affects tp5o and svg opacity tests which measure "pure" page load > time till mozafterpaint, and tresize which measure the same thing after each > resize "step" (and historically does show good correlation to gfx > performance). > > Most of the other tests do some animation in ASAP mode and so are unaffected > by hw/sw vsync timing. > > And the current Silk behavior should have more consistent results (compared > to pre-Silk) if I understand correctly. > > So what tests/cases would such effort still improve? Only the non-ASAP mode ones, which seems to be the test cases that I introduced regressions in heh. Seems like any perf tests that rely on internal timing mechanisms introduces some randomness. Removing that randomness should improve the noise situation of these two tests.
Resolution: WONTFIX → FIXED
(In reply to Mason Chang [:mchang] from comment #25) > Only the non-ASAP mode ones, which seems to be the test cases that I > introduced regressions in heh. Seems like any perf tests that rely on > internal timing mechanisms introduces some randomness. Removing that > randomness should improve the noise situation of these two tests. Well, yes, but which cases are those? The current silk behavior doesn't include randomness? Or do you mean to replace HW vsync with SW vsync on all platforms (while running non-ASAP talos tests) to reduce the apparent randomness from real hw vsync?
(In reply to Avi Halachmi (:avih) from comment #26) > (In reply to Mason Chang [:mchang] from comment #25) > > Only the non-ASAP mode ones, which seems to be the test cases that I > > introduced regressions in heh. Seems like any perf tests that rely on > > internal timing mechanisms introduces some randomness. Removing that > > randomness should improve the noise situation of these two tests. > > Well, yes, but which cases are those? The current silk behavior doesn't > include randomness? Or do you mean to replace HW vsync with SW vsync on all > platforms (while running non-ASAP talos tests) to reduce the apparent > randomness from real hw vsync? Yes both the current silk behavior and non-silk behavior include randomness. I mean, the tests should control the paints explicitly and disable the timers. e.g. the test should: 1) Load the page 2) Tick refresh driver 3) composite 4) check for finish page load else repeat 2-3.
Hmm.. I think otherwise here (but still open for discussion). In general I want the tests to be as black box as possible, such that they don't depend much on internal implementations - which is exactly how this bug was created - _because_ it doesn't trigger internal things, it was able to detect that the first paint with Silk could happen, and we could have this discussion and conclude whatever we concluded. If the tests start driving internal components, imagine how much harder it would have been to maintain them over transitions to the following technologies: OMTC, OMTA, APZC, e10s, and any other internal changes which will requires changes in "driving" logic/code. And do recall that the same tests are executed on all branches and all platforms - which include different configurations (for instance of OMTC, e10s, etc). If the implementation is such that it exhibits some micro randomness - that's fine, we want to know that and see that the results are noisy On some cases, however, we do violate this black box principle, like with waiting for the mozAfterPaint event, or by using ASAP mode, because those are relatively generic and do improve the quality of the tests very meaningfully (in the case of ASAP many tests are actually meaningless without it - and it indeed required lots of extra maintenance with OMTC and e10s). As for repetitions, most tests (page load tests included) are repeated 12-25 times for each page (we call them "replicates") and then they're crunched into a single value which is considered this page's score (typically after throwing away the first 1-5 values, and taking a median of the rest). Then all the pages (or subtests) scores are further crunched into yet another number which represents the overall score of the "suit" such as the overall result of the "tp5o" run. Depending on which system is used (graph server/datazilla/treeherder), some or all of these intermediate results are kept. So the noise levels should be relatively stable due to these averages, even if each single load does include inherent micro randomness. The same with GC/CC or any other component which could add uncertainty into timing - they average out, and we want to know what the noise level is and how the average changes over time.
You need to log in before you can comment on or make changes to this bug.