Investigate TART ".error" regressions on UX branch on OS X 10.8

RESOLVED WONTFIX

Status

()

defect
RESOLVED WONTFIX
6 years ago
6 years ago

People

(Reporter: avih, Unassigned)

Tracking

(Blocks 1 bug, {perf})

Trunk
x86
macOS
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [Australis:M9][Australis:P1])

Attachments

(4 attachments, 1 obsolete attachment)

mconley recently noticed a regression where some TART .error values on OS X 10.8 on the UX branch are considerably higher than the m-c ones (reminder: the .error values indicate target vs actual animation duration, e.g. if the animation should last 200ms but in practice lasts 207, then .error is 7):

http://compare-talos.mattn.ca/breakdown.html?oldTestIds=31243565,31252017,31252031,31252075,31252083,31252091,31252101,31252127,31252135,31252145&newTestIds=31242563,31252259,31252363,31252409,31252547,31252555,31252567,31252593,31252601,31252629&testName=tart&osName=Mac%2010.8&server=graphs.mozilla.org

The most meaningful ones are (highest %/absolute regressions):
- simple-close-DPI1.error where m-c is 0.77 while UX is 7.19 (830%)
- icon-open-DPI1.error    where m-c is 19.7 while UX is 34.7 (15ms)


Initially we thought they've been there all along and we somehow dismissed them, but looking at a similar comparison from Aug-25 shows that the regressions were considerably smaller two months ago - and they were indeed dismissible:

http://compare-talos.mattn.ca/breakdown.html?oldTestIds=29935085,29943989,29944011,29944037,29944043,29944049,29944065,29944073,29944079,29944143&newTestIds=29935065,29944097,29944113,29944119,29944125,29944149,29944155,29944161,29944167,29944191&testName=tart&osName=Mac%2010.8&server=graphs.mozilla.org

Where:
- simple-close-DPI1.error where m-c is 4.7  while UX is 6.4
- icon-open-DPI1.error    where m-c is 33.3 while UX is 34.5


Datazilla shows this (the graph starts on Sep): https://datazilla.mozilla.org/?start=1379500722&stop=1383236782&product=Firefox&repository=Firefox&os=mac&os_version=OS%20X%2010.8&test=tart&compare_product=Firefox&compare_repository=UX&project=talos

The .error results are very noisy and hard to read/interpret, but:
- We didn't notice a meaningful change over time of the .error values on UX/m-c.
- Of the above two .error values, _maybe_ UX is worse than m-c despite the noise - about twice as high - but it could be just the way datazilla overlays the two sets one over the other. Hard to tell.


So the question is what causes the higher .error regressions in recent builds compared to two months ago - as indicated by the above compare-talos links, and possibly also by datazilla.

Jeff suggested that it might be noise from +1 frame to complete the animation due to some recent gfx changes, since the first high regression is about 7ms, which is similar to a single frame duration of that animation. Such regression would be acceptable IMO.

It could also be due to the relatively high noise of the .error values.

However, considering that each talos TART run executes each test 25 times, and both comparisons supposedly include 10 talos TART runs (the links do seem to indicate this), then if there's a 1-frame noise, it should average over so many executions, and not show up. But it seems it does show.


So step one is to verify/reproduce the results locally. I executed TART locally on a 2010 MBA (OS X 10.8.4). I've used official UX and m-c nightlies from 2013-11-07 with clean profiles. I installed the TART v1.5 addon from bug 848358, and set the prefs to make it similar to the talos setup (ASAP mode, OMTC off).

Here are my results of the tests which showed the highest regression on compare talos (I ran the tests more than once, restarted the browser in between, with different number of repetitions etc, and all the results/runs were in the same ballpark):

m-c 2013-11-07 (30 runs):
-------------------------
simple-close-DPI1.half.TART    Average (30): 8.44  stddev: 2.97
simple-close-DPI1.all.TART     Average (30): 10.55 stddev: 3.72
simple-close-DPI1.error.TART   Average (30): 9.63  stddev: 6.90 <--

icon-open-DPI1.half.TART       Average (30): 10.45 stddev: 2.05
icon-open-DPI1.all.TART        Average (30): 13.85 stddev: 2.35
icon-open-DPI1.error.TART      Average (30): 36.03 stddev: 5.92 <--


UX 2013-11-07 (30 runs):
------------------------
simple-close-DPI1.half.TART    Average (30): 7.75  stddev: 3.33
simple-close-DPI1.all.TART     Average (30): 8.94  stddev: 3.60
simple-close-DPI1.error.TART   Average (30): 11.18 stddev: 5.17 <--

icon-open-DPI1.half.TART       Average (30): 10.31 stddev: 1.34
icon-open-DPI1.all.TART        Average (30): 11.96 stddev: 1.23
icon-open-DPI1.error.TART      Average (30): 37.16 stddev: 7.67 <--


While the .error were slightly worse on UX on my local runs, it's nowhere near the indications we've had initially (highest of those is ~1.5ms = ~15%). Also, roughly, the stddev of the .error values is ~70% of 1 frame interval of that animation - which sounds reasonable considering that the number of frames might differ by 1 between runs.


Since the local results were relatively consistent, then maybe:

1. Compare talos doesn't average all the runs properly.

2. One/some of the builds which the compare-talos links used were not representative.

3. When Datazilla overlays two sets of results onto the same graph, it first draws one set, then the other on top of it, which may lead to incorrect interpretation of the graphs.

4. Maybe the noise is between builds (rather than between runs), and the two UX/mc builds which I tested just happen to perform similarly.

5. There is a consistent UX regression (on average) which, for some reason, didn't show on my local runs.
  
6. Other ideas?


I think we should:
1. Check the values which compare-talos used for those links.
2. Get the datazilla values and calculate the stats manually.

If we're lucky and the values from these two sets indicate results similar to my local runs, then we're good (1.5ms .error regression on average is negligible/not a blocker), but it'd still be nice to fix compare-talos and think what to do about datazilla's display.

Otherwise, we'll need to investigate if there's a between-builds noise (so datazilla's interpretation is correct, my local run was lucky builds, and compare-talos was unlucky builds), and possibly the source of the regression-on-average.
Flags: needinfo?(jmaher)
Flags: needinfo?(MattN+bmo)
Whiteboard: [Australis:M9][Australis:P1]
I have pulled the last 75 days worth of runs from UX and MC as found in datazilla:
http://people.mozilla.org/~jmaher/mc.zip
http://people.mozilla.org/~jmaher/ux.zip

These are just the tart results for 10.8 on the respective branches. We can now go through this and validate numbers or graph them independently of datazilla UI.
Flags: needinfo?(jmaher)
Summary: Investigate TART ".error: regressions on UX branch on OS X 10.8 → Investigate TART ".error" regressions on UX branch on OS X 10.8
Joel posted the same files from comment 1, but file names are ordered by landing date (the content is unmodified AFAIK).

http://people.mozilla.org/~jmaher/mc-ordered.zip
http://people.mozilla.org/~jmaher/ux-ordered.zip

Otherwise, with the files from comment 1, the date within each JSON blob is the upload-to-datazilla date (e.g if a test was retriggered a month after the initial run, then it would have 2 dates - a month apart). There's still no direct way to tell the landing date from these files, but at least we'll know the dates are ordered - by file names only.
Jeff and I just looked at an SPS profile that we got off of a 10.8 talos box, and it looks like (at least while profiling) we're actually spending *less* time doing the animation than expected. The difference that error calculates is an absolute value - so that's why it's in the positive.

So now I'm going to determine if this is in fact the case when *not* profiling - if the duration of some of these animations is *less* than expected. If so, the next step is to find out why.
Bah, not going to be that easy. The profile I took was garbage - looking at the raw TART interval values, we do spend 260ms or so animating for icon-open-DPI1.
So here's one instance where we stop recording 260ms after we start:

         29.89
         22.59
         16.03
         6.92
         9.55
         4.20
         13.60
         5.65
         8.25
         3.68
         3.72
         9.29
         5.56
         11.10
         5.54
         11.13
         5.63
         11.01
         3.59
         3.70
         9.44
         5.59
         11.02
         4.26
         3.85
233.10   8.36
         5.61
         11.21
         5.40
         11.32

So after that 8.36 ms frame...we go ahead and draw 4 more frames. WTF?
So I added some profile markers to mark calls to nsTransitionManager::FlushTransitions, as well as for calls to tabbrowser.xml's addTab and the point where tabbrowser.xml sets the fadein attribute to "true" (which kicks off the transition).

On my local UX build, there was a 26ms delay between the call to addTab, and the setting of the fadein attribute. Once the fadein attribute was set, the FlushTransitions was called every frame as one would expect, and from the point of setting the fadein attribute to the transitionend event was really close to 230ms.

Another thing to note is that tabbrowser.xml's addTab waits for the next frame via mozRequestAnimationFrame before settings the fadein attribute, so this could explain a bunch of the noise in the error measurements - sometimes we have to wait a while for the next frame, and sometimes we don't.

So my current hypothesis is that the animation itself is fine, and that the error measure is from something that occurs between the addTab call, and the point where fadein is set on the tab.

I'll instrument an m-c, push them to try, and get some comparison profiles to see what's different.
I have a theory:

Suppose for icon-open-DPI1 there's a 26-30ms delay between calling addTab in tabbrowser.xml, and the fadein attribute being set on the tab, which fires off the transition.

Suppose now that this delay is common for both UX and m-c, and roughly equal. Let's choose 26ms, because that's the number I'm seeing crop up the most.

The error measurement for TART works by summing up the recorded frame intervals since starting a subtest, and taking the absolute difference with the expected duration (230ms for UX, 250ms for m-c).

Assuming for the second that the layout and animation stuff is all sane and doing its job correctly, to me, the only plausible explanation is that UX is doing some paints in that 26ms interval that are being recorded. These are paint intervals *before* we set the fadein attribute.

That would certainly bump the error values here. I really can't think of another explanation for it beyond maybe a bug in our CSS transition stuff or the frame recorder (but that seems like a real longshot).

So I think that's what we should be looking for - things that paint between when addTab is called, and when fadein is set on the tab. This, to me, is the most likely explanation.
Mike, your analysis is correct. The .error usually reflects the work which is being done after sending the command (open/close tab) but before the actual animation is triggered - which right now is roughly similar to the first frame interval at the recording (which is typically longer than the later intervals). The noise level of this value is roughly 1 frame interval.

Also, it could be, for instance, by design. We've experimented in the past with triggering the animation a bit later (e.g. using setTimeout) in the hope that it would perform smoother once it starts. On such case (unless we also shorten the duration), it would increase the .error value.

However, this bug is about a seemingly very big change in this value over the past couple of months - not about the fact that it's not 0.

My suspicion is that compare talos sometimes lies. Either back then which led us to dismiss it, or right now where it shows a relatively big difference.

It just doesn't make sense to me that m-c errors _on average_ by 0.7ms on tab close, while ux errors by 7ms also on average, and 2 months ago they were both 5-6 ms. That 0.7ms is just too close to an not-averaged value, which raised my suspicion in this direction.

But until we can confirm that, we should assume it could be an actual regression. Our goal here is not to "fix the numbers" (which we can do, for instance, by triggering the animation earlier), but rather to understand where did this supposed change come from.
(In reply to Avi Halachmi (:avih) from comment #8)
> It just doesn't make sense to me that m-c errors _on average_ by 0.7ms on
> tab close, while ux errors by 7ms also on average, and 2 months ago they
> were both 5-6 ms. That 0.7ms is just too close to an not-averaged value,
> which raised my suspicion in this direction.

I don't believe compare-talos is using averages though - I think it's using medians. And if mozilla-central has a greater tendency to have almost-zero error values, that could certainly explain the 0.7ms error.

That to me sounds like something happened to make mozilla-central paint fewer things during the simple-close-DPI1 test...
Attachment #830262 - Attachment description: mconley local run of TART with m-c → mconley local run of TART with m-c - f003c386c77a
Attachment #830262 - Attachment mime type: text/x-log → text/plain
Comment on attachment 830262 [details]
mconley local run of TART with m-c - f003c386c77a

This build is out of date and doesn't match Avi's. Re-running with a more up-to-date build.
Attachment #830262 - Attachment is obsolete: true
Attachment #830314 - Attachment mime type: text/x-log → text/plain
Attachment #830296 - Attachment mime type: text/x-log → text/plain
Attachment #830298 - Attachment mime type: text/x-log → text/plain
TL;DR: Some regression exists and has been mostly unchanged since we started collecting measurements. We think they're not big enough to block Australis.

The data:

- These .error values are rather noisy. The main reason is that the measured duration (and as a derivative the delta to the fixed target duration) is a sum of a discrete number of animation frame durations, and since each frame is ~9ms (depending on the specific animation, and on HW), a minor perf change between builds could "jump" the result by ~9ms either way. Within the same build, the values are relatively consistent.

- mconley collected and analyzed the historic data from comment 2, and graphed medians and averages for both simple-close and icon-open:
- https://docs.google.com/spreadsheet/ccc?key=0Asj8iLTl0K0UdGJzVnVYYkJTSDREbndRWWFTLWNyS1E&usp=sharing#gid=21
- https://docs.google.com/spreadsheet/ccc?key=0Asj8iLTl0K0UdGJzVnVYYkJTSDREbndRWWFTLWNyS1E&usp=sharing#gid=22

These graphs show that:
- For simple-close.error, the UX branch has been stable with values of 3-9ms through the entire period. m-c was more noisy, with recent values between 2-5ms.
- For icon-open.error, UX is consistently noisy around 32ms, while m-c was even more noisy with recent values around 25ms.

mconley and myself also ran local talos TART tests on different OS X 10.8 HW (run logs at comments 10-15) with 2013-11-11 nightlies, while focusing on icon-open.error which showed the highest absolute regression, with the following results:

- 2012 rMBP: m-c 19ms, UX: 27ms
- 2010 MBA:  mc: 30ms, UX: 37ms

full summary of 25 cycles here: https://docs.google.com/spreadsheet/ccc?key=0Asj8iLTl0K0UdGJzVnVYYkJTSDREbndRWWFTLWNyS1E&usp=sharing#gid=23

Since these .error values mostly reflect the initial "lag" between the command and the animation start, mconley and myself also tried to subjectively assess if UX looks to have bigger lag compared to m-c (on the same 2012 rMBP and 2010 MBA). We tested it with a normal profile rather than the no-OMTC and ASAP mode which talos uses (and hence talos is more sensitive to perf changes) - to get as close results to what users will actually experience.

Both of us couldn't notice a subjective difference between the two.


Summary:
- the UX branch does appear to perform slightly worse than m-c on these .error values. In absolute numbers, the maximum regression on average is ~9ms on both talos HW and 2012 rMBP. Surprisingly, on a lower end HW as the 2010 MBA, the absolute regression is lower at around 7ms (and much lower in percentage). We guess that it might be related to the high-DPI of retina displays, but we didn't try to investigate further why the relative regression is smaller on lower-end HW.

- The apparent regression of UX compared to m-c over time is a combination of improvements in m-c perf which didn't affect UX, and the fact that the results are apparently noisy per build due to the discrete nature of this measurement.

- In practice, UX hasn't regressed in itself over time, while m-c improved slightly.

Considering that:
- Under normal working conditions frames are at least 17ms, and the maximum absolute regression (lag) we measured was around 9.
- On low-end HW the absolute regression is even lower.
- We couldn't notice a subjective regression on both low and high end HW under normal conditions.

We think it's not worth the time to investigate those further as even if a regression exists and is measurable, it's not something which users are likely to notice (even if they try really hard), and considering that the UX branch is quite different than m-c (SVG, more stuff to update at the tabstrip, etc), it's acceptable if it has slightly higher cost, and especially if it's not something which users will notice.

I'm resolving it as WONTFIX and as acceptable. If someone thinks otherwise, please reopen.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Flags: needinfo?(MattN+bmo)
You need to log in before you can comment on or make changes to this bug.