Closed Bug 1295292 Opened 3 years ago Closed 3 years ago

tpaint results look like they get worse with each iteration because it sorts the results

Categories

(Testing :: Talos, defect)

Unspecified
All
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zbraniecki, Assigned: Gijs)

References

Details

Attachments

(4 files, 1 obsolete file)

At the moment both tpaint is reliably regressing with each run which means that we really don't know what we're measuring and the results are very questionable.

It also means that increasing the sample size does not reduce the noise and basically makes the whole test very unreliable.

Example results from tpaint:

12:42:19     INFO -  PERFHERDER_DATA: {"framework": {"name": "talos"}, "suites": [{"subtests": [{"replicates": [315.7099999999991, 316.119999999999, 319.33000000000175, 322.46500000000015, 325.77000000000044, 326.4350000000004, 330.43499999999995, 330.78499999999985, 331.65999999999985, 331.869999999999, 335.4549999999981, 337.20500000000175, 337.65999999999985, 339.9150000000009, 340.3199999999997, 345.3550000000014, 350.4449999999997, 363.9450000000006, 368.03999999999905, 418.4200000000001], "name": "tpaint", "value": 337.65999999999985}], "extraOptions": ["e10s"], "name": "tpaint"}]}


12:43:24     INFO -  PERFHERDER_DATA: {"framework": {"name": "talos"}, "suites": [{"subtests": [{"replicates": [308.84000000000015, 311.9599999999991, 327.14500000000044, 328.3650000000016, 331.6999999999998, 332.9349999999977, 337.01000000000204, 338.6349999999984, 341.3999999999978, 344.0149999999999, 346.0149999999994, 347.744999999999, 348.4200000000001, 349.1749999999993, 354.4750000000022, 356.77000000000044, 366.0600000000013, 377.0600000000004, 391.96999999999935, 420.17999999999984], "name": "tpaint", "value": 348.4200000000001}], "extraOptions": ["e10s"], "name": "tpaint"}]}


I see bug 1179377 that has a title indicating a relation to this bug (noisiness), but then dives into what tpaint really tests vs what tspaint really tests.

I also see bug 1008163 that indicates that this problem exists for quite a while.

While we can argue that the results are fairly reproducible and stable, I would say that the trend indicates intrinsic error within the test and make increasing sample in order to get statistically significant results impossible.
Summary: tspaint and tpaint are regressing with each iteration → tpaint are regressing with each iteration
This is how it looks when I increase the number of runs to 100 for tpaint.

I'm testing on Ubuntu 16.04
Joel, can you take a look at this and help me CC the right people? :)
Flags: needinfo?(jmaher)
Attached image ts_paint RUNS=100
btw. I couldn't reproduce the same regression for ts_paint. Here's a chart of 100 runs of ts_paint.
this is a great bug, and could be the cause of a lot of random frustrations (or missed regressions).

I have :jimm as the contact, he might be able to help us find someone to look into this.
Flags: needinfo?(jmaher) → needinfo?(jmathies)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #1)
> Created attachment 8781259 [details]
> chart of two RUNS=100 tpaints
> 
> This is how it looks when I increase the number of runs to 100 for tpaint.
> 
> I'm testing on Ubuntu 16.04

You reproduced here, but..

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #3)
> Created attachment 8781261 [details]
> ts_paint RUNS=100
> 
> btw. I couldn't reproduce the same regression for ts_paint. Here's a chart
> of 100 runs of ts_paint.

couldn't reproduce here? Can you clarify?
Flags: needinfo?(gandalf)
I don't know who can work on this but I'll try and come up with some ideas.
Flags: needinfo?(jmathies)
(In reply to Jim Mathies [:jimm] from comment #5)
> You reproduced here, but..

(...)

> couldn't reproduce here? Can you clarify?


I can reproduce on tpaint, not on ts_paint. Does it make sense?
Flags: needinfo?(gandalf) → needinfo?(jmathies)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #7)
> (In reply to Jim Mathies [:jimm] from comment #5)
> > You reproduced here, but..
> 
> (...)
> 
> > couldn't reproduce here? Can you clarify?
> 
> 
> I can reproduce on tpaint, not on ts_paint. Does it make sense?

Ah, sorry, didn't notice the different tests. thanks.
Flags: needinfo?(jmathies)
Any chance you can take a look soon at this?

It's killing us atm. We have a serious regression on tpaint and we're trying to track it down, but with the current results it may be a phantom, and we can't even vary the sample size to reduce the noise.
Flags: needinfo?(jmathies)
Also, I think it may be e10s only.

Look at those sample results: https://treeherder.mozilla.org/perf.html#/compare?originalProject=larch&originalRevision=7214320961ed&newProject=try&newRevision=2e0cb369c04f&framework=1&showOnlyImportant=0

tpaint opt e10s has 13-14% stdev, tpaint opt has 0.9% stdev.
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #9)
> Any chance you can take a look soon at this?
> 
> It's killing us atm. We have a serious regression on tpaint and we're trying
> to track it down, but with the current results it may be a phantom, and we
> can't even vary the sample size to reduce the noise.

Not soon but I can look at it when I have free time. This could be anything since we're opening a new window (ctrl-n) and seeing progressively worse performance with each open.
Flags: needinfo?(jmathies)
We've been working more on trying to understand tpaint test better.

The description of tpaint test from [0] states that it "tests the amount of time it takes the open a new window."

We found out that this statement is inaccurate. The test tests opening a new window *and* loading a new document into it. It measures mozafter paint only after the document is loaded.

It's meaningful for this bug because we discovered that the steady regression is not visible when measuring first MozAfterPaint inside chrome JS (browser.js loaded by browser.xul), but visible from within the content document JS.

What's more, loading that document (fed as "data:" URL) takes time. The MozAfterPaint of the document itself is ~80ms (measured from the inception of the document context to first mozAfterPaint).

There's an overlap with the chrome UI load, but here are the numbers we got on my machine:

* tpaint - 269ms
* browser.js MozAfterPaint - 229ms
* content document MozAfterPaint - 80ms

And, as I said before - browser.js MozAfterPaint doesn't have the slowdown on every window open. 

Questions:
 * do we want tpaint to test window open + document loading?
 * why window.open + document loading has this bug, and window.open itself doesn't?
 * is it a bug in talos test or in Firefox and does it affect users?

Since tpaint is used to inform all our work on Firefox UI I believe we should answer those questions rather urgently.

Here's raw data:

tpaint:

22:30:23     INFO -  PERFHERDER_DATA: {"framework": {"name": "talos"}, "suites": [{"subtests": [{"replicates": [250.1550000000002, 251.15000000000055, 252.6949999999997, 252.71500000000015, 254.0649999999987, 255.54999999999927, 257.6399999999994, 262.6650000000009, 267.1550000000025, 267.75, 268.97000000000116, 269.15499999999884, 269.9449999999997, 270.75, 273.8650000000007, 273.9399999999987, 276.9349999999995, 277.15999999999985, 281.52499999999986, 283.994999999999], "name": "tpaint", "value": 269.9449999999997}], "name": "tpaint"}]}

browser.js MozAfterPaint (printing performance.now() which measures from document context inception):

22:18:49     INFO -  PROCESS | 32467 | 746.78: Paint for tID: 3 (after interactive)
22:18:51     INFO -  PROCESS | 32467 | 242.56: Paint for tID: 2 (after interactive)
22:18:52     INFO -  PROCESS | 32467 | 253.21: Paint for tID: 2 (after interactive)
22:18:54     INFO -  PROCESS | 32467 | 254.315: Paint for tID: 2 (after interactive)
22:18:55     INFO -  PROCESS | 32467 | 240.74: Paint for tID: 2 (after interactive)
22:18:57     INFO -  PROCESS | 32467 | 225.4: Paint for tID: 2 (after interactive)
22:18:58     INFO -  PROCESS | 32467 | 247.45000000000002: Paint for tID: 2 (after interactive)
22:19:00     INFO -  PROCESS | 32467 | 243.97: Paint for tID: 2 (after interactive)
22:19:01     INFO -  PROCESS | 32467 | 230.29500000000002: Paint for tID: 2 (after interactive)
22:19:03     INFO -  PROCESS | 32467 | 222.395: Paint for tID: 2 (after interactive)
22:19:05     INFO -  PROCESS | 32467 | 224.035: Paint for tID: 2 (after interactive)
22:19:06     INFO -  PROCESS | 32467 | 247.62: Paint for tID: 2 (after interactive)
22:19:08     INFO -  PROCESS | 32467 | 225.565: Paint for tID: 2 (after interactive)
22:19:09     INFO -  PROCESS | 32467 | 223.055: Paint for tID: 2 (after interactive)
22:19:11     INFO -  PROCESS | 32467 | 241.495: Paint for tID: 2 (after interactive)
22:19:12     INFO -  PROCESS | 32467 | 238.6: Paint for tID: 2 (after interactive)
22:19:14     INFO -  PROCESS | 32467 | 255.87: Paint for tID: 2 (after interactive)
22:19:16     INFO -  PROCESS | 32467 | 224.815: Paint for tID: 2 (after interactive)
22:19:17     INFO -  PROCESS | 32467 | 240.9: Paint for tID: 2 (after interactive)
22:19:19     INFO -  PROCESS | 32467 | 240.385: Paint for tID: 2 (after interactive)
22:19:20     INFO -  PROCESS | 32467 | 227.69: Paint for tID: 2 (after interactive)


[0] https://wiki.mozilla.org/Buildbot/Talos/Tests#ts_paint
Mike, I know you've spent a lot of time squinting at tpaint recently. Any thoughts? :)
Flags: needinfo?(mconley)
Attached patch browser.js MozAfterPaint patch (obsolete) — Splinter Review
This is the patch I used to measure the difference. I landed it and run tpaint locally getting the tpaint and browser chrome mozafterpaint numbers.
Simplified patch. STR:

1) Apply the patch on top of m-c
2) ./mach build faster
3) ./mach talos-test --activeTests tpaint --disable-e10s
4) ./mach talos-test --activeTests tpaint
Attachment #8789770 - Attachment is obsolete: true
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #12)
> Questions:
>  * do we want tpaint to test window open + document loading?

I'd think yes. We prefer to measure blackbox or high level things which have some meaningful semantics. Users don't open a window without something in it, be it the home page or the first page or whatever. Therefore IMO it makes sense to measure the entire thing.

Another advantage of measuring high level things is that it makes the test semantics and meaningfulness less susceptible to changes in internal modules or implementations.

It's fine to also measure higher resolution things, but then it's important to not lose the big picture.

It's also fine if we think measuring a different thing could be better due to $whatever reasons.

>  * is it a bug in talos test or in Firefox and does it affect users?

While possible, and unless the test itself changes constantly (which AFAIK it's not) I'd think it's unlikely that it's a talos bug that the results it reports seem to regress over time.
(In reply to Avi Halachmi (:avih) from comment #16)
> (In reply to Zibi Braniecki [:gandalf][:zibi] from comment #12)
> > Questions:
> >  * do we want tpaint to test window open + document loading?
> 
> I'd think yes.

The way I understand it is that if we keep it this way, tpaint shouldn't be actually used to measure just the changes to browser.xul & friends because tpaint tests more (and in result may be affected by other things and be more noisy).

Would it be valuable to create a new test that takes mozAfterPaint from within browser.xul?

> >  * is it a bug in talos test or in Firefox and does it affect users?
> 
> While possible, and unless the test itself changes constantly (which AFAIK
> it's not) I'd think it's unlikely that it's a talos bug that the results it
> reports seem to regress over time.

Which indicates that this bug is actually affecting production builds and our users. It would indicate that opening new window (and launching a page) slows down over time.
But we know now that just opening new window is not affected by this bug, so it has to be something related to loading the document.

One hypothesis would be that we grow some cache in which we save the document and this cache slows down document loading over time?

That would, in turn indicate that it may not be related to new window, and opening the same page in consecutive tabs will also regress?
We should look for a way to snap a profile early on in the test and then another after numerous window opens, the difference should be pretty obvious between the two.
Pretty clearly not a Talos bug but a real product bug.
Component: Talos → General
Product: Testing → Firefox
Version: Version 3 → unspecified
The tpaint test is pretty carefully constructed: we care about how much time it takes to get content in front of a user, but we don't want this test to vary on most of the network stack, which is noisy. That's why we use a constant data: URI.

It's a good idea to subdivide this test down into pieces to measure time to paint chrome and time to paint content separately, in order to diagnose things. But at the end of the day we still want to avoid regressing this "complete" test, since it's the end-to-end time that the user experiences.

I think Jim is right that a comparative profile would tell us a lot. I don't know how to get a comparative profile, though. Do we have API calls that will let us start/stop a profile at particular points in time? Avi do you know?
Flags: needinfo?(avihpit)
I wasn't involved with tpaint, but tests can, and some do, add markers and/or start/stop the profiler around the test itself to make the analysis easier.
Flags: needinfo?(avihpit)
(In reply to Avi Halachmi (:avih) from comment #21)
> I wasn't involved with tpaint, but tests can, and some do, add markers
> and/or start/stop the profiler around the test itself to make the analysis
> easier.

Avi, can you please try to move forward with comments 18 and 20? i.e. adding markers where needed and collecting the profiles.
Flags: needinfo?(avihpit)
If we do add these markers, it would probably make sense to reconstruct this test as a "perfherder summary test". In that case, we'd track and alert on a summary value (same as now), but also have the times to the specific markers as subtests available when needed.
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #0)
> I also see bug 1008163 that indicates that this problem exists for quite a
> while.

Just so we're clear, that bug was filed about the trend that *overall* tpaint/ts_paint has increased over time (ie tpaint for fx44 was higher than for fx40), in very slow increments, throughout our release cycles. Because the increments are slow, we don't "catch" regressing bugs.

This is completely orthogonal to the trend you're seeing in an individual sampling of a test run that the samples are increasing.

Dumb questions first: have you verified that the perfherder (or other) code doesn't simply sort the results of the individual runs?
(In reply to :Gijs Kruitbosch from comment #24)
> (In reply to Zibi Braniecki [:gandalf][:zibi] from comment #0)
> > I also see bug 1008163 that indicates that this problem exists for quite a
> > while.
> 
> Just so we're clear, that bug was filed about the trend that *overall*
> tpaint/ts_paint has increased over time (ie tpaint for fx44 was higher than
> for fx40), in very slow increments, throughout our release cycles. Because
> the increments are slow, we don't "catch" regressing bugs.
> 
> This is completely orthogonal to the trend you're seeing in an individual
> sampling of a test run that the samples are increasing.

Eh, I missed out making my 'point' explicit, which is: this *is* fairly new (ie not something that dates back to 2014 like the bug you cited). I haven't seen it before. It might be bisectable.

(In reply to :Gijs Kruitbosch from comment #24)
> Dumb questions first: have you verified that the perfherder (or other) code
> doesn't simply sort the results of the individual runs?

I just checked, and it does:

http://searchfox.org/mozilla-central/rev/591354c5e5521cf845bf6b6ef44e8b3b0aeda17d/testing/talos/talos/startup_test/tpaint.html#44-48

Note that array.sort() affects |array| in JS, and object (incl. array) values are passed by reference.

http://searchfox.org/mozilla-central/rev/591354c5e5521cf845bf6b6ef44e8b3b0aeda17d/testing/talos/talos/startup_test/tpaint.html#66-72

  med = calcMedian(openTimes);

  if (auto) {
    dumpLog("__start_report" + openTimes.join('|') + "__end_report");
    var now = (new Date()).getTime();
    dumpLog("__startTimestamp" + now + "__endTimestamp\n");
    dumpLog("openingTimes="+openTimes.slice(1)+"\n");

If I copy that last line and run it before we calculate |med|, I see:

13:21:51     INFO -  PROCESS | 85262 | openingTimes=249.33000000000038,227.4350000000004,276.1400000000003,272.3900000000003,260.5749999999989,265.2049999999999,268.90999999999985,285.119999999999,276.4349999999995,243.51000000000022,227.0649999999987,305.6750000000029,227.96500000000015,219.52499999999782,260.4200000000019,233.73499999999694,244.59000000000015,260.5,244.01499999999942
13:21:51     INFO -  PROCESS | 85262 | __start_report219.52499999999782|227.0649999999987|227.4350000000004|227.96500000000015|233.73499999999694|243.51000000000022|244.01499999999942|244.59000000000015|249.33000000000038|260.4200000000019|260.5|260.5749999999989|265.2049999999999|268.90999999999985|272.3900000000003|276.1400000000003|276.4349999999995|285.119999999999|292.5300000000002|305.6750000000029__end_report__startTimestamp1473682911034__endTimestamp
13:21:51     INFO -  PROCESS | 85262 | openingTimes=227.0649999999987,227.4350000000004,227.96500000000015,233.73499999999694,243.51000000000022,244.01499999999942,244.59000000000015,249.33000000000038,260.4200000000019,260.5,260.5749999999989,265.2049999999999,268.90999999999985,272.3900000000003,276.1400000000003,276.4349999999995,285.119999999999,292.5300000000002,305.6750000000029

which shows both the unsorted and then the sorted list of opening times of the windows. ts_paint, because it's cross-process, can't do this sorting in JS and therefore doesn't exhibit the bug.

In other words, I think this bug is INVALID as filed, though it might be worth fixing the code anyway, to reduce future confusion.
This patch makes the median calculating code operate on a clone of the array, avoiding the confusing sorted results.
Attachment #8790253 - Flags: review?(jmaher)
Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
Egh, didn't intend to take this in case we want to do other things in this bug.

Mike, I think it would still be interesting to understand why the e10s version of this test is so much noisier.

Zibi, I'm afraid you might still be looking at a real regression for your patchset.

As to what we "want" tpaint to measure, I expect Joel and Mike have ideas, as I know we recently changed some things for e10s because it wasn't measuring the right thing.


(In reply to Benjamin Smedberg [:bsmedberg] from comment #19)
> Pretty clearly not a Talos bug but a real product bug.

This is extra-ironic because in a sense, this is very much a Talos bug, and yet this move was how I noticed the bug (I watch fx::general). Moving it back, though. :-)
Assignee: gijskruitbosch+bugs → nobody
Status: ASSIGNED → NEW
Component: General → Talos
Product: Firefox → Testing
Summary: tpaint are regressing with each iteration → tpaint results look like they get worse with each iteration because it sorts the results
Comment on attachment 8790253 [details] [diff] [review]
don't sort tpaint results,

Review of attachment 8790253 [details] [diff] [review]:
-----------------------------------------------------------------

good catch Gijs!
Attachment #8790253 - Flags: review?(jmaher) → review+
Keywords: leave-open
(In reply to :Gijs Kruitbosch from comment #24)
> Just so we're clear, that bug was filed about the trend that *overall*
> tpaint/ts_paint has increased over time (ie tpaint for fx44 was higher than
> for fx40)

That was my initial understanding as well, which appears to have been wrong.

> Dumb questions first: have you verified that the perfherder (or other) code
> doesn't simply sort the results of the individual runs?

Good catch.


As for profiler start/stop, the test seem to already do that and also adds a 'tpaint' marker on pause/resume.

However, the test seems intentionally constructed in a very simple way (and rightly so IMO) where it just opens a new window and measures the mozafterpaint of the result.

It does not lend itself at all to intermediate "checkpoints" where we could add more markers or measure partial progress, because at the high level at which the test runs there are just no intermediate places to observe.

The most we can do here IMO without completely rewriting the test in a much more complex way is to also measure from the new window at the first chance we have (i.e. right when the code starts to run rather than waiting to mozafterpaint).

However, this is quite susceptible to various DOM, JS and other optimization and is likely to change over time in a way which would make us scratch our heads while trying to understand whether or not a change is actually a regression. Therefore I'd vote against it.


As for causes to the noise, I see that mconley already added GC and CC before the test starts on about 6 months ago, so that should have helped.

Another factor which IMO could affect the results and the noise levels is that the result is measured at the opener window (i.e. the child calls opener.childIsOpen() and then the opener takes the end timestamp).

Another possible approach which might be more correct and possibly less noisy is to measure the end timestamp at the child itself, and send this value when calling the opener. However, this might make it impossible to use performance.now() which the test currently uses since its 0 point-in-time relative to the window.

Depending on the kind of values we're getting, however, Date.now() could be enough resolution for this cause - and it seems to be the case - values of 200-300 ms are high enough IMO for Date resolution.

Mike, thoughts about measuring the end timestamp at the child?
Flags: needinfo?(avihpit)
Coming up for air to say that I've seen this bug and the mailing list thread, but I'm clearing other stuff from my plate today. I hope to get to this later today, fingers crossed.
That's a great find, thanks :gijs!

I thought about the possibility of sorting, but the fact that ts_paint didn't display the regression made me confident that it must be a regression.
I should have read the source. :)
(In reply to Avi Halachmi (:avih) from comment #30)
> 
> Mike, thoughts about measuring the end timestamp at the child?

I hacked together a patch which passed around a Date.now() instead of window.performance.now(), where the Date.now() was calculated within the opened window, and it didn't seem to improve the noise.

What is needed here, in all honesty, is profiling.

I hacked together a patch that uses TalosContentProfiler.js to profile the tpaint test with e10s. I had hoped to land such a patch back when I was doing e10s tpaint regression stuff, but I never got around to fixing it for the non-profiling case.

I also had to use --symbolsPath to point to my local symbols, which I found a bit odd. Unfortunately, I find that --spsProfile is sometimes broken when I use it, since it's used infrequently, and the next person who uses it usually has to fix it. :/

Anyhow, long story short, here's a profile from my local OS X machine:

https://cleopatra.io/#report=9595dfcc9c5608395e9baff0100c9fabbea85d61

Having given these results: 295,295,298,303,310,316,323,325,327,339,339,339,348,349,356,368,371,396,398

So that's a spread of about 100ms of difference. Hopefully this is enough for us to track down what the difference is between the 295ms case and the 398ms case.
Flags: needinfo?(mconley)
So this analysis is for the tpaint runs profiled in comment 35.

The two tpaint runs differ by about 114ms. Let's see if we can find out where that's going.

In tpaint run 0, I notice that the content process spends 267ms waiting for the PContent::SendCreateWindow sync message to be responded to from the parent. This is in contrast to tpaint run 8, which spends 201ms doing the same thing, for a difference of 66ms.

So that seems to account for a little over half of the 114ms. 48ms remains. Where is it?

Now looking at the interval between the paint of the content, and the composite / MozAfterPaint in tpaint run 0, the interval is about 101ms. The interval in tpaint run 8 for the same thing is about 52ms, for a difference of 49ms.

Allowing for eyeball-error, my conclusion from this is:

The parent process is spending more time responding to the SendCreateWindow call for tpaint run 0. Also, there's more time between the painting of the content and its composition in tpaint run 0.

So, now the question is, why?
I have some things I'm trying to get into mozilla-central before the uplift, so I'm not able to spend more time on this today, but hopefully these profiles and this analysis is helpful to get someone else started. If there's been no forward movement by the time the uplift is over, ping me and I'll keep looking.
OS: Unspecified → BSDI
OS: BSDI → All
Can we close it?
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #39)
> Can we close it?

Depends if mconley thinks there's something to be gained from continuing to look at the e10s noise here.
Flags: needinfo?(mconley)
Personally, I don't think so.
Flags: needinfo?(mconley)
Assignee: nobody → gijskruitbosch+bugs
Status: NEW → RESOLVED
Closed: 3 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.