1295292 - tpaint results look like they get worse with each iteration because it sorts the results

Reporter

Description

•

9 years ago

At the moment both tpaint is reliably regressing with each run which means that we really don't know what we're measuring and the results are very questionable. It also means that increasing the sample size does not reduce the noise and basically makes the whole test very unreliable. Example results from tpaint: 12:42:19 INFO - PERFHERDER_DATA: {"framework": {"name": "talos"}, "suites": [{"subtests": [{"replicates": [315.7099999999991, 316.119999999999, 319.33000000000175, 322.46500000000015, 325.77000000000044, 326.4350000000004, 330.43499999999995, 330.78499999999985, 331.65999999999985, 331.869999999999, 335.4549999999981, 337.20500000000175, 337.65999999999985, 339.9150000000009, 340.3199999999997, 345.3550000000014, 350.4449999999997, 363.9450000000006, 368.03999999999905, 418.4200000000001], "name": "tpaint", "value": 337.65999999999985}], "extraOptions": ["e10s"], "name": "tpaint"}]} 12:43:24 INFO - PERFHERDER_DATA: {"framework": {"name": "talos"}, "suites": [{"subtests": [{"replicates": [308.84000000000015, 311.9599999999991, 327.14500000000044, 328.3650000000016, 331.6999999999998, 332.9349999999977, 337.01000000000204, 338.6349999999984, 341.3999999999978, 344.0149999999999, 346.0149999999994, 347.744999999999, 348.4200000000001, 349.1749999999993, 354.4750000000022, 356.77000000000044, 366.0600000000013, 377.0600000000004, 391.96999999999935, 420.17999999999984], "name": "tpaint", "value": 348.4200000000001}], "extraOptions": ["e10s"], "name": "tpaint"}]} I see bug 1179377 that has a title indicating a relation to this bug (noisiness), but then dives into what tpaint really tests vs what tspaint really tests. I also see bug 1008163 that indicates that this problem exists for quite a while. While we can argue that the results are fairly reproducible and stable, I would say that the trend indicates intrinsic error within the test and make increasing sample in order to get statistically significant results impossible.

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Updated

•

9 years ago

Summary: tspaint and tpaint are regressing with each iteration → tpaint are regressing with each iteration

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 1

•

9 years ago

Attached image chart of two RUNS=100 tpaints — Details

This is how it looks when I increase the number of runs to 100 for tpaint. I'm testing on Ubuntu 16.04

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 2

•

9 years ago

Joel, can you take a look at this and help me CC the right people? :)

Flags: needinfo?(jmaher)

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 3

•

9 years ago

Attached image ts_paint RUNS=100 — Details

btw. I couldn't reproduce the same regression for ts_paint. Here's a chart of 100 runs of ts_paint.

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Updated

•

9 years ago

Blocks: 1289530

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

9 years ago

this is a great bug, and could be the cause of a lot of random frustrations (or missed regressions). I have :jimm as the contact, he might be able to help us find someone to look into this.

Flags: needinfo?(jmaher) → needinfo?(jmathies)

Jim Mathies [:jimm]

Comment 5

•

9 years ago

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #1) > Created attachment 8781259 [details] > chart of two RUNS=100 tpaints > > This is how it looks when I increase the number of runs to 100 for tpaint. > > I'm testing on Ubuntu 16.04 You reproduced here, but.. (In reply to Zibi Braniecki [:gandalf][:zibi] from comment #3) > Created attachment 8781261 [details] > ts_paint RUNS=100 > > btw. I couldn't reproduce the same regression for ts_paint. Here's a chart > of 100 runs of ts_paint. couldn't reproduce here? Can you clarify?

Flags: needinfo?(gandalf)

Jim Mathies [:jimm]

Comment 6

•

9 years ago

I don't know who can work on this but I'll try and come up with some ideas.

Flags: needinfo?(jmathies)

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 7

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #5) > You reproduced here, but.. (...) > couldn't reproduce here? Can you clarify? I can reproduce on tpaint, not on ts_paint. Does it make sense?

Flags: needinfo?(gandalf) → needinfo?(jmathies)

Jim Mathies [:jimm]

Comment 8

•

9 years ago

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #7) > (In reply to Jim Mathies [:jimm] from comment #5) > > You reproduced here, but.. > > (...) > > > couldn't reproduce here? Can you clarify? > > > I can reproduce on tpaint, not on ts_paint. Does it make sense? Ah, sorry, didn't notice the different tests. thanks.

Flags: needinfo?(jmathies)

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 9

•

9 years ago

Any chance you can take a look soon at this? It's killing us atm. We have a serious regression on tpaint and we're trying to track it down, but with the current results it may be a phantom, and we can't even vary the sample size to reduce the noise.

Flags: needinfo?(jmathies)

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 10

•

9 years ago

Also, I think it may be e10s only. Look at those sample results: https://treeherder.mozilla.org/perf.html#/compare?originalProject=larch&originalRevision=7214320961ed&newProject=try&newRevision=2e0cb369c04f&framework=1&showOnlyImportant=0 tpaint opt e10s has 13-14% stdev, tpaint opt has 0.9% stdev.

Jim Mathies [:jimm]

Comment 11

•

9 years ago

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #9) > Any chance you can take a look soon at this? > > It's killing us atm. We have a serious regression on tpaint and we're trying > to track it down, but with the current results it may be a phantom, and we > can't even vary the sample size to reduce the noise. Not soon but I can look at it when I have free time. This could be anything since we're opening a new window (ctrl-n) and seeing progressively worse performance with each open.

Flags: needinfo?(jmathies)

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 12

•

9 years ago

We've been working more on trying to understand tpaint test better. The description of tpaint test from [0] states that it "tests the amount of time it takes the open a new window." We found out that this statement is inaccurate. The test tests opening a new window *and* loading a new document into it. It measures mozafter paint only after the document is loaded. It's meaningful for this bug because we discovered that the steady regression is not visible when measuring first MozAfterPaint inside chrome JS (browser.js loaded by browser.xul), but visible from within the content document JS. What's more, loading that document (fed as "data:" URL) takes time. The MozAfterPaint of the document itself is ~80ms (measured from the inception of the document context to first mozAfterPaint). There's an overlap with the chrome UI load, but here are the numbers we got on my machine: * tpaint - 269ms * browser.js MozAfterPaint - 229ms * content document MozAfterPaint - 80ms And, as I said before - browser.js MozAfterPaint doesn't have the slowdown on every window open. Questions: * do we want tpaint to test window open + document loading? * why window.open + document loading has this bug, and window.open itself doesn't? * is it a bug in talos test or in Firefox and does it affect users? Since tpaint is used to inform all our work on Firefox UI I believe we should answer those questions rather urgently. Here's raw data: tpaint: 22:30:23 INFO - PERFHERDER_DATA: {"framework": {"name": "talos"}, "suites": [{"subtests": [{"replicates": [250.1550000000002, 251.15000000000055, 252.6949999999997, 252.71500000000015, 254.0649999999987, 255.54999999999927, 257.6399999999994, 262.6650000000009, 267.1550000000025, 267.75, 268.97000000000116, 269.15499999999884, 269.9449999999997, 270.75, 273.8650000000007, 273.9399999999987, 276.9349999999995, 277.15999999999985, 281.52499999999986, 283.994999999999], "name": "tpaint", "value": 269.9449999999997}], "name": "tpaint"}]} browser.js MozAfterPaint (printing performance.now() which measures from document context inception): 22:18:49 INFO - PROCESS | 32467 | 746.78: Paint for tID: 3 (after interactive) 22:18:51 INFO - PROCESS | 32467 | 242.56: Paint for tID: 2 (after interactive) 22:18:52 INFO - PROCESS | 32467 | 253.21: Paint for tID: 2 (after interactive) 22:18:54 INFO - PROCESS | 32467 | 254.315: Paint for tID: 2 (after interactive) 22:18:55 INFO - PROCESS | 32467 | 240.74: Paint for tID: 2 (after interactive) 22:18:57 INFO - PROCESS | 32467 | 225.4: Paint for tID: 2 (after interactive) 22:18:58 INFO - PROCESS | 32467 | 247.45000000000002: Paint for tID: 2 (after interactive) 22:19:00 INFO - PROCESS | 32467 | 243.97: Paint for tID: 2 (after interactive) 22:19:01 INFO - PROCESS | 32467 | 230.29500000000002: Paint for tID: 2 (after interactive) 22:19:03 INFO - PROCESS | 32467 | 222.395: Paint for tID: 2 (after interactive) 22:19:05 INFO - PROCESS | 32467 | 224.035: Paint for tID: 2 (after interactive) 22:19:06 INFO - PROCESS | 32467 | 247.62: Paint for tID: 2 (after interactive) 22:19:08 INFO - PROCESS | 32467 | 225.565: Paint for tID: 2 (after interactive) 22:19:09 INFO - PROCESS | 32467 | 223.055: Paint for tID: 2 (after interactive) 22:19:11 INFO - PROCESS | 32467 | 241.495: Paint for tID: 2 (after interactive) 22:19:12 INFO - PROCESS | 32467 | 238.6: Paint for tID: 2 (after interactive) 22:19:14 INFO - PROCESS | 32467 | 255.87: Paint for tID: 2 (after interactive) 22:19:16 INFO - PROCESS | 32467 | 224.815: Paint for tID: 2 (after interactive) 22:19:17 INFO - PROCESS | 32467 | 240.9: Paint for tID: 2 (after interactive) 22:19:19 INFO - PROCESS | 32467 | 240.385: Paint for tID: 2 (after interactive) 22:19:20 INFO - PROCESS | 32467 | 227.69: Paint for tID: 2 (after interactive) [0] https://wiki.mozilla.org/Buildbot/Talos/Tests#ts_paint

Ryan VanderMeulen (PTO, back July 27)

Comment 13

•

9 years ago

Mike, I know you've spent a lot of time squinting at tpaint recently. Any thoughts? :)

Flags: needinfo?(mconley)

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 14

•

9 years ago

Attached patch browser.js MozAfterPaint patch (obsolete) — Details — Splinter Review

This is the patch I used to measure the difference. I landed it and run tpaint locally getting the tpaint and browser chrome mozafterpaint numbers.

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 15

•

9 years ago

Attached patch 8789770: browser.js MozAfterPaint patch — Details — Splinter Review

Simplified patch. STR: 1) Apply the patch on top of m-c 2) ./mach build faster 3) ./mach talos-test --activeTests tpaint --disable-e10s 4) ./mach talos-test --activeTests tpaint

Attachment #8789770 - Attachment is obsolete: true

Avi Halachmi (:avih)

Comment 16

•

9 years ago

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #12) > Questions: > * do we want tpaint to test window open + document loading? I'd think yes. We prefer to measure blackbox or high level things which have some meaningful semantics. Users don't open a window without something in it, be it the home page or the first page or whatever. Therefore IMO it makes sense to measure the entire thing. Another advantage of measuring high level things is that it makes the test semantics and meaningfulness less susceptible to changes in internal modules or implementations. It's fine to also measure higher resolution things, but then it's important to not lose the big picture. It's also fine if we think measuring a different thing could be better due to $whatever reasons. > * is it a bug in talos test or in Firefox and does it affect users? While possible, and unless the test itself changes constantly (which AFAIK it's not) I'd think it's unlikely that it's a talos bug that the results it reports seem to regress over time.

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 17

•

9 years ago

(In reply to Avi Halachmi (:avih) from comment #16) > (In reply to Zibi Braniecki [:gandalf][:zibi] from comment #12) > > Questions: > > * do we want tpaint to test window open + document loading? > > I'd think yes. The way I understand it is that if we keep it this way, tpaint shouldn't be actually used to measure just the changes to browser.xul & friends because tpaint tests more (and in result may be affected by other things and be more noisy). Would it be valuable to create a new test that takes mozAfterPaint from within browser.xul? > > * is it a bug in talos test or in Firefox and does it affect users? > > While possible, and unless the test itself changes constantly (which AFAIK > it's not) I'd think it's unlikely that it's a talos bug that the results it > reports seem to regress over time. Which indicates that this bug is actually affecting production builds and our users. It would indicate that opening new window (and launching a page) slows down over time. But we know now that just opening new window is not affected by this bug, so it has to be something related to loading the document. One hypothesis would be that we grow some cache in which we save the document and this cache slows down document loading over time? That would, in turn indicate that it may not be related to new window, and opening the same page in consecutive tabs will also regress?

Jim Mathies [:jimm]

Comment 18

•

9 years ago

We should look for a way to snap a profile early on in the test and then another after numerous window opens, the difference should be pretty obvious between the two.

Benjamin Smedberg

Comment 19

•

9 years ago

Pretty clearly not a Talos bug but a real product bug.

Component: Talos → General

Product: Testing → Firefox

Version: Version 3 → unspecified

Benjamin Smedberg

Comment 20

•

9 years ago

The tpaint test is pretty carefully constructed: we care about how much time it takes to get content in front of a user, but we don't want this test to vary on most of the network stack, which is noisy. That's why we use a constant data: URI. It's a good idea to subdivide this test down into pieces to measure time to paint chrome and time to paint content separately, in order to diagnose things. But at the end of the day we still want to avoid regressing this "complete" test, since it's the end-to-end time that the user experiences. I think Jim is right that a comparative profile would tell us a lot. I don't know how to get a comparative profile, though. Do we have API calls that will let us start/stop a profile at particular points in time? Avi do you know?

Flags: needinfo?(avihpit)

Avi Halachmi (:avih)

Comment 21

•

9 years ago

I wasn't involved with tpaint, but tests can, and some do, add markers and/or start/stop the profiler around the test itself to make the analysis easier.

Flags: needinfo?(avihpit)

Ryan VanderMeulen (PTO, back July 27)

Comment 22

•

9 years ago

(In reply to Avi Halachmi (:avih) from comment #21) > I wasn't involved with tpaint, but tests can, and some do, add markers > and/or start/stop the profiler around the test itself to make the analysis > easier. Avi, can you please try to move forward with comments 18 and 20? i.e. adding markers where needed and collecting the profiles.

Flags: needinfo?(avihpit)

William Lachance (:wlach)

Comment 23

•

9 years ago

If we do add these markers, it would probably make sense to reconstruct this test as a "perfherder summary test". In that case, we'd track and alert on a summary value (same as now), but also have the times to the specific markers as subtests available when needed.

:Gijs (he/him)

Assignee

Comment 24

•

9 years ago

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #0) > I also see bug 1008163 that indicates that this problem exists for quite a > while. Just so we're clear, that bug was filed about the trend that *overall* tpaint/ts_paint has increased over time (ie tpaint for fx44 was higher than for fx40), in very slow increments, throughout our release cycles. Because the increments are slow, we don't "catch" regressing bugs. This is completely orthogonal to the trend you're seeing in an individual sampling of a test run that the samples are increasing. Dumb questions first: have you verified that the perfherder (or other) code doesn't simply sort the results of the individual runs?

:Gijs (he/him)

Assignee

Comment 25

•

9 years ago

(In reply to :Gijs Kruitbosch from comment #24) > (In reply to Zibi Braniecki [:gandalf][:zibi] from comment #0) > > I also see bug 1008163 that indicates that this problem exists for quite a > > while. > > Just so we're clear, that bug was filed about the trend that *overall* > tpaint/ts_paint has increased over time (ie tpaint for fx44 was higher than > for fx40), in very slow increments, throughout our release cycles. Because > the increments are slow, we don't "catch" regressing bugs. > > This is completely orthogonal to the trend you're seeing in an individual > sampling of a test run that the samples are increasing. Eh, I missed out making my 'point' explicit, which is: this *is* fairly new (ie not something that dates back to 2014 like the bug you cited). I haven't seen it before. It might be bisectable. (In reply to :Gijs Kruitbosch from comment #24) > Dumb questions first: have you verified that the perfherder (or other) code > doesn't simply sort the results of the individual runs? I just checked, and it does: http://searchfox.org/mozilla-central/rev/591354c5e5521cf845bf6b6ef44e8b3b0aeda17d/testing/talos/talos/startup_test/tpaint.html#44-48 Note that array.sort() affects |array| in JS, and object (incl. array) values are passed by reference. http://searchfox.org/mozilla-central/rev/591354c5e5521cf845bf6b6ef44e8b3b0aeda17d/testing/talos/talos/startup_test/tpaint.html#66-72 med = calcMedian(openTimes); if (auto) { dumpLog("__start_report" + openTimes.join('|') + "__end_report"); var now = (new Date()).getTime(); dumpLog("__startTimestamp" + now + "__endTimestamp\n"); dumpLog("openingTimes="+openTimes.slice(1)+"\n"); If I copy that last line and run it before we calculate |med|, I see: 13:21:51 INFO - PROCESS | 85262 | openingTimes=249.33000000000038,227.4350000000004,276.1400000000003,272.3900000000003,260.5749999999989,265.2049999999999,268.90999999999985,285.119999999999,276.4349999999995,243.51000000000022,227.0649999999987,305.6750000000029,227.96500000000015,219.52499999999782,260.4200000000019,233.73499999999694,244.59000000000015,260.5,244.01499999999942 13:21:51 INFO - PROCESS | 85262 | __start_report219.52499999999782|227.0649999999987|227.4350000000004|227.96500000000015|233.73499999999694|243.51000000000022|244.01499999999942|244.59000000000015|249.33000000000038|260.4200000000019|260.5|260.5749999999989|265.2049999999999|268.90999999999985|272.3900000000003|276.1400000000003|276.4349999999995|285.119999999999|292.5300000000002|305.6750000000029__end_report__startTimestamp1473682911034__endTimestamp 13:21:51 INFO - PROCESS | 85262 | openingTimes=227.0649999999987,227.4350000000004,227.96500000000015,233.73499999999694,243.51000000000022,244.01499999999942,244.59000000000015,249.33000000000038,260.4200000000019,260.5,260.5749999999989,265.2049999999999,268.90999999999985,272.3900000000003,276.1400000000003,276.4349999999995,285.119999999999,292.5300000000002,305.6750000000029 which shows both the unsorted and then the sorted list of opening times of the windows. ts_paint, because it's cross-process, can't do this sorting in JS and therefore doesn't exhibit the bug. In other words, I think this bug is INVALID as filed, though it might be worth fixing the code anyway, to reduce future confusion.

:Gijs (he/him)

Assignee

Comment 26

•

9 years ago

Attached patch don't sort tpaint results, — Details — Splinter Review

This patch makes the median calculating code operate on a clone of the array, avoiding the confusing sorted results.

Attachment #8790253 - Flags: review?(jmaher)

:Gijs (he/him)

Assignee

Updated

•

9 years ago

Assignee: nobody → gijskruitbosch+bugs

Status: NEW → ASSIGNED

:Gijs (he/him)

Assignee

Comment 27

•

9 years ago

Egh, didn't intend to take this in case we want to do other things in this bug. Mike, I think it would still be interesting to understand why the e10s version of this test is so much noisier. Zibi, I'm afraid you might still be looking at a real regression for your patchset. As to what we "want" tpaint to measure, I expect Joel and Mike have ideas, as I know we recently changed some things for e10s because it wasn't measuring the right thing. (In reply to Benjamin Smedberg [:bsmedberg] from comment #19) > Pretty clearly not a Talos bug but a real product bug. This is extra-ironic because in a sense, this is very much a Talos bug, and yet this move was how I noticed the bug (I watch fx::general). Moving it back, though. :-)

Assignee: gijskruitbosch+bugs → nobody

Status: ASSIGNED → NEW

Component: General → Talos

Product: Firefox → Testing

Summary: tpaint are regressing with each iteration → tpaint results look like they get worse with each iteration because it sorts the results

Joel Maher ( :jmaher ) (UTC -8)

Comment 28

•

9 years ago

Comment on attachment 8790253 [details] [diff] [review] don't sort tpaint results, Review of attachment 8790253 [details] [diff] [review]: ----------------------------------------------------------------- good catch Gijs!

Attachment #8790253 - Flags: review?(jmaher) → review+

Pulsebot

Comment 29

•

9 years ago

Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/e5400de39990 don't sort tpaint results, r=jmaher

:Gijs (he/him)

Assignee

Updated

•

9 years ago

Keywords: leave-open

Avi Halachmi (:avih)

Comment 30

•

9 years ago

(In reply to :Gijs Kruitbosch from comment #24) > Just so we're clear, that bug was filed about the trend that *overall* > tpaint/ts_paint has increased over time (ie tpaint for fx44 was higher than > for fx40) That was my initial understanding as well, which appears to have been wrong. > Dumb questions first: have you verified that the perfherder (or other) code > doesn't simply sort the results of the individual runs? Good catch. As for profiler start/stop, the test seem to already do that and also adds a 'tpaint' marker on pause/resume. However, the test seems intentionally constructed in a very simple way (and rightly so IMO) where it just opens a new window and measures the mozafterpaint of the result. It does not lend itself at all to intermediate "checkpoints" where we could add more markers or measure partial progress, because at the high level at which the test runs there are just no intermediate places to observe. The most we can do here IMO without completely rewriting the test in a much more complex way is to also measure from the new window at the first chance we have (i.e. right when the code starts to run rather than waiting to mozafterpaint). However, this is quite susceptible to various DOM, JS and other optimization and is likely to change over time in a way which would make us scratch our heads while trying to understand whether or not a change is actually a regression. Therefore I'd vote against it. As for causes to the noise, I see that mconley already added GC and CC before the test starts on about 6 months ago, so that should have helped. Another factor which IMO could affect the results and the noise levels is that the result is measured at the opener window (i.e. the child calls opener.childIsOpen() and then the opener takes the end timestamp). Another possible approach which might be more correct and possibly less noisy is to measure the end timestamp at the child itself, and send this value when calling the opener. However, this might make it impossible to use performance.now() which the test currently uses since its 0 point-in-time relative to the window. Depending on the kind of values we're getting, however, Date.now() could be enough resolution for this cause - and it seems to be the case - values of 200-300 ms are high enough IMO for Date resolution. Mike, thoughts about measuring the end timestamp at the child?

Flags: needinfo?(avihpit)

Mike Conley (:mconley) (:⚙️)

Comment 31

•

9 years ago

Coming up for air to say that I've seen this bug and the mailing list thread, but I'm clearing other stuff from my plate today. I hope to get to this later today, fingers crossed.

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 32

•

9 years ago

That's a great find, thanks :gijs! I thought about the possibility of sorting, but the fact that ts_paint didn't display the regression made me confident that it must be a regression. I should have read the source. :)

Mike Conley (:mconley) (:⚙️)

Comment 33

•

9 years ago

(In reply to Avi Halachmi (:avih) from comment #30) > > Mike, thoughts about measuring the end timestamp at the child? I hacked together a patch which passed around a Date.now() instead of window.performance.now(), where the Date.now() was calculated within the opened window, and it didn't seem to improve the noise. What is needed here, in all honesty, is profiling. I hacked together a patch that uses TalosContentProfiler.js to profile the tpaint test with e10s. I had hoped to land such a patch back when I was doing e10s tpaint regression stuff, but I never got around to fixing it for the non-profiling case. I also had to use --symbolsPath to point to my local symbols, which I found a bit odd. Unfortunately, I find that --spsProfile is sometimes broken when I use it, since it's used infrequently, and the next person who uses it usually has to fix it. :/ Anyhow, long story short, here's a profile from my local OS X machine: https://cleopatra.io/#report=9595dfcc9c5608395e9baff0100c9fabbea85d61 Having given these results: 295,295,298,303,310,316,323,325,327,339,339,339,348,349,356,368,371,396,398 So that's a spread of about 100ms of difference. Hopefully this is enough for us to track down what the difference is between the 295ms case and the 398ms case.

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Comment 34

•

9 years ago

The patch I used, fwiw: https://pastebin.mozilla.org/8909580

Mike Conley (:mconley) (:⚙️)

Comment 35

•

9 years ago

Here's a better profile - I modified my patch to use TalosContentProfiler.contentMarker to set markers so that we didn't have the overhead of sending messages to the parent. https://cleopatra.io/#report=ab77263b6231aa3f6c12f616096e4a6eec0aae9e Here's tpaint run 0: https://cleopatra.io/#report=ab77263b6231aa3f6c12f616096e4a6eec0aae9e&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A3069,%22end%22%3A3694%7D%5D between markers, that's about 406ms. Here's tpaint run 8: https://cleopatra.io/#report=ab77263b6231aa3f6c12f616096e4a6eec0aae9e&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A16923,%22end%22%3A17301%7D%5D&selection=0,7166,7167,7168,7169,6341,7170,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,738,739,740,741,742,696,34,37,743,34,37,38,7181,39,40,37,2047,7182,7183,7184,7185,7186,7187,7188,7189,7360,5803,4443 Between markers, this clocks in at about 292ms. So now let's analyze.

Mike Conley (:mconley) (:⚙️)

Comment 36

•

9 years ago

So this analysis is for the tpaint runs profiled in comment 35. The two tpaint runs differ by about 114ms. Let's see if we can find out where that's going. In tpaint run 0, I notice that the content process spends 267ms waiting for the PContent::SendCreateWindow sync message to be responded to from the parent. This is in contrast to tpaint run 8, which spends 201ms doing the same thing, for a difference of 66ms. So that seems to account for a little over half of the 114ms. 48ms remains. Where is it? Now looking at the interval between the paint of the content, and the composite / MozAfterPaint in tpaint run 0, the interval is about 101ms. The interval in tpaint run 8 for the same thing is about 52ms, for a difference of 49ms. Allowing for eyeball-error, my conclusion from this is: The parent process is spending more time responding to the SendCreateWindow call for tpaint run 0. Also, there's more time between the painting of the content and its composition in tpaint run 0. So, now the question is, why?

Mike Conley (:mconley) (:⚙️)

Comment 37

•

9 years ago

I have some things I'm trying to get into mozilla-central before the uplift, so I'm not able to spend more time on this today, but hopefully these profiles and this analysis is helpful to get someone else started. If there's been no forward movement by the time the uplift is over, ping me and I'll keep looking.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 38

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e5400de39990

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

9 years ago

OS: Unspecified → BSDI

Ryan VanderMeulen (PTO, back July 27)

Updated

•

9 years ago

OS: BSDI → All

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 39

•

9 years ago

Can we close it?

:Gijs (he/him)

Assignee

Comment 40

•

9 years ago

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #39) > Can we close it? Depends if mconley thinks there's something to be gained from continuing to look at the e10s noise here.

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Comment 41

•

9 years ago

Personally, I don't think so.

Flags: needinfo?(mconley)

:Gijs (he/him)

Assignee

Updated

•

9 years ago

Assignee: nobody → gijskruitbosch+bugs

Status: NEW → RESOLVED

Closed: 9 years ago

Keywords: leave-open

Resolution: --- → FIXED

chart of two RUNS=100 tpaints 9 years ago Zibi Braniecki [:zbraniecki][:gandalf] 37.43 KB, image/png		Details
ts_paint RUNS=100 9 years ago Zibi Braniecki [:zbraniecki][:gandalf] 27.42 KB, image/png		Details
browser.js MozAfterPaint patch 9 years ago Zibi Braniecki [:zbraniecki][:gandalf] 2.42 KB, patch		Details \| Diff \| Splinter Review
8789770: browser.js MozAfterPaint patch 9 years ago Zibi Braniecki [:zbraniecki][:gandalf] 1.80 KB, patch		Details \| Diff \| Splinter Review
don't sort tpaint results, 9 years ago :Gijs (he/him) 1.47 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review