Closed Bug 1284370 Opened 4 years ago Closed 4 years ago

9.89 - 48.25% tp5o responsiveness (windowsxp) regression on push bb16dc5e693602f19ade6b212d8babd6d038b918 (Thu Jun 30 2016)

Categories

(Core :: JavaScript Engine, defect)

x86
Windows XP
defect
Not set

Tracking

()

RESOLVED WONTFIX
Tracking Status
e10s ? ---

People

(Reporter: wlach, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression)

Hi Nick, could you please have a look at this? I haven't done any retriggers but there don't seem to be any other likely culprits for this regression. Strangely this only seems to occur on Windows XP -- Windows 7 is fine.

I have marked this as blocking 1283245 but have not confirmed that this is related to that, and not one of the other bugs in the push. We can experiment with backing out specific patches and seeing what the effect is but I thought I'd file the bug first to see if you had any initial thoughts.

--

Talos has detected a Firefox performance regression from push bb16dc5e693602f19ade6b212d8babd6d038b918. As author of one of the patches included in that push, we need your help to address this regression.

Summary of tests that regressed:

  tp5o responsiveness windowsxp opt e10s - 28.97% worse
  tp5o responsiveness windowsxp pgo e10s - 48.25% worse
  tp5o responsiveness windowsxp pgo - 9.89% worse
  tp5o responsiveness windowsxp opt e10s - 37.76% worse


You can find links to graphs and comparsion views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=1672

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Flags: needinfo?(nfitzgerald)
Blocks: 1263289
Whiteboard: [devtools-html] [triage]
tracking-e10s: --- → ?
Posting snippets from IRC for posterity while I dig in further:

> 10:31 AM <terrence> fitzgen: I would not assume off hand that the new code is either slower or less fair than our old code
> 10:31 AM <terrence> fitzgen: I think it's quite likely that the test is encoding a dependency on a broken [condition variable] implementation
> 10:32 AM <terrence> fitzgen: our best bat may be to compare actual performance numbers across platforms to see if it's an actual slowdown or just if the wakeup order changed
> 10:34 AM <terrence> fitzgen: note that the regression is from 5 to 7, 7 to 10 and 45 to 49...  presumably milliseconds, so in absolute terms it's not a catastrophe for our remaining 2% winxp users even if we do decide to take a real slowdown
> 10:34 AM <terrence> fitzgen: also, it's only on this one test
> 10:35 AM <terrence> fitzgen: given how much we use [condition variables], I'd expect every test to be slower if it was not a wakeup order issue
> 10:35 AM <terrence> fitzgen: and given that every winxp [condition variable] impl I've read has been less fair than our new impl, I highly doubt that the new code is objectively worse
Flags: needinfo?(nfitzgerald)
Is the definition of tp5o in the separate talos repo/package? The units seem to be ms or latency when switching tabs, or something like that? Can I get access to the talos repo/package?
(In reply to Nick Fitzgerald [:fitzgen] [⏰PDT; UTC-7] from comment #2)
> Is the definition of tp5o in the separate talos repo/package? The units seem
> to be ms or latency when switching tabs, or something like that? Can I get
> access to the talos repo/package?

The definition is at the main repo, but the test files are not. You need tp5n.zip . see https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5n_pages_set
Looking at [0] and [1], the results seem to be bimodal, which might give credence to the this-could-be-wakeup-ordering-related hypothesis: each mode could be the result of a different wakeup ordering, one which happens to benefit the test more.

[0] https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,22b942243cce9b43b263b83c473af3256a138e58,1,1%5D&series=%5Bfx-team,22b942243cce9b43b263b83c473af3256a138e58,1,1%5D&series=%5Bmozilla-central,22b942243cce9b43b263b83c473af3256a138e58,1,1%5D&selected=%5Bmozilla-inbound,22b942243cce9b43b263b83c473af3256a138e58,33890,31099793,1%5D

[1] https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=%5Bmozilla-inbound,22b942243cce9b43b263b83c473af3256a138e58,1%5D&series=%5Bmozilla-central,22b942243cce9b43b263b83c473af3256a138e58,0%5D&series=%5Bfx-team,22b942243cce9b43b263b83c473af3256a138e58,0%5D&selected=%5Bmozilla-inbound,22b942243cce9b43b263b83c473af3256a138e58%5D

> 10:32 AM <terrence> fitzgen: our best bat may be to compare actual performance numbers across platforms to see if it's an actual slowdown or just if the wakeup order changed

What follows is a breakdown between winxp and other platforms running these tests.

## tp5o responsiveness opt e10s:

https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-central,22b942243cce9b43b263b83c473af3256a138e58,1,1%5D&series=%5Bmozilla-central,17776dbfe42bfdd7473bf1712f41ff2fdafcf4bb,1,1%5D&series=%5Bmozilla-central,3ece050aac95021a51363205ef7747cce7a62b24,1,1%5D&series=%5Bmozilla-central,bf41e17491286132034748a5f95035a0aaf50458,1,1%5D

winxp looks like it went from in the middle of the herd to the slowest of the platforms running tp5o responsiveness pgo, but by a very thin margin and within others' range of scores.

## tp5o responsiveness pgo:

https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-central,69600f98d411b5fde1ea437bb7e2d19ec9696b62,1,1%5D&series=%5Bmozilla-central,77bb16a7e7821467fe581797abe86744ad1519f7,1,1%5D&series=%5Bmozilla-central,26940bc490dde1433819e0eed0a7a6258cb9bc88,1,1%5D&series=%5Bmozilla-central,7809f0aef254efd820d1d3fdb711262797f85e64,1,1%5D

ditto

## tp5o responsiveness pgo e10s:

https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-central,5ec2754e398f6f440316bd82ff738cb21ba9ff70,1,1%5D&series=%5Bmozilla-central,cb7337b04c6d9c8e9e0c83272887d05ef94c0de6,1,1%5D&series=%5Bmozilla-central,3715ed80f45aebf89c53be5eceb7698f14f9493d,1,1%5D&series=%5Bmozilla-central,3cc5ddc12d27748f13e6ac101f094dfca2df1470,1,1%5D

Same story, but winxp is even closer to the others than before.
(In reply to Nick Fitzgerald [:fitzgen] [⏰PDT; UTC-7] from comment #4)

Given that it seems fairly likely that the tp5o responsiveness tests were sensitive to condition variable wakeup order, and that winxp's performance isn't non-normative compared to other platforms, I think we could take the regression.

Terrence: what do you think?
Flags: needinfo?(terrence)
FYI, in case this helps: I did some bisection on try for tp5o responsiveness opt e10s, and the results:

https://treeherder.mozilla.org/perf.html#/graphs?series=%5Btry,22b942243cce9b43b263b83c473af3256a138e58,1,1%5D&zoom=1467660068591.113,1467660436000,6.265769992628058,11.622603053274668

Point to this commit (in the changeset of 9 pushes) causing the actual perf regression:

c62dc1a368c8	Nick Fitzgerald — Bug 1283245 - Use js::Mutex and js::ConditionVariable instead of PRLock and PRCondVar in HelperThreads; r=terrence
(In reply to Nick Fitzgerald [:fitzgen] [⏰PDT; UTC-7] from comment #5)
> (In reply to Nick Fitzgerald [:fitzgen] [⏰PDT; UTC-7] from comment #4)
> 
> Given that it seems fairly likely that the tp5o responsiveness tests were
> sensitive to condition variable wakeup order, and that winxp's performance
> isn't non-normative compared to other platforms, I think we could take the
> regression.
> 
> Terrence: what do you think?

The PGO numbers, in particular the PGO e10s number (i.e. what we ship), show winXP ahead of win7 as often as not, full stop. There is no real regression here, so we can close. That said, I also think this level of regression would be worth shipping on a niche platform given the other gains that we are getting from the change.
Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(terrence)
Resolution: --- → WONTFIX
No longer blocks: 1263289
Whiteboard: [devtools-html] [triage]
thanks for figuring this out and coming to a resolution!
Component: Untriaged → JavaScript Engine
Product: Firefox → Core
Version: unspecified → Trunk
You need to log in before you can comment on or make changes to this bug.