18.75 - 25.73% tp6_facebook (windows10-64) regression on push 467d285e001c (Tue Jan 23 2018)

RESOLVED FIXED in Firefox 60

Status

()

defect
P3
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: igoldan, Assigned: jonco)

Tracking

(Blocks 1 bug, {perf, regression, talos-regression})

unspecified
mozilla60
Unspecified
Windows 10
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox58 unaffected, firefox59 unaffected, firefox60 fixed)

Details

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=95b09e6e9613435bd1934a274cbeeefe23e3bab1&tochange=467d285e001c568039af5e3e067cff11c7ac43cf

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 26%  tp6_facebook windows10-64 pgo e10s     283.31 -> 356.21
 19%  tp6_facebook windows10-64 pgo 1_thread e10s276.38 -> 328.21


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=11260

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: Untriaged → JavaScript Engine
Product: Firefox → Core
:jonco This regression is pretty visible on PGO builds. On OPT builds it's hard to notice, still a very small increase seems to be there.

It's possible this is caused by the PGO build process itself. If that's the case, there are chances this will go away.

Please investigate bug 1431353 and say whether there are any expected performance penalties.
Flags: needinfo?(jcoppeard)
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #1)
This was expected to improve performance, if anything, by using more JS helper threads for parsing.

It's possible that this is affecting scheduling of other tasks run on JS helper threads however.  It might be that parsing is now blocking GC tasks and holding things up.  Maybe we should allow higher priority tasks to interrupt lower priority ones.
Assignee: nobody → jcoppeard
Flags: needinfo?(jcoppeard)
I've reproduced this on try where I get a 300 -> 342 change for Windows 10 PGO builds.

I've tried a few approaches to mitigating this but haven't succeeded so far.  I'm continuing to investigate.
(In reply to Jon Coppeard (:jonco) from comment #3)
I didn't realise that talos ignores the first 5 results and then takes the median.  My test results actually show 259 -> 365, which is closer to those reported.

The source of the problem is contention on the 'exclusive access lock' used to protect the atoms table.  Parsing creates a lot of atoms, and even though we have a per-zone cache (hit rate 70% for this test) concurrent parse tasks spend a lot of time waiting on this lock.

The real fix for this is to allow concurrent modification of the atoms table, possibly by splitting it into many sub-tables each with their own lock (like Java's ConcurrentHashMap).  This would also have the side effect of enabling parallel sweeping of this table.  This is quite a large amount of work however.

There are several smaller things we can do to improve the current situation:
 - reduce the concurrency for off-thread parsing to a more acceptable level
 - reduce the scope where the lock is held while atomizing
 - split the exclusive access lock into several smaller locks

I'll experiment with these first.
The patch in bug 1432794 shows a large improvement on try.  My theory is that although this doesn't remove the contention it stops it affecting the main thread by removing the prototype/constructor initialisation that happened on the main thread and atomized a load of strings.

I'm going to land that patch and see if that improves things.
(In reply to Jon Coppeard (:jonco) from comment #5)
This seems to me to have been fixed by the patch in bug 1432794.  What do you think?
Flags: needinfo?(igoldan)
(In reply to Jon Coppeard (:jonco) from comment #6)
> (In reply to Jon Coppeard (:jonco) from comment #5)
> This seems to me to have been fixed by the patch in bug 1432794.  What do
> you think?

I agree. The regressions on PGO builds are definitely fixed.
Still, there's something wrong with the OPT builds from [1] and [2].
The tests here became noisier, as the min-max value range increased. Do you know why that may happen?
I will do some retriggers on the OPT builds.

[1] https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-inbound,1535542,1,1&series=mozilla-inbound,1535578,1,1
[2] https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=mozilla-inbound,1552299,1,1&series=mozilla-inbound,1550394,1,1
Flags: needinfo?(igoldan)
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #7)
Interesting, I hadn't seen that.  I don't know why that would be.
Priority: -- → P3
:jmaher I want to close this as FIXED unless you're too worried OPT tests are too noisy now.
Flags: needinfo?(jmaher)
this sounds good to me.
Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(jmaher)
Resolution: --- → FIXED
Depends on: 1432794
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.