Closed Bug 1295396 Opened 8 years ago Closed 8 years ago

2.05 tp5o (windows7-32) regression on push 02016837381082d203059b4cc193f5f4ced3ef50 (Fri Jul 22 2016)

Categories

(Firefox :: Untriaged, defect)

50 Branch
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox48 --- unaffected
firefox49 --- unaffected
firefox50 --- affected
firefox51 --- affected

People

(Reporter: ashiue, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push 02016837381082d203059b4cc193f5f4ced3ef50. As author of one of the patches included in that push, we need your help to address this regression.

Summary of tests that regressed:

  tp5o summary windows7-32 pgo e10s: 248.84 -> 253.94 (2.05% worse)


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=1985

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running
After did some retriggers, here is the zooming better view:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=%5Bmozilla-inbound,e9786d3bc11a5fff1fa013567494164638d623e9,1,1%5D&zoom=1469114706249.2698,1469126253778.5845,228.9126188527439,265.36307270248767&selected=%5Bmozilla-inbound,e9786d3bc11a5fff1fa013567494164638d623e9,34728,33195019,1%5D

We can found the problem push is 3baa895f4402.
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=2f91720817ab2f49b2e61460d92a28220868d73e&tochange=3baa895f4402d11258e5801f41cc259b58f4b4d5

In order to find the specific changeset which might caused the perf issue, I did the bisect on try server:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=%5Btry,f8524944919a2706fa1bcccc34af9f4ccacd84f7,1,1%5D&zoom=1470795703458.3064,1470798035490.4482,245.52742350688078,259.9489126472228

According to the above graph, we found 7e0b4b253b95 start have a up trend, therefore, we suspected changeset d03640d31c5f causing this issue.

Hi Michael, as you are the patch author, can you take a look at this and determine what is the root cause? 
Although Aurora merged did not fire this alert, we think this is still worth to note.
Thanks!
Blocks: 1280481, 1285272
Flags: needinfo?(mili)
given that those patches only effect things when accessibility is enabled, and that shouldn't be true for this test it seems very unlikely those patches are responsible.
Doing a bisection on try did a pretty good job of pinpointing this:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d03640d31c5f

looking at raw data comparing the larger range we see on inbound:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=2f91720817ab2f49b2e61460d92a28220868d73e&newProject=mozilla-inbound&newRevision=3baa895f4402d11258e5801f41cc259b58f4b4d5&originalSignature=e9786d3bc11a5fff1fa013567494164638d623e9&newSignature=e9786d3bc11a5fff1fa013567494164638d623e9&framework=1

I see a larger pile of regressed pages, looking at the bisection on try server, I see a subset of those pages:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=1be903294d486712e61af2b34e20b4ca80302329&newProject=try&newRevision=7e0b4b253b95bf9991ec41d20449930898780b60&originalSignature=f8524944919a2706fa1bcccc34af9f4ccacd84f7&newSignature=f8524944919a2706fa1bcccc34af9f4ccacd84f7&framework=1

does accessibility become enabled via a pref or other user action, or are there components that are activated on general pages?  This is restricted to a small regression on a single platform and single test- possibly this is just something odd with the way we do pgo optimizations as that is not as deterministic but depends on a variety of factors including the size of the libraries.
(In reply to Joel Maher ( :jmaher) from comment #3)
> Doing a bisection on try did a pretty good job of pinpointing this:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/d03640d31c5f

that precise patch?  That seems really odd, assuming a vaguely competent compiler that change shouldn't effect generated code at all.  It might even be interesting to compare the generated code for before and after the change, I'd expect its identical.

> I see a larger pile of regressed pages, looking at the bisection on try
> server, I see a subset of those pages:
> https://treeherder.mozilla.org/perf.html#/
> comparesubtest?originalProject=try&originalRevision=1be903294d486712e61af2b34
> e20b4ca80302329&newProject=try&newRevision=7e0b4b253b95bf9991ec41d20449930898
> 780b60&originalSignature=f8524944919a2706fa1bcccc34af9f4ccacd84f7&newSignatur
> e=f8524944919a2706fa1bcccc34af9f4ccacd84f7&framework=1
> 
> does accessibility become enabled via a pref or other user action, or are
> there components that are activated on general pages?  This is restricted to
> a small regression on a single platform and single test- possibly this is
> just something odd with the way we do pgo optimizations as that is not as
> deterministic but depends on a variety of factors including the size of the
> libraries.

its possible I suppose that somehow these changes caused other code to get less well optimized, and so we regressed the test a little bit.  However I find it rather hard to believe that.
(In reply to Trevor Saunders (:tbsaunde) from comment #4)
> (In reply to Joel Maher ( :jmaher) from comment #3)
> > Doing a bisection on try did a pretty good job of pinpointing this:
> > https://hg.mozilla.org/integration/mozilla-inbound/rev/d03640d31c5f
> 
> that precise patch?  That seems really odd, assuming a vaguely competent
> compiler that change shouldn't effect generated code at all.

confirming that I have seen this popping up in profiling (HasGenericType and ARIA map stuff).

> It might even
> be interesting to compare the generated code for before and after the
> change, I'd expect its identical.

might be not identical then
I'm not sure why windows 7 was specifically slower in this case, I expected the tests for the other OSes would be slower too, like windows 8/xp.
That being said, this slowness might be similar to what's happening in bug 1290776, where Accessible::HasGenericType uses up a lot of time, and that function calls GetRoleMapFromIndex, so it'd be nice to figure out why HasGenericType is being called so often and what to do about it, as that seems to be the root of the problem.
Flags: needinfo?(mili)
I did the talos tests on try reverting the first of the two patches, here's a comparison of central/default from two days ago and central/default with that patch reverted: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0ac38eeaf6b8&newProject=try&newRevision=3be8778b9238&framework=1&showOnlyImportant=0
tp5o opt e10s win7-32 slower with the patch reverted by 0.26%, low confidence. Full results of that test: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=0ac38eeaf6b8&newProject=try&newRevision=3be8778b9238&originalSignature=f8524944919a2706fa1bcccc34af9f4ccacd84f7&newSignature=f8524944919a2706fa1bcccc34af9f4ccacd84f7&framework=1
No significant results here.

I also did the talos tests reverting both patches, here's a comparison of central/default and central/default with both patches reverted: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0ac38eeaf6b8&newProject=try&newRevision=50bb9006823b&framework=1&showOnlyImportant=0
tp5o opt e10s win7-32 slower with both patches reverted by 0.63%, low confidence. Full results of that test: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=0ac38eeaf6b8&newProject=try&newRevision=50bb9006823b&originalSignature=f8524944919a2706fa1bcccc34af9f4ccacd84f7&newSignature=f8524944919a2706fa1bcccc34af9f4ccacd84f7&framework=1
Again, no significant results, so I'm not sure what caused the slowness reported in the regression and whether it was my patches that caused it.
I believe you need pgo builds on try:
https://wiki.mozilla.org/ReleaseEngineering/TryChooser#What_if_I_want_PGO_for_my_build

to reduce the usage on try, you can just do this:
try: -b o -p win32 -u none[Windows 7] -t tp5o-e10s[Windows 7] --rebuild 5
Oh thanks for the info! I wanted to try all the talos tests to see if there were other performance issues originating from my patch.
Hey Joel, I ran ./mach try -b o -p win32 -u none[Windows 7] -t tp5o-e10s[Windows 7] --rebuild 5
and mach said:
"It looks like you passed an unrecognized argument into mach.

The try command does not accept the arguments: 7]"

Looks like mach's not parsing the [Windows 7] part properly, and I've pulled from central. Do you know what might be wrong? Does this look like a bug with mach?
Flags: needinfo?(jmaher)
that doesn't work with mach try, it has to be in the commit message with the syntax from comment 8.
Flags: needinfo?(jmaher)
Oh thanks, I thought it would since the TryChooser Syntax Builder at http://trychooser.pub.build.mozilla.org/ outputted that in the command when I selected the same options.
Reverting my 2 patches caused a 1.4% speed up in the tp5o opt e10s test, a bit less than the 2% originally reported:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=559cf067cf99&newProject=try&newRevision=2f630dc1079c&framework=1&showOnlyImportant=0
The websites that ran a bit faster after reverting were also mostly different from the ones originally reported:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=559cf067cf99&newProject=try&newRevision=2f630dc1079c&originalSignature=f8524944919a2706fa1bcccc34af9f4ccacd84f7&newSignature=f8524944919a2706fa1bcccc34af9f4ccacd84f7&framework=1
The interesting result though is that only reverting the later of the two patches didn't result in much speed up, though this is with low confidence; this later patch is the one that actually puts the earlier patch into use in other parts in the code, i.e., the earlier patch was just adding functionality that wasn't being used anywhere else in the code.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=559cf067cf99&newProject=try&newRevision=afe14a4aab5c&framework=1&showOnlyImportant=0
Michael, thanks for the analysis here.  Given that this is pgo only, I suspect the addition of the code is causing the library to adjust sizes/thresholds and making the pgo bits different enough.  I would vote for closing this as wontfix, it doesn't look as though there is anything we can really do.
That sounds good to me, thanks!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.