Closed Bug 1265480 Opened 8 years ago Closed 8 years ago

7.03% tp5o responsiveness e10s (windowsxp) regression on push 76e8f6ad9ded (Thu Apr 14 2016)

Categories

(Core :: JavaScript: GC, defect, P3)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
e10s + ---

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(1 file)

Talos has detected a Firefox performance regression from push 76e8f6ad9ded. As author of one of the patches included in that push, we need your help to address this regression.

This is a list of all known regressions and improvements related to the push:
https://treeherder.mozilla.org/perf.html#/alerts?id=877

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see:
https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5

Reproducing and debugging the regression:

If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p win32 -u none -t tp5o-e10s[Windows XP] --rebuild 5  # add "mozharness: --spsProfile" to generate profile data

(we suggest --rebuild 5 to be more confident in the results)

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.lorg/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e [path]/firefox -a tp5o --e10s

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Thursday, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations:

https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: Untriaged → JavaScript: GC
Product: Firefox → Core
I pushed to try to bisect this down (xp takes a while on try, this should be ready in 12 hours or so):
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&selectedJob=19623552&fromchange=0488ff56c381&tochange=a9ca7f760697

:terrence, this is your favorite test this cycle of firefox!  as far as I know this is windows xp only, keep that in mind.  Can you help make a decision here?
Flags: needinfo?(terrence)
(In reply to Joel Maher (:jmaher) from comment #1)
> :terrence, this is your favorite test this cycle of firefox!  as far as I
> know this is windows xp only, keep that in mind.  Can you help make a
> decision here?

What exactly does that mean? Does this test run only on WinXP or did it only regress on WinXP?
Flags: needinfo?(terrence)
sorry, it appears to only have regressed on windows xp.
Also, can you please (1) fix the broken link to https://wiki.mozilla.lorg/Buildbot/Talos/Running#Running_locally_-_Source_Code and (2) include the commit messages in the description? Making me have to |hg log -r| to figure what's even implicated is super annoying.
ack, thanks for the catch on the broken link, here it is:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

as for the commit messages, I have a script that bisects, it would be nice to update it with commit messages- that is a good tip.

what I do, is look at the try pushes:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&selectedJob=19623552&fromchange=0488ff56c381&tochange=a9ca7f760697

then I match it up to:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=76e8f6ad9ded
(In reply to Joel Maher (:jmaher) from comment #3)
> sorry, it appears to only have regressed on windows xp.

Ugh, it's the same binary on all windows platforms. Does WinXP run on different hardware?
winxp/win7 are the same binary and same hardware, but different OS.  here is a graph of the 3 windows platforms:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,22b942243cce9b43b263b83c473af3256a138e58,1%5D&series=%5Bmozilla-inbound,bf41e17491286132034748a5f95035a0aaf50458,1%5D&series=%5Bmozilla-inbound,3ece050aac95021a51363205ef7747cce7a62b24,1%5D&zoom=1460335708874.058,1461001200000,1.5055762081784394,12.54646840148699

I have been working on bumping up the priority for the winxp jobs so we can get results faster.

here is a link to the machines we use in automation:
https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation

A few of the try jobs are starting to run
I have little data, but I am leaning towards:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b23a6286c125
(In reply to Joel Maher (:jmaher) from comment #8)
> I have little data, but I am leaning towards:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/b23a6286c125

I'd keep looking: that patch does no work without the later patches that landed.
tracking-e10s: --- → ?
Priority: -- → P3
(In reply to Joel Maher (:jmaher) from comment #10)
> ok, a lot of overlap in the data, with 6 data points each we have:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/86bd74d49e63

Thanks for the testing. We're doing the same amount of work before and after, but split into smaller chunks. The assumption we're making is that the work-stealing queue is basically zero cost. Which is true everywhere else. The one glaring difference on WinXP is the software condition variable emulation. Looks like I need to land the optimizations for this in bug 956899.
great update, I see recent activity on bug 956899 including reviewing a patch- looking forward to it landing.  This regression will roll into Aurora next week.  We don't need to uplift the fix for this there, but it would be nice.
This is an absolutely vile hack. It simply disables sweeping and compaction parallelization on winxp. We might want to take it as a temporary measure until the new software CV is landed, assuming it actually gets us back the perf we lost.
oh no, I see build failures on that try push, not sure if it is a bad base- the error wasn't obvious to me looking at the patch.
this is on aurora now!
Version: unspecified → Trunk
I am checking in here to see if there is anything remaining to do here?
Flags: needinfo?(terrence)
I think this is seeing the same CV wakeup ordering issue that Nick saw in his CV landings. In particular, the score jumps back down to where it was a few days later and seems to be fairly bistable around the two scores.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(terrence)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: