Closed Bug 1266735 Opened 8 years ago Closed 8 years ago

2.42 - 4.58% tp5o / tp5o % Processor Time (windows7-32, windows8-64, windowsxp) regression on push b3b5be1d2027f915c3c54b966fb41de8444923ee (Thu Apr 21 2016)

Categories

(Core :: Networking, defect)

48 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox47 --- unaffected
firefox48 --- disabled
firefox49 + disabled
firefox50 --- fixed

People

(Reporter: jmaher, Assigned: u408661)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][necko-active])

Talos has detected a Firefox performance regression from push b3b5be1d2027f915c3c54b966fb41de8444923ee. As author of one of the patches included in that push, we need your help to address this regression.

This is a list of all known regressions and improvements related to the push:
https://treeherder.mozilla.org/perf.html#/alerts?id=961

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see:
https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5

Reproducing and debugging the regression:

If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p win32,win64 -u none -t tp5o-e10s[Windows 7,Windows XP,Windows 8],tp5o[Windows 7,Windows XP,Windows 8] --rebuild 5  # add "mozharness: --spsProfile" to generate profile data

(we suggest --rebuild 5 to be more confident in the results)

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.lorg/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e [path]/firefox -a tp5o

(add --e10s to run tests in e10s mode)

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations:

https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: Untriaged → Networking
Product: Firefox → Core
:hurley, can you take a look at this.  I assume this wasn't intentional, possibly there is an easy fix or an easy explanation :)
Flags: needinfo?(hurley)
I'm not surprised by this (I expected a talos regression of some sort). Basically, this is the same thing that happened when we landed the original predictor (which that push added more features to) - we've added work that, in the talos world, gives us no improvement (in this case, we start prefetching things after so many loads of a page). So yeah, some extra CPU time is not in the least bit surprising. We didn't regress the actual pageload time (even though we're moving more bits around) because localhost has so much bandwidth available, our extra bit-moving doesn't matter there. But yeah, extra work to determine what to prefetch, plus the work of actually prefetching will use extra CPU cycles.

:jmaher, does this sound reasonable? I'll spend some time over the next few days figuring out if there's anything particularly egregious that I can fix that could reduce the impact, but I still expect there to be some impact no matter what.
Flags: needinfo?(hurley)
thanks for the quick reply :hurley!  This makes a lot of sense.  Explaining regressions is half the battle.  While these regressions are not large, they are on windows and I am not seeing it on osx/linux- so if you do think of something to reduce the impact it would be really cool.  I think it would not be useful to spend a week of your time focusing on this trying to optimize it with a bunch of hacks.
There's nothing windows-specific about the code I landed. I could easily imagine, however, windows' socket implementation (or something else in the layers underneath us) being less efficient than the osx/linux implementations, which would account for the fact that we see this regression on windows only. Unfortunately, I don't (currently) have a windows machine available to me, but I'm working on it :) Once I get one, I should have some cycles to at least run this through a profiler to see what (if anything) can be done - perhaps I could tweak prefetch parallelism on windows, for example, to reduce the amount of work we'll do that competes with everything else. Just a thought, though, I don't want to get too crazy without having some real data about where the actual CPU usage lies.
Assignee: nobody → hurley
Whiteboard: [talos_regression] → [talos_regression][necko-active]
this is now on aurora, so we should consider fixes for 48 (aurora) and 49 (trunk).
No longer relevant on aurora - prefetch has been disabled there (bug 1268208)
will we see this on beta when we uplift?
No, the feature has been disabled to give it a full nightly cycle before continuing along the trains
Blocks: 1267562
Tracking for 49 since this is a recent regression.  Marking as disabled for 48 too.
with the exception of the processor time, this regression seems to be fixed by bug 1273882, lets mark this as resolved fix.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.