Closed Bug 778718 Opened 8 years ago Closed 6 years ago

30% Windows Ts regression since 1st March

Categories

(Firefox :: General, defect, major)

x86
Windows 7
defect
Not set
major

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox15 - ---
firefox16 - ---
firefox17 - ---
firefox19 --- affected

People

(Reporter: emorley, Unassigned)

Details

(Keywords: perf, regression, Whiteboard: [ts][snappy])

Attachments

(4 files)

We appear to have gradually regressed Ts on Windows by ~30% since 1st March 2012.

ie:
http://graphs.mozilla.org/graph.html#tests=[[53,131,12]]&sel=none&displayrange=180&datatype=running

(Non-pgo & inbound so as to give as few changesets coalesced between each data point as possible)

I've done some preliminary poking through the graph, but with coalescing, large merges & the fact this is due to many smaller increases rather than one massive regression, I don't think this is going to be straight-forwards.

I also can't remember off the top of my head what talos changes might have landed in this timeframe, that may have caused Ts baselines to change.

CCing release drivers and a few talos-ish people; marking tracking given the size of the regression.
Adding Lawrence and Taras since this feels like something that would be covered by the snappy effort.

If this is truly an iterative slow down, it's not clear that we should be tracking for a specific  Firefox version, and definitely not the version currently stabilizing on Beta (15). Even tracking for 16/17 is a bit dubious, but we'll + for visibility.
Wouldn't slower machines be less likely to be idle? Which would mean newer data might be skewed in telemetry.

In the perfstats, I make out three distinct jumps with some gradual rise. Mar 6, Apr 27, and then the stair step run up starting around Jul 16th.
Whiteboard: [ts]
(In reply to Jim Mathies [:jimm] from comment #3)
> Wouldn't slower machines be less likely to be idle? Which would mean newer
> data might be skewed in telemetry.

This would result in other measures being significantly better in newer telemetry too.

> 
> In the perfstats, I make out three distinct jumps with some gradual rise.
> Mar 6, Apr 27, and then the stair step run up starting around Jul 16th.

Now that I think about it, ts is more focused on hot startups, so this could get lost in telemetry data due to noise from cold startups.
Whiteboard: [ts] → [ts][snappy]
FWIW, the jump-and-fall at the end of July is most likely form bug 778855.
Sounds like we don't fully trust the data, and Taras let us know that there isn't much work here that we'll be able to uplift into branches prior to release. No need to track for release in that case.
We should be able to take a build from March and one from today, run talos locally on a local desktop and see a similar difference.  TS is a very easy to run test.
(In reply to comment #7)
> We should be able to take a build from March and one from today, run talos
> locally on a local desktop and see a similar difference.  TS is a very easy to
> run test.

It is also very easy to profile.  I don't agree with the assertion that we cannot do anything about this at all.  Has anybody tried yet?
Taras asked me to look into this earlier this week.  I'll try to get builds and play them off against each other.
Assignee: nobody → nfroyd
Setting the tracking flags back on until someone says why comment 6 is correct.
(In reply to Ehsan Akhgari [:ehsan] from comment #10)
> Setting the tracking flags back on until someone says why comment 6 is
> correct.

This was based upon Comment 2 and follow up in email with Taras. Glad to see we now think we can make some gains here.
This bug hasn't become actionable and we're a couple of weeks from release, so untracking for FF16 given that.
We're just over a week away from merging 17 to Beta channel.  Nathan can you look into those builds and see if this bug can become actionable for 17's release?
(In reply to Lukas Blakk [:lsblakk] from comment #13)
> We're just over a week away from merging 17 to Beta channel.  Nathan can you
> look into those builds and see if this bug can become actionable for 17's
> release?

I am uncertain of how much can actually be done here.  I've been looking at a smaller startup regression that happened in the FF 15 timeframe (bug 792939) and it takes a couple of days to analyze regressions over a much smaller range of changes.  If you want something before it goes to Beta, I'd say that's a very very tall task.
I couldn't reproduce this issue on Latest Nightly (2013-01-28) and FF 19b3 on Windows 7 x64.

Can anyone still reproduce this issue on Latest builds of Beta, Aurora or Nightly?
(In reply to comment #15)
> I couldn't reproduce this issue on Latest Nightly (2013-01-28) and FF 19b3 on
> Windows 7 x64.

How did you try to reproduce this?
(In reply to :Ehsan Akhgari (Away 2/7-2/15) from comment #16)
> How did you try to reproduce this?

I compared telemetry histograms between build from 1st March 2012 and 2013-01-29 and I have also created a telemetry metric using URL from comment 2 for all data after 1st March 2012. Results in both cases differences are similar.

I don't know exactly how to test this using Talos on Windows, in facts I don't know exactly what to follow but I can try it using: 

https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Can you help me in doing that?
that link to "running locally - source code", should tell you all you need for running Ts.
(In reply to Joel Maher (:jmaher) from comment #18)
> that link to "running locally - source code", should tell you all you need
> for running Ts.

What should I have installed on Windows without Python, Mercurial and Mingw so I can   complete all steps from URL mentioned in comment 17 ?
you need mercurial, python, pywin32 package and that should work.  The tests run in a standard windows prompt, not inside Mingw or some other unix'ish shell.  

What problems are you seeing?  I can help you on irc or if you reply with what part of the instructions is not clear or failing.
I'd test this issue using Talos tool on FF 15b2, Nightly 17a1 from same date as FF15b2 and on FF 19b6. In the attachment you can see are results I have got.
(In reply to Joel Maher (:jmaher) from comment #20)

Thanks for helping me in working with talos.
Attached file Talos Results FF15b2
Sorry for wrong attach. Theese are the corect ones:
> you need mercurial, python, pywin32 package and that should work. 

(talos's setup.py *should* install pywin32 appropriately)
the numbers between 17 and 19 are very noticeable.  There could be issues prior to 17, but with these posted numbers it is large enough that it is worthwhile to look into the issue more.
Joel is there any way to find the real causing Build automatic using Talos? The only way I know to find regression range is using Mozregression and hg bisect.
In this case as I can see from Comment 27 I can use as edge, last good FF17 and first bad FF 19.
Should the regression range be restricted ? As Joel said in Comment 27 the numbers between FF 17 and FF 19 are very noticeable. If it should be, how can I restrict it?
Flags: needinfo?
I think at this point, we have looked into this bug as much as reasonably possible- I vote to close this and focus on current regressions and fixes/enhancements to other parts of the code!
Flags: needinfo?
(In reply to Joel Maher (:jmaher) from comment #31)
> I think at this point, we have looked into this bug as much as reasonably
> possible- I vote to close this and focus on current regressions and
> fixes/enhancements to other parts of the code!

WFM!
Assignee: nfroyd → nobody
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.