Closed Bug 1171653 Opened 9 years ago Closed 8 years ago

4% Linux*/Win7 tp5o regression on Firefox e10s on June 01, 2015 from push baa9c64fea6f

Categories

(Testing :: Talos, defect, P5)

defect

Tracking

(e10s+, firefox41- wontfix, firefox42- affected, firefox43- affected)

RESOLVED WONTFIX
Tracking Status
e10s + ---
firefox41 - wontfix
firefox42 - affected
firefox43 - affected

People

(Reporter: jmaher, Assigned: tnikkel)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][e10s])

Talos has detected a Firefox performance regression from your commit baa9c64fea6f.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=baa9c64fea6f&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#tsvg.2C_tsvgx

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch will be backed out! ***

Our wiki page outlines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
our first e10s only regression!  yay for talos.

the problem is this is on m-c only and a large merge range.  We can't run talos e10s on try really (we can hack it).

not sure what route to take here.
Flags: needinfo?(mconley)
This was a merge from fx-team. Do we a similar bump there?
Flags: needinfo?(mconley) → needinfo?(jmaher)
we only run e10s talos on m-c (not fx-team or inbound).  and this is the *first* regression that is e10s only !!!
Flags: needinfo?(jmaher)
Ugh. Ok.

How possible would it be to try to backfill the missing fx-team e10s talos data from around that date?
Flags: needinfo?(jmaher)
we don't have a way to run the tests on fx-team- maybe we can add that in and just not schedule it (i.e. create the builders).

:catlee, can we create builders for talos e10s jobs on fx-team (and ideally mozilla-inbound).  Then when we find an issue we can backfill.  Do let me know how this might work.
Flags: needinfo?(jmaher) → needinfo?(catlee)
hm, I don't think we have a way right now to create the tests and not run them.

could we use seta for this?
Flags: needinfo?(catlee)
There's not much we can do here until we have a changeset. :(
Flags: needinfo?(jmaher)
:catlee, seta doesn't work on Talos at the moment, if it did, then we could  apply the same logic of SETA to talos on inbound/fx-team for e10s and reduce resources.  Maybe it makes sense to do this.

kmoir, can you weigh in on how much work it might be to apply SETA to the talos builders?
Flags: needinfo?(jmaher) → needinfo?(kmoir)
Well, we would have to change the talos scheduler to use the class that looks at the skipconfig data. And we would need talos data generated by your SETA scripts so we could consume it.  So it does require significant testing like the previous implementation. However, we do have the code that works for opt and debug tests so the work should be more on the testing side, implementation shouldn't be that difficult in theory.
Flags: needinfo?(kmoir)
Flags: needinfo?(mconley)
Blocks: 1144120
Is tying this into our current automation practical, given that we'll probably only need to do this once?

I seem to recall MattN had some scripts that let us do some backfilling of talos data when we were working on Australis... MattN, are those scripts still around? Perhaps we could modify them for our purposes.
Flags: needinfo?(mconley) → needinfo?(MattN+bmo)
http://hg.mozilla.org/users/mozilla_noorenberghe.ca/talos-tart/file/b53e872a557f/tart-nightlies.sh
http://hg.mozilla.org/users/mozilla_noorenberghe.ca/talos-tart/file/b53e872a557f/README-TART

A patch to moznightly was also needed so it didn't delete the downloaded build. I believe moznightly is gone now though there is a github issue to bring it back.

I'm on my phone on a plane now so these instructions aren't thorough but it may get you started. I can help next week at Whistler. I will need to add your machine ID to my server IIRC if you want to post there like the one script does. If you save the result files you can also POST them with later.
Flags: needinfo?(MattN+bmo)
Assignee: nobody → mconley
Alright, I don't think it makes much sense to get a bunch of releng or ateam people hacking on making this happen, since this is probably one-time-only.

I've requested a Linux talos machine. My plan is to write a script that will run the tp5o test on the machine for each push to fx-team within the regression range, and report the results.

jmaher - is it possible / advisable for me to have talos report the results from this machine to graph server for analysis? Or should I do the old trick of posting the results to a Google Spreadsheet or into a file for manual analysis?
Flags: needinfo?(jmaher)
it would be just fine to report to graph server as long as you have the branch and machine names correct.
Flags: needinfo?(jmaher)
So the fastest path (at least while at Whistler) seemed to be bisecting with try pushes between sessions.

Just a reminder, this is the regression range that was identified: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2c815cc65cc9&tochange=8707d35414f4


8707d35414f4: Bad

70a376c0f23d: Bad

dc2e19e737c7: Bad

d5adc9e191d7: Bad

09b27d21c789: Bad

507b6aba4555: Bad

d551aa12ebb1: Good

1a955124eccc: Good

9aed76a4ee0b: Good

2c815cc65cc9: Good

My bisection leads me to believe that this was caused by bug 1148582.
Blocks: 1148582
This was marked m7 so we could at least identify the regressing changeset. I think we've done that now.
I'm guessing this has the same cause as bug 1169756. Pushing

https://hg.mozilla.org/try/rev/20272d58e2e5

to try with whatever other options are needed to reproduce this could confirm.
There's a patch to force e10s enabled for talos by pointing it at an alternative talos repo / revision. That's the best we can do until bug 1174780 is fixed.

I'll do the try push comparison:

Before tn's patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=84968faa49e8

After: https://treeherder.mozilla.org/#/jobs?repo=try&revision=183f5d88ea27
Retriggers in - we have a winner!
Unfortunately landing that patch would be a correctness regression. See bug 1169756 comment 16 for example.
Hrm. Well, at least we know this is where the bottleneck is. Let me know if / when you've got another patch you'd like to test.
Assignee: mconley → tnikkel
[Tracking Requested - why for this release]: regression in 41
Tracking in 41 because Comment 23
FF41 does not have e10s enabled by default. Moved tracking to 42 and 43 to ensure this gets attention there.
It doesn't look like it'll be useful to track this any more; I'd like to know though, how we will be testing and prioritizing performance issues when e10s is turned on. From talking with joel it sounds like we have e10s tests turned on for all pushes for talos now and so it will be easier to pinpoint future regressions. 

Should we close this, or is it still useful to leave it open?  Brad,  what do you think?
Flags: needinfo?(blassey.bugs)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #28)
> It doesn't look like it'll be useful to track this any more; I'd like to
> know though, how we will be testing and prioritizing performance issues when
> e10s is turned on. From talking with joel it sounds like we have e10s tests
> turned on for all pushes for talos now and so it will be easier to pinpoint
> future regressions. 
> 
> Should we close this, or is it still useful to leave it open?  Brad,  what
> do you think?

IMO, regressions should track the release they regressed in. If we are saying we don't care to fix this regression then close it as won't fix.
Flags: needinfo?(blassey.bugs)
tnikkel, it may be up to you then. I asked Brad before I noticed you were assigned to the bug. Improving performance would be great of course, and I don't want to close this if you're still intending to work on it.
Flags: needinfo?(tnikkel)
Blocks: e10s-perf
Priority: -- → P5
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(tnikkel)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.