Closed Bug 713055 Opened 13 years ago Closed 11 years ago

roll out mozharness desktop talos to mozilla-central + project branches when ready

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: k0scist, Assigned: jyeo)

References

Details

(Whiteboard: [mozharness+talos][leave open])

Attachments

(10 files, 7 obsolete files)

4.67 KB, patch
mozilla
: review+
jyeo
: feedback+
armenzg
: checked-in+
Details | Diff | Splinter Review
1.50 KB, patch
armenzg
: review+
armenzg
: checked-in+
Details | Diff | Splinter Review
660 bytes, patch
jmaher
: review+
armenzg
: checked-in+
Details | Diff | Splinter Review
773 bytes, patch
jmaher
: review+
armenzg
: checked-in+
Details | Diff | Splinter Review
773 bytes, patch
jmaher
: review+
armenzg
: checked-in+
Details | Diff | Splinter Review
774 bytes, patch
jmaher
: review+
armenzg
: checked-in+
Details | Diff | Splinter Review
2.43 KB, patch
jmaher
: review+
armenzg
: checked-in+
Details | Diff | Splinter Review
12.62 KB, patch
jhopkins
: review+
armenzg
: checked-in+
Details | Diff | Splinter Review
9.54 KB, text/plain
Details
9.72 KB, text/plain
Details
[tracking bug]

Goal: get Talos on Mozharness in production

(Secondary goal: generally make Talos nicer to use and more maintainable)

Hard Blockers [P1]:

* Bug 650887 - desktop talos runner in mozharness; This is actually
  checked in and "works", though as Aki notes there may need to be
  more improvements a la configuration for a production setup. We
  should get Aki to ticket what else needs to be done. But 650887 is
  blocked by Bug 700722

* Bug 650890 - port remote talos to mozharness; Aki is working on
  this, but he's been busy.  Not sure what the status is.  See also
  bug 713003

* Bug 701506 - create python package webserver; If we're using
  mozharness to install talos into a virtualenv, we will need packages
  from somewhere.

* But 700722 - Talos process checking is over-ambitious and wrong; The
  short story: running `python talos_mozharness.py --appname
  /some/path/firefox` will err out since the talos subprocess will
  detect this command as a running firefox.  The patch there does this
  much better, but breaks on Mac, presumedly because it is looking for
  the wrong thing (maybe firefox.app?).  I haven't had a chance to
  investigate what it wants to find, but it is reproducible in
  staging.  All that is needed is time to tackle this.

* Bug 713003 - combine desktop talos mozharness and remote talos
  mozharness; strictly speaking, this isn't blocking but Aki wants it
  and its probably a good idea

* Bug 713017 - get buildbotcustom to use mozharness + production talos;
  Probably a releng task. Not sure who should head up this effort.
  It is blocked by all of the above (mostly). Of course, this will all
  have to be heavily staged as well.

* Bug 694625 - talos should consume mozprocess; not strictly necessary
  but would be a substantive improvement.

Soft Blockers [P2]:

None of these are strictly necessary, but they are nice improvements
to take as part of this effort and most of them are low-hanging fruit.

* Bug 705809 - Talos should not depend on scripts being run from the
  talos directory; See this magical workaround:
  http://hg.mozilla.org/build/mozharness/file/be92a9addbd2/scripts/talos_script.py#l91

* Bug 705811 - run_tests.py should become a console_script;
  This goes for PerfConfigurator, etc, too.

* install story: currently we pretend to install pageloader as part of
  setup.py.  However, we don't extract it correctly :( We need to
  repack as there is a top level of namespace we don't need (bug
  709340).  However, we also have an install script
  (http://hg.mozilla.org/build/talos/file/cb1b7a64f98e/INSTALL.py
  ). Maybe that is a better place to do this?  It is probably worth
  figuring out what to do with the install script.  Currently it
  clones a whole new talos, but it might be worth setting up in place
  if it is already downloaded/checked out.  See also bug 701490,
  pageloader.xpi does not get installed from easy_install

* Bug 709349 - Getting broken pipe running talos in --develop mode; we
  probably don't want these showing up in production logs.

* Bug 694638 - talos should consume mozprofile

Also, talos.zip :
https://bugzilla.mozilla.org/show_bug.cgi?id=707218 . We should do
something about that
(In reply to Jeff Hammel [:jhammel] from comment #0)
> * Bug 650890 - port remote talos to mozharness; Aki is working on
>   this, but he's been busy.  Not sure what the status is.  See also
>   bug 713003

If this bug is tracking the Q2 goals, this isn't a hard blocker.
The Q2 goal for remote talos is a working poc only, not running in production.

> * Bug 713017 - get buildbotcustom to use mozharness + production talos;
>   Probably a releng task. Not sure who should head up this effort.
>   It is blocked by all of the above (mostly). Of course, this will all
>   have to be heavily staged as well.

This will use the same code paths as peptest + Jordan's unit tests.  I imagine we'll run in parallel with non-mozharness talos for a while.
We'll need config files for talos before buildbot code changes.
I'm removing the "get talos on mozbase" dependencies from this bug.  While we want to do this (soon), this bug is scheduled to be done by EOQ2 and these are not in scope for that
No longer depends on: 694625, 694638
Depends on: 760320
One month left before this is supposed to be "done".  Triaging says:

* bug 650890 - port remote talos to mozharness: not a strict blocker; assigned to Callek

* bug 701506 - create python package webserver: not sure what the status is at the moment and it is unowned, though it looks like most of what was desired by RelEng is basically done? We'll have to file bugs to get the packages Talos requires up after this is in a finished state

* bug 713017 - get buildbotcustom to use mozharness + talos: assigned to :aki, not sure what the status is

* bug 760320 - production config files for mozharness Talos : unassigned, filed today

Is there anything else I'm missing here?  The unassigned bugs should probably find owners.  Not sure if we need any changes talos or mozharness side, but if so I could take those or otherwise assist at getting any of this done
(In reply to Jeff Hammel [:jhammel] from comment #3)
> * bug 760320 - production config files for mozharness Talos : unassigned,
> filed today

Imo, this is part of the mozharness script.
I can help you understand mozharness config files if needed.
Aki what is the status on bug 713017?
I just picked it up on Wednesday and haven't started it.
It will require config files for Talos, and we're blocked on pushing that to production til after we have the python package server.

I have buildduty + a release next week, so I probably won't get to spend much time on it til the week of the 11th.
Depends on: 761809
No longer depends on: 650890
Summary: get Talos on Mozharness in production → get desktop Talos on Mozharness in production
Blocks: 764588
adding bug 764592 as a blocker.  This is probably not strictly true but close enough.  We'll want to go into this at least knowing our versioning story wrt talos + m-c
Depends on: 764592
Depends on: 766692
Depends on: 767042
No longer depends on: 764592
Depends on: 795236
Summary: get desktop Talos on Mozharness in production → roll out mozharness desktop talos to mozilla-central + project branches when ready
Depends on: 794587
Depends on: 795531
Depends on: 802801
Depends on: 803647
Depends on: 804385
Depends on: 805925
Depends on: 805931
No longer depends on: 805931
Depends on: 812609
Depends on: 812726
Depends on: 823306
Last Weds (Dec 19, 2012) we had a mozharness meeting where we audited the status of this bug.  We're still hammering out dependencies.  I'll give a A*Team POV on what else needs to be done there:

* bug 795236 : have fixed 822478 which will allow --authfile to use URLs, not just file paths; we still need config file changes which involves someone who knows where the authfile actually lives, so probably releng

* bug 795531 : being worked on by releng

* bug 802801 : someone needs to look at this; jmaher assigned ;

* bug 804385 : this probably shouldn't block deployment if its rare

* bug 805925 :
** we need to upload new talos + other packages: bug 805925
** So this still won't fix the problem.  But it will put us on parity with what is happening in production

* bug 812609 : releng is working on this
Depends on: 838485
Blocking on bug 837022's env update.
Depends on: 837022
Depends on: 843464
No longer depends on: 843464
Depends on: 843479
Depends on: 853599
Depends on: 853679
Depends on: 855554
Depends on: 865311
No longer depends on: 865311
No longer depends on: 843479
Assignee: nobody → yshun
Depends on: 878572
Depends on: 880414
Depends on: 880876
Landed a followup fix https://hg.mozilla.org/build/mozharness/rev/19e86b400db1 to get unit.sh passing again.
Depends on: 891525
I've hidden the rev3 machines from Cedar so we can focus:
https://tbpl.mozilla.org/?tree=Cedar&jobname=talos

Did someone from the a-team say that can help you verify the talos numbers?
I don't know what the right tools is these days.

Is the minidump bug still happening?
Perhaps you can ask ted or the a-team if they have a patch that can ensure crashing so we can do a try push.

I think our next branch to target should be try rather than mozilla-inbound and/or mozilla-central.
Should we be landing this patch to mozilla-inbound now-ish?
That way we don't have problems when we switch to talos mozharness.
Attachment #774674 - Flags: feedback?(yshun)
Attachment #774674 - Flags: feedback?(aki)
Attachment #774674 - Flags: feedback?(yshun) → feedback+
Attachment #774674 - Flags: feedback?(aki) → review?(aki)
Attachment #774674 - Flags: review?(aki) → review+
Attached patch enable talos on moz-inbound (obsolete) — Splinter Review
Attachment #774741 - Flags: review?(aki)
Comment on attachment 774741 [details] [diff] [review]
enable talos on moz-inbound

Let's roll out to try first, then all m-c level branches next.
I don't think it makes sense to only roll out to inbound.
Attachment #774741 - Flags: review?(aki)
Attachment #774674 - Attachment description: talos.json changes → [checked-in] talos.json changes
Whiteboard: [mozharness+talos] → [mozharness+talos][leave open]
Attachment #775683 - Flags: review?(armenzg)
Comment on attachment 775683 [details] [diff] [review]
[checked-in] enable talos on try

Review of attachment 775683 [details] [diff] [review]:
-----------------------------------------------------------------

Let's talk first before landing it.
Attachment #775683 - Flags: review?(armenzg) → review+
Comment on attachment 775683 [details] [diff] [review]
[checked-in] enable talos on try

https://hg.mozilla.org/build/buildbot-configs/rev/bacbba917d32
Attachment #775683 - Attachment description: enable talos on try → [checked-in] enable talos on try
Attachment #774741 - Attachment is obsolete: true
Merged to the production and live on the try server.
https://tbpl.mozilla.org/?tree=Try&jobname=talos

We will have a look for few days and then enable it across the board.
Depends on: 894980
Depends on: 895721
Attachment #778586 - Flags: review?(aki)
Comment on attachment 778586 [details] [diff] [review]
only consider return codes for talos mozharness

Before we land this and reconfig, we should verify the talos script actually exits with the appropriate self.return_code.

To do that we may have to add more self.buildbot_status() calls in mozharness.mozilla.testing.talos.  However, iirc, talos never goes orange, only green/red/retry.
Attachment #778586 - Flags: review?(aki) → review+
(In reply to Aki Sasaki [:aki] from comment #22)
> However, iirc, talos never goes orange,
> only green/red/retry.

Talos also goes orange for crashes and test-unexpected-fails as of bug 829728, via:
https://hg.mozilla.org/build/buildbotcustom/file/f24d9219c221/steps/talos.py#l115
Depends on: 896015
Moved the talos return codes to bug 896015.
Component: Talos → Release Engineering: Automation (General)
Product: Testing → mozilla.org
QA Contact: catlee
Version: unspecified → other
Attachment #778586 - Attachment is obsolete: true
Attached patch talos_mozharness.diff (obsolete) — Splinter Review
Enable talos mozharness on all FF25 branches.
Attachment #779916 - Flags: review?(jhopkins)
Attached patch different_builders.diff (obsolete) — Splinter Review
This shows which builders are being modified
Comment on attachment 779924 [details] [diff] [review]
[mc] Match talos.zip to talos_revision

Review of attachment 779924 [details] [diff] [review]:
-----------------------------------------------------------------

woot!
Attachment #779924 - Flags: review?(jmaher) → review+
Comment on attachment 779927 [details] [diff] [review]
[ma] Match talos.zip to talos_revision and use talos_repo

Review of attachment 779927 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/talos/talos.json
@@ +4,5 @@
>          "path": ""
>      },
>      "global": {
> +        "talos_repo": "http://hg.mozilla.org/build/talos",
> +        "talos_revision": "a11542b55a70"

the revision is wrong here
Attachment #779927 - Flags: review?(jmaher) → review-
Thanks for catching the mismatched revision.

Even though we're *not* enabling the new talos on older branches and will ride the trains (I think we can revisit this if we want to), I want talos.json to look the right way.
Attachment #779932 - Flags: review?(jmaher) → review+
Comment on attachment 779933 [details] [diff] [review]
[mr] Match talos.zip to talos_revision and use talos_repo

Review of attachment 779933 [details] [diff] [review]:
-----------------------------------------------------------------

this is not a valid revision :)
Attachment #779933 - Flags: review?(jmaher) → review-
Comment on attachment 779927 [details] [diff] [review]
[ma] Match talos.zip to talos_revision and use talos_repo

This one seems correct:
http://hg.mozilla.org/build/talos/rev/a11542b55a70
Attachment #779927 - Flags: review- → review?(jmaher)
I have no idea what that was about. Sorry!

http://hg.mozilla.org/build/talos/rev/560806cfa208
Attachment #779933 - Attachment is obsolete: true
Attachment #779935 - Flags: review?(jmaher)
Comment on attachment 779927 [details] [diff] [review]
[ma] Match talos.zip to talos_revision and use talos_repo

Review of attachment 779927 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/talos/talos.json
@@ +4,5 @@
>          "path": ""
>      },
>      "global": {
> +        "talos_repo": "http://hg.mozilla.org/build/talos",
> +        "talos_revision": "a11542b55a70"

ok, this is the right revision for the right branch.
Attachment #779927 - Flags: review?(jmaher) → review+
Comment on attachment 779935 [details] [diff] [review]
[mr] Match talos.zip to talos_revision and use talos_repo

Review of attachment 779935 [details] [diff] [review]:
-----------------------------------------------------------------

looks good!
Attachment #779935 - Flags: review?(jmaher) → review+
I think we're going to have to postpone enabling this tomorrow since I would not like to start using an incorrect talos_revision and possibly cause regressions for using an older talos versions.
It's unfortunate but I think it is the safest.
I will have to keep an eye on newer talos.zip revision requests.

Anyone objects? I was thinking of pushing it to Monday.

On another note, can I land a patch on mozilla-central without waiting for a merge from m-i? or would I cause a merge conflict?

https://hg.mozilla.org/integration/mozilla-inbound/rev/0d4ab37e3f3e
https://hg.mozilla.org/releases/mozilla-aurora/rev/ac9b464cb3c8
https://hg.mozilla.org/releases/mozilla-beta/rev/64046aafb054
https://hg.mozilla.org/releases/mozilla-release/rev/481145f83cc6
Attachment #779924 - Flags: checked-in+
Attachment #779927 - Flags: checked-in+
Attachment #779932 - Flags: checked-in+
Attachment #779935 - Flags: checked-in+
Attachment #774674 - Flags: checked-in+
Attachment #775683 - Flags: checked-in+
I'm waiting on jyeo to let me know if we're making use of "suites" inside of talos.json yet.

If we're, then we have to wait for this patch to spread across the branches.
Attachment #779958 - Flags: review?(jmaher)
Comment on attachment 779958 [details] [diff] [review]
update talos.json to match values from config.py

Review of attachment 779958 [details] [diff] [review]:
-----------------------------------------------------------------

update buildbot-configs first, I backed out some stuff, sorry.

::: testing/talos/talos.json
@@ +63,5 @@
> +                "ignore_first:5",
> +                "--filter",
> +                "median"
> +            ]
> +        },

we can delete both of these: otherx, and svgx.

@@ +74,5 @@
>                  "median"
>              ]
>          },
> +        "rafx": {
> +            "tests": ["tcanvasmark"],

this should have tsvgx, tscrollx
Attachment #779958 - Flags: review?(jmaher) → review-
Comment on attachment 779975 [details] [diff] [review]
update talos.json to match values from config.py

Review of attachment 779975 [details] [diff] [review]:
-----------------------------------------------------------------

this looks great!
Attachment #779975 - Flags: review?(jmaher) → review+
Comment on attachment 779975 [details] [diff] [review]
update talos.json to match values from config.py

I've landed it without removing the talos.zip block as that would have been bad:
https://hg.mozilla.org/integration/mozilla-inbound/rev/496a7582cf9e
Attachment #779975 - Flags: checked-in+
Attachment #779958 - Attachment is obsolete: true
Testing needed before we can enable talos mozharness across the board:
https://tbpl.mozilla.org/?tree=Try&rev=b5feca0c0c50
https://hg.mozilla.org/projects/cedar/rev/7cfd536a3d8a
Comment on attachment 779916 [details] [diff] [review]
talos_mozharness.diff

As discussed on IRC:
armenzg: try, cedar and ash should not change
jhopkins: "Rev4 MacOSX Lion 10.7 try talos" seems to have changed

I was comparing dump_master.py output before and after the patch and noticed the changed builder factories (note that 'try opt test' is unchanged but 'try talos' does indeed use a different factory than before):

< Rev4 MacOSX Lion 10.7 try opt test mochitest-browser-chrome ScriptFactory
> Rev4 MacOSX Lion 10.7 try opt test mochitest-browser-chrome ScriptFactory
< Rev4 MacOSX Lion 10.7 try opt test mochitest-other ScriptFactory
> Rev4 MacOSX Lion 10.7 try opt test mochitest-other ScriptFactory
< Rev4 MacOSX Lion 10.7 try opt test reftest-ipc ScriptFactory
> Rev4 MacOSX Lion 10.7 try opt test reftest-ipc ScriptFactory
< Rev4 MacOSX Lion 10.7 try opt test reftest ScriptFactory
> Rev4 MacOSX Lion 10.7 try opt test reftest ScriptFactory
< Rev4 MacOSX Lion 10.7 try opt test xpcshell ScriptFactory
> Rev4 MacOSX Lion 10.7 try opt test xpcshell ScriptFactory
< Rev4 MacOSX Lion 10.7 try talos chromez ScriptFactory
> Rev4 MacOSX Lion 10.7 try talos chromez TalosFactory
< Rev4 MacOSX Lion 10.7 try talos dirtypaint ScriptFactory
> Rev4 MacOSX Lion 10.7 try talos dirtypaint TalosFactory
< Rev4 MacOSX Lion 10.7 try talos dromaeojs ScriptFactory
> Rev4 MacOSX Lion 10.7 try talos dromaeojs TalosFactory
< Rev4 MacOSX Lion 10.7 try talos other ScriptFactory
> Rev4 MacOSX Lion 10.7 try talos other TalosFactory
< Rev4 MacOSX Lion 10.7 try talos svgr ScriptFactory
> Rev4 MacOSX Lion 10.7 try talos svgr TalosFactory
< Rev4 MacOSX Lion 10.7 try talos tp5o ScriptFactory
> Rev4 MacOSX Lion 10.7 try talos tp5o TalosFactory
Attachment #779916 - Flags: review?(jhopkins) → review-
Talos.zip is needed for mobile. Re-testing:
https://tbpl.mozilla.org/?tree=Cedar

Status summary:
###############
* I need to provide a patch for Monday to enable talos mozharnes for all FF25 trees
* I want to make sure that the talos.json modifications from comment 44 do not cause any regressions on Monday
* I want to make sure that talos.json is consistent across older branches in case a developer wants to push a change to the try server (since it will be using talos mozharness).

[1] https://hg.mozilla.org/integration/mozilla-inbound/rev/496a7582cf9e
Only one line has been added.
Attachment #781042 - Flags: review?(jhopkins)
Attachment #779916 - Attachment is obsolete: true
Attachment #779917 - Attachment is obsolete: true
I'm looking at this http://perf.snarkfest.net/compare-talos/index.html?oldRevs=f479167e92d2&newRev=7cfd536a3d8a&submit=true where I'm comparing two different revisions of Cedar I noticed a huge regression in tp5n_main_normal_fileio_paint but I can't find which job on Cedar runs it.
Any ideas?

https://tbpl.mozilla.org/?tree=Cedar&jobname=talos&showall=1
I am not sure how to use the compare-talos toolchain, but we could look at some revisions on cedar and see what the raw values are.  This looks to be related to xperf, and those values are subject to change a lot.
This is tedious.
It would be great that compare-talos allowed to create graphs URLs to compare to other branches or within the same branch (that is what I do in the following links).

From:
http://perf.snarkfest.net/compare-talos/index.html?oldRevs=f479167e92d2&newRev=7cfd536a3d8a&submit=true

I've built the following URLs:
http://graphs.mozilla.org/graph.html#tests=[[224,63,24],[224,26,24]]&sel=none&displayrange=7&datatype=running
http://graphs.mozilla.org/graph.html#tests=[[251,26,25],[251,63,25]]&sel=none&displayrange=7&datatype=running
http://graphs.mozilla.org/graph.html#tests=[[244,26,25],[244,63,25]]&sel=none&displayrange=7&datatype=running
http://graphs.mozilla.org/graph.html#tests=[[245,26,25],[245,63,25]]&sel=none&displayrange=7&datatype=running
I would say that we're fine and it is due to noisy talos jobs.

Unless someone strongly believes that I should keep up posting URL for each regressions mentioned on compare-talos for more than 3% deviance I would like to assume that we're fine.

I'm facing the same situation as to went I tried if talos-mozharness causes clear regressions compared to buildbotcustom/factory.py. It requires lots of eye analysis.

Missing URLs for these regressions:
- Ts Paint, MED Dirty Profile (tspaint_places_generated_med) - Mac 10.8 
- Ts Paint, MAX Dirty Profile (tspaint_places_generated_max) - Mac 10.8 & Ub. 32 
- TResize (tresize) - Win8
- Tp5 Optimized Responsiveness (tp5o_responsiveness_paint) - Mac 10.8, Win7 & Win8
- Tp5 Optimized MozAfterPaint (tp5o_shutdown_paint) - all but win7 & win8
- Tp5 Optimized (Modified Page List Bytes) (tp5o_modlistbytes_paint) - Win7
We'll deploy this tomorrow morning.
Review is expected to come through today.
Comment on attachment 781042 [details] [diff] [review]
enable talos mozharness for all FF25 trees

armenzg: can you please post an updated patch?  This one no longer applies cleanly to buildbot-configs.
Attachment #781042 - Flags: review?(jhopkins) → review-
Attachment #781042 - Attachment is obsolete: true
Attachment #782698 - Flags: review?(jhopkins)
Attachment #782698 - Flags: review?(jhopkins) → review+
I don't really have a clear way to determine what suites will clearly be affected by the talos mozharness tomorrow.

The last two attachments were obtained with this:
python talos/compare.py --revision=14e3e9ab9994 --branch=Cedar --masterbranch=Firefox --print-graph-url
python talos/compare.py --revision=14e3e9ab9994 --branch=Cedar --masterbranch=Firefox --print-graph-url

I wanted to understand things a little better and be sure that I know exactly what is going to change so I can answer any questions from developers. I wanted to be more educated in a way.

For now, I will re-paste the analysis that jmaher did a bit ago and hope that nothing new has cropped in.

From bug 802801:

(In reply to Joel Maher (:jmaher) from comment #10)
> So it appears that our ts test (start/stop the browser 20 times) is
> problematic.  the places_med|max are just different profiles used while
> running the test.  I have verified this on a few different changesets.  
> 
> I really don't understand how changing to mozharness could cause this, but
> maybe there is some additional overhead induced on the system with
> mozharness when it comes to launching a new process.  
> 
> We could take this as a bump in the numbers and accept that.  I am open to
> any thoughts here.
Attachment #782698 - Flags: checked-in+
This is live now.
Blocks: 899784
Depends on: 899793
Depends on: 899795
No longer blocks: 899784
Depends on: 899784
No longer depends on: 899795
Depends on: 900015
Depends on: 899570
Depends on: 900545
Depends on: 900605
Product: mozilla.org → Release Engineering
Blocks: 734466
I'm going to guess we're done here... ?
I don't think the two existing blocker bugs are actually strictly blocking.
I agree; I think this can be closed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: General Automation → Mozharness
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: