Closed Bug 1241535 Opened 8 years ago Closed 6 years ago

Create an action task that allows retriggering tests on an existing Taskcluster build to generate a geckoprofile for talos

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(firefox64 fixed)

RESOLVED FIXED
mozilla64
Tracking Status
firefox64 --- fixed

People

(Reporter: jmaher, Assigned: jmaher)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

right now if we enter "mozharness: --spsProfile" at the end of our try syntax, we can generate sps profile files for the duration of our test run.  This has proven to be useful in many cases!

There is a need to compare profiles between a revision with a patch and a previous base revision- this means we need to generate profile data.

Right now the method is a new fresh try push which means new builds and then new tests.  This would need to be done with 2 pushes.

given the fact that we can retrigger jobs, I would like to pass a flag to the jobs or toggle some bit that helps us run the magic data.  It might not be possible to add custom mozharness/try syntax to an existing try push.  I do wonder if something with push-extender could trigger a job, but with certain properties set.  

One other way to simplify this is to push to try and use existing builds from the offending changeset and base changeset and just run the talos job.  I know we can hack this in taskcluster, but I am not sure if some magic with buildbot bridge could allow us to take an arbitrary build and run a given test job on it with the right flags.

If we restrict this to try runs, then we don't have to worry about data being posted from the talos run while being profiled (this is useless test results).  

We should outline a couple approaches (or the one that is realistic) and turn this into an actionable bug that can be picked up in due time.
:armenzg, could you weigh in on any approaches to this problem that come to your mind?
Flags: needinfo?(armenzg)
jmaher: would you be taking a given push on try and wanting some builders with those parameters?
Or a push from other repos than try? Where would you want those new scheduled jobs to appear? Under the same revision you're retriggering from?

We can create new jobs with new properties [1]. Not a retrigger request, even though, in effect it will be the same.

We will need to change Mozharness to check both places to the parameters it needs [2]
> self.buildbot_config['sourcestamp']['changes'][-1]['comments'].partition('mozharness:')
to also support:
> self.buildbot_config['properties'].get('mozharness_extra_parameters')

We can very easily write a script that would do this by having the right credentials and hacking Mozharness.

---------

A more sophisticated approach would be to have a web app to help us do this:

You pass the web app two variables:
* repo_name
* revision

The web app loads the various available builds on Treeherder.
The user is prompted to select the builds that we want to based our tests off.

In the next step the user will be able to select from all the test jobs (the ones those builds could trigger) and choose the subset it cares about.

In the next step the user will be able to add extra paramaters to be passed to the job.
Perhaps we can have a list to pick from.

On try, we will assume that we want to add jobs to that same revision.
If we *don't* want the jobs showing up on the same revision as the builds are taken from, we can either receive another revision to show the jobs under or 

On any other repository, we will not assume that we want to see those jobs running on that same revision. Instead 
the user will be redirected to a task graph. If we want treeherder pushes we will need to discuss this a bit further.


[1] https://github.com/mozilla/build-buildapi/blob/a19f3d79dd78e221a763b341f7c2d8281bea94a7/buildapi/controllers/selfserve.py#L511
[2] https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/testing/talos.py#165
Flags: needinfo?(armenzg)
Depends on: 1241644
armenzg> let me see if I get it right this time
<armenzg> you trigger 6 jobs of a specific tests (rather than all tests for a specific platform)
<armenzg> I assume you do this accross various revisions
<armenzg> once you know which revision has the regression
<armenzg> you would go ahead and schedule a 7th job for it
<armenzg> that revision and the previous one
<jmaher> well, the only other difference is I schedule 6 retriggers for ALL talos tests on revision and base rev, which sometimes finds a few other regressions we didn't detect on initially running the test
Joel, do you use trigger all talos tests for this?

What is your exact flow to desire to do spsProfile runs? Do you get perf alerts, trigger "all talos" and determine later which pushes need spsProfile?

Is running "all talos" from Treeherder not sufficient to spot which revision caused the regression?

####################

Some notes for myself (do not read this section until jmaher and I have clarified exactly the flow we need):

Currently pulse_actions listens to "trigger all talos" requests [1] which calls trigger_all_talos_jobs() [2]
We will need to make pulse_actions pass an extra_properties 'mozharness_extra_properties' set to --spsProfile
We will need mozharness to look for a 'mozharness_extra_properties' property [3]
Side bug, in trigger_all_talos_jobs() we need to change from trigger_range() to trigger_arbitrary_job() since we're not triggering jobs across revisions.
[1] https://github.com/mozilla/pulse_actions/blob/master/pulse_actions/handlers/treeherder_resultset.py#L57
[2] https://github.com/mozilla/mozilla_ci_tools/blob/master/mozci/mozci.py#L560
[3] https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/testing/talos.py#165
for sps profiles the flow would be for specific tests which have regressed, so not all.  In many cases it would be the same test on many platforms, for example 'tp'.  We would manually retrigger these, probably via treeherder interface, and typically on try server, but it could be on an integration branch.

those are my wishes :)
When you say 'specific tests' do you mean 'specific talos jobs'?

From what you say, 'trigger all talos" from Treeherder does not help you as-is when you're dealing with a specific talos job (instead of *all* talos jobs).

Would you want another action on Treeherder that would look like this? (assuming I'm starting to understand what you need)
* Select a specific talos suite (e.g. tp)
* Choose a new action on Treeherder (Create baseline + sps profile)
* pulse_actions determines what are all the 'tp' builders for every platform that can be scheduled on that push
* We schedule 5 normal runs + an sps profile run
* Schedule missing builds if necessary
I think that would be good.  We normally want extra data point for each job, so doing that and then the additional sps version would be nice.  How we would differentiate the spsprofile run...would the user have to iterate through all the other 6 jobs before finding the one with the artifact?
I think so.
Once bug 1218537 is fixed they would *not* need to click on "inspect task" to determine which of the tasks has the artifact.

Another option would be for pulse_actions to send an email with direct links to where the artifacts would be found.
Maybe give the sps profile job a different colour so developers would not re-trigger by mistake an sps profile job (if they were expecting a normal run).
Developers put the profiles into a web app.

TODO: find developers to talk about this process with them (mstange, mconley and BenWa).
possibly a dup of bug 1322433
Blocks: 1307197
wlach, this is similar to what you are working on
Looks like this isn't a priority anymore; also it looks more like a treeherder-based feature and not a talos framework issue, so closing it out.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
This would still be super interesting to have. I agree this is probably more of a Treeherder thing.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Component: Talos → Treeherder
Product: Testing → Tree Management
Version: unspecified → ---
Hi! Reading the bug subject in the bugmail before opening the bug made me think this was not a Treeherder bug, but reading the comments I see that the bug has changed directions a few times, so that may no longer be the base. However I'm still a little unclear on what the use-case / ask is here? It sounds like new jobs need to be scheduled with an additional parameter passed in the try syntax? If so, that seems like something that try fuzzy or similar tools should be doing instead of Treeherder?

Could someone summarise the bug?
Eugh, s/base/case/
(In reply to Ed Morley [:emorley] from comment #15)
> 
> Could someone summarise the bug?

Yeah, sure, I can try.

Sometimes we get reports of a Talos performance regression on one of our patches. One of our key diagnostic tools for a performance regression is getting Gecko Profiler profiles from the Talos machines for the test runs.

In order to get those profiles, it's necessary for us to push to try with the mozharness: --geckoProfile argument to the try syntax. This means, in the worst case, spinning up a new build, which can take a while.

In actuality, re-running the Talos test with profiling enabled shouldn't require a re-build. This bug is a feature request that allows us to look at a regressing revision in Treeherder, and go "Oh, okay, please re-run that suite of Talos tests, but give me profiles for them".

Is that sufficient?
Flags: needinfo?(emorley)
Ah yes that's helpful - I think I was missing the part about not needing a rebuild to enable profiles.

So the steps for this are:
- Devise a way to schedule new test jobs on an existing push, but that have additional parameters set. (Reading back the comments it seems there's bug 1322433 and friends for something similar? I really don't know much about that)
- Create tooling that:
 * Makes it easy to select the correct subset of jobs
 * Reschedules those jobs using the above functionality with the appropriate sps profile options set
 * (Optionally) Makes it easy to find the URLs for the uploaded profiles and compare them

I agree this definitely sounds like a workflow that should be improved. 

My first concern is just that we try to avoid hardcoding too much test suite related business logic into Treeherder or anything else that isn't in mozilla-central. ie If Treeherder could schedule a second decision task (or something similar) that ran these steps based on a config in mozilla-central that would be ideal (this would make it easy to add mach support for the feature too). I'm presuming this isn't the only use-case that would follow this pattern so it seems worth having discussions between the taskcluster, treeherder and test automation teams to decide the best way forwards?
Flags: needinfo?(emorley)
Summary: find a way to quickly "retrigger" a job to generate a sps profile for talos → Find a way to quickly "retrigger" tests on an existing build to generate an sps profile for talos
Flags: needinfo?(mconley)
Bstack just added a way to initiate jobs that sounds like it may help here.  Or perhaps we can augment it to handle the case you want.  Mike: would you check out the drop-down on a given push in Treeherder and select "Custom push action..." and see if that's sufficient?  Or if it looks like a mod there may get you where you want to go?
Hi emorley, camd,

This "Custom push action..." business sounds like it might do what I want, but it seems to want some kind of job syntax that I don't have. There appears to be a dropdown that presumably lets me choose job parameters from a template or something... would it be possible to add a template that re-runs one or more talos test suites with mozharness: --geckoProfile ?
Flags: needinfo?(mconley) → needinfo?(cdawson)
Greg, what do you think would be required in taskcluster to accomplish this?

wlach, it sounds like you tried something similar to this, can you summarize a bit what you found?
Flags: needinfo?(wlachance)
Flags: needinfo?(garndt)
(In reply to Stuart Philp :sphilp from comment #21)
> Greg, what do you think would be required in taskcluster to accomplish this?
> 
> wlach, it sounds like you tried something similar to this, can you summarize
> a bit what you found?

Yeah, action tasks would be the way to implement this, assuming talos is running on taskcluster. I was working on this in the early part of the year:

https://wlach.github.io/blog/2017/04/easier-reproduction-of-intermittent-test-failures-in-automation/

I'm not up-to-date on where the action task stuff is these days, but I know Brian Stack and others on the tc team have been pushing this forward. Greg can no doubt give more detail.
Flags: needinfo?(wlachance)
It looks like Talos is a mixture of BuildBot and Taskcluster initiated jobs.

Adding bstack since he wrote the original "Custom push action..." impl.  Is there a template that could be added for what's described in comment 20?
Flags: needinfo?(cdawson)
osx jobs are 100% taskcluster, linux/win7/win10 are scheduled via taskcluster and run via buildbot-bridge on buildbot.  When we switch to the new hardware in the coming months everything will be on taskcluster- but that could take a few months- maybe even 6.
(In reply to Stuart Philp :sphilp from comment #21)
> Greg, what do you think would be required in taskcluster to accomplish this?
> 
> wlach, it sounds like you tried something similar to this, can you summarize
> a bit what you found?

Action tasks can now be defined in tree.  Here is some documentation about them:
http://firefox-source-docs.mozilla.org/taskcluster/taskcluster/actions.html

Here is an existing action task for inspiration:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py#20

I believe something like this might be a "retrigger talos with parameters" type action, such as the mochitest action:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/mochitest_retrigger.py#33

These will then show up in the custom action menu within TH.
Flags: needinfo?(garndt)
Component: Treeherder → Treeherder: Job Triggering & Cancellation
What Greg suggested sounds great - most of the business logic living in-repo (where it's more easily maintained), with Treeherder only needing to know enough to trigger that custom task.
See Also: → 1202718
See Also: → 1412009
Is --spsProfile related to --gecko-profile?

If so, bug 1412009 can be duped to this one.
spsProfile was renamed to geckoProfile a year or so ago
See Also: 1412009
Blocks: 1411995
Summary: Find a way to quickly "retrigger" tests on an existing build to generate an sps profile for talos → Find a way to quickly "retrigger" tests on an existing build to generate a geckoprofile for talos
Going by comment 25, can (/needs to) be implemented in-tree, so this belongs in the Talos component instead.
Status: REOPENED → NEW
Component: Treeherder: Job Triggering & Cancellation → Talos
Product: Tree Management → Testing
Summary: Find a way to quickly "retrigger" tests on an existing build to generate a geckoprofile for talos → Create an action task that allows retriggering tests on an existing Taskcluster build to generate a geckoprofile for talos
Version: --- → unspecified
Is this a dupe of bug 1465117?
99% a duplicate, I would like to make this more of a one click than a custom action, <edit a bunch>, click ok- I think we can avoid the <edit a bunch> step and make it a hardcoded custom action.
Yeah that definitely sounds like a good idea (making it a one-click). I meant more that in concept at least, this bug is a dupe of the two-parter "add action task in bug 1465117 + add treeherder UI parts in bug <TODO>"? :-)
Add support for 'geckoprofile' action task in-tree.
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ab32d2dc93e2
add support for 'geckoprofile' action task in-tree. r=bstack
https://hg.mozilla.org/mozilla-central/rev/ab32d2dc93e2
Status: NEW → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Assignee: nobody → jmaher
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/fd20a61c8e3216dd0a5db244e86c71aafa6ea4ce
Bug 1241535 - Add support to job actions for collecting gecko profiles of performance tests. r=camd (#4128)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: