Create an action task that allows retriggering tests on an existing Taskcluster build to generate a geckoprofile for talos

NEW
Unassigned

Status

3 years ago
3 months ago

People

(Reporter: jmaher, Unassigned)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
right now if we enter "mozharness: --spsProfile" at the end of our try syntax, we can generate sps profile files for the duration of our test run.  This has proven to be useful in many cases!

There is a need to compare profiles between a revision with a patch and a previous base revision- this means we need to generate profile data.

Right now the method is a new fresh try push which means new builds and then new tests.  This would need to be done with 2 pushes.

given the fact that we can retrigger jobs, I would like to pass a flag to the jobs or toggle some bit that helps us run the magic data.  It might not be possible to add custom mozharness/try syntax to an existing try push.  I do wonder if something with push-extender could trigger a job, but with certain properties set.  

One other way to simplify this is to push to try and use existing builds from the offending changeset and base changeset and just run the talos job.  I know we can hack this in taskcluster, but I am not sure if some magic with buildbot bridge could allow us to take an arbitrary build and run a given test job on it with the right flags.

If we restrict this to try runs, then we don't have to worry about data being posted from the talos run while being profiled (this is useless test results).  

We should outline a couple approaches (or the one that is realistic) and turn this into an actionable bug that can be picked up in due time.
(Reporter)

Comment 1

3 years ago
:armenzg, could you weigh in on any approaches to this problem that come to your mind?
Flags: needinfo?(armenzg)

Comment 2

3 years ago
jmaher: would you be taking a given push on try and wanting some builders with those parameters?
Or a push from other repos than try? Where would you want those new scheduled jobs to appear? Under the same revision you're retriggering from?

We can create new jobs with new properties [1]. Not a retrigger request, even though, in effect it will be the same.

We will need to change Mozharness to check both places to the parameters it needs [2]
> self.buildbot_config['sourcestamp']['changes'][-1]['comments'].partition('mozharness:')
to also support:
> self.buildbot_config['properties'].get('mozharness_extra_parameters')

We can very easily write a script that would do this by having the right credentials and hacking Mozharness.

---------

A more sophisticated approach would be to have a web app to help us do this:

You pass the web app two variables:
* repo_name
* revision

The web app loads the various available builds on Treeherder.
The user is prompted to select the builds that we want to based our tests off.

In the next step the user will be able to select from all the test jobs (the ones those builds could trigger) and choose the subset it cares about.

In the next step the user will be able to add extra paramaters to be passed to the job.
Perhaps we can have a list to pick from.

On try, we will assume that we want to add jobs to that same revision.
If we *don't* want the jobs showing up on the same revision as the builds are taken from, we can either receive another revision to show the jobs under or 

On any other repository, we will not assume that we want to see those jobs running on that same revision. Instead 
the user will be redirected to a task graph. If we want treeherder pushes we will need to discuss this a bit further.


[1] https://github.com/mozilla/build-buildapi/blob/a19f3d79dd78e221a763b341f7c2d8281bea94a7/buildapi/controllers/selfserve.py#L511
[2] https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/testing/talos.py#165
Flags: needinfo?(armenzg)
(Reporter)

Updated

3 years ago
Depends on: 1241644

Comment 3

3 years ago
armenzg> let me see if I get it right this time
<armenzg> you trigger 6 jobs of a specific tests (rather than all tests for a specific platform)
<armenzg> I assume you do this accross various revisions
<armenzg> once you know which revision has the regression
<armenzg> you would go ahead and schedule a 7th job for it
<armenzg> that revision and the previous one

Comment 4

3 years ago
<jmaher> well, the only other difference is I schedule 6 retriggers for ALL talos tests on revision and base rev, which sometimes finds a few other regressions we didn't detect on initially running the test

Comment 5

3 years ago
Joel, do you use trigger all talos tests for this?

What is your exact flow to desire to do spsProfile runs? Do you get perf alerts, trigger "all talos" and determine later which pushes need spsProfile?

Is running "all talos" from Treeherder not sufficient to spot which revision caused the regression?

####################

Some notes for myself (do not read this section until jmaher and I have clarified exactly the flow we need):

Currently pulse_actions listens to "trigger all talos" requests [1] which calls trigger_all_talos_jobs() [2]
We will need to make pulse_actions pass an extra_properties 'mozharness_extra_properties' set to --spsProfile
We will need mozharness to look for a 'mozharness_extra_properties' property [3]
Side bug, in trigger_all_talos_jobs() we need to change from trigger_range() to trigger_arbitrary_job() since we're not triggering jobs across revisions.
[1] https://github.com/mozilla/pulse_actions/blob/master/pulse_actions/handlers/treeherder_resultset.py#L57
[2] https://github.com/mozilla/mozilla_ci_tools/blob/master/mozci/mozci.py#L560
[3] https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/testing/talos.py#165
(Reporter)

Comment 6

3 years ago
for sps profiles the flow would be for specific tests which have regressed, so not all.  In many cases it would be the same test on many platforms, for example 'tp'.  We would manually retrigger these, probably via treeherder interface, and typically on try server, but it could be on an integration branch.

those are my wishes :)

Comment 7

3 years ago
When you say 'specific tests' do you mean 'specific talos jobs'?

From what you say, 'trigger all talos" from Treeherder does not help you as-is when you're dealing with a specific talos job (instead of *all* talos jobs).

Would you want another action on Treeherder that would look like this? (assuming I'm starting to understand what you need)
* Select a specific talos suite (e.g. tp)
* Choose a new action on Treeherder (Create baseline + sps profile)
* pulse_actions determines what are all the 'tp' builders for every platform that can be scheduled on that push
* We schedule 5 normal runs + an sps profile run
* Schedule missing builds if necessary
(Reporter)

Comment 8

3 years ago
I think that would be good.  We normally want extra data point for each job, so doing that and then the additional sps version would be nice.  How we would differentiate the spsprofile run...would the user have to iterate through all the other 6 jobs before finding the one with the artifact?

Comment 9

3 years ago
I think so.
Once bug 1218537 is fixed they would *not* need to click on "inspect task" to determine which of the tasks has the artifact.

Another option would be for pulse_actions to send an email with direct links to where the artifacts would be found.

Comment 10

3 years ago
Maybe give the sps profile job a different colour so developers would not re-trigger by mistake an sps profile job (if they were expecting a normal run).
Developers put the profiles into a web app.

TODO: find developers to talk about this process with them (mstange, mconley and BenWa).
(Reporter)

Comment 11

2 years ago
possibly a dup of bug 1322433
Blocks: 1307197
(Reporter)

Comment 12

2 years ago
wlach, this is similar to what you are working on
Looks like this isn't a priority anymore; also it looks more like a treeherder-based feature and not a talos framework issue, so closing it out.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → WONTFIX
This would still be super interesting to have. I agree this is probably more of a Treeherder thing.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Component: Talos → Treeherder
Product: Testing → Tree Management
Version: unspecified → ---
Hi! Reading the bug subject in the bugmail before opening the bug made me think this was not a Treeherder bug, but reading the comments I see that the bug has changed directions a few times, so that may no longer be the base. However I'm still a little unclear on what the use-case / ask is here? It sounds like new jobs need to be scheduled with an additional parameter passed in the try syntax? If so, that seems like something that try fuzzy or similar tools should be doing instead of Treeherder?

Could someone summarise the bug?
Eugh, s/base/case/
(In reply to Ed Morley [:emorley] from comment #15)
> 
> Could someone summarise the bug?

Yeah, sure, I can try.

Sometimes we get reports of a Talos performance regression on one of our patches. One of our key diagnostic tools for a performance regression is getting Gecko Profiler profiles from the Talos machines for the test runs.

In order to get those profiles, it's necessary for us to push to try with the mozharness: --geckoProfile argument to the try syntax. This means, in the worst case, spinning up a new build, which can take a while.

In actuality, re-running the Talos test with profiling enabled shouldn't require a re-build. This bug is a feature request that allows us to look at a regressing revision in Treeherder, and go "Oh, okay, please re-run that suite of Talos tests, but give me profiles for them".

Is that sufficient?
Flags: needinfo?(emorley)
Ah yes that's helpful - I think I was missing the part about not needing a rebuild to enable profiles.

So the steps for this are:
- Devise a way to schedule new test jobs on an existing push, but that have additional parameters set. (Reading back the comments it seems there's bug 1322433 and friends for something similar? I really don't know much about that)
- Create tooling that:
 * Makes it easy to select the correct subset of jobs
 * Reschedules those jobs using the above functionality with the appropriate sps profile options set
 * (Optionally) Makes it easy to find the URLs for the uploaded profiles and compare them

I agree this definitely sounds like a workflow that should be improved. 

My first concern is just that we try to avoid hardcoding too much test suite related business logic into Treeherder or anything else that isn't in mozilla-central. ie If Treeherder could schedule a second decision task (or something similar) that ran these steps based on a config in mozilla-central that would be ideal (this would make it easy to add mach support for the feature too). I'm presuming this isn't the only use-case that would follow this pattern so it seems worth having discussions between the taskcluster, treeherder and test automation teams to decide the best way forwards?
Flags: needinfo?(emorley)

Updated

a year ago
Summary: find a way to quickly "retrigger" a job to generate a sps profile for talos → Find a way to quickly "retrigger" tests on an existing build to generate an sps profile for talos
Flags: needinfo?(mconley)
Bstack just added a way to initiate jobs that sounds like it may help here.  Or perhaps we can augment it to handle the case you want.  Mike: would you check out the drop-down on a given push in Treeherder and select "Custom push action..." and see if that's sufficient?  Or if it looks like a mod there may get you where you want to go?
Hi emorley, camd,

This "Custom push action..." business sounds like it might do what I want, but it seems to want some kind of job syntax that I don't have. There appears to be a dropdown that presumably lets me choose job parameters from a template or something... would it be possible to add a template that re-runs one or more talos test suites with mozharness: --geckoProfile ?
Flags: needinfo?(mconley) → needinfo?(cdawson)
Greg, what do you think would be required in taskcluster to accomplish this?

wlach, it sounds like you tried something similar to this, can you summarize a bit what you found?
Flags: needinfo?(wlachance)
Flags: needinfo?(garndt)
(In reply to Stuart Philp :sphilp from comment #21)
> Greg, what do you think would be required in taskcluster to accomplish this?
> 
> wlach, it sounds like you tried something similar to this, can you summarize
> a bit what you found?

Yeah, action tasks would be the way to implement this, assuming talos is running on taskcluster. I was working on this in the early part of the year:

https://wlach.github.io/blog/2017/04/easier-reproduction-of-intermittent-test-failures-in-automation/

I'm not up-to-date on where the action task stuff is these days, but I know Brian Stack and others on the tc team have been pushing this forward. Greg can no doubt give more detail.
Flags: needinfo?(wlachance)
It looks like Talos is a mixture of BuildBot and Taskcluster initiated jobs.

Adding bstack since he wrote the original "Custom push action..." impl.  Is there a template that could be added for what's described in comment 20?
Flags: needinfo?(cdawson)
(Reporter)

Comment 24

a year ago
osx jobs are 100% taskcluster, linux/win7/win10 are scheduled via taskcluster and run via buildbot-bridge on buildbot.  When we switch to the new hardware in the coming months everything will be on taskcluster- but that could take a few months- maybe even 6.

Comment 25

11 months ago
(In reply to Stuart Philp :sphilp from comment #21)
> Greg, what do you think would be required in taskcluster to accomplish this?
> 
> wlach, it sounds like you tried something similar to this, can you summarize
> a bit what you found?

Action tasks can now be defined in tree.  Here is some documentation about them:
http://firefox-source-docs.mozilla.org/taskcluster/taskcluster/actions.html

Here is an existing action task for inspiration:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py#20

I believe something like this might be a "retrigger talos with parameters" type action, such as the mochitest action:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/mochitest_retrigger.py#33

These will then show up in the custom action menu within TH.
Flags: needinfo?(garndt)

Updated

11 months ago
Component: Treeherder → Treeherder: Job Triggering & Cancellation

Comment 26

11 months ago
What Greg suggested sounds great - most of the business logic living in-repo (where it's more easily maintained), with Treeherder only needing to know enough to trigger that custom task.

Updated

11 months ago
See Also: → bug 1202718

Updated

11 months ago
See Also: → bug 1412009

Comment 27

11 months ago
Is --spsProfile related to --gecko-profile?

If so, bug 1412009 can be duped to this one.
(Reporter)

Comment 28

11 months ago
spsProfile was renamed to geckoProfile a year or so ago

Updated

11 months ago
See Also: bug 1412009
Duplicate of this bug: 1412009

Updated

11 months ago
Blocks: 1411995

Updated

11 months ago
Summary: Find a way to quickly "retrigger" tests on an existing build to generate an sps profile for talos → Find a way to quickly "retrigger" tests on an existing build to generate a geckoprofile for talos

Comment 30

3 months ago
Going by comment 25, can (/needs to) be implemented in-tree, so this belongs in the Talos component instead.
Status: REOPENED → NEW
Component: Treeherder: Job Triggering & Cancellation → Talos
Product: Tree Management → Testing
Summary: Find a way to quickly "retrigger" tests on an existing build to generate a geckoprofile for talos → Create an action task that allows retriggering tests on an existing Taskcluster build to generate a geckoprofile for talos
Version: --- → unspecified
You need to log in before you can comment on or make changes to this bug.