Closed Bug 1274310 Opened 8 years ago Closed 8 years ago

Don't run PGO tests on every build

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: mtabara)

References

Details

Attachments

(1 file)

Linux PGO builds are now running as tier 2 in TC, but they're running on every commit. In buildbot these builds were running on a timer and/or very 5 commits. 

We should use the TaskCluster hooks service to perform similar scheduling for PGO in TC.
I think this actually needs something like SETA?  Hooks can't really help with this, unless you *only* want to run them on a schedule and not on pushes.
See Also: → 1274022
After some digging in the past days, it came up that:
* I cannot use hooks, unless I want to run them on a schedule basis
* I cannot use SETA since it's being plugged for tests only so far
* I can use the coalescer for TC builds which is great!

Had a longer chat with dividehex today as he was kind enough to explain me the basics of the coalescer for TC builds.

Long story short:
* more info on the coalescer can be found at https://mana.mozilla.org/wiki/display/IT/Releng+Coalescer+for+Taskcluster
* the PR for adding the production threshold is https://github.com/mozilla/tc-coalesce/pull/13

I will follow-up shortly with the gecko patch to supersede the Linux64 PGO builds.
See Also: → 1268187
Enable the usage of coalescer for Linux64 PGO builds.
Attachment #8756131 - Flags: review?(jwatkins)
:dustin, coop: 

After doing some at-first-glance measurements on mozilla-inbound, I started over with these values: https://github.com/mozilla/tc-coalesce/blob/master/config/config.py#L12 .Not sure if they make too much sense but it was a rough estimation to start with: if more than 5 pending jobs are in the queue, with the eldest one more than 30 mins, they all get coalesced into a single task. 

Question: do we have any kind of data as to the average number of pushes/pending job time/etc per task in mozilla-inbound in order to infer better values here?


:dividehex: thanks for landing the changes to production: https://coalesce.mozilla-releng.net/v1/threshold
Flags: needinfo?(dustin)
Flags: needinfo?(coop)
Attachment #8756131 - Flags: feedback?(dustin)
Comment on attachment 8756131 [details] [diff] [review]
Bug 1274310 - Use TC coalescer for Linux64 PGO builds. r=dividehex

Review of attachment 8756131 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/taskcluster/tasks/builds/opt_linux64_pgo.yml
@@ +10,5 @@
>  
>    routes:
>      - 'index.buildbot.branches.{{project}}.linux64-pgo'
>      - 'index.buildbot.revisions.{{head_rev}}.{{project}}.linux64-pgo'
> +    - 'coalesce.v1.builds.opt_linux64_pgo'

Let's stick with the build name used in the index here, and be sure to include the project so that we're not coalescing across projects (aka branches) -- '{{project}}.linux64-pgo'.

Also -- and it's OK to leave this for a follow-up patch -- we'll need to add some code in taskcluster/taskgraph/kind/legacy.py to remove this route and supersederUrl for try jobs.
Attachment #8756131 - Flags: feedback?(dustin)
I don't know if we have better data.

My comment 1 still stands -- this is a way to do "queue collapsing" of PGO builds under load, but under ordinary circumstances we'll still run PGO builds on every commit.
Flags: needinfo?(dustin)
20+ hours is done in parallel, so it will eat up hundreds of thousands of extra jobs until we resolve this.  If we want to spend the resources, that is fine- I don't want to assume SETA will solve this.  Why wouldn't hooks work for an every 3 hour job?
Hooks would work for an every-3-hour job.  If you then drop the jobs from the tree, then you'll reduce the overall test load, but you said "and by demand" -- I assume that means on some pushes?
on demand would be as needed.  A few common scenarios:
1) we have a pgo only failure (as seen on monday) and have to backfill pgo jobs/tests since it is run periodically
2) sheriffs want to merge between branches, so they trigger on-demand pgo jobs for a given revision to ensure it passes all the tests prior to merge
3) developers know there is a potential for a pgo difference in their patches (a lot of changes with the build for a recent example) and trigger pgo builds.

When we do on-demand, we currently expect a build to be built and all tests to be launched afterwards.
Sorry to derail this, mihai!  I'd *really* like to get coalescer started, and there's no reason not to start with this particular task -- but as we've learned, it doesn't solve the issue in this bug.

So, yes, we can do PGOs on a scheduled basis.  That will require:

 * setting up a hook to schedule a decision task
   * ideally the frequency and details of this task would be in-tree, in some kind of in-tree `.cron.yml`; that cron.yml would be read and interpreted by a trusted "gecko cron task" that is, in turn, run from hooks.  The "gecko cron task" would spawn the decision tasks based on the schedule and description in .cron.yml.

 * Implementing a "nightly" mode for `mach taskgraph decision`
   * at the least, being able to run without a given revision
   * possibly implementing some kind of buildbot-like "find the last working revision" support?

As for backfilling, I think that the up-and-coming try-extender support can handle that, however we implement it.  Basically, the full task graph for every push would contain the PGO builds, but they would not be targetted and thus would be omitted from the target task graph.  The try extender (or mozci or whatever) would look in the full task graph, find the relevant tasks, and create them.
would hooks be ok?  Currently we run pgo on a timer every X (I believe 3) hours, and by demand.  hooks would solve the timer, but not on-demand; currently there is no other on-demand solution so hooks might be the right solution.

SETA is used to intentionally skip jobs- we could use the eventual mechanism to force build skipping as well, but that is 4+ weeks before it works in taskcluster.

The biggest concern here is if we build pgo much more frequently we are going to be running 20+ hours of tests for each of those builds.  I still think the new <90 minute pgo builds are too long to be run instead of opt, and I am not sure we will be able to optimize them <60 minutes.
Hooks is an option for nightly-style builds, yes, but we'd still need SETA to make decisions about when to not run on-push (otherwise we'd just be adding builds).

I'm not sure what the "eventual mechanism" is, but 4+ weeks of running a few extra long builds doesn't seem that bad.  And Armen has a contributor working on SETA in TaskCluster, too.

I guess I don't understand the issue -- if the PGO tests take 20+ hours to finish, surely they're not of much value on *any* push?  Why bother running them at all?  Or why not run them but hide them by default, and only check occasionally?
See Also: → 1275972
Attachment #8756131 - Flags: review?(jwatkins)
:dustin - no worries for derailing here!

I filed a separate bug 1275972 to follow-up with the coalescer "under load" scenario plugging and keep this bug for the ordinary circumstances PGO builds triggering behavior. 

I'm not sure yet I have enough skills to implement the aforementioned stuff, but would definitely like to have a deeper look.
Flags: needinfo?(coop)
:dustin Before I forget, I promised to follow-up with a patch to subtract try from the coalescer queue collapsing that has been plugged in bug 1275972. Can you please point me again the resource I need to tweak for that? I unsuccessfully searched for the taskcluster/taskgraph/kind/legacy.py.

As to the scheduled basis, I have one question. I may be completely off the rails but we've got several things under heavy development that are going to be deployed soon (SETA for builds mechanism, try-extender support, etc) that are to bring some ease in this anyway in the near future. Having that in mind, why bother with the scheduling mechanism since builds are cheap anyway?

How about allowing PGO builds (which are < 90min effort now) on every commit but intentionally amend to skip some of tests jobs which seem to be the more expensive operation here (20+ hours of test). Not sure though how/if that can be tweaked for the "on demand" as well but I thought I should ask here. 

Sorry if I misunderstood something here.
Flags: needinfo?(dustin)
There are a few clauses in legacy.py similar to this:

https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/kind/legacy.py#242
            # try builds don't use cache
            if project == "try":
                remove_caches_from_task(build_task)
                set_expiration(build_task, json_time_from_now(TRY_EXPIRATION))

to which you could add a transformation to remove coalescing.

The effort question is really one for Joel -- it's less about the cost of the compute time and more about the appearance to developers, I think.
Flags: needinfo?(dustin)
Sorry for delay here. Before I push a patch for the legacy.py thing, just so I don't forget to ask Joel as well about the effort question.

:jmaher

As to the scheduled basis, I have one question. I may be completely off the rails but we've got several things under heavy development that are going to be deployed soon (SETA for builds mechanism, try-extender support, etc) that are to bring some ease in this anyway in the near future. Having that in mind, why bother with the scheduling mechanism since builds are cheap anyway?

How about allowing PGO builds (which are < 90min effort now) on every commit but intentionally amend to skip some of tests jobs which seem to be the more expensive operation here (20+ hours of test). Not sure though how/if that can be tweaked for the "on demand" as well but I thought I should ask here. 

Sorry if I misunderstood something here.
Flags: needinfo?(jmaher)
Just to be clear, I'm working on the hook thing already, but just wanted to throw the above question to make sure I got the things right.
SETA could skip the pgo jobs, but that is more of a end of July thing, maybe August.  try-extender support will need more attention and we would need to support smoothly pgo on try, again that full package is a end of July thing.  As a note, we have never used SETA for builds, that would be a feature we would need to add.

For the next 6-8 weeks, it would be nice if we use hooks or something to reduce the number of builds+tests we do for pgo as:
1) they are not needed
2) it costs money- we just had to expand our pool of machines

it is important to run these regularly to ensure they stay green and track similar test values to that of buildbot.

So technically, we could avoid this and live with the cost of builds/tests running.  Solving this now, sets us up for a good solution in the future.
Flags: needinfo?(jmaher)
:jmaher 

Thanks for informed reply! Makes perfect sense now. Am working on the hooks to make them work, ideally before London.
The in-tree optimization framework might make a good solution here, too.  You could consult the last matching job performed in the index, and look at the time it was performed -- if less than 3h ago, optimize the job away.  It's a little tricky since you'd need to make the same checks for the tests, but could be done.
Just some quick thoughts on this:
* apologies for long delay in status updating this
* managed to get the related hook https://tools.taskcluster.net/hooks/#releng/pgo up and running during Mozlondon, even though at this moment it's just a blank hello world job (had some issues with the scopes in the first place)
* working towards understanding if `mach taskgraph` can be pointed at an alternate template - as in a "nightly" mode decision task to schedule but the PGO build only. I don't have too much progress yet but IIUC <garbas> played already a bit with hooks so I might poke around for some help
That's not so much an alternative template as a new target_tasks_method; see http://gecko.readthedocs.io/en/latest/taskcluster/taskcluster/parameters.html#target-set
Any update on scheduling here?
Flags: needinfo?(mtabara)
Kind of abandoned this for a few days to deal with something releaseduty/release promotion related.
Hope to have some progress by mid next-week on this.
Apology for the delay.
Flags: needinfo?(mtabara)
See Also: → 1289122
Took a little bit of time to better understand some of the taskcluster intrinsics before jumping back to working with hooks.
Also played with `mach taskgraph` command in bug 1289122 to familiarize myself with it.

Note to self:
Am currently attempting to finish up this bug in itterations:
* first to get the hooks working on a schedule basis using try syntax within commit message
* tweak the `mach taskgraph` to use target_tasks_method instead to schedule the PGO target task only
* understand how this can be tweaked for mozilla-inbound
Tried to make the hook working but keep failing with errorCode: InputValidationError statusCode: 400 requestInfo: method: updateHook params when I update the hook definition https://tools.taskcluster.net/hooks/#releng/pgo with a modified version of the latest decision task. Time for me to poke for some help in #taskcluster
Managed to overpass the schema validation errors and create something almost similar to a decision task.
However, it seems like I hit some issues:
1. I'm missing "assume:repo:hg.mozilla.org/try:*" in my releng/pgo hook. However I do have  "assume:repo:hg.mozilla.org/mozilla-central:*"
2. Not sure how to insert {from_now} tags for datetime strings (e.g. payload artifacts expiration attribute)
3. I attempted to run plain <make taskgraph> to see it works and for some reason the task exits with code 1. Logs aren't very generous so I'll keep invetigating.

Will poke for some help in #taskcluster
I fixed #1.

#2 isn't supported -- but artifact expiration matches task expiration by default, I believe.

#3 there's no `make taskgraph` command -- it's `mach taskgraph <subcommand>`.
Finally making some progress here, thanks to dustin!

re: earlier questions:

#1 works now, all good

#2 hm. If I omit expires, it's not added in the task defition. But it doesn't scream either. So it's good for now :-)

#3 apologies, I had a typo in the bugzilla comment. I specified <mach taskgraph> in the task definition the right way, still getting failed task with. 

Note to self: after dustin debugged on a loaner, seems like indeed $GECKO_BASE_REPOSITORY was missing. I should've looked in https://dxr.mozilla.org/mozilla-central/source/testing/docker/decision/bin/checkout-gecko first

Also, gps landed some changes with respect to the decision image including upgrading it twice. After I reused the latest decision task definition and changed it accordingly in terms of user info (myself) it worked nicely - https://tools.taskcluster.net/task-inspector/#ZM1PdEusQE-3B4vjdoADPg/ created TC Linux64 PGO build - https://tools.taskcluster.net/task-inspector/#AEacL_U2S9CD2y9q3iJl-Q/ - and some extra eslint job too :)

Will follow-up with some questions in a bit.
So the next question(s) that come(s) to my mind is/are:

1. How do I tweak DECISION_ARGS to grab the latest changes from the default head? Something lilke rev=$(hg parent --template '{node}'). The existing task just uses some existing job info from try.

2. How do I make this work for mozilla-inbound? First drive tells me that if: 
> GECKO_HEAD_REPOSITORY is properly pointed to mozilla-inbound, 
> GECKO_HEAD_REF and GECKO_HEAD_REV to latest green revision
< <project> to point to "mozilla-inbound"
> <puslog-id> TO-BE-DETERMINED
> <head_repository> to point to mozilla-inbound
> <head-red> and <head-rev> to same as GECKO_HEAD_REF and GECKO_HEAD_REV

.... we may end up running that build under mozilla-inbound?

Note to self: I still need to tweak the extra dict in the hook task definition + routes accordingly once all this is sorted out

Excerpt from IRC:

21:11:22 <dustin> mtabara: I think that should be done explicitly in the decision task
21:11:39 <dustin> mtabara: through some sort of special command-line arg
21:11:56 <dustin> and I'd prefer that the *result* of that command be recorded in parameters.yml, but I could be convinced otherwise

3. "explicitly in the decision task through some sort of special cmd line arg" -> as in something that gets evaluated in https://tools.taskcluster.net/hooks/#releng/pgo task definition?

4. Uhm, isn't the parameters list ... fixed? http://gecko.readthedocs.io/en/latest/taskcluster/taskcluster/parameters.html

Apologies for so many questions. And above all, sorry if I went completely off rails with the any of logic above.
Flags: needinfo?(dustin)
My thinking for DECISION_ARGS is to include something like `--calculate-latest-green-ref`.  When the decision task sees that option, it calculates the latest green ref, then sets the `head_ref` and `head_rev` parameters appropriately.  So if I download the `parameters.yml` from a decision task, I build with the same rev it used, rather than determining a new latest green rev.  I think that answers #1, #3, and #4.

It would probably be friendly to add a `./mach taskgraph calculate-latest-green --project=<project>` command to help with debugging that code, since otherwise it would only run in decision tasks.  And of course some awesome unit tests :)

Regarding #2, I'm not sure what you mean by "under mozilla-inbound".  The bits where the project are relevant are:

 - scopes (assume:repo:hg.mozilla.org/<whatever>, which generally translate to assume:moz-tree:level:L)
 - where to pull the rev from (GECKO_HEAD_REPOSITORY and GECKO_BASE_REPOSITORY)
 - target task method (try option syntax for the `try` project, etc.)
 - per-project tweaks (which we want to minimize, but for example always clobbering on try)

is there one of those in particular that you're thinking of?
Flags: needinfo?(dustin)
Hacking to add some functionality in <mach taskgraph> sounds pretty cool! 

@dustin:
I'd like to sign-up for that if that's possible. If yes, should I add a separate bug to track it?
I think it would fit nicely in this bug.
Sounds good, thanks! I'll be back with some comments & questions once I better understand what I'm up against. I need to dig a bit in the code beforehand.
@dustin: Sorry for delaying here with a status update, got derailed with something else today.

So I looked a bit in the code and have a few questions if possible:

1. Regarding this "nightly" behavior that we want for these PGO builds - what should we consider as "latest green". Do test failures count? Is there a similar behavior I can look up to? (afaik we've used that for nightly builds)

2. Regarding its implementation, I was thinking of adding the 'calculate-latest-green-ref' logic in a function under taskcluster/taskgraph/decision.py that's to be reused in taskcluster/mach_commands for the `./mach taskgraph calculate-latest-green --project=<project>` sub command.

3. Supposedly I've got #1 and #2 already solved, I'm still missing a piece in the puzzle. 
Before the logic hits the 'mach taskgraph decision`, it will firstly need to checkout gecko - https://dxr.mozilla.org/mozilla-central/source/testing/docker/decision/bin/run-decision?q=run-decision&redirect_type=direct#16 . What GECKO_HEAD_REV and GECKO_HEAD_REF are we going to use in https://tools.taskcluster.net/hooks/#releng/pgo in this case?

4. (offtopic) How/is there a method to version the source of hooks? Do we define them in tree or just the web UI?

Thanks in advance and sorry for so many questions.
1. I don't know the answer to this.  I think there is some existing code to do this in the Buildbot schedulers?

2. Yes, that sounds good.

3. I think that we will always need to check out the latest commit ("tip"? "default"?) of mozilla-central, then those will find the latest green rev and create tasks for that rev.  Another alternative may be to run a task on the latest commit that *just* determines the latest green rev, then runs another task with that rev spelled out explicitly.

4. No, hooks are just JSON blobs.  Ideally the hook itself will be pretty dead simple and all of the interesting bits will be in the task definitions in-tree.
Oh, for #4, the best design we have for how to do that is a hook that just runs a hard-coded task that knows how to download the latest ${topsrcdir}/.cron.yml file and interpret it.  We don't have to get there immediately, but that's the idea.
Note to self:

So I had a conversation in detail with Coop yesterday to break this into pieces. Turns out I made a confusion with respect to these hooks/PGO builds:
* there are nightly builds that actually need the latest green revision to run
* there is the current PGO build in inbound which this bug refers to

Having PGO builds on every commit on mozilla-central is understandable as builds are already green and passed a check in mozilla-inbound. The issue with the frequency is a problem for mozilla-inbound only where we expect a larger traffic and we could use some resource trimming. Sheriffs would benefit of that too. As to mozilla-central, once the commits are merged from inbound, it's unlikely that we'll run PGO builds on inbound for that specific revision, since the opt builds have already been green.

Our proposal is to run PGO builds on every commit on inbound (currently happening) but *not to run the tests by default*. With build at hand later on, triggering tests should be easier by Sheriffs, if needed. Additionally, our power load on the build machines is quite generous, build resources are cheap, hence I assume it's safe to say that doing the PGO builds on every commit, which are < 90min effort now, should not be a problem. This leaves us with a separate problem to deal with at https://tools.taskcluster.net/hooks/#releng/pgo which is, triggering BUILDS + TESTS on a schedule basis (say every 3h or 5 commits or whatever).

In conclusion, the proposal would be:
* mozilla-central is out of the equation in this case - leave it as it is: PGO builds + tests on every commit.
* mozilla-inbound - run PGO builds for every commit but disable the tests. Set up the hook to trigger BUILD + TESTS on a schedule basis


@dustin:
* if we're going down this path, how should I track the PGO triggering to figure out the treshold params (time, no. of commits, etc) to run a new one?

@gbrown/jmaher:
* just to double check, if we have the PGO builds, it's safe to say that Sheriffs will easily trigger the tests against that build later on if needed, right?
* if we are to go down this path of disabling the tests, are they part of the build graph or a separate graph? How are tests scheduled in the graph?

This should make Sheriffs lives easier as they won't need to regenerate PGO builds in the event of a merge issue. We should definitely bring them too into this conversation but I thought it's better to ask for some seconds thoughts of you first.

Thanks!
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
Flags: needinfo?(dustin)
It sounds like the coalescer service may be what you need here.  Or something like it.  We already have SETA (/cc joel) and Coalescer both trying to handle load-trimming, and I'd hate to add a third service trying to do the same thing..
Flags: needinfo?(dustin)
(In reply to Mihai Tabara [:mtabara] from comment #38)
> In conclusion, the proposal would be:
> * mozilla-central is out of the equation in this case - leave it as it is:
> PGO builds + tests on every commit.
> * mozilla-inbound - run PGO builds for every commit but disable the tests.
> Set up the hook to trigger BUILD + TESTS on a schedule basis

To be clear, let's explicitly talk about all the other repos too:
 - pgo build per commit + scheduled pgo tests on all integration repos: mozilla-inbound, fx-team, autoland, ...?
 - pgo build+test per commit on mozilla-central, mozilla-aurora, mozilla-beta, ...?
 - pgo build+test per commit on project repos like ash, cedar, ...?

> @gbrown/jmaher:
> * just to double check, if we have the PGO builds, it's safe to say that
> Sheriffs will easily trigger the tests against that build later on if
> needed, right?

I think so. I'm thinking "Add new jobs" from treeherder would work, right?
Flags: needinfo?(gbrown)
(In reply to Dustin J. Mitchell [:dustin] from comment #39)
> It sounds like the coalescer service may be what you need here.  Or
> something like it.  We already have SETA (/cc joel) and Coalescer both
> trying to handle load-trimming, and I'd hate to add a third service trying
> to do the same thing..

@dustin:

Yeah, makes sense. AFAIK coalescer is to cope more with queue collapsing under load - we've got one already working for mozilla-inbound PGO builds - https://coalesce.mozilla-releng.net/v1/threshold so I'll poke around to learn more of SETA. Afaik we've already got some working examples there so it should be straightforward once we decide to do that. 

(In reply to Geoff Brown [:gbrown] from comment #40)
> (In reply to Mihai Tabara [:mtabara] from comment #38)
> > In conclusion, the proposal would be:
> > * mozilla-central is out of the equation in this case - leave it as it is:
> > PGO builds + tests on every commit.
> > * mozilla-inbound - run PGO builds for every commit but disable the tests.
> > Set up the hook to trigger BUILD + TESTS on a schedule basis
> 
> To be clear, let's explicitly talk about all the other repos too:
>  - pgo build per commit + scheduled pgo tests on all integration repos:
> mozilla-inbound, fx-team, autoland, ...?
>  - pgo build+test per commit on mozilla-central, mozilla-aurora,
> mozilla-beta, ...?
>  - pgo build+test per commit on project repos like ash, cedar, ...?

 
Good idea! Yep, that's what I have in mind too:

1. main repos: mozilla-central, mozilla-aurora, mozilla-beta, mozilla-release, etc
PGO build + tests per commit.

2. integration repos:
PGO builds per commit + scheduleded tests

3. project repos: ash, cedar, etc
To be honest, I'm not sure, but I guess PGO build + tests per commit makes sense?

Now, taking into consideration Dustin's observation above, I think the new workaround would be something like:
1. for main repos: do nothing, leave things as they are
2. for integration repos: leave builds + tests per commit as they currently are but instead add (SETA + coalescer) for tests load-trimming 
3. for project repos: do nothing, leave things as they are

This way: a) we always run PGO builds for every project since they are cheap b) for integration repos only we do not touch the tests scheduling mechanism but rather use existing tools to load-trim the tests c) Sheriffs can trigger tests in need in this particular case later on.

So if my understanding is correct, I guess the hook is no longer needed at this point.

> > @gbrown/jmaher:
> > * just to double check, if we have the PGO builds, it's safe to say that
> > Sheriffs will easily trigger the tests against that build later on if
> > needed, right?
> 
> I think so. I'm thinking "Add new jobs" from treeherder would work, right?

Yep, that's what I had in mind too.
I don't think we can make an assumption that the sheriffs will add PGO jobs.  Add new jobs is great for one or two jobs, not 50+ jobs/platform.  There is something which exists for buildbot of 'trigger all tests', but that is limited to opt I believe.  The current workflow is to manually add pgo jobs only prior to a merge when they think there is a green changeset.  As this is a 1-2 times/day event, that would be a lot of missing pgo data unless the sheriffs triggered more data.  Still we need to solve the problem of making it a single button to click to trigger tests, otherwise this is a non starter.

I really like the idea of pgo builds being available.  Would this apply to linux/windows (when all are on taskcluster)?  If so, changing our workflows to meet this proposal sounds more and more plausible.

What is interesting is the concept of using SETA to reduce pgo tests assuming we have more frequent builds.  A few thoughts here:
1) we could use SETA to reduce PGO builds every random time, then all tests would be scheduled if a build is completed
2) SETA will effectively find little to no unique failures in PGO tests, therefore all PGO related tests will be considered low-value and not run as often.  In fact here, we would probably use SETA's preseed.json concept to force all PGO related jobs into the low value state.

If we do design SETA to be more flexible in the sense of multiple priorities and/or schedules, we could really make it work well for reducing test overhead.  Of course this means that SETA is doing more than it was intended to by using pre determined information (load/end-to-end time) to determine which jobs to run.

:armenzg, do you know how much work it would be to add functionality into treeherder to schedule all pgo test jobs for a specific revision on taskcluster?
Flags: needinfo?(jmaher) → needinfo?(armenzg)
(In reply to Mihai Tabara [:mtabara] from comment #41)
> 3. project repos: ash, cedar, etc
> To be honest, I'm not sure, but I guess PGO build + tests per commit makes
> sense?

If the project branch has nightlies, it should be treated as if it was in bucket #2. Project branches should have the lowest priority, so I'm expecting SETA with skip the majority of PGO tests here.

We could get away with not even offering PGO builds on non-nightly project branches, but that just introduces potential merge problems, and really that the reason we decided to do PGO builds on every push anyway. 

(In reply to Joel Maher ( :jmaher ) from comment #42) 
> I really like the idea of pgo builds being available.  Would this apply to
> linux/windows (when all are on taskcluster)?  If so, changing our workflows
> to meet this proposal sounds more and more plausible.

Our thinking that build capacity is cheap, and it's test capacity that expensive. If we can save sheriffs the time to generate the builds when they do need to bisect/backout, that's a small win. 
  
> :armenzg, do you know how much work it would be to add functionality into
> treeherder to schedule all pgo test jobs for a specific revision on
> taskcluster?

Yes, the back-filling ability is the key here. 

Here's where I betray my lack of knowledge about how the current system works: would it be helpful to generate the complete testing graph as an artifact for each PGO build task and simply not run it? That way you wouldn't need to compute anything when you go to back-fill later.
I will leave the question of full vs optimized task graph to Dustin and Armen.

Keep in mind that SETA as designed is only going to work on integration branches(mozilla-inbound/fx-team/autoland).  If we want it to work on more branches, we need to rethink things.  That is something I overlooked in this strategy.  Also, what do we do about try server?  Do we have a way to specify a buildtype of 'p' as we do 'd' for debug and 'o' for opt?
For reference's sake (as it took me a bit to find the jobs), this is what running opt + pgo builds looks like on Treeherder:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=Linux%20x64%20opt%20Submitted%20by%20taskcluster%20%5BTC%5D%20Linux64%20Opt%20tc(B)&selectedJob=33410715

Using SETA is a fine approach. However, I would like to also mention the pulse_action's approach since it has some minor advantages. The reason that SETA is mentioned is because both approaches can cause the same effect.

Adding tests to a PGO build is not different than adding new jobs.
Instead of a human requesting the right labels we have a system that determines which pushes get to have PGO tests added.
We would only need to listen to completion of gecko tasks.
The system could even deal with scheduling the build so you don't have to run the PGO build on every push.
The system could even re-trigger PGO tests a second time if they fail once.
It can even notify sheriffs if a second test run came back orange for them to investigate/backfill.

Having PGO builds running for every push is the same as saying "we're OK with spending more than double on build costs for intregration repositories" (40-50 mins + ~90mins). I don't pay for it so I'm OK with it and it makes any solution easier.


> 1) we could use SETA to reduce PGO builds every random time, then all tests would be scheduled if a build is completed
Yes, this should not be difficult. You don't even need to check data from SETA for low/high values (as you mentioned).


> There is something which exists for buildbot of 'trigger all tests', but that is limited to opt I believe.
We can add "trigger all pgo missing tests" to Treeherder.
It should take between a week to two weeks.


> Here's where I betray my lack of knowledge about how the current system works: would it be helpful to 
> generate the complete testing graph as an artifact for each PGO build task and simply not run it?
> That way you wouldn't need to compute anything when you go to back-fill later.
That's more or less how it works. We create an artifact on the gecko decision task which the action tasks will use to see "what else could have been scheduled?".

> Do we have a way to specify a buildtype of 'p' as we do 'd' for debug and 'o' for opt?
AFAIK we don't atm.

(In reply to Dustin J. Mitchell [:dustin] from comment #12)
> I guess I don't understand the issue -- if the PGO tests take 20+ hours to
> finish, surely they're not of much value on *any* push?  Why bother running
> them at all?  Or why not run them but hide them by default, and only check
> occasionally?
PGO tests are the same as opt tests.
20+ hours refers to the sum of all the test jobs that a build would trigger.
Flags: needinfo?(armenzg)
We're still figuring out how we'll actually do this, but I think we're in agreement that we'll always create PGO builds and find a way to scale back the testing on those builds.

mtabara: are you the correct person to continue driving this?
Summary: Don't run PGO builds on every commit → Don't run PGO tests on every build
To sum this up, IIUC:

0. Using hooks is not the way to go anymore here. 

1. We're all in agreement that we'll always create PGO builds since our build capacity is cheap.

@jmaher:
2. Having said that, am trying to understand which one of SETA or the pulse_action's approach is the way to go in order to move forward. Armen mentioned above that adding "trigger all pgo missing tests" to Treeherder should take between a week to two weeks. Does having the Sherffis add PGO jobs whenever they want with a single button to click to trigger tests mean we'ge got a third plausible option here besides SETA and the pulse_action's approach? (as in, always have PGO builds and disabled tests by default and leave the Sherrifs schedule the tests whenever they want)

3. If SETA is the way to go, our proposal becomes the two options aforementioned by jmaher?

(In reply to Joel Maher ( :jmaher) from comment #42)
> What is interesting is the concept of using SETA to reduce pgo tests
> assuming we have more frequent builds.  A few thoughts here:
> 1) we could use SETA to reduce PGO builds every random time, then all tests
> would be scheduled if a build is completed
> 2) SETA will effectively find little to no unique failures in PGO tests,
> therefore all PGO related tests will be considered low-value and not run as
> often.  In fact here, we would probably use SETA's preseed.json concept to
> force all PGO related jobs into the low value state.

Also, what do should we do for project branches in this case? For integration repos we've got the options above, for main repos (m-c, m-a, m-b, m-r, etc) we have PGO builds + tests per commit by default but what about the project branches? If I understood correctly SETA is designed to work on integration branches only.
Flags: needinfo?(jmaher)
Thanks for the great summary here.

I believe we need to have a method for sheriffs to trigger all pgo tests before reducing the tests.  On demand pgo generation is important.  Doing this is really pulse_actions- just an interface with a button that uses pulse actions to extend the task graph.

As for SETA, that will reduce the jobs, it will require some work on the SETA side to put a hack in to force ALL PGO jobs to be considered low-value.  Once we have that, we need to ensure that PGO jobs are included in the periodic runs of low-value jobs, that shouldn't be too hard.  Overall, it is probably one week of work on the SETA side to make this happen- that cannot happen this week or next, we have a few weeks of SETA work to make it work in taskcluster and finish with our deployment story as SETA is fragile right now.

Lastly, lets discuss the other branches.  SETA only works on integration branches, not Try, project branches or release branches.  For release branches, I see no concern.  For project branches and Try there could be some issues.  I wish on try we had a build option of pgo, currently we have opt and debug, but right now if you build opt there will be both opt and pgo builds.  This will get confusing as adding a 'p' option to the build type will be confusing until everything is on taskcluster.  We could make SETA work on project branches to limit pgo jobs only- my understanding is that the project branches are for developing certain large features- some branches are very low volume and would benefit from the pgo jobs, others it might not matter.  How many project branches do we have?  what is the volume of pushes on those branches?  Maybe a breakdown per month for the last 6 months would help understand the issues on project branches better.

it sounds like we need to file a bug for pulse_actions to support a 'trigger all pgo jobs' action, and another bug for SETA to force all PGO jobs as low-value.
Flags: needinfo?(jmaher)
Apologies for delay in answering here, I got derailed in some other unrelated work.

@jmaher:
Thanks for the clear answer!
I really like this approach of taking the things step by step. So it sounds like we need to break the things in this order:

1. Before trimming anything, make sure we have on demand pgo generation, thus the pulse_actions button to trigger all pgo tests for Sheriffs. 

2. Hack SETA to force all PGO jobs to be considered low-value. 

3. Hack SETA even more to ensure PGO jobs are included in the periodic runs of low-value jobs

4. 

(In reply to Joel Maher ( :jmaher) from comment #48)
> some branches are very low volume and would benefit from the pgo jobs, others it
> might not matter.  How many project branches do we have?  what is the volume
> of pushes on those branches?  Maybe a breakdown per month for the last 6
> months would help understand the issues on project branches better.

Sounds like a good starting point. I'll have a look at that. Do we need a rought estimation or something more precise here? As in, should I poke some people around or an estimation is good enough here?

Misc:
* we might need to give it a few more days to wait in the line for 2) and 3) as SETA is fragile now
* bug for 1 should be a blocker for this bug as well as for bugs filed for 2) and 3)

I'll go ahead and file the bugs. Is there any of the above I could help/take responsability with/for or should I leave it to #ateam ? Also, since the TC/Hooks are no longer of use here, should we change the bug's component to Release Engineering - General Automation?
Depends on: 1297690
Depends on: 1297692
Depends on: 1297694
it will be a while before SETA support can be tweaked, we need to get migrated fully to heroku and proven working on taskcluster first.  We will probably have support for pgo in there, just might limit the use of it until we have everything proven.

I would move things to testing::general.
Component: Hooks → General
Product: Taskcluster → Testing
(In reply to Joel Maher ( :jmaher) from comment #50)
> it will be a while before SETA support can be tweaked, we need to get
> migrated fully to heroku and proven working on taskcluster first.  We will
> probably have support for pgo in there, just might limit the use of it until
> we have everything proven.

Sounds good. We'll put this on hold for a while.
Meanwhile, I'll try to gather all that data with respect to project branches.
Mihai: did you collect the PGO data about the project branches?
Also note that I made a modification to this to avoid running coalescing on try in bug 1286075.  Unfortunately I can't request review from mihai because of a bug in mozreview
  https://reviewboard.mozilla.org/r/75706/diff/2#index_header
(In reply to Chris Cooper [:coop] from comment #52)
> Mihai: did you collect the PGO data about the project branches?

Bringing gps into the conversation.

:gps

Context: we're talking about PGO builds/tests and how should they behave on the project branches in the context of tc migration. For release branches we have no concern, for integration branches we're tweaking SETA and pulse_actions but we've yet to reach an agreement on project branches as we need to gather some more data.

From my understanding the project branches are for developing important features as some branches benefit of low-volume, while others the contrary. My questions are:
* how many project branches do we have?
* what is the volume of pushes on those branches?
* is there any way I can get a breakdown per month for the last 3-6 months?

Is there any automated way I can get this info or do you already have a centralized-gear that gathers all this data?
Flags: needinfo?(gps)
(In reply to Dustin J. Mitchell [:dustin] from comment #53)
> Also note that I made a modification to this to avoid running coalescing on
> try in bug 1286075.  Unfortunately I can't request review from mihai because
> of a bug in mozreview
>   https://reviewboard.mozilla.org/r/75706/diff/2#index_header

lgtm, r+ ;)
We don't have a great "dashboard" for hg.mo data. However, it is possible to mine the data.

The project "branches" are mostly all located under https://hg.mozilla.org/projects/.

If you install the "firefoxtree" and "mozext" extensions from the version-control-tools repo, you can aggregate pushlog data locally to get some useful info.

  $ hg clone -U https://hg.mozilla.org/mozilla-unified pushlog-mining
  $ cd pushlog-mining
  $ hg pull twigs
  # Wait several minutes for it to pull all the project repos and synchronize their pushlogs

If you are comfortable with Mercurial revsets and templates, run `hg help -e mozext` to see what revsets and templates are available to query and format pushlog data. e.g.

  # Print tip changeset of pushes to the cypress repo
  $ hg log -r 'pushhead(cypress)' -T '{node}'
  
  $ Print dates of pushes for tip changesets of pushes to the cypress repo
  $ hg log -r 'pushhead(cypress)' -T '{dates(pushheaddates, "%Y-%m-%d", " ")}\n'

You can also open .hg/changetracker.db and look at the SQLite data yourself.
Flags: needinfo?(gps)
@gps: thanks a lot for the useful information!

I'll follow up with the above instructions to come up with some data here in order to pursue with PGO builds/tests on project branches.
@jmaher: good job on fixing bug 1297692 and bug 1297694!

I still owe some project branch data gathering here - apologies, I've abandoned the work a couple of weeks back as more prioritized stuff came along. I hit an issue with `hg pull twigs` but didn't escalate properly. I'll resume to that data gathering as soon as I can.
can we mark this completed?
(In reply to Joel Maher ( :jmaher) from comment #59)
> can we mark this completed?

Summoning Coop for some seconds thoughts here.

@jmaher:
Not sure if there's anything left we should be addressing for project branches. Do bug 1297692 and bug 1297694 cover us for all branches?
Flags: needinfo?(coop)
I think we can call this done.

Unless there's an outstanding reason not to, we should run PGO builds on project branches same as we do on integration branches and use SETA aggressively if we're worried about volume.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(coop)
Resolution: --- → FIXED
See Also: → 1382204
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: