Open Bug 1521515 Opened Last year Updated 5 months ago

Allow instance types to be specified when retrigerring/backfilling from Treeherder

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(Not tracked)

People

(Reporter: davehunt, Unassigned)

Details

To assist investigation of build metrics regressions, it would be useful to have a way to enforce the instance type used when pushing to try. See bug 1513173 comment 19 for more context.

Component: General → Task Configuration
Product: Taskcluster → Firefox Build System

You can select a worker type (any gecko-1-b-*), but if that workerType uses several EC2 instance types, then you can't control which worker (and thus which instance type) will get the task. If you want to be able to do that, you'll need to reduce the worker type to use a single instance type (with the potential for increased cost).

If we define 2 new worker types, gecko-3-perf-linux and gecko-3-perf-win2012 and limit them to a single EC2 instance type, would they be able to run all kinds of build tasks?

By build tasks I mean: pgo, debug, opt, asan, debug etc

Yes, they would. We have several options on the table:

  1. limit the instance type of the existing worker types (pick the most commonly used)
  2. create new worker types, like you suggest, specifically for performance monitoring

It depends mostly on usage - if all tasks are providing performance metrics, and we end up running tasks on both the standard worker type and the performance worker type, that wouldn't be good. But if we only care about performance monitoring on a subset of tasks, we can filter those tasks off to a dedicated worker type, to ensure they are pinned to a given instance type.

As I understand it, the worker types this bug covers are the gecko-3-b-{platform} worker types.

(In reply to Pete Moore [:pmoore][:pete] from comment #5)

As I understand it, the worker types this bug covers are the gecko-3-b-{platform} worker types.

I need to skim the other perf alerts to confirm that.

Summary: Allow instance types to be specified when pushing to try → Allow instance types to be specified when retrigerring/backfilling from Treeherder

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #6)

I need to skim the other perf alerts to confirm that.

The gecko-{1,2}-b-{platform} worker types should also be affected, and also likely the mobile-{1,2,3}-b-* worker types which recently spun off from those.

From a simplicity standpoint, I'm fine to optimize on a single instance type. However, we need to answer some questions first:

  1. Should we optimize for price or performance? I don't think we should go with the instances we end up with (most commonly used), but rather make an explicit choice. The cheapest is likely not the most performant.

  2. Using multiple instance types helps us avoid being starved for capacity. Are we OK with occasional (how occasional TBD) outages due to this?

  3. Because these are build machines and we need fewer of them relative to testers, would we be better served by using fewer, higher-spec, reserved instances to better leverage local caches, etc?

(In reply to Pete Moore [:pmoore][:pete] from comment #4)

Yes, they would. We have several options on the table:

  1. limit the instance type of the existing worker types (pick the most commonly used)

This is an ideal option for me, but from :coop's questions above, the effects may be too bad. I don't like the idea of capacity starvation. I don't even think it's my call to answer that.

  1. create new worker types, like you suggest, specifically for performance monitoring

It depends mostly on usage - if all tasks are providing performance metrics, and we end up running tasks on both the standard worker type and the performance worker type, that wouldn't be good.

I'm not sure I'm following well enough. Why wouldn't this be good?
The way I see things is this (correct me if I'm wrong): create an extra minimal pool of performance-specific worker types, with a single instance type per OS platform, which are able to do all build tasks. Basically, they claim tasks as regular workers.
But when a Perf sheriff spots build time increases and issues retriggers on a Treeherder job, these workers will claim all those tasks. When those are finished, the workers return to claiming regular tasks from the Queue service.

But if we only care about performance monitoring on a subset of tasks, we can filter those tasks off to a dedicated worker type, to ensure they are pinned to a given instance type.

Well, we care about all build tasks. We could consider that this subset of tasks consist of Perf sheriff-triggered tasks (job retriggers).

(In reply to Chris Cooper [:coop] pronoun: he from comment #8)

  1. Using multiple instance types helps us avoid being starved for capacity. Are we OK with occasional (how occasional TBD) outages due to this?

Could you elaborate more on this issue? Why would these occasional outages happen?

Do EC2 instance types have a limited capacity? What I mean by this: assuming AWS lends weak/medium/powerful machine specs, are there times when there aren't any medium machines available?

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #10)

Could you elaborate more on this issue? Why would these occasional outages happen?

Do EC2 instance types have a limited capacity? What I mean by this: assuming AWS lends weak/medium/powerful machine specs, are there times when there aren't any medium machines available?

Yes, even AWS has finite capacity. We rarely hit it, but since we are occasionally spinning up 1000s of instances, it does happen. Usually this is because we hit our bid ceiling, i.e. due to increased demand in the pool, the cost of the instances goes above what we are willing to pay. This is where having multiple instance types in multiple regions/availability zones(AZs) can save us.

If you look at https://tools.taskcluster.net/aws-provisioner/gecko-1-b-linux/ as an example, you'll see that we use multiple different instance types (c5d.4xlarge, m5d.4xlarge, c5.4xlarge, m5.4xlarge) to supply that single worker type. The performance characteristics of each of those underlying instance types is different. This method of ensuring pool capacity makes it hard to decipher perf issues, which is why Dave is asking in comment #0 for a way to target each type directly.

My posit is that, rather than make it possible to select a particular instance type, we should align on a single instance type -- for linux builds I would choose c5d.4xlarge -- and then spread our coverage across more regions/AZs to provide some resilience.

(In reply to Chris Cooper [:coop] pronoun: he from comment #11)

My posit is that, rather than make it possible to select a particular instance type, we should align on a single instance type -- for linux builds I would choose c5d.4xlarge -- and then spread our coverage across more regions/AZs to provide some resilience.

Thanks a lot for the detailed explanation!
I now understand this better. So this is the option we'll go with :)

(In reply to Chris Cooper [:coop] pronoun: he from comment #8)

From a simplicity standpoint, I'm fine to optimize on a single instance type. However, we need to answer some questions first:

  1. Should we optimize for price or performance? I don't think we should go with the instances we end up with (most commonly used), but rather make an explicit choice. The cheapest is likely not the most performant.

From the Perf sheriffing's perspective, weaker specs would make build time changes more obvious. We would opt for lower price.
But this isn't that much of a deal. We would rather go with your suggestions, such as picking c5d.4xlarge for Linux.

  1. Using multiple instance types helps us avoid being starved for capacity. Are we OK with occasional (how occasional TBD) outages due to this?

I'm ok with that. Using some extra day or two for retriggering isn't too bad.

  1. Because these are build machines and we need fewer of them relative to testers, would we be better served by using fewer, higher-spec, reserved instances to better leverage local caches, etc?

Yes, I think using fewer, higher-spec & reserved instances would serve us better.

Do you have enough data to proceed with the changes the Perf team requires?

Flags: needinfo?(coop)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #13)

From the Perf sheriffing's perspective, weaker specs would make build time changes more obvious. We would opt for lower price.
But this isn't that much of a deal. We would rather go with your suggestions, such as picking c5d.4xlarge for Linux.

Builds need to finish before tests can run, and we run many more tests than builds, as I'm sure you're aware as a perf sheriff. I'll always opt for faster builds in CI. The number of people working on build performance is relatively small and we can make individual config arrangements for them if necessary.

  1. Because these are build machines and we need fewer of them relative to testers, would we be better served by using fewer, higher-spec, reserved instances to better leverage local caches, etc?

Yes, I think using fewer, higher-spec & reserved instances would serve us better.

That seems at odds with your preference for lower price above. Higher spec instances will be more expensive.

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #14)

Do you have enough data to proceed with the changes the Perf team requires?

Your preferences are noted, but we're talking about a decision that could potentially cost Mozilla thousands of dollars a month and possibly some developer downtime due to capacity.

I'll float the proposal to align on a single, more performant instance type for builds at the CI mtg on Thursday (Mar 21). I can make the changes quickly thereafter once the other stakeholders have weighed-in. I'll reflect their notes back here, if any.

Flags: needinfo?(coop)

(In reply to Chris Cooper [:coop] pronoun: he from comment #15)

I'll float the proposal to align on a single, more performant instance type for builds at the CI mtg on Thursday (Mar 21). I can make the changes quickly thereafter once the other stakeholders have weighed-in. I'll reflect their notes back here, if any.

So, we're going to try this, and re-evaluate after a week.

I'm gathering the baseline data now (both cost and perf) so we can reason about the outcome, and will limit each builder to a single instance type on Monday.

(In reply to Chris Cooper [:coop] pronoun: he from comment #16)

(In reply to Chris Cooper [:coop] pronoun: he from comment #15)

I'll float the proposal to align on a single, more performant instance type for builds at the CI mtg on Thursday (Mar 21). I can make the changes quickly thereafter once the other stakeholders have weighed-in. I'll reflect their notes back here, if any.

So, we're going to try this, and re-evaluate after a week.

I'm gathering the baseline data now (both cost and perf) so we can reason about the outcome, and will limit each builder to a single instance type on Monday.

Nice. I'll be waiting for the outcome results then.

(In reply to Chris Cooper [:coop] pronoun: he from comment #16)

I'm gathering the baseline data now (both cost and perf) so we can reason about the outcome, and will limit each builder to a single instance type on Monday.

Clarification: I'm going to start with the gecko-1-b-linux (vs all builders) to minimize the number of changes in flight.

(In reply to Chris Cooper [:coop] pronoun: he from comment #18)

Clarification: I'm going to start with the gecko-1-b-linux (vs all builders) to minimize the number of changes in flight.

I've updated the worker type definition for gecko-1-b-linux to only use c5d.4xlarge instances:

https://tools.taskcluster.net/aws-provisioner/gecko-1-b-linux/view

(In reply to Chris Cooper [:coop] pronoun: he from comment #19)

I've updated the worker type definition for gecko-1-b-linux to only use c5d.4xlarge instances:

https://tools.taskcluster.net/aws-provisioner/gecko-1-b-linux/view

We're now running in this config. I terminated the other workers not using this instance type.

Chris, could you apply this update to gecko-3-b-linux also? I have a bunch of new alerts which originate from these workers.

Also: do you know any way of filtering jobs on Treeherder based on the workerType that handled them? I mean: show only jobs that were run by gecko-3-b-linux or gecko-1-b-linux, for example.

Flags: needinfo?(coop)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #21)

Chris, could you apply this update to gecko-3-b-linux also? I have a bunch of new alerts which originate from these workers.

Are we happy with the change, i.e. does the perf graph for gecko-1-b-linux jobs look more stable now? That was the only reason I was holding off here.

Also: do you know any way of filtering jobs on Treeherder based on the workerType that handled them? I mean: show only jobs that were run by gecko-3-b-linux or gecko-1-b-linux, for example.

Not that I'm aware of. You should file a feature request bug with the Treeherder team if that would be useful for you.

Assignee: nobody → coop
Status: NEW → ASSIGNED
Flags: needinfo?(coop)

(In reply to Chris Cooper [:coop] pronoun: he from comment #22)

Are we happy with the change, i.e. does the perf graph for gecko-1-b-linux jobs look more stable now? That was the only reason I was holding off here.

Here's the graph:

https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=try,1686265,1,2&series=try,1678618,1,2&series=try,1678602,1,2&series=try,1699389,1,2

Looks pretty consistent. In fact, it looks like a build time regression may have been introduced around April 1st. :/

I've gone ahead and disabled all the instance types except c5d.4xlarge for both gecko-2-b-linux and gecko-3-b-linux.

(In reply to Chris Cooper [:coop] pronoun: he from comment #23)

I've gone ahead and disabled all the instance types except c5d.4xlarge for both gecko-2-b-linux and gecko-3-b-linux.

Wander reverted this for gecko-3-b-linux because the missing instance types caused failures in provisioning: https://bugzilla.mozilla.org/show_bug.cgi?id=1545352#c2

The outage also exposed some capacity concerns around aligning on a single instance type. We fell behind really quickly when provisioning broke down, so I'd like to highlight comment #8 again since that we've just seen what one of those outages looks like.

What I would suggest now is as follows:

  • remove any older-spec instance types , i.e. m4 or c4, from the instance type list, but keep the current mix of c5/c5d/m5. That will give us better average performance, without the full downside of being tied to a single instance type. We can keep the most performant instance type, c5d, weighted so that it is preferred (this is currently the case).

  • for build perf work like in comment #0, use the gecko-1-b-linux-large(1) or gecko-1-b-linux-xlarge(2) worker types in Try. These worker types are currently used for anything else, could be easily constrained to a single AWS instance type, and could even have regular builds setup to make bisection and debugging easier.

Does that make sense?

  1. https://tools.taskcluster.net/aws-provisioner/gecko-1-b-linux-large
  2. https://tools.taskcluster.net/aws-provisioner/gecko-1-b-linux-xlarge

Once Bug 1532783 lands, there will be a single place in taskcluster/ci/config.yml that will need to change to switch all workers using one instance type to another.

(In reply to Chris Cooper [:coop] pronoun: he from comment #24)

What I would suggest now is as follows:

  • remove any older-spec instance types , i.e. m4 or c4, from the instance type list, but keep the current mix of c5/c5d/m5. That will give us better average performance, without the full downside of being tied to a single instance type. We can keep the most performant instance type, c5d, weighted so that it is preferred (this is currently the case).

Are there any bugs filed for this?

Flags: needinfo?(coop)

Coincidentally, yes: bug 1572312.

Flags: needinfo?(coop)

(In reply to Pete Moore [:pmoore][:pete] from comment #4)

Yes, they would. We have several options on the table:

  1. limit the instance type of the existing worker types (pick the most commonly used)
  2. create new worker types, like you suggest, specifically for performance monitoring

It depends mostly on usage - if all tasks are providing performance metrics, and we end up running tasks on both the standard worker type and the performance worker type, that wouldn't be good. But if we only care about performance monitoring on a subset of tasks, we can filter those tasks off to a dedicated worker type, to ensure they are pinned to a given instance type.

Dave, I think we need to revisit this bug & go with the 2nd option. Also, this ticket should be focused on autoland, mozilla-inbound & mozilla-beta branches. Having this mechanic on try won't help us too much.

Flags: needinfo?(dave.hunt)

:igoldan, perf jobs used fixed hardware pool, there is no variability. The only exception is windows xperf jobs and awsy jobs. My understanding is this concern about worker types is for the builds that are performed and the resulting build times that get filed. Can you elaborate on what exactly you are looking to fix?

Flags: needinfo?(igoldan)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #29)

:igoldan, perf jobs used fixed hardware pool, there is no variability. The only exception is windows xperf jobs and awsy jobs. My understanding is this concern about worker types is for the builds that are performed and the resulting build times that get filed. Can you elaborate on what exactly you are looking to fix?

I want to run build tasks on the same instance time each time I request for their retrigger or backfill.
It's not mandatory for this instance type to be the same like the one that generated the perf alert. It simply shouldn't vary in time, depending on the real time biding algorithm. This way, Perf sheriffs have a single platform to analyze and don't have to pull in plots from 5 or 6 instance types.

Current problem
Let's say that we have A, B, C, D instance types. For simplicity, we'll narrow this discussion to Windows desktop platform only.
On T1 moment in time, a build times alert is generated for C instance type. Because C was the one allocated on T1.
On T2, when a Perf sheriff starts investigating & does retriggers/backfills, the build times test will run on B.
On T3 it's possible he needs more data & does extra retriggers/backfills. The test could now run on A and so on.

My suggestion
Keep the same A, B, C, D instance types for Windows.
But, always have a E instance type at the Perf sheriffs disposal. It should only run retriggered and backfilled build jobs.
On T1 moment in time, an alert is generated on C.
On T2 a Perf sheriff retriggers/backfills & gets results on E.
On T3 a Perf sheriffs does another round on retriggers/backfills & gets even more results on E.

Flags: needinfo?(igoldan)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriff from comment #30)

My suggestion
Keep the same A, B, C, D instance types for Windows.
But, always have a E instance type at the Perf sheriffs disposal. It should only run retriggered and backfilled build jobs.
On T1 moment in time, an alert is generated on C.
On T2 a Perf sheriff retriggers/backfills & gets results on E.
On T3 a Perf sheriffs does another round on retriggers/backfills & gets even more results on E.

This sounds good to me. We could eventually tie this into the retrigger bot being worked on in bug 1562178, to filter out false postives.

Flags: needinfo?(dave.hunt)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriff from comment #28)

Dave, I think we need to revisit this bug & go with the 2nd option. Also, this ticket should be focused on autoland, mozilla-inbound & mozilla-beta branches. Having this mechanic on try won't help us too much.

Yes, we've tried #1, and even with the very large capacity of AWS, we sometimes get outbid (or hit the capacity ceiling) on particular instance types. We need the variability in the pools to ensure enough capacity to keep up with CI volume. This might get better as we move builds to GCP because their instance configs are more dynamic, but we're getting ahead of ourselves here.

(In reply to Ionuț Goldan [:igoldan], Performance Sheriff from comment #30)

My suggestion
Keep the same A, B, C, D instance types for Windows.
But, always have a E instance type at the Perf sheriffs disposal. It should only run retriggered and backfilled build jobs.
On T1 moment in time, an alert is generated on C.
On T2 a Perf sheriff retriggers/backfills & gets results on E.
On T3 a Perf sheriffs does another round on retriggers/backfills & gets even more results on E.

This sounds workable.

Assignee: coop → nobody
Status: ASSIGNED → NEW
You need to log in before you can comment on or make changes to this bug.