Closed Bug 1641675 Opened 5 years ago Closed 2 years ago

workerPoolId gecko-t/t-linux-metal is ambiguous

Categories

(Firefox Build System :: Task Configuration, defect)

defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: trink, Unassigned)

References

Details

workerPoolId gecko-t/t-linux-metal can represent two types of hardware (r5/m5.metal) with different costs and capacity configurations. This breaks some assumptions in the cost analysis dashboards and leads to incorrect totals.

in addition, it seems that we have a fairly equal distribution of m5.metal (15 instances) and r5.metal (32 instances). the difference is that we pay per instance at: m5.metal $.30/hour and r5.metal $.18/hour.

I think minimizing our use of m5.metal would be better cost savings unless we determine that r5.metal is hard to keep full. My understanding is before the migration we were paying something like $.28/hour for workers, so either way this is better assuming we don't leave too much overhead or it ends up being the same.

Blocks: 1641680

Most of our existing pools have multiple instance types, with different costs already, so I'm surprised that the different types of instances would be an issue, if the existing variation is not already an issue.

Regrading the capacity variation, my understanding is that :wcosta did performance measurements of various instance types fully loaded, and found that the given capacities gave acceptable preformance on the different worker types (I think largely due to available memory). I think there may also have been capacity issues with using a single instance type. :wcosta is no longer at Mozilla, but perhaps :coop has some more details.

I do think that this is something that cost-analysis will need to take into account rather than changing the pools.

Flags: needinfo?(coop)

(In reply to Tom Prince [:tomprince] from comment #2)

Most of our existing pools have multiple instance types, with different costs already, so I'm surprised that the different types of instances would be an issue, if the existing variation is not already an issue.

With the same capacity (in most cases 1) they are just averaged together which in general is good enough.

I do think that this is something that cost-analysis will need to take into account rather than changing the pools.

This data made available by Taskcluster has to match up to the billing tags; this has been evolving from workerType only, to workerPoolId, to workerPoolId+workerGroup. The pool type can be made distinct to address the issue or new metadata added but without one of these the cost analysis cannot take the capacity into account. It just depends on how we want the costs rolled up and the lowest common denominator between the various clouds/datacenters.

The pool type can be made distinct to address the issue

Doing that we remove the benefit of having two instance types, as tasks are assigned to a worker pool, and the intent it to run those task on whichever instance is allocated by taskcluster.

I'm not sure what information it would make sense to add to the metadata, to improve the cost data. Each task run has the associated worker it was run on, which I believe (for cloud providers) includes the instance name, and from that cross-reference with the instance type. It would probably be useful if worker-manager included information about the instance type and image id used in the metadata it reports for workers, though it does include the capacity as known by worker-manager[1].


Regarding :jmaher's point about costs, it would be good if we could prefer once instance type to the other (either automatically, or via manual anotations) but taskcluster doesn't currently support that.


[1] This does not necessarily match with how the worker is actually configured (as evidenced by Bug 164008), but that is definitely a bug in the configuration and probably not worth worrying about in the cost-data handling.

(In reply to Tom Prince [:tomprince] from comment #2)

Regrading the capacity variation, my understanding is that :wcosta did performance measurements of various instance types fully loaded, and found that the given capacities gave acceptable preformance on the different worker types (I think largely due to available memory). I think there may also have been capacity issues with using a single instance type. :wcosta is no longer at Mozilla, but perhaps :coop has some more details.

Bug 1578460 has the details for the cost-to-perf exploration that Wander did, including links to his spreadsheets.

In short, metal instances are bigger and beefier because of the types of workloads people would want to run on them. Consequently, there are fewer of them available, so we need to widen our options to avoid long delays in acquiring capacity.

I do think that this is something that cost-analysis will need to take into account rather than changing the pools.

Absolutely. If we need to tag things better or adjust our nomenclature to make this easier to parse, please let us know.

Flags: needinfo?(coop)

The instanceId is extremely useful for an accurate mapping of the Taskcluster data to the billing data unfortunately the only provider that currently has it in the billing data is AWS. We have been using it where we can e.g. to track down provisioning issues but it would be nice to get our tagging/metadata in order so we can compute costs the same way on each cloud and avoid a lot of one-off estimates, averages and manual tweaks per provider.

In the long term I believe switching from workerPoolId to the workerId/instanceId as the mapping mechanism is the right direction to go.

It looks at least the aws provisioner keeps around the instance type (though google and azure don't). That could perhaps be exposed in the API.

Alternatively or additionally, worker-manager could track the launch-config that was used to spawn a worker (see this comment), and expose that data, which includes the instance type).

The AWS specific workerPoolIds have been addressed. The billing hours are now adjusted by the configured capacity for each workerPoolId in combination with their instanceType before being rolled up into the final workerPoolId costs. This required a smaller change than using the instanceId and kept the higher level cost rollups compatible with the other providers so AWS didn't have to be separate or special cased.

See https://bugzilla.mozilla.org/show_bug.cgi?id=1641680#c2

No further action is required to correct this specific issue so it can be close if this solution is acceptable. However, it would still be good to create a forward looking solution for all providers regarding variable capacity within a pool.

Commit: https://github.com/mozilla-services/lua_sandbox_extensions/commit/5bdbe65a9365b67209a3c6601c02e226d98caab2

Hm, so I just stumbled across this in triage. If I'm parsing the conversation here right, it looks like Mike implemented the ability to retrieve all the instance types used by a pool, but only for the AWS provider. We are now almost entirely on GCP, so I think that might mean that the Taskcluster cost ETL does not know about the possibility of multiple instance types there? I think most (or all?) of the GCP pools define a single instance type anyway, but we'll likely be adding more in the future as we go along.

Jason do you now if my assessement sounds correct? Or who can I ask about this? I have a vague worry that since we migrated to GCP the cost data is no longer accurate.

Flags: needinfo?(jthomas)

We've given up on the cost dashboard and no longer trust its results. The t-linux-metal pool also no longer exists.

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(jthomas)
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.