Open Bug 1346565 Opened 8 years ago Updated 5 years ago

Taskcluster uses randomly generated 'machine' names causing machine table bloat

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P5)

enhancement

Tracking

(Not tracked)

People

(Reporter: emorley, Unassigned)

References

(Blocks 1 open bug)

Details

The `job` table has a machine_id field that foreign-keys to the `machine` table. The `machine` table only has two fields, `id` and `name`. On prod the machine table contains 5.3 million rows (albeit some will be deleted when cycle_data is fixed, the jobs table 'only' contains 2.6 millions distinct machine_ids). Sample recent machine names: i-0d927185f235e24dd i-016e9043de0c1bb2c i-036f2c95e98d36be6 i-0e0cc5628c05ba4f7 i-0b3926af182bbe46b i-09dd45c1c6e750422 i-0853fde24a27f425d (Buildbot used names like `bld-linux64-spot-001` etc) Jobs was normalised into jobs and machine years ago on the premise that the set of machine names was finite, however this appears to no longer be the case with Taskcluster. As such, perhaps we should just store the machine name directly on the jobs table instead, since it's turning into a one-to-one mapping? Though in fact, with taskcluster is a randomly generated machine name even useful? The use cases previously have been: (a) help identify if a class of machines is causing an issue (eg all spot instances or all machines in one datacenter) (b) help identify particular bad machines (eg bad RAM, dirty work directory etc) With AWS, containers, spot instances etc I think (b) is not only occurring less often, but a random machine name doesn't really help answer it anyway. Perhaps a machine_class_name or similar would solve (a) but mean we can still ditch machine_name? (Allowing a transition for while we still have buildbot jobs of course) The machine instance id/other details can still be looked up in the log. Thoughts? (Also CCing sheriffs and others who might look at machine name)
Blocks: 1178227
I think we should just remove the machine_name table entirely, and just let people look at the logs for this information. Thoughts?
I would vote for that.
How difficult would it be when sheriffs are triaging intermittent issues to discover that it's the same machine failing them? We do have a machine dashboard now that allows one to see the last 20 tasks claimed, but connecting the dots from TH to that dashboard is not clear.
I don't think I have any opinion about removing the table, since if it's ever done anything for me I don't know what it is, but I would very very much prefer that the machine name not be removed from the Job panel, since I use that pretty close to once a day to identify that something is specific to a broken instance or machine (despite Taskcluster being everything, it does still use hardware, not just AWS instances with annoying-to-recognize names).
So to clarify: by machine name, for taskcluster we're referring to the string of form `i-036f2c95e98d36be6` - and that even though it looks pseudo random and there are 11 million of them in Treeherder's `machine` table, they are actually reused?
Yep. I don't know whether the 11 million number reflects the fact that some instances are very short-lived, or just that we run a completely ridiculous number of jobs, but both https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win10-64/workers/us-east-1/i-0cda6675a9779be9a (that's a Win10 instance with an "Activate Windows" overlay up) and many of the lines in https://mozilla.logbot.info/?ch=taskcluster&q=kill say that individual instances last long enough that we recognize them as being broken, and while Aryx might well be cunning enough to get the instance name out of the log or the task inspector link, whenever I do it's because I recognize it (mostly from the last four or five chars, I don't actually memorize all 17) from the Job panel's "Machine:" line.
Priority: -- → P3

Although I agree this does not need to be a table, it is not a priority to change the schema.

Component: Treeherder → Treeherder: Infrastructure
Priority: P3 → P5
You need to log in before you can comment on or make changes to this bug.