1644110 - Resource usage artifacts and Taskcluster ETL

Reporter

Description

•

5 years ago

There are some tasks which produce artifacts containing resource usage samples. We're hoping to expand this, and have a new artifact that records task resource usage, such as CPU and Memory information.

The exact artifact name and structure are still to be decided, but it'll likely be a json file produced as a task artifact, and notified via pulse.

Since we already process the live log this way - pulse message arrives, artifact is retrieved, processed and data added to BigQuery - it would be good if this new artifact could be processed the same way.

It would contain samples of the resources used every $time_interval (likely 1 second) so perhaps we want to aggregate that before adding to BigQuery, I'm happy to take opinions on that.

The questions we'd want to answer with this data are ones such as:

Are the worker instances an appropriate size, or have the right concurrency set?
- Could more cores be added
- Could we use less memory
- Could we instead fit 2 tasks per worker if we increased the size slightly, which would decrease our overall bill, etc.
Is there a difference in resource usage after this date/time (when a commit/infra change was made, for example)

Mike Trinkala [:trink]

Assignee

Updated

•

5 years ago

Points: --- → 3

Priority: -- → P2

Mike Trinkala [:trink]

Assignee

Updated

•

5 years ago

Depends on: 1649145

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

5 years ago

Blocks: 1425787

Simon Fraser [:sfraser] ⌚️GMT

Reporter

Comment 1

•

5 years ago

We've now produced this tool and have settled on an output format, at least for version 1. The files themselves contain a version number to hopefully make life easier if there are changes later. An example can be found here:

https://firefoxci.taskcluster-artifacts.net/CahkYcxZTSq5_kSchXEV5Q/0/public/monitoring/resource-monitor.json

How should we proceed with getting data through the ETL? At a minimum the summary section would be useful, so we could start with that.

Simon Fraser [:sfraser] ⌚️GMT

Reporter

Updated

•

5 years ago

Flags: needinfo?(mtrinkala)

Mike Trinkala [:trink]

Assignee

Comment 2

•

5 years ago

I just have to make a BigQuery schema and update the ETL: If there is a resource-monitor.json in the artifact list it will be fetched, validated and loaded into BigQuery. The ETL process could summarize the data, but generally the raw data is loaded in as close to its original form as possible. From there a summarized view can be generated if necessary. I should be able to start testing this tomorrow.

Assignee: nobody → mtrinkala

Flags: needinfo?(mtrinkala)

Mike Trinkala [:trink]

Assignee

Updated

•

5 years ago

Points: 3 → 2

Priority: P2 → P1

Mike Trinkala [:trink]

Assignee

Comment 3

•

5 years ago

•

Edited

I am seeing some very large vms numbers 958274947641344 which is throwing some schema errors (since it is defined as an integer). This seems like bad data but I could just change the schema type to 'number' and allow them.
https://firefoxci.taskcluster-artifacts.net/ehSPWe90SsK_3E9L7E87zA/0/public/monitoring/resource-monitor.json

Also write_bytes 18446744073708775765 in
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/ew47lYVPRiWg5fpAtbB6nA/runs/0/artifacts/public/monitoring/resource-monitor.json

And write_count 18446744073709550984 in
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/JzaV1-ECSIGP4S7sWfc66Q/runs/0/artifacts/public/monitoring/resource-monitor.json

Flags: needinfo?(sfraser)

Simon Fraser [:sfraser] ⌚️GMT

Reporter

Comment 4

•

5 years ago

(In reply to Mike Trinkala [:trink] from comment #3)

I am seeing some very large vms numbers 958274947641344 which is throwing some schema errors (since it is defined as an integer). This seems like bad data but I could just change the schema type to 'number' and allow them.
https://firefoxci.taskcluster-artifacts.net/ehSPWe90SsK_3E9L7E87zA/0/public/monitoring/resource-monitor.json

ahh, I think I should just turn off collection of that. I suspect it's all expanded VMS without any allocation, and then we're summing it across processes on top of that. The current generation shouldn't have more than 256Gb / 274877906944 bytes. What's your maxint?

Flags: needinfo?(sfraser)

Simon Fraser [:sfraser] ⌚️GMT

Reporter

Updated

•

5 years ago

Depends on: 1657980

Mike Trinkala [:trink]

Assignee

Comment 5

•

5 years ago

jsonschema does not specify a limit/range for the integer type technically the interoperable range should be up to 2^53 - 1 (before losing precision) but it is implementation dependent.

Mike Trinkala [:trink]

Assignee

Comment 6

•

5 years ago

I will relax the samples schema to use number instead of integer everywhere to avoid throwing away the entire submission for a few anomalous samples.

Mike Trinkala [:trink]

Assignee

Comment 7

•

5 years ago

I will let it run on dev over the weekend and deploy to production on Monday if everything looks good.

Mike Trinkala [:trink]

Assignee

Comment 8

•

5 years ago

Deployed the data is available at:
https://console.cloud.google.com/bigquery?project=moz-fx-data-taskclu-prod-8fbf&p=moz-fx-data-taskclu-prod-8fbf&d=taskclusteretl&t=resource_monitor&page=table

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Pipeline Ingestion → General

Bugzilla

Resource usage artifacts and Taskcluster ETL

Categories

(Data Platform and Tools :: General, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: sfraser, Assigned: trink)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Updated

Comment 2

Updated

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Updated