Resource usage artifacts and Taskcluster ETL
Categories
(Data Platform and Tools :: General, enhancement, P1)
Tracking
(Not tracked)
People
(Reporter: sfraser, Assigned: trink)
References
(Blocks 1 open bug)
Details
There are some tasks which produce artifacts containing resource usage samples. We're hoping to expand this, and have a new artifact that records task resource usage, such as CPU and Memory information.
The exact artifact name and structure are still to be decided, but it'll likely be a json file produced as a task artifact, and notified via pulse.
Since we already process the live log this way - pulse message arrives, artifact is retrieved, processed and data added to BigQuery - it would be good if this new artifact could be processed the same way.
It would contain samples of the resources used every $time_interval (likely 1 second) so perhaps we want to aggregate that before adding to BigQuery, I'm happy to take opinions on that.
The questions we'd want to answer with this data are ones such as:
-
Are the worker instances an appropriate size, or have the right concurrency set?
- Could more cores be added
- Could we use less memory
- Could we instead fit 2 tasks per worker if we increased the size slightly, which would decrease our overall bill, etc.
-
Is there a difference in resource usage after this date/time (when a commit/infra change was made, for example)
Assignee | ||
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
We've now produced this tool and have settled on an output format, at least for version 1. The files themselves contain a version number to hopefully make life easier if there are changes later. An example can be found here:
How should we proceed with getting data through the ETL? At a minimum the summary section would be useful, so we could start with that.
Reporter | ||
Updated•5 years ago
|
Assignee | ||
Comment 2•5 years ago
|
||
I just have to make a BigQuery schema and update the ETL: If there is a resource-monitor.json in the artifact list it will be fetched, validated and loaded into BigQuery. The ETL process could summarize the data, but generally the raw data is loaded in as close to its original form as possible. From there a summarized view can be generated if necessary. I should be able to start testing this tomorrow.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 3•5 years ago
•
|
||
I am seeing some very large vms numbers 958274947641344 which is throwing some schema errors (since it is defined as an integer). This seems like bad data but I could just change the schema type to 'number' and allow them.
https://firefoxci.taskcluster-artifacts.net/ehSPWe90SsK_3E9L7E87zA/0/public/monitoring/resource-monitor.json
Also write_bytes 18446744073708775765 in
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/ew47lYVPRiWg5fpAtbB6nA/runs/0/artifacts/public/monitoring/resource-monitor.json
And write_count 18446744073709550984 in
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/JzaV1-ECSIGP4S7sWfc66Q/runs/0/artifacts/public/monitoring/resource-monitor.json
Reporter | ||
Comment 4•5 years ago
|
||
(In reply to Mike Trinkala [:trink] from comment #3)
I am seeing some very large vms numbers 958274947641344 which is throwing some schema errors (since it is defined as an integer). This seems like bad data but I could just change the schema type to 'number' and allow them.
https://firefoxci.taskcluster-artifacts.net/ehSPWe90SsK_3E9L7E87zA/0/public/monitoring/resource-monitor.json
ahh, I think I should just turn off collection of that. I suspect it's all expanded VMS without any allocation, and then we're summing it across processes on top of that. The current generation shouldn't have more than 256Gb / 274877906944 bytes. What's your maxint?
Assignee | ||
Comment 5•5 years ago
|
||
jsonschema does not specify a limit/range for the integer type technically the interoperable range should be up to 2^53 - 1 (before losing precision) but it is implementation dependent.
Assignee | ||
Comment 6•5 years ago
|
||
I will relax the samples schema to use number instead of integer everywhere to avoid throwing away the entire submission for a few anomalous samples.
Assignee | ||
Comment 7•5 years ago
|
||
I will let it run on dev over the weekend and deploy to production on Monday if everything looks good.
Assignee | ||
Comment 8•5 years ago
|
||
Deployed the data is available at:
https://console.cloud.google.com/bigquery?project=moz-fx-data-taskclu-prod-8fbf&p=moz-fx-data-taskclu-prod-8fbf&d=taskclusteretl&t=resource_monitor&page=table
Updated•3 years ago
|
Description
•