Closed Bug 1644110 Opened 5 years ago Closed 5 years ago

Resource usage artifacts and Taskcluster ETL

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sfraser, Assigned: trink)

References

(Blocks 1 open bug)

Details

There are some tasks which produce artifacts containing resource usage samples. We're hoping to expand this, and have a new artifact that records task resource usage, such as CPU and Memory information.

The exact artifact name and structure are still to be decided, but it'll likely be a json file produced as a task artifact, and notified via pulse.

Since we already process the live log this way - pulse message arrives, artifact is retrieved, processed and data added to BigQuery - it would be good if this new artifact could be processed the same way.

It would contain samples of the resources used every $time_interval (likely 1 second) so perhaps we want to aggregate that before adding to BigQuery, I'm happy to take opinions on that.

The questions we'd want to answer with this data are ones such as:

  • Are the worker instances an appropriate size, or have the right concurrency set?

    • Could more cores be added
    • Could we use less memory
    • Could we instead fit 2 tasks per worker if we increased the size slightly, which would decrease our overall bill, etc.
  • Is there a difference in resource usage after this date/time (when a commit/infra change was made, for example)

Points: --- → 3
Priority: -- → P2
Depends on: 1649145

We've now produced this tool and have settled on an output format, at least for version 1. The files themselves contain a version number to hopefully make life easier if there are changes later. An example can be found here:

https://firefoxci.taskcluster-artifacts.net/CahkYcxZTSq5_kSchXEV5Q/0/public/monitoring/resource-monitor.json

How should we proceed with getting data through the ETL? At a minimum the summary section would be useful, so we could start with that.

Flags: needinfo?(mtrinkala)

I just have to make a BigQuery schema and update the ETL: If there is a resource-monitor.json in the artifact list it will be fetched, validated and loaded into BigQuery. The ETL process could summarize the data, but generally the raw data is loaded in as close to its original form as possible. From there a summarized view can be generated if necessary. I should be able to start testing this tomorrow.

Assignee: nobody → mtrinkala
Flags: needinfo?(mtrinkala)
Points: 3 → 2
Priority: P2 → P1

I am seeing some very large vms numbers 958274947641344 which is throwing some schema errors (since it is defined as an integer). This seems like bad data but I could just change the schema type to 'number' and allow them.
https://firefoxci.taskcluster-artifacts.net/ehSPWe90SsK_3E9L7E87zA/0/public/monitoring/resource-monitor.json

Also write_bytes 18446744073708775765 in
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/ew47lYVPRiWg5fpAtbB6nA/runs/0/artifacts/public/monitoring/resource-monitor.json

And write_count 18446744073709550984 in
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/JzaV1-ECSIGP4S7sWfc66Q/runs/0/artifacts/public/monitoring/resource-monitor.json

Flags: needinfo?(sfraser)

(In reply to Mike Trinkala [:trink] from comment #3)

I am seeing some very large vms numbers 958274947641344 which is throwing some schema errors (since it is defined as an integer). This seems like bad data but I could just change the schema type to 'number' and allow them.
https://firefoxci.taskcluster-artifacts.net/ehSPWe90SsK_3E9L7E87zA/0/public/monitoring/resource-monitor.json

ahh, I think I should just turn off collection of that. I suspect it's all expanded VMS without any allocation, and then we're summing it across processes on top of that. The current generation shouldn't have more than 256Gb / 274877906944 bytes. What's your maxint?

Flags: needinfo?(sfraser)
Depends on: 1657980

jsonschema does not specify a limit/range for the integer type technically the interoperable range should be up to 2^53 - 1 (before losing precision) but it is implementation dependent.

I will relax the samples schema to use number instead of integer everywhere to avoid throwing away the entire submission for a few anomalous samples.

I will let it run on dev over the weekend and deploy to production on Monday if everything looks good.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: Pipeline Ingestion → General
You need to log in before you can comment on or make changes to this bug.