Closed Bug 1847235 Opened 9 months ago Closed 5 months ago

upload timing metrics

Categories

(Tecken :: Upload, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

One of the critical roles that Tecken performs is around the upload API. Uploaded zip files are large. Processing them is complex and needs to be fast and reliable. Processing needs to be completed during the HTTP request/response cycle.

Because of this, we need a way to monitor upload success/failure as well as other metrics that tell us whether changes we've made (upgrading Python, upgrading libraries, adjusting the processing code, etc) have improved or degraded uploading.

There are some metrics scattered around in various places. It's possible we're capturing everything we need already, but need to centralize it and wrap it into a dashboard that allows us to determine upload processing SLO success.

We need to have this in place before we do the GCP migration, so I'm making this work block that.

Assignee: nobody → willkg
Blocks: 1687804
Status: NEW → ASSIGNED

Bug #1751303 covers "Intermittent Symbols [taskcluster:error] Task timeout after 900 seconds. Force killing container. | The write operation timed out" which seems to suggest there's like 1 or 2 failures every week out of like 3,800 runs. You can see this failure aggregated here:

https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2023-07-28&endday=2023-08-04&tree=trunk&bug=1751303

Bug #1779449 covers "upload-symbols jobs are slow". In that bug, it's suggested that the timing for the upload task is in the logs for every build, but metrics on task execution times aren't collected anywhere. NB: The upload task includes a bunch of work that needs to be done to build the zip file to be uploaded. You can see this failure aggregated here:

https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2023-07-28&endday=2023-08-04&tree=trunk&bug=1658531

Both of those graphs can be used to get a coarse understanding of whether changes we've made affected the Firefox build system (and maybe Fenix, too--I'm not sure).

In Grafana, we have several graphs tracking overall upload API timings. The size of zip files is not the same. We could probably track the 95% line for upload API timing figuring that shows us uploads for big files. Do we get a timing metric emitted for failed uploads? How about uploads that exceed the timeout and terminate with a connection reset?

I see a graph for timings for the dumping and extracting files to disk step. Do we have timings for other steps?

Would it help to have a second set of metrics for overall upload API timing and timings by step that are normalized by the size of the payload?

Ultimately, I think I want all of this in a Grafana dashboard that works for stage and prod. I'm going to focus my efforts on that and probably do this incrementally.

I tinkered with Gunicorn server hooks today. We can capture how often the Gunicorn worker times out when handling uploads by adding a Gunicorn server hook for worker_abort and emitting an INCR.

https://docs.gunicorn.org/en/stable/settings.html#worker-abort

I have an upload api metrics dashboard I'm building:

https://earthangel-b40313e5.influxcloud.net/d/6gimTZ6Vz/tecken-upload-api-metrics

I added instance metrics that I used for the Socorro processor analysis dashboard I was using to investigate Socorro processor stability issues.

Then I started skulking for a period where the usage_iowait rises above the usage_user which indicates the instances are spending more time waiting on disk. I found one here:

https://earthangel-b40313e5.influxcloud.net/d/6gimTZ6Vz/tecken-upload-api-metrics?orgId=1&var-env=prod&from=1700074199240&to=1700083096967

That period corresponds with monotonically increased disk usage which happens when gunicorn workers are taking too long to handle upload API requests, hit the 5 minute timeout, and get terminated by the gunicorn manager leaving behind files on disk.

The disk read/write panels don't look like Socorro's where there's it's clear that disk usage is suddenly getting throttled. However, I think it's probably the case that switching to a c5d.large (and a local disk) will reduce the upload API handling time, number of gunicorn workers getting terminated, and out-of-disk errors. I'll work on that next week.

I pushed the tecken.gunicorn_worker_abort metric emission to prod just now in bug #1867844 and added it to the upload metrics dashboard.

I don't see any tecken.gunicorn_worker_abort data so far. Maybe that's bursty. Maybe we don't have gunicorn worker timeouts at all.

Another thing we should look at is HTTP 5xx responses. nginx returns a 504 (gateway timeout) when the timeout is exceeded. I added ELB 5xx (prod) and ELB 5xx (stage) panels to the dashboard.

Further, we can look at the nginx logs for when it returns a 504 for an upload API request.

I think that probably gives us all the data we need to monitor upload API health over time. I don't think we need to break it down by payload size or number of files or anything like that, yet, but those might be interesting avenues of investigation when we have problems in the future.

I'm going to mark this FIXED.

Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED

The last set of changes were deployed just now in bug #1869351.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: