Open Bug 1898749 Opened 10 months ago Updated 5 months ago

Tecken doesn't handle upload spikes gracefully

Categories

(Tecken :: Upload, defect, P3)

Tracking

(Not tracked)

People

(Reporter: sven, Unassigned)

References

Details

Attachments

(3 files)

When many symbol upload requests arrive within a short time period, Tecken can become unresponsive. Here's an example of this happening:

This is probably caused by all gunicorn workers being busy, so the backend can't respond to any further requests.

We probably don't want to do anything to address this before finishing the GCP migration.

Priority: -- → P3

Brain dump:

"spike" is a very small number--it's like 15 or 20. We can see these spikes when looking at cumulative time spent handling symbol upload API requests.

This has been an issue for a long time, but got worse when symbols files got bigger when we added inline function information in September 2022. We don't have good way of measuring how this affects the upload-symbols builds. There are a couple of bugs about intermittent failures, but they don't have any metrics for failures.

A couple of years ago to reduce the problem, I split out the symbolication API from Tecken and moved it to a dedicated service. That's Eliot. Symbolication API requests sometimes take a long time to run depending on whether they have to parse large symbols files or not. That reduced some of the unhealthy-host issues.

The upload API and download API are vastly different in a number of ways. Jason and I talked about possibly breaking the upload API into a separate service as well. That would allow us to adjust how it runs in infrastructure to cater to handling large uploads and taking a long time to do it. It's not clear how much this would help on its own.

The other thought was to change the upload API to work more like the Symbolicator service handles upload requests where the upload API handler saves the payload, returns a response, and then a background process processes it. This would be a fundamental architecture change for Tecken. The upload-symbols build step would need to know whether the upload was successfully processed, so it requires changes in upload-symbols code. I was thinking it might be better for us to create an upload client, switch everything to use that, and then we'd more easily be able to change the symbol upload protocol since we maintained both sides.

Re-architecting the upload API is covered in bug #1652835. Writing an upload client is covered in bug #1806578.

See Also: → 1806578, 1652835

We currently run a minimum of 5 Tecken instances with 5 gunicorn workers each. It would take at least 25 upload requests to tie up all the gunicorn workers. Since uploads can take multiple minutes to process, these 25 requests don't need to come in all at exactly the same time, just in a short enough period of time.

I think the number of simultaneous upload requests can be lower than 25 which is why I was saying it could be between 15 and 20 which may not be easy to see with our dashboards. Tecken does other things than upload requests. It's the combination of all the things coupled with periodicity of health checks over a period of time such that instances get marked unhealthy.

We discussed in Slack that we would like to try increasing the maximum number of Tecken instances to see whether that helps with the issue. We also discussed increasing the number of GUnicorn workers per instance, but considered increasing the instance count lower risk.

Copying more of our Slack discussion here.

We've had several alerts for Tecken unhealthy hosts in the last 24 hours. That's why this is coming up now.

Increasing instances is low risk, but ultimately costs more. The theory is that if there are more instances available when we get a burst of upload API requests, there's more instances and workers to handle them and thus this reduces the likelihood that specific instances are solely working on requests that take a long time to process such that the instance isn't responding to health checks. Maybe this is a good-enough short term fix? Maybe it doesn't fundamentally improve anything?

Increasing gunicorn workers per instance doesn't increase cost, but possibly increases the risk of service degradation and outages. If the number of workers on an instance increases, so too does the maximum number of upload API requests that can be handled at the same time on the instance. Each one has memory and disk needs. Upload API requests where the zip file is in the payload are limited to 2gb. However, upload API requests where the payload is a url to the zip file are not limited. Tecken tries to work such that it's keeping everything on disk and not in memory. Even so, memory usage fluctuates.

We have an upload API metrics dashboard:

https://earthangel-b40313e5.influxcloud.net/d/6gimTZ6Vz/tecken-upload-api-metrics

I adjusted the panels to include the things Sven was looking at.

We deployed this change to prod in bug #1908904. I see more instances now. We'll see how this affects things over the next month.

In bug #1910613, we switched Tecken from rate-limiting upload requests to connection-limiting upload requests which gives us some gunicorn workers free to do non-upload requests. In this model, Tecken instances are more likely to pass health checks since it's more likely gunicorn workers are available to handle the health checks.

I pushed everything we've landed so far to production in bug #1911874. I'll keep tabs on tecken instance health for the week and update this bug accordingly.

We did some work on this, but we're lacking a good view for metrics so the best I can figure it was a wash. Unassigning myself.

Assignee: willkg → nobody
Status: ASSIGNED → NEW
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: