change upload limiting to connection limiting
Categories
(Tecken :: Upload, defect, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Unassigned)
Details
Attachments
(4 files)
65 bytes,
text/x-github-pull-request
|
Details | Review | |
[mozilla-services/tecken] bug-1910613: update guidance on 429s and symbols zip archive sizes (#2984)
52 bytes,
text/x-github-pull-request
|
Details | Review | |
65 bytes,
text/x-github-pull-request
|
Details | Review | |
65 bytes,
text/x-github-pull-request
|
Details | Review |
Handling upload API requests can be extremely costly in time. Because of this, it's possible for a single instance to get swamped with upload requests that take a long time to process which causes the instance to look unhealthy and for new connections from nginx to gunicorn to result in HTTP 5xx errors.
To account for that, when Tecken was set up, they added rate limiting in nginx to restrict the number of uploads per time period for a given instance. Some uploads are small and get handled very quickly. Some uploads are large and take a long time to process. The heavy-handed approach of using rate limiting doesn't take this into account at all and it leads us to having instances that are blocked from upload requests that aren't really doing anything.
Sven tested out switching from rate limiting to connection limiting where nginx restricts the number of simultaneous upload requests. If we set the limit number to (gunicorn workers - 2)
or something like that, we think that'll improve upload handling in prod and vastly improve upload handling in stage where we're running system tests.
This bug covers making the changes in our AWS and GCP nginx configuration.
Reporter | ||
Comment 1•7 months ago
|
||
Sven will fix this in GCP.
WillKG will fix this in AWS.
Reporter | ||
Updated•7 months ago
|
Reporter | ||
Comment 2•7 months ago
|
||
Assigning to Sven to do the GCP configuration change.
Comment 3•7 months ago
|
||
Here are the changes I made for GCP: https://github.com/mozilla-it/webservices-infra/pull/2355/commits/a0585def02201c8fe738ec4d432bef7207d6ee37
They are deployed to the stage environment in stage and appear to be working fine. I didn't do any load testing, though.
Reporter | ||
Comment 4•7 months ago
•
|
||
Running the existing system tests against stage 10 times in a row with no breaks yields this set of timings:
round 0: elapsed: 0:01:04.864894
round 1: elapsed: 0:02:35.779171
round 2: elapsed: 0:02:35.855127
round 3: elapsed: 0:02:38.264483
round 4: elapsed: 0:02:36.296327
round 5: elapsed: 0:02:36.322767
round 6: elapsed: 0:02:05.452292
round 7: elapsed: 0:02:35.818059
round 8: elapsed: 0:02:36.022828
round 9: elapsed: 0:02:37.009649
The first round is 1 minute. The others all hit rate-limiting and take between 2 and 2.5 minutes to run.
If I understand correctly, the systemtests (skipping large_files tests) are doing:
- 2 uploads
- 1 upload with a bad token
- 1 upload-by-download
That's 4 uploads. When the tests hit rate-limiting, they sleep for 30 seconds and retry. It's probably the case that each test run is retrying 3 times.
We have the upload API dashboard. Here's the period where I ran 10 rounds of the systemtests:
I'll use that for a baseline to measure changes against.
I can use the new systemtest bits to put together a rough load test.
Reporter | ||
Comment 5•7 months ago
|
||
Reporter | ||
Comment 6•7 months ago
|
||
Reporter | ||
Comment 7•7 months ago
•
|
||
I ran the systemtests 10 times in rapid succession:
round 0: elapsed: 0:00:05.332435
round 1: elapsed: 0:00:04.751373
round 2: elapsed: 0:00:04.554610
round 3: elapsed: 0:00:04.767880
round 4: elapsed: 0:00:05.808504
round 5: elapsed: 0:00:04.283236
round 6: elapsed: 0:00:05.042892
round 7: elapsed: 0:00:04.950264
round 8: elapsed: 0:00:04.417888
round 9: elapsed: 0:00:04.812643
We went from 1m (best time in comment #4--rest were like 2.5m) to 5s for each test. That's cool.
I built a load test that simulates 3 users uploading 10mb payloads full of 1mb sym files.
Load test before landing that change and doing a stage deploy:
Tue Jul 30 21:13:41 UTC 2024: Locust end 20240730-210000-aws_stage-normal.
20240730-210000-aws_stage-normal users=3 runtime=4m
Runname: logs/20240730-210000-aws_stage-normal
Requests:
Name | Requests | Failures | Req/s | Avg Time (ms) | 50% (ms) | 95% (ms)
----------+----------+----------+-------+---------------+----------+----------
/upload/ | 203 | 193 | 0.85 | 2,948.42 | 3,000 | 4,700
Failures:
Method | Name | Error | Occurrences
--------+----------+---------------------------------------------------+-------------
POST | /upload/ | HTTPError('429 Client Error: for url: /upload/') | 193
Load test after landing that change and doing a stage deploy:
Wed Jul 31 00:28:26 UTC 2024: Locust end 20240731-000000-aws_stage-normal.
20240731-000000-aws_stage-normal users=3 runtime=4m
Runname: logs/20240731-000000-aws_stage-normal
Requests:
Name | Requests | Failures | Req/s | Avg Time (ms) | 50% (ms) | 95% (ms)
----------+----------+----------+-------+---------------+----------+----------
/upload/ | 137 | 0 | 0.58 | 4,652.26 | 4,300 | 6,900
Failures: None
The zip files were 10mb with 1mb sym files. 10 requests vs. 137 requests. That's bonkers--so much better.
Grafana looks good, too:
In production, payloads are in the 100s of mb range. I changed the load test to simulate 3 users with 100mb zip files with 10mb sym files in it.
Wed Jul 31 00:38:38 UTC 2024: Locust end 20240731-000000-aws_stage-normal.
20240731-000000-aws_stage-normal users=3 runtime=4m
Runname: logs/20240731-000000-aws_stage-normal
Requests:
Name | Requests | Failures | Req/s | Avg Time (ms) | 50% (ms) | 95% (ms)
----------+----------+----------+-------+---------------+----------+----------
/upload/ | 15 | 0 | 0.06 | 39,338.47 | 40,000 | 45,000
Failures: None
Still looks good.
Then I did 5 users with 400mb zip files with 10mb sym files in it. That completely saturated the upload on my Internet connection and so it didn't finish any.
Wed Jul 31 00:45:10 UTC 2024: Locust end 20240731-000000-aws_stage-normal.
20240731-000000-aws_stage-normal users=5 runtime=4m
Runname: logs/20240731-000000-aws_stage-normal
Requests:
Name | Requests | Failures | Req/s | Avg Time (ms) | 50% (ms) | 95% (ms)
------+----------+----------+-------+---------------+----------+----------
Failures: None
Oh well.
I tweaked the load test and landed it:
https://github.com/mozilla-services/tecken-loadtests/pull/33
I think this change is good enough to go to production.
Reporter | ||
Comment 8•7 months ago
|
||
This went to AWS production in bug #1910917. I'll keep an eye on Grafana for the rest of the week to see how it affects performance.
Reporter | ||
Comment 9•7 months ago
|
||
Mike: Is there a way I can look at upload-symbols task timings for builds before and after this was deployed to production? Who can I talk to for help here?
Comment 10•7 months ago
|
||
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #9)
Mike: Is there a way I can look at upload-symbols task timings for builds before and after this was deployed to production? Who can I talk to for help here?
Trying Johan.
Hello there!
Here's a heatmap of the duration of all upload symbol tasks on a given day: https://sql.telemetry.mozilla.org/queries/101620/source?p_project=%5B%22mozilla-central%22%2C%22mozilla-beta%22%2C%22mozilla-release%22%5D&p_start_date=2024-06-01&p_task_kind=upload-symbols#250376
Feel free to tinker with it and/or ask any questions 🙂
Reporter | ||
Comment 12•7 months ago
|
||
There's only been a day of data since we made the change, but I'm not seeing evidence that this improved upload-symbols or uploading symbols in general. I do think we want to change the guidance for what to do when a client gets a 429. I'll update the docs later.
We probably want to follow this up with adding more gunicorn workers to each instance. Currently it's set to (num_processor * 2) + 1
(which is 5). We could increase that and the connection limit and see how that affects things.
Reporter | ||
Comment 13•7 months ago
|
||
Reporter | ||
Comment 14•7 months ago
|
||
Reporter | ||
Comment 15•7 months ago
|
||
Reporter | ||
Comment 16•7 months ago
|
||
Reporter | ||
Comment 17•7 months ago
|
||
Reporter | ||
Comment 18•7 months ago
|
||
Reporter | ||
Comment 19•7 months ago
|
||
Everything so far was pushed to production in bug #1911874.
I looked at Grafana dashboards for a while. The change in gunicorn workers doesn't seem to have made anything worse. I'll do more analysis at the end of this week.
Reporter | ||
Comment 20•5 months ago
|
||
We did some work on this, but we're lacking a good view for metrics so the best I can figure it was a wash. Unassigning myself.
Description
•