Open Bug 1910613 Opened 7 months ago Updated 5 months ago

change upload limiting to connection limiting

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: willkg, Unassigned)

Details

Attachments

(4 files)

[mozilla-services/cloudops-deployment] tecken: switch to connection limiting (bug-1910613) (#4990) 7 months ago Will Kahn-Greene [:willkg] ET needinfo? me 65 bytes, text/x-github-pull-request		Details \| Review
[mozilla-services/tecken] bug-1910613: update guidance on 429s and symbols zip archive sizes (#2984) 7 months ago Will Kahn-Greene [:willkg] ET needinfo? me 52 bytes, text/x-github-pull-request		Details \| Review
[mozilla-services/cloudops-deployment] tecken: increase gunicorn workers (bug-1910613, bug-1898749) (#4991) 7 months ago Will Kahn-Greene [:willkg] ET needinfo? me 65 bytes, text/x-github-pull-request		Details \| Review
[mozilla-services/cloudops-deployment] tecken: drop connection limit to 4 (bug-1910613, bug-1898749) (#4992) 7 months ago Will Kahn-Greene [:willkg] ET needinfo? me 65 bytes, text/x-github-pull-request		Details \| Review

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Description

•

7 months ago

•

Edited

Handling upload API requests can be extremely costly in time. Because of this, it's possible for a single instance to get swamped with upload requests that take a long time to process which causes the instance to look unhealthy and for new connections from nginx to gunicorn to result in HTTP 5xx errors.

To account for that, when Tecken was set up, they added rate limiting in nginx to restrict the number of uploads per time period for a given instance. Some uploads are small and get handled very quickly. Some uploads are large and take a long time to process. The heavy-handed approach of using rate limiting doesn't take this into account at all and it leads us to having instances that are blocked from upload requests that aren't really doing anything.

Sven tested out switching from rate limiting to connection limiting where nginx restricts the number of simultaneous upload requests. If we set the limit number to (gunicorn workers - 2) or something like that, we think that'll improve upload handling in prod and vastly improve upload handling in stage where we're running system tests.

This bug covers making the changes in our AWS and GCP nginx configuration.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 1

•

7 months ago

Sven will fix this in GCP.

WillKG will fix this in AWS.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Updated

•

7 months ago

Summary: switch rate limiting to connection limiting for uploads → change upload limiting to connection limiting

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 2

•

7 months ago

Assigning to Sven to do the GCP configuration change.

Assignee: nobody → sven

Status: NEW → ASSIGNED

Sven Marnach :sven

Comment 3

•

7 months ago

Here are the changes I made for GCP: https://github.com/mozilla-it/webservices-infra/pull/2355/commits/a0585def02201c8fe738ec4d432bef7207d6ee37

They are deployed to the stage environment in stage and appear to be working fine. I didn't do any load testing, though.

Assignee: sven → willkg

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 4

•

7 months ago

•

Edited

Running the existing system tests against stage 10 times in a row with no breaks yields this set of timings:

round 0: elapsed: 0:01:04.864894
round 1: elapsed: 0:02:35.779171
round 2: elapsed: 0:02:35.855127
round 3: elapsed: 0:02:38.264483
round 4: elapsed: 0:02:36.296327
round 5: elapsed: 0:02:36.322767
round 6: elapsed: 0:02:05.452292
round 7: elapsed: 0:02:35.818059
round 8: elapsed: 0:02:36.022828
round 9: elapsed: 0:02:37.009649

The first round is 1 minute. The others all hit rate-limiting and take between 2 and 2.5 minutes to run.

If I understand correctly, the systemtests (skipping large_files tests) are doing:

2 uploads
1 upload with a bad token
1 upload-by-download

That's 4 uploads. When the tests hit rate-limiting, they sleep for 30 seconds and retry. It's probably the case that each test run is retrying 3 times.

We have the upload API dashboard. Here's the period where I ran 10 rounds of the systemtests:

https://earthangel-b40313e5.influxcloud.net/d/6gimTZ6Vz/tecken-upload-api-metrics?orgId=1&var-env=stage&from=1722366144401&to=1722367919744

I'll use that for a baseline to measure changes against.

I can use the new systemtest bits to put together a rough load test.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 5

•

7 months ago

Attached file [mozilla-services/cloudops-deployment] tecken: switch to connection limiting (bug-1910613) (#4990) — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 6

•

7 months ago

willkg merged PR [mozilla-services/cloudops-deployment]: tecken: switch to connection limiting (bug-1910613) (#4990) in 2c42a96.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 7

•

7 months ago

•

Edited

I ran the systemtests 10 times in rapid succession:

round 0: elapsed: 0:00:05.332435
round 1: elapsed: 0:00:04.751373
round 2: elapsed: 0:00:04.554610
round 3: elapsed: 0:00:04.767880
round 4: elapsed: 0:00:05.808504
round 5: elapsed: 0:00:04.283236
round 6: elapsed: 0:00:05.042892
round 7: elapsed: 0:00:04.950264
round 8: elapsed: 0:00:04.417888
round 9: elapsed: 0:00:04.812643

We went from 1m (best time in comment #4--rest were like 2.5m) to 5s for each test. That's cool.

I built a load test that simulates 3 users uploading 10mb payloads full of 1mb sym files.

Load test before landing that change and doing a stage deploy:

Tue Jul 30 21:13:41 UTC 2024: Locust end 20240730-210000-aws_stage-normal.
20240730-210000-aws_stage-normal users=3 runtime=4m
Runname: logs/20240730-210000-aws_stage-normal

Requests:
 Name     | Requests | Failures | Req/s | Avg Time (ms) | 50% (ms) | 95% (ms) 
----------+----------+----------+-------+---------------+----------+----------
 /upload/ | 203      | 193      | 0.85  | 2,948.42      | 3,000    | 4,700    

Failures:
 Method | Name     | Error                                             | Occurrences 
--------+----------+---------------------------------------------------+-------------
 POST   | /upload/ | HTTPError('429 Client Error:  for url: /upload/') | 193

Load test after landing that change and doing a stage deploy:

Wed Jul 31 00:28:26 UTC 2024: Locust end 20240731-000000-aws_stage-normal.
20240731-000000-aws_stage-normal users=3 runtime=4m
Runname: logs/20240731-000000-aws_stage-normal

Requests:
 Name     | Requests | Failures | Req/s | Avg Time (ms) | 50% (ms) | 95% (ms) 
----------+----------+----------+-------+---------------+----------+----------
 /upload/ | 137      | 0        | 0.58  | 4,652.26      | 4,300    | 6,900    

Failures: None

The zip files were 10mb with 1mb sym files. 10 requests vs. 137 requests. That's bonkers--so much better.

Grafana looks good, too:

https://earthangel-b40313e5.influxcloud.net/d/6gimTZ6Vz/tecken-upload-api-metrics?orgId=1&var-env=stage&from=1722385330821&to=1722385918162

In production, payloads are in the 100s of mb range. I changed the load test to simulate 3 users with 100mb zip files with 10mb sym files in it.

Wed Jul 31 00:38:38 UTC 2024: Locust end 20240731-000000-aws_stage-normal.
20240731-000000-aws_stage-normal users=3 runtime=4m
Runname: logs/20240731-000000-aws_stage-normal

Requests:
 Name     | Requests | Failures | Req/s | Avg Time (ms) | 50% (ms) | 95% (ms) 
----------+----------+----------+-------+---------------+----------+----------
 /upload/ | 15       | 0        | 0.06  | 39,338.47     | 40,000   | 45,000   

Failures: None

Still looks good.

Then I did 5 users with 400mb zip files with 10mb sym files in it. That completely saturated the upload on my Internet connection and so it didn't finish any.

Wed Jul 31 00:45:10 UTC 2024: Locust end 20240731-000000-aws_stage-normal.
20240731-000000-aws_stage-normal users=5 runtime=4m
Runname: logs/20240731-000000-aws_stage-normal

Requests:
 Name | Requests | Failures | Req/s | Avg Time (ms) | 50% (ms) | 95% (ms) 
------+----------+----------+-------+---------------+----------+----------

Failures: None

Oh well.

I tweaked the load test and landed it:

https://github.com/mozilla-services/tecken-loadtests/pull/33

I think this change is good enough to go to production.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 8

•

7 months ago

This went to AWS production in bug #1910917. I'll keep an eye on Grafana for the rest of the week to see how it affects performance.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 9

•

7 months ago

Mike: Is there a way I can look at upload-symbols task timings for builds before and after this was deployed to production? Who can I talk to for help here?

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 10

•

7 months ago

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #9)

Mike: Is there a way I can look at upload-symbols task timings for builds before and after this was deployed to production? Who can I talk to for help here?

Trying Johan.

Flags: needinfo?(mh+mozilla) → needinfo?(jlorenzo)

Johan Lorenzo [:jlorenzo] - On PTO - Back on March 18th

Comment 11

•

7 months ago

Hello there!

Here's a heatmap of the duration of all upload symbol tasks on a given day: https://sql.telemetry.mozilla.org/queries/101620/source?p_project=%5B%22mozilla-central%22%2C%22mozilla-beta%22%2C%22mozilla-release%22%5D&p_start_date=2024-06-01&p_task_kind=upload-symbols#250376

Feel free to tinker with it and/or ask any questions 🙂

Flags: needinfo?(jlorenzo)

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 12

•

7 months ago

There's only been a day of data since we made the change, but I'm not seeing evidence that this improved upload-symbols or uploading symbols in general. I do think we want to change the guidance for what to do when a client gets a 429. I'll update the docs later.

We probably want to follow this up with adding more gunicorn workers to each instance. Currently it's set to (num_processor * 2) + 1 (which is 5). We could increase that and the connection limit and see how that affects things.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 13

•

7 months ago

Attached file [mozilla-services/tecken] bug-1910613: update guidance on 429s and symbols zip archive sizes (#2984) — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 14

•

7 months ago

willkg merged PR [mozilla-services/tecken]: bug-1910613: update guidance on 429s and symbols zip archive sizes (#2984) in e647572.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 15

•

7 months ago

Attached file [mozilla-services/cloudops-deployment] tecken: increase gunicorn workers (bug-1910613, bug-1898749) (#4991) — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 16

•

7 months ago

willkg merged PR [mozilla-services/cloudops-deployment]: tecken: increase gunicorn workers (bug-1910613, bug-1898749) (#4991) in 9e8f445.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 17

•

7 months ago

Attached file [mozilla-services/cloudops-deployment] tecken: drop connection limit to 4 (bug-1910613, bug-1898749) (#4992) — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 18

•

7 months ago

willkg merged PR [mozilla-services/cloudops-deployment]: tecken: drop connection limit to 4 (bug-1910613, bug-1898749) (#4992) in 317e6c3.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 19

•

7 months ago

Everything so far was pushed to production in bug #1911874.

I looked at Grafana dashboards for a while. The change in gunicorn workers doesn't seem to have made anything worse. I'll do more analysis at the end of this week.

Will Kahn-Greene [:willkg] ET needinfo? me

Reporter

Comment 20

•

5 months ago

We did some work on this, but we're lacking a good view for metrics so the best I can figure it was a wash. Unassigning myself.

Assignee: willkg → nobody

Status: ASSIGNED → NEW

You need to log in before you can comment on or make changes to this bug.