Open Bug 1652814 Opened 4 years ago Updated 4 years ago

need better rate limiting

Categories

(Tecken :: General, enhancement, P2)

enhancement

Tracking

(Not tracked)

People

(Reporter: willkg, Unassigned)

Details

We're having difficulties with lots of requests coming in in short periods of time flooding the system. We're currently doing some rate limiting with nginx, but it's not quite what we want.

This bug covers looking at switching to django-ratelimit and doing the rate limiting in the app for most/all of the rate limiting we need to do.

I'll add some additional thoughts.

The primary goal of this rate limiting is to prevent saturating the worker processes on individal EC2 instances, so it should be per-instance rather than per-environment.

Tecken has three classes of endpoints

  1. symbols redirects - hit fairly constantly 15 times a second. takes less than 10ms.
  2. symbolication - comes in bursts of requests every couple hours, where several thousand requests will be spread out across a half hour. Can take a few ms if cached or a few seconds if not.
  3. symbol uploads - comes in bursts of requests frequently throughout the day. Can take anywhere from 30 seconds to several minutes.

All of these are server by the same synchronous django app. We run 5 gunicorn worker processes on 2-cpu instances. This may actually be higher than is ideal from a latency perspective; it is definitely high enough to saturate the CPUs on an instance.

Our current rate limiting in nginx is at https://github.com/mozilla-services/cloudops-deployment/blob/32f8e6d6b75b5be89cb10867003f1e070854a45f/projects/symbols/puppet/modules/symbols/templates/http_symbols.conf.erb. It allows an instance to initially accept 3 upload requests, and then to accept new ones every 30 seconds. This can saturate an instance if uploads are averaging more than 30s.

What would be an improvement is if we could express that if four workers are busy processing uploads, the fifth should reject them.

The alternative solution at the infra level level is to split the upload and non-upload traffic (e.g. across different ASGs at the LB level, or just across different gunicorn worker pools at the instance/nginx level).

The alternative solution at the app level is to split accepting upload requests and completing them into two seperate processes, and make accepting async (e.g. the socorro antenna and processor model).

Note that improved rate limiting doesn't actually help bug 1652803 , it just prevents users of symbols redirect or symbolication from being affected by upload bursts. I still have to address scaling so that tasks don't fail due to uploads failing regardless of if they fail due to 500 or 429.

Ahhh... ok. I don't think this is something we can solve with rate-limiting using any of the systems that I can think of. I could write it by hand, but that seems dumb.

We've talked about moving the bulk of the upload work out of the HTTP request/response cycle. I think that's the only way we're going to solve some other impending problems. For some reason, I haven't written up a bug for that. I'll do that now.

I'll leave this open, but it's probably a WONTFIX.

You need to log in before you can comment on or make changes to this bug.