reduce gunicorn worker restarting
Categories
(Location :: General, task, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
Details
gunicorn is set to restart workers after every 10k +/- 1k requests. John was looking at logs and sees a worker restarting every 2 minutes.
That's a lot of restarting and seems excessive. This bug covers finding a better value for max_requests
.
As a side bonus, it'd be nice to know whether restarting the worker process also picks up a changed maxminddb.
Assignee | ||
Comment 1•5 years ago
|
||
Grabbing this to work on this week.
Assignee | ||
Comment 2•5 years ago
|
||
According to graphs, MLS gets (on average) 106k requests per minute (pretty sure the graph is saying the interval is 1 minute).
We've got worker concurrency set to 4. We've got 16 web nodes. So that's 64 workers handling 106k requests per minute. Each worker is handling 1,656 requests per minute (106k / 64). Each worker will run for to 5.4 to 6.6 minutes.
That seems too short to me. I think we should double the max_requests to 20k.
Things we're thinking about here:
- it takes time to recycle a worker, so the more frequently it happens, the more time wasted
- I don't know what the performance characteristics of web workers in MLS are, but I'm not seeing indications of memory leaks or other badness in the graphs, so going longer seems like it'd be ok
- we theorize that when the web worker restarts, it'll pick up the new maxminddb (needs to be verified); if true, we wouldn't want a web worker going for hours before picking up the update
What other things particular to MLS should we consider here?
Tagging ckolos with a needinfo to weigh in with the ops angle.
I think we'd be fine doubling (possibly tripling) the number currently in use. Webhead run lean wrt memory utilization and that would be my greatest concern when cycling workers out. The average memory free across all webhead is ~86% which leaves us with some room to stretch.
Assignee | ||
Comment 4•5 years ago
|
||
I don't see a lot of risk here, so let's triple it and see how that goes. I'll do up a pull request now.
Assignee | ||
Comment 5•5 years ago
|
||
Assignee | ||
Comment 6•5 years ago
|
||
ckolos did some experimentation and we're going to use 100k for max_requests
:
Connection Limit | Jitter | Start Time | Worker 1 Restart Time | Worker 2 restart time | Worker 3 restart time | Worker 4 restart time | Memory utilization |
---|---|---|---|---|---|---|---|
20000 | 2000 | 15:44 | 15:52 | 15:54 | 15:54 | 15:57 | 2 |
50000 | 1000 | 16:10 | 16:33 | 16:35 | 16:36 | 16:36 | 2.1 |
100000 | 1000 | 16:43 | 17:24 | 17:28 | 17:35 | 17:50 | 2.3 |
_ | _ | _ | 18:10 | 18:28 | 18:34 | 18:34 |
This will go in with the next prod deploy.
Description
•