Closed Bug 1616658 Opened 5 years ago Closed 5 years ago

reduce gunicorn worker restarting

Categories

(Location :: General, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

Details

gunicorn is set to restart workers after every 10k +/- 1k requests. John was looking at logs and sees a worker restarting every 2 minutes.

That's a lot of restarting and seems excessive. This bug covers finding a better value for max_requests.

As a side bonus, it'd be nice to know whether restarting the worker process also picks up a changed maxminddb.

Grabbing this to work on this week.

Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P2

According to graphs, MLS gets (on average) 106k requests per minute (pretty sure the graph is saying the interval is 1 minute).

We've got worker concurrency set to 4. We've got 16 web nodes. So that's 64 workers handling 106k requests per minute. Each worker is handling 1,656 requests per minute (106k / 64). Each worker will run for to 5.4 to 6.6 minutes.

That seems too short to me. I think we should double the max_requests to 20k.

Things we're thinking about here:

  • it takes time to recycle a worker, so the more frequently it happens, the more time wasted
  • I don't know what the performance characteristics of web workers in MLS are, but I'm not seeing indications of memory leaks or other badness in the graphs, so going longer seems like it'd be ok
  • we theorize that when the web worker restarts, it'll pick up the new maxminddb (needs to be verified); if true, we wouldn't want a web worker going for hours before picking up the update

What other things particular to MLS should we consider here?

Tagging ckolos with a needinfo to weigh in with the ops angle.

Flags: needinfo?(ckolos)

I think we'd be fine doubling (possibly tripling) the number currently in use. Webhead run lean wrt memory utilization and that would be my greatest concern when cycling workers out. The average memory free across all webhead is ~86% which leaves us with some room to stretch.

Flags: needinfo?(ckolos)

I don't see a lot of risk here, so let's triple it and see how that goes. I'll do up a pull request now.

ckolos did some experimentation and we're going to use 100k for max_requests:

Connection Limit Jitter Start Time Worker 1 Restart Time Worker 2 restart time Worker 3 restart time Worker 4 restart time Memory utilization
20000 2000 15:44 15:52 15:54 15:54 15:57 2
50000 1000 16:10 16:33 16:35 16:36 16:36 2.1
100000 1000 16:43 17:24 17:28 17:35 17:50 2.3
_ _ _ 18:10 18:28 18:34 18:34

This will go in with the next prod deploy.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.