memory usage on nodes sometimes goes monotonic

RESOLVED FIXED

Status

RESOLVED FIXED
a year ago
a year ago

People

(Reporter: willkg, Assigned: willkg)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

We put Antenna in -prod yesterday. Since then, the three nodes that were part of the original cluster were doing fine and then each one independently started accruing memory until it hit max memory.

https://app.datadoghq.com/dash/274773/antenna--prod?live=false&page=0&is_auto=false&from_ts=1492121395638&to_ts=1492182202000&tile_size=m&tpl_var_env=prod#close

When that happens, we get paged. Plus it seems like the nodes are ok, but maybe they're not. There's no evidence we're losing crashes during this time.

This bug covers looking into that further and determining:

Is this bad?

If it's not bad, can we make it look less bad in the graphs?

If it is bad, how can we figure out what's going on so we can fix it?
Grabbing this to work on today.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Miles and I talked for a while about this.

Maybe this is a memory leak? Maybe it's a memory leak in an external library? Maybe it's from memory fragmentation in the Python process? Antenna is an upload server--it gets big payloads, manipulates them, and then throws them away. That uses big chunks of memory.

Leaving that for a moment, let's think about how to not fix that, but instead set things up so it's moot.

Currently, we don't have Gunicorn max_requests set, so workers spin up, do a bunch of work, and keep on working forever. One of the reasons we did that is because we were concerned we'd drop crashes if we recycled workers.

Bug #1337506 covers fixing Antenna to handle shutdown gracefully such that it would prevent the process from disappearing until the crashes had been saved. I had taken a couple of stabs at it, but our implementation probably sucked and I was never able to test it effectively and verify it was working.

I revisited that bug just now and redid the implementation. The new implementation works and I can verify it. (Yay!) Given that, we can now enable Gunicorn's "max_requests" and that'll make our long-running-process memory problem symptoms disappear.

We still don't know what the problem with memory usage is, but debugging it is probably a project especially since we have difficulties reproducing it outside of -prod. We could spend time on it, but seems like using max_requests and recycling workers is a better option time-wise right now.

Making this block on bug #1337506. After that lands and works, we can set "max_requests".
Depends on: 1337506
Say the server gets 5000 requests/min. There are 3 nodes, so that's like 1666 request/min/node. There are 5 workers on each node, so that's 333 requests per minute per node per worker.

I don't think we want to recycle workers often. Maybe 10,000 requests +/- 1,000 in jitter?

Miles set up -stage and -prod that way. We'll see how that goes.
Memory usage on the nodes hangs out at a balmy 20% and haven't budged at all. I don't see the problems detailed in the description anymore.

I'm going to mark this FIXED.
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.