Closed
Bug 1337506
Opened 7 years ago
Closed 7 years ago
antenna needs to shut down gracefully [antenna]
Categories
(Socorro :: Antenna, task)
Socorro
Antenna
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: willkg, Assigned: willkg)
References
Details
Attachments
(2 files)
Antenna should handle OS signals correctly. Particularly the "shut down, please" one should cause Antenna to stop handling incoming HTTP connections, but flush it's outgoing s3 queue if possible and then shut down cleanly. This bug covers checking whether that's working and if not, implement/fix it.
Assignee | ||
Comment 1•7 years ago
|
||
Gunicorn's docs on signals is here: http://docs.gunicorn.org/en/stable/signals.html I think that indicates that I have to add handling for SIGTERM to gracefully finish the s3 queue. I'll work on that.
Assignee | ||
Comment 2•7 years ago
|
||
I'm going to wait on implementing this. First off, it's nice, but I'm not convinced we actually need it. The common scenario is: 1. Antenna fails a health check or we're doing a deploy or something that causes the autoscaler to kill this node 2. The autoscaler takes the node out of rotation and waits for it to finish handling the HTTP connections it has 3. The autoscaler waits some grace period so Antenna can flush it's s3 queue 4. The autoscaler kills the node I'm pretty sure at no point in that set of steps does something send a SIGTERM to nginx or gunicorn or the gunicorn worker processes. If that's correct, then we don't need to add a bunch of code and testing to make sure Antenna behaves well.
Assignee | ||
Comment 3•7 years ago
|
||
Miles pointed out that if we set up gunicorn to kill Antenna process after max_requests requests, then it sends Antenna a SIGTERM and we'd want ANTENNA to comply in a well-behaved and orderly fashion. Right now, we're not limiting process lifetime with max_requests, so we don't need it, but if we decide to do that because memory creep or other related long-running kinds of issues, we'll want to implement this.
Assignee | ||
Comment 4•7 years ago
|
||
So, Gunicorn starts up Gevent workers and that runs the Antenna app code. The Gevent workers handle various things related to the relationship between the workers and the parent process. Pretty sure this includes signals. I'd be more sure, but my brain has converted itself to applesauce. There's a section of Gevent code that kicks in on "shutdown" and waits until all incoming HTTP connections have been handled and then shuts down. It knows whether there are outstanding incoming HTTP connections that are being handled by looking at a Pool it uses for keeping track of them all. That's great, except it doesn't account for the crashes that are in the s3 save queue. With this bit of wisdom, I set out to modify Antenna to add the heartbeat coroutine to the incoming connection pool. Since the heartbeat coroutine is in the incoming connection pool, it prevents the gevent worker from shutting down until the s3 save queue is cleared. Hooray! Covered in PR: https://github.com/mozilla/antenna/pull/180
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Assignee | ||
Comment 5•7 years ago
|
||
This landed and we pushed it to -stage. It passes some stuff to the logs when it kicks in. I'll keep an eye out for that text over the next couple of days.
Assignee | ||
Comment 6•7 years ago
|
||
I haven't seen it in the logs, yet. I'll wait to the end of the week. It's possible it's such an edge case scenario that it doesn't happen often or we're not doing the things to trigger it.
Assignee | ||
Comment 7•7 years ago
|
||
I still haven't seen any of the messages in the logs. Specifically "work left to do" which indicates the Gunicorn process is shutting down and Antenna is like, "woah--not yet!" If it's the case that Antenna is totally done, then we should see "Heartbeat stopped" since that indicates that all the requirements for Antenna to stay alive are done. I'm not seeing that one, either. So either there's a bug and it's not working or Gunicorn is configured in such a way that it'll never kick in. Maybe in node shutdown, all the processes are SIGKILLd rather than SIGTERMd?
Assignee | ||
Updated•7 years ago
|
Summary: antenna needs to handle signals correctly [antenna] → antenna needs to shut down gracefully [antenna]
Assignee | ||
Comment 8•7 years ago
|
||
Assignee | ||
Comment 9•7 years ago
|
||
mozilla/antenna#189 doesn't fix the issue. I tried to "simulate" the circumstances locally, but can't seem to coerce Gunicorn into shutting down gracefully. I'm not really sure why. Maybe the Gunicorn parent process stops slurping stdout after it's issued a TERM to the subprocess? This is worth looking into further, but shouldn't block moving forward.
Assignee | ||
Comment 10•7 years ago
|
||
Changing this so it doesn't block bug #1315258 anymore. This is still a curiosity, but we can get by without it. If it turns out that the changes in mozilla/antenna#189 don't help, then we should remove those changes because they add some non-trivial complexity that isn't helping.
No longer blocks: 1315258
Assignee | ||
Comment 11•7 years ago
|
||
Switching Antenna bugs to Antenna component.
Component: General → Antenna
Assignee | ||
Comment 12•7 years ago
|
||
Assignee | ||
Comment 13•7 years ago
|
||
This landed in master, went to -dev and Miles pushed it to -stage. I looked at the grephost and it's working perfectly. Marking this as FIXED.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•