antenna needs to shut down gracefully [antenna]



a year ago
10 months ago


(Reporter: willkg, Assigned: willkg)


Firefox Tracking Flags

(Not tracked)



(2 attachments)

Antenna should handle OS signals correctly. Particularly the "shut down, please" one should cause Antenna to stop handling incoming HTTP connections, but flush it's outgoing s3 queue if possible and then shut down cleanly.

This bug covers checking whether that's working and if not, implement/fix it.
Gunicorn's docs on signals is here:

I think that indicates that I have to add handling for SIGTERM to gracefully finish the s3 queue. I'll work on that.
I'm going to wait on implementing this. First off, it's nice, but I'm not convinced we actually need it. The common scenario is:

1. Antenna fails a health check or we're doing a deploy or something that causes the autoscaler to kill this node
2. The autoscaler takes the node out of rotation and waits for it to finish handling the HTTP connections it has
3. The autoscaler waits some grace period so Antenna can flush it's s3 queue
4. The autoscaler kills the node

I'm pretty sure at no point in that set of steps does something send a SIGTERM to nginx or gunicorn or the gunicorn worker processes.

If that's correct, then we don't need to add a bunch of code and testing to make sure Antenna behaves well.
Miles pointed out that if we set up gunicorn to kill Antenna process after max_requests requests, then it sends Antenna a SIGTERM and we'd want ANTENNA to comply in a well-behaved and orderly fashion.

Right now, we're not limiting process lifetime with max_requests, so we don't need it, but if we decide to do that because memory creep or other related long-running kinds of issues, we'll want to implement this.
So, Gunicorn starts up Gevent workers and that runs the Antenna app code. The Gevent workers handle various things related to the relationship between the workers and the parent process. Pretty sure this includes signals. I'd be more sure, but my brain has converted itself to applesauce.

There's a section of Gevent code that kicks in on "shutdown" and waits until all incoming HTTP connections have been handled and then shuts down. It knows whether there are outstanding incoming HTTP connections that are being handled by looking at a Pool it uses for keeping track of them all. That's great, except it doesn't account for the crashes that are in the s3 save queue.

With this bit of wisdom, I set out to modify Antenna to add the heartbeat coroutine to the incoming connection pool.

Since the heartbeat coroutine is in the incoming connection pool, it prevents the gevent worker from shutting down until the s3 save queue is cleared. Hooray!

Covered in PR:
Assignee: nobody → willkg
This landed and we pushed it to -stage. It passes some stuff to the logs when it kicks in. I'll keep an eye out for that text over the next couple of days.
I haven't seen it in the logs, yet. I'll wait to the end of the week. It's possible it's such an edge case scenario that it doesn't happen often or we're not doing the things to trigger it.
I still haven't seen any of the messages in the logs.

Specifically "work left to do" which indicates the Gunicorn process is shutting down and Antenna is like, "woah--not yet!"

If it's the case that Antenna is totally done, then we should see "Heartbeat stopped" since that indicates that all the requirements for Antenna to stay alive are done. I'm not seeing that one, either.

So either there's a bug and it's not working or Gunicorn is configured in such a way that it'll never kick in. Maybe in node shutdown, all the processes are SIGKILLd rather than SIGTERMd?
Summary: antenna needs to handle signals correctly [antenna] → antenna needs to shut down gracefully [antenna]
mozilla/antenna#189 doesn't fix the issue. I tried to "simulate" the circumstances locally, but can't seem to coerce Gunicorn into shutting down gracefully. I'm not really sure why. Maybe the Gunicorn parent process stops slurping stdout after it's issued a TERM to the subprocess?

This is worth looking into further, but shouldn't block moving forward.
Changing this so it doesn't block bug #1315258 anymore. This is still a curiosity, but we can get by without it.

If it turns out that the changes in mozilla/antenna#189 don't help, then we should remove those changes because they add some non-trivial complexity that isn't helping.
No longer blocks: 1315258
Switching Antenna bugs to Antenna component.
Component: General → Antenna
Blocks: 1356577
This landed in master, went to -dev and Miles pushed it to -stage. I looked at the grephost and it's working perfectly.

Marking this as FIXED.
Last Resolved: 10 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.