Closed Bug 1470702 Opened 7 years ago Closed 6 years ago

Add health check to processor

Categories

(Socorro :: Processor, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: osmose, Unassigned)

References

(Blocks 1 open bug)

Details

The webapp has an HTTP endpoint that, when accessed, runs health checks and returns the status of the webapp. The endpoint is used by ops for monitoring. We'd like to add something similar to the processor that checks the health of a single node. The main thing I can think of that we'd want to check would be connectivity to storage backends that crashes are read from and written to. We may have other things we want the check to cover. There are a few different options for exposing the health check: - An HTTP endpoint that isn't open to the public. - A open socket on a specific port Miles / Brian: Feel free to correct me or flesh out requirements for this. I think this is a P2 - we don't need it immediately but are interested in working on it ourselves soon.
If we do this, we should do Dockerflow web endpoints since that's the standard we're using for services. This was something we were doing with the Jansky rewrite (https://github.com/mozilla-services/jansky/issues/22). At the time (this was April 2017), we decided not to try to jam this into the existing processor because it's really tricky. We could reduce the scope to an endpoint that just checks "is there a processor process running in this container?" That's not in Mike's list of health checks, but it is doable without rewriting the processor app in non-trivial ways and it probably would have helped with the problem we had recently where the node came up, but the docker container didn't.
There are two things we could already implement via datadog 1) A check that each processor instance has a running processor container 2) A check that each processor instance has processed > 0 crashes in the last N minutes So I don't think a new endpoint just to tell us that the processor running is worth the effort. An endpoint that checks connectivity to storage backends and rabbitmq would be nice, but it's understandable if we want to put that off because the current architecture makes it hard.
Brian: Oh, I thought that while those things were possible to determine using datadog data, it was really hard and it was hard to tie into infrastructure scaling and such. Hence the need for an HTTP API endpoint. Is that not true?
I don't think it's too hard #1 we should be able to do with https://docs.datadoghq.com/integrations/process/#service-checks #2 we can do with a monitor like "sum(last_10m):sum:processor.save_raw_and_processed{app:socorro,type:processor,env:prod} by {host}.as_count() < 1". To handle the time it takes new instances to provision when scaling up, we can add a delay so it doesn't start evaluating until an instance has been up for 10 minutes. Instances scaling down shouldn't be a problem, since datadog will see they're gone and stop evaluating the monitor for them. The only issue I see is that this is going to be noisy when there is a problem that affects all processor instances.
If that's the case, I'd love to do this with the existing stuff. While I haven't spent quality time figuring out whether it's possible or not and how to do it, shoehorning a web server into the processor right now with its current architecture feels hard.
Making this part of the processor rewrite and bumping it down to P3.
Blocks: 907277
Priority: P2 → P3

At some point in the last 3 months, I read about using honcho in the Docker container to run two processes:

  1. the process that's doing the work
  2. a webapp process that implements the health-check endpoints by looking at the state of the process doing the work

I think that's totally doable. One thing I haven't looked into is what happens to stdout for the processes honcho is running? Does it pass all stdout for both processes to the Docker host? We'd need it to do that.

I like this model. We could do the same with the crontabber container.

Making this a P2 to look into.

Type: task → enhancement
Priority: P3 → P2

I talked to Brian about this at the all hands. We decided not to do this. When we move to GCP, things will run differently and we won't need this anymore.

Marking as WONTFIX.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.