Closed Bug 1630646 Opened 5 years ago Closed 4 years ago

Make it so that we don't end up with a whole bunch of codesearch instances lying around

Categories

(Webtools :: Searchfox, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kats, Assigned: kats)

References

Details

Attachments

(1 file)

We seem to often end up in a situation with a lot of codesearch instances running. This seems bad because they eat up a lot of memory, and only one instance can bind to a port anyway, so most of those instances are useless. I looked to see how hard it would be to make codesearch itself self-terminate if it can't bind to the port, but it seems hard. I filed https://github.com/livegrep/livegrep/issues/262 for that.

Another solution is to try and return the pid of the instance we do start up from here and then if we go to start a new instance (because we think the old one is not responding or whatever), then we actually kill the old pid before trying to start the new one. This seems like it should handle a wider class of errors than the first approach anyway.

Actually we can just put the pid in the tree_data structure in that file, and have startup_codesearch kill the pid if there's already one. Should be a few lines of code, pretty simple.

Turns out multiple instances can bind to the same port and share the load. But we now have a mechanism to disable that, see the aforelinked github issue.

Looks like the migration to py3 didn't fix this magically (shoot!). I was investigating some 504 reports today and noticed we had a whole bunch of codesearch instances spawned at around 06:50 UTC. The router.log has things like:

2020-06-12 06:50:17.028688/pid=23914 - request(handled by 23915) /mozilla-central/search?case=true&path=1'"&q=&regexp=true
2020-06-12 06:50:17.030954/pid=23915 - QUERY file: "1\'\"",
2020-06-12 06:50:17.194915/pid=23921 - request(handled by 23922) /mozilla-central/search?case=1%00%C0%A7%C0%A2%252527%252522&path=1&q=the&regexp=true
2020-06-12 06:50:17.198508/pid=23922 - QUERY line: "the", file: "1", fold_case: true,
2020-06-12 06:50:17.309310/pid=23915 - Got exception: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "{"created":"@1591944617.308874618","description":"Error received from peer ipv4:127.0.0.1:8082","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}"
>
2020-06-12 06:50:17.309383/pid=23915 - Starting codesearch
2020-06-12 06:50:17.310015/pid=23922 - Got exception: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Connection reset by peer"
        debug_error_string = "{"created":"@1591944617.309642520","description":"Error received from peer ipv4:127.0.0.1:8082","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}"
>
2020-06-12 06:50:17.310105/pid=23922 - Starting codesearch

I think eventually we spawned so many of these that it basically locked up the web-server. Even after pointing the load balancer to the old web-server instance, I couldn't SSH in to the affected one and had to force a reboot via the web console.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #1)

Actually we can just put the pid in the tree_data structure in that file, and have startup_codesearch kill the pid if there's already one. Should be a few lines of code, pretty simple.

This probably won't work, because the main router.py spawns child processes for each request, and so changes to tree_data won't get reflected back to other handler processes. Also just killing the pid means that many handlers running in quick succession will all be stomping on each others' toes and killing the pids started by other handlers.

One possible approach might be to stop having the codesearch.search function try to restart the codesearch instance, but instead propagate the failure back to the parent router.py, which can attempt the restart. I don't know if that's feasible though since the parent router.py will be stuck serve_forever(). Regardless we need some mechanism for the handlers running concurrently to all coordinate and make sure only one tries to start the new instance and all the other ones just wait. Maybe a pidfile written to disk would work better.

Ah, yeah, it does seem like the re-spawning logic right now seems more likely to cause the server to fall over than anything and it's structurally not trivial to fix. Also, I think we want to move everything in router.py into web-server.rs so it'd be nice to not invest too much more effort into router.py.

It appears systemd already has support for restarting crashed services (tutorial-ish plus rules about handling repeated restarts, also service config docs). And there's even support for services managed by/associated with a user. This covers the segfault scenario identified in https://bugzilla.mozilla.org/show_bug.cgi?id=1667828#c5 but wouldn't cover the hang case.

Per this superuser.com answer and this one systemd has a specific API sd_notify that livegrep could call, or alternately, it could periodically send heartbeats over the NOTIFY_SOCKET env var-identified socket to enable systemd to automatically restart when unresponsive. Here's an article on making go executables report activity to systemd if the livegrep daemon part we use is written in go (I'm confused what frontend means for livegrep).

There's also more user-level-ish things like monit that has a somewhat bizarre config file syntax apache example / docs for HTTP checks that could restart things.

I would lean towards trying to use systemd, especially since our supported modes of operations are 1) a docker image we have complete control over and root access within and 2) ubuntu server images we have complete control over and whose startup includes root access

I tried out a short-term fix for router.py by taking advantage of the --noreuseport option added in response to https://github.com/livegrep/livegrep/issues/262 and that seems to work. In summary, if we upgrade our version of livegrep, and start codesearch with the --noreuseport option, then if there are multiple attempts to start a given codesearch instance, only the first will succeed and the rest will fail on startup. This seems to solve the problem of having multiple codesearch instances all get kicked off when the base instance dies and causing OOMs. Instead, some of the requests that come in will hit temporary errors but after a few seconds the restarted codesearch will start handling things again.

Until we migrate the code to rust this seems like an improvement.

Assignee: nobody → kats

Sounds like a plan! Thanks!

Fix is deployed, hopefully that takes care of it.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: