1630646 - Make it so that we don't end up with a whole bunch of codesearch instances lying around

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Description

•

5 years ago

We seem to often end up in a situation with a lot of codesearch instances running. This seems bad because they eat up a lot of memory, and only one instance can bind to a port anyway, so most of those instances are useless. I looked to see how hard it would be to make codesearch itself self-terminate if it can't bind to the port, but it seems hard. I filed https://github.com/livegrep/livegrep/issues/262 for that.

Another solution is to try and return the pid of the instance we do start up from here and then if we go to start a new instance (because we think the old one is not responding or whatever), then we actually kill the old pid before trying to start the new one. This seems like it should handle a wider class of errors than the first approach anyway.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 1

•

5 years ago

Actually we can just put the pid in the tree_data structure in that file, and have startup_codesearch kill the pid if there's already one. Should be a few lines of code, pretty simple.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 2

•

5 years ago

Turns out multiple instances can bind to the same port and share the load. But we now have a mechanism to disable that, see the aforelinked github issue.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Updated

•

4 years ago

Blocks: 1645321

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 3

•

4 years ago

Looks like the migration to py3 didn't fix this magically (shoot!). I was investigating some 504 reports today and noticed we had a whole bunch of codesearch instances spawned at around 06:50 UTC. The router.log has things like:

2020-06-12 06:50:17.028688/pid=23914 - request(handled by 23915) /mozilla-central/search?case=true&path=1'"&q=&regexp=true
2020-06-12 06:50:17.030954/pid=23915 - QUERY file: "1\'\"",
2020-06-12 06:50:17.194915/pid=23921 - request(handled by 23922) /mozilla-central/search?case=1%00%C0%A7%C0%A2%252527%252522&path=1&q=the&regexp=true
2020-06-12 06:50:17.198508/pid=23922 - QUERY line: "the", file: "1", fold_case: true,
2020-06-12 06:50:17.309310/pid=23915 - Got exception: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "{"created":"@1591944617.308874618","description":"Error received from peer ipv4:127.0.0.1:8082","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}"
>
2020-06-12 06:50:17.309383/pid=23915 - Starting codesearch
2020-06-12 06:50:17.310015/pid=23922 - Got exception: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Connection reset by peer"
        debug_error_string = "{"created":"@1591944617.309642520","description":"Error received from peer ipv4:127.0.0.1:8082","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}"
>
2020-06-12 06:50:17.310105/pid=23922 - Starting codesearch

I think eventually we spawned so many of these that it basically locked up the web-server. Even after pointing the load balancer to the old web-server instance, I couldn't SSH in to the affected one and had to force a reboot via the web console.

Andrew Sutherland [:asuth] (he/him)

Comment 4

•

4 years ago

Attached file grep "Starting" router.log, ps uax excerpt — Details

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Updated

•

4 years ago

Blocks: 1667828

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 5

•

4 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #1)

Actually we can just put the pid in the tree_data structure in that file, and have startup_codesearch kill the pid if there's already one. Should be a few lines of code, pretty simple.

This probably won't work, because the main router.py spawns child processes for each request, and so changes to tree_data won't get reflected back to other handler processes. Also just killing the pid means that many handlers running in quick succession will all be stomping on each others' toes and killing the pids started by other handlers.

One possible approach might be to stop having the codesearch.search function try to restart the codesearch instance, but instead propagate the failure back to the parent router.py, which can attempt the restart. I don't know if that's feasible though since the parent router.py will be stuck serve_forever(). Regardless we need some mechanism for the handlers running concurrently to all coordinate and make sure only one tries to start the new instance and all the other ones just wait. Maybe a pidfile written to disk would work better.

Andrew Sutherland [:asuth] (he/him)

Comment 6

•

4 years ago

Ah, yeah, it does seem like the re-spawning logic right now seems more likely to cause the server to fall over than anything and it's structurally not trivial to fix. Also, I think we want to move everything in router.py into web-server.rs so it'd be nice to not invest too much more effort into router.py.

It appears systemd already has support for restarting crashed services (tutorial-ish plus rules about handling repeated restarts, also service config docs). And there's even support for services managed by/associated with a user. This covers the segfault scenario identified in https://bugzilla.mozilla.org/show_bug.cgi?id=1667828#c5 but wouldn't cover the hang case.

Per this superuser.com answer and this one systemd has a specific API sd_notify that livegrep could call, or alternately, it could periodically send heartbeats over the NOTIFY_SOCKET env var-identified socket to enable systemd to automatically restart when unresponsive. Here's an article on making go executables report activity to systemd if the livegrep daemon part we use is written in go (I'm confused what frontend means for livegrep).

There's also more user-level-ish things like monit that has a somewhat bizarre config file syntax apache example / docs for HTTP checks that could restart things.

I would lean towards trying to use systemd, especially since our supported modes of operations are 1) a docker image we have complete control over and root access within and 2) ubuntu server images we have complete control over and whose startup includes root access

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 7

•

4 years ago

I tried out a short-term fix for router.py by taking advantage of the --noreuseport option added in response to https://github.com/livegrep/livegrep/issues/262 and that seems to work. In summary, if we upgrade our version of livegrep, and start codesearch with the --noreuseport option, then if there are multiple attempts to start a given codesearch instance, only the first will succeed and the rest will fail on startup. This seems to solve the problem of having multiple codesearch instances all get kicked off when the base instance dies and causing OOMs. Instead, some of the requests that come in will hit temporary errors but after a few seconds the restarted codesearch will start handling things again.

Until we migrate the code to rust this seems like an improvement.

Assignee: nobody → kats

Andrew Sutherland [:asuth] (he/him)

Comment 8

•

4 years ago

Sounds like a plan! Thanks!

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 9

•

4 years ago

https://github.com/mozsearch/mozsearch/pull/373

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 10

•

4 years ago

Fix is deployed, hopefully that takes care of it.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Updated

•

4 years ago

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED

Andrew Sutherland [:asuth] (he/him)

Updated

•

3 years ago

Bugzilla

Make it so that we don't end up with a whole bunch of codesearch instances lying around

Categories

(Webtools :: Searchfox, defect)

Tracking

(Not tracked)

People

(Reporter: kats, Assigned: kats)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated

Attachment

General

Description

File Name

Content Type