SUMO chief is returning an Internal Server Error



Infrastructure & Operations
WebOps: Community Platform
4 years ago
4 years ago


(Reporter: rrosario, Assigned: cturra)





4 years ago

+++ This bug was initially created as a clone of Bug #932380 +++

+++ This bug was initially created as a clone of Bug #925080 +++

+++ This bug was initially created as a clone of Bug #917357 +++

When trying to deploy, chief is failing me with an Internal Server Error.

I didn't try prod. Obviously, this blocks us from deploying.

Comment 1

4 years ago
I restarted redis again to unblock you, I will cc chris to see what the status of his script is on here.
Flags: needinfo?(cturra)
Chief is very, very bad at logging anything it does (or attempts to do), so things like this are irritatingly difficult to track down.  That said, based on the bugs indicated in the clone list, poking Redis was a good place to start :

[root@supportadm.private.phx1 ~]# service redis-support-updater stop
Shutdown may take a while; redis needs to save the entire database
Shutting down redis-server: (error) ERR max number of clients reached

Something (likely Chief) appears to be opening connections to Redis, then not closing them, which would definitely cause things to stop behaving properly.  In bug 932380 :cturra mentions that he has seen this error condition before, and that he would apply some sort of script to help deal with the situation - so far I have not been able to track down exactly what he is referring to (needinfo set).

In any case, killing the Redis process and allowing the supervisor to restart it has, as expected, cleared the error condition.  In terms of preventing this from happening again, it is possible that :cturra's script may alleviate the symptoms, but since the issue is very likely with Chief itself, that is where we'd need to start looking.  That said, Chief development is basically nil, and is itself set to be replaced by Captain Shove[1] going forward.

I guess the real discussion is how and when we want to implement CapSho. :P


Comment 3

4 years ago
(In reply to Jason Crowe [:jd] from comment #1)
> I restarted redis again to unblock you, I will cc chris to see what the
> status of his script is on here.

looks like i hadn't applied my redis-memory-check script to the support cluster. it's now in place and actively keeping an eye on the redis process here. 

sorry about not getting this done sooner :r1cky!
Flags: needinfo?(cturra)
Since this is being worked on dropping priority so it doesn't keep paging.
Severity: major → normal

Comment 5

4 years ago
Bumping up the priority because we are getting 500s again and can't deploy.
Severity: normal → major
Assignee: server-ops-webops → bburton
Severity: major → normal
The redis restart script needed a couple tweaks, it's been restarted and you should be able to deploy
Last Resolved: 4 years ago
Resolution: --- → FIXED

Comment 7

4 years ago
we saw this come back again today. working with :mythmon to test. initially, we were seeing redis die during many (all?) of the chief pushes. as a result, i turned off the redis-checker script and put the redis process in debug log mode. it hasn't happen since so i am going to revert the log levels and review the redis-check script to see if that was the root cause.
Resolution: FIXED → ---

Comment 8

4 years ago
i found the root cause and have applied a patch. 

ultimately, it was a bug in my redis-check script. that script looks for memory consumed by redis and if it's approaching the server limits, restarts the redis process. what happened in this case is it saw the value of memory is kb and interpreted that as mb, which would have been WAY over the allowable memory limit. so, the redis process was stopped, but was not able to start it again because the process name was not correct in the script. two failures :( 

both have been addressed and tested. sorry for any inconvenience this may have caused!
Assignee: bburton → cturra
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.