939797 - SUMO chief is returning an Internal Server Error

Reporter

Description

•

12 years ago

(Again) +++ This bug was initially created as a clone of Bug #932380 +++ +++ This bug was initially created as a clone of Bug #925080 +++ +++ This bug was initially created as a clone of Bug #917357 +++ When trying to deploy, chief is failing me with an Internal Server Error. http://supportadm.private.phx1.mozilla.com/chief/support.stage I didn't try prod. Obviously, this blocks us from deploying.

Jason Crowe [:jd]

Comment 1

•

12 years ago

I restarted redis again to unblock you, I will cc chris to see what the status of his script is on here.

Flags: needinfo?(cturra)

Daniel Maher [:phrawzty]

Comment 2

•

12 years ago

Chief is very, very bad at logging anything it does (or attempts to do), so things like this are irritatingly difficult to track down. That said, based on the bugs indicated in the clone list, poking Redis was a good place to start : [root@supportadm.private.phx1 ~]# service redis-support-updater stop Shutdown may take a while; redis needs to save the entire database Shutting down redis-server: (error) ERR max number of clients reached [FAILED] Something (likely Chief) appears to be opening connections to Redis, then not closing them, which would definitely cause things to stop behaving properly. In bug 932380 :cturra mentions that he has seen this error condition before, and that he would apply some sort of script to help deal with the situation - so far I have not been able to track down exactly what he is referring to (needinfo set). In any case, killing the Redis process and allowing the supervisor to restart it has, as expected, cleared the error condition. In terms of preventing this from happening again, it is possible that :cturra's script may alleviate the symptoms, but since the issue is very likely with Chief itself, that is where we'd need to start looking. That said, Chief development is basically nil, and is itself set to be replaced by Captain Shove[1] going forward. I guess the real discussion is how and when we want to implement CapSho. :P [1] https://wiki.mozilla.org/Websites/Captain_Shove

Chris Turra [:cturra]

Assignee

Comment 3

•

12 years ago

(In reply to Jason Crowe [:jd] from comment #1) > I restarted redis again to unblock you, I will cc chris to see what the > status of his script is on here. looks like i hadn't applied my redis-memory-check script to the support cluster. it's now in place and actively keeping an eye on the redis process here. sorry about not getting this done sooner :r1cky!

Flags: needinfo?(cturra)

Peter Radcliffe [:pir]

Comment 4

•

12 years ago

Since this is being worked on dropping priority so it doesn't keep paging.

Severity: major → normal

Ricky Rosario [:rrosario, :r1cky]

Reporter

Comment 5

•

12 years ago

Bumping up the priority because we are getting 500s again and can't deploy.

Severity: normal → major

Brandon Burton [:solarce]

Updated

•

12 years ago

Assignee: server-ops-webops → bburton

Severity: major → normal

Brandon Burton [:solarce]

Comment 6

•

12 years ago

The redis restart script needed a couple tweaks, it's been restarted and you should be able to deploy

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Chris Turra [:cturra]

Assignee

Comment 7

•

12 years ago

we saw this come back again today. working with :mythmon to test. initially, we were seeing redis die during many (all?) of the chief pushes. as a result, i turned off the redis-checker script and put the redis process in debug log mode. it hasn't happen since so i am going to revert the log levels and review the redis-check script to see if that was the root cause.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Chris Turra [:cturra]

Assignee

Comment 8

•

12 years ago

i found the root cause and have applied a patch. ultimately, it was a bug in my redis-check script. that script looks for memory consumed by redis and if it's approaching the server limits, restarts the redis process. what happened in this case is it saw the value of memory is kb and interpreted that as mb, which would have been WAY over the allowable memory limit. so, the redis process was stopped, but was not able to start it again because the process name was not correct in the script. two failures :( both have been addressed and tested. sorry for any inconvenience this may have caused!

Assignee: bburton → cturra

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

SUMO chief is returning an Internal Server Error

Categories

(Infrastructure & Operations Graveyard :: WebOps: Community Platform, task)

Tracking

(Not tracked)

People

(Reporter: rrosario, Assigned: cturra)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Updated