Closed Bug 905587 Opened 6 years ago Closed 6 years ago

Investigate redis hangs

Categories

(Infrastructure & Operations :: CIDuty, task, P2)

x86
Linux

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: nthomas, Unassigned)

References

Details

Attachments

(3 files, 5 obsolete files)

Bug 905554 and bug 898739 are recent examples of redis01.build.m.o becoming unresponsive, we need to get to the bottom of this and fix.
Timeline for bug 905554:

Last successful background save writing to /var/log/redis/redis.log
[16062] 14 Aug 22:34:00 * Background saving terminated with success

reporter-4hr.log on buildapi01:
Wed Aug 14 22:39:01 -0700
  File "/home/buildapi/src/buildapi/scripts/reporter.py", line 356, in <module>
    report = build_report(R, session, scheduler_db_engine, starttime, endtime)
...
sqlalchemy.exc.OperationalError: (OperationalError) (2006, 'MySQL server has gone away') 'SELECT masters.id AS masters_id, masters.url AS masters_url, masters.name AS masters_name \nFROM masters

strace for the redis process was looping over:
gettimeofday({1376559180, 294528}, NULL) = 0
gettimeofday({1376559180, 294567}, NULL) = 0
epoll_wait(3, {{EPOLLIN, {u32=4, u64=4}}}, 10240, 73) = 1
accept(4, 0x7fff4849af90, [12199322303221202960]) = -1 EMFILE (Too many open files)
open("/var/log/redis/redis.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = -1 EMFILE (Too many open files)

https://github.com/antirez/redis/issues/246 says this was fixed in Redis 2.6, we have 2.4.x.
Amy, could you please take a look at the KVM hosts for redis01.build.scl1 and buildapi01.build.scl1 to see if there is anything logged around Aug 14 22:39 ? I suspect either KVM or the network burped around then.
Flags: needinfo?(arich)
There were a large number of connections from signing servers, many more than the usual 20 or so. The breakdown of the number of connections is:

252  mac-signing1.srv.releng.scl3
258  mac-signing2.srv.releng.scl3
221  signing4.srv.releng.scl3
210  signing5.srv.releng.scl3
206  signing6.srv.releng.scl3

  3  mac-signing3.build.scl1
  3  mac-signing4.build.scl1
  6  buildapi01.build.scl1

Taken with the 'MySQL server has gone away' for buildapi01.build.scl1 ---> buildbot-ro-vip.db.scl3, this seems like the scl1-scl3 link had a glitch.
Depends on: 905616
buildapi and redis are not on the same primary kvm servers, though they do share a secondary node.  I don't see anything obvious in the logs on any of the three machines.  The only anomaly around that time was one warning from rngd about a failed block on the primary server for redis (which shouldn't have had anything to do with this).
Flags: needinfo?(arich)
catlee, I'm thinking we should upgrade to 2.6.x. Do you recall where the redis package came from when redis01 was set up ? AFAICT Centos5 doesn't have include it so perhaps a manual compile, but no spec file in the repo. 

See also 
* bug 735293 (make redis less spof)
* bug 735252 (signing servers can hang on connection to redis)
* bug 863268 (migrate buildapi01 and redis01 off kvm)
Flags: needinfo?(catlee)
I don't remember where redis on redis01 came from...We're certainly not using anything fancy from it, so upgrading should be fine.

Unfortunately migrating to a new redis host will require a downtime I think, since we only support a single redis host ATM.

Could we split up the redis instances to have one used for buildapi and another used for signing?
Flags: needinfo?(catlee)
Bug 919848 was another instance of this.
since we are putting instances in this bug, please see the list below

bug 912428
bug 905554

Just so we are all aware, when builds-4hr.js.gz doesnt get updated due to a hang the sheriffs _will_ close _all_ the trees since they can't trust any results in TBPL. We then wait for Releng to fix the issue which can be some time if you arent around :nthomas
RelEng HOWTO:
 connect to:   ssh root@redis01.build.mozilla.org
 diagnosis:    1, get the pid for redis
               2, ls /proc/<pid>/fd | wc -l
               3, If it's ~1000 it's this bug
 resolution:   service redis restart
 confirmation: 1, telnet localhost 6379
               2, say 'MONITOR'
               3, should see a lot of lines fly past once builds-4hrs generation gets going again
               4, if it doesn't root@buildapi01.build.mozilla.org and look at buildapi processes

Deassigning, because I'm not going to have cycles to work on this until mid October between EOQ, PTO, and Summit.
Assignee: nthomas → nobody
Blocks: 926246
catlee suggested a @weekly cron job to restart the redis service, on the basis that we're gradually accreting file descriptors until we get over 1024 and the service fails to get a new one. That seems likely, as I logged an increase in open connections from 202 to 237 in the 11 hours up until now. The signing servers have many connections open, at least according to redis01. eg netstat on redis01 says signing4 has 42 ESTABLISHED connections, while the same on signing4 says only 3. 

We have v2.4.5 of redis-py on the signing-servers and buildapi01, and 
 https://github.com/andymccurdy/redis-py/blob/master/CHANGES
has some references to handling connections better, so we could try that. Or the redis server upgrade I mentioned earlier.
See Also: → 919848
grabbing for patch level fix to improve operations
Assignee: nobody → hwine
Status: NEW → ASSIGNED
Attached file crontab from redis01
PATCH crontab -- no prior crontab
PATCH - cronjob to restart redis weekly to avoid FD leak. Output emailed to release@, pages hwine if anything goes wrong
PATCH -- verify redis running every hour. Attempt restart if not. Email release@ if restart, and page hwine if anything unexpected
Bubblegum now in place -- please remove when doing the real fix.
Assignee: hwine → nobody
Status: ASSIGNED → NEW
fixed syntax error when multiple redis-server processes present

Still emails on any error, pages hwine on outage
Attachment #819219 - Attachment is obsolete: true
fixed syntax error when multiple redis-server processes present

Still emails on any error, pages hwine on outage
Attachment #819220 - Attachment is obsolete: true
tweaked to syntax error & wait for new process to create pid file
Attachment #819893 - Attachment is obsolete: true
So, sometimes redis has 2 processes running, and that confuses the script. And, when it's running 2 processes, restart doesn't work. So don't even try anymore, just page.
Attachment #819894 - Attachment is obsolete: true
Comment on attachment 823601 [details]
hourly_check -- cronjob to ensure redis is running

Nagios alerts now checking for redis-server running:
    Notification Type: PROBLEM

    Service: procs - redis-server
    Host: redis01.build.scl1.mozilla.com
    Address: 10.12.48.24
    State: CRITICAL

    Date/Time: 10-28-2013 14:09:21

    Additional Info:
    PROCS CRITICAL: 0 processes with regex args redis-server

This band-aid can be removed.
Attachment #823601 - Attachment is obsolete: true
The weekly restart still does not function properly, leading to bug 942545.

Removing the restart. Will find a reasonable time to introduce a manual restart.
Component: Buildduty → Platform Support
QA Contact: armenzg → coop
Did a manual restart just now; there were 420 file descriptors open.
When I had a look, it seemed that most of the open connections were from stale TCP sessions.  I'm guessing that the server does not set SO_KEEPALIVE.  If it's possible to hack that in somehow, it could probably avoid the need for restarts.
We could also check if the python-redis used on the signing servers (v2.4.5) has a known bugs, as it's always those machines that cause the accumulation of connections. The ones from buildapi01 get torn down properly, but then reporter processes exit completely there too.
Restarted manually - 479 connections open prior to that, last restart Dec 3.
Redis is gone.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.