Closed
Bug 905587
Opened 11 years ago
Closed 10 years ago
Investigate redis hangs
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P2)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: nthomas, Unassigned)
References
Details
Attachments
(3 files, 5 obsolete files)
Bug 905554 and bug 898739 are recent examples of redis01.build.m.o becoming unresponsive, we need to get to the bottom of this and fix.
Reporter | ||
Comment 1•11 years ago
|
||
Timeline for bug 905554: Last successful background save writing to /var/log/redis/redis.log [16062] 14 Aug 22:34:00 * Background saving terminated with success reporter-4hr.log on buildapi01: Wed Aug 14 22:39:01 -0700 File "/home/buildapi/src/buildapi/scripts/reporter.py", line 356, in <module> report = build_report(R, session, scheduler_db_engine, starttime, endtime) ... sqlalchemy.exc.OperationalError: (OperationalError) (2006, 'MySQL server has gone away') 'SELECT masters.id AS masters_id, masters.url AS masters_url, masters.name AS masters_name \nFROM masters strace for the redis process was looping over: gettimeofday({1376559180, 294528}, NULL) = 0 gettimeofday({1376559180, 294567}, NULL) = 0 epoll_wait(3, {{EPOLLIN, {u32=4, u64=4}}}, 10240, 73) = 1 accept(4, 0x7fff4849af90, [12199322303221202960]) = -1 EMFILE (Too many open files) open("/var/log/redis/redis.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = -1 EMFILE (Too many open files) https://github.com/antirez/redis/issues/246 says this was fixed in Redis 2.6, we have 2.4.x.
Reporter | ||
Comment 2•11 years ago
|
||
Amy, could you please take a look at the KVM hosts for redis01.build.scl1 and buildapi01.build.scl1 to see if there is anything logged around Aug 14 22:39 ? I suspect either KVM or the network burped around then.
Flags: needinfo?(arich)
Reporter | ||
Comment 3•11 years ago
|
||
There were a large number of connections from signing servers, many more than the usual 20 or so. The breakdown of the number of connections is: 252 mac-signing1.srv.releng.scl3 258 mac-signing2.srv.releng.scl3 221 signing4.srv.releng.scl3 210 signing5.srv.releng.scl3 206 signing6.srv.releng.scl3 3 mac-signing3.build.scl1 3 mac-signing4.build.scl1 6 buildapi01.build.scl1 Taken with the 'MySQL server has gone away' for buildapi01.build.scl1 ---> buildbot-ro-vip.db.scl3, this seems like the scl1-scl3 link had a glitch.
Comment 4•11 years ago
|
||
buildapi and redis are not on the same primary kvm servers, though they do share a secondary node. I don't see anything obvious in the logs on any of the three machines. The only anomaly around that time was one warning from rngd about a failed block on the primary server for redis (which shouldn't have had anything to do with this).
Flags: needinfo?(arich)
Reporter | ||
Comment 5•11 years ago
|
||
catlee, I'm thinking we should upgrade to 2.6.x. Do you recall where the redis package came from when redis01 was set up ? AFAICT Centos5 doesn't have include it so perhaps a manual compile, but no spec file in the repo. See also * bug 735293 (make redis less spof) * bug 735252 (signing servers can hang on connection to redis) * bug 863268 (migrate buildapi01 and redis01 off kvm)
Flags: needinfo?(catlee)
Comment 6•11 years ago
|
||
I don't remember where redis on redis01 came from...We're certainly not using anything fancy from it, so upgrading should be fine. Unfortunately migrating to a new redis host will require a downtime I think, since we only support a single redis host ATM. Could we split up the redis instances to have one used for buildapi and another used for signing?
Flags: needinfo?(catlee)
Reporter | ||
Comment 7•11 years ago
|
||
Bug 919848 was another instance of this.
Comment 8•11 years ago
|
||
since we are putting instances in this bug, please see the list below bug 912428 bug 905554 Just so we are all aware, when builds-4hr.js.gz doesnt get updated due to a hang the sheriffs _will_ close _all_ the trees since they can't trust any results in TBPL. We then wait for Releng to fix the issue which can be some time if you arent around :nthomas
Reporter | ||
Comment 9•11 years ago
|
||
RelEng HOWTO: connect to: ssh root@redis01.build.mozilla.org diagnosis: 1, get the pid for redis 2, ls /proc/<pid>/fd | wc -l 3, If it's ~1000 it's this bug resolution: service redis restart confirmation: 1, telnet localhost 6379 2, say 'MONITOR' 3, should see a lot of lines fly past once builds-4hrs generation gets going again 4, if it doesn't root@buildapi01.build.mozilla.org and look at buildapi processes Deassigning, because I'm not going to have cycles to work on this until mid October between EOQ, PTO, and Summit.
Assignee: nthomas → nobody
Reporter | ||
Comment 10•11 years ago
|
||
catlee suggested a @weekly cron job to restart the redis service, on the basis that we're gradually accreting file descriptors until we get over 1024 and the service fails to get a new one. That seems likely, as I logged an increase in open connections from 202 to 237 in the 11 hours up until now. The signing servers have many connections open, at least according to redis01. eg netstat on redis01 says signing4 has 42 ESTABLISHED connections, while the same on signing4 says only 3. We have v2.4.5 of redis-py on the signing-servers and buildapi01, and https://github.com/andymccurdy/redis-py/blob/master/CHANGES has some references to handling connections better, so we could try that. Or the redis server upgrade I mentioned earlier.
Comment 11•11 years ago
|
||
grabbing for patch level fix to improve operations
Assignee: nobody → hwine
Status: NEW → ASSIGNED
Comment 12•11 years ago
|
||
PATCH crontab -- no prior crontab
Comment 13•11 years ago
|
||
PATCH - cronjob to restart redis weekly to avoid FD leak. Output emailed to release@, pages hwine if anything goes wrong
Comment 14•11 years ago
|
||
PATCH -- verify redis running every hour. Attempt restart if not. Email release@ if restart, and page hwine if anything unexpected
Comment 15•11 years ago
|
||
Bubblegum now in place -- please remove when doing the real fix.
Assignee: hwine → nobody
Status: ASSIGNED → NEW
Comment 16•11 years ago
|
||
fixed syntax error when multiple redis-server processes present Still emails on any error, pages hwine on outage
Attachment #819219 -
Attachment is obsolete: true
Comment 17•11 years ago
|
||
fixed syntax error when multiple redis-server processes present Still emails on any error, pages hwine on outage
Attachment #819220 -
Attachment is obsolete: true
Comment 18•11 years ago
|
||
tweaked to syntax error & wait for new process to create pid file
Attachment #819893 -
Attachment is obsolete: true
Comment 19•11 years ago
|
||
So, sometimes redis has 2 processes running, and that confuses the script. And, when it's running 2 processes, restart doesn't work. So don't even try anymore, just page.
Attachment #819894 -
Attachment is obsolete: true
Comment 20•11 years ago
|
||
Comment on attachment 823601 [details]
hourly_check -- cronjob to ensure redis is running
Nagios alerts now checking for redis-server running:
Notification Type: PROBLEM
Service: procs - redis-server
Host: redis01.build.scl1.mozilla.com
Address: 10.12.48.24
State: CRITICAL
Date/Time: 10-28-2013 14:09:21
Additional Info:
PROCS CRITICAL: 0 processes with regex args redis-server
This band-aid can be removed.
Attachment #823601 -
Attachment is obsolete: true
Comment 21•11 years ago
|
||
The weekly restart still does not function properly, leading to bug 942545. Removing the restart. Will find a reasonable time to introduce a manual restart.
Updated•11 years ago
|
Component: Buildduty → Platform Support
QA Contact: armenzg → coop
Reporter | ||
Comment 22•11 years ago
|
||
Did a manual restart just now; there were 420 file descriptors open.
Comment 23•11 years ago
|
||
When I had a look, it seemed that most of the open connections were from stale TCP sessions. I'm guessing that the server does not set SO_KEEPALIVE. If it's possible to hack that in somehow, it could probably avoid the need for restarts.
Reporter | ||
Comment 24•11 years ago
|
||
We could also check if the python-redis used on the signing servers (v2.4.5) has a known bugs, as it's always those machines that cause the accumulation of connections. The ones from buildapi01 get torn down properly, but then reporter processes exit completely there too.
Reporter | ||
Comment 25•11 years ago
|
||
Restarted manually - 479 connections open prior to that, last restart Dec 3.
Comment 26•10 years ago
|
||
Redis is gone.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Assignee | ||
Updated•6 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•