Closed Bug 905587 Opened 12 years ago Closed 11 years ago

Investigate redis hangs

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: nthomas, Unassigned)

References

Details

Attachments

(3 files, 5 obsolete files)

netstat for redis process 12 years ago Nick Thomas [:nthomas] (UTC+12) 123.53 KB, text/plain		Details
crontab from redis01 12 years ago hwine 100 bytes, text/plain		Details
weekly_restart -- script to restart redis-server 12 years ago hwine 1.17 KB, text/plain		Details
hourly_check -- cronjob to ensure redis is running 12 years ago hwine 1.00 KB, text/plain		Details
weekly_restart -- script to restart redis-server 12 years ago hwine 1.26 KB, text/plain		Details
hourly_check -- cronjob to ensure redis is running 12 years ago hwine 924 bytes, text/plain		Details
weekly_restart -- script to restart redis-server 12 years ago hwine 1.34 KB, text/plain		Details
hourly_check -- cronjob to ensure redis is running 12 years ago hwine 1.00 KB, text/plain		Details

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

12 years ago

Bug 905554 and bug 898739 are recent examples of redis01.build.m.o becoming unresponsive, we need to get to the bottom of this and fix.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

12 years ago

Timeline for bug 905554: Last successful background save writing to /var/log/redis/redis.log [16062] 14 Aug 22:34:00 * Background saving terminated with success reporter-4hr.log on buildapi01: Wed Aug 14 22:39:01 -0700 File "/home/buildapi/src/buildapi/scripts/reporter.py", line 356, in <module> report = build_report(R, session, scheduler_db_engine, starttime, endtime) ... sqlalchemy.exc.OperationalError: (OperationalError) (2006, 'MySQL server has gone away') 'SELECT masters.id AS masters_id, masters.url AS masters_url, masters.name AS masters_name \nFROM masters strace for the redis process was looping over: gettimeofday({1376559180, 294528}, NULL) = 0 gettimeofday({1376559180, 294567}, NULL) = 0 epoll_wait(3, {{EPOLLIN, {u32=4, u64=4}}}, 10240, 73) = 1 accept(4, 0x7fff4849af90, [12199322303221202960]) = -1 EMFILE (Too many open files) open("/var/log/redis/redis.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = -1 EMFILE (Too many open files) https://github.com/antirez/redis/issues/246 says this was fixed in Redis 2.6, we have 2.4.x.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

12 years ago

Amy, could you please take a look at the KVM hosts for redis01.build.scl1 and buildapi01.build.scl1 to see if there is anything logged around Aug 14 22:39 ? I suspect either KVM or the network burped around then.

Flags: needinfo?(arich)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

12 years ago

Attached file netstat for redis process — Details

There were a large number of connections from signing servers, many more than the usual 20 or so. The breakdown of the number of connections is: 252 mac-signing1.srv.releng.scl3 258 mac-signing2.srv.releng.scl3 221 signing4.srv.releng.scl3 210 signing5.srv.releng.scl3 206 signing6.srv.releng.scl3 3 mac-signing3.build.scl1 3 mac-signing4.build.scl1 6 buildapi01.build.scl1 Taken with the 'MySQL server has gone away' for buildapi01.build.scl1 ---> buildbot-ro-vip.db.scl3, this seems like the scl1-scl3 link had a glitch.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

12 years ago

Depends on: 905616

Amy Rich [:arr] [:arich]

Comment 4

•

12 years ago

buildapi and redis are not on the same primary kvm servers, though they do share a secondary node. I don't see anything obvious in the logs on any of the three machines. The only anomaly around that time was one warning from rngd about a failed block on the primary server for redis (which shouldn't have had anything to do with this).

Flags: needinfo?(arich)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 5

•

12 years ago

catlee, I'm thinking we should upgrade to 2.6.x. Do you recall where the redis package came from when redis01 was set up ? AFAICT Centos5 doesn't have include it so perhaps a manual compile, but no spec file in the repo. See also * bug 735293 (make redis less spof) * bug 735252 (signing servers can hang on connection to redis) * bug 863268 (migrate buildapi01 and redis01 off kvm)

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 6

•

12 years ago

I don't remember where redis on redis01 came from...We're certainly not using anything fancy from it, so upgrading should be fine. Unfortunately migrating to a new redis host will require a downtime I think, since we only support a single redis host ATM. Could we split up the redis instances to have one used for buildapi and another used for signing?

Flags: needinfo?(catlee)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

12 years ago

Bug 919848 was another instance of this.

David Burns :automatedtester

Comment 8

•

12 years ago

since we are putting instances in this bug, please see the list below bug 912428 bug 905554 Just so we are all aware, when builds-4hr.js.gz doesnt get updated due to a hang the sheriffs _will_ close _all_ the trees since they can't trust any results in TBPL. We then wait for Releng to fix the issue which can be some time if you arent around :nthomas

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 9

•

12 years ago

RelEng HOWTO: connect to: ssh root@redis01.build.mozilla.org diagnosis: 1, get the pid for redis 2, ls /proc/<pid>/fd | wc -l 3, If it's ~1000 it's this bug resolution: service redis restart confirmation: 1, telnet localhost 6379 2, say 'MONITOR' 3, should see a lot of lines fly past once builds-4hrs generation gets going again 4, if it doesn't root@buildapi01.build.mozilla.org and look at buildapi processes Deassigning, because I'm not going to have cycles to work on this until mid October between EOQ, PTO, and Summit.

Assignee: nthomas → nobody

Ed Morley [:emorley]

Updated

•

12 years ago

Blocks: 926246

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 10

•

12 years ago

catlee suggested a @weekly cron job to restart the redis service, on the basis that we're gradually accreting file descriptors until we get over 1024 and the service fails to get a new one. That seems likely, as I logged an increase in open connections from 202 to 237 in the 11 hours up until now. The signing servers have many connections open, at least according to redis01. eg netstat on redis01 says signing4 has 42 ESTABLISHED connections, while the same on signing4 says only 3. We have v2.4.5 of redis-py on the signing-servers and buildapi01, and https://github.com/andymccurdy/redis-py/blob/master/CHANGES has some references to handling connections better, so we could try that. Or the redis server upgrade I mentioned earlier.

hwine

Updated

•

12 years ago

Comment 11

•

12 years ago

grabbing for patch level fix to improve operations

Assignee: nobody → hwine

Status: NEW → ASSIGNED

hwine

Comment 12

•

12 years ago

Attached file crontab from redis01 — Details

PATCH crontab -- no prior crontab

hwine

Comment 13

•

12 years ago

Attached file weekly_restart -- script to restart redis-server (obsolete) — Details

PATCH - cronjob to restart redis weekly to avoid FD leak. Output emailed to release@, pages hwine if anything goes wrong

hwine

Comment 14

•

12 years ago

Attached file hourly_check -- cronjob to ensure redis is running (obsolete) — Details

PATCH -- verify redis running every hour. Attempt restart if not. Email release@ if restart, and page hwine if anything unexpected

hwine

Comment 15

•

12 years ago

Bubblegum now in place -- please remove when doing the real fix.

Assignee: hwine → nobody

Status: ASSIGNED → NEW

hwine

Comment 16

•

12 years ago

Attached file weekly_restart -- script to restart redis-server (obsolete) — Details

fixed syntax error when multiple redis-server processes present Still emails on any error, pages hwine on outage

Attachment #819219 - Attachment is obsolete: true

hwine

Comment 17

•

12 years ago

Attached file hourly_check -- cronjob to ensure redis is running (obsolete) — Details

fixed syntax error when multiple redis-server processes present Still emails on any error, pages hwine on outage

Attachment #819220 - Attachment is obsolete: true

hwine

Comment 18

•

12 years ago

Attached file weekly_restart -- script to restart redis-server — Details

tweaked to syntax error & wait for new process to create pid file

Attachment #819893 - Attachment is obsolete: true

hwine

Comment 19

•

12 years ago

Attached file hourly_check -- cronjob to ensure redis is running (obsolete) — Details

So, sometimes redis has 2 processes running, and that confuses the script. And, when it's running 2 processes, restart doesn't work. So don't even try anymore, just page.

Attachment #819894 - Attachment is obsolete: true

hwine

Comment 20

•

12 years ago

Comment on attachment 823601 [details] hourly_check -- cronjob to ensure redis is running Nagios alerts now checking for redis-server running: Notification Type: PROBLEM Service: procs - redis-server Host: redis01.build.scl1.mozilla.com Address: 10.12.48.24 State: CRITICAL Date/Time: 10-28-2013 14:09:21 Additional Info: PROCS CRITICAL: 0 processes with regex args redis-server This band-aid can be removed.

Attachment #823601 - Attachment is obsolete: true

hwine

Comment 21

•

12 years ago

The weekly restart still does not function properly, leading to bug 942545. Removing the restart. Will find a reasonable time to introduce a manual restart.

Armen [:armenzg]

Updated

•

12 years ago

Component: Buildduty → Platform Support

QA Contact: armenzg → coop

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 22

•

12 years ago

Did a manual restart just now; there were 420 file descriptors open.

Dustin J. Mitchell [:dustin] (he/him)

Comment 23

•

12 years ago

When I had a look, it seemed that most of the open connections were from stale TCP sessions. I'm guessing that the server does not set SO_KEEPALIVE. If it's possible to hack that in somehow, it could probably avoid the need for restarts.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 24

•

12 years ago

We could also check if the python-redis used on the signing servers (v2.4.5) has a known bugs, as it's always those machines that cause the accumulation of connections. The ones from buildapi01 get torn down properly, but then reporter processes exit completely there too.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 25

•

12 years ago

Restarted manually - 479 connections open prior to that, last restart Dec 3.

Chris Cooper [:coop] (he/him)

Comment 26

•

11 years ago

Redis is gone.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: Platform Support → Buildduty

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.