Status

Release Engineering
Buildduty
P1
blocker
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: nthomas, Assigned: nthomas)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

4 years ago
Traced from 
Sat 02:14:12 PDT [4364] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:28:59 ago - 369134 bytes in 0.036 second response time (http://m.allizom.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
to the reporter.py script on buildapi01 not getting any response from redis01.build.mozilla.org (specifically waiting on data from fd 5, a socket).

There was also some bustage in builds trying to get bits signed, since that uses redis too. Helpfully inbound was already closed and it's the weekend.

In /var/log/redis/redis.log:
[1569] 27 Jul 01:39:20 * 10 changes in 300 seconds. Saving...
[1569] 27 Jul 01:39:20 * Background saving started by pid 13848
[13848] 27 Jul 01:39:26 * DB saved on disk
[1569] 27 Jul 01:39:26 * Background saving terminated with success
[1569] 27 Jul 01:44:27 * 10 changes in 300 seconds. Saving...
[1569] 27 Jul 01:44:27 * Background saving started by pid 13943
[13943] 27 Jul 01:44:34 * DB saved on disk
[1569] 27 Jul 01:44:34 * Background saving terminated with success
[14033] 27 Jul 01:49:45 * DB saved on disk
[14128] 27 Jul 01:54:54 * DB saved on disk
[14219] 27 Jul 02:00:07 * DB saved on disk
[14330] 27 Jul 02:05:15 * DB saved on disk
[14421] 27 Jul 02:10:28 * DB saved on disk
[14519] 27 Jul 02:15:35 * DB saved on disk
[14610] 27 Jul 02:20:47 * DB saved on disk
[14705] 27 Jul 02:25:54 * DB saved on disk
[14796] 27 Jul 02:31:03 * DB saved on disk
[14958] 27 Jul 02:39:57 * DB saved on disk
[15059] 27 Jul 02:45:08 * DB saved on disk
[15150] 27 Jul 02:50:16 * DB saved on disk
[15245] 27 Jul 02:55:25 * DB saved on disk
[15335] 27 Jul 03:00:36 * DB saved on disk
[15436] 27 Jul 03:05:47 * DB saved on disk
[15526] 27 Jul 03:11:00 * DB saved on disk
[15637] 27 Jul 03:16:11 * DB saved on disk
[15734] 27 Jul 03:21:20 * DB saved on disk
[15825] 27 Jul 03:26:30 * DB saved on disk
[15920] 27 Jul 03:31:42 * DB saved on disk

strace-ing the redis process yields a lot of this pattern:
gettimeofday({1374921289, 745893}, NULL) = 0
gettimeofday({1374921289, 745915}, NULL) = 0
epoll_wait(3, {{EPOLLIN, {u32=4, u64=4}}}, 10240, 57) = 1
accept(4, 0x7fffcffad7b0, [1012113338269696016]) = -1 EMFILE (Too many open files)
open("/var/log/redis/redis.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = -1 EMFILE (Too many open files)

and sure enough 
$ ls -l /proc/1569/fd | wc -l
1025
all sockets (should have tried to trace the other end of those). Some sort of network glitch then lots of clients trying to reconnect ?

Stopped the redis service and restarted it, which cleared the problems with the buildapi reporting.
(Assignee)

Comment 1

4 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=25813483&tree=Mozilla-Central is a signing failure from pre-fix, rebuilt. Can't find anything else on other trees.
Assignee: nobody → nthomas
Severity: normal → blocker
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
(Assignee)

Comment 2

4 years ago
(In reply to Nick Thomas [:nthomas] from comment #0)
> Some sort of network glitch then lots of clients trying to reconnect ?

We got a separate email from cron on buildapi01 (report-today.sh):
sqlalchemy.exc.OperationalError: (OperationalError) (2006, 'MySQL server has gone away') 'SELECT slaves.id AS slaves_id, slaves.name AS slaves_name \nFROM slaves \nWHERE slaves.id = %s' (4314L,)

That came in at 03:08 and is for a scl1 -> scl3 connection. redis was probably busted since 01:44, and reporter.py is using a scl1 -> sc11 connection. Might be a KVM issue too, I guess, both buildapi01 and redis01 are KVM in scl1.
Product: mozilla.org → Release Engineering

Updated

4 years ago
Blocks: 926246

Updated

4 years ago
QA Contact: other → armenzg
You need to log in before you can comment on or make changes to this bug.