Closed Bug 942545 Opened 6 years ago Closed 6 years ago

builds-4hr.js.gz not updating, all trees closed

Categories

(Infrastructure & Operations :: CIDuty, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

Notification Type: PROBLEM

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL

Date/Time: 11-24-2013 00:16:15

Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:17:01 ago - 250539 bytes in 0.045 second response time

and the companion,

Sun 00:21:55 PST [4308] redis01.build.scl1.mozilla.com:procs - redis-server is CRITICAL: PROCS CRITICAL: 0 processes with regex args redis-server

All trees closed.
Blocks: 926246
Adding release and john to this bug in case someone checks his emails today.

Just as note that the last outage bug 936878 2 weeks ago happened also on a sunday, so whatever process is involved (i guess some weekly process) maybe should not run on a sunday or so in case no one is around :)
Whats going on here? Why is this not being acted on. John?
Flags: needinfo?(joduinn)
Blocks: 942503
Severity: blocker → critical
Priority: -- → P1
Severity: critical → blocker
Priority: P1 → --
I've restarted redis on redis01.build.m.o, but have to get on a plane now. Will verify when I can get online next.
Seems to have done the trick, we missed most of the things that died failing to get a signing token since they were more than four hours ago, but dying PGO on fx-team said we were successfully rebuilding builds-4hr.js.gz. Retriggered nightlies on m-c and aurora, killed the b2g nightlies on m-c since they don't care about signing and apparently completed fine the first time. Trees reopened.
Severity: blocker → normal
Ok. The output of the weekly cron was:

Found redis running on pid 26858
Open files 316 in /proc, 325 via lsof
Stopping redis-server: [FAILED]
Starting redis-server: [  OK  ]
cat: /var/run/redis/redis.pid: No such file or directory
/root/weekly_restart: line 20: test: : integer expression expected
redis confusion: pid_file=, pgrep=26858
Redis apparantly not running after restart

I believe hwine made some changes last time this happened, so we'll need to look further to see what might be causing this.
No longer blocks: 942503
There has been discussion about redoing the redis service (moving off of kvm, into scl3, managed by webops).  Please see bug 934627 and bug 934593 for proposed future work.
(In reply to Amy Rich [:arich] [:arr] from comment #6)
> There has been discussion about redoing the redis service (moving off of
> kvm, into scl3, managed by webops).  Please see bug 934627 and bug 934593
> for proposed future work.

In addition to what Amy listed, I note other "make redis more stable" and "have better monitoring on redis" work by both RelEng and IT is being tracked in bug#905587, bug#905616.
Flags: needinfo?(joduinn)
IIRC we have removed the cronjob to do the restart.
buildduty will be running the restart on Monday mornings until this is a stable process.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.