Closed
Bug 960054
Opened 11 years ago
Closed 11 years ago
builds-4hr.js.gz not updating, all trees closed
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Unassigned)
References
Details
+++ This bug was initially created as a clone of Bug #942545 +++
Email alert:
[Sheriffs] ** PROBLEM alert - builddata.pub.build.mozilla.org/http file age - /buildjson/builds-4hr.js.gz is CRITICAL **
***** Nagios *****
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 01-15-2014 05:55:15
Additional Info:
CRITICAL - Socket timeout after 10 seconds
http://m.allizom.org/http%2Bfile%2Bage%2B-%2B/buildjson/builds-4hr.js.gz
Reporter | ||
Comment 1•11 years ago
|
||
Has now recovered - trees reopened, but leaving open for diagnosis - can someone take a quick look?
***** Nagios *****
Notification Type: RECOVERY
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: OK
Date/Time: 01-15-2014 06:15:05
Additional Info:
HTTP OK: HTTP/1.1 200 OK - 1673250 bytes in 1.906 second response time
Reporter | ||
Updated•11 years ago
|
Severity: blocker → critical
Comment 2•11 years ago
|
||
I'm pretty sure this was a false positive. "socket timeout" means that the nagios server couldn't actually check the status, and usually isn't indicative of a real issue.
Were you seeing missing jobs on tbpl?
Reporter | ||
Comment 3•11 years ago
|
||
I checked http://builddata.pub.build.mozilla.org/buildjson/ and builds-4hr.js and it hadn't been updated for ~5 mins (with a .tmp file present alongside, so presume a job was in progress) at that point.
Comment 4•11 years ago
|
||
My email seemed suspiciously timed, with the problem alert, the three wait time reports, then the recovery alert. Is that all one machine or all one data source?
Comment 5•11 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #4)
> My email seemed suspiciously timed, with the problem alert, the three wait
> time reports, then the recovery alert. Is that all one machine or all one
> data source?
I'm pretty sure they all share at least one resource...so it wouldn't shock me if that's what happened. That could explain why nagios went off too.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 6•11 years ago
|
||
And again; all trees closed:
***** Nagios *****
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 01-16-2014 04:47:16
Additional Info:
CRITICAL - Socket timeout after 10 seconds
Severity: critical → blocker
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 7•11 years ago
|
||
Has now recovered - will reopen the trees, but leaving open for diagnosis
Reporter | ||
Comment 8•11 years ago
|
||
Trees reopened:
***** Nagios *****
Notification Type: RECOVERY
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: OK
Date/Time: 01-16-2014 04:57:10
Additional Info:
HTTP OK: HTTP/1.1 200 OK - 1348108 bytes in 4.271 second response time
--
Please can we figure this out? :-)
Comment 9•11 years ago
|
||
catlee and dustin are discussing this in #releng
Reporter | ||
Comment 10•11 years ago
|
||
Latest instance and then recovery shortly after:
***** Nagios *****
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 01-16-2014 20:37:21
Additional Info:
CRITICAL - Socket timeout after 10 seconds
---
***** Nagios *****
Notification Type: RECOVERY
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: OK
Date/Time: 01-16-2014 20:42:11
Additional Info:
HTTP OK: HTTP/1.1 200 OK - 1476632 bytes in 5.573 second response time
Reporter | ||
Comment 11•11 years ago
|
||
The Nagios alert has been flapping a lot, even with bug 962089:
eg:
***** Nagios *****
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 01-27-2014 20:45:18
Additional Info:
CRITICAL - Socket timeout after 10 seconds
***** Nagios *****
Notification Type: RECOVERY
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: OK
Date/Time: 01-27-2014 20:50:18
Additional Info:
HTTP OK: HTTP/1.1 200 OK - 1435828 bytes in 9.937 second response time
-------
Times (problem alert \n recovery alert):
Date/Time: 01-23-2014 10:22:23
Date/Time: 01-23-2014 10:27:13
Date/Time: 01-23-2014 15:45:23
Date/Time: 01-23-2014 16:00:23
Date/Time: 01-24-2014 09:41:22
Date/Time: 01-24-2014 09:46:22
Date/Time: 01-27-2014 15:17:53
Date/Time: 01-27-2014 15:31:13
Date/Time: 01-27-2014 18:22:18
Date/Time: 01-27-2014 18:27:08
Date/Time: 01-27-2014 20:45:18
Date/Time: 01-27-2014 20:50:18
And now the latest problem alert (no recovery alert seen yet):
Date/Time: 01-28-2014 03:08:10
Comment 12•11 years ago
|
||
and all trees are closed for comment #11 - ***** Nagios *****
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 01-28-2014 03:08:10
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:08:33 ago - 905009 bytes in 8.155 second response time
Reporter | ||
Comment 13•11 years ago
|
||
The recovery alerts also suggest bug 962089's attempt to not download the body didn't work.
Comment 14•11 years ago
|
||
I should reiterate that this particular failure of this alert does not indicate a tree-closing problem, *unless* the underlying file is actually too old.
That said, I'm going to work today on reimplementing this on the new buildapi system. The habit of generating four days' worth of data every minute may not be sustainable - with a cold cache, the task takes 11 minutes!
Reporter | ||
Comment 15•11 years ago
|
||
4 days? The file covers the last 4 hours of completed jobs.
Comment 16•11 years ago
|
||
That's what I meant.. that still means we're wrangling each minute's worth of data 240 times, so basically doing 240x the amount of work we should.
Reporter | ||
Comment 17•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #16)
> That's what I meant..
Just checking, since I was surprised the job takes 11 mins to run.
> that still means we're wrangling each minute's worth
> of data 240 times, so basically doing 240x the amount of work we should.
You mean as opposed to each time just appending the latest minute's worth of jobs to the builds-4hr dataset and expiring those now > 4hrs?
Comment 18•11 years ago
|
||
Yes, or something in between - generate a rolling report up to an even multiple of 10 minutes, then leave that report static and start a new one. That would require frontend changes of course, to download multiple reports. But that seems like a more manageable approach.
Reporter | ||
Comment 19•11 years ago
|
||
Ideally longer term, consumers such as treeherder (which has currently had to resort to builds-4hr due to pulse bugs) would very much like to use Pulse so we can avoid this insanity altogether :-)
Reporter | ||
Comment 20•11 years ago
|
||
s/insanity/polling insanity/
Reporter | ||
Comment 21•11 years ago
|
||
More spam:
***** Nagios *****
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 01-28-2014 07:25:10
Additional Info:
CRITICAL - Socket timeout after 10 seconds
Reporter | ||
Comment 22•11 years ago
|
||
***** Nagios *****
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 01-28-2014 20:43:13
Additional Info:
CRITICAL - Socket timeout after 10 seconds
***** Nagios *****
Notification Type: RECOVERY
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: OK
Date/Time: 01-28-2014 20:48:13
Additional Info:
HTTP OK: HTTP/1.1 200 OK - 1146819 bytes in 8.744 second response time
Comment 23•11 years ago
|
||
I think it's safe to assume at this point that this check is failing due to bug 964853.
I'm going to downtime it for a day since it's not giving us any useful information at the moment. As far as I can see, the report itself is still up to date (less than a minute old right now).
There's ongoing work to bring the generation of this report out of the releng network, which will fix the timeout errors.
Depends on: 964853
Comment 24•11 years ago
|
||
We've replaced the entire generation process just now, and anyway I haven't heard any complaints about this in two weeks.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•