Closed Bug 960054 Opened 10 years ago Closed 10 years ago

builds-4hr.js.gz not updating, all trees closed

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

References

Details

+++ This bug was initially created as a clone of Bug #942545 +++

Email alert:

[Sheriffs] ** PROBLEM alert - builddata.pub.build.mozilla.org/http file age - /buildjson/builds-4hr.js.gz is CRITICAL **

***** Nagios  *****

Notification Type: PROBLEM

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL

Date/Time: 01-15-2014 05:55:15

Additional Info:
CRITICAL - Socket timeout after 10 seconds

   

http://m.allizom.org/http%2Bfile%2Bage%2B-%2B/buildjson/builds-4hr.js.gz
Has now recovered - trees reopened, but leaving open for diagnosis - can someone take a quick look?

***** Nagios  *****

Notification Type: RECOVERY

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: OK

Date/Time: 01-15-2014 06:15:05

Additional Info:
HTTP OK: HTTP/1.1 200 OK - 1673250 bytes in 1.906 second response time
Severity: blocker → critical
I'm pretty sure this was a false positive. "socket timeout" means that the nagios server couldn't actually check the status, and usually isn't indicative of a real issue.

Were you seeing missing jobs on tbpl?
I checked http://builddata.pub.build.mozilla.org/buildjson/ and builds-4hr.js and it hadn't been updated for ~5 mins (with a .tmp file present alongside, so presume a job was in progress) at that point.
My email seemed suspiciously timed, with the problem alert, the three wait time reports, then the recovery alert. Is that all one machine or all one data source?
(In reply to Phil Ringnalda (:philor) from comment #4)
> My email seemed suspiciously timed, with the problem alert, the three wait
> time reports, then the recovery alert. Is that all one machine or all one
> data source?

I'm pretty sure they all share at least one resource...so it wouldn't shock me if that's what happened. That could explain why nagios went off too.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
And again; all trees closed:

***** Nagios  *****

Notification Type: PROBLEM

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL

Date/Time: 01-16-2014 04:47:16

Additional Info:
CRITICAL - Socket timeout after 10 seconds
Severity: critical → blocker
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Has now recovered - will reopen the trees, but leaving open for diagnosis
Trees reopened:

***** Nagios  *****

Notification Type: RECOVERY

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: OK

Date/Time: 01-16-2014 04:57:10

Additional Info:
HTTP OK: HTTP/1.1 200 OK - 1348108 bytes in 4.271 second response time

--

Please can we figure this out? :-)
catlee and dustin are discussing this in #releng
Latest instance and then recovery shortly after:

***** Nagios  *****

Notification Type: PROBLEM

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL

Date/Time: 01-16-2014 20:37:21

Additional Info:
CRITICAL - Socket timeout after 10 seconds

---

***** Nagios  *****

Notification Type: RECOVERY

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: OK

Date/Time: 01-16-2014 20:42:11

Additional Info:
HTTP OK: HTTP/1.1 200 OK - 1476632 bytes in 5.573 second response time
Depends on: 962089
The Nagios alert has been flapping a lot, even with bug 962089:

eg:

***** Nagios  *****

Notification Type: PROBLEM

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL

Date/Time: 01-27-2014 20:45:18

Additional Info:
CRITICAL - Socket timeout after 10 seconds

***** Nagios  *****

Notification Type: RECOVERY

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: OK

Date/Time: 01-27-2014 20:50:18

Additional Info:
HTTP OK: HTTP/1.1 200 OK - 1435828 bytes in 9.937 second response time

-------

Times (problem alert \n recovery alert):

Date/Time: 01-23-2014 10:22:23
Date/Time: 01-23-2014 10:27:13

Date/Time: 01-23-2014 15:45:23
Date/Time: 01-23-2014 16:00:23

Date/Time: 01-24-2014 09:41:22
Date/Time: 01-24-2014 09:46:22

Date/Time: 01-27-2014 15:17:53
Date/Time: 01-27-2014 15:31:13

Date/Time: 01-27-2014 18:22:18
Date/Time: 01-27-2014 18:27:08

Date/Time: 01-27-2014 20:45:18
Date/Time: 01-27-2014 20:50:18

And now the latest problem alert (no recovery alert seen yet):
Date/Time: 01-28-2014 03:08:10
and all trees are closed for comment #11 - ***** Nagios  *****

Notification Type: PROBLEM

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL

Date/Time: 01-28-2014 03:08:10

Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:08:33 ago - 905009 bytes in 8.155 second response time
The recovery alerts also suggest bug 962089's attempt to not download the body didn't work.
I should reiterate that this particular failure of this alert does not indicate a tree-closing problem, *unless* the underlying file is actually too old.

That said, I'm going to work today on reimplementing this on the new buildapi system.  The habit of generating four days' worth of data every minute may not be sustainable - with a cold cache, the task takes 11 minutes!
4 days? The file covers the last 4 hours of completed jobs.
That's what I meant.. that still means we're wrangling each minute's worth of data 240 times, so basically doing 240x the amount of work we should.
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #16)
> That's what I meant.. 

Just checking, since I was surprised the job takes 11 mins to run.

> that still means we're wrangling each minute's worth
> of data 240 times, so basically doing 240x the amount of work we should.

You mean as opposed to each time just appending the latest minute's worth of jobs to the builds-4hr dataset and expiring those now > 4hrs?
Yes, or something in between - generate a rolling report up to an even multiple of 10 minutes, then leave that report static and start a new one.  That would require frontend changes of course, to download multiple reports.  But that seems like a more manageable approach.
Ideally longer term, consumers such as treeherder (which has currently had to resort to builds-4hr due to pulse bugs) would very much like to use Pulse so we can avoid this insanity altogether :-)
s/insanity/polling insanity/
More spam:


***** Nagios  *****

Notification Type: PROBLEM

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL

Date/Time: 01-28-2014 07:25:10

Additional Info:
CRITICAL - Socket timeout after 10 seconds
***** Nagios  *****

Notification Type: PROBLEM

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL

Date/Time: 01-28-2014 20:43:13

Additional Info:
CRITICAL - Socket timeout after 10 seconds

***** Nagios  *****

Notification Type: RECOVERY

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: OK

Date/Time: 01-28-2014 20:48:13

Additional Info:
HTTP OK: HTTP/1.1 200 OK - 1146819 bytes in 8.744 second response time
I think it's safe to assume at this point that this check is failing due to bug 964853.

I'm going to downtime it for a day since it's not giving us any useful information at the moment.  As far as I can see, the report itself is still up to date (less than a minute old right now).

There's ongoing work to bring the generation of this report out of the releng network, which will fix the timeout errors.
Depends on: 964853
We've replaced the entire generation process just now, and anyway I haven't heard any complaints about this in two weeks.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.