960054 - builds-4hr.js.gz not updating, all trees closed

Reporter

Description

•

12 years ago

+++ This bug was initially created as a clone of Bug #942545 +++ Email alert: [Sheriffs] ** PROBLEM alert - builddata.pub.build.mozilla.org/http file age - /buildjson/builds-4hr.js.gz is CRITICAL ** ***** Nagios ***** Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 01-15-2014 05:55:15 Additional Info: CRITICAL - Socket timeout after 10 seconds http://m.allizom.org/http%2Bfile%2Bage%2B-%2B/buildjson/builds-4hr.js.gz

Ed Morley [:emorley]

Reporter

Comment 1

•

12 years ago

Has now recovered - trees reopened, but leaving open for diagnosis - can someone take a quick look? ***** Nagios ***** Notification Type: RECOVERY Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: OK Date/Time: 01-15-2014 06:15:05 Additional Info: HTTP OK: HTTP/1.1 200 OK - 1673250 bytes in 1.906 second response time

Ed Morley [:emorley]

Reporter

Updated

•

12 years ago

Severity: blocker → critical

bhearsum@mozilla.com (:bhearsum)

Comment 2

•

12 years ago

I'm pretty sure this was a false positive. "socket timeout" means that the nagios server couldn't actually check the status, and usually isn't indicative of a real issue. Were you seeing missing jobs on tbpl?

Ed Morley [:emorley]

Reporter

Comment 3

•

12 years ago

I checked http://builddata.pub.build.mozilla.org/buildjson/ and builds-4hr.js and it hadn't been updated for ~5 mins (with a .tmp file present alongside, so presume a job was in progress) at that point.

Phil Ringnalda (:philor)

Comment 4

•

12 years ago

My email seemed suspiciously timed, with the problem alert, the three wait time reports, then the recovery alert. Is that all one machine or all one data source?

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

12 years ago

(In reply to Phil Ringnalda (:philor) from comment #4) > My email seemed suspiciously timed, with the problem alert, the three wait > time reports, then the recovery alert. Is that all one machine or all one > data source? I'm pretty sure they all share at least one resource...so it wouldn't shock me if that's what happened. That could explain why nagios went off too.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Reporter

Comment 6

•

12 years ago

And again; all trees closed: ***** Nagios ***** Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 01-16-2014 04:47:16 Additional Info: CRITICAL - Socket timeout after 10 seconds

Severity: critical → blocker

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Carsten Book [:Tomcat]

Comment 7

•

12 years ago

Has now recovered - will reopen the trees, but leaving open for diagnosis

Ed Morley [:emorley]

Reporter

Comment 8

•

12 years ago

Trees reopened: ***** Nagios ***** Notification Type: RECOVERY Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: OK Date/Time: 01-16-2014 04:57:10 Additional Info: HTTP OK: HTTP/1.1 200 OK - 1348108 bytes in 4.271 second response time -- Please can we figure this out? :-)

John Hopkins (:jhopkins)

Comment 9

•

12 years ago

catlee and dustin are discussing this in #releng

Ed Morley [:emorley]

Reporter

Comment 10

•

12 years ago

Latest instance and then recovery shortly after: ***** Nagios ***** Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 01-16-2014 20:37:21 Additional Info: CRITICAL - Socket timeout after 10 seconds --- ***** Nagios ***** Notification Type: RECOVERY Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: OK Date/Time: 01-16-2014 20:42:11 Additional Info: HTTP OK: HTTP/1.1 200 OK - 1476632 bytes in 5.573 second response time

Chris AtLee [:catlee]

Updated

•

12 years ago

Depends on: 962089

Ed Morley [:emorley]

Reporter

Comment 11

•

12 years ago

The Nagios alert has been flapping a lot, even with bug 962089: eg: ***** Nagios ***** Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 01-27-2014 20:45:18 Additional Info: CRITICAL - Socket timeout after 10 seconds ***** Nagios ***** Notification Type: RECOVERY Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: OK Date/Time: 01-27-2014 20:50:18 Additional Info: HTTP OK: HTTP/1.1 200 OK - 1435828 bytes in 9.937 second response time ------- Times (problem alert \n recovery alert): Date/Time: 01-23-2014 10:22:23 Date/Time: 01-23-2014 10:27:13 Date/Time: 01-23-2014 15:45:23 Date/Time: 01-23-2014 16:00:23 Date/Time: 01-24-2014 09:41:22 Date/Time: 01-24-2014 09:46:22 Date/Time: 01-27-2014 15:17:53 Date/Time: 01-27-2014 15:31:13 Date/Time: 01-27-2014 18:22:18 Date/Time: 01-27-2014 18:27:08 Date/Time: 01-27-2014 20:45:18 Date/Time: 01-27-2014 20:50:18 And now the latest problem alert (no recovery alert seen yet): Date/Time: 01-28-2014 03:08:10

Carsten Book [:Tomcat]

Comment 12

•

12 years ago

and all trees are closed for comment #11 - ***** Nagios ***** Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 01-28-2014 03:08:10 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:08:33 ago - 905009 bytes in 8.155 second response time

Ed Morley [:emorley]

Reporter

Comment 13

•

12 years ago

The recovery alerts also suggest bug 962089's attempt to not download the body didn't work.

Dustin J. Mitchell [:dustin] (he/him)

Comment 14

•

12 years ago

I should reiterate that this particular failure of this alert does not indicate a tree-closing problem, *unless* the underlying file is actually too old. That said, I'm going to work today on reimplementing this on the new buildapi system. The habit of generating four days' worth of data every minute may not be sustainable - with a cold cache, the task takes 11 minutes!

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

12 years ago

Depends on: 863268

Ed Morley [:emorley]

Reporter

Comment 15

•

12 years ago

4 days? The file covers the last 4 hours of completed jobs.

Dustin J. Mitchell [:dustin] (he/him)

Comment 16

•

12 years ago

That's what I meant.. that still means we're wrangling each minute's worth of data 240 times, so basically doing 240x the amount of work we should.

Ed Morley [:emorley]

Reporter

Comment 17

•

12 years ago

(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #16) > That's what I meant.. Just checking, since I was surprised the job takes 11 mins to run. > that still means we're wrangling each minute's worth > of data 240 times, so basically doing 240x the amount of work we should. You mean as opposed to each time just appending the latest minute's worth of jobs to the builds-4hr dataset and expiring those now > 4hrs?

Dustin J. Mitchell [:dustin] (he/him)

Comment 18

•

12 years ago

Yes, or something in between - generate a rolling report up to an even multiple of 10 minutes, then leave that report static and start a new one. That would require frontend changes of course, to download multiple reports. But that seems like a more manageable approach.

Ed Morley [:emorley]

Reporter

Comment 19

•

12 years ago

Ideally longer term, consumers such as treeherder (which has currently had to resort to builds-4hr due to pulse bugs) would very much like to use Pulse so we can avoid this insanity altogether :-)

Ed Morley [:emorley]

Reporter

Comment 20

•

12 years ago

s/insanity/polling insanity/

Ed Morley [:emorley]

Reporter

Comment 21

•

12 years ago

More spam: ***** Nagios ***** Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 01-28-2014 07:25:10 Additional Info: CRITICAL - Socket timeout after 10 seconds

Ed Morley [:emorley]

Reporter

Comment 22

•

12 years ago

***** Nagios ***** Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 01-28-2014 20:43:13 Additional Info: CRITICAL - Socket timeout after 10 seconds ***** Nagios ***** Notification Type: RECOVERY Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: OK Date/Time: 01-28-2014 20:48:13 Additional Info: HTTP OK: HTTP/1.1 200 OK - 1146819 bytes in 8.744 second response time

Dustin J. Mitchell [:dustin] (he/him)

Comment 23

•

12 years ago

I think it's safe to assume at this point that this check is failing due to bug 964853. I'm going to downtime it for a day since it's not giving us any useful information at the moment. As far as I can see, the report itself is still up to date (less than a minute old right now). There's ongoing work to bring the generation of this report out of the releng network, which will fix the timeout errors.

Depends on: 964853

Dustin J. Mitchell [:dustin] (he/him)

Comment 24

•

11 years ago

We've replaced the entire generation process just now, and anyway I haven't heard any complaints about this in two weeks.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard