builds-4hr.js.gz not updating

RESOLVED FIXED

Status

Release Engineering
Platform Support
--
major
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: RyanVM, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

4 years ago
Mon 12:03:06 PDT [4485] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:13:35 ago - 1245436 bytes in 2.634 second response time (http://m.allizom.org/http+file+age+-+/buildjson/builds-4hr.js.gz)

All trees closed.

Updated

4 years ago
Assignee: nobody → armenzg

Comment 1

4 years ago
I don't know if these are related:

12:18 nagios-releng: Mon 09:18:09 PDT [4294] dc1.ad.mozilla.com:Windows Memory is WARNING: WARNING: physical memory: Total: 4G - Used: 3.45G (86%) - Free: 564M (14%)  warning (http://m.allizom.org/Windows+Memory)
12:19 kmoir is now known as kmoir-afk
12:23 nagios-releng: Mon 09:23:09 PDT [4296] dc1.ad.mozilla.com:Windows Memory is CRITICAL: CRITICAL: physical memory: Total: 4G - Used: 3.69G (92%) - Free: 321M (8%)  critical (http://m.allizom.org/Windows+Memory)
...
First instance:
12:25 nagios-releng: Mon 09:25:49 PDT [4301] buildbot-master67.srv.releng.use1.mozilla.com:Command Queue is WARNING: 60 new items (http://m.allizom.org/Command+Queue)
...
14:44 nagios-releng: Mon 11:44:49 PDT [4444] buildbot-master65.srv.releng.usw2.mozilla.com:Command Queue is CRITICAL: 5 dead items (http://m.allizom.org/Command+Queue)
14:44 nagios-releng: Mon 11:44:49 PDT [4445] buildbot-master64.srv.releng.usw2.mozilla.com:Command Queue is CRITICAL: 1 dead item (http://m.allizom.org/Command+Queue)
14:45 nagios-releng: Mon 11:45:49 PDT [4446] buildbot-master58.srv.releng.usw2.mozilla.com:Command Queue is CRITICAL: 1 dead item (http://m.allizom.org/Command+Queue)
14:45 nagios-releng: Mon 11:45:49 PDT [4447] buildbot-master57.srv.releng.use1.mozilla.com:Command Queue is CRITICAL: 1 dead item (http://m.allizom.org/Command+Queue)
14:45 nagios-releng: Mon 11:45:49 PDT [4448] buildbot-master62.srv.releng.use1.mozilla.com:Command Queue is CRITICAL: 2 dead items (http://m.allizom.org/Command+Queue)
14:46 nagios-releng: Mon 11:46:49 PDT [4449] buildbot-master67.srv.releng.use1.mozilla.com:Command Queue is OK: Ok (http://m.allizom.org/Command+Queue)
14:50 nagios-releng: Mon 11:50:49 PDT [4480] buildbot-master71.srv.releng.use1.mozilla.com:Command Queue is OK: Ok (http://m.allizom.org/Command+Queue)
14:56 nagios-releng: Mon 11:55:56 PDT [4481] buildbot-master69.srv.releng.use1.mozilla.com:Command Queue is CRITICAL: 125 new items (http://m.allizom.org/Command+Queue)
14:56 armenzg_buildduty is now known as armenzg_brb
14:59 nagios-releng: Mon 11:59:46 PDT [4483] buildbot-master44.build.scl1.mozilla.com:Command Queue is CRITICAL: 3 dead items (http://m.allizom.org/Command+Queue)
15:11 nagios-releng: Mon 12:10:56 PDT [4487] buildbot-master61.srv.releng.use1.mozilla.com:Command Queue is CRITICAL: 2 dead items (http://m.allizom.org/Command+Queue)
15:11 nagios-releng: Mon 12:10:56 PDT [4488] buildbot-master63.srv.releng.use1.mozilla.com:Command Queue is CRITICAL: 1 dead item (http://m.allizom.org/Command+Queue)
15:11 nagios-releng: Mon 12:10:57 PDT [4489] buildbot-master70.srv.releng.use1.mozilla.com:Command Queue is WARNING: 59 new items (http://m.allizom.org/Command+Queue)
15:11 armenzg_brb is now known as armenzg_buildduty
15:13 nagios-releng: Mon 12:12:56 PDT [4490] buildbot-master67.srv.releng.use1.mozilla.com:Command Queue is WARNING: 80 new items (http://m.allizom.org/Command+Queue)
15:15 nagios-releng: Mon 12:14:56 PDT [4492] buildbot-master64.srv.releng.usw2.mozilla.com:Command Queue is CRITICAL: 1 dead item (http://m.allizom.org/Command+Queue)
15:15 nagios-releng: Mon 12:14:56 PDT [4493] buildbot-master65.srv.releng.usw2.mozilla.com:Command Queue is CRITICAL: 5 dead items (http://m.allizom.org/Command+Queue)
15:15 nagios-releng: Mon 12:14:56 PDT [4494] buildbot-master66.srv.releng.usw2.mozilla.com:Command Queue is CRITICAL: 2 dead items (http://m.allizom.org/Command+Queue)
15:16 nagios-releng: Mon 12:15:56 PDT [4495] buildbot-master62.srv.releng.use1.mozilla.com:Command Queue is CRITICAL: 2 dead items (http://m.allizom.org/Command+Queue)
15:19 nagios-releng: Mon 12:18:56 PDT [4496] buildbot-master71.srv.releng.use1.mozilla.com:Command Queue is WARNING: 51 new items (http://m.allizom.org/Command+Queue)
15:20 nagios-releng: Mon 12:20:46 PDT [4498] buildbot-master58.srv.releng.usw2.mozilla.com:Command Queue is CRITICAL: 1 dead item (http://m.allizom.org/Command+Queue)
15:20 nagios-releng: Mon 12:20:46 PDT [4499] buildbot-master57.srv.releng.use1.mozilla.com:Command Queue is CRITICAL: 1 dead item (http://m.allizom.org/Command+Queue)
15:26 nagios-releng: Mon 12:25:56 PDT [4503] buildbot-master69.srv.releng.use1.mozilla.com:Command Queue is CRITICAL: 150 new items:oldest item is 1214s old (http://m.allizom.org/Command+Queue)
15:29 nagios-releng: Mon 12:29:46 PDT [4506] buildbot-master44.build.scl1.mozilla.com:Command Queue is CRITICAL: 3 dead items (http://m.allizom.org/Command+Queue)
15:29 nagios-releng: Mon 12:29:46 PDT [4507] buildbot-master51.srv.releng.use1.mozilla.com:Command Queue is WARNING: 73 new items (http://m.allizom.org/Command+Queue)

Comment 2

4 years ago
15:53 RyanVM|sheriffduty: AutomatedTester: catlee: armenzg_buildduty: looks like jobs are showing up again
15:53 RyanVM|sheriffduty: whatever that means
15:54 AutomatedTester: we have a recovery nagios too


[root@buildapi01 buildapi]# ls -lrt builds-4hr.js.gz
-rw-r--r-- 1 buildapi buildapi 1245028 Sep 30 11:49 builds-4hr.js.gz
[root@buildapi01 buildapi]# ls -lrt builds-4hr.js.gz
-rw-r--r-- 1 buildapi buildapi 1408227 Sep 30 12:57 builds-4hr.js.gz
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED

Comment 3

4 years ago
hrmm (@ 3:48pm)
Cron <buildapi@buildapi01> /home/buildapi/bin/report-4hr.sh
/home/buildapi/bin/report-4hr.sh: line 4: 13066 Terminated              /home/buildapi/bin/python /home/buildapi/src/buildapi/scripts/reporter.py -z -o /var/www/buildapi/buildjson/builds-4hr.js.gz --starttime $(date -d 'now - 4 hours' +\%s) >> reporter-4hr.log 2>&1

Comment 4

4 years ago
Another separate email w/o much information.

Subject: "Cron <root@relengwebadm> rsync -r --links --delete syncbld@cruncher.build.mozilla.org:/var/www/html/builds/ /mnt/netapp/relengweb/builddata/reports/"
Duplicate of this bug: 922339
nagios alerted again this morning

***** Nagios  *****

Notification Type: PROBLEM

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL

Date/Time: 10-07-2013 01:43:03

Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:53 ago - 158707 bytes in 0.027 second response time

was resolved but we should investigate this "intermittent" bustages i guess
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Updated

4 years ago
Component: Buildduty → General Automation
QA Contact: armenzg → catlee

Comment 7

4 years ago
Maybe best to keep on Buildduty.

catlee, any suggestions on what to investigate? (bug 924109 might also be related).
Component: General Automation → Buildduty
Flags: needinfo?(catlee)
QA Contact: catlee → armenzg

Updated

4 years ago
Severity: blocker → major
and just happened again - trees closed 

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL

Date/Time: 10-08-2013 03:03:02

Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:49 ago - 399096 bytes in 0.030 second response time
recovered after 10 minutes, when the nagios recovery mail was received

Updated

4 years ago
Assignee: armenzg → nobody
So, from logs and emails I see this:

Oct 6th: several occurrences of "Unknown MySQL server host 'buildbot-ro-vip.db.scl3.mozilla.com'". Actually we've been hitting this most days now. I wonder if DNS is flapping sometimes?

Oct 7th around 04:24 PT: "Can't connect to MySQL server on 'buildbot-ro-vip.db.scl3.mozilla.com'"
Oct 7th around 06:27 PT: "Unknown MySQL server host 'buildbot-ro-vip.db.scl3.mozilla.com'"
Oct 7th from 06:29 PT on: "QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30"

several hundred instances of that last message until about 08:03 PT.

We should see if DNS and/or the DB are flapping. I suspect these are somehow the cause. But we also need to look at how to make the DB queue pool more resilient. Perhaps upgrading Pylons or SQLAlchemy would help.
Flags: needinfo?(catlee)

Comment 11

4 years ago
arr, dustin: any ideas on how to investigate if DNS if flapping?
Component: Buildduty → Platform Support
QA Contact: armenzg → coop
Funny story, I tried to login to buildapi01 to take a look at the logs, and ended up at ns1b.infra.scl1.  Almost immediately, that screen window froze and I killed it.  Re-running the same SSH command got me to buildapi01.  So, that's weird.

However, logging into ns1b directly shows

bond0.48  Link encap:Ethernet  HWaddr 00:13:21:AE:52:AC
          inet addr:10.12.48.22  Bcast:10.12.55.255  Mask:255.255.248.0

which is buildapi01's IP.  Similarly, on ns1a:

bond0.48  Link encap:Ethernet  HWaddr AC:16:2D:B2:56:3C  
          inet addr:10.12.48.21  Bcast:10.12.55.255  Mask:255.255.248.0

which is (thankfully) an unassigned IP.

So, clearly ns1b needs to stop squatting buildapi01's IP, but I'm not sure how to do that.  Just taking bond0.48 down may impact DNS service.
Flags: needinfo?(bhourigan)
I put ns1a's vlan48 IP into DNS, and selected a new one (10.12.48.16) for ns1b.  Since it's not holding either of the VIPs right now, I'm going to go ahead and change its active and on-disk config to correspond.
That had no impact, but will hopefully fix this problem.  Please re-open if it does not.  Brian, let me know if this doesn't seem like a good way to solve the problem.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED
Thank you for (hopefully) tracking down the cause of this! :-D

Updated

4 years ago
Flags: needinfo?(bhourigan)
Blocks: 926246
You need to log in before you can comment on or make changes to this bug.