The default bug view has changed. See this FAQ.

builds-4hr.js.gz et al aren't updating

RESOLVED FIXED

Status

Release Engineering
Buildduty
--
blocker
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: RyanVM, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
All trees closed.

Notification Type: PROBLEM

Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL

Date/Time: 08-21-2014 11:59:47

Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:08 ago - 8191 bytes in 0.008 second response time

etc
incase it was just hung I killed it:

[root@relengwebadm.private.scl3 ~]# ps auxwww | grep 4hr
buildapi 19923  0.0  0.0   9224  1108 ?        Ss   11:50   0:00 /bin/sh /var/lib/buildapi/report-4hr.sh
buildapi 19929  0.0  0.0   9224   660 ?        S    11:50   0:00 /bin/sh /var/lib/buildapi/report-4hr.sh
buildapi 19934  0.0  0.3 211772 22080 ?        S    11:50   0:00 python2.7 /data/releng/www/buildapi/run-reporter --config /data/
releng/www/buildapi/reporter.cfg -z -o /mnt/netapp/relengweb/builddata/buildjson/builds-4hr.js.gz --starttime 1408632601
root     23207  0.0  0.0 103244   852 pts/1    S+   12:26   0:00 grep 4hr
[root@relengwebadm.private.scl3 ~]# kill 19934
[root@relengwebadm.private.scl3 ~]# ps auxwww | grep 4hr
buildapi 23239  0.0  0.0   9224  1104 ?        Ss   12:27   0:00 /bin/sh /var/lib/buildapi/report-4hr.sh
buildapi 23242  0.0  0.0   9224   656 ?        S    12:27   0:00 /bin/sh /var/lib/buildapi/report-4hr.sh
buildapi 23247 16.2  4.6 466988 277580 ?       S    12:27   0:04 python2.7 /data/releng/www/buildapi/run-reporter --config /data/
releng/www/buildapi/reporter.cfg -z -o /mnt/netapp/relengweb/builddata/buildjson/builds-4hr.js.gz --starttime 1408634821
root     23268  0.0  0.0 103240   848 pts/1    S+   12:27   0:00 grep 4hr

And the alert recovered.

Leaving open incase theres a more serious issue.
I just reopened the trees, since it seems to be okay.
Seems to be holding up fine, so resolving this for now.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED

Updated

3 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
We killed another stuck builds-4hr.js, and had to ask webops to restart apache on the hosts which serve the releng cluster (secure.pub.build.m.o, slavealloc, lots of stuff).
Trees reopened, leaving the bug open for if/when this happens again.
Rough timeline:

First nagios alert - this file should be refreshed every minute by hitting http://buildapi.pvt.build.mozilla.org/buildapi/pending

> Date/Time: 08-21-2014 16:12:48
>** PROBLEM alert - builddata.pub.build.mozilla.org/http file age - /buildjson/builds-pending.js is CRITICAL **
>HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:31 ago - 2848 bytes in 0.008 second response time

Also got alerts for builds-running.js (same details as above), and 
> Thu 16:16:08 PDT
> PROBLEM alert - buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is CRITICAL: CRITICAL - Socket timeout after 10 seconds

so should have realized at that point that the relengweb cluster wasn't well. The buildbot database wasn't happy either though, as builds-4hrs.js.gz was stale; that only does db + memcached work, and we focused on the db first.

At about 16:20 there were three cron jobs running on relengwebadm - report-4hr.sh which started at 16:01, report-today.sh at 16:15, and a running.sh at some time we've lost track of.

At 16:34 we had this in 'show processlist' for buildbot db cluster:
>+----------+------------------+--------------------+----------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
>| Id       | User             | Host               | db       | Command | Time | State                        | Info                                                                                                 |
>+----------+------------------+--------------------+----------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
>| 16364117 | buildbot_reader2 | 10.22.70.209:7378  | buildbot | Query   | 1951 | Waiting for table level lock | SELECT builds.id, builders.name AS buildname, builds.buildnumber, builds.starttime, builds.endtime,  |
>| 16364122 | buildbot_reader2 | 10.22.70.209:7419  | buildbot | Query   | 1966 | Waiting for table level lock | SELECT builds.id, builders.name AS buildname, builds.buildnumber, builds.starttime, builds.endtime,  |
>| 16364127 | buildbot_reader2 | 10.22.70.209:7437  | buildbot | Query   | 1961 | Waiting for table level lock | SELECT builds.id, builders.name AS buildname, builds.buildnumber, builds.starttime, builds.endtime,  |
>| 16364133 | buildbot_reader2 | 10.22.70.209:7464  | buildbot | Query   | 1941 | Waiting for table level lock | SELECT builds.id, builders.name AS buildname, builds.buildnumber, builds.starttime, builds.endtime,  |
>| 16364138 | buildbot_reader2 | 10.22.70.209:7496  | buildbot | Query   | 1926 | Waiting for table level lock | SELECT builds.id, builders.name AS buildname, builds.buildnumber, builds.starttime, builds.endtime,  |
>| 16364140 | buildbot_reader2 | 10.22.70.209:7515  | buildbot | Query   | 1971 | Waiting for table level lock | SELECT builds.id, builders.name AS buildname, builds.buildnumber, builds.starttime, builds.endtime,  |
>| 16364145 | buildbot_reader2 | 10.22.70.209:7571  | buildbot | Query   | 1946 | Waiting for table level lock | SELECT builds.id, builders.name AS buildname, builds.buildnumber, builds.starttime, builds.endtime,  |
>| 16364150 | buildbot_reader2 | 10.22.70.209:7616  | buildbot | Query   | 1936 | Waiting for table level lock | SELECT builds.id, builders.name AS buildname, builds.buildnumber, builds.starttime, builds.endtime,  |
>| 16364155 | buildbot_reader2 | 10.22.70.209:7672  | buildbot | Query   | 1956 | Waiting for table level lock | SELECT builds.id, builders.name AS buildname, builds.buildnumber, builds.starttime, builds.endtime,  |
>| 16364161 | buildbot_reader2 | 10.22.70.209:7714  | buildbot | Query   | 1931 | Waiting for table level lock | SELECT builds.id, builders.name AS buildname, builds.buildnumber, builds.starttime, builds.endtime,  |
>| 16364166 | buildbot_reader2 | 10.22.70.209:7841  | buildbot | Query   | 1916 | Waiting for table level lock | SELECT builds.id, builders.name AS buildname, builds.buildnumber, builds.starttime, builds.endtime,  |
>| 16364177 | buildbot_reader2 | 10.22.70.209:9110  | buildbot | Query   | 1972 | Sending data                 | SELECT builds.id, builders.name AS builder_name, slaves.name AS slave_name, steps.name AS step_name, |
>| 16364179 | buildbot_reader2 | 10.22.70.209:9421  | buildbot | Query   | 1969 | Waiting for table level lock | SELECT builds.id AS builds_id, builds.buildnumber AS builds_buildnumber, builds.builder_id AS builds |
>| 16364262 | buildbot_reader2 | 10.22.70.209:26134 | buildbot | Query   | 1817 | Waiting for table level lock | SELECT slaves.name as slave_name, builders.name as builder_name, builds.* FROM build                 |
>| 16364662 | buildbot_reader2 | 10.22.70.209:33866 | buildbot | Query   | 1130 | Waiting for table level lock | SELECT builds.id AS builds_id, builds.buildnumber AS builds_buildnumber, builds.builder_id AS builds |
>| 16364663 | buildbot_reader2 | 10.22.70.209:33994 | buildbot | Query   | 1129 | Waiting for table level lock | SELECT builds.id as build_id, masters.name as master, steps.starttime, steps.endtime FROM masters, b |
>| 16365134 | buildbot_reader2 | 10.22.70.209:61013 | buildbot | Query   |  229 | Waiting for table level lock | SELECT builds.id as build_id, masters.name as master, steps.starttime, steps.endtime FROM masters, b |
>| 16365226 | buildbot_reader2 | 10.22.70.209:14014 | buildbot | Query   |    0 | init                         | show processlist                                                                                     |
>+----------+------------------+--------------------+----------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
>18 rows in set (0.00 sec)

At 16:37 we killed the report-4hr process. The other cron jobs (ie already running) then started to work, and future ones were fine too. 

The cluster was still busted though:
> [16:48:08] <nagios-releng> Thu 16:48:08 PDT [4533] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is CRITICAL: CRITICAL - Socket timeout after 10 seconds
> [16:53:18] <nagios-releng> Thu 16:53:18 PDT [4534] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - Document modification date unknown - 630 bytes in 0.002 second
> [16:57:39] <nagios-releng> Thu 16:57:38 PDT [4536] slavealloc.pvt.build.mozilla.org:http_expect slavealloc.pvt.build is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 500 Internal Server Error

We found jakem on irc and he looked in zeus (all these sites are served by releng-zlb.vips.scl3), and found that the (backend) pool was all red, on port 81 and 82, both of web[12].releng.webapp.scl3. He restarted Apache on them ~ 1700, which resolved the issues. Nagios was happy a few mins later.

We don't yet know what caused db and cluster to hang up. Pointers welcome for where to look
The busted cluster also caused a lot of problems for jobs using talos-remote.pvt.build (android emulators & pandas, apparently, bug 1057153).
We think this was probably fallout from running some queries for bug 1055876, tying up buildbot-ro-vip.db.scl3. Bug 1057184 filed to try to protect from that.
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.