Closed Bug 1005342 Opened 6 years ago Closed 6 years ago
Tree closure due to Buildbot DB failing
30 minutes ago this file has stopped getting data from the buildbot databases: http://builddata.pub.build.mozilla.org/builddata/buildjson/builds-4hr.js.gz This causes a tree closure until solved.
[14:48:34] nagios-scl3 Fri 14:48:44 PDT  backup3.db.scl3.mozilla.com:MySQL DB Backups is WARNING: WARNING: buildbot is more than 8 hours behind. (http://m.mozilla.org/MySQL+DB+Backups) [14:50:05] Aj heh [14:50:06] Aj LIES [15:00:54] nagios-scl3 Fri 15:01:04 PDT  buildbot2.db.scl3.mozilla.com:MySQL Replication is WARNING: Replication lag: 896 seconds (http://m.mozilla.org/MySQL+Replication) [15:01:30] Aj almost there... [15:10:54] nagios-scl3 Fri 15:11:03 PDT  buildbot2.db.scl3.mozilla.com:MySQL Replication is CRITICAL: Replication lag: 1496 seconds (http://m.mozilla.org/MySQL+Replication) [15:11:01] Aj meh [15:11:21] =-= nthomas is now known as nthomas|away [15:15:05] =-= pmoore is now known as pmoore|buildduty [15:15:15] Callek Aj: trees closed due to that [15:15:26] Callek (5266) [15:15:30] Aj because of lag ... [15:15:38] Aj that truly makes no sense [15:15:42] Callek yea buildapi itself is "dead" [15:15:48] Callek which uses that DB heavily [15:16:11] Aj nagios-phx1 page cyborgshadow back online? [15:16:11] nagios-phx1 Aj: cyborgshadow has been paged with the message "back online?(Aj)" [15:16:15] Callek armenzg is filing a bug
buildbot2.db.scl3 is fully caught up: mysql -e "show slave status\G" | grep Seconds_Behind_Master Seconds_Behind_Master: 0
[15:32:17] Callek mpressman: any clue "what" got stuck so we can document/try to prevent in the future? [15:33:25] mpressman it could've been my query, I ran one that should've only taken a couple of minutes, but it was still running after I don't know how long and I killed it, then maybe 15 minutes later I noticed the alert in this channel [15:33:46] mpressman I'll do a quick scan of the slow query logs to see if it was mine or if something else pops up
I think the problem query was mine. In an effort to comment on bug 1005264 I ran some queries including this one: select builds.id, changes.when from builds, source_changes, changes, builders where builds.builder_id = builders.id and builds.source_id = source_changes.source_id and source_changes.change_id = changes.id group by builds.id having count(builds.id) > 1 limit 10; It had previously run in ~6 minutes. In order to get a different set of build.id values, I ran it again and went to a different buffer. After checking back I noticed the query had still not returned so I killed it. I then checked the #sysadmins room to see if there were any alerts and I saw that replication lag went critical. I went back to the db and noticed that the lag time started to decrease which leads me to believe that it was my query causing the issue. I further did a brief scan of the mysql slow logs to see if my query had possibly been caught by another and it was just timing that the killing of it coincided with the replication catching up, but didn't notice anything.
Closing to document the buildbot db is now recovered
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
I looked at 4:01 pacific, and saw that builds-4hr.js.gz had last been generated at 2:47. This is generated by a crontab that runs every minute, and makes a lot of heavy queries against the DB. On a good day with a cold cache, it's about 15 minutes. With a hot cache it can finish comfortably in one minute. It would seem that Matt's heavy query caused the builds-4hr.js.gz queries to go slowly, and thus caused the file's generation to fall behind. I found the crontask, appropriately dated 2:47, and killed it. Cron started a new one up on the minute, and that completed in something like 45s. I don't know why the original job didn't complete soon after mpressman cancelled his query.
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.