Closed Bug 1005342 Opened 6 years ago Closed 6 years ago

Tree closure due to Buildbot DB failing

Categories

(Data & BI Services Team :: DB: MySQL, task, blocker)

x86_64
Linux
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: mpressman)

References

Details

30 minutes ago this file has stopped getting data from the buildbot databases:
http://builddata.pub.build.mozilla.org/builddata/buildjson/builds-4hr.js.gz

This causes a tree closure until solved.
[14:48:34]	nagios-scl3	Fri 14:48:44 PDT [5250] backup3.db.scl3.mozilla.com:MySQL DB Backups is WARNING: WARNING: buildbot is more than 8 hours behind. (http://m.mozilla.org/MySQL+DB+Backups)
[14:50:05]	Aj	heh
[14:50:06]	Aj	LIES
[15:00:54]	nagios-scl3	Fri 15:01:04 PDT [5260] buildbot2.db.scl3.mozilla.com:MySQL Replication is WARNING: Replication lag: 896 seconds (http://m.mozilla.org/MySQL+Replication)
[15:01:30]	Aj	almost there...
[15:10:54]	nagios-scl3	Fri 15:11:03 PDT [5266] buildbot2.db.scl3.mozilla.com:MySQL Replication is CRITICAL: Replication lag: 1496 seconds (http://m.mozilla.org/MySQL+Replication)
[15:11:01]	Aj	meh
[15:11:21]	=-=	nthomas is now known as nthomas|away
[15:15:05]	=-=	pmoore is now known as pmoore|buildduty
[15:15:15]	Callek	Aj: trees closed due to that
[15:15:26]	Callek	(5266)
[15:15:30]	Aj	because of lag ...
[15:15:38]	Aj	that truly makes no sense
[15:15:42]	Callek	yea buildapi itself is "dead"
[15:15:48]	Callek	which uses that DB heavily
[15:16:11]	Aj	nagios-phx1 page cyborgshadow back online?
[15:16:11]	nagios-phx1	Aj: cyborgshadow has been paged with the message "back online?(Aj)"
[15:16:15]	Callek	armenzg is filing a bug
buildbot2.db.scl3 is fully caught up:
 mysql -e "show slave status\G" | grep Seconds_Behind_Master
        Seconds_Behind_Master: 0
[15:32:17]	Callek	mpressman: any clue "what" got stuck so we can document/try to prevent in the future?
[15:33:25]	mpressman	it could've been my query, I ran one that should've only taken a couple of minutes, but it was still running after I don't know how long and I killed it, then maybe 15 minutes later I noticed the alert in this channel
[15:33:46]	mpressman	I'll do a quick scan of the slow query logs to see if it was mine or if something else pops up
I think the problem query was mine. In an effort to comment on bug 1005264 I ran some queries including this one:
select builds.id, changes.when from builds, source_changes, changes, builders where builds.builder_id = builders.id and builds.source_id = source_changes.source_id and source_changes.change_id = changes.id group by builds.id having count(builds.id) > 1 limit 10;

It had previously run in ~6 minutes. In order to get a different set of build.id values, I ran it again and went to a different buffer. After checking back I noticed the query had still not returned so I killed it. I then checked the #sysadmins room to see if there were any alerts and I saw that replication lag went critical. I went back to the db and noticed that the lag time started to decrease which leads me to believe that it was my query causing the issue. I further did a brief scan of the mysql slow logs to see if my query had possibly been caught by another and it was just timing that the killing of it coincided with the replication catching up, but didn't notice anything.
Assignee: server-ops-database → mpressman
Blocks: 1005341
Closing to document the buildbot db is now recovered
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
I looked at 4:01 pacific, and saw that builds-4hr.js.gz had last been generated at 2:47.  This is generated by a crontab that runs every minute, and makes a lot of heavy queries against the DB.  On a good day with a cold cache, it's about 15 minutes.  With a hot cache it can finish comfortably in one minute.  It would seem that Matt's heavy query caused the builds-4hr.js.gz queries to go slowly, and thus caused the file's generation to fall behind.

I found the crontask, appropriately dated 2:47, and killed it.  Cron started a new one up on the minute, and that completed in something like 45s.

I don't know why the original job didn't complete soon after mpressman cancelled his query.
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.