This query was being run over and over by mistake on the buildbot master: update buildrequests set claimed_at=0, claimed_by_name=NULL, claimed_by_incarnation=NULL where id=id The script was killed but the queries kept showing up, under incrementing process ids within MySQL.
I tried changing the password for buildbot2 user, but queries kept coming (I reversed the password and flushed privileges) so we tried restarting mysql on buildbot1, but that hung. So we failed over. Luckily (we think) buildbot schedulers only reschedule if claimed_at is 0 AND completed=0, so the completed runs don't re-run. We're checking that now.
Trees re-opened from an infra standpoint a few hours ago. Around 6:30/7pm eastern, I believe. (They remained closed for non-infra reasons, as they were before this incident happened.) Buildbot had more trouble recovering than expected - we had a bunch of duplicate jobs cause very long wait times and confusing results. We believe that this is a result of Buildbot masters not coping with a ~1h long downtime of the Buildbot database (the between the query starting and us failing over.) However, things seem much better now - there are no more duplicate jobs, builds are running fine for pushes that happened after the failover, and other than a backlog of jobs, things appear to be back to normal. We suspect that replication from buildbot1 -> buildbot2 (I _think_ i have those names right) wasn't quite working correctly, possibly because of the massive load on buildbot1 while the bad query was running. This is a guess that Catlee and I made while analyzing everything after the failover - it wouldn't shock me if that's wrong. Sheeri, thank you so much for your help with this so late on a Friday. Leaving this bug open for now because I believe we still need to fail back over put buildbot1 back in as the rw master and I'm not sure if you want to track that here or elsewhere.
Additional fallout from this: Some buildbot masters lost some of their schedulers. The symptom of this that was noticed was that l10n nightlies couldn't be triggered ("Started no scheduler: Firefox mozilla-central macosx64 l10n nightly"). I reconfiged the build/try/test masters for this (scheduler masters skipped because they had a full restart yesterday after we resolved the database issue). The masters now have their full complement of schedulers as best I can tell, and we've retriggered a Linux64 nightly to make sure things are okay. (We don't need a full set of l10n nightlies over the weekend IMO, so we didn't retrigger everything.)
buildbot1 has been upgraded and defragmented. Let's fail back tomorrow morning between 6 am and 9 am Pacific.
(In reply to Sheeri Cabral [:sheeri] from comment #4) > buildbot1 has been upgraded and defragmented. Let's fail back tomorrow > morning between 6 am and 9 am Pacific. I should be online between these hours. Ping me whenever you're ready so I can keep an eye on things.
Am currently running checksums from buildbot2 to buildbot1 to find where any different/missing data is.
Added 653 records from the builders table back into buildbot1, between builder id's 191000 and 192000.
builds table 28862087 through 29106081
(er, that's the current missing data I'm looking at, there are 101,216 missing rows on buildbot1 that I'll add back in).
Those rows have been added. The build_properties table has 2,696,355 missing rows, adding them back in now.
Added in about 2.7 million rows into the build_properties table. That took a while :D Also, 5,461 missing rows have been added into the changes table.
7,573 missing rows were added into the files table.
96,886 missing rows added back into the file_changes table.
358,394 missing rows added back into the properties table. 106,992 missing rows added back into the schedulerdb_requests table. 10,637 missing rows added back into the sourcestamps table. 11,979 missing rows added back into the source_changes table. Removed 97,020 from the table steps table (removed on buildbot1, didn't exist on buildbot2).
100,299 missing rows added back into the buildbot_schedulers.buildrequests table. 3,941 missing rows added back into the buildbot_schedulers.buildsets table. 37,878 missing rows added back into the buildbot_schedulers.buildset_properties table. 399.020 missing rows added back into the buildbot_schedulers.change_files table. Running another checksum to see if there are any differences I missed.
20,521 additional missing rows added back into the buildbot_schedulers.buildsets table. 8,366 missing rows added back into the buildbot_schedulers.changes table. Running a checksum again....hopefully it comes out clean!
Checksum came up with no differences. All clear to fail back tomorrow before 9 am Pacific.
And we are failed back. Whee!
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.