Status

RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: scabral, Unassigned)

Tracking

Details

(Reporter)

Description

5 years ago
This query was being run over and over by mistake on the buildbot master:

update buildrequests set claimed_at=0, claimed_by_name=NULL, claimed_by_incarnation=NULL where id=id

The script was killed but the queries kept showing up, under incrementing process ids within MySQL.
(Reporter)

Comment 1

5 years ago
I tried changing the password for buildbot2 user, but queries kept coming (I reversed the password and flushed privileges) so we tried restarting mysql on buildbot1, but that hung.

So we failed over. Luckily (we think) buildbot schedulers only reschedule if claimed_at is 0 AND completed=0, so the completed runs don't re-run. We're checking that now.

Updated

5 years ago
Blocks: 928514
Trees re-opened from an infra standpoint a few hours ago. Around 6:30/7pm eastern, I believe. (They remained closed for non-infra reasons, as they were before this incident happened.)

Buildbot had more trouble recovering than expected - we had a bunch of duplicate jobs cause very long wait times and confusing results. We believe that this is a result of Buildbot masters not coping with a ~1h long downtime of the Buildbot database (the between the query starting and us failing over.) However, things seem much better now - there are no more duplicate jobs, builds are running fine for pushes that happened after the failover, and other than a backlog of jobs, things appear to be back to normal.

We suspect that replication from buildbot1 -> buildbot2 (I _think_ i have those names right) wasn't quite working correctly, possibly because of the massive load on buildbot1 while the bad query was running. This is a guess that Catlee and I made while analyzing everything after the failover - it wouldn't shock me if that's wrong.

Sheeri, thank you so much for your help with this so late on a Friday.

Leaving this bug open for now because I believe we still need to fail back over put buildbot1 back in as the rw master and I'm not sure if you want to track that here or elsewhere.
Additional fallout from this: Some buildbot masters lost some of their schedulers. The symptom of this that was noticed was that l10n nightlies couldn't be triggered ("Started no scheduler: Firefox mozilla-central macosx64 l10n nightly"). I reconfiged the build/try/test masters for this (scheduler masters skipped because they had a full restart yesterday after we resolved the database issue). The masters now have their full complement of schedulers as best I can tell, and we've retriggered a Linux64 nightly to make sure things are okay. (We don't need a full set of l10n nightlies over the weekend IMO, so we didn't retrigger everything.)
(Reporter)

Comment 4

5 years ago
buildbot1 has been upgraded and defragmented. Let's fail back tomorrow morning between 6 am and 9 am Pacific.
(In reply to Sheeri Cabral [:sheeri] from comment #4)
> buildbot1 has been upgraded and defragmented. Let's fail back tomorrow
> morning between 6 am and 9 am Pacific.

I should be online between these hours. Ping me whenever you're ready so I can keep an eye on things.
(Reporter)

Comment 6

5 years ago
Will do.
(Reporter)

Comment 7

5 years ago
Am currently running checksums from buildbot2 to buildbot1 to find where any different/missing data is.

Updated

5 years ago
Blocks: 929039
(Reporter)

Comment 8

5 years ago
Added 653 records from the builders table back into buildbot1, between builder id's 191000 and 192000.
(Reporter)

Comment 9

5 years ago
builds table 28862087 through 29106081
(Reporter)

Comment 10

5 years ago
(er, that's the current missing data I'm looking at, there are 101,216 missing rows on buildbot1 that I'll add back in).
(Reporter)

Comment 11

5 years ago
Those rows have been added. The build_properties table has 2,696,355 missing rows, adding them back in now.
(Reporter)

Comment 12

5 years ago
Added in about 2.7 million rows into the build_properties table. That took a while :D

Also, 5,461 missing rows have been added into the changes table.
(Reporter)

Comment 13

5 years ago
7,573 missing rows were added into the files table.
(Reporter)

Comment 14

5 years ago
96,886 missing rows added back into the file_changes table.
(Reporter)

Comment 15

5 years ago
358,394 missing rows added back into the properties table.

106,992 missing rows added back into the schedulerdb_requests table.

10,637 missing rows added back into the sourcestamps table.

11,979 missing rows added back into the source_changes table.

Removed 97,020 from the table steps table (removed on buildbot1, didn't exist on buildbot2).
(Reporter)

Comment 16

5 years ago
100,299 missing rows added back into the buildbot_schedulers.buildrequests table.

3,941 missing rows added back into the buildbot_schedulers.buildsets table.

37,878 missing rows added back into the buildbot_schedulers.buildset_properties table.

399.020 missing rows added back into the buildbot_schedulers.change_files table.

Running another checksum to see if there are any differences I missed.
(Reporter)

Comment 17

5 years ago
20,521 additional missing rows added back into the buildbot_schedulers.buildsets table.

8,366 missing rows added back into the buildbot_schedulers.changes table.

Running a checksum again....hopefully it comes out clean!
(Reporter)

Comment 18

5 years ago
Checksum came up with no differences. All clear to fail back tomorrow before 9 am Pacific.
(Reporter)

Comment 19

5 years ago
And we are failed back. Whee!
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.