Closed
Bug 1172750
Opened 10 years ago
Closed 10 years ago
Trees closed for db corruption caused by bug 1172666
Categories
(Data & BI Services Team :: DB: MySQL, task)
Data & BI Services Team
DB: MySQL
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: scabral)
Details
(Whiteboard: [vm-delete:1][vm-create:1])
sheeri has dealt with these issues already:
_mysql_exceptions.InternalError: (126, "Incorrect key file for table './buildbot_schedulers/buildrequests.MYI'; try to repair
sqlalchemy.exc.InternalError: (InternalError) (126, "Incorrect key file for table './buildbot/builds.MYI'; try to repair it")
sqlalchemy.exc.IntegrityError: (IntegrityError) (1062, "Duplicate entry '3063515' for key 'ix_build_properties_property_id'") 'INSERT INTO build_properties (property_id, build_id) VALUES (%s, %s)' ((3063515L, 67596054L), (4396974L, 67596054L)) ... and a total of 28 bound parameter sets
But now we're hitting:
sqlalchemy.exc.InternalError: (InternalError) (126, "Incorrect key file for table './buildbot/steps.MYI'; try to repair it")
The trees are closed until we can submit into the buildbot database, because it blocks reporting on treeherder.
| Reporter | ||
Updated•10 years ago
|
Group: metrics-private
| Reporter | ||
Updated•10 years ago
|
Severity: normal → blocker
| Reporter | ||
Comment 2•10 years ago
|
||
(from irc) While the steps table was being repaired the db crashed, and the slave is separately 'horked'. Steps is being repaired again, but this could extend for quite a while yet.
| Reporter | ||
Comment 3•10 years ago
|
||
<sheeri> nthomas: Tomcat|sheriffduty 17g complete. out of 43G db. this will be a while longer, at least 4h more
Comment 4•10 years ago
|
||
update from 5:50am Pacific:
05:49 #releng: < sheeri> Tomcat|sheriffduty: now that it's a decent time on the east coast I'll check in every hour, but yeah, it looks like another 4h
Comment 5•10 years ago
|
||
We've been talking on Vidyo and the plan we're going with right now is to use the "buildbot" (aka statsudb) database from ~18h ago and retrigger jobs to fill in any missing data. We'll use a myisam version to minimize risk. Sheeri is in the process of making that open, I'll leave it to her to fill in any details.
Another thing to mention in that I had a think-o and misunderstood this problem to be with buildbot_schedulers, which led to me shutting down the masters thinking that we'd start with a clean buildbot_schedulers database. I'm starting them back up now.
Comment 6•10 years ago
|
||
Status update: Data has finished copying over to what will be the mysql master (buildbot1, I think). Sheeri is working on getting mysql started on it, but it's crashing.
Buildbot masters are still down because mysql is not running.
Comment 7•10 years ago
|
||
Status update: couldn't get buildbot1 to come back up. buildbot2 came up, but the buildbot.steps table was screwed. I told sheeri to delete that and start with a new one.
Separately, sheeri and gcox are working on cloning buildbot2 to a new version of buildbot1 so that we can use it as the ro slave. Once cloned, we'll bring buildbot2 back up and use it as rw and ro until buildbot1 is ready to be made the ro slave.
Comment 8•10 years ago
|
||
Status update: We're attempting to bring buildbot back up with both rw and ro pointed at buildbot2 while gcox works on getting the new buildbot1 ready to be the ro slave. Things were a bit sluggish at first, but they seem OK now with one try master + the two scheduler masters running. I'm going to hold off starting more until I see how one job on this try master performs.
Comment 9•10 years ago
|
||
Status update: We're still using buildbot2 as the rw+ro master, but most of the buildbot masters are back up and jobs are slowly being processed. Some other systems that use the buildbot database might need a kick still, it's hard to tell. I'm continuing to slowly bring up more buildbot masters, and Ryan has started doing pushes to start the process of greening up the trees.
gcox is still working on bringing up buildbot1 as the new ro slave.
Comment 10•10 years ago
|
||
Update: buildbot1 is almost back up and ready to be used as ro slave. However, the buildbot_schedulers.builds table needs to be repaired, which is in progress. In the meantime some parts of buildapi are busted because of it.
Comment 11•10 years ago
|
||
Status update: everything is up and running now, and we're waiting for the backlog to clear. buildbot1 has been swapped in as the ro slave again.
A couple of misc. things still to do, not urgent:
* restart buildbot bridge
* fix up any other failing nagios checks
* mark buildrequsets from yesterday as complete (lots of jobs that were running when things died yesterday showing in https://secure.pub.build.mozilla.org/buildapi/running)
But for the most part, we're just waiting for the new jobs that Ryan triggered to run to completion before the trees can be re-opened.
Updated•10 years ago
|
Whiteboard: [vm-delete:1][vm-create:1]
Comment 12•10 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #11)
> Status update: everything is up and running now, and we're waiting for the
> backlog to clear. buildbot1 has been swapped in as the ro slave again.
>
> A couple of misc. things still to do, not urgent:
> * restart buildbot bridge
> * fix up any other failing nagios checks
These are done.
> * mark buildrequsets from yesterday as complete (lots of jobs that were
> running when things died yesterday showing in
> https://secure.pub.build.mozilla.org/buildapi/running)
I'm struggling to come up with the correct query to do this, and I don't want to risk accidentally completing builds that are actually running. This isn't critical, so I'm putting it off for when I (or someone else) hase more brains.
Comment 13•10 years ago
|
||
The trees were mostly reopened around 10am PT. The integration branches were set to approval-only so we could keep close tabs on job results as they came in due to the lack of information from a large number of pushes yesterday and to avoid large quantities of pushes overwhelming our infra. At this point, only mozilla-inbound remains set to approval-only, and I'd anticipate that changing relatively soon.
Reopened inbound.
| Reporter | ||
Comment 15•10 years ago
|
||
I've cleaned up the running jobs. TBH it was almost all jobs from 7 and 15 days ago left, just 1 from yesterday.
There's a lot of test work pending, but I think that's just from opening try and world+dog pushing. Lets resolve this now.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•