Last Comment Bug 572518 - AMO Cluster Needs Rebuild
: AMO Cluster Needs Rebuild
Status: RESOLVED FIXED
08/10/2010 @ 6pm
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: All Other
: -- blocker (vote)
: ---
Assigned To: timellis
: matthew zeier [:mrz]
Mentors:
: 587365 (view as bug list)
Depends on:
Blocks: 560659 581300
  Show dependency treegraph
 
Reported: 2010-06-16 13:54 PDT by timellis
Modified: 2015-03-12 08:17 PDT (History)
16 users (show)
mzeier: needs‑downtime+
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
slow migrations (446 bytes, text/plain)
2010-07-20 10:09 PDT, Jeff Balogh (:jbalogh)
no flags Details

Description timellis 2010-06-16 13:54:12 PDT
Tried to create a new slave of tm-amo01-master01, go this error:

100616 12:32:32 [ERROR] Slave: Error 'Cannot add or update a child row: a foreign key constraint fails (`addons_remora/addons_collections`, CONSTRAINT `addons_collections_ibfk_2` FOREIGN KEY (`collection_id`) REFERENCES `collections` (`id`))' on query. Default database: 'addons_remora'. Query: 'INSERT INTO addons_collections  (addon_id, user_id, collection_id, added) VALUES     (321, 2787551, 69531, NOW())', Error_code: 1452

Sure enough, collections 69531 doesn't exist on the backups server, but does on tm-amo01-master01. This means that any slave created from that backups server is broken. This is potentially the entire cluster.

So I need to rebuild the cluster.
Comment 1 matthew zeier [:mrz] 2010-06-16 14:34:42 PDT
What about slaves created from the current slaves?  6 hours of AMO downtime is inconvenient and makes me sad.  Others too I bet (and how does this happen?).
Comment 2 timellis 2010-06-16 14:47:08 PDT
I can do this in 4.5 hours of downtime. I've figured out a way. But it will take me personally much longer to complete the job (another 20-25 hours). That is probably a good tradeoff from the Moz POV.

How it happens is unclear. I don't see this sort of "corruption" on systems with good network connectivity and stable hardware. We used to have it at Friendster. I would fix it with a homegrown tablesync script I wrote specifically for this sort of problem.

Since then, a tablesync has been written as part of the Maatkit MySQL toolkit. Eventually I'll install, configure, and run it as part of the routine database maintenance so I can catch this sort of "corruption" earlier. There's no way to be sure how long the problem has existed.
Comment 3 timellis 2010-06-16 14:48:41 PDT
I should clarify my final paragraph there. There's no way right now to tell how long the problem has existed. Once I'm running tablesync scripts on a daily basis, we will know down to a day when a problem occurs, and there will be some hope to fixing the underlying cause of the problem.
Comment 4 Wil Clouser [:clouserw] 2010-06-16 14:49:25 PDT
> How it happens is unclear. I don't see this sort of "corruption" on systems
> with good network connectivity and stable hardware. 

We still see many "Can't connect to mysql" errors from the app every hour.  Perhaps they are related.
Comment 5 Justin Scott [:fligtar] 2010-06-16 14:54:07 PDT
Is there any way we could make AMO read-only for these 4.5 - 6 hours instead of completely down? Have it run from separate db boxes off the current bad data until the normal cluster is finished?
Comment 6 matthew zeier [:mrz] 2010-06-16 14:56:36 PDT
I'd sure love that feature in general - would make failing over to Phoenix for a couple hours way easier if we can go into a degraded read-only state.
Comment 7 Jeff Balogh (:jbalogh) 2010-06-16 15:00:54 PDT
(In reply to comment #6)
> I'd sure love that feature in general - would make failing over to Phoenix for
> a couple hours way easier if we can go into a degraded read-only state.

The only writes zamboni does right now are for session stuff.  If we made a way to turn that off I think we could operate read-only.  All the developer pages for remora would be messed up, but we could serve all the public pages.

zamboni doesn't talk to master unless it's for a write, so I think we could run it on slaves.
Comment 8 Justin Scott [:fligtar] 2010-06-16 16:15:25 PDT
If we disable logging in and log everyone out, that pretty much puts us in read-only mode. We should be able to do a site notice from the admin panel that explains logging in is currently disabled.
Comment 9 Wil Clouser [:clouserw] 2010-06-16 22:05:42 PDT
We've got a push next Tuesday that will let us get some code out to support site notices and such.

I'm not convinced we can do a read-only mode between now and then but we can certainly try, and planning this downtime for next week will give us a bit of time.
Comment 10 matthew zeier [:mrz] 2010-06-17 09:10:08 PDT
So tentatively planned for next Tuesday's window?
Comment 11 Wil Clouser [:clouserw] 2010-06-17 10:58:05 PDT
Code update is at 4pm, so any time after that would be fine with me - sounds like 7pm is a good time.
Comment 12 Nick Nguyen [:osunick] 2010-06-17 13:52:20 PDT
We'll need someone who knows what's up to write a blog post explaining the outage and when it's happening.  I also presume we can put up a static page that points to this blog post?
Comment 13 timellis 2010-06-17 14:17:29 PDT
I will coordinate with Dave Miller on writing the blog post. Unsure about what static page goes up. I assumed we had a standard procedure for AMO outage static page...?

Addt'l info: plan is to be done by 10pm, but might take until 11pm or so.
Comment 14 matthew zeier [:mrz] 2010-06-17 14:18:52 PDT
Tim's talking about the normal IT downtime blog, not the blog Nick's talking about.
Comment 15 Justin Scott [:fligtar] 2010-06-21 17:03:01 PDT
I am told it's too late for us to look into read-only mode now and that we'll have to be completely down for the whole time. What's the status of the blog post? I'd like to cross-post on the AMO blog as soon as possible.

Also, can we modify the outage page to link to the post?
Comment 16 Justin Scott [:fligtar] 2010-06-21 20:04:37 PDT
I think this needs to be rescheduled for another day for two reasons:

1. same-day notice to users and developers is not acceptable for this level of downtime

2. Firefox 3.6.4 is being released at 1pm tomorrow. I don't think we can take AMO down during any Firefox release, and certainly not one that will get as much traffic/promotion as this one.
Comment 17 Wil Clouser [:clouserw] 2010-06-22 10:29:58 PDT
comment 16 sounds pretty reasonable.  IT?
Comment 18 Wil Clouser [:clouserw] 2010-06-22 15:26:19 PDT
Alright, this is off for tonight.
Comment 19 timellis 2010-06-22 16:30:15 PDT
I concur. Just let me know the next window you think is opportune and I'll work it in.
Comment 20 Wil Clouser [:clouserw] 2010-06-23 23:58:31 PDT
Next AMO upgrade is next Tuesday, although we aren't planning any downtime.  Perhaps a weekend would be best instead of trying to coordinate with our upgrades?
Comment 21 Wil Clouser [:clouserw] 2010-06-25 14:43:07 PDT
We've got a rudimentary "read only" mode finished in bug 574024.  Once it's QA'd it will be on the live site next Tuesday.  This downtime can be its maiden voyage.
Comment 22 Justin Scott [:fligtar] 2010-07-19 10:47:15 PDT
Does this draft blog post cover everything?
--
This Saturday, July 24 at noon, addons.mozilla.org will undergo a 4.5 hour maintenance period. While we will try to make many pages available as read-only, please plan for the site to be completely unavailable for the full 4.5 hour window.

For more information, please see <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=572518">bug 572518</a>.
Comment 24 Jeff Balogh (:jbalogh) 2010-07-20 10:09:06 PDT
Created attachment 458700 [details]
slow migrations

Here's some slow migrations we'd like to run while the site is offline.  The index in addons_collections took 27 minutes for me to run locally.

Please run these when you feel the timing is right.
Comment 25 timellis 2010-07-22 10:55:12 PDT
(In reply to comment #24)
> Here's some slow migrations we'd like to run while the site is offline.  The
> index in addons_collections took 27 minutes for me to run locally.
> 
> Please run these when you feel the timing is right.

There is a very high probability there won't be time in that 4.5-hour window for running any slow migrations. We'll need to take another outage for that, or lengthen Saturday's outage.
Comment 26 matthew zeier [:mrz] 2010-07-23 09:13:18 PDT
Do we still think we're okay doing this tomorrow?  Still okay with a possible Fx 3.6.8 release today?
Comment 27 Justin Scott [:fligtar] 2010-07-23 10:46:17 PDT
If 3.6.8 goes out today or tomorrow (which is likely) we probably shouldn't do this. Though with BlackHat and Defcon at the end of next week, we may see another release then as well.

I guess it depends on how badly this needs to be fixed. I'd really like to do it on a weekend.
Comment 28 timellis 2010-07-24 11:43:25 PDT
Wait. What? Are we cancelling this or not? I have set aside 4.5 hours today to do this outage.
Comment 29 timellis 2010-07-24 11:51:04 PDT
Caught up on email, read blog posts. I see outage is definitely cancelled. Will schedule it up, perhaps 7 Aug 2010.
Comment 30 timellis 2010-08-02 14:41:14 PDT
Fligtar et al: When is good to have this outage? I can't do this coming weekend, so I'm hoping sometime this week during the week? After 11pm (Pacific) I could do it most days of the week.
Comment 31 timellis 2010-08-03 17:30:06 PDT
mrz notes since we have a read-only mode we can do this during the next normal outage window. How is Thursday at 7pm? Please respond ASAP. This cluster is in a dangerous state. We can't ignore it.
Comment 32 Wil Clouser [:clouserw] 2010-08-03 22:12:38 PDT
no objections from me
Comment 33 timellis 2010-08-04 11:20:22 PDT
(In reply to comment #32)
> no objections from me

MRZ objects. He says there's an AMO content push during tomorrow's outage, and that you probably don't want databases in read-only mode for ~5 hours during that time.

Is that true? If not, let's schedule the AMO rebuild for the same moment. I can do it tomorrow night if you really are on-board with it.
Comment 34 Wil Clouser [:clouserw] 2010-08-04 11:23:37 PDT
content push is at 4pm.  I expect to be done well before 7, but if we want to be safe we can do it another night.  /me is flexible!
Comment 35 matthew zeier [:mrz] 2010-08-04 11:28:19 PDT
Concerned about doing this after a content push.  Next Tuesday?
Comment 36 Wil Clouser [:clouserw] 2010-08-04 12:43:47 PDT
(In reply to comment #35)
> Concerned about doing this after a content push.  Next Tuesday?

wfm
Comment 37 Justin Scott [:fligtar] 2010-08-09 13:06:46 PDT
Is this on for tomorrow night?
Comment 38 timellis 2010-08-09 13:12:02 PDT
Works for me: Tuesday August 10, 2010. Can I start at 6pm?
Comment 39 Justin Scott [:fligtar] 2010-08-09 13:14:05 PDT
Fine with me. I'll do another blog post. Is 6pm - 10:30pm in read-only mode correct?
Comment 40 timellis 2010-08-09 13:56:59 PDT
Correct. 6pm-10:30pm Tuesday August 10th. Does anyone on this bug know how to put AMO into and take it out of read-only mode? (ie: an actual document about how to do it in the production environment)
Comment 41 timellis 2010-08-09 14:05:11 PDT
jbalogh	‣ # Turn on read-only mode in settings_local.py by putting this line
jbalogh	‣ # at the VERY BOTTOM: read_only_mode(globals())
jbalogh	‣ then you have to push that out to all the webheads
jbalogh	‣ https://intranet.mozilla.org/UpdateAMO#Pushing_content

Pushing content

Since zamboni is hosted in a .git repo it conflicts with ITs git content push system. To resolve the problem simply these are the steps to pushing zamboni content live:

Make changes @ mradm02:/data/amo_python/src e.g., preview is @ mradm02:/data/amo_python/src/preview/zamboni

Run /data/bin/omg_push_zamboni_live.sh

This will basically:
rsync src/ to www/ excluding .git and overwrite everything in www/.

commit all changes to our push system
Comment 42 Jeff Balogh (:jbalogh) 2010-08-09 14:14:01 PDT
https://intranet.mozilla.org/UpdateAMO#Read-only_Mode
Comment 43 Daniel Einspanjer [:dre] [:deinspanjer] 2010-08-10 11:55:48 PDT
FYI, because metrics attempts to insert data into the AMO master during or end of day procedure, we'll have to suspend that process for tonight and run it tomorrow instead.

That means that AMO developers won't see an update to their stats dashboards until tomorrow evening.
Comment 44 Jeff Balogh (:jbalogh) 2010-08-10 11:59:15 PDT
Please disable amo's crontab during the outage: https://intranet.mozilla.org/UpdateAMO#Cronjobs
Comment 45 Justin Scott [:fligtar] 2010-08-10 12:55:42 PDT
For tonight's outage, the read-only mode will only affect python pages, which don't include services like update check and blocklist. Is there a way to get Zeus to serve cached pages for the duration of the outage to minimize that?
Comment 46 Justin Scott [:fligtar] 2010-08-10 16:31:59 PDT
(In reply to comment #45)
> For tonight's outage, the read-only mode will only affect python pages, which
> don't include services like update check and blocklist. Is there a way to get
> Zeus to serve cached pages for the duration of the outage to minimize that?

dmoore says this isn't possible for tonight.
Comment 47 Michael Morgan [:morgamic] 2010-08-10 17:45:36 PDT
I'm okay with versioncheck posting a blank update snippet for 5 hours.  The blocklist service should not have downtime if it's avoidable.  It'd be worth delaying this outage window if we can find a way to avoid downtime for two core Firefox services.

Mostly, I'd like to know what's possible, what's not and what we could do to help with an interim solution. The versioncheck and blocklist stuff could be solved by serving static XML files in both cases or pre-populating a cache based off of replayed logs then serving directly out of memory for a specified block of time.

Can someone explain why it isn't possible?  Is Zeus going down?  What's the constraint?
Comment 48 Michael Morgan [:morgamic] 2010-08-10 17:57:32 PDT
Probably late for this, but zeus can serve files itself statically:
http://knowledgehub.zeus.com/articles/2009/04/27/using_zxtm_as_a_webserver

In both cases the service needs just a static file to avoid serving 404s.

Versioncheck:
http://people.mozilla.org/~morgamic/versioncheck.xml

Blocklist:
http://people.mozilla.org/~morgamic/blocklist.xml

Maybe something for next time.
Comment 49 Wil Clouser [:clouserw] 2010-08-10 18:13:41 PDT
I don't think this has started yet - the site is still r/w.
Comment 50 timellis 2010-08-10 22:05:33 PDT
These hosts rebuilt from tm-amo01-master01:

tm-amo02-master01
tm-amo01-slave01
tm-amo01-slave02
tm-amo01-slave03
tm-amo01-slave04

These need to be done still:

tm-amo02-slave*: need to create proper my.cnf first so these errors won't happen anymore:

100810 20:36:26 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysqld-relay-bin' to avoid this problem.

And others. The slave threads won't start.

tm-backup02: need to figure out why there isn't enough disk space:

innodb.db
 17804034048  79%   15.77MB/s    0:04:43
rsync: write failed on "/data/amo01-innodb/innodb.db": No space left on device (28)
rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.7]
rsync: connection unexpectedly closed (96 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(601) [generator=3.0.7]

[root@tm-backup02 22:00:42 /data]
:) l
total 28
drwxr-xr-x 6 mysql mysql 4096 Aug 10 21:57 amo01/
drwxr-xr-x 2 mysql mysql 4096 Aug 10 21:51 amo01-innodb/
[root@tm-backup02 22:00:43 /data]
:) df -m .
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                         77247     55956     17305  77% /

17GB isn't even close to enough disk space.
Comment 51 matthew zeier [:mrz] 2010-08-10 22:50:35 PDT
> Can someone explain why it isn't possible?  Is Zeus going down?  What's the
> constraint?

Time.  Lack of time to spin up whatever alternative Zeus config, test, and be ready.  Should be something we do for next time (or even just because).
Comment 52 matthew zeier [:mrz] 2010-08-10 23:00:27 PDT
(In reply to comment #48)
> Probably late for this, but zeus can serve files itself statically:
> http://knowledgehub.zeus.com/articles/2009/04/27/using_zxtm_as_a_webserver
> 
> In both cases the service needs just a static file to avoid serving 404s.

I like this idea mostly from a load shed perspective.  If we needed to shed DB load, this feels like a win.  Can you invite a knob that makes AMO spit out a static page with a cache life of days?  Post load shed, we purge that object and turn the knob back.

I'll buy you a cookie.
Comment 53 Michael Morgan [:morgamic] 2010-08-10 23:34:00 PDT
Sure, but if we could force it with Zeus it'd work for non-AMO stuff, which would be helpful if we had to do it for another site.
Comment 54 Mark Banner (:standard8) 2010-08-11 01:13:46 PDT
So is the downtime complete (for now at least)? Is AMO back up and running?

We're going to be unthrottling a major update for Thunderbird today, so we need AMO to be working for ensuring add-ons are up to date.
Comment 55 Wil Clouser [:clouserw] 2010-08-11 09:51:16 PDT
Production AMO has been having a lot of trouble connecting to MySQL recently, but it looks like Nagios is complaining about it so I assume it's being worked on?  ->blocker
Comment 56 timellis 2010-08-11 10:13:28 PDT
Also need to rebuild the entirety of Phoenix AMO. Am doing the backups slave now to facilitate that.

@Mark/Wil: the outage is finished and was done on time from your POV. Let me know if you see some evidence contrary.
Comment 57 Wil Clouser [:clouserw] 2010-08-11 10:22:33 PDT
(In reply to comment #56)
> @Mark/Wil: the outage is finished and was done on time from your POV. Let me
> know if you see some evidence contrary.

Assuming we're not using the phoenix cluster at all, I assume this is causing our troubles:

tm-amo01-slave04:MySQL is CRITICAL: Cant connect to MySQL server on 10.2.70.15
Comment 58 timellis 2010-08-16 09:10:19 PDT
*** Bug 587365 has been marked as a duplicate of this bug. ***
Comment 59 timellis 2010-08-17 11:03:23 PDT
Done rebuilding Phx.

Note You need to log in before you can comment on or make changes to this bug.