Closed Bug 572518 Opened 14 years ago Closed 14 years ago

AMO Cluster Needs Rebuild

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tellis, Assigned: tellis)

References

Details

(Whiteboard: 08/10/2010 @ 6pm)

Attachments

(1 file)

Tried to create a new slave of tm-amo01-master01, go this error:

100616 12:32:32 [ERROR] Slave: Error 'Cannot add or update a child row: a foreign key constraint fails (`addons_remora/addons_collections`, CONSTRAINT `addons_collections_ibfk_2` FOREIGN KEY (`collection_id`) REFERENCES `collections` (`id`))' on query. Default database: 'addons_remora'. Query: 'INSERT INTO addons_collections  (addon_id, user_id, collection_id, added) VALUES     (321, 2787551, 69531, NOW())', Error_code: 1452

Sure enough, collections 69531 doesn't exist on the backups server, but does on tm-amo01-master01. This means that any slave created from that backups server is broken. This is potentially the entire cluster.

So I need to rebuild the cluster.
Assignee: server-ops → tellis
Blocks: 560659
Group: infra
What about slaves created from the current slaves?  6 hours of AMO downtime is inconvenient and makes me sad.  Others too I bet (and how does this happen?).
I can do this in 4.5 hours of downtime. I've figured out a way. But it will take me personally much longer to complete the job (another 20-25 hours). That is probably a good tradeoff from the Moz POV.

How it happens is unclear. I don't see this sort of "corruption" on systems with good network connectivity and stable hardware. We used to have it at Friendster. I would fix it with a homegrown tablesync script I wrote specifically for this sort of problem.

Since then, a tablesync has been written as part of the Maatkit MySQL toolkit. Eventually I'll install, configure, and run it as part of the routine database maintenance so I can catch this sort of "corruption" earlier. There's no way to be sure how long the problem has existed.
I should clarify my final paragraph there. There's no way right now to tell how long the problem has existed. Once I'm running tablesync scripts on a daily basis, we will know down to a day when a problem occurs, and there will be some hope to fixing the underlying cause of the problem.
> How it happens is unclear. I don't see this sort of "corruption" on systems
> with good network connectivity and stable hardware. 

We still see many "Can't connect to mysql" errors from the app every hour.  Perhaps they are related.
Is there any way we could make AMO read-only for these 4.5 - 6 hours instead of completely down? Have it run from separate db boxes off the current bad data until the normal cluster is finished?
I'd sure love that feature in general - would make failing over to Phoenix for a couple hours way easier if we can go into a degraded read-only state.
(In reply to comment #6)
> I'd sure love that feature in general - would make failing over to Phoenix for
> a couple hours way easier if we can go into a degraded read-only state.

The only writes zamboni does right now are for session stuff.  If we made a way to turn that off I think we could operate read-only.  All the developer pages for remora would be messed up, but we could serve all the public pages.

zamboni doesn't talk to master unless it's for a write, so I think we could run it on slaves.
Whiteboard: needs outage
Flags: needs-downtime+
If we disable logging in and log everyone out, that pretty much puts us in read-only mode. We should be able to do a site notice from the admin panel that explains logging in is currently disabled.
We've got a push next Tuesday that will let us get some code out to support site notices and such.

I'm not convinced we can do a read-only mode between now and then but we can certainly try, and planning this downtime for next week will give us a bit of time.
So tentatively planned for next Tuesday's window?
Whiteboard: needs outage → 06/22/2010 @ 7pm (?)
Code update is at 4pm, so any time after that would be fine with me - sounds like 7pm is a good time.
Severity: normal → critical
Whiteboard: 06/22/2010 @ 7pm (?) → 06/22/2010 @ 7pm
We'll need someone who knows what's up to write a blog post explaining the outage and when it's happening.  I also presume we can put up a static page that points to this blog post?
I will coordinate with Dave Miller on writing the blog post. Unsure about what static page goes up. I assumed we had a standard procedure for AMO outage static page...?

Addt'l info: plan is to be done by 10pm, but might take until 11pm or so.
Tim's talking about the normal IT downtime blog, not the blog Nick's talking about.
I am told it's too late for us to look into read-only mode now and that we'll have to be completely down for the whole time. What's the status of the blog post? I'd like to cross-post on the AMO blog as soon as possible.

Also, can we modify the outage page to link to the post?
I think this needs to be rescheduled for another day for two reasons:

1. same-day notice to users and developers is not acceptable for this level of downtime

2. Firefox 3.6.4 is being released at 1pm tomorrow. I don't think we can take AMO down during any Firefox release, and certainly not one that will get as much traffic/promotion as this one.
comment 16 sounds pretty reasonable.  IT?
Alright, this is off for tonight.
I concur. Just let me know the next window you think is opportune and I'll work it in.
Severity: critical → enhancement
Whiteboard: 06/22/2010 @ 7pm → [Waiting on good downtime]
Next AMO upgrade is next Tuesday, although we aren't planning any downtime.  Perhaps a weekend would be best instead of trying to coordinate with our upgrades?
We've got a rudimentary "read only" mode finished in bug 574024.  Once it's QA'd it will be on the live site next Tuesday.  This downtime can be its maiden voyage.
Severity: enhancement → critical
Whiteboard: [Waiting on good downtime] → Needs outage: 24 July 2010, NOON.
Does this draft blog post cover everything?
--
This Saturday, July 24 at noon, addons.mozilla.org will undergo a 4.5 hour maintenance period. While we will try to make many pages available as read-only, please plan for the site to be completely unavailable for the full 4.5 hour window.

For more information, please see <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=572518">bug 572518</a>.
Attached file slow migrations
Here's some slow migrations we'd like to run while the site is offline.  The index in addons_collections took 27 minutes for me to run locally.

Please run these when you feel the timing is right.
(In reply to comment #24)
> Here's some slow migrations we'd like to run while the site is offline.  The
> index in addons_collections took 27 minutes for me to run locally.
> 
> Please run these when you feel the timing is right.

There is a very high probability there won't be time in that 4.5-hour window for running any slow migrations. We'll need to take another outage for that, or lengthen Saturday's outage.
Do we still think we're okay doing this tomorrow?  Still okay with a possible Fx 3.6.8 release today?
If 3.6.8 goes out today or tomorrow (which is likely) we probably shouldn't do this. Though with BlackHat and Defcon at the end of next week, we may see another release then as well.

I guess it depends on how badly this needs to be fixed. I'd really like to do it on a weekend.
Whiteboard: Needs outage: 24 July 2010, NOON. → postponted
Wait. What? Are we cancelling this or not? I have set aside 4.5 hours today to do this outage.
Caught up on email, read blog posts. I see outage is definitely cancelled. Will schedule it up, perhaps 7 Aug 2010.
Fligtar et al: When is good to have this outage? I can't do this coming weekend, so I'm hoping sometime this week during the week? After 11pm (Pacific) I could do it most days of the week.
mrz notes since we have a read-only mode we can do this during the next normal outage window. How is Thursday at 7pm? Please respond ASAP. This cluster is in a dangerous state. We can't ignore it.
no objections from me
(In reply to comment #32)
> no objections from me

MRZ objects. He says there's an AMO content push during tomorrow's outage, and that you probably don't want databases in read-only mode for ~5 hours during that time.

Is that true? If not, let's schedule the AMO rebuild for the same moment. I can do it tomorrow night if you really are on-board with it.
content push is at 4pm.  I expect to be done well before 7, but if we want to be safe we can do it another night.  /me is flexible!
Concerned about doing this after a content push.  Next Tuesday?
(In reply to comment #35)
> Concerned about doing this after a content push.  Next Tuesday?

wfm
Is this on for tomorrow night?
Works for me: Tuesday August 10, 2010. Can I start at 6pm?
Fine with me. I'll do another blog post. Is 6pm - 10:30pm in read-only mode correct?
Correct. 6pm-10:30pm Tuesday August 10th. Does anyone on this bug know how to put AMO into and take it out of read-only mode? (ie: an actual document about how to do it in the production environment)
jbalogh	‣ # Turn on read-only mode in settings_local.py by putting this line
jbalogh	‣ # at the VERY BOTTOM: read_only_mode(globals())
jbalogh	‣ then you have to push that out to all the webheads
jbalogh	‣ https://intranet.mozilla.org/UpdateAMO#Pushing_content

Pushing content

Since zamboni is hosted in a .git repo it conflicts with ITs git content push system. To resolve the problem simply these are the steps to pushing zamboni content live:

Make changes @ mradm02:/data/amo_python/src e.g., preview is @ mradm02:/data/amo_python/src/preview/zamboni

Run /data/bin/omg_push_zamboni_live.sh

This will basically:
rsync src/ to www/ excluding .git and overwrite everything in www/.

commit all changes to our push system
Whiteboard: postponted → postponed
Whiteboard: postponed → 08/10/2010 @ 6pm
FYI, because metrics attempts to insert data into the AMO master during or end of day procedure, we'll have to suspend that process for tonight and run it tomorrow instead.

That means that AMO developers won't see an update to their stats dashboards until tomorrow evening.
Please disable amo's crontab during the outage: https://intranet.mozilla.org/UpdateAMO#Cronjobs
For tonight's outage, the read-only mode will only affect python pages, which don't include services like update check and blocklist. Is there a way to get Zeus to serve cached pages for the duration of the outage to minimize that?
(In reply to comment #45)
> For tonight's outage, the read-only mode will only affect python pages, which
> don't include services like update check and blocklist. Is there a way to get
> Zeus to serve cached pages for the duration of the outage to minimize that?

dmoore says this isn't possible for tonight.
I'm okay with versioncheck posting a blank update snippet for 5 hours.  The blocklist service should not have downtime if it's avoidable.  It'd be worth delaying this outage window if we can find a way to avoid downtime for two core Firefox services.

Mostly, I'd like to know what's possible, what's not and what we could do to help with an interim solution. The versioncheck and blocklist stuff could be solved by serving static XML files in both cases or pre-populating a cache based off of replayed logs then serving directly out of memory for a specified block of time.

Can someone explain why it isn't possible?  Is Zeus going down?  What's the constraint?
Probably late for this, but zeus can serve files itself statically:
http://knowledgehub.zeus.com/articles/2009/04/27/using_zxtm_as_a_webserver

In both cases the service needs just a static file to avoid serving 404s.

Versioncheck:
http://people.mozilla.org/~morgamic/versioncheck.xml

Blocklist:
http://people.mozilla.org/~morgamic/blocklist.xml

Maybe something for next time.
I don't think this has started yet - the site is still r/w.
These hosts rebuilt from tm-amo01-master01:

tm-amo02-master01
tm-amo01-slave01
tm-amo01-slave02
tm-amo01-slave03
tm-amo01-slave04

These need to be done still:

tm-amo02-slave*: need to create proper my.cnf first so these errors won't happen anymore:

100810 20:36:26 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysqld-relay-bin' to avoid this problem.

And others. The slave threads won't start.

tm-backup02: need to figure out why there isn't enough disk space:

innodb.db
 17804034048  79%   15.77MB/s    0:04:43
rsync: write failed on "/data/amo01-innodb/innodb.db": No space left on device (28)
rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.7]
rsync: connection unexpectedly closed (96 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(601) [generator=3.0.7]

[root@tm-backup02 22:00:42 /data]
:) l
total 28
drwxr-xr-x 6 mysql mysql 4096 Aug 10 21:57 amo01/
drwxr-xr-x 2 mysql mysql 4096 Aug 10 21:51 amo01-innodb/
[root@tm-backup02 22:00:43 /data]
:) df -m .
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                         77247     55956     17305  77% /

17GB isn't even close to enough disk space.
> Can someone explain why it isn't possible?  Is Zeus going down?  What's the
> constraint?

Time.  Lack of time to spin up whatever alternative Zeus config, test, and be ready.  Should be something we do for next time (or even just because).
(In reply to comment #48)
> Probably late for this, but zeus can serve files itself statically:
> http://knowledgehub.zeus.com/articles/2009/04/27/using_zxtm_as_a_webserver
> 
> In both cases the service needs just a static file to avoid serving 404s.

I like this idea mostly from a load shed perspective.  If we needed to shed DB load, this feels like a win.  Can you invite a knob that makes AMO spit out a static page with a cache life of days?  Post load shed, we purge that object and turn the knob back.

I'll buy you a cookie.
Sure, but if we could force it with Zeus it'd work for non-AMO stuff, which would be helpful if we had to do it for another site.
So is the downtime complete (for now at least)? Is AMO back up and running?

We're going to be unthrottling a major update for Thunderbird today, so we need AMO to be working for ensuring add-ons are up to date.
Production AMO has been having a lot of trouble connecting to MySQL recently, but it looks like Nagios is complaining about it so I assume it's being worked on?  ->blocker
Severity: critical → blocker
Also need to rebuild the entirety of Phoenix AMO. Am doing the backups slave now to facilitate that.

@Mark/Wil: the outage is finished and was done on time from your POV. Let me know if you see some evidence contrary.
(In reply to comment #56)
> @Mark/Wil: the outage is finished and was done on time from your POV. Let me
> know if you see some evidence contrary.

Assuming we're not using the phoenix cluster at all, I assume this is causing our troubles:

tm-amo01-slave04:MySQL is CRITICAL: Cant connect to MySQL server on 10.2.70.15
Done rebuilding Phx.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: