572518 - AMO Cluster Needs Rebuild

Assignee

Description

•

14 years ago

Tried to create a new slave of tm-amo01-master01, go this error:

100616 12:32:32 [ERROR] Slave: Error 'Cannot add or update a child row: a foreign key constraint fails (`addons_remora/addons_collections`, CONSTRAINT `addons_collections_ibfk_2` FOREIGN KEY (`collection_id`) REFERENCES `collections` (`id`))' on query. Default database: 'addons_remora'. Query: 'INSERT INTO addons_collections  (addon_id, user_id, collection_id, added) VALUES     (321, 2787551, 69531, NOW())', Error_code: 1452

Sure enough, collections 69531 doesn't exist on the backups server, but does on tm-amo01-master01. This means that any slave created from that backups server is broken. This is potentially the entire cluster.

So I need to rebuild the cluster.

timellis

Assignee

Updated

•

14 years ago

Assignee: server-ops → tellis

timellis

Assignee

Updated

•

14 years ago

Blocks: 560659

timellis

Assignee

Updated

•

14 years ago

Group: infra

matthew zeier [:mrz]

Comment 1

•

14 years ago

What about slaves created from the current slaves?  6 hours of AMO downtime is inconvenient and makes me sad.  Others too I bet (and how does this happen?).

timellis

Assignee

Comment 2

•

14 years ago

I can do this in 4.5 hours of downtime. I've figured out a way. But it will take me personally much longer to complete the job (another 20-25 hours). That is probably a good tradeoff from the Moz POV.

How it happens is unclear. I don't see this sort of "corruption" on systems with good network connectivity and stable hardware. We used to have it at Friendster. I would fix it with a homegrown tablesync script I wrote specifically for this sort of problem.

Since then, a tablesync has been written as part of the Maatkit MySQL toolkit. Eventually I'll install, configure, and run it as part of the routine database maintenance so I can catch this sort of "corruption" earlier. There's no way to be sure how long the problem has existed.

timellis

Assignee

Comment 3

•

14 years ago

I should clarify my final paragraph there. There's no way right now to tell how long the problem has existed. Once I'm running tablesync scripts on a daily basis, we will know down to a day when a problem occurs, and there will be some hope to fixing the underlying cause of the problem.

Wil Clouser [:clouserw]

Comment 4

•

14 years ago

> How it happens is unclear. I don't see this sort of "corruption" on systems
> with good network connectivity and stable hardware. 

We still see many "Can't connect to mysql" errors from the app every hour.  Perhaps they are related.

Justin Scott [:fligtar]

Comment 5

•

14 years ago

Is there any way we could make AMO read-only for these 4.5 - 6 hours instead of completely down? Have it run from separate db boxes off the current bad data until the normal cluster is finished?

matthew zeier [:mrz]

Comment 6

•

14 years ago

I'd sure love that feature in general - would make failing over to Phoenix for a couple hours way easier if we can go into a degraded read-only state.

Jeff Balogh (:jbalogh)

Comment 7

•

14 years ago

(In reply to comment #6)
> I'd sure love that feature in general - would make failing over to Phoenix for
> a couple hours way easier if we can go into a degraded read-only state.

The only writes zamboni does right now are for session stuff.  If we made a way to turn that off I think we could operate read-only.  All the developer pages for remora would be messed up, but we could serve all the public pages.

zamboni doesn't talk to master unless it's for a write, so I think we could run it on slaves.

timellis

Assignee

Updated

•

14 years ago

Whiteboard: needs outage

matthew zeier [:mrz]

Updated

•

14 years ago

Flags: needs-downtime+

Justin Scott [:fligtar]

Comment 8

•

14 years ago

If we disable logging in and log everyone out, that pretty much puts us in read-only mode. We should be able to do a site notice from the admin panel that explains logging in is currently disabled.

Wil Clouser [:clouserw]

Comment 9

•

14 years ago

We've got a push next Tuesday that will let us get some code out to support site notices and such.

I'm not convinced we can do a read-only mode between now and then but we can certainly try, and planning this downtime for next week will give us a bit of time.

matthew zeier [:mrz]

Comment 10

•

14 years ago

So tentatively planned for next Tuesday's window?

Whiteboard: needs outage → 06/22/2010 @ 7pm (?)

Wil Clouser [:clouserw]

Comment 11

•

14 years ago

Code update is at 4pm, so any time after that would be fine with me - sounds like 7pm is a good time.

timellis

Assignee

Updated

•

14 years ago

Severity: normal → critical

Whiteboard: 06/22/2010 @ 7pm (?) → 06/22/2010 @ 7pm

Nick Nguyen [:osunick]

Comment 12

•

14 years ago

We'll need someone who knows what's up to write a blog post explaining the outage and when it's happening.  I also presume we can put up a static page that points to this blog post?

timellis

Assignee

Comment 13

•

14 years ago

I will coordinate with Dave Miller on writing the blog post. Unsure about what static page goes up. I assumed we had a standard procedure for AMO outage static page...?

Addt'l info: plan is to be done by 10pm, but might take until 11pm or so.

matthew zeier [:mrz]

Comment 14

•

14 years ago

Tim's talking about the normal IT downtime blog, not the blog Nick's talking about.

Justin Scott [:fligtar]

Comment 15

•

14 years ago

I am told it's too late for us to look into read-only mode now and that we'll have to be completely down for the whole time. What's the status of the blog post? I'd like to cross-post on the AMO blog as soon as possible.

Also, can we modify the outage page to link to the post?

Justin Scott [:fligtar]

Comment 16

•

14 years ago

I think this needs to be rescheduled for another day for two reasons:

1. same-day notice to users and developers is not acceptable for this level of downtime

2. Firefox 3.6.4 is being released at 1pm tomorrow. I don't think we can take AMO down during any Firefox release, and certainly not one that will get as much traffic/promotion as this one.

Wil Clouser [:clouserw]

Comment 17

•

14 years ago

comment 16 sounds pretty reasonable.  IT?

Wil Clouser [:clouserw]

Comment 18

•

14 years ago

Alright, this is off for tonight.

timellis

Assignee

Comment 19

•

14 years ago

I concur. Just let me know the next window you think is opportune and I'll work it in.

timellis

Assignee

Updated

•

14 years ago

Severity: critical → enhancement

Whiteboard: 06/22/2010 @ 7pm → [Waiting on good downtime]

Wil Clouser [:clouserw]

Comment 20

•

14 years ago

Next AMO upgrade is next Tuesday, although we aren't planning any downtime.  Perhaps a weekend would be best instead of trying to coordinate with our upgrades?

Wil Clouser [:clouserw]

Comment 21

•

14 years ago

We've got a rudimentary "read only" mode finished in bug 574024.  Once it's QA'd it will be on the live site next Tuesday.  This downtime can be its maiden voyage.

matthew zeier [:mrz]

Updated

•

14 years ago

Severity: enhancement → critical

timellis

Assignee

Updated

•

14 years ago

Whiteboard: [Waiting on good downtime] → Needs outage: 24 July 2010, NOON.

Justin Scott [:fligtar]

Comment 22

•

14 years ago

Does this draft blog post cover everything?
--
This Saturday, July 24 at noon, addons.mozilla.org will undergo a 4.5 hour maintenance period. While we will try to make many pages available as read-only, please plan for the site to be completely unavailable for the full 4.5 hour window.

For more information, please see <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=572518">bug 572518</a>.

Justin Scott [:fligtar]

Comment 23

•

14 years ago

Posted. http://blog.mozilla.com/addons/2010/07/19/addons-mozilla-org-planned-downtime/

Jeff Balogh (:jbalogh)

Comment 24

•

14 years ago

Attached file slow migrations — Details

Here's some slow migrations we'd like to run while the site is offline.  The index in addons_collections took 27 minutes for me to run locally.

Please run these when you feel the timing is right.

timellis

Assignee

Comment 25

•

14 years ago

(In reply to comment #24)
> Here's some slow migrations we'd like to run while the site is offline.  The
> index in addons_collections took 27 minutes for me to run locally.
> 
> Please run these when you feel the timing is right.

There is a very high probability there won't be time in that 4.5-hour window for running any slow migrations. We'll need to take another outage for that, or lengthen Saturday's outage.

matthew zeier [:mrz]

Comment 26

•

14 years ago

Do we still think we're okay doing this tomorrow?  Still okay with a possible Fx 3.6.8 release today?

Justin Scott [:fligtar]

Comment 27

•

14 years ago

If 3.6.8 goes out today or tomorrow (which is likely) we probably shouldn't do this. Though with BlackHat and Defcon at the end of next week, we may see another release then as well.

I guess it depends on how badly this needs to be fixed. I'd really like to do it on a weekend.

matthew zeier [:mrz]

Updated

•

14 years ago

Whiteboard: Needs outage: 24 July 2010, NOON. → postponted

timellis

Assignee

Comment 28

•

14 years ago

Wait. What? Are we cancelling this or not? I have set aside 4.5 hours today to do this outage.

timellis

Assignee

Comment 29

•

14 years ago

Caught up on email, read blog posts. I see outage is definitely cancelled. Will schedule it up, perhaps 7 Aug 2010.

timellis

Assignee

Comment 30

•

14 years ago

Fligtar et al: When is good to have this outage? I can't do this coming weekend, so I'm hoping sometime this week during the week? After 11pm (Pacific) I could do it most days of the week.

timellis

Assignee

Comment 31

•

14 years ago

mrz notes since we have a read-only mode we can do this during the next normal outage window. How is Thursday at 7pm? Please respond ASAP. This cluster is in a dangerous state. We can't ignore it.

Wil Clouser [:clouserw]

Comment 32

•

14 years ago

no objections from me

timellis

Assignee

Comment 33

•

14 years ago

(In reply to comment #32)
> no objections from me

MRZ objects. He says there's an AMO content push during tomorrow's outage, and that you probably don't want databases in read-only mode for ~5 hours during that time.

Is that true? If not, let's schedule the AMO rebuild for the same moment. I can do it tomorrow night if you really are on-board with it.

Wil Clouser [:clouserw]

Comment 34

•

14 years ago

content push is at 4pm.  I expect to be done well before 7, but if we want to be safe we can do it another night.  /me is flexible!

matthew zeier [:mrz]

Comment 35

•

14 years ago

Concerned about doing this after a content push.  Next Tuesday?

Wil Clouser [:clouserw]

Comment 36

•

14 years ago

(In reply to comment #35)
> Concerned about doing this after a content push.  Next Tuesday?

wfm

Justin Scott [:fligtar]

Comment 37

•

14 years ago

Is this on for tomorrow night?

timellis

Assignee

Comment 38

•

14 years ago

Works for me: Tuesday August 10, 2010. Can I start at 6pm?

Justin Scott [:fligtar]

Comment 39

•

14 years ago

Fine with me. I'll do another blog post. Is 6pm - 10:30pm in read-only mode correct?

timellis

Assignee

Comment 40

•

14 years ago

Correct. 6pm-10:30pm Tuesday August 10th. Does anyone on this bug know how to put AMO into and take it out of read-only mode? (ie: an actual document about how to do it in the production environment)

timellis

Assignee

Comment 41

•

14 years ago

jbalogh	‣ # Turn on read-only mode in settings_local.py by putting this line
jbalogh	‣ # at the VERY BOTTOM: read_only_mode(globals())
jbalogh	‣ then you have to push that out to all the webheads
jbalogh	‣ https://intranet.mozilla.org/UpdateAMO#Pushing_content

Pushing content

Since zamboni is hosted in a .git repo it conflicts with ITs git content push system. To resolve the problem simply these are the steps to pushing zamboni content live:

Make changes @ mradm02:/data/amo_python/src e.g., preview is @ mradm02:/data/amo_python/src/preview/zamboni

Run /data/bin/omg_push_zamboni_live.sh

This will basically:
rsync src/ to www/ excluding .git and overwrite everything in www/.

commit all changes to our push system

Jeff Balogh (:jbalogh)

Comment 42

•

14 years ago

https://intranet.mozilla.org/UpdateAMO#Read-only_Mode

Jeff Balogh (:jbalogh)

Updated

•

14 years ago

Whiteboard: postponted → postponed

matthew zeier [:mrz]

Updated

•

14 years ago

Whiteboard: postponed → 08/10/2010 @ 6pm

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 43

•

14 years ago

FYI, because metrics attempts to insert data into the AMO master during or end of day procedure, we'll have to suspend that process for tonight and run it tomorrow instead.

That means that AMO developers won't see an update to their stats dashboards until tomorrow evening.

Jeff Balogh (:jbalogh)

Comment 44

•

14 years ago

Please disable amo's crontab during the outage: https://intranet.mozilla.org/UpdateAMO#Cronjobs

Justin Scott [:fligtar]

Comment 45

•

14 years ago

For tonight's outage, the read-only mode will only affect python pages, which don't include services like update check and blocklist. Is there a way to get Zeus to serve cached pages for the duration of the outage to minimize that?

Justin Scott [:fligtar]

Comment 46

•

14 years ago

(In reply to comment #45)
> For tonight's outage, the read-only mode will only affect python pages, which
> don't include services like update check and blocklist. Is there a way to get
> Zeus to serve cached pages for the duration of the outage to minimize that?

dmoore says this isn't possible for tonight.

Michael Morgan [:morgamic]

Comment 47

•

14 years ago

I'm okay with versioncheck posting a blank update snippet for 5 hours.  The blocklist service should not have downtime if it's avoidable.  It'd be worth delaying this outage window if we can find a way to avoid downtime for two core Firefox services.

Mostly, I'd like to know what's possible, what's not and what we could do to help with an interim solution. The versioncheck and blocklist stuff could be solved by serving static XML files in both cases or pre-populating a cache based off of replayed logs then serving directly out of memory for a specified block of time.

Can someone explain why it isn't possible?  Is Zeus going down?  What's the constraint?

Michael Morgan [:morgamic]

Comment 48

•

14 years ago

Probably late for this, but zeus can serve files itself statically:
http://knowledgehub.zeus.com/articles/2009/04/27/using_zxtm_as_a_webserver

In both cases the service needs just a static file to avoid serving 404s.

Versioncheck:
http://people.mozilla.org/~morgamic/versioncheck.xml

Blocklist:
http://people.mozilla.org/~morgamic/blocklist.xml

Maybe something for next time.

Wil Clouser [:clouserw]

Comment 49

•

14 years ago

I don't think this has started yet - the site is still r/w.

timellis

Assignee

Comment 50

•

14 years ago

These hosts rebuilt from tm-amo01-master01:

tm-amo02-master01
tm-amo01-slave01
tm-amo01-slave02
tm-amo01-slave03
tm-amo01-slave04

These need to be done still:

tm-amo02-slave*: need to create proper my.cnf first so these errors won't happen anymore:

100810 20:36:26 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysqld-relay-bin' to avoid this problem.

And others. The slave threads won't start.

tm-backup02: need to figure out why there isn't enough disk space:

innodb.db
 17804034048  79%   15.77MB/s    0:04:43
rsync: write failed on "/data/amo01-innodb/innodb.db": No space left on device (28)
rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.7]
rsync: connection unexpectedly closed (96 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(601) [generator=3.0.7]

[root@tm-backup02 22:00:42 /data]
:) l
total 28
drwxr-xr-x 6 mysql mysql 4096 Aug 10 21:57 amo01/
drwxr-xr-x 2 mysql mysql 4096 Aug 10 21:51 amo01-innodb/
[root@tm-backup02 22:00:43 /data]
:) df -m .
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                         77247     55956     17305  77% /

17GB isn't even close to enough disk space.

matthew zeier [:mrz]

Comment 51

•

14 years ago

> Can someone explain why it isn't possible?  Is Zeus going down?  What's the
> constraint?

Time.  Lack of time to spin up whatever alternative Zeus config, test, and be ready.  Should be something we do for next time (or even just because).

matthew zeier [:mrz]

Comment 52

•

14 years ago

(In reply to comment #48)
> Probably late for this, but zeus can serve files itself statically:
> http://knowledgehub.zeus.com/articles/2009/04/27/using_zxtm_as_a_webserver
> 
> In both cases the service needs just a static file to avoid serving 404s.

I like this idea mostly from a load shed perspective.  If we needed to shed DB load, this feels like a win.  Can you invite a knob that makes AMO spit out a static page with a cache life of days?  Post load shed, we purge that object and turn the knob back.

I'll buy you a cookie.

Michael Morgan [:morgamic]

Comment 53

•

14 years ago

Sure, but if we could force it with Zeus it'd work for non-AMO stuff, which would be helpful if we had to do it for another site.

Mark Banner (:standard8)

Comment 54

•

14 years ago

So is the downtime complete (for now at least)? Is AMO back up and running?

We're going to be unthrottling a major update for Thunderbird today, so we need AMO to be working for ensuring add-ons are up to date.

Wil Clouser [:clouserw]

Comment 55

•

14 years ago

Production AMO has been having a lot of trouble connecting to MySQL recently, but it looks like Nagios is complaining about it so I assume it's being worked on?  ->blocker

Severity: critical → blocker

timellis

Assignee

Comment 56

•

14 years ago

Also need to rebuild the entirety of Phoenix AMO. Am doing the backups slave now to facilitate that.

@Mark/Wil: the outage is finished and was done on time from your POV. Let me know if you see some evidence contrary.

Wil Clouser [:clouserw]

Comment 57

•

14 years ago

(In reply to comment #56)
> @Mark/Wil: the outage is finished and was done on time from your POV. Let me
> know if you see some evidence contrary.

Assuming we're not using the phoenix cluster at all, I assume this is causing our troubles:

tm-amo01-slave04:MySQL is CRITICAL: Cant connect to MySQL server on 10.2.70.15

timellis

Assignee

Comment 59

•

14 years ago

Done rebuilding Phx.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard