Closed
Bug 572518
Opened 14 years ago
Closed 14 years ago
AMO Cluster Needs Rebuild
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: tellis, Assigned: tellis)
References
Details
(Whiteboard: 08/10/2010 @ 6pm)
Attachments
(1 file)
446 bytes,
text/plain
|
Details |
Tried to create a new slave of tm-amo01-master01, go this error: 100616 12:32:32 [ERROR] Slave: Error 'Cannot add or update a child row: a foreign key constraint fails (`addons_remora/addons_collections`, CONSTRAINT `addons_collections_ibfk_2` FOREIGN KEY (`collection_id`) REFERENCES `collections` (`id`))' on query. Default database: 'addons_remora'. Query: 'INSERT INTO addons_collections (addon_id, user_id, collection_id, added) VALUES (321, 2787551, 69531, NOW())', Error_code: 1452 Sure enough, collections 69531 doesn't exist on the backups server, but does on tm-amo01-master01. This means that any slave created from that backups server is broken. This is potentially the entire cluster. So I need to rebuild the cluster.
Comment 1•14 years ago
|
||
What about slaves created from the current slaves? 6 hours of AMO downtime is inconvenient and makes me sad. Others too I bet (and how does this happen?).
I can do this in 4.5 hours of downtime. I've figured out a way. But it will take me personally much longer to complete the job (another 20-25 hours). That is probably a good tradeoff from the Moz POV. How it happens is unclear. I don't see this sort of "corruption" on systems with good network connectivity and stable hardware. We used to have it at Friendster. I would fix it with a homegrown tablesync script I wrote specifically for this sort of problem. Since then, a tablesync has been written as part of the Maatkit MySQL toolkit. Eventually I'll install, configure, and run it as part of the routine database maintenance so I can catch this sort of "corruption" earlier. There's no way to be sure how long the problem has existed.
I should clarify my final paragraph there. There's no way right now to tell how long the problem has existed. Once I'm running tablesync scripts on a daily basis, we will know down to a day when a problem occurs, and there will be some hope to fixing the underlying cause of the problem.
Comment 4•14 years ago
|
||
> How it happens is unclear. I don't see this sort of "corruption" on systems
> with good network connectivity and stable hardware.
We still see many "Can't connect to mysql" errors from the app every hour. Perhaps they are related.
Comment 5•14 years ago
|
||
Is there any way we could make AMO read-only for these 4.5 - 6 hours instead of completely down? Have it run from separate db boxes off the current bad data until the normal cluster is finished?
Comment 6•14 years ago
|
||
I'd sure love that feature in general - would make failing over to Phoenix for a couple hours way easier if we can go into a degraded read-only state.
Comment 7•14 years ago
|
||
(In reply to comment #6) > I'd sure love that feature in general - would make failing over to Phoenix for > a couple hours way easier if we can go into a degraded read-only state. The only writes zamboni does right now are for session stuff. If we made a way to turn that off I think we could operate read-only. All the developer pages for remora would be messed up, but we could serve all the public pages. zamboni doesn't talk to master unless it's for a write, so I think we could run it on slaves.
Updated•14 years ago
|
Flags: needs-downtime+
Comment 8•14 years ago
|
||
If we disable logging in and log everyone out, that pretty much puts us in read-only mode. We should be able to do a site notice from the admin panel that explains logging in is currently disabled.
Comment 9•14 years ago
|
||
We've got a push next Tuesday that will let us get some code out to support site notices and such. I'm not convinced we can do a read-only mode between now and then but we can certainly try, and planning this downtime for next week will give us a bit of time.
Comment 10•14 years ago
|
||
So tentatively planned for next Tuesday's window?
Whiteboard: needs outage → 06/22/2010 @ 7pm (?)
Comment 11•14 years ago
|
||
Code update is at 4pm, so any time after that would be fine with me - sounds like 7pm is a good time.
Severity: normal → critical
Whiteboard: 06/22/2010 @ 7pm (?) → 06/22/2010 @ 7pm
Comment 12•14 years ago
|
||
We'll need someone who knows what's up to write a blog post explaining the outage and when it's happening. I also presume we can put up a static page that points to this blog post?
Assignee | ||
Comment 13•14 years ago
|
||
I will coordinate with Dave Miller on writing the blog post. Unsure about what static page goes up. I assumed we had a standard procedure for AMO outage static page...? Addt'l info: plan is to be done by 10pm, but might take until 11pm or so.
Comment 14•14 years ago
|
||
Tim's talking about the normal IT downtime blog, not the blog Nick's talking about.
Comment 15•14 years ago
|
||
I am told it's too late for us to look into read-only mode now and that we'll have to be completely down for the whole time. What's the status of the blog post? I'd like to cross-post on the AMO blog as soon as possible. Also, can we modify the outage page to link to the post?
Comment 16•14 years ago
|
||
I think this needs to be rescheduled for another day for two reasons: 1. same-day notice to users and developers is not acceptable for this level of downtime 2. Firefox 3.6.4 is being released at 1pm tomorrow. I don't think we can take AMO down during any Firefox release, and certainly not one that will get as much traffic/promotion as this one.
Comment 17•14 years ago
|
||
comment 16 sounds pretty reasonable. IT?
Comment 18•14 years ago
|
||
Alright, this is off for tonight.
Assignee | ||
Comment 19•14 years ago
|
||
I concur. Just let me know the next window you think is opportune and I'll work it in.
Severity: critical → enhancement
Whiteboard: 06/22/2010 @ 7pm → [Waiting on good downtime]
Comment 20•14 years ago
|
||
Next AMO upgrade is next Tuesday, although we aren't planning any downtime. Perhaps a weekend would be best instead of trying to coordinate with our upgrades?
Comment 21•14 years ago
|
||
We've got a rudimentary "read only" mode finished in bug 574024. Once it's QA'd it will be on the live site next Tuesday. This downtime can be its maiden voyage.
Updated•14 years ago
|
Severity: enhancement → critical
Whiteboard: [Waiting on good downtime] → Needs outage: 24 July 2010, NOON.
Comment 22•14 years ago
|
||
Does this draft blog post cover everything? -- This Saturday, July 24 at noon, addons.mozilla.org will undergo a 4.5 hour maintenance period. While we will try to make many pages available as read-only, please plan for the site to be completely unavailable for the full 4.5 hour window. For more information, please see <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=572518">bug 572518</a>.
Comment 23•14 years ago
|
||
Posted. http://blog.mozilla.com/addons/2010/07/19/addons-mozilla-org-planned-downtime/
Comment 24•14 years ago
|
||
Here's some slow migrations we'd like to run while the site is offline. The index in addons_collections took 27 minutes for me to run locally. Please run these when you feel the timing is right.
Assignee | ||
Comment 25•14 years ago
|
||
(In reply to comment #24) > Here's some slow migrations we'd like to run while the site is offline. The > index in addons_collections took 27 minutes for me to run locally. > > Please run these when you feel the timing is right. There is a very high probability there won't be time in that 4.5-hour window for running any slow migrations. We'll need to take another outage for that, or lengthen Saturday's outage.
Comment 26•14 years ago
|
||
Do we still think we're okay doing this tomorrow? Still okay with a possible Fx 3.6.8 release today?
Comment 27•14 years ago
|
||
If 3.6.8 goes out today or tomorrow (which is likely) we probably shouldn't do this. Though with BlackHat and Defcon at the end of next week, we may see another release then as well. I guess it depends on how badly this needs to be fixed. I'd really like to do it on a weekend.
Updated•14 years ago
|
Whiteboard: Needs outage: 24 July 2010, NOON. → postponted
Assignee | ||
Comment 28•14 years ago
|
||
Wait. What? Are we cancelling this or not? I have set aside 4.5 hours today to do this outage.
Assignee | ||
Comment 29•14 years ago
|
||
Caught up on email, read blog posts. I see outage is definitely cancelled. Will schedule it up, perhaps 7 Aug 2010.
Assignee | ||
Comment 30•14 years ago
|
||
Fligtar et al: When is good to have this outage? I can't do this coming weekend, so I'm hoping sometime this week during the week? After 11pm (Pacific) I could do it most days of the week.
Assignee | ||
Comment 31•14 years ago
|
||
mrz notes since we have a read-only mode we can do this during the next normal outage window. How is Thursday at 7pm? Please respond ASAP. This cluster is in a dangerous state. We can't ignore it.
Comment 32•14 years ago
|
||
no objections from me
Assignee | ||
Comment 33•14 years ago
|
||
(In reply to comment #32) > no objections from me MRZ objects. He says there's an AMO content push during tomorrow's outage, and that you probably don't want databases in read-only mode for ~5 hours during that time. Is that true? If not, let's schedule the AMO rebuild for the same moment. I can do it tomorrow night if you really are on-board with it.
Comment 34•14 years ago
|
||
content push is at 4pm. I expect to be done well before 7, but if we want to be safe we can do it another night. /me is flexible!
Comment 35•14 years ago
|
||
Concerned about doing this after a content push. Next Tuesday?
Comment 36•14 years ago
|
||
(In reply to comment #35) > Concerned about doing this after a content push. Next Tuesday? wfm
Comment 37•14 years ago
|
||
Is this on for tomorrow night?
Assignee | ||
Comment 38•14 years ago
|
||
Works for me: Tuesday August 10, 2010. Can I start at 6pm?
Comment 39•14 years ago
|
||
Fine with me. I'll do another blog post. Is 6pm - 10:30pm in read-only mode correct?
Assignee | ||
Comment 40•14 years ago
|
||
Correct. 6pm-10:30pm Tuesday August 10th. Does anyone on this bug know how to put AMO into and take it out of read-only mode? (ie: an actual document about how to do it in the production environment)
Assignee | ||
Comment 41•14 years ago
|
||
jbalogh ‣ # Turn on read-only mode in settings_local.py by putting this line jbalogh ‣ # at the VERY BOTTOM: read_only_mode(globals()) jbalogh ‣ then you have to push that out to all the webheads jbalogh ‣ https://intranet.mozilla.org/UpdateAMO#Pushing_content Pushing content Since zamboni is hosted in a .git repo it conflicts with ITs git content push system. To resolve the problem simply these are the steps to pushing zamboni content live: Make changes @ mradm02:/data/amo_python/src e.g., preview is @ mradm02:/data/amo_python/src/preview/zamboni Run /data/bin/omg_push_zamboni_live.sh This will basically: rsync src/ to www/ excluding .git and overwrite everything in www/. commit all changes to our push system
Comment 42•14 years ago
|
||
https://intranet.mozilla.org/UpdateAMO#Read-only_Mode
Updated•14 years ago
|
Whiteboard: postponted → postponed
Updated•14 years ago
|
Whiteboard: postponed → 08/10/2010 @ 6pm
Comment 43•14 years ago
|
||
FYI, because metrics attempts to insert data into the AMO master during or end of day procedure, we'll have to suspend that process for tonight and run it tomorrow instead. That means that AMO developers won't see an update to their stats dashboards until tomorrow evening.
Comment 44•14 years ago
|
||
Please disable amo's crontab during the outage: https://intranet.mozilla.org/UpdateAMO#Cronjobs
Comment 45•14 years ago
|
||
For tonight's outage, the read-only mode will only affect python pages, which don't include services like update check and blocklist. Is there a way to get Zeus to serve cached pages for the duration of the outage to minimize that?
Comment 46•14 years ago
|
||
(In reply to comment #45) > For tonight's outage, the read-only mode will only affect python pages, which > don't include services like update check and blocklist. Is there a way to get > Zeus to serve cached pages for the duration of the outage to minimize that? dmoore says this isn't possible for tonight.
Comment 47•14 years ago
|
||
I'm okay with versioncheck posting a blank update snippet for 5 hours. The blocklist service should not have downtime if it's avoidable. It'd be worth delaying this outage window if we can find a way to avoid downtime for two core Firefox services. Mostly, I'd like to know what's possible, what's not and what we could do to help with an interim solution. The versioncheck and blocklist stuff could be solved by serving static XML files in both cases or pre-populating a cache based off of replayed logs then serving directly out of memory for a specified block of time. Can someone explain why it isn't possible? Is Zeus going down? What's the constraint?
Comment 48•14 years ago
|
||
Probably late for this, but zeus can serve files itself statically: http://knowledgehub.zeus.com/articles/2009/04/27/using_zxtm_as_a_webserver In both cases the service needs just a static file to avoid serving 404s. Versioncheck: http://people.mozilla.org/~morgamic/versioncheck.xml Blocklist: http://people.mozilla.org/~morgamic/blocklist.xml Maybe something for next time.
Comment 49•14 years ago
|
||
I don't think this has started yet - the site is still r/w.
Assignee | ||
Comment 50•14 years ago
|
||
These hosts rebuilt from tm-amo01-master01: tm-amo02-master01 tm-amo01-slave01 tm-amo01-slave02 tm-amo01-slave03 tm-amo01-slave04 These need to be done still: tm-amo02-slave*: need to create proper my.cnf first so these errors won't happen anymore: 100810 20:36:26 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysqld-relay-bin' to avoid this problem. And others. The slave threads won't start. tm-backup02: need to figure out why there isn't enough disk space: innodb.db 17804034048 79% 15.77MB/s 0:04:43 rsync: write failed on "/data/amo01-innodb/innodb.db": No space left on device (28) rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.7] rsync: connection unexpectedly closed (96 bytes received so far) [generator] rsync error: error in rsync protocol data stream (code 12) at io.c(601) [generator=3.0.7] [root@tm-backup02 22:00:42 /data] :) l total 28 drwxr-xr-x 6 mysql mysql 4096 Aug 10 21:57 amo01/ drwxr-xr-x 2 mysql mysql 4096 Aug 10 21:51 amo01-innodb/ [root@tm-backup02 22:00:43 /data] :) df -m . Filesystem 1M-blocks Used Available Use% Mounted on /dev/mapper/VolGroup00-LogVol00 77247 55956 17305 77% / 17GB isn't even close to enough disk space.
Comment 51•14 years ago
|
||
> Can someone explain why it isn't possible? Is Zeus going down? What's the
> constraint?
Time. Lack of time to spin up whatever alternative Zeus config, test, and be ready. Should be something we do for next time (or even just because).
Comment 52•14 years ago
|
||
(In reply to comment #48) > Probably late for this, but zeus can serve files itself statically: > http://knowledgehub.zeus.com/articles/2009/04/27/using_zxtm_as_a_webserver > > In both cases the service needs just a static file to avoid serving 404s. I like this idea mostly from a load shed perspective. If we needed to shed DB load, this feels like a win. Can you invite a knob that makes AMO spit out a static page with a cache life of days? Post load shed, we purge that object and turn the knob back. I'll buy you a cookie.
Comment 53•14 years ago
|
||
Sure, but if we could force it with Zeus it'd work for non-AMO stuff, which would be helpful if we had to do it for another site.
Comment 54•14 years ago
|
||
So is the downtime complete (for now at least)? Is AMO back up and running? We're going to be unthrottling a major update for Thunderbird today, so we need AMO to be working for ensuring add-ons are up to date.
Comment 55•14 years ago
|
||
Production AMO has been having a lot of trouble connecting to MySQL recently, but it looks like Nagios is complaining about it so I assume it's being worked on? ->blocker
Severity: critical → blocker
Assignee | ||
Comment 56•14 years ago
|
||
Also need to rebuild the entirety of Phoenix AMO. Am doing the backups slave now to facilitate that. @Mark/Wil: the outage is finished and was done on time from your POV. Let me know if you see some evidence contrary.
Comment 57•14 years ago
|
||
(In reply to comment #56) > @Mark/Wil: the outage is finished and was done on time from your POV. Let me > know if you see some evidence contrary. Assuming we're not using the phoenix cluster at all, I assume this is causing our troubles: tm-amo01-slave04:MySQL is CRITICAL: Cant connect to MySQL server on 10.2.70.15
Assignee | ||
Comment 59•14 years ago
|
||
Done rebuilding Phx.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•