Closed
Bug 935002
Opened 12 years ago
Closed 12 years ago
Prepare MDN for SCL3 outage
Categories
(developer.mozilla.org Graveyard :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: groovecoder, Unassigned)
References
Details
(Whiteboard: [kb=1175877] )
Attachments
(1 file)
|
2.84 MB,
application/gzip
|
Details |
1. Create a site-wide notification the display leading up to the outage
2. Create a static site status page to display during the outage
| Reporter | ||
Comment 1•12 years ago
|
||
Ali - can you decide who will write the site-wide notification copy (very soon), and who will create the site status page to display? And I'm okay if both of them are me, so long as we decide soon.
Flags: needinfo?(aspivak)
Comment 2•12 years ago
|
||
(In reply to Luke Crouch [:groovecoder] from comment #1)
> Ali - can you decide who will write the site-wide notification copy (very
> soon), and who will create the site status page to display? And I'm okay if
> both of them are me, so long as we decide soon.
I can write the text if you like. Just let me know any details I need about the timing and duration.
Comment 3•12 years ago
|
||
The MDN site will be partly or totally unavailable for approximately 6 hours (16:00-22:00 UTC; 8am-2pm Pacific) on November 16th
Flags: needinfo?(aspivak)
Comment 4•12 years ago
|
||
I've created the message. It works on the old look and feel but doesn't appear on the new; looks like a redesign bug. I'm filing.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 5•12 years ago
|
||
I closed this prematurely; we don't have the static page up yet.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 6•12 years ago
|
||
A quick comment regarding the nature of the maintenance page, which is to say the actual page that will be displayed during the SCL3 outage : this page and all of its assets must be self-contained and serve-able from a simple static endpoint. This page cannot depend on any pre-existing assets, content, or functionality normally available from MDN because MDN will not be available. I realise that this might have been obvious, but I figure better safe than sorry. :)
tl;dr Tarball me a simple maintenance page and I'll serve it from a bare-bones Apache server.
Comment 7•12 years ago
|
||
For the satisfaction of the interested parties and, eventually, the edification of the future chroniclers of our great era, it may be worth commenting on some of the reasons for which a maintenance page is to be used instead of guaranteeing some level of service delivery.
First off, it must be stated that a full-scale replication of the entire MDN platform at PHX1 is - given the complexity of the task and the time frame in which it would need to be implemented - simply not feasible. That said, this is not in itself a bad idea, and I would very much like to open this discussion point up going forward. There are no *technical* reasons why we can't do this - it's just a matter of time and resource allocation.
A partial replication of some MDN functionality is possible, an was discussed at some length between :phrawzty, :groovecoder, and other interested parties. The elements that would need to be replicated in order to attempt to deliver basic MDN services in a read-only capacity are :
* Zeus (load balancer) configuration
* Front-end web servers
* Netapp shares and contents thereof
* MySQL database(s)
* Memcached
* Network flows (ACLs)
Notably absent from the above list :
* Elasticsearch
* Celery
* Admin node
* Dev and Stage web nodes
Concerning the items which would need to be replicated, the Zeus configuration would be relatively straightforward, so that's not a big deal. Furthermore, it would likely be possible to use existing Memcached servers at PHX1, which further reduces the expected work load. Unfortunately, the remaining items do pose some problems from both a real-work perspective, as well as a infrastructural resource allocation standpoint.
The front-end web nodes would need to be either virtual machines or actual hardware. In the latter case, that hardware would need to be - at worst - ordered, purchased, shipped, racked, set up, and provisioned internally (inventory, DNS, etc). If the hardware is already available (as spares) at the data centre, it would still need to be turned on, configured, provisioned internally, and so forth. While there might be enough time to physically accomplish these goals, there would not afterwards be enough time to actually migrate and test anything - so that's a non-starter.
In the case of virtual machines, these are, as a general rule, much more easily provisioned - though they do still require some time. Unfortunately, the VM infrastructure at PHX1 is currently overly taxed, notably as regards allocatable RAM. This is an issue since each actual webhead currently in production has 16 GB of ram at roughly 66 % utilisation, leading to a minimum 12 GB allocation for each VM, which is a dicey proposition for the time being. Technical limitations are the blocker in this case.
The production netapp share currently has a 30GB allocation at, again, roughly 66 % utilisation. This is not particularly rough, but it does imply that a new netapp share and associated configuration bits and pieces would need to be set up at PHX1. The data would also need to be copied over well ahead of time. Human time restraints are the blocker in this case.
Concerning the database(s), to be perfectly honest, I'm not sure whether we would be able to use an existing database cluster at PHX1, or whether we'd have to set one up temporarily. In the former case, this step involves the relatively straightforward task of dumping and importing; in the latter case we're faced with the same dilemmas as with provisioning machines to act as webheads.
Finally, as regards the network configuration, while these are not particularly time or resource-heavy tasks, they cannot be undertaken until all of the actual infrastructure was put into place. This adds yet another layer to the proverbial cake (which, in this case, is unfortunately not a lie).
So, to sum up, for an outage that *may* last as little as 2 hours (1+1d8 is the current estimate), we would need to build new infrastructure, involve at least 5 separate teams, and devote technical resources which may or may not actually be available.
That's why I'm formally suggesting that a simple maintenance page (see comment 6) is the best course of action at this time.
Comment 8•12 years ago
|
||
>So, to sum up, for an outage that *may* last as little as 2 hours (1+1d8 is the current estimate)
For what it's worth, that is the most awesome way of stating an estimate of how long an outage will last that I've ever seen.
Comment 9•12 years ago
|
||
Have we started to design the static page? When done well, static pages can be very valuable and even fun.
Here are my ideas:
1. The following note somewhere on the page: "MDN is temporarily offline. In
the meantime, why don't you try creating your first Firefox OS app? If you
already have a web app, you'll probably finish before our maintenance does!"
Below that, static documentation on building Firefox OS apps.
2. Tumbeasts somewhere on the page
Comment 10•12 years ago
|
||
Another idea:
3. In the meantime, see dochub.io and Dash for copies of our documentation.
| Reporter | ||
Comment 11•12 years ago
|
||
:openjck - I think you just volunteered to work on the static page?
Ali - who will work with :openjck on the content for the static page?
Flags: needinfo?(aspivak)
Comment 12•12 years ago
|
||
(In reply to Luke Crouch [:groovecoder] from comment #11)
> :openjck - I think you just volunteered to work on the static page?
>
> Ali - who will work with :openjck on the content for the static page?
Yeah, I would love to do the static page. Slightly concerned about timing with PR merging and the CSS migration, but we can talk about that offline.
| Reporter | ||
Updated•12 years ago
|
Whiteboard: [kb=1175877]
Comment 14•12 years ago
|
||
(In reply to Daniel Maher [:phrawzty] from comment #6)
> A quick comment regarding the nature of the maintenance page, which is to
> say the actual page that will be displayed during the SCL3 outage : this
> page and all of its assets must be self-contained and serve-able from a
> simple static endpoint. This page cannot depend on any pre-existing assets,
> content, or functionality normally available from MDN because MDN will not
> be available. I realise that this might have been obvious, but I figure
> better safe than sorry. :)
Does this include the CDN? In other words, will assets served from the CDN also be unavailable?
Flags: needinfo?(dmaher)
Comment 15•12 years ago
|
||
(In reply to John Karahalis [:openjck] from comment #14)
> Does this include the CDN? In other words, will assets served from the CDN
> also be unavailable?
The maintenance will not affect the CDN directly, however, since DNS for MDN will be re-directed to PHX1, and all web requests will be re-written to show only the maintenance page, the net result is that the assets on the CDN will not be available.
Flags: needinfo?(dmaher)
Comment 16•12 years ago
|
||
I don't want to alarm anybody, but we're getting close to the outage, and I'd like to have the maintenance page set up and ready to go ahead of time. How close are we to having that prepared ?
Flags: needinfo?(lcrouch)
| Reporter | ||
Comment 17•12 years ago
|
||
We have a PR in for it, though I think it assumes CDN assets are available.
https://github.com/mozilla/kuma/pull/1646
Flags: needinfo?(lcrouch)
Comment 18•12 years ago
|
||
(In reply to Daniel Maher [:phrawzty] from comment #16)
> I don't want to alarm anybody, but we're getting close to the outage, and
> I'd like to have the maintenance page set up and ready to go ahead of time.
> How close are we to having that prepared ?
No problem, we can make that work. Will multiple assets (like index.html, main.css, logo.png, etc.) work out, or would you prefer we inline everything in the HTML?
Flags: needinfo?(dmaher)
Comment 19•12 years ago
|
||
(In reply to Luke Crouch [:groovecoder] from comment #17)
> We have a PR in for it, though I think it assumes CDN assets are available.
>
> https://github.com/mozilla/kuma/pull/1646
As I mentioned in comment #6, the entire page you wish to display to the end user, including every single image, CSS file, HTML file, or any other entity whatsoever, must be available from the temporary webserver. I apologise if I wasn't clear enough on this point. :(
(In reply to John Karahalis [:openjck] from comment #18)
> No problem, we can make that work. Will multiple assets (like index.html,
> main.css, logo.png, etc.) work out, or would you prefer we inline everything
> in the HTML?
Please feel free to organise the assets as you like (i.e. making everything inline is not a requirement); just be aware that they will be served from a simple webserver, so everything must be static.
Please don't hesitate to ask if you have any more questions or concerns.
Flags: needinfo?(dmaher)
| Reporter | ||
Comment 20•12 years ago
|
||
:phrawzty - will CDN assets really be unavailable during the outage? A local traceroute from my box to developer.cdn.mozilla.net sent me straight from Cox (my ISP) to Akamai - never seemed to hit SCL3.
Flags: needinfo?(dmaher)
Comment 21•12 years ago
|
||
:phrawzty, could you confirm that all MDN pages will be redirected with a 302 to this static page? Thank you.
Comment 22•12 years ago
|
||
(In reply to Luke Crouch [:groovecoder] from comment #20)
> :phrawzty - will CDN assets really be unavailable during the outage? A local
> traceroute from my box to developer.cdn.mozilla.net sent me straight from
> Cox (my ISP) to Akamai - never seemed to hit SCL3.
The way that CDNs work is that they cache content for a certain amount of time. When the cache for a given object expires, the CDN will attempt to re-obtain the content from the source - in this case, webservers at SCL3. If the cache expires, and the source is unavailable, then the object in question will also be unavailable.
You're welcome to roll the dice and hope that your content doesn't expire during the window, though that wouldn't be my suggestion.
(In reply to Jean-Yves Perrier [:teoli] from comment #21)
> :phrawzty, could you confirm that all MDN pages will be redirected with a
> 302 to this static page? Thank you.
Confirmed as per comment #15. If a different behaviour is desired / expected, please let me know ASAP.
Flags: needinfo?(dmaher)
Comment 23•12 years ago
|
||
(In reply to Daniel Maher [:phrawzty] from comment #22)
> (In reply to Jean-Yves Perrier [:teoli] from comment #21)
> > :phrawzty, could you confirm that all MDN pages will be redirected with a
> > 302 to this static page? Thank you.
>
> Confirmed as per comment #15. If a different behaviour is desired /
> expected, please let me know ASAP.
No, that's fine for me (with my SEO-hat on)! Google will not de-indexed us that way.
Thanks!
| Reporter | ||
Comment 24•12 years ago
|
||
(In reply to Daniel Maher [:phrawzty] from comment #22)
I left a comment on the static page bug (https://bugzilla.mozilla.org/show_bug.cgi?id=936452#c5) ... we may merge what we have now so we have *something* in place while we continue to work on copying the static assets down to the maintenance package.
Comment 25•12 years ago
|
||
Maintenance assets are attached. Should look like this when all set up:
http://i.imgur.com/PDVDPOU.png
Comment 27•12 years ago
|
||
That zip works fine. I'll set it up on the Static cluster now.
This bug is effectively satisfied. For further details on the temporary infra during the outage, please see bug 938672.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Flags: needinfo?(dmaher)
Resolution: --- → FIXED
| Reporter | ||
Comment 28•12 years ago
|
||
The new routing isn't temporarily bouncing all MDN url's to this outage page?
e.g., I clicked https://developer.mozilla.org/en-US/docs/User:wbamberg/Add-on_SDK and got a plain 404.
Severity: normal → critical
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 29•12 years ago
|
||
Good catch Luke. I can confirm this.
Comment 30•12 years ago
|
||
Maintenance has terminated.(In reply to Luke Crouch [:groovecoder] from comment #28)
> The new routing isn't temporarily bouncing all MDN url's to this outage page?
>
> e.g., I clicked
> https://developer.mozilla.org/en-US/docs/User:wbamberg/Add-on_SDK and got a
> plain 404.
Resolved by bug 938666. tl;dr CDN caching issue which actually happened *after* the maintenance was complete. Force-refreshing (or emptying the browser cache) clears the errant behaviour.
Severity: critical → normal
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•