935002 - Prepare MDN for SCL3 outage

Reporter

Description

•

12 years ago

1. Create a site-wide notification the display leading up to the outage 2. Create a static site status page to display during the outage

Luke Crouch [:groovecoder]

Reporter

Comment 1

•

12 years ago

Ali - can you decide who will write the site-wide notification copy (very soon), and who will create the site status page to display? And I'm okay if both of them are me, so long as we decide soon.

Flags: needinfo?(aspivak)

Eric Shepherd [:sheppy]

Comment 2

•

12 years ago

(In reply to Luke Crouch [:groovecoder] from comment #1) > Ali - can you decide who will write the site-wide notification copy (very > soon), and who will create the site status page to display? And I'm okay if > both of them are me, so long as we decide soon. I can write the text if you like. Just let me know any details I need about the timing and duration.

ali spivak

Comment 3

•

12 years ago

The MDN site will be partly or totally unavailable for approximately 6 hours (16:00-22:00 UTC; 8am-2pm Pacific) on November 16th

Flags: needinfo?(aspivak)

Eric Shepherd [:sheppy]

Comment 4

•

12 years ago

I've created the message. It works on the old look and feel but doesn't appear on the new; looks like a redesign bug. I'm filing.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Eric Shepherd [:sheppy]

Comment 5

•

12 years ago

I closed this prematurely; we don't have the static page up yet.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Daniel Maher [:phrawzty]

Comment 6

•

12 years ago

A quick comment regarding the nature of the maintenance page, which is to say the actual page that will be displayed during the SCL3 outage : this page and all of its assets must be self-contained and serve-able from a simple static endpoint. This page cannot depend on any pre-existing assets, content, or functionality normally available from MDN because MDN will not be available. I realise that this might have been obvious, but I figure better safe than sorry. :) tl;dr Tarball me a simple maintenance page and I'll serve it from a bare-bones Apache server.

Daniel Maher [:phrawzty]

Comment 7

•

12 years ago

For the satisfaction of the interested parties and, eventually, the edification of the future chroniclers of our great era, it may be worth commenting on some of the reasons for which a maintenance page is to be used instead of guaranteeing some level of service delivery. First off, it must be stated that a full-scale replication of the entire MDN platform at PHX1 is - given the complexity of the task and the time frame in which it would need to be implemented - simply not feasible. That said, this is not in itself a bad idea, and I would very much like to open this discussion point up going forward. There are no *technical* reasons why we can't do this - it's just a matter of time and resource allocation. A partial replication of some MDN functionality is possible, an was discussed at some length between :phrawzty, :groovecoder, and other interested parties. The elements that would need to be replicated in order to attempt to deliver basic MDN services in a read-only capacity are : * Zeus (load balancer) configuration * Front-end web servers * Netapp shares and contents thereof * MySQL database(s) * Memcached * Network flows (ACLs) Notably absent from the above list : * Elasticsearch * Celery * Admin node * Dev and Stage web nodes Concerning the items which would need to be replicated, the Zeus configuration would be relatively straightforward, so that's not a big deal. Furthermore, it would likely be possible to use existing Memcached servers at PHX1, which further reduces the expected work load. Unfortunately, the remaining items do pose some problems from both a real-work perspective, as well as a infrastructural resource allocation standpoint. The front-end web nodes would need to be either virtual machines or actual hardware. In the latter case, that hardware would need to be - at worst - ordered, purchased, shipped, racked, set up, and provisioned internally (inventory, DNS, etc). If the hardware is already available (as spares) at the data centre, it would still need to be turned on, configured, provisioned internally, and so forth. While there might be enough time to physically accomplish these goals, there would not afterwards be enough time to actually migrate and test anything - so that's a non-starter. In the case of virtual machines, these are, as a general rule, much more easily provisioned - though they do still require some time. Unfortunately, the VM infrastructure at PHX1 is currently overly taxed, notably as regards allocatable RAM. This is an issue since each actual webhead currently in production has 16 GB of ram at roughly 66 % utilisation, leading to a minimum 12 GB allocation for each VM, which is a dicey proposition for the time being. Technical limitations are the blocker in this case. The production netapp share currently has a 30GB allocation at, again, roughly 66 % utilisation. This is not particularly rough, but it does imply that a new netapp share and associated configuration bits and pieces would need to be set up at PHX1. The data would also need to be copied over well ahead of time. Human time restraints are the blocker in this case. Concerning the database(s), to be perfectly honest, I'm not sure whether we would be able to use an existing database cluster at PHX1, or whether we'd have to set one up temporarily. In the former case, this step involves the relatively straightforward task of dumping and importing; in the latter case we're faced with the same dilemmas as with provisioning machines to act as webheads. Finally, as regards the network configuration, while these are not particularly time or resource-heavy tasks, they cannot be undertaken until all of the actual infrastructure was put into place. This adds yet another layer to the proverbial cake (which, in this case, is unfortunately not a lie). So, to sum up, for an outage that *may* last as little as 2 hours (1+1d8 is the current estimate), we would need to build new infrastructure, involve at least 5 separate teams, and devote technical resources which may or may not actually be available. That's why I'm formally suggesting that a simple maintenance page (see comment 6) is the best course of action at this time.

Eric Shepherd [:sheppy]

Comment 8

•

12 years ago

>So, to sum up, for an outage that *may* last as little as 2 hours (1+1d8 is the current estimate) For what it's worth, that is the most awesome way of stating an estimate of how long an outage will last that I've ever seen.

John Karahalis [:openjck]

Comment 9

•

12 years ago

Have we started to design the static page? When done well, static pages can be very valuable and even fun. Here are my ideas: 1. The following note somewhere on the page: "MDN is temporarily offline. In the meantime, why don't you try creating your first Firefox OS app? If you already have a web app, you'll probably finish before our maintenance does!" Below that, static documentation on building Firefox OS apps. 2. Tumbeasts somewhere on the page

John Karahalis [:openjck]

Comment 10

•

12 years ago

Another idea: 3. In the meantime, see dochub.io and Dash for copies of our documentation.

Luke Crouch [:groovecoder]

Reporter

Comment 11

•

12 years ago

:openjck - I think you just volunteered to work on the static page? Ali - who will work with :openjck on the content for the static page?

Flags: needinfo?(aspivak)

John Karahalis [:openjck]

Comment 12

•

12 years ago

(In reply to Luke Crouch [:groovecoder] from comment #11) > :openjck - I think you just volunteered to work on the static page? > > Ali - who will work with :openjck on the content for the static page? Yeah, I would love to do the static page. Slightly concerned about timing with PR merging and the CSS migration, but we can talk about that offline.

Luke Crouch [:groovecoder]

Reporter

Updated

•

12 years ago

Depends on: 936452

Luke Crouch [:groovecoder]

Reporter

Updated

•

12 years ago

Whiteboard: [kb=1175877]

ali spivak

Comment 13

•

12 years ago

Sheppy will provide the content.

Flags: needinfo?(aspivak)

John Karahalis [:openjck]

Comment 14

•

12 years ago

(In reply to Daniel Maher [:phrawzty] from comment #6) > A quick comment regarding the nature of the maintenance page, which is to > say the actual page that will be displayed during the SCL3 outage : this > page and all of its assets must be self-contained and serve-able from a > simple static endpoint. This page cannot depend on any pre-existing assets, > content, or functionality normally available from MDN because MDN will not > be available. I realise that this might have been obvious, but I figure > better safe than sorry. :) Does this include the CDN? In other words, will assets served from the CDN also be unavailable?

Flags: needinfo?(dmaher)

Daniel Maher [:phrawzty]

Comment 15

•

12 years ago

(In reply to John Karahalis [:openjck] from comment #14) > Does this include the CDN? In other words, will assets served from the CDN > also be unavailable? The maintenance will not affect the CDN directly, however, since DNS for MDN will be re-directed to PHX1, and all web requests will be re-written to show only the maintenance page, the net result is that the assets on the CDN will not be available.

Flags: needinfo?(dmaher)

Daniel Maher [:phrawzty]

Comment 16

•

12 years ago

I don't want to alarm anybody, but we're getting close to the outage, and I'd like to have the maintenance page set up and ready to go ahead of time. How close are we to having that prepared ?

Flags: needinfo?(lcrouch)

Luke Crouch [:groovecoder]

Reporter

Comment 17

•

12 years ago

We have a PR in for it, though I think it assumes CDN assets are available. https://github.com/mozilla/kuma/pull/1646

Flags: needinfo?(lcrouch)

John Karahalis [:openjck]

Comment 18

•

12 years ago

(In reply to Daniel Maher [:phrawzty] from comment #16) > I don't want to alarm anybody, but we're getting close to the outage, and > I'd like to have the maintenance page set up and ready to go ahead of time. > How close are we to having that prepared ? No problem, we can make that work. Will multiple assets (like index.html, main.css, logo.png, etc.) work out, or would you prefer we inline everything in the HTML?

Flags: needinfo?(dmaher)

Daniel Maher [:phrawzty]

Comment 19

•

12 years ago

(In reply to Luke Crouch [:groovecoder] from comment #17) > We have a PR in for it, though I think it assumes CDN assets are available. > > https://github.com/mozilla/kuma/pull/1646 As I mentioned in comment #6, the entire page you wish to display to the end user, including every single image, CSS file, HTML file, or any other entity whatsoever, must be available from the temporary webserver. I apologise if I wasn't clear enough on this point. :( (In reply to John Karahalis [:openjck] from comment #18) > No problem, we can make that work. Will multiple assets (like index.html, > main.css, logo.png, etc.) work out, or would you prefer we inline everything > in the HTML? Please feel free to organise the assets as you like (i.e. making everything inline is not a requirement); just be aware that they will be served from a simple webserver, so everything must be static. Please don't hesitate to ask if you have any more questions or concerns.

Flags: needinfo?(dmaher)

Luke Crouch [:groovecoder]

Reporter

Comment 20

•

12 years ago

:phrawzty - will CDN assets really be unavailable during the outage? A local traceroute from my box to developer.cdn.mozilla.net sent me straight from Cox (my ISP) to Akamai - never seemed to hit SCL3.

Flags: needinfo?(dmaher)

Jean-Yves Perrier [:teoli]

Comment 21

•

12 years ago

:phrawzty, could you confirm that all MDN pages will be redirected with a 302 to this static page? Thank you.

Daniel Maher [:phrawzty]

Comment 22

•

12 years ago

(In reply to Luke Crouch [:groovecoder] from comment #20) > :phrawzty - will CDN assets really be unavailable during the outage? A local > traceroute from my box to developer.cdn.mozilla.net sent me straight from > Cox (my ISP) to Akamai - never seemed to hit SCL3. The way that CDNs work is that they cache content for a certain amount of time. When the cache for a given object expires, the CDN will attempt to re-obtain the content from the source - in this case, webservers at SCL3. If the cache expires, and the source is unavailable, then the object in question will also be unavailable. You're welcome to roll the dice and hope that your content doesn't expire during the window, though that wouldn't be my suggestion. (In reply to Jean-Yves Perrier [:teoli] from comment #21) > :phrawzty, could you confirm that all MDN pages will be redirected with a > 302 to this static page? Thank you. Confirmed as per comment #15. If a different behaviour is desired / expected, please let me know ASAP.

Flags: needinfo?(dmaher)

Jean-Yves Perrier [:teoli]

Comment 23

•

12 years ago

(In reply to Daniel Maher [:phrawzty] from comment #22) > (In reply to Jean-Yves Perrier [:teoli] from comment #21) > > :phrawzty, could you confirm that all MDN pages will be redirected with a > > 302 to this static page? Thank you. > > Confirmed as per comment #15. If a different behaviour is desired / > expected, please let me know ASAP. No, that's fine for me (with my SEO-hat on)! Google will not de-indexed us that way. Thanks!

Luke Crouch [:groovecoder]

Reporter

Comment 24

•

12 years ago

(In reply to Daniel Maher [:phrawzty] from comment #22) I left a comment on the static page bug (https://bugzilla.mozilla.org/show_bug.cgi?id=936452#c5) ... we may merge what we have now so we have *something* in place while we continue to work on copying the static assets down to the maintenance package.

John Karahalis [:openjck]

Comment 25

•

12 years ago

Attached file Maintenance page — Details

Maintenance assets are attached. Should look like this when all set up: http://i.imgur.com/PDVDPOU.png

Luke Crouch [:groovecoder]

Reporter

Comment 26

•

12 years ago

Daniel - does this .zip work for you?

Flags: needinfo?(dmaher)

Daniel Maher [:phrawzty]

Comment 27

•

12 years ago

That zip works fine. I'll set it up on the Static cluster now. This bug is effectively satisfied. For further details on the temporary infra during the outage, please see bug 938672.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Flags: needinfo?(dmaher)

Resolution: --- → FIXED

Luke Crouch [:groovecoder]

Reporter

Comment 28

•

12 years ago

The new routing isn't temporarily bouncing all MDN url's to this outage page? e.g., I clicked https://developer.mozilla.org/en-US/docs/User:wbamberg/Add-on_SDK and got a plain 404.

Severity: normal → critical

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

John Karahalis [:openjck]

Comment 29

•

12 years ago

Good catch Luke. I can confirm this.

Daniel Maher [:phrawzty]

Comment 30

•

12 years ago

Maintenance has terminated.(In reply to Luke Crouch [:groovecoder] from comment #28) > The new routing isn't temporarily bouncing all MDN url's to this outage page? > > e.g., I clicked > https://developer.mozilla.org/en-US/docs/User:wbamberg/Add-on_SDK and got a > plain 404. Resolved by bug 938666. tl;dr CDN caching issue which actually happened *after* the maintenance was complete. Force-refreshing (or emptying the browser cache) clears the errant behaviour.

Severity: critical → normal

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

5 years ago

Product: developer.mozilla.org → developer.mozilla.org Graveyard