Closed Bug 1142639 Opened 11 years ago Closed 10 years ago

MDN Planned down-time for RabbitMQ P2V

Categories

(developer.mozilla.org Graveyard :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: groovecoder, Unassigned)

References

Details

We are taking MDN down while we perform physical-to-virtual on the RabbitMQ servers: https://bugzilla.mozilla.org/show_bug.cgi?id=1078752 As part of this, we need to: 1. Create site-wide notices of the planned down-time 2. Update the maintenance page (bug 1141314) 3. Redirect all traffic to the maintenance page
:cyliang - can you give an approximate start for the down-time so we can add it to our site-wide notices? :cyliang - can you help us make sure the redirect will go to *our* down-time page (which includes helpful links for visitors to get MDN docs offline), and not just the generic mozilla hard-hat page?
Flags: needinfo?(cliang)
* The last CAB meeting rearranged the work during the tree-closing window so that the P2V work will take place *before* the PHX1 networking work. We're slated to do the P2V of the rabbit servers at 9 AM EDT / 7 AM PDT. * If the maintenance page has a specific URL, I can handle enabling / disabling a redirect to that page from the load balancer. I just need to know what the maintenance URL should be. =)
Flags: needinfo?(cliang)
:cyliang - I just merged the updated maintenance page to our repo here: https://github.com/mozilla/kuma/tree/master/maintenance Can you grab that and just serve it statically from another web host/server? Or could/should we do it with our own Apache server? Are we taking the entire Apache server down during the RabbitMQ downtime?
Flags: needinfo?(cliang)
In addition to showing the maintenance page when a user visits the homepage, can we also show the page for all other paths? In other words, can developer.mozilla.org/* either directly load or temporarily redirect to the page?
@groovecoder: I wasn't planning on taking down the MDN Apache web server. If it's easier for me to host the static page on another server, I believe I can do so. If it's easy enough for you to host it in some corner of the MDN web servers, I can point there just as well. @openjck: Any HTTP / HTTPS request attempting to go to the developer.mozilla.org IP address will be written to go the maintenance URL, so it *should* catch all paths.
Flags: needinfo?(cliang)
I just submitted https://github.com/mozilla/kuma/pull/3116 which puts the maintenance site into our own media/ directory, which would allow us to redirect all MDN traffic to: https://developer.mozilla.org/media/maintenance/ Which will be served by Apache as a static page/site.
Commits pushed to master at https://github.com/mozilla/kuma https://github.com/mozilla/kuma/commit/3421193a134052e002dc7bef80ffcdb547a3610b bug 1142639 - move maintenance/ to media/ for static serving https://github.com/mozilla/kuma/commit/260f0290572f56bc33e0545b748be5c61ba9933b Merge pull request #3116 from groovecoder/move-maintenance-to-media-1142639 bug 1142639 - move maintenance/ to media/ for static serving
https://developer.mozilla.org/media/maintenance/ is live so we can plan to redirect all MDN traffic there and let Apache serve it as static page.
Right now, I have a rule on the load balancer for MDN that says: If header "bunnies" == "true" then redirect to maintenance page This appears to be working. (See below.) For the outage, I'll just remove the header check; once the outage is over, I'll just make the rule in-active. cliang-07757:~ cliang$ curl -kI https://developer.mozilla.org/ HTTP/1.1 301 MOVED PERMANENTLY Server: Apache Vary: Accept-Language, Accept-Encoding X-Backend-Server: developer1.webapp.scl3.mozilla.com Content-Type: text/html; charset=utf-8 Access-Control-Allow-Credentials: false Date: Fri, 13 Mar 2015 20:26:43 GMT Location: https://developer.mozilla.org/en-US/ Transfer-Encoding: chunked Access-Control-Allow-Origin: * X-Frame-Options: DENY Access-Control-Allow-Methods: GET Connection: Keep-Alive X-Cache-Info: cached cliang-07757:~ cliang$ curl -kI https://developer.mozilla.org/en-US/Firefox/ HTTP/1.1 301 MOVED PERMANENTLY Server: Apache Vary: Cookie, Accept-Encoding X-Backend-Server: developer1.webapp.scl3.mozilla.com Content-Type: text/html; charset=utf-8 Access-Control-Allow-Credentials: false Date: Fri, 13 Mar 2015 20:33:54 GMT Location: https://developer.mozilla.org/en-US/Firefox Transfer-Encoding: chunked Access-Control-Allow-Origin: * Connection: Keep-Alive X-Frame-Options: DENY Access-Control-Allow-Methods: GET X-Cache-Info: caching cliang-07757:~ cliang$ curl -H "bunnies: true" -kI https://developer.mozilla.org/ HTTP/1.1 302 Moved Temporarily Content-Type: text/html Date: Fri, 13 Mar 2015 20:34:05 GMT Location: https://developer.mozilla.org/media/maintenance/ Connection: Keep-Alive Content-Length: 0 cliang-07757:~ cliang$ curl -H "bunnies: true" -kI https://developer.mozilla.org/en-US/Firefox/ HTTP/1.1 302 Moved Temporarily Content-Type: text/html Date: Fri, 13 Mar 2015 20:34:12 GMT Location: https://developer.mozilla.org/media/maintenance/ Connection: Keep-Alive Content-Length: 0
Can we close this?
Flags: needinfo?(lcrouch)
Yup. Here are the emails from the start and end of the down-time ... Start ===== We have started the RabbitMQ P2V as part of this maintenance window. MDN is redirecting to our updated maintenance page. We have stopped the celery processes on the production cluster. We see that the stage celery processes use the same RabbitMQ cluster as production :( so we're getting lots of "connection closed unexpectedly" emails from the site re-render task that's going on there. For now, we're going to let it run expecting the P2V can finish in 10-15 minutes and the broker will be back up for those tasks. If the cluster down-time stretches past 15 minutes, we may kill the stage re-render process. I'll update this thread. End === The RabbitMQ P2V is done. (Thanks cyliang!) MDN is back up. The RabbitMQ queues (prod & stage) are active again. Production celery tasks are starting and completing. The stage errors have stopped. Total time between first error to last was 25m.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(lcrouch)
Resolution: --- → FIXED
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.