Closed Bug 765670 Opened 13 years ago Closed 13 years ago

3crowd-hosted websites are unavailable due to DNS issues

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ashish, Assigned: nmaul)

Details

(Whiteboard: [outage][treeclosure])

Websites that use 3crowd for DNS are returning 404. 3crowd's DNS is returning SERVFAILs: ~ $ dig bugzilla.3crowd.mozilla.net ; <<>> DiG 9.7.3-P3 <<>> bugzilla.3crowd.mozilla.net ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 13995 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;bugzilla.3crowd.mozilla.net. IN A ;; Query time: 1516 msec ;; SERVER: 203.0.178.191#53(203.0.178.191) ;; WHEN: Mon Jun 18 16:25:04 2012 ;; MSG SIZE rcvd: 45 Current list of sites include: * www.mozilla.org * mxr.mozilla.org * bugzilla.mozilla.org * developer.mozilla.org I've opened a support request with 3crowd but haven't heard back yet.
Escalated within XDN, email cc'd infra.
Assignee: server-ops → ashish
sumo is "affected" only because it fetches some css/js objects from www.mozilla.org - http://www.mozilla.org/tabzilla/media/css/tabzilla.css and http://www.mozilla.org/tabzilla/media/js/tabzilla.js (from catchpoint)
XDN are still investigating: 03:32:54 < alan-3crowd> | We are seeeing some issues wtih our london servers, investigating. 03:37:02 < alan-3crowd> | it looks likely that they are in a poor state 03:37:08 < alan-3crowd> | will continue to investigate and report back
Got the sirst set of all recoveries after XDN made some changes to their EMEA GSLB. Going to standby for at least 30 mins before giving an all-clear.
Recoveries from all the failed tests. XDN will send out an RFO after they're done diagnosing related issues. I'll keep this bug open until then.
Status: NEW → ASSIGNED
There needs to be a quicker fix to issues like this. Why not pull out of XDN and self-host until XDN has resolved? Isn't that a better state than being down?
(In reply to matthew zeier [:mrz] from comment #6) > There needs to be a quicker fix to issues like this. Why not pull out of > XDN and self-host until XDN has resolved? Isn't that a better state than > being down? We considered this before but held back considering the different services that are hosted off geodns. 3crowd have escalated this all to their VP Engineering. Waiting for an update from them before we take a call on this.
Catchpoint has recovered from all the alerts. I'm reassigning this back to the queue for tracking.
Assignee: ashish → server-ops
Severity: critical → normal
missing an update here, probably because oncall can not resolve bugzilla. This is an issue again, now in North America. 3crowd is working on it from their end, and we are working on moving away from them ASAP.
:jakem has been working on the issue before :cshields update.
Group: infra
Current the following sites should be working. 1. www.mozilla.org in all regions 2. ftp.mozilla.org 3. support.mozilla.org We'll continue to update as further progress is made.
Tree's were closed from 1:30pm Pacific (but there was bustage before that), reopened at 2:35pm.
Assignee: server-ops → bburton
The following sites should now also be working 1. bugzilla.mozilla.org 2. *.bugzilla.mozilla.org (attachments) 3. download.mozilla.org
The following sites should also be working 1. start.mozilla.org
At this time all major sites have been transitioned to Dynect. :jakem will update as he finished a few smaller sites
Assignee: bburton → nmaul
The original issue here is resolved... many sites are moved, and 3crowd has restored service to normal. Please see bug 765745 for information on future migrations.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [outage][treeclosure]
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.