Closed Bug 562171 Opened 14 years ago Closed 14 years ago

Drumbeat.org web site experiencing intermittent outages

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: mthompson2000, Assigned: justdave)

Details

* The Drumbeat.org site is presently down. As of approx 5:10 pm ET 
* The site was also down earlier today from approx 1:10 pm ET to 1:20pm ET. 
* Checked with developers at Trellon -- they said it looks like a server issue on our side.
* It now appears to be back up as of 5:20pm ET
* We appear to be experiencing temporary outages today. Let us know if there's any testing or maintenance or other potential reasons?
Severity: critical → normal
Summary: Drumbeat.org web site is down → Drumbeat.org web site experiencing intermittent outages
* Drumbeat.org is down again at 10:15am ET
* Confirmed on http://downforeveryoneorjustme.com/drumbeat.org
* Trellon developers say this isn't an issue on their side -- allege it's a server issue
* This is obviously a big deal for us. Status report appreciated.
Severity: normal → critical
Assignee: server-ops → shyam
Seems like there's something overloading the box. I can't even login. 

It's happened at around the same time over the last 2 days and it recovers on it own. Do you guys have a cron job or something running around this time?
Also, it seems like pm-drumbeat01 is a VM. What kind of traffic have you been seeing on the site of late?
Shyam: stats are here:
https://metrics.mozilla.com/awstats/bin/awstats.pl?config=drumbeat

Traffic went through the roof on the 23rd of April. We're now getting up to 600,000 hits a day. That, not coincidentally, is the day we started linking to drumbeat.org from a Firefox start page snippet.

Gerv
That level of exposure makes the site going down a blocker, IMO.

Gerv
Severity: critical → blocker
No wonder.

I'm seeing massive spikes on the CPU/Memory graphs on the VM.

Does the site have searchable stuff? How scalable is that? It looks to me like someone is able to search for stuff and hammer the site/DB and bring the VM to its knees.  

Also, how scalable is the app? I don't think we can keep running this on a single VM for long if we are seeing legitimate traffic.
08:20:39 up 45 days, 17:15,  1 user,  load average: 213.68, 218.22, 220.95

Mem:   2075392k total,  2025504k used,    49888k free,     1996k buffers
Swap:  2097144k total,  1977336k used,   119808k free,   117276k cached

I'm going to have to reboot to recover from this. The machine's used up all it's RAM and swap and is practically dead.
Rebooted the VM, site is back online.

Looking to see what the problem could have been on IRC.
pingers is looking into this. They're planning to apply the 6.16 update to Drupal and hope that will fix the issue here.

Since they're taking point on this, nothing more from IT's side at this point.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Site is down again as of 12:10pm ET
Severity: blocker → critical
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #5)
> Shyam: stats are here:
> https://metrics.mozilla.com/awstats/bin/awstats.pl?config=drumbeat
> 
> Traffic went through the roof on the 23rd of April. We're now getting up to
> 600,000 hits a day. That, not coincidentally, is the day we started linking to
> drumbeat.org from a Firefox start page snippet.

Why would we do that without notification to IT?  Who approved doing that?

drumbeat's not on any infrastructure that can handle load like that.  It's a single server, no redundancy.

This has be pulled from the start page now.
MRZ: good point and my fault as I authorized. Clearly I didn't think through the limitations or downstream implications. 

In terms of solutions, my understanding is we can't easily pull from the start page -- takes days or weeks for google to react to our requests.

My proposal is that we create a page w/ similar content on www.mozilla.org and then redirect there. We've put snippets at mozilla.org in the past and it's been fine.

Does this solve the problem from IT's perspective? Matt and David, possible to get something up relatively quickly?

FWIW, it should be easy to avoid mistakes like this once Paul starts next week. We'll have someone on our team to push stuff risks and limitations like this in front of my face before pulling the trigger.

In any case, sorry for creating this firedrill for you guys.
(In reply to comment #13)
> In terms of solutions, my understanding is we can't easily pull from the start
> page -- takes days or weeks for google to react to our requests.

In my understanding, we have direct control over these because they're on our server and Google syncs them relatively frequently.  jslater or pascalc would know for sure.
David Boswell and I are working on creating that page for re-direct now. Should have an ETA shortly.
We have control on snippets creation as they are stored on our svn repo but little control on the syncing with Google servers. We can have snippets changed once a month only. I can change the snippet on our SVN repo in a matter of hours but somebody in MV will have to make a special request to Google to have it published since we already used our April update.

Another solution would be to redirect the requested page to a temporary static html page, that would probably remove a lot of the load (no mysql, no php).
* We are creating a temporary page at mozilla.org/causes/subtitles now
* ETA for having that page live is 1:30pm PT
* Will need a re-direct from 
drumbeat.org/universal-subtitles to 
mozilla.org/causes/subtitles
Severity: critical → blocker
Status: REOPENED → NEW
* We have some temporary placeholder text up now at: 
http://www.mozilla.org/causes/subtitles.html
"Universal Subtitles coming soon! please check back later."

* Please enable the re-direct now. Will have proper HTML page up there soon.

re-direct from:
drumbeat.org/project/universal-subtitles
to: 
mozilla.org/causes/subtitles
Severity: blocker → critical
Done.

trinity:~ shyam$ curl -L -I -H "Host: www.drumbeat.org" http://pm-drumbeat01.mozilla.org/project/universal-subtitles
HTTP/1.1 302 Found
Date: Wed, 28 Apr 2010 19:31:52 GMT
Server: Apache
Location: http://www.mozilla.org/causes/subtitles
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 200 OK
Date: Wed, 28 Apr 2010 19:31:53 GMT
Server: Apache
Content-Location: subtitles.html
Vary: negotiate
TCN: choice
X-Powered-By: PHP/5.2.9
Cache-Control: max-age=900
Expires: Wed, 28 Apr 2010 19:46:53 GMT
X-Backend-Server: pm-web01
Content-Type: text/html; charset=UTF-8
Status: NEW → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
If you need anything else, either file a new bug or reopen and re-assign to server-ops. Past 0330 here and I'm going to crash.
(In reply to comment #14)
> (In reply to comment #13)
> > In terms of solutions, my understanding is we can't easily pull from the start
> > page -- takes days or weeks for google to react to our requests.
> 
> In my understanding, we have direct control over these because they're on our
> server and Google syncs them relatively frequently.  jslater or pascalc would
> know for sure.

http://www.google.com/firefox is most certainly *not* on our servers; were you thinking of our first run or what's new pages?
FWIW, I'm not seeing the redirect yet.  Not sure if it takes a while to propagate to a bunch of servers?
* The re-direct seemed to be working earlier, but no longer appears to be working.
* When I click on the link in the start page snippet, still takes me to
http://www.drumbeat.org/project/universal-subtitles/
* Still need the snippet to re-direct to 
http://www.mozilla.org/causes/subtitles
Assignee: shyam → server-ops
Severity: critical → normal
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #21)
> (In reply to comment #14)
> > In my understanding, we have direct control over these because they're on
> > our server and Google syncs them relatively frequently.  jslater or pascalc
> > would know for sure.
> 
> http://www.google.com/firefox is most certainly *not* on our servers; were you
> thinking of our first run or what's new pages?

The directory that Google syncs them from is on our servers, and is what I was refering to.  But it's moot, because Google still has to be told to sync them if it's not staged for the beginning of the month.

(In reply to comment #23)
> * The re-direct seemed to be working earlier, but no longer appears to be
> working.

Looks like he changed it locally on the box instead of in puppet, and puppet changed it back on the next pass.
Assignee: server-ops → justdave
OK, redirect should be working again now (pushed via puppet this time).
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Matt: is this fixed?  Have you noticed any subsequent outages?  Netcraft doesn't have uptime availability for drumbeat.org.  Thanks!
No outages since the re-direct took effect. Thanks guys! :)
Verified FIXED per comment 27.
Status: RESOLVED → VERIFIED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.