Closed Bug 662194 Opened 13 years ago Closed 13 years ago

[stage] Staging env returns a 500 error

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
major

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: mbrandt, Assigned: nmaul)

References

()

Details

(Whiteboard: [stage])

Attachments

(1 file)

Attached image Fail message
The staging env appears to be down. Steps to reproduce: 1. goto http://input.stage.mozilla.com/ Actual: "Service Unavailable" message is displayed by the site.
Blocks: 661594
Assignee: nobody → server-ops
Component: Backend → Server Operations
Product: Input → mozilla.org
QA Contact: backend → mrz
Version: unspecified → other
Assignee: server-ops → phong
mrapp-stage02 is not responding (nor is its out of band). Phong is en route to check on it.
Assignee: phong → nmaul
This server is responding again, and we're looking into what happened. One thing I can tell you, however, is that there appears to be a problem with http://input.stage.mozilla.com/en-US/. When I attempt to visit that page, the Apache process handling my request shoots up to 100% CPU usage and hangs there.
Dropping prio since the server is working again and this is now "just" a problem with input.stage, plus some investigative work.
Severity: blocker → major
Priority: P1 → --
Let me know if it's something strange on our end.
I do believe it is some type of coding issue, but I can't pinpoint it more specifically that just the URL: http://input.stage.mozilla.com/en-US/ This feels like it's stuck the server in some type of infinite loop. By that I mean, when someone hits that page, I can watch an Apache worker suddenly jump up to 100% CPU usage and stay there forever, until someone kills it. In a browser, you'll get a Zeus 500 ISE error after 30 seconds or so, but the Apache worker on the server keeps going for as long as 5 minutes (longest I've seen one before it was killed manually). The main Apache error_log has nothing, and the same goes for input.stage.mozilla.com's error_log. I doubt it's emailing you stack traces or anything, either. I believe what happened is enough of these got loaded up at once that the server simply became unresponsive entirely, until the kernel "Out of Memory Killer" killed off one of offending httpd processes. This has happened twice yesterday morning, twice this morning, and once this afternoon (1:59:05pm MV time). Note that other languages behave the same way (http://input.stage.mozilla.com/es/), but that other pages do work (http://input.stage.mozilla.com/en-US/feedback).
Blocks: 662423
This happened again at ~0815 PDT, host ran out of RAM and swap and had to reboot to get the host back up. It was extremely busy oom-killing processes and took about 20-30 mins to reboot.
Considering this a solved code issue- the features on input.stage.mozilla.com suspected of causing this server-wide issue have been removed, and will not be re-enabled unless reimplemented some other way. Thanks all!
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
thx for the attention... stage seems to be fantastic again. No longer going down with ISEs
Status: RESOLVED → VERIFIED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: