Closed Bug 804984 Opened 12 years ago Closed 12 years ago

stage.m.o returns a lot of HTTP 500 errors

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: bhourigan)

References

Details

(Whiteboard: [reit-ops])

I'm getting a lot of 500 errors while running update verification automation, which download files from stage.m.o. At least it happened yesterday, around 19-23. Re-running the automation again ATM.
:rail are you still seeing this this morning?
All of the failed verification steps are passed now.
Peter, any idea what happened here? 

Rail, your bug report isn't clear on what the exact issues were and what you saw? Can you provide some more information please? Logs? Machines involved etc?

I'm dropping this to normal so as to not page oncall and since there is not apparent outage/imapact at the moment.
Severity: major → normal
No longer blocks: 796961
Rail, 

Are these stage URLs persistent? I guess not, because they all 404 now?
I've no ideas beyond bug 804119. First I've seen of this issue.
(In reply to Shyam Mani [:fox2mike] from comment #5)
> Rail, 
> 
> Are these stage URLs persistent? I guess not, because they all 404 now?

That's really strange. The files should be on ftp for a long time before we remove them...
stage and ftp aren't the same, atleast from what I can see. Tossing this over to infra.
Assignee: server-ops → server-ops-infra
Component: Server Operations → Server Operations: Infrastructure
QA Contact: shyam → jdow
(In reply to Chris AtLee [:catlee] from comment #7)
> Saw a bunch of failures over night that broke 10.0.10 update verification:
> 
> http://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/10.0.10esr-
> candidates/build1/logs/release-mozilla-esr10-macosx64_update_verify_2-bm12-
> build1-build0.txt.gz

Do you have a time for when that happened? This could all be related to the ongoing netapp issues in SCL3.
From the log:

Trying to get http://stage.mozilla.org/pub/mozilla.org/firefox/nightly/10.0.10esr-candidates/build1/update/mac/lv/firefox-10.0.10esr.complete.mar:

21:26:22 ERROR 500: Internal Server Error.
21:26:54 ERROR 500: Internal Server Error.
21:27:28 ERROR 500: Internal Server Error.
All connections to stage.mozilla.org on tcp/80 point to the ftp cluster on tcp/80. I combed through the apache logs on ftp* and all requests for firefox-10.0.10esr.complete.mar were satisfied with a response code of 200, so I don't think this was attributed to the netapp unless the servers were so loaded that the health checks failed and shut down the pool.

Zeus logs do confirm numerous 500 errors. I found that 29 errors occured between 24/Oct/2012:21:21:45 and 24/Oct/2012:21:32:43. Detailed logging isn't enabled for this vip however general Zeus error logs don't provide any helpful information.

It is worth noting that the 500 errors were only served from zlb1.ops.scl3, the remainder of the zeus nodes did not log any errors.
(In reply to Brian Hourigan [:digi] from comment #12)

> It is worth noting that the 500 errors were only served from zlb1.ops.scl3,
> the remainder of the zeus nodes did not log any errors.

And zlb1 and 6 are the ones that host ftp...
What's the next step here?
Whiteboard: [reit-ops]
philor, did you mention there is an ongoing problem with scattered 500 responses when test slaves make request to http://ftp.m.o/, probably for files in 
 /pub/mozilla.org/firefox/tinderbox-builds/
 /pub/mozilla.org/mobile/tinderbox-builds/
There was an ongoing problem with them, it feels like it's been since around September, but there haven't been any for several days, so those may have been the netapp issues.
Closing this one on grounds that it is probably the netapp issue that was at fault and we haven't seen any further issues in a few weeks. Reopen if it happens again.
Assignee: server-ops-infra → bhourigan
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.