Closed Bug 753728 Opened 12 years ago Closed 12 years ago

browserid.org/verify is on occasion throwing 500 errors (being caught on the load balancer)

Categories

(Cloud Services :: Operations: Miscellaneous, task)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: boozeniges, Assigned: petef)

Details

(Whiteboard: [qa+])

On both the mozillaignite (https://github.com/rossbruniges/mozilla-ignite/tree/stage) and webmaker (https://github.com/rossbruniges/make.mozilla.org) projects we've recently been experiencing irregular login failures - as of around yesterday afternoon (2pm, GMT).

After a bit of digging and adding in extra logs we were able to find the following error being thrown:

django_browserid.base:INFO Verification URL: https://browserid.org/verify :/projects/mozilla/ignite/mozilla-ignite/vendor-local/src/django-browserid/django_browserid/base.py:118
django_browserid.base:DEBUG Failed to decode JSON. Resp: 500, Content: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<title>Service Unavailable</title>
<style type="text/css">
body, p, h1 {
  font-family: Verdana, Arial, Helvetica, sans-serif;
}
h2 {
  font-family: Arial, Helvetica, sans-serif;
  color: #b10b29;
}
</style>
</head>
<body>
<h2>Service Unavailable</h2>
<p>The service is temporarily unavailable. Please try again later.</p>
</body>
</html>

projects/mozilla/ignite/mozilla-ignite/vendor-local/src/django-browserid/django_browserid/base.py:107

Both have recently been updated to use the new playdoh - so that django can be deployed out if it, fixing this issue (https://github.com/mozilla/playdoh/issues/107)

Setting as major as both projects being effected are hoping to be deployed next week, and also as this may be an issue on other sites using browserID and django-browserID.
browserid.org is Mozilla Services. Moving...
Assignee: server-ops-infra → nobody
Component: Server Operations: Infrastructure → Operations
Product: mozilla.org → Mozilla Services
QA Contact: jdow → operations
Version: other → unspecified
Thanks Shyam - that was the one thing that I wasn't sure about :)
You're welcome Ross. I've poked the services ops folks on IRC as well, if this is incorrect, they'll move it to the right place and look at it.
I think this is probably fixed now.  The verifier pool in scl2 had all backends marked as draining for some odd reason, so when GSLB gave you scl2's IP address for browserid.org, /verify calls would fail.

We need to start QAing the verifier service before we undrain a datacenter during a push.
Assignee: nobody → petef
Status: NEW → ASSIGNED
Yes. I agree.
Seems like I need to update our Test Plan to add a section for Prod push specific stuff.
Probably something that could be automated...
Whiteboard: [qa+]
Filed bug 753828 to monitor zeus pools so we'll catch this before undraining a datacenter in the future.
Should be covered by a script that :jrgm has.
Also, running it once per colo per Prod push should be enough to verify everything is working as expected (getting a 200 back rather than a 500).
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Marking as Verified since we appear to have everything in place, including tests for Prod.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.