Closed Bug 1189717 Opened 10 years ago Closed 10 years ago

Basket server unresponsive to requests

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jrgm, Assigned: rwatson)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1492] )

https://basket.mozilla.com/news is timing out and returning errors for the past hour. Please have a look.
From #basket 07-31 02:51:35] vectorvictor NR_ALERT: Alert opened for basket.mozilla.org -- Triggered by: Error rate > 10.0% -- Apps currently involved: basket.mozilla.org. https://rpm.newrelic.com/accounts/263620/incidents/16656911 [07-31 02:55:42] vectorvictor NR_ALERT: Alert ended for basket.mozilla.org -- Triggered by: Error rate > 10.0% -- Apps currently involved: basket.mozilla.org. https://rpm.newrelic.com/accounts/263620/incidents/16656911 [07-31 02:56:23] vectorvictor NR_ALERT: Alert escalated to downtime for basket.mozilla.org -- Triggered by: unable to ping basket.mozilla.org -- Apps currently involved: basket.mozilla.org. https://rpm.newrelic.com/accounts/263620/incidents/16656911 [07-31 03:01:33] vectorvictor NR_ALERT: Alert downtime recovered for basket.mozilla.org -- Triggered by: unable to ping basket.mozilla.org -- Apps currently involved: basket.mozilla.org. https://rpm.newrelic.com/accounts/263620/incidents/16656911 [07-31 03:02:23] vectorvictor NR_ALERT: Alert escalated to downtime for basket.mozilla.org -- Triggered by: unable to ping basket.mozilla.org -- Apps currently involved: basket.mozilla.org. https://rpm.newrelic.com/accounts/263620/incidents/16656911 [07-31 03:11:36] vectorvictor NR_ALERT: Alert opened for basket.mozilla.org -- Triggered by: Error rate > 10.0% -- Apps currently involved: basket.mozilla.org. https://rpm.newrelic.com/accounts/263620/incidents/16656911 [07-31 03:12:26] vectorvictor NR_ALERT: Alert downtime recovered for basket.mozilla.org -- Triggered by: unable to ping basket.mozilla.org -- Apps currently involved: basket.mozilla.org. https://rpm.newrelic.com/accounts/263620/incidents/16656911 [07-31 03:14:32] vectorvictor NR_ALERT: Alert escalated to downtime for basket.mozilla.org -- Triggered by: unable to ping basket.mozilla.org -- Apps currently involved: basket.mozilla.org. https://rpm.newrelic.com/accounts/263620/incidents/16656911 [07-31 03:24:28] vectorvictor NR_ALERT: Alert downtime recovered for basket.mozilla.org -- Triggered by: unable to ping basket.mozilla.org -- Apps currently involved: basket.mozilla.org. https://rpm.newrelic.com/accounts/263620/incidents/16656911 [07-31 03:35:18] vectorvictor NR_ALERT: Alert opened for basket.mozilla.org -- Triggered by: unable to ping basket.mozilla.org -- Apps currently involved: basket.mozilla.org. https://rpm.newrelic.com/accounts/263620/incidents/16657467 [07-31 03:45:24] vectorvictor NR_ALERT: Alert downtime recovered for basket.mozilla.org -- Triggered by: unable to ping basket.mozilla.org -- Apps currently involved: basket.mozilla.org. https://rpm.newrelic.com/accounts/263620/incidents/16657467
[Fri Jul 31 11:41:54 2015] [error] [client 52.24.177.182] (11)Resource temporarily unavailable: mod_wsgi (pid=26043): Unable to connect to WSGI daemon process 'basket-ssl' on '/var/run/wsgi.1440.7.5.sock'. [Fri Jul 31 11:42:07 2015] [error] [client 63.245.214.162] (11)Resource temporarily unavailable: mod_wsgi (pid=25729): Unable to connect to WSGI daemon process 'basket-ssl' on '/var/run/wsgi.1440.7.5.sock'. [Fri Jul 31 11:42:20 2015] [error] [client 63.245.214.162] (11)Resource temporarily unavailable: mod_wsgi (pid=26039): Unable to connect to WSGI daemon process 'basket-ssl' on '/var/run/wsgi.1440.7.5.sock'. [Fri Jul 31 11:42:32 2015] [error] [client 52.27.217.70] (11)Resource temporarily unavailable: mod_wsgi (pid=27972): Unable to connect to WSGI daemon process 'basket-ssl' on '/var/run/wsgi.1440.7.5.sock'. [Fri Jul 31 11:42:53 2015] [error] [client 63.245.214.162] (11)Resource temporarily unavailable: mod_wsgi (pid=27907): Unable to connect to WSGI daemon process 'basket-ssl' on '/var/run/wsgi.1440.7.5.sock'. [Fri Jul 31 11:43:03 2015] [error] [client 63.245.214.162] (11)Resource temporarily unavailable: mod_wsgi (pid=26088): Unable to connect to WSGI daemon process 'basket-ssl' on '/var/run/wsgi.1440.7.5.sock'.
I've restarted httpd on generic to see if it would solve the problem. if problem still occurs, please move this to Infra:webops ....
Assignee: nobody → server-ops-webops
Component: Basket → WebOps: Product Delivery
Product: Websites → Infrastructure & Operations
QA Contact: smani
Version: unspecified → other
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1492]
It came back for a bit after the restart, but I'm seeing timeouts again.
Severity: normal → critical
Assignee: server-ops-webops → rwatson
basket is largely functional again; there seems to be some issue with some jobs entering a scheduled stated without being acknowledged. The root cause was that the password for the basket-prod rabbit user was accidentally changed. This meant that the celery and web processes were unable to authenticate to rabbit, which undoubtedly caused all sorts of failures. After changing the rabbit password and restarting both celery and Apache processes, there were unacknowledged messsages in the celery queue. These persisted, even after shutting down the celery processes running on the new python cluster. The number of unacknowledged messages has continued to climb slowly through the day (we're currently at 30 messages); these correspond with the messages shown if you do a 'celeryctl inspect scheduled'.
I see about 60+ unacknowledged requests in the queues this morning but this appears to be somewhat "normal" looking at the collected data (https://graphite-phx1.mozilla.org/dashboard/#basket-prod-rabbitmq).
Resolving for now. If this becomes an issue, feel free to re-open.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.