Closed Bug 966354 Opened 11 years ago Closed 11 years ago

SUMO staging is down/throwing an HTTP 500 error

Categories

(Data & BI Services Team :: DB: MySQL, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1033504

People

(Reporter: stephend, Assigned: scabral)

References

()

Details

(Whiteboard: [fromAutomation] [data: consultative])

Attachments

(1 file)

https://support.allizom.org/ is returning a 500, and have been for over an hour :-(
https://errormill.mozilla.org/support/sumo-stage/group/160276/ OperationalError: (2006, 'MySQL server has gone away') Stacktrace (most recent call last): File "django/core/handlers/base.py", line 187, in get_response response = middleware_method(request, response) File "newrelic/api/object_wrapper.py", line 237, in __call__ self._nr_instance, args, kwargs, **self._nr_kwargs) File "newrelic/hooks/framework_django.py", line 307, in wrapper return wrapped(*args, **kwargs) File "django/middleware/transaction.py", line 42, in process_response transaction.rollback() File "django/db/transaction.py", line 161, in rollback connection.rollback() File "django/db/backends/__init__.py", line 249, in rollback self._rollback() File "django/db/backends/mysql/base.py", line 421, in _rollback BaseDatabaseWrapper._rollback(self) File "django/db/backends/__init__.py", line 59, in _rollback return self.connection.rollback() File "newrelic/hooks/database_dbapi2.py", line 84, in rollback return self._nr_connection.rollback() Maybe mysql for that environment is hosed?
Sending over to the DBAs.
Assignee: nobody → server-ops-database
Component: Code Quality → Server Operations: Database
Product: support.mozilla.org → mozilla.org
QA Contact: scabral
Version: unspecified → other
Severity: major → critical
On it.
https://rpm.newrelic.com/accounts/263620/dashboard/3110396 Looks like things are OK in general....that's for the sumo master db. One of the slaves - https://rpm.newrelic.com/accounts/263620/dashboard/3110397
Er, that's production, let me check out stage: dev+stage are on the same db server, dev cluster in phx, shared with other dev/stage instances. https://rpm.newrelic.com/accounts/263620/dashboard/3110372 - dev1, current master https://rpm.newrelic.com/accounts/263620/dashboard/3110373 - dev2, current slave Will logon to the machines and see what's up.
master stage db has been up since 12/28, slave up since 12/12, so it's not a crash.
Both master and slave are quiet, so it's not something like high load or max connections.
Sanity check on the db import - we import the db monthly on the first of the month. This isn't that, I double-checked that it runs only on the 1st day of the month. It looks like *some* processes can get through, as kpi_metric and django_session were updated in the last hour. Does that ring any bells? [root@dev1.db.phx1 support_allizom_org]# date Fri Jan 31 18:45:14 UTC 2014 [root@dev1.db.phx1 support_allizom_org]# ls -rlth | tail -10 -rw-rw---- 1 mysql mysql 15M Jan 31 09:09 wiki_documentlink.ibd -rw-rw---- 1 mysql mysql 44M Jan 31 09:09 dashboards_wikimetric.ibd -rw-rw---- 1 mysql mysql 704K Jan 31 10:54 users_registrationprofile.ibd -rw-rw---- 1 mysql mysql 112M Jan 31 10:54 auth_user.ibd -rw-rw---- 1 mysql mysql 64M Jan 31 10:54 users_profile.ibd -rw-rw---- 1 mysql mysql 17M Jan 31 10:54 auth_user_groups.ibd -rw-rw---- 1 mysql mysql 188M Jan 31 12:01 questions_question.ibd -rw-rw---- 1 mysql mysql 128K Jan 31 12:47 karma_title_users.ibd -rw-rw---- 1 mysql mysql 656K Jan 31 18:26 kpi_metric.ibd -rw-rw---- 1 mysql mysql 128M Jan 31 18:40 django_session.ibd
Connectivity is definitely happening, through the load balancer (that's the HOST), this query shows that in 3 seconds, the support_stage user connected 3 times. I can do more to try to see what it did, maybe there's one table that's giving a problem. mysql> select *,now() from accounts where user like 'support_stage'\G *************************** 1. row *************************** USER: support_stage HOST: 10.8.70.202 CURRENT_CONNECTIONS: 0 TOTAL_CONNECTIONS: 1570437 now(): 2014-01-31 18:52:04 1 row in set (0.00 sec) mysql> select *,now() from accounts where user like 'support_stage'\G *************************** 1. row *************************** USER: support_stage HOST: 10.8.70.202 CURRENT_CONNECTIONS: 0 TOTAL_CONNECTIONS: 1570440 now(): 2014-01-31 18:52:07 1 row in set (0.00 sec)
Nothing in the MySQL error logs (other than the usual "unsafe statement written to binary log")
Assignee: server-ops-database → scabral
Attached file support_stage.log
Hrm, I turned on the general log for a few seconds and got 2 connections from the support_stage user logged, so it's intermittent...
11:01 < stephend> qatestbot: build sumo.stage.saucelabs 11:01 -qatestbot:#sumodev- stephend: job sumo.stage.saucelabs build scheduled now 11:01 < stephend> vamos a ver 11:02 <@r1cky> dale wey 11:03 < stephend> heh 11:05 -qatestbot:#sumodev- Project sumo.stage.saucelabs build #1005: ABORTED in 4 min 2 sec: http://qa-selenium.mv.mozilla.com:8080/job/sumo.stage.saucelabs/1005/ So the 300 second Zeus timeout isn't in play here, neither is any MySQL query killer, nor the 10 minute idle timout for MySQL.
13:05 < sheeri> wait, is it still happening or not? 13:06 < sheeri> stephend: I think solarce restarted apache 13:06 < stephend> not at the moment, but this /behavior/ happens 13:06 < stephend> every few days 13:06 < sheeri> ah 13:08 < sheeri> I have told this info to webops
https://rpm.newrelic.com/accounts/263620/applications/2779107/traced_error From Friday, we need to run an strace while it's happening....
(In reply to Sheeri Cabral [:sheeri] from comment #15) > https://rpm.newrelic.com/accounts/263620/applications/2779107/traced_error > > From Friday, we need to run an strace while it's happening.... The URL was missing a trailing "s", so should be: https://rpm.newrelic.com/accounts/263620/applications/2779107/traced_errors, or https://rpm.newrelic.com/accounts/263620/applications/2779107/traced_errors/1249918770, more specifically.
* strace on the mod_wsgi processes just showed them "blocked" [root@support1.stage.webapp.phx1 kitsune]# ps waux | grep kitsune apache 10811 2.9 13.8 1491036 265932 ? Sl 07:57 7:57 kitsune-ssl apache 11999 2.8 14.3 1493088 275192 ? Sl 08:02 7:28 kitsune-ssl apache 22027 0.3 7.2 1032292 140148 ? Sl Feb02 3:36 kitsune apache 22416 0.3 7.7 1032300 149748 ? Sl Feb02 3:32 kitsune root 32406 0.0 0.0 103256 836 pts/0 S+ 12:25 0:00 grep kitsune [root@support1.stage.webapp.phx1 kitsune]# strace -p 10811 Process 10811 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>q^C <unfinished ...> Process 10811 detached * Logs show the "server has gone away" but manage.py is able to connect [root@support1.stage.webapp.phx1 kitsune]# tail -f /var/log/httpd/support.mozilla.org/error_log_2014-02-03-20 [Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] connection.abort() [Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/__init__.py", line 97, in abort [Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] self._rollback() [Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/mysql/base.py", line 421, in _rollback [Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] BaseDatabaseWrapper._rollback(self) [Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/__init__.py", line 59, in _rollback [Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] return self.connection.rollback() [Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] File "/usr/lib64/python2.6/site-packages/newrelic-1.13.1.31/newrelic/hooks/database_dbapi2.py", line 84, in rollback [Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] return self._nr_connection.rollback() [Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] OperationalError: (2006, 'MySQL server has gone away') [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] mod_wsgi (pid=11999): Exception occurred processing WSGI script '/data/www/support.allizom.org/kitsune/wsgi/kitsune.wsgi'. [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] Traceback (most recent call last): [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/usr/lib64/python2.6/site-packages/newrelic-1.13.1.31/newrelic/api/web_transaction.py", line 584, in close [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] self.generator.close() [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/http/response.py", line 236, in close [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] signals.request_finished.send(sender=self._handler_class) [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/dispatch/dispatcher.py", line 170, in send [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] response = receiver(signal=self, sender=sender, **named) [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/__init__.py", line 51, in close_connection [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] transaction.abort(conn) [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/transaction.py", line 40, in abort [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] connection.abort() [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/__init__.py", line 97, in abort [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] self._rollback() [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/mysql/base.py", line 421, in _rollback [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] BaseDatabaseWrapper._rollback(self) [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/__init__.py", line 59, in _rollback [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] return self.connection.rollback() [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/usr/lib64/python2.6/site-packages/newrelic-1.13.1.31/newrelic/hooks/database_dbapi2.py", line 84, in rollback [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] return self._nr_connection.rollback() [Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] OperationalError: (2006, 'MySQL server has gone away') [root@support1.stage.webapp.phx1 kitsune]# ./manage.py dbshell /data/www/support.allizom.org/kitsune/vendor/src/django/django/core/management/__init__.py:409: DeprecationWarning: The 'setup_environ' function is deprecated, you likely need to update your 'manage.py'; please see the Django 1.4 release notes (https://docs.djangoproject.com/en/dev/releases/1.4/). DeprecationWarning) /data/www/support.allizom.org/kitsune/vendor/src/django/django/core/management/__init__.py:465: DeprecationWarning: The 'execute_manager' function is deprecated, you likely need to update your 'manage.py'; please see the Django 1.4 release notes (https://docs.djangoproject.com/en/dev/releases/1.4/). DeprecationWarning) Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 4992892 Server version: 5.6.12-log MySQL Community Server (GPL) Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> mysql> show tables; +-------------------------------+ | Tables_in_support_allizom_org | +-------------------------------+ | announcements_announcement | | auth_group | | auth_group_permissions | | auth_message | | auth_permission | | auth_user | * This is probably related to the ongoing issues in prod, see bug 934601
It was relayed to the sumo team that they can "self-service" an apache restart by doing a code push (the restart happens near the end).
Still plaguing us (Web QA, at least), to the point where I've disabled tests, because they just randomly hit a host with a 500, and either fail -- or even worse -- hang (due to a Selenium issue). :-(
I think this is something we can monitor for, as per bug 1033504 - resolving as a "dupe" but if it's not related, please reopen.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Whiteboard: [fromAutomation] → [fromAutomation] [consultative]
Whiteboard: [fromAutomation] [consultative] → [fromAutomation] [data: consultative]
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: