Closed
Bug 966354
Opened 11 years ago
Closed 11 years ago
SUMO staging is down/throwing an HTTP 500 error
Categories
(Data & BI Services Team :: DB: MySQL, task)
Data & BI Services Team
DB: MySQL
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 1033504
People
(Reporter: stephend, Assigned: scabral)
References
()
Details
(Whiteboard: [fromAutomation] [data: consultative])
Attachments
(1 file)
|
15.19 KB,
text/x-log
|
Details |
https://support.allizom.org/ is returning a 500, and have been for over an hour :-(
Comment 1•11 years ago
|
||
https://errormill.mozilla.org/support/sumo-stage/group/160276/
OperationalError: (2006, 'MySQL server has gone away')
Stacktrace (most recent call last):
File "django/core/handlers/base.py", line 187, in get_response
response = middleware_method(request, response)
File "newrelic/api/object_wrapper.py", line 237, in __call__
self._nr_instance, args, kwargs, **self._nr_kwargs)
File "newrelic/hooks/framework_django.py", line 307, in wrapper
return wrapped(*args, **kwargs)
File "django/middleware/transaction.py", line 42, in process_response
transaction.rollback()
File "django/db/transaction.py", line 161, in rollback
connection.rollback()
File "django/db/backends/__init__.py", line 249, in rollback
self._rollback()
File "django/db/backends/mysql/base.py", line 421, in _rollback
BaseDatabaseWrapper._rollback(self)
File "django/db/backends/__init__.py", line 59, in _rollback
return self.connection.rollback()
File "newrelic/hooks/database_dbapi2.py", line 84, in rollback
return self._nr_connection.rollback()
Maybe mysql for that environment is hosed?
Comment 2•11 years ago
|
||
Sending over to the DBAs.
Assignee: nobody → server-ops-database
Component: Code Quality → Server Operations: Database
Product: support.mozilla.org → mozilla.org
QA Contact: scabral
Version: unspecified → other
| Reporter | ||
Comment 3•11 years ago
|
||
Yeah, MySQL has "gone away":
https://rpm.newrelic.com/accounts/263620/applications/2779107/traced_errors/1239782229
| Reporter | ||
Updated•11 years ago
|
Severity: major → critical
| Assignee | ||
Comment 4•11 years ago
|
||
On it.
| Assignee | ||
Comment 5•11 years ago
|
||
https://rpm.newrelic.com/accounts/263620/dashboard/3110396 Looks like things are OK in general....that's for the sumo master db. One of the slaves - https://rpm.newrelic.com/accounts/263620/dashboard/3110397
| Assignee | ||
Comment 6•11 years ago
|
||
Er, that's production, let me check out stage:
dev+stage are on the same db server, dev cluster in phx, shared with other dev/stage instances.
https://rpm.newrelic.com/accounts/263620/dashboard/3110372 - dev1, current master
https://rpm.newrelic.com/accounts/263620/dashboard/3110373 - dev2, current slave
Will logon to the machines and see what's up.
| Assignee | ||
Comment 7•11 years ago
|
||
master stage db has been up since 12/28, slave up since 12/12, so it's not a crash.
| Assignee | ||
Comment 8•11 years ago
|
||
Both master and slave are quiet, so it's not something like high load or max connections.
| Assignee | ||
Comment 9•11 years ago
|
||
Sanity check on the db import - we import the db monthly on the first of the month. This isn't that, I double-checked that it runs only on the 1st day of the month.
It looks like *some* processes can get through, as kpi_metric and django_session were updated in the last hour. Does that ring any bells?
[root@dev1.db.phx1 support_allizom_org]# date
Fri Jan 31 18:45:14 UTC 2014
[root@dev1.db.phx1 support_allizom_org]# ls -rlth | tail -10
-rw-rw---- 1 mysql mysql 15M Jan 31 09:09 wiki_documentlink.ibd
-rw-rw---- 1 mysql mysql 44M Jan 31 09:09 dashboards_wikimetric.ibd
-rw-rw---- 1 mysql mysql 704K Jan 31 10:54 users_registrationprofile.ibd
-rw-rw---- 1 mysql mysql 112M Jan 31 10:54 auth_user.ibd
-rw-rw---- 1 mysql mysql 64M Jan 31 10:54 users_profile.ibd
-rw-rw---- 1 mysql mysql 17M Jan 31 10:54 auth_user_groups.ibd
-rw-rw---- 1 mysql mysql 188M Jan 31 12:01 questions_question.ibd
-rw-rw---- 1 mysql mysql 128K Jan 31 12:47 karma_title_users.ibd
-rw-rw---- 1 mysql mysql 656K Jan 31 18:26 kpi_metric.ibd
-rw-rw---- 1 mysql mysql 128M Jan 31 18:40 django_session.ibd
| Assignee | ||
Comment 10•11 years ago
|
||
Connectivity is definitely happening, through the load balancer (that's the HOST), this query shows that in 3 seconds, the support_stage user connected 3 times.
I can do more to try to see what it did, maybe there's one table that's giving a problem.
mysql> select *,now() from accounts where user like 'support_stage'\G
*************************** 1. row ***************************
USER: support_stage
HOST: 10.8.70.202
CURRENT_CONNECTIONS: 0
TOTAL_CONNECTIONS: 1570437
now(): 2014-01-31 18:52:04
1 row in set (0.00 sec)
mysql> select *,now() from accounts where user like 'support_stage'\G
*************************** 1. row ***************************
USER: support_stage
HOST: 10.8.70.202
CURRENT_CONNECTIONS: 0
TOTAL_CONNECTIONS: 1570440
now(): 2014-01-31 18:52:07
1 row in set (0.00 sec)
| Assignee | ||
Comment 11•11 years ago
|
||
Nothing in the MySQL error logs (other than the usual "unsafe statement written to binary log")
| Assignee | ||
Updated•11 years ago
|
Assignee: server-ops-database → scabral
| Assignee | ||
Comment 12•11 years ago
|
||
Hrm, I turned on the general log for a few seconds and got 2 connections from the support_stage user logged, so it's intermittent...
| Assignee | ||
Comment 13•11 years ago
|
||
11:01 < stephend> qatestbot: build sumo.stage.saucelabs
11:01 -qatestbot:#sumodev- stephend: job sumo.stage.saucelabs build scheduled now
11:01 < stephend> vamos a ver
11:02 <@r1cky> dale wey
11:03 < stephend> heh
11:05 -qatestbot:#sumodev- Project sumo.stage.saucelabs build #1005: ABORTED in 4 min 2 sec: http://qa-selenium.mv.mozilla.com:8080/job/sumo.stage.saucelabs/1005/
So the 300 second Zeus timeout isn't in play here, neither is any MySQL query killer, nor the 10 minute idle timout for MySQL.
| Assignee | ||
Comment 14•11 years ago
|
||
13:05 < sheeri> wait, is it still happening or not?
13:06 < sheeri> stephend: I think solarce restarted apache
13:06 < stephend> not at the moment, but this /behavior/ happens
13:06 < stephend> every few days
13:06 < sheeri> ah
13:08 < sheeri> I have told this info to webops
| Assignee | ||
Comment 15•11 years ago
|
||
https://rpm.newrelic.com/accounts/263620/applications/2779107/traced_error
From Friday, we need to run an strace while it's happening....
| Reporter | ||
Comment 16•11 years ago
|
||
(In reply to Sheeri Cabral [:sheeri] from comment #15)
> https://rpm.newrelic.com/accounts/263620/applications/2779107/traced_error
>
> From Friday, we need to run an strace while it's happening....
The URL was missing a trailing "s", so should be:
https://rpm.newrelic.com/accounts/263620/applications/2779107/traced_errors, or https://rpm.newrelic.com/accounts/263620/applications/2779107/traced_errors/1249918770, more specifically.
Comment 17•11 years ago
|
||
* strace on the mod_wsgi processes just showed them "blocked"
[root@support1.stage.webapp.phx1 kitsune]# ps waux | grep kitsune
apache 10811 2.9 13.8 1491036 265932 ? Sl 07:57 7:57 kitsune-ssl
apache 11999 2.8 14.3 1493088 275192 ? Sl 08:02 7:28 kitsune-ssl
apache 22027 0.3 7.2 1032292 140148 ? Sl Feb02 3:36 kitsune
apache 22416 0.3 7.7 1032300 149748 ? Sl Feb02 3:32 kitsune
root 32406 0.0 0.0 103256 836 pts/0 S+ 12:25 0:00 grep kitsune
[root@support1.stage.webapp.phx1 kitsune]# strace -p 10811
Process 10811 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>q^C <unfinished ...>
Process 10811 detached
* Logs show the "server has gone away" but manage.py is able to connect
[root@support1.stage.webapp.phx1 kitsune]# tail -f /var/log/httpd/support.mozilla.org/error_log_2014-02-03-20
[Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] connection.abort()
[Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/__init__.py", line 97, in abort
[Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] self._rollback()
[Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/mysql/base.py", line 421, in _rollback
[Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] BaseDatabaseWrapper._rollback(self)
[Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/__init__.py", line 59, in _rollback
[Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] return self.connection.rollback()
[Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] File "/usr/lib64/python2.6/site-packages/newrelic-1.13.1.31/newrelic/hooks/database_dbapi2.py", line 84, in rollback
[Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] return self._nr_connection.rollback()
[Mon Feb 03 12:29:08 2014] [error] [client 10.8.81.216] OperationalError: (2006, 'MySQL server has gone away')
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] mod_wsgi (pid=11999): Exception occurred processing WSGI script '/data/www/support.allizom.org/kitsune/wsgi/kitsune.wsgi'.
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] Traceback (most recent call last):
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/usr/lib64/python2.6/site-packages/newrelic-1.13.1.31/newrelic/api/web_transaction.py", line 584, in close
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] self.generator.close()
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/http/response.py", line 236, in close
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] signals.request_finished.send(sender=self._handler_class)
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/dispatch/dispatcher.py", line 170, in send
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] response = receiver(signal=self, sender=sender, **named)
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/__init__.py", line 51, in close_connection
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] transaction.abort(conn)
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/transaction.py", line 40, in abort
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] connection.abort()
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/__init__.py", line 97, in abort
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] self._rollback()
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/mysql/base.py", line 421, in _rollback
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] BaseDatabaseWrapper._rollback(self)
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/data/www/support.allizom.org/kitsune/vendor/src/django/django/db/backends/__init__.py", line 59, in _rollback
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] return self.connection.rollback()
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] File "/usr/lib64/python2.6/site-packages/newrelic-1.13.1.31/newrelic/hooks/database_dbapi2.py", line 84, in rollback
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] return self._nr_connection.rollback()
[Mon Feb 03 12:29:28 2014] [error] [client 10.8.81.216] OperationalError: (2006, 'MySQL server has gone away')
[root@support1.stage.webapp.phx1 kitsune]# ./manage.py dbshell
/data/www/support.allizom.org/kitsune/vendor/src/django/django/core/management/__init__.py:409: DeprecationWarning: The 'setup_environ' function is deprecated, you likely need to update your 'manage.py'; please see the Django 1.4 release notes (https://docs.djangoproject.com/en/dev/releases/1.4/).
DeprecationWarning)
/data/www/support.allizom.org/kitsune/vendor/src/django/django/core/management/__init__.py:465: DeprecationWarning: The 'execute_manager' function is deprecated, you likely need to update your 'manage.py'; please see the Django 1.4 release notes (https://docs.djangoproject.com/en/dev/releases/1.4/).
DeprecationWarning)
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 4992892
Server version: 5.6.12-log MySQL Community Server (GPL)
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
mysql> show tables;
+-------------------------------+
| Tables_in_support_allizom_org |
+-------------------------------+
| announcements_announcement |
| auth_group |
| auth_group_permissions |
| auth_message |
| auth_permission |
| auth_user |
* This is probably related to the ongoing issues in prod, see bug 934601
| Assignee | ||
Comment 18•11 years ago
|
||
It was relayed to the sumo team that they can "self-service" an apache restart by doing a code push (the restart happens near the end).
| Reporter | ||
Comment 19•11 years ago
|
||
Still plaguing us (Web QA, at least), to the point where I've disabled tests, because they just randomly hit a host with a 500, and either fail -- or even worse -- hang (due to a Selenium issue). :-(
| Assignee | ||
Comment 20•11 years ago
|
||
I think this is something we can monitor for, as per bug 1033504 - resolving as a "dupe" but if it's not related, please reopen.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
| Assignee | ||
Updated•11 years ago
|
Whiteboard: [fromAutomation] → [fromAutomation] [consultative]
| Assignee | ||
Updated•11 years ago
|
Whiteboard: [fromAutomation] [consultative] → [fromAutomation] [data: consultative]
Updated•11 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•