Closed Bug 698314 Opened 8 years ago Closed 8 years ago

ganglia claims that l10n VMs are down, though they're not

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Pike, Assigned: bkero)

Details

https://ganglia.mozilla.org/sjc1/?c=Localization&m=load_one&r=hour&s=descending&hc=4&mc=2 reports that bm-l10n-dashboard01 and bm-l10n-db1 are down, though they're up.

There has been a db connection error from bm-l10n-dashboard01 to bm-l10n-db1, though. The relevant snippet from my logs would be

2011-10-30 18:30:15+0000 [-] Unhandled error in Deferred:
2011-10-30 18:30:15+0000 [-] Unhandled Error
	Traceback (most recent call last):
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1170, in run
	    self.mainLoop()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1179, in mainLoop
	    self.runUntilCurrent()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent
	    call.func(*call.args, **call.kw)
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/task.py", line 194, in __call__
	    d = defer.maybeDeferred(self.f, *self.a, **self.kw)
	--- <exception caught here> ---
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 117, in maybeDeferred
	    result = f(*args, **kw)
	  File "/usr/local/lib/python2.6/dist-packages/Django-1.1-py2.6.egg/django/db/transaction.py", line 265, in _commit_manually
	    return func(*args, **kw)
	  File "/home/dashboard/site/locale-inspector/l10ninsp/changes.py", line 40, in poll
	    transaction.commit()
	  File "/usr/local/lib/python2.6/dist-packages/Django-1.1-py2.6.egg/django/db/transaction.py", line 167, in commit
	    connection._commit()
	  File "/usr/local/lib/python2.6/dist-packages/Django-1.1-py2.6.egg/django/db/backends/__init__.py", line 38, in _commit
	    return self.connection.commit()
	_mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query')

... which maps the "5 hours ago" I see in ganglia, time-wise.

Filing this for tracking and investigation. If there's a need to stop either of the VMs, let's coordinate on that with a real downtime.
Any update on this? It'd be nice to get ganglia to report on these again.
Assignee: server-ops → bkero
This started reporting again friday afternoon, without any action from the IT team.

I've been searching through the logs on both the machines, but I can't find any mention of ganglia complaining of any network failure.  If the error rises again, please reopen and I'll look at it straight away.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.