[traceback] crontabber node: [Errno 28] No space left on device

RESOLVED FIXED

Status

Socorro
Infra
RESOLVED FIXED
9 months ago
9 months ago

People

(Reporter: willkg, Assigned: miles)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

We're seeing errors on the crontabber node:

"""
IOError: [Errno 28] No space left on device
  File "crontabber/app.py", line 975, in _run_one
    for last_success in self._run_job(job_class, config, info):
  File "crontabber/base.py", line 189, in main
    function()
  File "crontabber/base.py", line 259, in _run_proxy
    return self.run(*args, **kwargs)
  File "socorro/cron/jobs/elasticsearch_cleanup.py", line 27, in run
    cleaner.delete_old_indices()
  File "socorro/external/es/index_cleaner.py", line 48, in delete_old_indices
    index_client = es_class.indices_client()
  File "socorro/external/es/connection_context.py", line 129, in indices_client
    elasticsearch.client.IndicesClient(self.connection())
  File "socorro/external/es/connection_context.py", line 117, in connection
    elasticsearch.connection.RequestsHttpConnection
  File "elasticsearch/client/__init__.py", line 110, in __init__
    self.transport = transport_class(_normalize_hosts(hosts), **kwargs)
  File "elasticsearch/client/__init__.py", line 38, in _normalize_hosts
    host
  File "logging/__init__.py", line 1171, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "logging/__init__.py", line 1278, in _log
    self.handle(record)
  File "logging/__init__.py", line 1288, in handle
    self.callHandlers(record)
  File "logging/__init__.py", line 1335, in callHandlers
    " \"%s\"\n" % self.name)
"""

Looks like the disk is full.

This issue covers fixing the immediate problem (no space left) and looking into why it ran out of space. Does it need a logrotate thinger? Does it need spiritual guidance?
Sentry issue: https://sentry.prod.mozaws.net/operations/socorro-prod/issues/378838/

Looks like it might have been happening for a long while? Either that or I don't know how to read the graphs and charts and all that technical stuff.
It started last Friday, so this has been going on for about 4 days. 

I'm confused by the fact that it happens only during the elasticsearch-cleanup job.
We have a contrabber.log file in /var/log/socorro/ that takes 5.2G out of the 8G of the disk. I suppose that's a problem, but I'm not sure how to best solve it. I don't want to just delete that file. 

Deferring to Miles.
Assignee: nobody → miles
Flags: needinfo?(miles)
Note: this is on the production admin server.
(Assignee)

Comment 5

9 months ago
Gah! Lost the bugmail associated with this.
I think deleting the current log, putting logrotate in place, and setting up Datadog alerts going forward is the way to handle this. A 5.2G log file makes no one happy.
Flags: needinfo?(miles)
(Assignee)

Comment 6

9 months ago
I deleted the 5.2G log yesterday and put logrotate in place. I have put Datadog disk space alerts for all hosts tagged Environment={stage,prod}. The caveat is that not all of these hosts (including the prod admin node) have the Datadog Agent installed + the correct configuration, so some are not reporting. I am remedying this and will be making these changes using Ansible.
Status: NEW → RESOLVED
Last Resolved: 9 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.