If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

ES reported being out of disk space for a time period

RESOLVED WORKSFORME

Status

Socorro
Infra
RESOLVED WORKSFORME
4 years ago
3 years ago

People

(Reporter: lars, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

4 years ago
on 2013-08-28, the socorro processors reported that Elastic Search reported that it was out of disk space.  

    "Non-OK response returned (500): u'IndexFailedEngineException[[socorro201334][2] Index failed for [crash_reports#b03ec07b-744c-4933-a7cf-104d22130828]]; nested: IOException[No space left on device]"

This did not happen for all submissions, most succeeded.

it did this during the following time periods:
  01:53:11 to 01:54:01
  09:14:06 to 12:21:20

this affected approximately 3.5K crashes per processor for a total of 35K submissions.

This problem is not recurring today 2013-08-29 nor did it happen in any previous day.  We need to know what caused (and resolved?) the problem.

Do we need to resubmit those crashes to ES?
I strongly suspect the reason lies in bug 907673. We are currently in the process of adding new machines. I am a bit sad to see that elasticsearch is not smart enough to index data on a different machine when one is full though... 

As a quick fix, I am going to delete some of the oldest data that we store, until we add new machines to the cluster. Then we should resubmit them. Lars, how would we do that? Do we have a list of the failed crashes? Or should we simply resubmit everything for that day?
(Reporter)

Comment 2

4 years ago
there are two methods to resubmit these crashes to ES:
  1) provision a temporary crashmover that can read the processor logs and extract the crash_ids and set ES as the sole crashstorage destination.
  2) reprocess all the crashes that were involved.

The latter is likely easier, though computationally more expensive. The first step is to extract the crash_ids from the ten processor logs.  Then take those crash_ids and insert them into the priority jobs queue table in postgres.  The processors will handle everything from there.

The former would be a fun exercise, but would require coding and access to running an app on each of the processor boxes.
Just for fun, here's an exploration of how indexing on a different node is actually messier then it sounds. ES divides documents into shards based on (by default) a hash of the doc ID. To index a doc onto a machine that has free space, it would have to migrate that shard--and all other replicas of it--each to a unique space-having machine. Then you'd probably come along and index another doc which would need to go in a different shard, so the process would have to be repeated. You can see how this would quickly lead to a lot of shard-moving thrashing. Cheers!
(Reporter)

Comment 4

3 years ago
since this is more than a year old, I'm assuming it is a resolved issue.  If it is not, feel free to reopen...
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.