Closed Bug 910753 Opened 11 years ago Closed 10 years ago

ES reported being out of disk space for a time period

Tracking

(Not tracked)

Status:

RESOLVED WORKSFORME

People

(Reporter: lars, Unassigned)

Details

K Lars Lohn [:lars] [:klohn]

Reporter

Description

•

11 years ago

on 2013-08-28, the socorro processors reported that Elastic Search reported that it was out of disk space.  

    "Non-OK response returned (500): u'IndexFailedEngineException[[socorro201334][2] Index failed for [crash_reports#b03ec07b-744c-4933-a7cf-104d22130828]]; nested: IOException[No space left on device]"

This did not happen for all submissions, most succeeded.

it did this during the following time periods:
  01:53:11 to 01:54:01
  09:14:06 to 12:21:20

this affected approximately 3.5K crashes per processor for a total of 35K submissions.

This problem is not recurring today 2013-08-29 nor did it happen in any previous day.  We need to know what caused (and resolved?) the problem.

Do we need to resubmit those crashes to ES?

[DEACTIVATED] Adrian Gaudebert

Comment 1

•

11 years ago

I strongly suspect the reason lies in bug 907673. We are currently in the process of adding new machines. I am a bit sad to see that elasticsearch is not smart enough to index data on a different machine when one is full though... 

As a quick fix, I am going to delete some of the oldest data that we store, until we add new machines to the cluster. Then we should resubmit them. Lars, how would we do that? Do we have a list of the failed crashes? Or should we simply resubmit everything for that day?

K Lars Lohn [:lars] [:klohn]

Reporter

Comment 2

•

11 years ago

there are two methods to resubmit these crashes to ES:
  1) provision a temporary crashmover that can read the processor logs and extract the crash_ids and set ES as the sole crashstorage destination.
  2) reprocess all the crashes that were involved.

The latter is likely easier, though computationally more expensive. The first step is to extract the crash_ids from the ten processor logs.  Then take those crash_ids and insert them into the priority jobs queue table in postgres.  The processors will handle everything from there.

The former would be a fun exercise, but would require coding and access to running an app on each of the processor boxes.

Erik Rose [:erik][:erikrose]

Comment 3

•

11 years ago

Just for fun, here's an exploration of how indexing on a different node is actually messier then it sounds. ES divides documents into shards based on (by default) a hash of the doc ID. To index a doc onto a machine that has free space, it would have to migrate that shard--and all other replicas of it--each to a unique space-having machine. Then you'd probably come along and index another doc which would need to go in a different shard, so the process would have to be repeated. You can see how this would quickly lead to a lot of shard-moving thrashing. Cheers!

K Lars Lohn [:lars] [:klohn]

Reporter

Comment 4

•

10 years ago

since this is more than a year old, I'm assuming it is a resolved issue.  If it is not, feel free to reopen...

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WORKSFORME

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

ES reported being out of disk space for a time period

Categories

(Socorro :: Infra, task)

Tracking

(Not tracked)

People

(Reporter: lars, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4