845826 - ElasticSearch failures on mozillians.allizom.org

Reporter

Description

•

11 years ago

http://qa-selenium.mv.mozilla.com:8080/view/Mozillians/job/mozillians.stage/745/HTML_Report/?

Something on stage regarding ElasticSearch is breaking intermittently and requiring a manual re-index. I don't see any tracebacks from the server, so I'm not really sure what's going on here.

Brandon Burton [:solarce]

Assignee

Comment 1

•

11 years ago

Looking into, possibly a bad index

Assignee: server-ops-webops → bburton

Priority: -- → P3

Brandon Burton [:solarce]

Assignee

Comment 2

•

11 years ago

So far in my investigation I can't find any issues with the index in ES for mozillians_stage

{
  "cluster_name" : "phxdev",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 5,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "indices" : {
    "mozillians_stage" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "shards" : {
        "0" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "1" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "2" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "3" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "4" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        }
      }
    }
  }
}

The index is updated once a day by a cron job

[root@genericadm.private.phx1 cron.d]# cat mozillians.allizom.org
# This file is managed by puppet
MAILTO="webops-cron@mozilla.com,cron-mozillians@mozilla.com"
11 1 * * * root /usr/bin/flock -w 10 /var/lock/mozillians-stage /usr/bin/python -W ignore /data/genericrhel6-stage/src/mozillians.allizom.org/mozillians/manage.py cron index_all_profiles > /dev/null

The only thing I really do is delete the index entirely and do a new indexing

Let me know if I should proceed with that or if we should wait for another error then check the index health

Thanks

Status: NEW → ASSIGNED

Flags: needinfo?(sancus)

Andrei Hajdukewycz [:sancus]

Reporter

Comment 3

•

11 years ago

So if I'm reading that right, the daily indexing happens at 1:11am (PST, I assume?). I haven't seen any incidents of the QA tests beginning to fail on staging since last week.

@mbrandt: How frequently do those tests run? Could it be the daily reindexing breaking somehow, periodically?

Perhaps we should be keeping the output from index_all_profiles instead of dumping it to /dev/null? I'm wondering if there's some way to get notice of a broken index other than the staging tests.

Flags: needinfo?(sancus)

Andrei Hajdukewycz [:sancus]

Reporter

Comment 4

•

11 years ago

Oh, and if it wasn't clear, I think we should wait :)

Brandon Burton [:solarce]

Assignee

Comment 5

•

11 years ago

(In reply to Andrei Hajdukewycz [:sancus] from comment #3)
> So if I'm reading that right, the daily indexing happens at 1:11am (PST, I
> assume?). I haven't seen any incidents of the QA tests beginning to fail on
> staging since last week.
> 

Correct, 1:11AM PST each day.

> @mbrandt: How frequently do those tests run? Could it be the daily
> reindexing breaking somehow, periodically?
> 
> Perhaps we should be keeping the output from index_all_profiles instead of
> dumping it to /dev/null? I'm wondering if there's some way to get notice of
> a broken index other than the staging tests.

I can remove the '> /dev/null' and each day's output would go to "webops-cron@mozilla.com,cron-mozillians@mozilla.com" , I am fine running this way for a few days.

Andrei Hajdukewycz [:sancus]

Reporter

Comment 6

•

11 years ago

(In reply to Brandon Burton [:solarce] from comment #5) 
> I can remove the '> /dev/null' and each day's output would go to
> "webops-cron@mozilla.com,cron-mozillians@mozilla.com" , I am fine running
> this way for a few days.

Okay, lets do that for this week. If our fishing expedition turns up nothing, we'll go home and wait for better conditions I guess!

Brandon Burton [:solarce]

Assignee

Comment 7

•

11 years ago

committed, will go live in 30-60 minutes

bburton@ironbars [05:28:39] [~/code/mozilla/sysadmins/puppet/trunk]
-> % svn ci -m "removing pipe to devnull for mozillians-stage index rebuild for a few days, bug 845826"
Sending        trunk/modules/webapp/files/genericrhel6/admin/etc-cron.d/mozillians.allizom.org
Transmitting file data .
Committed revision 60015.

Will watch the inbox :)

Giorgos Logiotatidis [:giorgos]

Comment 8

•

11 years ago

Just a note:

Re-indexing deletes the current index, meaning that the site does not return (all) results for a few minutes. Maybe that's causing the failing tests.

That being said, I believe that the cron job is not required anymore: We do have all update-profile->update-index commands go through celery and now we also have a button on the Admin panel to re-index on demand. So I would go for removing it altogether.

@solarce can you please check if such a cronjob exists on prod too?

Thanks!

Andrei Hajdukewycz [:sancus]

Reporter

Comment 9

•

11 years ago

The test failures seem to be quite some time after the reindexing was supposed to take place, though, not mere minutes. They were persistent for many hours, and finally fixed when I did a manual reindex from Admin. So I don't think it's a momentary break.

Andrei Hajdukewycz [:sancus]

Reporter

Comment 10

•

11 years ago

This hasn't reoccurred since the initial report so I think we're done for now, if it happens again we can look at it. Hopefully New Relic will allow us to monitor ES a little better as well.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → WORKSFORME

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

ElasticSearch failures on mozillians.allizom.org

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P3)

Tracking

(Not tracked)

People

(Reporter: sancus, Assigned: bburton)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated