Closed Bug 845826 Opened 11 years ago Closed 11 years ago

ElasticSearch failures on mozillians.allizom.org

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P3)

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: sancus, Assigned: bburton)

Details

http://qa-selenium.mv.mozilla.com:8080/view/Mozillians/job/mozillians.stage/745/HTML_Report/?

Something on stage regarding ElasticSearch is breaking intermittently and requiring a manual re-index. I don't see any tracebacks from the server, so I'm not really sure what's going on here.
Looking into, possibly a bad index
Assignee: server-ops-webops → bburton
Priority: -- → P3
So far in my investigation I can't find any issues with the index in ES for mozillians_stage

{
  "cluster_name" : "phxdev",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 5,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "indices" : {
    "mozillians_stage" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "shards" : {
        "0" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "1" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "2" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "3" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "4" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        }
      }
    }
  }
}

The index is updated once a day by a cron job

[root@genericadm.private.phx1 cron.d]# cat mozillians.allizom.org
# This file is managed by puppet
MAILTO="webops-cron@mozilla.com,cron-mozillians@mozilla.com"
11 1 * * * root /usr/bin/flock -w 10 /var/lock/mozillians-stage /usr/bin/python -W ignore /data/genericrhel6-stage/src/mozillians.allizom.org/mozillians/manage.py cron index_all_profiles > /dev/null

The only thing I really do is delete the index entirely and do a new indexing

Let me know if I should proceed with that or if we should wait for another error then check the index health

Thanks
Status: NEW → ASSIGNED
Flags: needinfo?(sancus)
So if I'm reading that right, the daily indexing happens at 1:11am (PST, I assume?). I haven't seen any incidents of the QA tests beginning to fail on staging since last week.

@mbrandt: How frequently do those tests run? Could it be the daily reindexing breaking somehow, periodically?

Perhaps we should be keeping the output from index_all_profiles instead of dumping it to /dev/null? I'm wondering if there's some way to get notice of a broken index other than the staging tests.
Flags: needinfo?(sancus)
Oh, and if it wasn't clear, I think we should wait :)
(In reply to Andrei Hajdukewycz [:sancus] from comment #3)
> So if I'm reading that right, the daily indexing happens at 1:11am (PST, I
> assume?). I haven't seen any incidents of the QA tests beginning to fail on
> staging since last week.
> 

Correct, 1:11AM PST each day.

> @mbrandt: How frequently do those tests run? Could it be the daily
> reindexing breaking somehow, periodically?
> 
> Perhaps we should be keeping the output from index_all_profiles instead of
> dumping it to /dev/null? I'm wondering if there's some way to get notice of
> a broken index other than the staging tests.

I can remove the '> /dev/null' and each day's output would go to "webops-cron@mozilla.com,cron-mozillians@mozilla.com" , I am fine running this way for a few days.
(In reply to Brandon Burton [:solarce] from comment #5) 
> I can remove the '> /dev/null' and each day's output would go to
> "webops-cron@mozilla.com,cron-mozillians@mozilla.com" , I am fine running
> this way for a few days.

Okay, lets do that for this week. If our fishing expedition turns up nothing, we'll go home and wait for better conditions I guess!
committed, will go live in 30-60 minutes

bburton@ironbars [05:28:39] [~/code/mozilla/sysadmins/puppet/trunk]
-> % svn ci -m "removing pipe to devnull for mozillians-stage index rebuild for a few days, bug 845826"
Sending        trunk/modules/webapp/files/genericrhel6/admin/etc-cron.d/mozillians.allizom.org
Transmitting file data .
Committed revision 60015.

Will watch the inbox :)
Just a note:

Re-indexing deletes the current index, meaning that the site does not return (all) results for a few minutes. Maybe that's causing the failing tests.

That being said, I believe that the cron job is not required anymore: We do have all update-profile->update-index commands go through celery and now we also have a button on the Admin panel to re-index on demand. So I would go for removing it altogether.

@solarce can you please check if such a cronjob exists on prod too?

Thanks!
The test failures seem to be quite some time after the reindexing was supposed to take place, though, not mere minutes. They were persistent for many hours, and finally fixed when I did a manual reindex from Admin. So I don't think it's a momentary break.
This hasn't reoccurred since the initial report so I think we're done for now, if it happens again we can look at it. Hopefully New Relic will allow us to monitor ES a little better as well.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.