bugzilla.mozilla.org has resumed normal operation. Attachments prior to 2014 will be unavailable for a few days. This is tracked in Bug 1475801.
Please report any other irregularities here.

ElasticSearch failures on mozillians.allizom.org

RESOLVED WORKSFORME

Status

Infrastructure & Operations
WebOps: Other
P3
normal
RESOLVED WORKSFORME
6 years ago
5 years ago

People

(Reporter: sancus, Assigned: solarce)

Tracking

Details

(Reporter)

Description

6 years ago
http://qa-selenium.mv.mozilla.com:8080/view/Mozillians/job/mozillians.stage/745/HTML_Report/?

Something on stage regarding ElasticSearch is breaking intermittently and requiring a manual re-index. I don't see any tracebacks from the server, so I'm not really sure what's going on here.
(Assignee)

Comment 1

5 years ago
Looking into, possibly a bad index
Assignee: server-ops-webops → bburton
Priority: -- → P3
(Assignee)

Comment 2

5 years ago
So far in my investigation I can't find any issues with the index in ES for mozillians_stage

{
  "cluster_name" : "phxdev",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 5,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "indices" : {
    "mozillians_stage" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "shards" : {
        "0" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "1" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "2" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "3" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "4" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        }
      }
    }
  }
}

The index is updated once a day by a cron job

[root@genericadm.private.phx1 cron.d]# cat mozillians.allizom.org
# This file is managed by puppet
MAILTO="webops-cron@mozilla.com,cron-mozillians@mozilla.com"
11 1 * * * root /usr/bin/flock -w 10 /var/lock/mozillians-stage /usr/bin/python -W ignore /data/genericrhel6-stage/src/mozillians.allizom.org/mozillians/manage.py cron index_all_profiles > /dev/null

The only thing I really do is delete the index entirely and do a new indexing

Let me know if I should proceed with that or if we should wait for another error then check the index health

Thanks
Status: NEW → ASSIGNED
Flags: needinfo?(sancus)
(Reporter)

Comment 3

5 years ago
So if I'm reading that right, the daily indexing happens at 1:11am (PST, I assume?). I haven't seen any incidents of the QA tests beginning to fail on staging since last week.

@mbrandt: How frequently do those tests run? Could it be the daily reindexing breaking somehow, periodically?

Perhaps we should be keeping the output from index_all_profiles instead of dumping it to /dev/null? I'm wondering if there's some way to get notice of a broken index other than the staging tests.
Flags: needinfo?(sancus)
(Reporter)

Comment 4

5 years ago
Oh, and if it wasn't clear, I think we should wait :)
(Assignee)

Comment 5

5 years ago
(In reply to Andrei Hajdukewycz [:sancus] from comment #3)
> So if I'm reading that right, the daily indexing happens at 1:11am (PST, I
> assume?). I haven't seen any incidents of the QA tests beginning to fail on
> staging since last week.
> 

Correct, 1:11AM PST each day.

> @mbrandt: How frequently do those tests run? Could it be the daily
> reindexing breaking somehow, periodically?
> 
> Perhaps we should be keeping the output from index_all_profiles instead of
> dumping it to /dev/null? I'm wondering if there's some way to get notice of
> a broken index other than the staging tests.

I can remove the '> /dev/null' and each day's output would go to "webops-cron@mozilla.com,cron-mozillians@mozilla.com" , I am fine running this way for a few days.
(Reporter)

Comment 6

5 years ago
(In reply to Brandon Burton [:solarce] from comment #5) 
> I can remove the '> /dev/null' and each day's output would go to
> "webops-cron@mozilla.com,cron-mozillians@mozilla.com" , I am fine running
> this way for a few days.

Okay, lets do that for this week. If our fishing expedition turns up nothing, we'll go home and wait for better conditions I guess!
(Assignee)

Comment 7

5 years ago
committed, will go live in 30-60 minutes

bburton@ironbars [05:28:39] [~/code/mozilla/sysadmins/puppet/trunk]
-> % svn ci -m "removing pipe to devnull for mozillians-stage index rebuild for a few days, bug 845826"
Sending        trunk/modules/webapp/files/genericrhel6/admin/etc-cron.d/mozillians.allizom.org
Transmitting file data .
Committed revision 60015.

Will watch the inbox :)
Just a note:

Re-indexing deletes the current index, meaning that the site does not return (all) results for a few minutes. Maybe that's causing the failing tests.

That being said, I believe that the cron job is not required anymore: We do have all update-profile->update-index commands go through celery and now we also have a button on the Admin panel to re-index on demand. So I would go for removing it altogether.

@solarce can you please check if such a cronjob exists on prod too?

Thanks!
(Reporter)

Comment 9

5 years ago
The test failures seem to be quite some time after the reindexing was supposed to take place, though, not mere minutes. They were persistent for many hours, and finally fixed when I did a manual reindex from Admin. So I don't think it's a momentary break.
(Reporter)

Comment 10

5 years ago
This hasn't reoccurred since the initial report so I think we're done for now, if it happens again we can look at it. Hopefully New Relic will allow us to monitor ES a little better as well.
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → WORKSFORME
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.