Closed Bug 925532 Opened 11 years ago Closed 11 years ago

Elasticsearch indexing failing on mozillians.org

Categories

(Infrastructure & Operations Graveyard :: WebOps: Engagement, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hoosteeno, Assigned: bburton)

Details

Our production push usually triggers a reindex, but that has failed in the last two releases. Indexing seems to get partway and stop. The first time it happened (last week), we created a second push with some trivial change and after pushing that change, indexing worked. This time that workaround hasn't helped. We've also tried rolling back, rolling forward, but neither has helped. We also cannot force indexing to finish using the button (in /admin) that we normally use if we need to trigger a reindex. We're not getting any traceback emails, and the mozilliansprodpush bot isn't showing any errors. The practical outcome of this is that there are very few search results in Mozillians, which approaches the urgency of a system down, since search is the primary function. This search should return lots of results: https://mozillians.org/en-US/search/?q=a&limit=
(In reply to Justin Crawford [:hoosteeno] from comment #0) > Our production push usually triggers a reindex, but that has failed in the > last two releases. Indexing seems to get partway and stop. > > The first time it happened (last week), we created a second push with some > trivial change and after pushing that change, indexing worked. This time > that workaround hasn't helped. We've also tried rolling back, rolling > forward, but neither has helped. > > We also cannot force indexing to finish using the button (in /admin) that we > normally use if we need to trigger a reindex. > > We're not getting any traceback emails, and the mozilliansprodpush bot isn't > showing any errors. > > The practical outcome of this is that there are very few search results in > Mozillians, which approaches the urgency of a system down, since search is > the primary function. > > This search should return lots of results: > https://mozillians.org/en-US/search/?q=a&limit= I am investigating
Assignee: server-ops-webops → bburton
Status: NEW → ASSIGNED
Main problem was that generic-celery1.webapp.phx1 was in a state that it wasn't doing work but wasn't triggering any monitoring alerts. A forced restart of generic-celery1 and it returning to service caused it + generic-celery2 to seem to get through the indexing work quickly, so I suspect that generic-celery2 by itself wasn't completing the jobs before hitting some time out I'll be filing additional bugs to add additional monitoring and will look into getting New Relic for the celery side of things
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.