Closed Bug 1128833 Opened 9 years ago Closed 9 years ago

Elmo-specific ES cluster doesn't index new documents

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Pike, Unassigned)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/402] )

Attachments

(1 file)

The elasticsearch cluster for elmo from bug #1020349 seems to be rather silently ignoring new docs indexed.

I've tried the mass-insert command we have in elmo, and it doesn't report any errors, but the amount of indexed documents stays constant.

Last doc inserted seems to be the one for 455411, we're currently above 462280.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/402]
http://elmo-elasticsearch-zlb.webapp.scl3.mozilla.com:9200/_plugin/head/ is the cluster
Summary: Create Elmo-specific ES cluster → Elmo-specific ES cluster doesn't index new documents
Attached file elmo_es_error.log
Looking in the logs for master, I see an error about handling client http traffic around an hour before the bug was filed.  ("java.lang.IllegalArgumentException: empty text") I've attached the traceback error from the logs.

Otherwise, if it safe for me to trigger the mass-insert command "at will", would it be possible to tell me what the command is so I can run the command while tailing out the Elasticsearch logs?
There isn't anything easy or safe, sadly. The automation is trying to index new docs when they happen now, so basically all the logs for that cluster should be from attempts to index docs that fail.

I'm traveling this week with lots of meetups, so it's hard to carve out some time to investigate what we're sending in detail. Should be easier to do that next week.
@Pike:  Is Elmo ES still silently ignoring new docs indexed?

The last error I can find is similar to the one that I posted in comment #2.  It occurred on February 5th at 07:05:08AM UTC.  Afterwards, I did do a rolling restart of the cluster in order to apply glibc patches (Bug 1130133) and haven't seen anything since.
I've just been typing, too, I looked into this again:

Some commands work now.

The mass-insert works, the on-the-fly doesn't, still.

Before I go in to investigate the latter, I wonder if there's log research you want to do while I run the mass inserts to see if there's anything good or bad?

Once that's out of the way, I should update the log index completely (I'm still a few thousand docs behind)
I was wondering if we could take a brief outage of the production Elmo ES cluster so I can do a full reboot of ES before we continue to do more testing.


I looked at the logs to see if there were any errors from your mass inserts today.  I didn't find any but I do see some evidence of that something is wrong.  There is a script that regularly creates a snapshot of the ES indexes and, after completing, deletes older ones.  It looks like the snapshots are being created but the script is aborting before the snapshot cleanup begins.  Running the script (python), I get an error when the ES index snapshot is being created -- even though there are no other snapshots running and the snapshot I've requested completes. [1] 

We're supposed to keep about 7 days of backups and the oldest snapshot on the server is from January 30th.  



[1] elasticsearch.exceptions.TransportError: TransportError(503, u'ConcurrentSnapshotExecutionException[[snapstore:201502181815-elmo-comparisons] a snapshot is already running]')
Cylia is dealing with the snapshot stuff.

The indexing problem was actually a code bug in bug 958067, I've landed a bunch of changesets to iterate towards a fix in https://github.com/Pike/locale-inspector/compare/d069fb63a8aded701c6fb54bb2c67454e31f06b1...05d912f49599e01d45350c17d24b749829dc4f18

Marking FIXED, as ES now updates docs as expected.

I wonder if we can have a decent query to delete documents without a run, but that's candy more than anything, I think.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: