Closed Bug 1170333 Opened 10 years ago Closed 10 years ago

switch from elasticutils to elasticsearch-dsl

Categories

(Input Graveyard :: Backend, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

(Whiteboard: u=dev c=backend p=5 s=input.2015q3)

Now that we're using Elasticsearch 1.2.4, we can ditch ElasticUtils which doesn't support any of the new Elasticsearch features and has ended as a project. This bug covers researching the options and switching to something better.
I've been toying with things for a while. I think there are two libraries that are interesting: 1. pyelasticsearch maintained by ErikRose (with some help from me at various points) which is a much better API for Elasticsearch things than elasticsearch-py library. One particularly interesting feature is a very nice bulk indexing API that's fast. It'd be interesting to see if it's fast for Input as well. That would help development and also reindexing the site. A better API likely leads us to fewer development bugs. 2. elasticsearch-dsl takes some of the ideas from ElasticUtils, but fleshes them out in nice ways and supports many of Elasticsearch's newer features. This library is young, still, but it has a similar API to ElasticUtils so it'd be helpful for building searches incrementally. I think either elasticsearch-py/elasticsearch-dsl or pyelasticsearch/elasticsearch-dsl would be a good option. This needs more research.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
There are a couple of issues, but I've got a good first pass done now. Going to go with just elasticsearch-dsl and not use pyelasticsearch at all. It'd be interesting to see if we could make bulk indexing faster, but right now it doesn't help us a lot--only really helps with development and when we have to wipe-and-reindex the cluster which doesn't happen very often. Given that, there isn't a compelling reason to add pyelasticsearch to the mix right now. I'll clean things up and post a PR next week.
I did a bunch of cleanup. I still have a couple of outstanding issues: 1. MLT needs to be implemented 2. empty string values aren't working for some reason where "aren't working" is defined as "aren't searchable and don't show up in facets" I think item 1 is straight-forward and I just need to do it. The crux of the problem for item 2 is eluding me. I think I'll have to switch gears and write a test script to fiddle with it.
Summary: switch from elasticutils to pyelasticsearch/elasticsearch-dsl or elasticsearch-dsl by itself → switch from elasticutils to elasticsearch-dsl
Whiteboard: u=dev c=backend p= s=input.2015q2 → u=dev c=backend p=3 s=input.2015q2
I traced item 2 down. Looks like elasticsearch-dsl is dropping falsey values. So the '' never makes it to Elasticsearch. I wrote up an issue for it: https://github.com/elastic/elasticsearch-dsl-py/issues/202
I implemented the MLT code and then redid indexing to speed it up (new code creates less objects and has fewer critical loops) and also get around the problem where elasticsearch-dsl drops falsey values like the empty string. In a PR: https://github.com/mozilla/fjord/pull/618
I did the work in 2015q3 and this is like a 5 point project, so fixing the metadata as such.
Whiteboard: u=dev c=backend p=3 s=input.2015q2 → u=dev c=backend p=5 s=input.2015q3
Landed in https://github.com/mozilla/fjord/commit/7a1d434b23c40941584348514fe41706a2cf9e10 Next step is to push to -stage and test all the parts: 1. front page dashboard search, filters, date_start, date_end, query 2. response view, mlt 3. feedback GET API 4. top-sekret feedback aggregations GET API 5. submitting feedback and having it show up on the dashboard 6. analyzer search, filters, date_start, date_end, query I think that's everything.
Pushed it to stage. Hit a problem with indexing from the admin. Fixed it in: https://github.com/mozilla/fjord/commit/d9f247fcb04765688c6b594483b11fd38a1cb0e8 Tested it on stage using the smoketest suite plus some manual testing: 1. PASS: front page dashboard search, filters, date_start, date_end, query 2. PASS: response view, mlt 3. PASS: feedback GET API 4. PASS: top-sekret feedback aggregations GET API 5. PASS: submitting feedback and having it show up on the dashboard 6. PASS: analyzer search, filters, date_start, date_end, query Everything looks great. Pushed to prod just now. Marking as FIXED.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Input → Input Graveyard
You need to log in before you can comment on or make changes to this bug.