Open Bug 1276967 Opened 8 years ago Updated 2 years ago

ElasticSearch autoclassification is too slow

Categories

(Tree Management :: Treeherder, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: jgraham, Unassigned)

Details

On heroku autoclassification has become very slow and the queues are hugely backed up. This is primarily due to the time taken for elasticsearch; whilst looking up an autoclassification in the database takes order 10ms, looking up one on ES is order 100ms (spiking over 1000ms when queries are coincident with GC pauses), so we spend about 10x as long in ES as in the database. This is more or less a worst case scenario; on a tree that was actively sheriffed we would find a match in the db far more often and the ES backend wouldn't be hit at all. But stage will always be in the heroku situation, so this has to work well.

I have several ideas about how to improve the situation, but I'm not sure exactly what will be effective:

1) Run multiple queries in parallel. AFAICT there is no ES API for this, but I can at least make the http requests in parallel to cut a little overhead there.

2) Consider adding more shards, and routing by test name. This will add overhead since there is one lucene index per shard, but should cut the amount of data that has to be searched for each query (but see below).

3) Store less data, or increase the available RAM capacity, or figure out some other way to prevent long (1s) GC pauses. I need to read up a bit to understand if our usage patterns are making this worse and if we could batch more changes or similar to help here.

4) Investigate if the custom tokenizer we use is particularly slow.

5) ? I need more ideas. ekyle: you have lots more experience than anyone running ES. Are there are optimisations we should be looking at to improve the performance here. I can provide you with more background about the type of data being stored out of band.
:jgraham,

1) Yes, run your queries in parallel.
2) Increasing shards must correspond to an increase in machines (or machine cores), otherwise search speed will be slower.
3) The amount of data should not matter; ES is columnar, so any data that is not needed for the search is not consuming much memory or resources.  Although, if you are retrieving large json documents, the network transmission time will be your bottleneck: Turn on http compression.
4) The tokenizer will only impact your bulk indexing rate, not your search speed. Although, a good tokenization for the given search task is important so you can use the fastest `term` filters.


Please, 

a) May you add a link to the ES query being sent (or some code that is composing the query)?
b) what version of ES?
c) Send me a link to the ES config file
d) What does the schema look like?  eg curl http://localhost:9200/_mapping/twitter
e) What is the CPU/memory/stats of the machine running ES?

Long GC pauses usually indicate lack of memory.  Using `regexp` expressions will do that to you.  For versions of ES<2.0, you can save some memory by setting property's {"doc_values": true}

Thanks!
ekyle: Thanks this is super-helpful

The code that generates the ES queries is [1]. This is using the elasticsearch_dsl library, but hopefully it's not too hard to grasp what's going on from the high level code.

Elastic search is version 2.3.2. It's a hosted version running on heroku [2] (dachs plan). It is not entirely clear to me how many nodes you get. 

I think that implicitly covers most of your points. I can follow up with detailed schema if it's not clear from the code.

[1] https://github.com/mozilla/treeherder/blob/94ebeca251089ad374242f68be52f045e101d99e/treeherder/model/search.py
[2] https://elements.heroku.com/addons/foundelasticsearch
jgraham:

1gb memory limit is probably your problem: ES is a memory hog. You could try 4gb.

Looking at the code, I am not familiar with a "phrase" query, so I can not say if it can be causing additional slowness.  Every other part of the query looks fine, but I am concerned the `elasticsearch_dsl` query builder may be generating a pathologically complicated query.  May you capture the body of http post request from an example search?

You can also consider using the ActiveData ES cluster directly, which has multiple machines at 30gb each.  These machines give up low latency for low cost; so they may not give you the 10ms response time you are looking for, but we could try.

Thanks!
Component: Treeherder → Treeherder: Log Parsing & Classification
Priority: -- → P3
Component: Treeherder: Log Parsing & Classification → TreeHerder
You need to log in before you can comment on or make changes to this bug.