Closed
Bug 1150947
Opened 10 years ago
Closed 5 years ago
Optimize ES memory use
Categories
(Webtools Graveyard :: DXR, defect)
Tracking
(firefox40 affected)
RESOLVED
WONTFIX
Tracking | Status | |
---|---|---|
firefox40 | --- | affected |
People
(Reporter: erik, Unassigned)
Details
We use ES in an unusual way, indexing every line of code as an undivided term so we can get adequate performance on our script-filter-based accelerated regex searching. This uses a lot of field data cache, leading to errors like this under the default circuit breaker settings:
ElasticHttpError: (500, u'SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed; shardFailures {[9zil0pkNSVqtiJhbxo1eBA][dxr_hot_11_mozilla-central_72dc7738-d98c-11e4-8081-441ea14ffe94][0]: RemoteTransportException[[node58_scl3][inet[/10.22.78.122:9300]][indices:data/read/search[phase/query+fetch]]]; nested: QueryPhaseExecutionException[[dxr_hot_11_mozilla-central_72dc7738-d98c-11e4-8081-441ea14ffe94][0]: query[filtered(ConstantScore(+QueryWrapperFilter(content.trigrams:"ban ana nan ana")))->cache(_type:line)],from[0],size[100],sort[<custom:"path": org.elasticsearch.index.fielddata.fieldcomparator.BytesRefFieldComparatorSource@46b42c96>,<custom:"number": org.elasticsearch.index.fielddata.fieldcomparator.LongValuesComparatorSource@5edb501d>]: Query Failed [Failed to execute main query]]; nested: ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException: [FIELDDATA] Data too large, data for [number] would be larger than limit of [622775500/593.9mb]]; nested: UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException: [FIELDDATA] Data too large, data for [number] would be larger than limit of [622775500/593.9mb]]; nested: CircuitBreakingException[[FIELDDATA] Data too large, data for [number] would be larger than limit of [622775500/593.9mb]]; }]')
We cranked that up in staging, but it would be nice to explore better solutions. For one thing, we shouldn't need norm values on file paths (in either doctype) or the "content" field (on lines). That should save roughly a byte per field per doc. Also, I wonder if sharding more would divide the memory use.
Reporter | ||
Comment 1•10 years ago
|
||
Another option with different tradeoffs would be to use https://github.com/wikimedia/search-extra/blob/master/docs/source_regex.md, which uses Postgres's version of the trigram acceleration (great stuff), is very configurable, appears well supported by the Wikimedia Foundation, and would let me jettison my own custom trigram extractor (but bind me to their own regex dialect).
We'd have to measure the RAM use. It works against either _source and stored fields. If perf is adequate and RAM is good with either of those, then we could quit keeping "content" as an unanalyzed field, and field data cache use would go through the floor. I'd also want to have a closer look at the code.
Comment 2•5 years ago
|
||
DXR is no longer available. Searchfox is now replacing it.
See meta bug 1669906 & https://groups.google.com/g/mozilla.dev.platform/c/jDRjrq3l-CY for more details.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Updated•5 years ago
|
Product: Webtools → Webtools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•