stop indexing items from raw_crash
Categories
(Socorro :: Processor, task, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
References
Details
Attachments
(2 files)
The super_search_fields.py
file specifies which data to index from the raw and processed crash. The raw crash contains the original crash annotation values. The processed crash normalized values, inferred values, and calculated values.
One of the problems we have on a regular-enough-to-frustrate-me basis is discovering we need to normalize a value that's being indexed that comes from the raw crash.
Because of the way the Elasticsearch crash storage thing indexes things, the items from the raw crash are put in a raw_crash
namespace. So there's no way to start with indexing a value from the raw crash, discover you need to normalize it, and then index it from the processed crash. What a pain.
I want to switch to indexing only things from the processed crash. In order to not make users sad, we need to migrate from where we're at now to where we want to be. This bug covers that work.
Assignee | ||
Comment 1•3 years ago
|
||
I think what we want to do is something like:
- add processor rules to copy data from the raw crash to the processed crash for all items we're indexing from the raw crash
- add items to
super_search_fields.py
to index these fields in theprocessed_crash
namespace, but don't expose them as search fields - wait X months (at most 6, but maybe we can get away with 4?)
- remove the raw crash versions from
super_search_fields.py
and switch the search fields to search the processed crash versions
This idea needs testing. Also, the sooner we do this, the better.
Assignee | ||
Comment 2•3 years ago
|
||
I had a better plan: we enhance super search fields definitions to allow us to do migrations by specifying source, destination, and search keys. This would solve a long-standing problem where it's currently hard to seamlessly migrate data.
While working on that new plan, I decided to rewrite indexing. Currently, the code:
- takes a raw crash and a processed crash
- copies them (iterate over entire raw and processed crash)
- removes fields that either aren't in the current schema or aren't in the schema for the index the document is going into (iterate over allowed keys and copy raw and processed crash)
- fixes the data values (iterate over super search fields multiple times--once for string, keyword, integers, longs, and booleans)
- builds a document to index
The new version builds the document by:
- traverse search fields (iterate over super search fields)
- if it's not an allowed key, continue next loop (allowed keys is a set, so this is O(1))
- extract data to index from source key
- fix data depending on data type
- populate destination keys in new document
This reduces the number of passes we do through things and the use of source, destination, and search keys allows us to migrate data from one place to another in the indexed document without affecting search.
Assignee | ||
Comment 3•3 years ago
|
||
Assignee | ||
Comment 4•3 years ago
•
|
||
willkg merged PR #6008: "bug 1753521: refactor indexing and start migrating some raw crash fields" in 118eba6.
This needs to hang out on stage until a new index is created and we should make sure querying works across indexes with the old and new mappings. Having said that, I'm feeling pretty good about this.
There's some follow-up work that needs to happen, but it doesn't need to happen all at once--it can happen in stages.
Just landing this stage alone gives us the ability to migrate data around in the index over time which is a huge win.
Assignee | ||
Comment 5•3 years ago
|
||
I tested searching on stage yesterday and everything looked fine as far as I could tell.
I deployed this just now in bug #1756845.
There's still a bunch more fields to migrate, so leaving this open for now.
Assignee | ||
Comment 6•3 years ago
|
||
Bug #1755528 covered fixing flag/boolean fields.
After that, we have two more:
- collector_notes
- submitted_timestamp
I'll do those next.
Assignee | ||
Comment 7•3 years ago
|
||
Assignee | ||
Comment 8•3 years ago
|
||
willkg merged PR #6041: "bug 1753521: fix collector_notes" in e40d76a.
That covers everything. We have the cleanup-step from the migration which we can do in August 2022 or thereabouts. Otherwise, we're done here.
Assignee | ||
Comment 9•3 years ago
|
||
Everything so far went to production in bug #1763234 just now. Followup work will be done in bug #1763264. Keeping this open to verify everything is working on Monday after we've created a new Elasticsearch index.
Assignee | ||
Comment 10•2 years ago
|
||
I looked at some of the affected fields and they are getting copied to the processed crash and indexed from there. Marking as FIXED.
Description
•