Closed
Bug 1382727
Opened 7 years ago
Closed 7 years ago
Elasticsearch should remove invalid keys before saving
Categories
(Socorro :: General, task)
Socorro
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: willkg, Assigned: willkg)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
In bug #1380442, one of the problems we're having with migrating data from the ES 1.x cluster to the ES 5.x cluster is that some of the documents contain keys that are empty strings. This bug covers fixing the Elasticsearch crash storage save methods to either rename empty-string keys to EMPTYSTRING or remove them altogether or some third option.
Assignee | ||
Comment 1•7 years ago
|
||
The case for renaming them is that we can do analytics later to see how often we remove things and maybe alleviate some issues elsewhere in the system that's causing the empty string keys. The case for removing them altogether is that they're probably not searchable, so they're "dead weight". I made this block #1322630 because we're going to need to fix this. However, I think we need to wait until we re-land the ES 5.x related changes.
Assignee | ||
Comment 2•7 years ago
|
||
Expanding this to fix/remove all invalid keys: 1. empty strings 2. keys with utf-8 characters in them Example: https://crash-stats.mozilla.com/report/index/8f317aad-0119-4ec7-b2d5-f95ba2170130#tab-metadata 3. keys that are mappings Example: https://crash-stats.mozilla.com/report/index/b346db90-70c8-4af0-a950-8145f2170210#tab-metadata 4. keys that start with a period
Summary: Elasticsearch should remove empty-string keys before saving → Elasticsearch should remove invalid keys before saving
Comment 3•7 years ago
|
||
Opened https://crash-stats.mozilla.com/report/index/8f317aad-0119-4ec7-b2d5-f95ba2170130#tab-metadata and signed in so no filtering of keys on the Metadata tab.
Assignee | ||
Comment 4•7 years ago
|
||
After we do the next cutover, but before we do any code changes here, we should send crash ids for problematic crashes to -stage. We can use that script that adds those crash ids to the socorro.submitter queue. This will let us check to see if the existing code handles these specific indexing errors ok. If that works out, then we won't need to do anything additional here.
Assignee | ||
Comment 5•7 years ago
|
||
Adrian is now unsure that the error we'll get back from Elasticsearch will trigger this code. I want to push off thinking about this until we do the cutover again. Peter's got a list of known-bad crash ids, so when we do the next cutover attempt, we'll: 1. cut over -stage to ES 5.x 2. process a known bad crash in -stage 3. see what happens If it fails, when we'll write some key cleaning code that removes keys from raw_crash that aren't a-zA-Z0-9_-. That's the only problematic part of the json document.
Comment 6•7 years ago
|
||
This file contains all crash IDs from prod's all indices where the document contains any key that is true as per this function:: non_ascii_regex = re.compile('[^a-zA-Z0-9-_]') def bad_key(key): key = key.replace(' ', '').strip() return ( not key or key.startswith('.') or non_ascii_regex.findall(key) ) The way to to read it is "crash ID [TAB] Python repr of the key".
Comment 7•7 years ago
|
||
I'm changing the dependency of this but to reflect what we discussed. So we'll wait till we land the new ES5 based code. Then we'll take some of these crash IDs (previous comment with attachment) and try to process it and make sure ES 5 doesn't choke on saving it. There's also the option that with certain keys ES 5 might actually be "OK" with saving it but *perhaps* we ought to clean up the keys that aren't ascii'ish.
Assignee | ||
Comment 8•7 years ago
|
||
We don't want this to depend on bug #1322630 getting finished because that would imply that we've cut -prod over, too, and we need to do this *before* we cut -prod over. Given that, I'm switching this back to blocking bug #1322630. If you want, you can additionally make this depend on the bug for updating the code to support ES 5.x. That's bug #1342081.
Assignee | ||
Comment 9•7 years ago
|
||
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Comment 10•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla-services/socorro https://github.com/mozilla-services/socorro/commit/c3c839512be20240f98350c6aeaf4a65d7c3bffb fixes bug 1382727 - remove bad keys from ES documents (#3912) * fixes bug 1382727 - remove bad keys from ES documents This adds a pass to remove bad keys from the raw_crash part of the document before indexing it into Elasticsearch. Good keys contain one or more ascii alpha-numeric characters plus underscore and hyphen. Everything else is invalid and gets removed. * Fix names and add comments
Updated•7 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 11•7 years ago
|
||
Figured I'd just nip this in the bud now to save us time later.
You need to log in
before you can comment on or make changes to this bug.
Description
•