Closed Bug 1382727 Opened 7 years ago Closed 7 years ago

Elasticsearch should remove invalid keys before saving

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

In bug #1380442, one of the problems we're having with migrating data from the ES 1.x cluster to the ES 5.x cluster is that some of the documents contain keys that are empty strings.

This bug covers fixing the Elasticsearch crash storage save methods to either rename empty-string keys to EMPTYSTRING or remove them altogether or some third option.
The case for renaming them is that we can do analytics later to see how often we remove things and maybe alleviate some issues elsewhere in the system that's causing the empty string keys.

The case for removing them altogether is that they're probably not searchable, so they're "dead weight".

I made this block #1322630 because we're going to need to fix this. However, I think we need to wait until we re-land the ES 5.x related changes.
Expanding this to fix/remove all invalid keys:

1. empty strings
2. keys with utf-8 characters in them

Example: https://crash-stats.mozilla.com/report/index/8f317aad-0119-4ec7-b2d5-f95ba2170130#tab-metadata

3. keys that are mappings

Example: https://crash-stats.mozilla.com/report/index/b346db90-70c8-4af0-a950-8145f2170210#tab-metadata

4. keys that start with a period
Summary: Elasticsearch should remove empty-string keys before saving → Elasticsearch should remove invalid keys before saving
After we do the next cutover, but before we do any code changes here, we should send crash ids for problematic crashes to -stage. We can use that script that adds those crash ids to the socorro.submitter queue.

This will let us check to see if the existing code handles these specific indexing errors ok.

If that works out, then we won't need to do anything additional here.
Adrian is now unsure that the error we'll get back from Elasticsearch will trigger this code.

I want to push off thinking about this until we do the cutover again.

Peter's got a list of known-bad crash ids, so when we do the next cutover attempt, we'll:

1. cut over -stage to ES 5.x
2. process a known bad crash in -stage
3. see what happens

If it fails, when we'll write some key cleaning code that removes keys from raw_crash that aren't a-zA-Z0-9_-. That's the only problematic part of the json document.
Attached file crazy-crash-ids.log
This file contains all crash IDs from prod's all indices where the document contains any key that is true as per this function::

  non_ascii_regex = re.compile('[^a-zA-Z0-9-_]')


  def bad_key(key):
      key = key.replace(' ', '').strip()
      return (
          not key or
          key.startswith('.') or
          non_ascii_regex.findall(key)
      )

The way to to read it is "crash ID [TAB] Python repr of the key".
I'm changing the dependency of this but to reflect what we discussed. 
So we'll wait till we land the new ES5 based code. Then we'll take some of these crash IDs (previous comment with attachment) and try to process it and make sure ES 5 doesn't choke on saving it. 

There's also the option that with certain keys ES 5 might actually be "OK" with saving it but *perhaps* we ought to clean up the keys that aren't ascii'ish.
No longer blocks: 1322630
Depends on: 1322630
We don't want this to depend on bug #1322630 getting finished because that would imply that we've cut -prod over, too, and we need to do this *before* we cut -prod over.

Given that, I'm switching this back to blocking bug #1322630. If you want, you can additionally make this depend on the bug for updating the code to support ES 5.x. That's bug #1342081.
Blocks: 1322630
No longer depends on: 1322630
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Commit pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/c3c839512be20240f98350c6aeaf4a65d7c3bffb
fixes bug 1382727 - remove bad keys from ES documents (#3912)

* fixes bug 1382727 - remove bad keys from ES documents

This adds a pass to remove bad keys from the raw_crash part of the document
before indexing it into Elasticsearch.

Good keys contain one or more ascii alpha-numeric characters plus underscore and
hyphen. Everything else is invalid and gets removed.

* Fix names and add comments
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Figured I'd just nip this in the bud now to save us time later.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: