Closed Bug 1296621 Opened 5 years ago Closed 5 years ago

Send unmutated crash report to Telemetry

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Assigned: peterbe)

References

Details

Attachments

(1 file)

At the moment we have the following poly crash storage configuration (at the time of writing, only on stage):

1) socorro.external.postgresql.crashstorage.PostgreSQLCrashStorage
2) socorro.external.boto.crashstorage.BotoS3CrashStorage
3) socorro.external.es.crashstorage.ESCrashStorageRedactedJsonDump
4) socorro.external.statsd.crashstorage.StatsdCrashStorage
5) socorro.external.boto.crashstorage.TelemetryBotoS3CrashStorage

The problem is that the 3rd one (ESCrashStorageRedactedJsonDump) mutates the [processed] crash for the sake of ElasticSearch. So by the time it reaches the 5th one (TelemetryBotoS3CrashStorage) it's different. In particular, all the json_dump stuff has been greatly redacted/removed. 

We need to make sure TelemetryBotoS3CrashStorage gets the processed crash without it being mutated by the ESCrashStorageRedactedJsonDump crash storage.
Adrian, 
What does it mean that StatsdCrashStorage is BEFORE TelemetryBotoS3CrashStorage? Does that mean it's not sending bits to datadog about what TelemetryBotoS3CrashStorage does?

It wouldn't be hard to prepare a bash script that defines ALL the new consulate kv set commands. But if we run that (e.g. `bash rearrange-storagekeys.sh`) it'd have to contain 10 consulate kv set command and thus restart things 10 times. Kinda scary to do on prod.
Blocks: 1295559
The more I think about it, I really don't like the idea of being a "slave" to the numbering system. 
I'd much rather attack the mutation problem. That feels more robust. We can eat the cost of memory with $$$ if we have to. The scary thing are other mutation bugs. 

One possible solution is to use https://pypi.python.org/pypi/frozendict but it's quite annoying. 
1. You can't do things like `json.dumps(myfrozendict)`. You have you do `json.dumps(dict(myfrozendict))` which creates another copy in memory. 
2. You can still delete keys with one of these frozendict objects
3. You can actually mutate it. `myfrozendict['key'] = 'other'` is blocked but nested dicts within can be mutated `myfrozendict['inner']['key'] = 'different'`
I think we have to good choices:

1) Re-write the PolyCrashStorage so it does something like this:
  for wrapped_crash_storage_class in self.config.wrapped_crash_storage_classes:
      wrapped_crash_storage_class().save_processed_crash(copy.deepcopy(crash))

2) Just re-write the ESCrashStorageRedactedJsonDump so it doesn't mess with it too much. E.g.
  def save_processed_crash(self, crash):  # in ESCrashStorageRedactedJsonDump
      crash = copy.deepcopy(crash)
      ...proceed as normal...
Does the "crash" contain anything that can't effectively be copied? e.g. file pointers don't copy well.
I believe we pass by reference to avoid the expense of the deep copy. We haven't needed it until now.
I think that means we want to do some form of number 2.

Maybe we should take this opportunity to spend some research points on developing copy-on-write DotDict?
Adrian and I talked about it and decided that the best way to proceed is to run a copy.deepcopy() on each loop in the PolyCrashStorage class.
Assignee: nobody → peterbe
We haven't needed to do that until now because we relied on the order of the crash stores in the PolyCrashStorage config. The Elasticsearch one is destructive, and thus is the "last" one on the list. But if one doesn't know that, or forgot about it, that leads to painful bugs. I am also afraid that maybe the other crash stores are changing the processed crash somehow, and I do not know about it. Anything can happen in these classes and we might never really notice it. 

I am strongly in favor of using a deepcopy for every sub-store. We have plenty of available memory in processors at the moment (out of 64G, only 4 or 5 were being used last time Peter and I checked).
Blocks: 1304106
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/1e180913c464bdfc77602fc26a9cb8bea0c47712
Bug 1296621 let polycrashstorage process cloned objects each time (#3459)

* fixes bug 1296621 - let PolyCrashStorage process cloned objects each time

* cleanup

* with tests

* pep8 fixup
This has gone into stage now. 
I randomly downloaded on of the crashes sent to the telemetry bucket and it had the full json_dump in there.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Blocks: 1311522
You need to log in before you can comment on or make changes to this bug.