Closed Bug 907499 Opened 11 years ago Closed 9 years ago

Checksum minidumps

Categories

(Socorro :: Backend, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: selenamarie, Assigned: selenamarie)

Details

(Whiteboard: [StabilityWeek2013])

Let's create a checksum for the minidump. I'll leave the details for everyone else to fill in.
This is for de-duplication.
the question is when to make the checksum.  I suspect that the collector will be the best candidate to do the calculation. 

The collector has the dump in memory for a moment before it sends it out to the file system, so that would be a good opportunity.  Though burdening the collectors with that might not be a good idea.  The checksum doesn't have to be cryptographically ideal, so we could choose an algorithm that isn't too CPU intensive.  One the collector calculates the checksum, it can save it directly to the raw_crash json. 

The next opportunity to create the checksum would be the crash mover.  That seems to be out-of-band for that application, though.  The mover has one task and adding something unrelated seems wrong.

The processor doesn't really have the opportunity to do the checksum.  It doesn't actually load the dumps, it just tells the HBase crash storage to write the dump as a file and then points MDSW at it.  The code that writes the dump to /tmp, of course, has the dump in memory as it writes it out.  However, that code is deep in the load/save of the HBase code, adding the checksum would be a "sidecar" addition to the HBase code and tying the checksum to a crash storage resource just seems wrong.   

When a crash as more than one dump, would we want to checksum them all?  

What would be a good checksum algorithm to use? md5? something simpler?
selena and have a new proposal:  some external key value store wrapped with the CrashStorage interface, implementing the save_raw_crash method.  It is used as a destination for the crashmover.  As the condition in a FilteringCrashStorage container.  The ChecksumCrashStore will calculate the checksum, and save it to its key value store iff that checksum doesn't already exist in the key value store.  If the checksum already exists, the save_raw_crash fails, and the FilteringCrashStorage container declines to sawe the crash to its destination crash storage (HBase & RabbitMQ).  If the checksum did not already exist in the key value store, save_raw_crash succeeds and the FilterCrashStorage proceeds to save to HBase and RabbitMQ.
Whiteboard: [StabilityWeek2013]
Assignee: nobody → sdeckelmann
Hey lars,

Do you have a branch somewhere I could take a look at? or would you like to open a PR for discussion about this? I'd love to deprecate our duplicates mechanism in Postgres.
Flags: needinfo?(lars)
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/34af8ab0bd649cd3ab26555f3281a9b25148164b
Merge pull request #2612 from twobraids/checksum_collector

Fixes Bug 907499 - added checksum to collector
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Flags: needinfo?(lars)
You need to log in before you can comment on or make changes to this bug.