Closed Bug 472358 Opened 16 years ago Closed 15 years ago

support duplicate crash report counting without keeping GUID data

Categories

(Socorro :: General, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: beltzner, Unassigned)

Details

(Whiteboard: cloud)

Bug 444351 is rightly looking to reduce personally identifiable information by not sending client GUIDs with crash reports.

A common request for data analysis, though, is to understand how many instances of a crash came from a single user. I think there's a way to do this by adding a "duplicateCount" variable to each crash report instance. When the client sends a crash report, it should also send the IDs of the last N crash reports that were sent from that client. Socorro can then compare the reports, and if the new report is deemed to be a duplicate of an old report, then:

 - ID of new report is set to redirect to ID of older report
 - duplicateCount on older report is incremented
Oh, forgot to mention, once the analysis is done, the set of crash report IDs is discarded.
Hardware: x86 → All
Oh, forgot to mention that the new crash report would be discarded entirely, too.
Hardware: All → x86
comments by Lars the Wet Blanket:

It seems server side throttling may interfere with this scheme.  How important
are accurate counts?  If throttling is set for 10% and a given user crashes 10
times at the same place, the odds are that only one of those crashes will have
made into Socorro's database.  So rather than a count of 10, the count will
only be 1.

When comparing reports, is the signature alone sufficient to determine
equivalence?

The ramifications for statistic reporting of this scheme would have to be
studied.  Suddenly a single crash report potentially no longer represents a
single crash.  For example, this could adversely affect MTBF stats because the
we've lost the MTBF information for the subsequent "duplicate" crashes.  If we
treat a report always as a single crash for stats purposes, then we are
effectively throttling duplicate crashes at a higher rate than non-duplicated
crashes.

There would also be performance costs on the Socorro processor.  The 'N'
previous crash reports would translate into N more database queries in a search
for duplicates.  I'd imagine that the more common case would be for these
queries to not find duplicates, therefore the expense of the database queries
wouldn't have bought us much.
think about what we are saying.

from a set of crash reports determine if the some subset of crashes are from the same user.

there are two ways to do this:

-one is to assign a unique identifier

-the other is to try and back out unique identification through analysis of the other data to determine which set of reports are from the same user, and create some kind of proxy unique identifier.


both of these approaches have privacy implications for users that are in the pool of crash reports being analyzed, if there is any data there considered to be a risk to personal identification or tracking. 

the disadvantage of the second approach is that it spends lots of computational and analysis cycles to "back out" and associate a collection of reports with the same user, while the first approach makes the assignment explict, efficient, and exact.

trying to take a single stack signature for a single release version and figure out how many users are contributing to that crash might be feasable and might support analysis on the micro level.

on this level a person doing the crash analyzer sees a top crash shortly after the release of a build for testing, suspects its coming from one or a few testers.  the tools would need to operate on few hundred crash reports of the same stack signature.


on the macro crash analysis level the goal is to help calibrate the "overall crashiness" of a release when comparing ADUs to crash reports submitted.  That means trying to look over hundreds of thousands or millions users and crash reports to determine stuff like we had with talkback. ( see: http://talkback-public.mozilla.org/reports/firefox/FF20020/smart-analysis.all )

the goal here is to monitor release to release with stats that tell session ending crash ratio's.  Something like:

               ADUs - unique users submitting crashes = crash ratio

fx3b2 11/15/08 107,045 - 2012 = 0.0187
fx3b2 12/15/08 157,200 - 2230 = 0.0141  

Starting to produce numbers like this and watching has provided an early warning system for betas, RCs, and final releases in the past and we should try and start doing it again.  When see a crash ratio that looks out of whack it would be time to through more investigative resources to find out why.

Wide knowledge of the crash ratios gets us out of a lot of the anecdotal second guessing about if a recent release is stable or not.  We have spent lots of cycles on this kind of discussion in the last year.

My fear is that that kind of more global analysis of the data to determine the number of unique users submitting crashes for any one release is just going to be to expensive to produce now that we have removed UUID, and have to recalculate a proxy for a UUID on the backend processing of the data.
How about doing the duplication analysis on the client side? That way if a user crashes in the same place the client can just tell Socorro "I've crashed the same way again" instead of "Here's a new crash".
that sounds like would involve feeding the client symbols so it could construct the stacktrace then compare the stacktraces to determine if its a dupe.   

maybe there is someway to just do a checksum and compare the contents of the "blackbox or mini-dump" associated with each crash report and that would get us close enough to figure out if its a dup of any previous crash the user has had. maybe there is a way to play around with a set of crash reports to see if that approach is feasible.   I'm not sure that this solution would get us the "macro analysis number" to compute the crash ratio suggested in comment #4, but it does get us the "micro" analysis ability.
(In reply to comment #6)
> that sounds like would involve feeding the client symbols so it could construct
> the stacktrace then compare the stacktraces to determine if its a dupe.   

And doing all the processing data client-side could be expensive for users of older systems (or, really, users of any systems, potentially). We should see what that processing hit would be, sure, as well as what would be involved with feeding client symbols to every Firefox user (which sounds extra expensive).

That being said, I hadn't considered MTBF numbers being off when this idea was first proposed. We need to do this in a way that doesn't affect other statistics. i.e., we need to keep the crash reports for some stats and remove them for others without saying "they belong to the same user"...
well... if we're really clever we skip the symbols entirely and do it by raw addresses (those should match anyway).

if you try to do it w/ symbols, the first penalty is libxul which is on average 70mb.

that's expensive for people w/ metered bandwidth.
(In reply to comment #4)
>                ADUs - unique users submitting crashes = crash ratio

We can easily compute this without keeping GUIDs associated with crashIDs. We could send the GUID as a separate data parameter (as opposed to including it in the crash report) and simply keep a running count of how many unique GUIDs we've seen.

The basic gist of the proposal in comment 0 is that there are ways of getting this data without associating the GUID with the crash report, by stripping apart the association and blinding the data. We need to think of a crash report like a vote: the user should be able to track it through our system, but we should not - without some information the user holds - be able to link it back to any specific user. We can do that by discarding original reports and keeping aggregate data wherever possible.
(In reply to comment #6)
> that sounds like would involve feeding the client symbols so it could construct
> the stacktrace then compare the stacktraces to determine if its a dupe.   

No you certainly wouldn't want to have to get the symbols when you crash. You can often unwind the stack without symbols but that won't necessarily give the same results on platforms (e.g. Windows) where we use unwind information when processing the minidumps.

> maybe there is someway to just do a checksum and compare the contents of the
> "blackbox or mini-dump" associated with each crash report and that would get us
> close enough to figure out if its a dup of any previous crash the user has had.
> maybe there is a way to play around with a set of crash reports to see if that
> approach is feasible.


A decent way to hash the mini-dump would be to go through the stack data and only hash addresses on the stack that correspond to code areas in memory. This method is less likely to give false positives that trying to do an unwind on the client would and is also much simpler. I think that it should do a reasonably good job identifying crashes that were the same, but we'd need to check whether there'd be too many false negatives.
If we just wanted to have the client answer "does this crash look exactly like one I've previously hit", comparing the module-relative address of the top of the crashed stack is probably good enough. Yes, if you had two crashes at the same place with different stacks it would be a false positive.

Another way to work around this that was floated on IRC/teleconference was to hash the remote IP + some value that would change per-day, so you could tell if a number of crashes submitted in the same day were the same user (to a close enough approximation). Not sure if that solves the overall problem if a user is crashing frequently, but not multiple times daily.
The macro crash analysis is an entirely different issue and should be covered in a different bug. The way talkback calculated MTBF was statistical nonsense because it ignored users who never crashed, which are a significant percentage of our userbase (we hope!).

For the micro analysis of "I have 83 crashes here, are they all from the same person?", I think that the solution proposed in comment 0 is too complex. We can get numbers which are "good enough" I think by correlating the DLL list and version numbers. You'll get a pretty good sense if there were 65 users or 1-4 users submitting crashes by correlating these. This information is already public and you could do this analysis today without any access to the database. In fact, if someone has a current crash they'd like to get data for, I'd be happy to write a script to do it and see if it gets useful results.
>  The way talkback calculated MTBF was statistical nonsense because it ignored
> users who never crashed, which are a significant percentage
> of our userbase (we hope!).

Statistical non-sense or not there was a great deal of value in knowing

 "of the users that crashed, how long did they run before crashing"

and

 "of all the users that ran a given release of firefox how many ended the session with a crash"

thats what the talkback data allowed us to view over the life of a release, and release to release.  that is a key piece of release management data that we are still missing and spend hours discussing and speculating about since we don't have the data.
>  "of the users that crashed, how long did they run before crashing"

We collect this data. Why aren't you using it?

>  "of all the users that ran a given release of firefox how many ended the
> session with a crash"

What is the value in this number? You can't really compare it to much of anything. It isn't a proxy for "how crashy is this version".
(In reply to comment #14)
> >  "of the users that crashed, how long did they run before crashing"
> 
> We collect this data. Why aren't you using it?
> 
> >  "of all the users that ran a given release of firefox how many ended the
> > session with a crash"
> 
> What is the value in this number? You can't really compare it to much of
> anything. It isn't a proxy for "how crashy is this version".

And if we figured out what the number meant, we'd probably want server-side throttling to trust it more.

(Perhaps we should always report version/OS on crash, but throttle away the crash dump processing.)
(In reply to comment #2)
> Oh, forgot to mention that the new crash report would be discarded entirely,
> too.

(In reply to comment #9)
[...]
> The basic gist of the proposal in comment 0 is that there are ways of getting
> this data without associating the GUID with the crash report, by stripping
> apart the association and blinding the data. We need to think of a crash report
> like a vote: the user should be able to track it through our system, but we
> should not - without some information the user holds - be able to link it back
> to any specific user. We can do that by discarding original reports and keeping
> aggregate data wherever possible.

Not sure I understand what this whole bug is about -- but: if a user clicks on a link in about:crashes, and that crash was originally "discarded as a dupe", will he unknowingly get data collected for a _different_ crash? Or would the GUID force processing of the exact crash which happened at the date-time shown on that line of about:crashes?
Whiteboard: cloud
Closing as WONTFIX as per comment 12, which seems more sensible.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.