Closed Bug 629088 Opened 12 years ago Closed 11 years ago

Mark suspected duplicates

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: laura, Assigned: rhelmer)

References

Details

Attachments

(3 files)

Need:
new db column 
cron job to calculate 

Criteria for dupes are as follows.

Where reports have:
- same datetime
- same uptime (or slightly increasing, within an hour)
- same OS
- same build
- same address
- installAge is a better indicator, via chofmann.  This is in the CSV.

Need to backfill last two months.
Assignee: nobody → rhelmer
(In reply to comment #0)
> Need:
> new db column 

What would the new column be and on which table?

Column "duplicate" (or maybe "suspected_duplicate") on table "reports"?
Robert,

Actually, the column would be "duplicate_of" and would be the UUID for the "first" example of that group of suspected duplicates.

I need to hash out whether it makes more sense to put the duplication information in the reports table or to give it its own table(s).  The fact that we're planning to drop all duplicates once we feel like we can detect them accurately (and possibly revise the algorithm and rebuild the duplicates list a couple of times in the process) would argue that we should use a separate table.
I guess the other argument for a separate table is that it saves monkeying with the reports table...not a bad thing given its size.  We may want to try running a few different dupe detection queries as we tune that too.

So:
duplicates(ooid, duplicate_of)
?
Assignee: rhelmer → josh
So, here's the initial algorithm we're going to try:

Reports submitted within 1 hour of each other, in which the following elements are identical:

OS
Build
Product
Version
Signature

And where the following values are either identical, or are within a reasonable window (1 hour) of each other and progress together:

client_crash_date
install_age
uptime

And where the data in optional fields is not different:

url
cpu_name
cpu_info
reason
address
email
user comments
app_notes

... is all of the above correct?
Assignee: josh → rhelmer
yeah, looks good.  I'd say give this a whirl and lets see what it turns up.

We do see some dups that seem to "suspend" for a while then another batch comes in so two batches that are outside the 1 hour limit of each other won't get connected but I think that's probably ok.   It will keep us over counting on the same event, but by what I think will be a small number.

maybe run a report like I did in 
https://bugzilla.mozilla.org/show_bug.cgi?id=574711#c17

and let's see if it turns up a list similar to the attachment there.  

We might need some conversion/rounding on the time values to get them into the same units (minutes?).
BTW,  if we need to start optimizing the first value to look at might be "uptime."  

for my report in bug 574711 I first searched for reports on the same signature that had an uptime of less than 10 seconds and that filter surfaced a good set of candidate dups to do the additional comparisons checks on.  

10 seconds is probably more restrictive than we need or want, but uptime of an hour is probably more than we need or want too.
Chris:

I guess I need some clarity on what we consider a duplicate, so I know what I'm looking for.  Here's the two cases:

CASE A: 
-- User is using Firefox
-- Firefox crashes
-- Due to a bug in FF or Breakpad, it submits the crash report several times.

CASE B:
-- User is using Firefox
-- Firefox crashes
-- User restarts Firefox right away
-- Firefox crashes again with the same signature

Searching for CASE A is significantly different from searching for CASE B.   So which one am I looking for?
In general I believe we're looking for case A, a genuine dupe.  Case B is actually a separate crash.

However, there are times when you won't be able to tell the difference, if time since startup is consistently low.

Try it and see.  I'd like some results: we can iterate on the algorithm.
If we can we should try and find and mark both kinds.

Lets call A "Duplicated Crashes" and B "Replicated Crashes"

Its possible that we might need to breakdown the replicated crashes as well once we start looking at samples of that data.
the long term plan should be to get rid of the duplicate crashes, and to keep marking and maybe consolidating the replicated crashes if we are pretty certain that they user is able to reproduce/repeat the same crash.

once replicated crashes are identified we can put special emphasis on trying to diagnose those since they might be easier to reproduce, or at least they are reproducible for at least one or more people.
Group: mozilla-corporation-confidential
Group: mozilla-corporation-confidential
I've uploaded some duplicate search results under a separate, mozilla-confidential bug:  https://bugzilla.mozilla.org/show_bug.cgi?id=634445

Please examine these results and suggest ways to reduce the number of false positives.  Currently I'm getting a 10% hit rate, and that's probably way too high.
Here's the first version of the duplicate-finding query:

create table possible_dups as
select leader.uuid as leader_uuid, follower.uuid as follower_uuid
from
(  
select uuid,
    install_age,
    uptime,
  first_value(uuid)
  over ( partition by
            product,
            version,
            build,
            signature,
            cpu_name,
            cpu_info,
            os_name,
            os_version,
            address,
            topmost_filenames,
            reason
         order by
            client_crash_date,
            uuid
        ) as leader_uuid
   from reports
   where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01 13:00:00'
 ) as follower
JOIN 
  ( select uuid, install_age, uptime
    FROM reports
    where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01 13:00:00' ) as leader
  ON follower.leader_uuid = leader.uuid
WHERE ( ( leader.uptime - follower.uptime BETWEEN -300 AND 900 ) 
  OR follower.uptime < 300 )
  AND follower.uuid <> leader.uuid;
What that query says is:

For one hour of reports (1pm-2pm Feb 1), take all reports for which the following fields are identical:
            product,
            version,
            build,
            signature,
            cpu_name,
            cpu_info,
            os_name,
            os_version,
            address,
            topmost_filenames,
            reason
Arbitrarily designate one report with the lowest (client_crash_date, UUID) as the "leader" and all others as the followers.

Further filter by making sure that follower uptime is within ( -5min, +15min ) of the leader or is less than 5min.

The query is hard to read, but actually runs pretty fast, which is why I wrote it that way.
Here's our new query:

select leader.uuid as leader_uuid, follower.uuid as follower_uuid
from
(  
select uuid,
    install_age,
    uptime,
    client_crash_date,
  first_value(uuid)
  over ( partition by
            product,
            version,
            build,
            signature,
            cpu_name,
            cpu_info,
            os_name,
            os_version,
            address,
            topmost_filenames,
            reason,
            app_notes,
            url
         order by
            client_crash_date,
            uuid
        ) as leader_uuid
   from reports
   where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01 13:00:00'
 ) as follower
JOIN 
  ( select uuid, install_age, uptime, client_crash_date
    FROM reports
    where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01
13:00:00' ) as leader
  ON follower.leader_uuid = leader.uuid
WHERE ( same_time_fuzzy(leader.client_crash_time, follower.client_crash_time, 
                  leader.uptime, follower.uptime) 
		  OR follower.uptime < 60 
  	  )
  AND
	same_time_fuzzy(leader.client_crash_time, follower.client_crash_time, 
                  leader.install_age, follower.install_age)
  AND follower.uuid <> leader.uuid;

same_time_fuzzy is a SQL function I added to check deltas on install_age and uptime:

create or replace function same_time_fuzzy(
	date1 timestamptz, date2 timestamptz,
	interval_secs1 int, interval_secs2 int
) returns boolean
language sql as $f$
SELECT
-- return true if either interval is null
-- so we don't exclude crashes missing data
CASE WHEN $3 IS NULL THEN
	TRUE
WHEN $4 IS NULL THEN
	TRUE
-- otherwise check that the two timestamp deltas
-- and the two interval deltas are within 60 sec
-- of each other
ELSE
	(
		extract ('epoch' from ( $2 - $1 ) ) -
		( $4 - $3 ) 
	) BETWEEN -60 AND 60
END;
$f$;

Effectively, here's the translation of the new query:

For one hour of reports (1pm-2pm Feb 1), take all reports for which the
following fields are identical:
            product,
            version,
            build,
            signature,
            cpu_name,
            cpu_info,
            os_name,
            os_version,
            address,
            topmost_filenames,
            reason,
            app_notes,
            url
Arbitrarily designate one report with the lowest (client_crash_date, UUID) as
the "leader" and all others as the followers.

Further filter by making sure that the delta between leader and follower uptimes and install_ages is the same, within a 60-second margin of error, OR that uptime is less than 60 seconds.

This narrows things down to 412 suspected duplicates (of 250 original reports), or a possible duplicate rate of 2.5%  I think this may be narrow enough to go forwards with if people think the sample results are good enough.

Please see the other bug to examine the sample results. https://bugzilla.mozilla.org/show_bug.cgi?id=634445
Depends on: 636283
Attached is an SQL file of the database changes which need to go into schema.py to support the new duplicate-finding feature.  Over to you, Rob.
Ran in dev, seems to create the desired table and functions
Attachment #514913 - Flags: review?(laura)
Attachment #514913 - Flags: review?(lars)
Comment on attachment 514913 [details] [diff] [review]
add schema changes to schema.py

this really brings into sharp focus how inadequate the schema.py system for maintaining the database schema.  The gap between how this system was conceived to work and what it has become is now huge.  

In the original spec, each table was to be defined with its own class.  Embedding a table definition within another, like 'productdims_version_sort' is defined within 'productdims', was not part of how this was to work.  The intent was to follow very roughly how SQLAlchemy defines table objects, with the idea that we'd leave the door open to returning to that standard sometime in the future.

I'm going to r+ this just to keep us moving with the caveat that we MUST refactor this code to a better compromise very soon.  In fact, I'm going to file a bug about this issue right now.
Attachment #514913 - Flags: review?(lars) → review+
Here's the patch as landed. No changes to the ReportsDuplicatesTable class, just remove noise and line offsets introduced by other schema changes and fixing unit test failures.

Committed revision 2986.
(In reply to comment #18)
> Comment on attachment 514913 [details] [diff] [review]
> add schema changes to schema.py
> 
> this really brings into sharp focus how inadequate the schema.py system for
> maintaining the database schema.  The gap between how this system was conceived
> to work and what it has become is now huge.  
> 
> In the original spec, each table was to be defined with its own class. 
> Embedding a table definition within another, like 'productdims_version_sort' is
> defined within 'productdims', was not part of how this was to work.  The intent
> was to follow very roughly how SQLAlchemy defines table objects, with the idea
> that we'd leave the door open to returning to that standard sometime in the
> future.
> 
> I'm going to r+ this just to keep us moving with the caveat that we MUST
> refactor this code to a better compromise very soon.  In fact, I'm going to
> file a bug about this issue right now.

Agreed, thanks for filing the followup.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Attachment #514913 - Flags: review?(laura)
Blocks: 652829
Blocks: 651543
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.