629088 - Mark suspected duplicates

Reporter

Description

•

13 years ago

Need:
new db column 
cron job to calculate 

Criteria for dupes are as follows.

Where reports have:
- same datetime
- same uptime (or slightly increasing, within an hour)
- same OS
- same build
- same address

Laura Thomson :laura

Reporter

Comment 1

•

13 years ago

- installAge is a better indicator, via chofmann.  This is in the CSV.

Need to backfill last two months.

Assignee: nobody → rhelmer

Robert Helmer [:rhelmer]

Assignee

Comment 2

•

13 years ago

(In reply to comment #0)
> Need:
> new db column 

What would the new column be and on which table?

Column "duplicate" (or maybe "suspected_duplicate") on table "reports"?

[:jberkus] Josh Berkus

Comment 3

•

13 years ago

Robert,

Actually, the column would be "duplicate_of" and would be the UUID for the "first" example of that group of suspected duplicates.

I need to hash out whether it makes more sense to put the duplication information in the reports table or to give it its own table(s).  The fact that we're planning to drop all duplicates once we feel like we can detect them accurately (and possibly revise the algorithm and rebuild the duplicates list a couple of times in the process) would argue that we should use a separate table.

Laura Thomson :laura

Reporter

Comment 4

•

13 years ago

I guess the other argument for a separate table is that it saves monkeying with the reports table...not a bad thing given its size.  We may want to try running a few different dupe detection queries as we tune that too.

So:
duplicates(ooid, duplicate_of)
?

Laura Thomson :laura

Reporter

Updated

•

13 years ago

Assignee: rhelmer → josh

[:jberkus] Josh Berkus

Comment 5

•

13 years ago

So, here's the initial algorithm we're going to try:

Reports submitted within 1 hour of each other, in which the following elements are identical:

OS
Build
Product
Version
Signature

And where the following values are either identical, or are within a reasonable window (1 hour) of each other and progress together:

client_crash_date
install_age
uptime

And where the data in optional fields is not different:

url
cpu_name
cpu_info
reason
address
email
user comments
app_notes

... is all of the above correct?

Assignee: josh → rhelmer

chris hofmann

Comment 6

•

13 years ago

yeah, looks good.  I'd say give this a whirl and lets see what it turns up.

We do see some dups that seem to "suspend" for a while then another batch comes in so two batches that are outside the 1 hour limit of each other won't get connected but I think that's probably ok.   It will keep us over counting on the same event, but by what I think will be a small number.

maybe run a report like I did in 
https://bugzilla.mozilla.org/show_bug.cgi?id=574711#c17

and let's see if it turns up a list similar to the attachment there.  

We might need some conversion/rounding on the time values to get them into the same units (minutes?).

chris hofmann

Comment 7

•

13 years ago

BTW,  if we need to start optimizing the first value to look at might be "uptime."  

for my report in bug 574711 I first searched for reports on the same signature that had an uptime of less than 10 seconds and that filter surfaced a good set of candidate dups to do the additional comparisons checks on.  

10 seconds is probably more restrictive than we need or want, but uptime of an hour is probably more than we need or want too.

[:jberkus] Josh Berkus

Comment 8

•

13 years ago

Chris:

I guess I need some clarity on what we consider a duplicate, so I know what I'm looking for.  Here's the two cases:

CASE A: 
-- User is using Firefox
-- Firefox crashes
-- Due to a bug in FF or Breakpad, it submits the crash report several times.

CASE B:
-- User is using Firefox
-- Firefox crashes
-- User restarts Firefox right away
-- Firefox crashes again with the same signature

Searching for CASE A is significantly different from searching for CASE B.   So which one am I looking for?

Laura Thomson :laura

Reporter

Comment 9

•

13 years ago

In general I believe we're looking for case A, a genuine dupe.  Case B is actually a separate crash.

However, there are times when you won't be able to tell the difference, if time since startup is consistently low.

Try it and see.  I'd like some results: we can iterate on the algorithm.

chris hofmann

Comment 10

•

13 years ago

If we can we should try and find and mark both kinds.

Lets call A "Duplicated Crashes" and B "Replicated Crashes"

Its possible that we might need to breakdown the replicated crashes as well once we start looking at samples of that data.

chris hofmann

Comment 11

•

13 years ago

the long term plan should be to get rid of the duplicate crashes, and to keep marking and maybe consolidating the replicated crashes if we are pretty certain that they user is able to reproduce/repeat the same crash.

once replicated crashes are identified we can put special emphasis on trying to diagnose those since they might be easier to reproduce, or at least they are reproducible for at least one or more people.

Robert Helmer [:rhelmer]

Assignee

Updated

•

13 years ago

Group: mozilla-corporation-confidential

Robert Helmer [:rhelmer]

Assignee

Updated

•

13 years ago

Group: mozilla-corporation-confidential

[:jberkus] Josh Berkus

Comment 12

•

13 years ago

I've uploaded some duplicate search results under a separate, mozilla-confidential bug:  https://bugzilla.mozilla.org/show_bug.cgi?id=634445

Please examine these results and suggest ways to reduce the number of false positives.  Currently I'm getting a 10% hit rate, and that's probably way too high.

[:jberkus] Josh Berkus

Comment 13

•

13 years ago

Here's the first version of the duplicate-finding query:

create table possible_dups as
select leader.uuid as leader_uuid, follower.uuid as follower_uuid
from
(  
select uuid,
    install_age,
    uptime,
  first_value(uuid)
  over ( partition by
            product,
            version,
            build,
            signature,
            cpu_name,
            cpu_info,
            os_name,
            os_version,
            address,
            topmost_filenames,
            reason
         order by
            client_crash_date,
            uuid
        ) as leader_uuid
   from reports
   where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01 13:00:00'
 ) as follower
JOIN 
  ( select uuid, install_age, uptime
    FROM reports
    where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01 13:00:00' ) as leader
  ON follower.leader_uuid = leader.uuid
WHERE ( ( leader.uptime - follower.uptime BETWEEN -300 AND 900 ) 
  OR follower.uptime < 300 )
  AND follower.uuid <> leader.uuid;

[:jberkus] Josh Berkus

Comment 14

•

13 years ago

What that query says is:

For one hour of reports (1pm-2pm Feb 1), take all reports for which the following fields are identical:
            product,
            version,
            build,
            signature,
            cpu_name,
            cpu_info,
            os_name,
            os_version,
            address,
            topmost_filenames,
            reason
Arbitrarily designate one report with the lowest (client_crash_date, UUID) as the "leader" and all others as the followers.

Further filter by making sure that follower uptime is within ( -5min, +15min ) of the leader or is less than 5min.

The query is hard to read, but actually runs pretty fast, which is why I wrote it that way.

[:jberkus] Josh Berkus

Comment 15

•

13 years ago

Here's our new query:

select leader.uuid as leader_uuid, follower.uuid as follower_uuid
from
(  
select uuid,
    install_age,
    uptime,
    client_crash_date,
  first_value(uuid)
  over ( partition by
            product,
            version,
            build,
            signature,
            cpu_name,
            cpu_info,
            os_name,
            os_version,
            address,
            topmost_filenames,
            reason,
            app_notes,
            url
         order by
            client_crash_date,
            uuid
        ) as leader_uuid
   from reports
   where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01 13:00:00'
 ) as follower
JOIN 
  ( select uuid, install_age, uptime, client_crash_date
    FROM reports
    where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01
13:00:00' ) as leader
  ON follower.leader_uuid = leader.uuid
WHERE ( same_time_fuzzy(leader.client_crash_time, follower.client_crash_time, 
                  leader.uptime, follower.uptime) 
		  OR follower.uptime < 60 
  	  )
  AND
	same_time_fuzzy(leader.client_crash_time, follower.client_crash_time, 
                  leader.install_age, follower.install_age)
  AND follower.uuid <> leader.uuid;

same_time_fuzzy is a SQL function I added to check deltas on install_age and uptime:

create or replace function same_time_fuzzy(
	date1 timestamptz, date2 timestamptz,
	interval_secs1 int, interval_secs2 int
) returns boolean
language sql as $f$
SELECT
-- return true if either interval is null
-- so we don't exclude crashes missing data
CASE WHEN $3 IS NULL THEN
	TRUE
WHEN $4 IS NULL THEN
	TRUE
-- otherwise check that the two timestamp deltas
-- and the two interval deltas are within 60 sec
-- of each other
ELSE
	(
		extract ('epoch' from ( $2 - $1 ) ) -
		( $4 - $3 ) 
	) BETWEEN -60 AND 60
END;
$f$;

Effectively, here's the translation of the new query:

For one hour of reports (1pm-2pm Feb 1), take all reports for which the
following fields are identical:
            product,
            version,
            build,
            signature,
            cpu_name,
            cpu_info,
            os_name,
            os_version,
            address,
            topmost_filenames,
            reason,
            app_notes,
            url
Arbitrarily designate one report with the lowest (client_crash_date, UUID) as
the "leader" and all others as the followers.

Further filter by making sure that the delta between leader and follower uptimes and install_ages is the same, within a 60-second margin of error, OR that uptime is less than 60 seconds.

This narrows things down to 412 suspected duplicates (of 250 original reports), or a possible duplicate rate of 2.5%  I think this may be narrow enough to go forwards with if people think the sample results are good enough.

Please see the other bug to examine the sample results. https://bugzilla.mozilla.org/show_bug.cgi?id=634445

[:jberkus] Josh Berkus

Updated

•

13 years ago

Depends on: 636283

[:jberkus] Josh Berkus

Comment 16

•

13 years ago

Attached file db changes for find dups feature — Details

Attached is an SQL file of the database changes which need to go into schema.py to support the new duplicate-finding feature.  Over to you, Rob.

Robert Helmer [:rhelmer]

Assignee

Comment 17

•

13 years ago

Attached patch add schema changes to schema.py — Details — Splinter Review

Ran in dev, seems to create the desired table and functions

Attachment #514913 - Flags: review?(laura)

Robert Helmer [:rhelmer]

Assignee

Updated

•

13 years ago

Attachment #514913 - Flags: review?(lars)

K Lars Lohn [:lars] [:klohn]

Comment 18

•

13 years ago

Comment on attachment 514913 [details] [diff] [review]
add schema changes to schema.py

this really brings into sharp focus how inadequate the schema.py system for maintaining the database schema.  The gap between how this system was conceived to work and what it has become is now huge.  

In the original spec, each table was to be defined with its own class.  Embedding a table definition within another, like 'productdims_version_sort' is defined within 'productdims', was not part of how this was to work.  The intent was to follow very roughly how SQLAlchemy defines table objects, with the idea that we'd leave the door open to returning to that standard sometime in the future.

I'm going to r+ this just to keep us moving with the caveat that we MUST refactor this code to a better compromise very soon.  In fact, I'm going to file a bug about this issue right now.

Attachment #514913 - Flags: review?(lars) → review+

Robert Helmer [:rhelmer]

Assignee

Comment 19

•

13 years ago

Attached patch [as-landed] add schema changes to schema.py — Details — Splinter Review

Here's the patch as landed. No changes to the ReportsDuplicatesTable class, just remove noise and line offsets introduced by other schema changes and fixing unit test failures.

Committed revision 2986.

Robert Helmer [:rhelmer]

Assignee

Comment 20

•

13 years ago

(In reply to comment #18)
> Comment on attachment 514913 [details] [diff] [review]
> add schema changes to schema.py
> 
> this really brings into sharp focus how inadequate the schema.py system for
> maintaining the database schema.  The gap between how this system was conceived
> to work and what it has become is now huge.  
> 
> In the original spec, each table was to be defined with its own class. 
> Embedding a table definition within another, like 'productdims_version_sort' is
> defined within 'productdims', was not part of how this was to work.  The intent
> was to follow very roughly how SQLAlchemy defines table objects, with the idea
> that we'd leave the door open to returning to that standard sometime in the
> future.
> 
> I'm going to r+ this just to keep us moving with the caveat that we MUST
> refactor this code to a better compromise very soon.  In fact, I'm going to
> file a bug about this issue right now.

Agreed, thanks for filing the followup.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Laura Thomson :laura

Reporter

Updated

•

13 years ago

Attachment #514913 - Flags: review?(laura)

Robert Kaiser

Updated

•

13 years ago

Blocks: 652829

Wayne Mery (:wsmwk)

Updated

•

13 years ago

Blocks: 651543

Nobody; OK to take it and work on it

Updated

•

13 years ago

Component: Socorro → General

Product: Webtools → Socorro

db changes for find dups feature 13 years ago [:jberkus] Josh Berkus 3.25 KB, text/plain		Details
add schema changes to schema.py 13 years ago Robert Helmer [:rhelmer] 26.46 KB, patch	lars : review+	Details \| Diff \| Splinter Review
[as-landed] add schema changes to schema.py 13 years ago Robert Helmer [:rhelmer] 10.03 KB, patch		Details \| Diff \| Splinter Review