Closed
Bug 629088
Opened 14 years ago
Closed 14 years ago
Mark suspected duplicates
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
1.7.7
People
(Reporter: laura, Assigned: rhelmer)
References
Details
Attachments
(3 files)
3.25 KB,
text/plain
|
Details | |
26.46 KB,
patch
|
lars
:
review+
|
Details | Diff | Splinter Review |
10.03 KB,
patch
|
Details | Diff | Splinter Review |
Need:
new db column
cron job to calculate
Criteria for dupes are as follows.
Where reports have:
- same datetime
- same uptime (or slightly increasing, within an hour)
- same OS
- same build
- same address
Reporter | ||
Comment 1•14 years ago
|
||
- installAge is a better indicator, via chofmann. This is in the CSV.
Need to backfill last two months.
Assignee: nobody → rhelmer
Assignee | ||
Comment 2•14 years ago
|
||
(In reply to comment #0)
> Need:
> new db column
What would the new column be and on which table?
Column "duplicate" (or maybe "suspected_duplicate") on table "reports"?
Comment 3•14 years ago
|
||
Robert,
Actually, the column would be "duplicate_of" and would be the UUID for the "first" example of that group of suspected duplicates.
I need to hash out whether it makes more sense to put the duplication information in the reports table or to give it its own table(s). The fact that we're planning to drop all duplicates once we feel like we can detect them accurately (and possibly revise the algorithm and rebuild the duplicates list a couple of times in the process) would argue that we should use a separate table.
Reporter | ||
Comment 4•14 years ago
|
||
I guess the other argument for a separate table is that it saves monkeying with the reports table...not a bad thing given its size. We may want to try running a few different dupe detection queries as we tune that too.
So:
duplicates(ooid, duplicate_of)
?
Reporter | ||
Updated•14 years ago
|
Assignee: rhelmer → josh
Comment 5•14 years ago
|
||
So, here's the initial algorithm we're going to try:
Reports submitted within 1 hour of each other, in which the following elements are identical:
OS
Build
Product
Version
Signature
And where the following values are either identical, or are within a reasonable window (1 hour) of each other and progress together:
client_crash_date
install_age
uptime
And where the data in optional fields is not different:
url
cpu_name
cpu_info
reason
address
email
user comments
app_notes
... is all of the above correct?
Assignee: josh → rhelmer
Comment 6•14 years ago
|
||
yeah, looks good. I'd say give this a whirl and lets see what it turns up.
We do see some dups that seem to "suspend" for a while then another batch comes in so two batches that are outside the 1 hour limit of each other won't get connected but I think that's probably ok. It will keep us over counting on the same event, but by what I think will be a small number.
maybe run a report like I did in
https://bugzilla.mozilla.org/show_bug.cgi?id=574711#c17
and let's see if it turns up a list similar to the attachment there.
We might need some conversion/rounding on the time values to get them into the same units (minutes?).
Comment 7•14 years ago
|
||
BTW, if we need to start optimizing the first value to look at might be "uptime."
for my report in bug 574711 I first searched for reports on the same signature that had an uptime of less than 10 seconds and that filter surfaced a good set of candidate dups to do the additional comparisons checks on.
10 seconds is probably more restrictive than we need or want, but uptime of an hour is probably more than we need or want too.
Comment 8•14 years ago
|
||
Chris:
I guess I need some clarity on what we consider a duplicate, so I know what I'm looking for. Here's the two cases:
CASE A:
-- User is using Firefox
-- Firefox crashes
-- Due to a bug in FF or Breakpad, it submits the crash report several times.
CASE B:
-- User is using Firefox
-- Firefox crashes
-- User restarts Firefox right away
-- Firefox crashes again with the same signature
Searching for CASE A is significantly different from searching for CASE B. So which one am I looking for?
Reporter | ||
Comment 9•14 years ago
|
||
In general I believe we're looking for case A, a genuine dupe. Case B is actually a separate crash.
However, there are times when you won't be able to tell the difference, if time since startup is consistently low.
Try it and see. I'd like some results: we can iterate on the algorithm.
Comment 10•14 years ago
|
||
If we can we should try and find and mark both kinds.
Lets call A "Duplicated Crashes" and B "Replicated Crashes"
Its possible that we might need to breakdown the replicated crashes as well once we start looking at samples of that data.
Comment 11•14 years ago
|
||
the long term plan should be to get rid of the duplicate crashes, and to keep marking and maybe consolidating the replicated crashes if we are pretty certain that they user is able to reproduce/repeat the same crash.
once replicated crashes are identified we can put special emphasis on trying to diagnose those since they might be easier to reproduce, or at least they are reproducible for at least one or more people.
Assignee | ||
Updated•14 years ago
|
Group: mozilla-corporation-confidential
Assignee | ||
Updated•14 years ago
|
Group: mozilla-corporation-confidential
Comment 12•14 years ago
|
||
I've uploaded some duplicate search results under a separate, mozilla-confidential bug: https://bugzilla.mozilla.org/show_bug.cgi?id=634445
Please examine these results and suggest ways to reduce the number of false positives. Currently I'm getting a 10% hit rate, and that's probably way too high.
Comment 13•14 years ago
|
||
Here's the first version of the duplicate-finding query:
create table possible_dups as
select leader.uuid as leader_uuid, follower.uuid as follower_uuid
from
(
select uuid,
install_age,
uptime,
first_value(uuid)
over ( partition by
product,
version,
build,
signature,
cpu_name,
cpu_info,
os_name,
os_version,
address,
topmost_filenames,
reason
order by
client_crash_date,
uuid
) as leader_uuid
from reports
where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01 13:00:00'
) as follower
JOIN
( select uuid, install_age, uptime
FROM reports
where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01 13:00:00' ) as leader
ON follower.leader_uuid = leader.uuid
WHERE ( ( leader.uptime - follower.uptime BETWEEN -300 AND 900 )
OR follower.uptime < 300 )
AND follower.uuid <> leader.uuid;
Comment 14•14 years ago
|
||
What that query says is:
For one hour of reports (1pm-2pm Feb 1), take all reports for which the following fields are identical:
product,
version,
build,
signature,
cpu_name,
cpu_info,
os_name,
os_version,
address,
topmost_filenames,
reason
Arbitrarily designate one report with the lowest (client_crash_date, UUID) as the "leader" and all others as the followers.
Further filter by making sure that follower uptime is within ( -5min, +15min ) of the leader or is less than 5min.
The query is hard to read, but actually runs pretty fast, which is why I wrote it that way.
Comment 15•14 years ago
|
||
Here's our new query:
select leader.uuid as leader_uuid, follower.uuid as follower_uuid
from
(
select uuid,
install_age,
uptime,
client_crash_date,
first_value(uuid)
over ( partition by
product,
version,
build,
signature,
cpu_name,
cpu_info,
os_name,
os_version,
address,
topmost_filenames,
reason,
app_notes,
url
order by
client_crash_date,
uuid
) as leader_uuid
from reports
where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01 13:00:00'
) as follower
JOIN
( select uuid, install_age, uptime, client_crash_date
FROM reports
where date_processed BETWEEN '2011-02-01 12:00:00' AND '2011-02-01
13:00:00' ) as leader
ON follower.leader_uuid = leader.uuid
WHERE ( same_time_fuzzy(leader.client_crash_time, follower.client_crash_time,
leader.uptime, follower.uptime)
OR follower.uptime < 60
)
AND
same_time_fuzzy(leader.client_crash_time, follower.client_crash_time,
leader.install_age, follower.install_age)
AND follower.uuid <> leader.uuid;
same_time_fuzzy is a SQL function I added to check deltas on install_age and uptime:
create or replace function same_time_fuzzy(
date1 timestamptz, date2 timestamptz,
interval_secs1 int, interval_secs2 int
) returns boolean
language sql as $f$
SELECT
-- return true if either interval is null
-- so we don't exclude crashes missing data
CASE WHEN $3 IS NULL THEN
TRUE
WHEN $4 IS NULL THEN
TRUE
-- otherwise check that the two timestamp deltas
-- and the two interval deltas are within 60 sec
-- of each other
ELSE
(
extract ('epoch' from ( $2 - $1 ) ) -
( $4 - $3 )
) BETWEEN -60 AND 60
END;
$f$;
Effectively, here's the translation of the new query:
For one hour of reports (1pm-2pm Feb 1), take all reports for which the
following fields are identical:
product,
version,
build,
signature,
cpu_name,
cpu_info,
os_name,
os_version,
address,
topmost_filenames,
reason,
app_notes,
url
Arbitrarily designate one report with the lowest (client_crash_date, UUID) as
the "leader" and all others as the followers.
Further filter by making sure that the delta between leader and follower uptimes and install_ages is the same, within a 60-second margin of error, OR that uptime is less than 60 seconds.
This narrows things down to 412 suspected duplicates (of 250 original reports), or a possible duplicate rate of 2.5% I think this may be narrow enough to go forwards with if people think the sample results are good enough.
Please see the other bug to examine the sample results. https://bugzilla.mozilla.org/show_bug.cgi?id=634445
Comment 16•14 years ago
|
||
Attached is an SQL file of the database changes which need to go into schema.py to support the new duplicate-finding feature. Over to you, Rob.
Assignee | ||
Comment 17•14 years ago
|
||
Ran in dev, seems to create the desired table and functions
Attachment #514913 -
Flags: review?(laura)
Assignee | ||
Updated•14 years ago
|
Attachment #514913 -
Flags: review?(lars)
Comment 18•14 years ago
|
||
Comment on attachment 514913 [details] [diff] [review]
add schema changes to schema.py
this really brings into sharp focus how inadequate the schema.py system for maintaining the database schema. The gap between how this system was conceived to work and what it has become is now huge.
In the original spec, each table was to be defined with its own class. Embedding a table definition within another, like 'productdims_version_sort' is defined within 'productdims', was not part of how this was to work. The intent was to follow very roughly how SQLAlchemy defines table objects, with the idea that we'd leave the door open to returning to that standard sometime in the future.
I'm going to r+ this just to keep us moving with the caveat that we MUST refactor this code to a better compromise very soon. In fact, I'm going to file a bug about this issue right now.
Attachment #514913 -
Flags: review?(lars) → review+
Assignee | ||
Comment 19•14 years ago
|
||
Here's the patch as landed. No changes to the ReportsDuplicatesTable class, just remove noise and line offsets introduced by other schema changes and fixing unit test failures.
Committed revision 2986.
Assignee | ||
Comment 20•14 years ago
|
||
(In reply to comment #18)
> Comment on attachment 514913 [details] [diff] [review]
> add schema changes to schema.py
>
> this really brings into sharp focus how inadequate the schema.py system for
> maintaining the database schema. The gap between how this system was conceived
> to work and what it has become is now huge.
>
> In the original spec, each table was to be defined with its own class.
> Embedding a table definition within another, like 'productdims_version_sort' is
> defined within 'productdims', was not part of how this was to work. The intent
> was to follow very roughly how SQLAlchemy defines table objects, with the idea
> that we'd leave the door open to returning to that standard sometime in the
> future.
>
> I'm going to r+ this just to keep us moving with the caveat that we MUST
> refactor this code to a better compromise very soon. In fact, I'm going to
> file a bug about this issue right now.
Agreed, thanks for filing the followup.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•14 years ago
|
Attachment #514913 -
Flags: review?(laura)
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
You need to log in
before you can comment on or make changes to this bug.
Description
•