22353 - Automatic duplicate bug detection

Reporter

Description

•

26 years ago

It would be good for someone interested in AI to try and determine whether a bug report has duplicate bugs in the database. I'm thinking you could look at reports for common fragments, a lot of words in common, and possibly os/platform/product/component. Firstly, it could run in the background and generate a report on existing bugs for someone to look at (obviously it wouldn't generate the same combination again if it could, but this probably implies upper limit n squared space). Secondly, it could be used at report time to prevent the bugs being reported. It would display a list of bugs that could possibly be duplicates. The false positive count would have to be low though, such that most of the time you wouldn't get this screen. This is to prevent annoyance, as well as any "I don't need to search, I'll just let the dupe checker find it for me" attitude. Obviously the second option would need to be quite quick, which probably involves precomputed information. While being developed, possible duplicates on the entered bug report could be logged rather than displayed to the user, so the false positive level could be monitored while the checker is tweaked to give an acceptable false positive ratio.

Eli Goldberg

Updated

•

26 years ago

QA Contact: elig

Eli Goldberg

Comment 1

•

26 years ago

This is IMHO a great idea that's been suggested by many other folks --- but I doubt anyone here has the time to implement it. So, Matty, please go implement it... ;)

Paul MacQuiddy

Comment 2

•

26 years ago

funny, valeski mentioned this yesterday, cc:ing him

Terry Weissman

Updated

•

26 years ago

Status: NEW → ASSIGNED

Jesse Ruderman

Comment 3

•

26 years ago

I don't know if AI is even needed. At least one duplicate bug in the system is one I made accidentally when there was a lot of latency between me and the server, by pressing "commit" twice. Bugzilla should at the very least block bugs from the same submitter from having the same exact title.

Terry Weissman

Comment 4

•

26 years ago

tara@tequilarista.org is the new owner of Bugzilla and Bonsai. (For details, see my posting in netscape.public.mozilla.webtools, news://news.mozilla.org/38F5D90D.F40E8C1A%40geocast.com .)

Assignee: terry → tara

Status: ASSIGNED → NEW

Chris Yeh

Comment 5

•

25 years ago

this would be really really really hard. first, you'd have to take a pass on creating a decent ai. secondly, the ai has to work on a dataset. building a dataset would have to occur during bugzilla slow times, and would need to fairly massive in order for the ai to work properly. can you think of some fields that an ai might use?

Judson Valeski

Comment 6

•

25 years ago

We'd catch the 99% case by just searching the summary field for duplicate tokens. If matches are found, they're presented to the user.

Jesse Ruderman

Comment 7

•

25 years ago

See bug 46843 for simply comparing the summary of new bugs with the summary of the last bug submitted (or the last bug submitted by the same user).

Blake Ross

Updated

•

25 years ago

QA Contact: elig → matty

Matthew Tuck [:CodeMachine]

Reporter

Updated

•

25 years ago

Whiteboard: Future-Target

Stephan Niemz

Comment 8

•

25 years ago

moving to real milestones...

Target Milestone: --- → Future

Matthew Tuck [:CodeMachine]

Reporter

Comment 9

•

24 years ago

If someone is willing to put in the cycles to implement an appropriate neural interface I'd be willing to volunteer my brain ...

Matthew Tuck [:CodeMachine]

Reporter

Updated

•

24 years ago

Component: Bugzilla → Creating/Changing Bugs

Product: Webtools → Bugzilla

Whiteboard: Future-Target

Version: other → unspecified

Matthew Tuck [:CodeMachine]

Reporter

Comment 10

•

24 years ago

-> component owner

Assignee: tara → myk

Dave Miller [:justdave]

Comment 11

•

24 years ago

*** Bug 110626 has been marked as a duplicate of this bug. ***

Asa Dotzler [:asa]

Comment 12

•

24 years ago

Here is an idea i brought up to shiva and endico for dealing with bugs that collect _a_lot_ of dupes on b.m.o. I think that the majority of dupes aren't in these huge dupe chains but they are probably an easy place to start. We could poll through all the bugs in a dupe chain as soon as that chain has more than say 10 dupes and build a list of the most common 10 or 15 words used to describe that problem. Bugzilla could append those "key" words to the summary of bug that is the end of the dupechain. This would probably go a long way to getting that bug into query results. Then we could run queries of unconfirmed and new bugs reported after the oldest bug in the dupechain which contain any 3 of those 10 or 15 "key" words and generate reports of "likely duplicates" for triage teams to read through an resolve where appropriate. Another ideas we discussed was for all bugs with a URL in the URL field or the long description, trim URLs down to the base domain and generate a report that lists bugs and thier summaries grouped by URL. Preliminary tests show there are a couple of quick dupes that can be knocked off with this technique.

mental

Comment 13

•

24 years ago

I've been considering taking this eventually if I can nail down a good approach, and when I get the time. ;) I've been playing with ideas similar to Asa's, but as applies to all bugs. See also bug 129459.

mental

Comment 14

•

24 years ago

*** Bug 131120 has been marked as a duplicate of this bug. ***

Andrew Lin

Comment 15

•

24 years ago

Sigh... I'm the one responsible for the (second) ironic dupe of this bug. Anyway, now that I've drifted to the legitimate place to talk about these things, I'm first going to point out that my bug 131120 talks about a different, simpler idea, which attacks the bugs described in comment 12, so the interested are directed there ;). An independent idea: consider how effective Google usually is at finding things. Yet Google doesn't (as far as I know) use some n-th generation AI; in its basic form it's a smart keyword search with a very smart ranking engine. It also preindexes keywords to avoid having to grind through the page database on every search, which is part of how it searches over 2 billion pages in 100 milliseconds (http://www-db.stanford.edu/~backrub/google.html). Can we bring one or both of these ideas to this problem? (I'm not suggesting we actually crib the Google engine itself. But maybe the ideas of computing something like Google's PageRank, and of pre-indexing keywords, could find application to mental's work.)

Matthew Cline

Comment 16

•

24 years ago

There are some things that could be done that wouldn't require an AI (but there's still the problem of the data and processing time needed): - Look for similarities between summaries. - Look for similarities between URLs (both in the URL field and the description). - Extract console messages and stack traces from the description and compare them. - Extract plain old warning and error messages from the description and compare them. Of course, these things aren't trivial to do, but a lot easier than making an AI to do it.

Daniel Naber

Comment 17

•

23 years ago

KDE's bugzilla has quite an advanced duplicate detection which does a search with the bug's summary as keywords. You can get the files (simple_search*) from here: http://webcvs.kde.org/cgi-bin/cvsweb.cgi/bugs/bugz/ However, this requires a full-text index and thus is MySQL specific, so it will probably not make it into Bugzilla.

Myk Melez [:myk] [@mykmelez]

Comment 18

•

23 years ago

I wouldn't necessarily say that. We can have MySQL-specific features in Bugzilla, we just can't make them mandatory (and we should examine the DB-independent options before going with the DB-specific solution).

Dave Miller [:justdave]

Comment 19

•

23 years ago

*** Bug 178574 has been marked as a duplicate of this bug. ***

Dave Miller [:justdave]

Comment 20

•

22 years ago

*** Bug 214622 has been marked as a duplicate of this bug. ***

Christian Reis

Comment 21

•

22 years ago

*** Bug 221295 has been marked as a duplicate of this bug. ***

Eyal Rozenberg

Comment 22

•

22 years ago

repeating my words from the recent dupe: "It may help reduce the number of duplicates if, when a new bug is submitted, it is not accepted immediately, but rather some heuristic search finds 'top ten bugs most probably related to yours', which the user can survey and hopefully have a better chance finding possible duplicates among them than searching on his/her own. He/she can then choose to re-approve his bug entry if he/she has not found a duplicate."

Brant Gurganus

Comment 23

•

22 years ago

*** Bug 188815 has been marked as a duplicate of this bug. ***

Travis Chase

Comment 24

•

22 years ago

*** Bug 236037 has been marked as a duplicate of this bug. ***

Ben Fowler

Comment 25

•

21 years ago

Except that I have seen it with my own eyes it seems incredible that this bug should have collected so many dups (each person who submitted a dup probably used the old fashioned human intelligence to look for this very bug). The solution is simple, use Bayesian analysis against Bugs that are resolved DUPLICATE and Bugs that are not and then then give a probability that the bug is a duplicate. (You could also check NEW bugs against the corpora for each component, Product keyword et cetera, and propose changes/additions) We do this in our heads whenever we see a new bug that deals with Frontpage or Download manager and Menus to determine whether it is worth doing a search for a dup; a machine could do the same job far faster, more reliably and gives instant (non-punitive) feedback to the Reporter and the lesser accuracy probably would not be missed.

Frédéric Buclin

Comment 26

•

20 years ago

*** Bug 303536 has been marked as a duplicate of this bug. ***

Jon B

Comment 27

•

20 years ago

(In reply to comment #1) > This is IMHO a great idea that's been suggested by many other folks --- but I > doubt anyone here has the time to implement it. But if you did implement it, you would have *more* time, from not wasting so much dealing with dupes... :-) It's probably in your best interests to fix things like this first, and then you wouldn't have as much noise to deal with later. (Though maybe tagging dupes is like a therapeutic relief? Like knocking things off a list without having to do any work? But that's besides the point...) I think a good implementation would, after the reporter has submitted their bug (but still encourage them to search first), provide them with a list of bugs that have similar words, phrases, whatever, and say, "is your bug the same as any of these"? (This could also be good for comments *within* bugs.) The summary comparison would have more words and content to search with than the regular search. "But that will annoy people who post bugs all the time and just want to press post and move on!" So make it a user preference that newbs won't find immediately.

Csaba Gabor

Comment 28

•

20 years ago

http://bugs.php.net has quite a good implementation of duplicate testing. Once you submit a bug report you get back 5 previous bug reports and they tease you saying approximately: Are you sure your report isn't a dup of one of these - we think it looks close. The amazing thing is that those reports have been quite well correlated to what I submitted. The point is, someone has gotten this duplicate detection business to work. I don't know how they do it, but if I was going to code this up I would do: 1. For each distinct word in bugzilla, have a count of how many bug reports it appears in. 2. When a new bug report is submitted, rank order each word in the report according to the counts in step 1. 3. Keep adding the least frequent words into a "pot" until a match of bugs against that pot yields less than 5 matches. 4. Take the last word out, repeat the query, and take 5 matches from it to present to the user. Incorrect spellings will skew the results, so a list of common html/internet/computer/mozilla terms should evolve corresponding to correctly spelled words. Of course a 'dictionary words' table should also be around. If a word comes along that doesn't match the dictionary words, and is also not a computer term, then try to identify the most common 'one byte different' match and if one exists, it is assumed that's what the word was meant to be (for counting purposes). Csaba Gabor from Vienna

Frédéric Buclin

Comment 29

•

20 years ago

*** Bug 336573 has been marked as a duplicate of this bug. ***

Olav Vitters

Updated

•

19 years ago

QA Contact: mattyt-bugzilla → default-qa

victory <never@receive.bug.mails.i.hate.spammer>

Updated

•

19 years ago

Target Milestone: Future → ---

Mik Kersten

Comment 30

•

19 years ago

Just noticed this bug. Note that the Mylar rich client for Bugzilla has duplicate detection that uses Java stack trace heuristics for finding matches (assuming that you are running it in an Eclipse with the Java IDE parts installed, refer to "Automatic Duplicate Detection" on http://www.eclipse.org/mylar/doc/new.php) We enable automatic submission of bugs from error log events, so it's key that we run the duplicate detection first. At some point we would like to extend this to textual similarity along the lines of comment#28. If anyone comes up with a specification for that it would be great to see it posted here. Otherwise, if this bug gets attention we would appreciate the duplicate detection working similarly to other queries, so that we could invoke it and get the results back in RDF.

Frédéric Buclin

Comment 31

•

19 years ago

*** Bug 36525 has been marked as a duplicate of this bug. ***

Frédéric Buclin

Updated

•

19 years ago

Assignee: myk → create-and-change

Proof of Concept 16 years ago Max Kanat-Alexander 6.78 KB, patch		Details \| Diff \| Splinter Review
Proof of Concept 2 16 years ago Max Kanat-Alexander 9.13 KB, patch		Details \| Diff \| Splinter Review
v1 15 years ago Max Kanat-Alexander 167.25 KB, patch	glob : review-	Details \| Diff \| Splinter Review
v2 15 years ago Max Kanat-Alexander 25.30 KB, patch		Details \| Diff \| Splinter Review
v3 15 years ago Max Kanat-Alexander 25.32 KB, patch	glob : review+	Details \| Diff \| Splinter Review