Closed Bug 22353 Opened 22 years ago Closed 11 years ago
Automatic duplicate bug detection
It would be good for someone interested in AI to try and determine whether a bug report has duplicate bugs in the database. I'm thinking you could look at reports for common fragments, a lot of words in common, and possibly os/platform/product/component. Firstly, it could run in the background and generate a report on existing bugs for someone to look at (obviously it wouldn't generate the same combination again if it could, but this probably implies upper limit n squared space). Secondly, it could be used at report time to prevent the bugs being reported. It would display a list of bugs that could possibly be duplicates. The false positive count would have to be low though, such that most of the time you wouldn't get this screen. This is to prevent annoyance, as well as any "I don't need to search, I'll just let the dupe checker find it for me" attitude. Obviously the second option would need to be quite quick, which probably involves precomputed information. While being developed, possible duplicates on the entered bug report could be logged rather than displayed to the user, so the false positive level could be monitored while the checker is tweaked to give an acceptable false positive ratio.
This is IMHO a great idea that's been suggested by many other folks --- but I doubt anyone here has the time to implement it. So, Matty, please go implement it... ;)
funny, valeski mentioned this yesterday, cc:ing him
I don't know if AI is even needed. At least one duplicate bug in the system is one I made accidentally when there was a lot of latency between me and the server, by pressing "commit" twice. Bugzilla should at the very least block bugs from the same submitter from having the same exact title.
firstname.lastname@example.org is the new owner of Bugzilla and Bonsai. (For details, see my posting in netscape.public.mozilla.webtools, news://news.mozilla.org/38F5D90D.F40E8C1A%40geocast.com .)
Assignee: terry → tara
Status: ASSIGNED → NEW
this would be really really really hard. first, you'd have to take a pass on creating a decent ai. secondly, the ai has to work on a dataset. building a dataset would have to occur during bugzilla slow times, and would need to fairly massive in order for the ai to work properly. can you think of some fields that an ai might use?
We'd catch the 99% case by just searching the summary field for duplicate tokens. If matches are found, they're presented to the user.
See bug 46843 for simply comparing the summary of new bugs with the summary of the last bug submitted (or the last bug submitted by the same user).
moving to real milestones...
Target Milestone: --- → Future
If someone is willing to put in the cycles to implement an appropriate neural interface I'd be willing to volunteer my brain ...
Component: Bugzilla → Creating/Changing Bugs
Product: Webtools → Bugzilla
Version: other → unspecified
-> component owner
Assignee: tara → myk
*** Bug 110626 has been marked as a duplicate of this bug. ***
Here is an idea i brought up to shiva and endico for dealing with bugs that collect _a_lot_ of dupes on b.m.o. I think that the majority of dupes aren't in these huge dupe chains but they are probably an easy place to start. We could poll through all the bugs in a dupe chain as soon as that chain has more than say 10 dupes and build a list of the most common 10 or 15 words used to describe that problem. Bugzilla could append those "key" words to the summary of bug that is the end of the dupechain. This would probably go a long way to getting that bug into query results. Then we could run queries of unconfirmed and new bugs reported after the oldest bug in the dupechain which contain any 3 of those 10 or 15 "key" words and generate reports of "likely duplicates" for triage teams to read through an resolve where appropriate. Another ideas we discussed was for all bugs with a URL in the URL field or the long description, trim URLs down to the base domain and generate a report that lists bugs and thier summaries grouped by URL. Preliminary tests show there are a couple of quick dupes that can be knocked off with this technique.
I've been considering taking this eventually if I can nail down a good approach, and when I get the time. ;) I've been playing with ideas similar to Asa's, but as applies to all bugs. See also bug 129459.
*** Bug 131120 has been marked as a duplicate of this bug. ***
Sigh... I'm the one responsible for the (second) ironic dupe of this bug. Anyway, now that I've drifted to the legitimate place to talk about these things, I'm first going to point out that my bug 131120 talks about a different, simpler idea, which attacks the bugs described in comment 12, so the interested are directed there ;). An independent idea: consider how effective Google usually is at finding things. Yet Google doesn't (as far as I know) use some n-th generation AI; in its basic form it's a smart keyword search with a very smart ranking engine. It also preindexes keywords to avoid having to grind through the page database on every search, which is part of how it searches over 2 billion pages in 100 milliseconds (http://www-db.stanford.edu/~backrub/google.html). Can we bring one or both of these ideas to this problem? (I'm not suggesting we actually crib the Google engine itself. But maybe the ideas of computing something like Google's PageRank, and of pre-indexing keywords, could find application to mental's work.)
There are some things that could be done that wouldn't require an AI (but there's still the problem of the data and processing time needed): - Look for similarities between summaries. - Look for similarities between URLs (both in the URL field and the description). - Extract console messages and stack traces from the description and compare them. - Extract plain old warning and error messages from the description and compare them. Of course, these things aren't trivial to do, but a lot easier than making an AI to do it.
KDE's bugzilla has quite an advanced duplicate detection which does a search with the bug's summary as keywords. You can get the files (simple_search*) from here: http://webcvs.kde.org/cgi-bin/cvsweb.cgi/bugs/bugz/ However, this requires a full-text index and thus is MySQL specific, so it will probably not make it into Bugzilla.
I wouldn't necessarily say that. We can have MySQL-specific features in Bugzilla, we just can't make them mandatory (and we should examine the DB-independent options before going with the DB-specific solution).
*** Bug 178574 has been marked as a duplicate of this bug. ***
*** Bug 214622 has been marked as a duplicate of this bug. ***
*** Bug 221295 has been marked as a duplicate of this bug. ***
repeating my words from the recent dupe: "It may help reduce the number of duplicates if, when a new bug is submitted, it is not accepted immediately, but rather some heuristic search finds 'top ten bugs most probably related to yours', which the user can survey and hopefully have a better chance finding possible duplicates among them than searching on his/her own. He/she can then choose to re-approve his bug entry if he/she has not found a duplicate."
*** Bug 188815 has been marked as a duplicate of this bug. ***
*** Bug 236037 has been marked as a duplicate of this bug. ***
Except that I have seen it with my own eyes it seems incredible that this bug should have collected so many dups (each person who submitted a dup probably used the old fashioned human intelligence to look for this very bug). The solution is simple, use Bayesian analysis against Bugs that are resolved DUPLICATE and Bugs that are not and then then give a probability that the bug is a duplicate. (You could also check NEW bugs against the corpora for each component, Product keyword et cetera, and propose changes/additions) We do this in our heads whenever we see a new bug that deals with Frontpage or Download manager and Menus to determine whether it is worth doing a search for a dup; a machine could do the same job far faster, more reliably and gives instant (non-punitive) feedback to the Reporter and the lesser accuracy probably would not be missed.
*** Bug 303536 has been marked as a duplicate of this bug. ***
(In reply to comment #1) > This is IMHO a great idea that's been suggested by many other folks --- but I > doubt anyone here has the time to implement it. But if you did implement it, you would have *more* time, from not wasting so much dealing with dupes... :-) It's probably in your best interests to fix things like this first, and then you wouldn't have as much noise to deal with later. (Though maybe tagging dupes is like a therapeutic relief? Like knocking things off a list without having to do any work? But that's besides the point...) I think a good implementation would, after the reporter has submitted their bug (but still encourage them to search first), provide them with a list of bugs that have similar words, phrases, whatever, and say, "is your bug the same as any of these"? (This could also be good for comments *within* bugs.) The summary comparison would have more words and content to search with than the regular search. "But that will annoy people who post bugs all the time and just want to press post and move on!" So make it a user preference that newbs won't find immediately.
http://bugs.php.net has quite a good implementation of duplicate testing. Once you submit a bug report you get back 5 previous bug reports and they tease you saying approximately: Are you sure your report isn't a dup of one of these - we think it looks close. The amazing thing is that those reports have been quite well correlated to what I submitted. The point is, someone has gotten this duplicate detection business to work. I don't know how they do it, but if I was going to code this up I would do: 1. For each distinct word in bugzilla, have a count of how many bug reports it appears in. 2. When a new bug report is submitted, rank order each word in the report according to the counts in step 1. 3. Keep adding the least frequent words into a "pot" until a match of bugs against that pot yields less than 5 matches. 4. Take the last word out, repeat the query, and take 5 matches from it to present to the user. Incorrect spellings will skew the results, so a list of common html/internet/computer/mozilla terms should evolve corresponding to correctly spelled words. Of course a 'dictionary words' table should also be around. If a word comes along that doesn't match the dictionary words, and is also not a computer term, then try to identify the most common 'one byte different' match and if one exists, it is assumed that's what the word was meant to be (for counting purposes). Csaba Gabor from Vienna
*** Bug 336573 has been marked as a duplicate of this bug. ***
Target Milestone: Future → ---
Just noticed this bug. Note that the Mylar rich client for Bugzilla has duplicate detection that uses Java stack trace heuristics for finding matches (assuming that you are running it in an Eclipse with the Java IDE parts installed, refer to "Automatic Duplicate Detection" on http://www.eclipse.org/mylar/doc/new.php) We enable automatic submission of bugs from error log events, so it's key that we run the duplicate detection first. At some point we would like to extend this to textual similarity along the lines of comment#28. If anyone comes up with a specification for that it would be great to see it posted here. Otherwise, if this bug gets attention we would appreciate the duplicate detection working similarly to other queries, so that we could invoke it and get the results back in RDF.
*** Bug 36525 has been marked as a duplicate of this bug. ***
I talked a bit with kiko about how Launchpad deals with this. My understanding is that essentially, it does a fulltext search against a continuously smaller set of words from the new bug's summary to determine the most likely matches. It seems to work quite well.
Summary: Duplicate bug detection. → Duplicate bug detection
Great discussion on duplicate detection. At UBC, we have been experimenting with duplicate detection based on (essentially) information retrieval ideas (actually ideas from news event chain detection). The approach is based on comparing titles/summaries of bugs and forming clusters of similar bugs. When a new bug comes in, its possible to ask which existing bugs it might be a duplicate of. The clusters help ensure the search results returned do not contain all bugs within a duplicate chain. We've run experiments with this detector over historical data (essentially building a model with our system out of bugs to date X and then one by one adding in the next Y bugs to look for duplicates and correcting the model as we go). This detector achieves about 70% recall - i.e., if a duplicate exists, it can find it 70% of the time within 7 recommendations returned. This 70% holds across Firefox 2.0, Eclipse Platform 3.3, Apache 2.0 and Mylyn bug repositories. Others have acehived similar results. A group at Ericsson achieved 48% recall on a corporate database and triagers found it useful. A group at North Carolina led by Tao Xie has added in execution information and achieved higher recall rates (but at the disadvantage of more involvement of the triager or reporter). If there is interest, we'd be interested in having our system run for Bugzilla and collecting data about whether its useful. We'd love user feedback to help tune the model and build a good UI (for triagers or reporters or both). (Right now we have it linked in experimentally into the Mylyn rich client.)
Hey Gail. Sure, it'd at least be interesting to see how useful the system is. As far as running it against *this* Bugzilla, you'd have to talk to one of the bugzilla.mozilla.org admins, probably justdave on irc.mozilla.org in #bmo.
I'll try to follow up on setting up against bugzilla/bugzilla in early April.
In MySQL, we could probably just detect duplicates by OR'ing all the summary words together in a boolean fulltext search.
Summary: Duplicate bug detection → Automatic duplicate bug detection
Here's a proof of concept, with the algorithm and a little test script called check-dup.pl. The first argument should be the summary you want to check for a dup (you'll have to use quotes if you want to use multiple words) and then all later arguments are products to limit the check to. Right now, the algorithm is very slow, but it seems to be very accurate. I have some ideas for how to improve performance. The question now is where do we do this? I was thinking perhaps it could be an AJAX call that's hooked in as an onchange after somebody stops typing for a bit in the summary field (or perhaps an onblur for when they leave the field).
Assignee: create-and-change → mkanat
Status: NEW → ASSIGNED
Okay, here's a high-performance version of the duplicate detector. Even on my local hard drive (which is going to be way slower than the bmo hardware) a search of the Bugzilla product for a duplicate takes only 1 second, with this code.
Attachment #442609 - Attachment is obsolete: true
Okay, the proof of concept patch has all of the backend code that's actually needed in order to implement this feature. That is, we now have automatic duplicate detection, we just don't have a UI. So now we need to talk about how we want the UI to work. For sure, we want a list of bugs and a link to those bugs, plus a button for "add me to the cc list" for each one. Probably we just need the bug number and the summary. From my experiments, showing the top 7 possible duplicates should be all we need to do. The question is: where and when? In Launchpad, they give you a summary box before having you choose anything else, which I think is nice. But that would require a bit of re-work for the bug-filing form. We could also just have the list appear as an onblur handler for the Summary field, but then there's a chance that people would miss it if it didn't return before they filed the bug. (We'd show a spinner while we were doing the search, of course--something like "Looking for possible duplicates...")
Thanks Max -- I like that you are actually pushing ahead at this. Action speaks louder than words :) For the UI and implementation I suggest thinking of user scenarios, and then potential UIs. For example I can think of three scenarios: (a) checking for a potential duplicates when a bug is being submitted; (b) reviewing an existing bug and checking for duplicates of it; and (c) doing a review of the entire DB and checking for all potential duplicates. Comment 12, comment 22 and comment 28 discuss possible models around these. For (a) I can imagine the following scenario: 1. User enters a text description of a bug (like today) 2. Bugzilla returns a list of potentially duplicate (or related?) bugs, and asks the user to review these to see if this is a duplicate. 3. user reviews the list, and marks their bug as a duplicate of one of the previous entries (or not, of course) The challenge is that many bug submitters will likely skip step 3 ;-( I don't think a good UI can solve this: the challenge is that you need to read and understand the descriptions to know if they are dups or not, and that can be non-trivial. But my own opinion is that it's worth a try! Users are a surprising lot, and many of them (us!) will work really hard to Do the Right Thing - all we need to do is give them tools for doing so! Some more thinking: For (b) I can imagine the following scenario: 1. User opens a bug for review. 2. User selects to 'search for potential duplicates' 3. Bugzilla provides a list of potential duplicate (or related?) bugs 4. User reviews the potential duplicates and marks the opened bug as duplicate of these. Some other thoughts: I don't know what to do when multiple bugs in the returned list are dups of the one you started with. Mark them all as dups? And how do you ensure these dups are properly reviewed/confirmed? Also, I don't see how you determine which bug is the 'root' of the duplicate list? Last, would you also search bugs already marked as dups? Might be a good idea: if, for example, bugzilla already has marked marked bugs X, Y, and Z as dups of Q, then if the new bug is detected as a potential dup of Q, X and Y then it's far likelier to be a dup than if it was detected as a dup of Q alone.
In case it affects the path forward, I spent a bit of today re-learning python and mocked up a keyword search of all summaries of open bugs in the core/firefox/toolkit components of b.m.o, using redis as the backingstore. 47K bugs, 470K words, 47K unique words. On my ubuntu VM on my windows desktop, I get the index loaded into redis in 40 seconds. I am doing the simplest possible thing to load the index, which is O(N) on the (word, bug) tuples, so I'm sure I can make it faster. A search on OR(tracemonkey, silverlight) returns 30 bugs in 2ms (sometimes spiking to 7ms). A search on AND("error", "message", "and", "print") returns 2 bugs in about the same. That's fast enough that you could do the search as the user typed, once per character, as long as the network could keep up. (Websockets? :-) ) No stemming yet, or removal of punctuation, and I don't maintain the structure to make it easy to do incremental updating of the index, but I don't expect them to add much time. If people are interested, I can post the code here when I get it shaped up -- it's about 90 lines of python, probably be the same in perl.
(In reply to comment #42) > The challenge is that many bug submitters will likely skip step 3 ;-( I don't > think a good UI can solve this: Perhaps some will. But I think the UI that Launchpad uses for this is very effective, personally. It's very easy to see the dup list and hard to just skip it. > Some more thinking: For (b) I can imagine the following scenario: > [snip] I like that. I hadn't even thought of that scenario. That's probably something we'd add after this patch. First we'd do it on submission only, and then we'd add it to other places in the UI where it was useful. > Some other thoughts: I don't know what to do when multiple bugs in the returned > list are dups of the one you started with. Mark them all as dups? And how do > you ensure these dups are properly reviewed/confirmed? Also, I don't see how > you determine which bug is the 'root' of the duplicate list? Any bug marked DUPLICATE isn't returned in the list, actually. They're all changed to the bugs they're ultimately a duplicate of. > Last, would you also search bugs already marked as dups? The attached code does include searching duplicates and using them in an intelligent way.
(In reply to comment #43) > In case it affects the path forward, I spent a bit of today re-learning python > and mocked up a keyword search of all summaries of open bugs in the > core/firefox/toolkit components of b.m.o, using redis as the backingstore. 47K > bugs, 470K words, 47K unique words. > [snip] Except that it's using the backend database's fulltext index, the patch I've attached is actually doing the exact same thing. > A search on OR(tracemonkey, silverlight) returns 30 bugs in 2ms [snip] That's definitely way faster than the index that I have above. However, the system I've got above currently doesn't require any additional software to be installed by people using Bugzilla, which is a big deal. So perhaps after we have the basic system worked out, we could work out another system using lucerne or Sphinx or the indexing system you're using for your script.
redis barely needs installation (it really is terrifyingly simple to use and tiny -- wget & make & ./redis-server), but you know more about the characteristics of bugzilla installation goals than I do for sure. :-)
(in reply to comment (In reply to comment #44) > (In reply to comment #42) > > The challenge is that many bug submitters will likely skip step 3 ;-( I don't > > think a good UI can solve this: > > Perhaps some will. But I think the UI that Launchpad uses for this is very > effective, personally. It's very easy to see the dup list and hard to just skip > it. Makes sense to me - just getting something up an running, and testing the approach, will be a bug success :) > > Some more thinking: For (b) I can imagine the following scenario: > > [snip] > > I like that. I hadn't even thought of that scenario. That's probably > something we'd add after this patch. First we'd do it on submission only, and > then we'd add it to other places in the UI where it was useful. Agreed - I too saw this a follow-on 'clean-up' scenario. > > Some other thoughts: I don't know what to do when multiple bugs in the returned > > list are dups of the one you started with. Mark them all as dups? And how do > > you ensure these dups are properly reviewed/confirmed? Also, I don't see how > > you determine which bug is the 'root' of the duplicate list? > > Any bug marked DUPLICATE isn't returned in the list, actually. They're all > changed to the bugs they're ultimately a duplicate of. I don't quite understand (I think I asked too many questions at the same time ;-/). I'm thinking of the case where upon submitting the bug, bugzilla returns a list of potential dups, say A, B, and C. But what would the user do if their bug is in fact a dup of both A and C - i.e. A and C are actually dups of each other, but are not yet marked as such? In hindsight, and not having seen Launchpad, I assume for now you'd just ignore that case, let the user pick one dup, and then resolve any other potential existing duplicates in the traditional way.
Okay, here is a complete patch that includes a UI. What it does is use an "onblur" on the short_desc field on enter_bug, and displays a table of possible duplicates. It's slightly annoying to have the UI move around like that, but I don't see anywhere else to put the table. There is an installation running this code, here: https://landfill.bugzilla.org/duplicates/enter_bug.cgi
in IE 8 you get: "'key' is null or not an object" at bz_autocomplete_bundle.js line 11 the style of the "possible duplicates" header and the table itself is very different from the rest of bugzilla, and should be restyled to match.
(In reply to comment #48) > There is an installation running this code, here: > > https://landfill.bugzilla.org/duplicates/enter_bug.cgi Is it only enabled for the FoodReplicator product? I have old bugs in the Sam's Widget product, and I tried filing a new bug with both the same and a similar summary there, but the duplicates UI never appeared. I could trigger the UI in the FoodReplicator product, and although it found some possible duplicates, it didn't find my bugs in Sam's Widget (not sure if that is by design or not). Also, within the FoodReplicator product, cloning an existing FoodReplicator bug triggers the UI; I can see it both ways, but I think cloning should not trigger the duplicates UI, since the user is explicitly stating he/she wants a semi-duplicate.
Attachment #451420 - Flags: review?(dkl) → review?(bugzilla)
Comment on attachment 451420 [details] [diff] [review] v1 the IE and style issues need to be addressed
Attachment #451420 - Flags: review?(bugzilla) → review-
Okay, thanks for the review! I've fixed the style of the table, and I've made it a table row in the form instead of just a separate div in the table, so that solves the style of the header. The IE 8 problem was an extra comma in the "fields" element of the DataSource. Also, it turns out that I don't even need to specify "fields", so I got rid of it. Smokey: It's enabled for all products; there was a bug where I'd forgotten to JS-escape the string, so "Sam's" was breaking it. I've updated the test installation with the new patch.
Oh, and Smokey is right about cloning; I still have to fix that. Should be a pretty simple fix.
Okay, here it is with the cloned bug thing fixed. (So the table doesn't appear at all anymore, if you're cloning a bug.)
Looks great to me! One possible enhancement might be to highlight / decorate the duplicate row the user selects to view/preview, so that when they go back to the list they can see which one they were previewing. That way the user doesn't have to remember which item they clicked on... Is there any sense of how this performs on a large database (like bugzilla.mozilla.org)?, or how well it works when the text / pattern matching includes things like patches, etc, or when there are lots of possible duplicates..... I guess that's good for the next set of tests :)
(In reply to comment #55) > Is there any sense of how this performs on a large database (like > bugzilla.mozilla.org)? Well, I have a copy of bmo's database locally, but I don't have the RAM or the general performance of bmo's database servers, so I can't really test it. Performance will be slightly worse than doing a single-word search within an individual product using the "Find a Specific Bug" search is now, most likely.
Comment on attachment 452466 [details] [diff] [review] v3 r=glob excellent.
Attachment #452466 - Flags: review?(bugzilla) → review+
Woohooo!! :-) Thanks for the review, glob! :-) Committing to: bzr+ssh://bzr.mozilla.org/bugzilla/trunk/ modified .bzrignore modified Bugzilla/Bug.pm modified Bugzilla/Constants.pm modified Bugzilla/DB.pm modified Bugzilla/User.pm modified Bugzilla/DB/Mysql.pm modified Bugzilla/DB/Oracle.pm modified Bugzilla/WebService/Bug.pm added js/bug.js added skins/standard/enter_bug.css modified skins/standard/global.css modified template/en/default/bug/create/create.html.tmpl modified template/en/default/global/header.html.tmpl Committed revision 7219.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
How one can disable this feature? I've searched through all the Administration menu, but with no luck. The only solution I've found was to edit template/en/default/bug/create/create.html.tmpl. Is there any other way?
(In reply to Julia Romanenkova from comment #62) > How one can disable this feature? You cannot. Not sure why one would want to disable this feature.
(In reply to Frédéric Buclin from comment #63) > (In reply to Julia Romanenkova from comment #62) > > How one can disable this feature? > > You cannot. Not sure why one would want to disable this feature. it's a totally reasonable request. in bugzilla 5.0 you'll be able to disable this feature - see bug 669535.
You need to log in before you can comment on or make changes to this bug.