Open Bug 24957 (dupesinsearch) Opened 25 years ago Updated 10 years ago

include duplicates in search, but return only their originals (-> better search results, less duplicates)

Categories

(Bugzilla :: Query/Bug List, enhancement)

enhancement
Not set
normal

Tracking

()

People

(Reporter: derikson3, Unassigned)

References

Details

(Whiteboard: [relations:dupl])

Attachments

(3 files)

It's sometimes very difficult to check the database for a paticular bug
before filing a new bug so that a duplicate isn't created.  Sometimes the
text of the bug is too technical to match up with the search, but
bugs marked as duplicates of that bug may contain the text searched on.
A on RESOLVED and VERIFIED and DUPLICATE might find that text, but it
will also return duplicates of fixed bugs and such.  For these reasons,
I think it'd be very beneficial to have a checkbox for including duplicates
of the bugs in the search of summary and description fields.

For example, searching for NEW or ASSIGNED or REOPENED bugs in the "bugzilla"
component with description containing "duplicate", will select all those
NEW or ASSIGNED or REOPENED bugs in the "bugzilla" component, and then
search the description of those bugs as well as their duplicates for
the text "duplicate".  If the query matches against the duplicate, then
the bug that it duplicates is returned.

There may be a better way of achieving the same results, but something
similar may help reduce the number of duplicates submitted.
Duplicate bugs have never been represented in the database very well.
Status: NEW → ASSIGNED
tara@tequilarista.org is the new owner of Bugzilla and Bonsai.  (For details,
see my posting in netscape.public.mozilla.webtools,
news://news.mozilla.org/38F5D90D.F40E8C1A%40geocast.com .)
Assignee: terry → tara
Status: ASSIGNED → NEW
As far as I can see this should be possible by searching on all statuses with
resolution of --- or DUPLICATE.
maybe what you want is to do someting clever like "if this bug is marked as a 
duplicate, save the summary off into a seperate searchable table"

comment?
QA Contact: matty
Whiteboard: Future-Target
moving to real milestones...
Target Milestone: --- → Future
-> Bugzilla product, Query component, reassigning.
Assignee: tara → endico
Component: Bugzilla → Query/Bug List
Product: Webtools → Bugzilla
Version: other → unspecified
Whiteboard: Future-Target → [relations:dupl]
See also bug 105295, query should show closed duplicate bugs that have the 
main/parent bug open.
*** Bug 105295 has been marked as a duplicate of this bug. ***
i'm the reporter of the 105295 dupe

8)

i have searched for dupes but as i searched  with the word "query" and "find",
not with "search" i didnt find it...

a good exemple why a solution for this is needed

Thanks Jesse Ruderman for point this out

below is the text of my submit:

the sumary usually is limited to some words and there are many diferent ways for
naming bugs

one of the main reason that people post dupes is because they cant find the open
bug because they are using the "wrong" words that arent in the sumary

one option turn on by default would search the dupes bugs(all states?!) too, but
instead of showing all dupes bugs, it would show up only dupes that have their
main /parent bug is still open
so if one bug have many definitions, dupes will popup, but stop after the main
and the dupes have all the words combinations in their sumary
if reporters cant find the main bug because of the "wrong" words, they will find
it in the closed "open" dupes, and the dupe will point to the main open bug
*** Bug 107982 has been marked as a duplicate of this bug. ***
NOTE: I'm not familiar with the bugzilla code nor the database structure it uses.

If the user does not explicitely select to search DUPLICATEs or if we add a
special checkbox that says something like "Find matches by resolving duplicate
targets" and be on by default (Note: the text would have to be far less
confusing then what I have =), then we do the following:

(I'm assuming that bugzilla has a seperate relational table that relates bugs
and their duplicates)
We perform a JOIN between the bug number in the search results and the bug
number in the duplicates table so that the result set has a column with a bug
number of a duplicate.
(I don't know how much of a performance hit the JOIN would be in mysql)

With bugs that are duplicates, we display the target bugs in their place.
(Obviously we'd need to filter out multiple bugs that point to the same bug and
only display one in their place.)
Ironically I logged a dupe of this bug too.

I think the problem is that Bugzilla equates bug reports with bugs and this is
not the case. Bug reports should exist separately from bugs. For a given bug
there may be many reports, so it's a one to many relationship. In fact it's a
many to many relationship as a report may involve several different bugs,
although this is usually because it's a badly written report, so I would say
make it one to many and allow reports to be broken up if they need it.

Having a one-one relationship leads to the whole DUPLICATE problem. DUPLICATE is
not a valid resolution, the bug may or may not be resolved. You are actually
using the resolution field to flag that this row has some sort of parent-child
relationship with another row.

This relationship should be made explicit in the schema, with a table for bugs
and a table for reports and each report is linked to a bug and each bug is
linked to 1 or more reports.

Ideally, reports would come in and be assigned a report number, an expert can
then assess each report and either create a new bug or simply attach the report
to an existing bug (including a special type of bug called the "non-bug").
Reports should be mobile. If after some research it turns out that bug A is just
bug B in disguise then you could drop bug A and assign all its reports to bug B.
Not only do you have 1 less bug but you also have all the reported information
available from a central point.

Keyword searches are done by searching in the individual reports which then lead
back to bugs.

This eliminates DUPLICATE and now all the resolutions actually are resolutions. 

Under this scheme, bugs have a status, resolution, priority, owner etc. Reports
 have a reporter, a parent bug, platform, OS, build ID and by examining the
linked reports, Bugzilla can figure out things like what platforms are effected
by the bug.

This allows you to record multiple build IDs and OS versions against a single
bug (one in each report) it also allows the possibility of other types of
reports like success reports (confirmation that the bug does not effect a given
platform). By combining bug reports and success reports, a bug can know what
platforms are and what aren't effected by the bug. Success reports were
previously entered as comments, now you can use the drop downs.

Yes, I know this is heaps of work and I'm not sure how (or even if) you can
migrate from one scheme to the other, I'm just throwing in my .02 euro.
*** Bug 151964 has been marked as a duplicate of this bug. ***
Is there some sort of hack to fix this on bmo ? By a rough estimate, this is
causing around 20% of the dupes. That translates into thousands of the UNCO bugs
sitting there.
Alias: dupesinsearch
duplicates should be searched for by default!
(the proposed checkbox should start checked.)
this would significantly decrease the number of dupes.
think of a dup as a pointer or symlink; 
it should be treated as an aide in searching and indexing.
...isn't much other use for them ;^)

perhaps we need some better way of identifying where they point to on buglists,
like having the buglist link point to the comment of the dup'ed bug (which
should probably mirror the dup's summary and comment count).
...and why is the cool <a ... title=bugname> link-hover feature on the buglist?
I started to work today on a patch for this. The essential idea is:
1) Add an 'include duplicates' checkbox under the 'Status' field of the query
   form.
2) If it's checked, then add the following to the query:
   i) in the tables list, add:
         LEFT JOIN duplicates ON duplicates.dupe = bugs.bug_id
         LEFT JOIN bugs parentbug ON parent.bug_id = duplicates.dupe_of
   ii) in the where clause, change
         (bugs.bug_status IN (...))
       to
         (bugs.bug_status IN (...) OR parentbug.bug_status IN (...))

So the meaning for the user is, I believe, very simple and straightforward: if
they select NEW, ASSIGNED, and REOPENED and check "include duplicates", it means
"search for bugs that are NEW, ASSIGNED, REOPENED, or duplicates of bugs that
are NEW, ASSIGNED, or REOPENED".
OS -> All
was Linux
OS: Linux → All
Attached patch PatchSplinter Review
Here it is. I have the following things to note about it:

First, that's my first larger-than-a-line Bugzilla hack. I'm far from fully
understanding Bugzilla::Search->init, Template::Toolkit, or from having a clear
general picture of Bugzilla. So, although I believe the patch works, I
understand that it may well be a bad hack. If it is, give me some clues so that
I can improve it.

Second, I can't test it adequately. On my test Bugzilla installation with six
bugs :-), it seems to work alright. I don't know how such patches are tested in
real conditions, but we may be confident that, if the new field is unchecked,
Bugzilla will work OK; except that the joins and "where" expressions may be in
a different order, the query will be exactly the same.
As far as I can tell from reading the description, this will only add immediate
dupes to the search - dupes which are one or more levels removed won't get
included. I have issues with the UI as well, and I think this will also
negatively impact performance. You are also all thinking about the search page
as a tool for bug filers and QA, and this is not its only use.

So I'm not convinced that this is a good idea...

Gerv
I don't think database performance should be too much of a consideration, the
goal should be making the database more effective (whatever that means).
Reporters and QA people find it hard to find dupes because the interface sucks.
My standard way of searching is to do a limited search for open bugs in the
right component, and if that fails, then to search open and resolved bugs in all
components using keywords in the summary. I'm doing the performance-hurting
queries anyway because I know how. Reporters that don't know how will just fail
to find anything (as they do now) and file dupes.

but this seems a rather kludgy way of solving a specific instance of a general
problem... we don't actually need to find the dupes, we just want to find the
original bug based on summary/component info from the dupes. it would make sense
 for the database to facilitate that, so it's not necessary to trawl the whole
database each time.
screenshot says 'search for duplicates' ...
as I see it, the main use of this will not be to find duplicates, 
but rather to find bugs those dups point to.  
that said, I think the text should be something like 
'search duplicates' or 'include duplicates in search'

I would also like to repeat my request that this be checked by default.
Comment 20: That's right, it does not search duplicates recursively. Not only
would this be difficult to implement and bad for performance, but, I believe,
unnecessary as well. You have bug A with 15 duplicates, and one day it is
discovered that A is a duplicate of B. I have a feeling that this happens late
in the lifetime of B, when filing duplicates (which is what we are trying to
avoid here) is not so important. But it might be good practice to change the 15
dups so that they are dups of B instead of A. (I have no opinion on the more
general remarks.)

Comment 22: The default for bugzilla.mozilla.org is the administrator's
decision; the default query is specified in the bugzilla operating parameters
page. What _is_ the developers' decision is the default for new Bugzilla
installations, which I believe should be off; that's why I did not change
defparams.pl.
> I don't think database performance should be too much of a consideration

Believe me, in the real world, database performance is very much a
consideration. Or are you happy with the speed of bugzilla.mozilla.org? :-|

Gerv
The problem is the structure of the database. Right now buzilla thinks an bug 
and bug report are the same thing. They're not. Some reports are not bugs at 
all and some bugs have many reports.

If there was a table for bugs and table for reports and a 1 to many 
relationship, that'd remove the need for recursive queries.

A new report comes in, someone who knows what they're doing attaches it to an 
existing bug or creates a new one for it.

Obviously dups will still occur by accident but now you can just transfer all 
the reports from the dup to the original bug and close the dup.

It's a big change but it seems to me that the dup issue is getting bigger all 
the time and distinguishing between bugs and reports of bugs corresponds better 
to the reality that bugs and bug reports are 2 very different things.

For even more about this see comment number 12 earlier.
You're right, the two concepts should be separated. By the way, this would also
solve bug 121805 ... But I guess migrating existing bugs into that new structure
would be a little bit tricky.
Re #26. I think the migration would require careful work but is kinda straight 
forward. As things stand every (or almost every) field which is filled in when 
creating a new bug is really a bug report field - OS, version, description of 
symptoms, how to reproduce etc.

All other fields belong to the bug, like status, votes, assigned, resolution.

For migratation:

1: Go through all the non-dup bugs and add a row in the BUGS table and the 
REPORTS table, filling the fields in the new tables with data from the old 
table. Link the row in the REPORTS table to the row in the BUGS table.

2: Go through all dup bugs and add a row in the REPORTS table for each one. Do 
not create an entry in the BUGS table. Find the master bug" that this is a 
duplicate of and link the report to the master bug in the BUGS table.

When converting dups, you will have to throw away some information, basically 
anything that doesn't fit into the bug_reports table but if the fields have 
been chosen well then nothing important will be lost. For some info, like 
votes, it can be added to the master bug.

This scheme also allows for new kinds of reports like success reports, 
analysis, extra symptoms etc, solving #121805 also.
Strictly speaking, the discussion about separating bugs from bug reports is
off-topic. It is true that this bug, and bug 121805, and bug 145588, and the
idea about separating reports from bugs, are four different attempts to solve
the same problem: "Do something to reduce duplicates." I think I'll start a
discussion on netscape.public.mozilla.webtools about this. I briefly searched
for such a discussion and found
http://groups.google.com/groups?hl=el&lr=&ie=UTF-8&oe=UTF-8&frame=right&th=aa4c53a552207cba&seekm=an26vi%24dl0%241%40news5.svr.pol.co.uk#link1,
which the people new to this bug may want to read.

Meanwhile, I think it would be nice if we focused this discussion on including
duplicates in search. Most of the on-topic discussion has been skepticism on
performance impact. The thing is that it is impossible to know the performance
impact just by looking at the query. Even if someone can tell that turning on
the duplicates option will take three times more machine time, what does this
tell us? That machine response time would become nine times longer? For all we
know, the option might help people find what they're searching for with fewer
queries, thus actually improving performance.

So, if we agree on the user interface, if we agree that the feature may be
worth, and our only problem is performance, let's put it there, have it off by
default, mark it experimental, discourage its use, and try it. See how much
longer it takes, how much better the returned results are, and make some
better-informed guesses as to its implications.
Re #28 - off-topicness. I disagree. Separating bugs and bug reports allows you 
to eliminate the whole concept of duplicates. With no dups, there's no need to 
include them in searches, 

The triagers now deal with reports. When the reports is for a previously 
unknown problem they create a new bug which will have the report attached. When 
someone reports an already known problem, the triager just attaches the report 
to the existing bug. This second action is what used to lead to a dup but now 
just leads to an extra report attached to the bug.

The only problem is if 2 existing bugs A and B later turn out to be different 
aspects of a single bug or if a triager creates bug B from a report when they 
should have attached it bug A. Then we need to merge A and B together. The 
mergeing process would attach all the reports from bug B to bug A and delete 
bug B. Maybe some other data needs to be merged also, like votes and CCs. Also 
whoever is merging may have to choose a status and who to assign it to. This 
mergeing process should be fairly rare if bug reports are assigned correctly in 
the first place.

Keyword searches are done on fields in the bug reports and since all the 
reports about bug A are attached to bug A there's no need to worry about dups. 
The performance should be no worse than before and now no one needs to even 
think about dups when doing their query.
Attachment #108780 - Flags: review?
Re: searching duplicates recursively, that's what bug 68611 would fix (by
storing the end of the duplicate chain in addition to the next dupe in the
chain), at which point this fix to switch over to using that "end of chain ID"
instead of the "dupe_of" ID.  See also bug 204209 about simplifying the
duplicates schema, since the current schema is unnecessarily complex.

Re: performance, performance matters.  This feature needs to be disablable for
installations that can't afford it (f.e. b.m.o at the moment; I hope that b.m.o
becomes an installation that can afford it, but that depends on us getting new
machines, how fast they are, and how much our traffic grows).

Re: the UI, I haven't had a chance to look at it closely yet, but in general I
agree with Michael that the goal of this fix should be to find the original bugs
to which the duplicates refer, not the duplicates themselves, and the UI should
reflect that.  Ideally the user shouldn't have to select an option at all,
Bugzilla should just do the duplicates search and return the matching original
bugs, but we would have to be careful to limit this to a known set of fields for
which it doesn't matter if we find duplicates (i.e. if someone searches for open
bugs assigned to a particular person, we don't want to find open bugs assigned
to someone else but which have duplicates assigned to the person being searched
for).  Perhaps this means that we should limit this feature to searches where
it's clear that the user is looking for a specific bug via non-specific criteria
(i.e. key words) ala bug 145588.

Re: the implementation, again, I haven't looked closely yet, but MySQL 4.0
provides UNIONs, which would probably make this much easier (especially for
returning originals instead of duplicates) and more performant.  We will
probably start requiring MySQL 4.0+ soon after Bugzilla 2.18 ships (bug 204217),
which is likely to happen in the next few months, so it's worthwhile looking
into doing this with UNION, i.e.:

(SELECT <columns> FROM bugs WHERE <conditions>) UNION (SELECT <columns> FROM
bugs INNER JOIN duplicates ON bugs.bug_id = duplicates.dupe INNER JOIN bugs AS
originals ON duplicates.dupe_of = originals.bug_id WHERE <conditions>) ORDER BY
<order columns>;
Hardware: PC → All
Comment on attachment 108780 [details] [diff] [review]
Patch

Bitrotten, although I'm surprised how little.
Attachment #108780 - Flags: review? → review-
Assignee: endico → nobody
Depends on: 204217
Attached patch work in progressSplinter Review
Here's a work in progress that returns duplicates for fulltext searches,
attempting to aggregate the relevances of the duplicate bugs so that bugs with
more duplicates show up higher in the list.  A test of this functionality is
available on b.m.o by running a fulltext search and then changing "buglist.cgi"
in the URL to "buglist-bm.cgi".
i submitted a duplicate of this bug :O

Basically, if we did a search of the duplicates, and then ranked the bugs that
the duplicates pointed to by the number of matched duplicates to the user
querry,  We'd have a nice system..

another way of dealing with this would be to take the summaries of duplicate
bugs, and attach their extra words to the summary of the origional bug.

I don't know which one would be easier. But I'm overwhelmed by the numbers of
duplicates and i'm only an end user!
*** Bug 249372 has been marked as a duplicate of this bug. ***
This bug seems to be the closest match for my suggestion so I'll start here. I
think the easiest way to include dups is to add a checkbox to the Quicksearch:

Enter a bug # or some search terms:
__________________ [ Show ] [Help]
[ ] Include RESOLVED, VERIFIED, and CLOSED bugs.


This doesn't resolve the whole issue of tracking and backreferencing dups in a
search, but it'd be an easy fix since it'd include all bugs (including WONTFIX,
INVALID, DUPLICATE, etc.).
*** Bug 216360 has been marked as a duplicate of this bug. ***
QA Contact: mattyt-bugzilla → default-qa
Target Milestone: Future → ---
Assignee: nobody → query-and-buglist
Priority: P3 → --
The benefit of this long-standing bug is immediately obvious:
-> easier to find just the right bug as we exploit the potential of duplicate bugs' data to find their originals (while not cluttering results with duplicates)
-> less duplicates will be filed

Therefore, I'd suggest changing the following flags:
Priority: P1 or P2
Target Milestone: something as near as possible
Summary: include duplicates in search → include duplicates in search, but return only their originals (-> better search results, less duplicates)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: