145588 - Search query formation requires expert a priori knowledge of reported bugs - should support fulltext indexes

Reporter

Description

•

23 years ago

When searching the database for existing bug reports a user has to formulate their searches such that all possible variants of their search terms are included. If searching for errors with table presentation under Show All Tags or for searching for "aminated gif" the user has to enter prefixes rather than natural language terms. For example with "animated" the user is forced to remember that this must be searched for as "animat" in case someone enter "animation" instead of "animated". These are simple cases and have lead to duplicate entries being made. Bugzilla requires a thesaurus or controlled vocabulary feature provided by the better text retrieval systems. So when the user enters "animated" a search is made for "animated or animation or animate" depending upon what Use For/Related terms are entered for animated/animation.

Jouni Heikniemi

Updated

•

23 years ago

Severity: normal → enhancement

OS: Linux → All

Hardware: PC → All

Summary: Search query formation requires expert a priori knowledge of reported bugs → Search query formation requires expert a priori knowledge of reported bugs

Matthew Tuck [:CodeMachine]

Comment 1

•

23 years ago

Hmm, was just thinking of this tonight with regard to customised/customized, and then there's all sorts of synonyms. We obviously can't do a perfect job here (words with multiple meanings anyone?), but we can possibly things a bit better. There's no problem with converting words into the canonical form for their word group, but we need what we search to be in canonical form too, so we either need special MySQL support (and I doubt there is any in MySQL let alone all databases we intend to support eventually), or maintain a copy of every comment and status in canonicalised form, which is a large waste of space. Then there's the fact this could theoretically bring up bugs where the reason they matched is not obvious, and this will result in bug reports if it happens. Possible ways of dealing with this is through smaller groups (which could result in a reduction of usefulness), requiring this feature to be turned on, or highlighting the found words in the text like some search engines do. If you have some implementation knowledge from other systems by all means I'd like to hear it.

Priority: -- → P3

Target Milestone: --- → Future

Trevor Jenkins

Reporter

Comment 2

•

23 years ago

The specific problem is with the use of search terms. I don't believe that any "canonical form" is either necessary or desirable. Because the user base is drawn from a very wide range of experience and languages some form of normalisation would be ueful. The convention way of doing this is by creating a thesaurus. (See ANSI/NISO Z39.19 Monolingual Thesaurus Creation Standard at http://www.niso.org/standards/resources/Z39-19.html for specifics.) Bug reporters would continue to enter their reports in whatever fashion suits them best. However, when someone (else) searches for these reports searches are not made for terms directly but are routed via a thesaurus where synonyms (and possible broader terms) can be added to the search criteria. Some users might like to switch this feature off because it will be slower. Rather than using mySQL (or any other system) a faster way would be to use an inverted file system. Careful selection of techniques can provide an almost O(1) search of the inverted list. As to whether I have implementation knowledge ... well after 25 years in the text retrieval market I know a thing or two.

Matthew Tuck [:CodeMachine]

Comment 3

•

23 years ago

What you say about adding search terms is true, but I didn't mention because I was thinking it would only work for "any words" searches, but after thinking about how MySQL does this it would just be ORs nested within ANDs for "all words" searches. I suppose we could put a copy of every comment on a file system or something but that's a hefty amount of disk space and a radical departure from the way Bugzilla currently works. And I have no idea what an inverted file system is. If you mean a full text index, MySQL has this feature I think, although I don't think Bugzilla uses it, and our comment searching is pretty slow because of it. Unfortunately MySQL doesn't handle ORs very well in its optimiser (as evidenced by past slow queries) so this sort of thing is probably potentially really slow even if we do use full text indexes which will likely get prompted ignored in the presence of ORs. This really is the sort of thing you'd want to be on by default but I can't see that happening yet. Quicksearch on the front page already puts a strain on the server apparently. For this to be useful the word groups probably have to be reasonable size and it feels like the search terms list would be about 4 times bigger.

Andreas Franke (gone)

Comment 4

•

23 years ago

We should not rely on a full text index provided by MySQL if we have any plans to support other databases. Full text indexing using porter stemmers or something like this is not too hard, I have done that as an exercise for our information retrieval lecture, though not in perl. A full text index only increases the database space used by a factor of two, at maximum. > Quicksearch on the front page already puts a strain on the server apparently. I would like to see some facts backing up this claim.

Matthew Tuck [:CodeMachine]

Comment 5

•

23 years ago

> I would like to see some facts backing up this claim. Well I don't know, I thought I heard that, adding Myk. Given there are no indexes I can see, I'd be suprised if it wasn't true.

work in progress 22 years ago Myk Melez [:myk] [@mykmelez] 5.97 KB, patch		Details \| Diff \| Splinter Review
work in progress #2 22 years ago Myk Melez [:myk] [@mykmelez] 9.16 KB, patch		Details \| Diff \| Splinter Review
patch v1: implementation 22 years ago Myk Melez [:myk] [@mykmelez] 11.65 KB, patch		Details \| Diff \| Splinter Review
patch v2: fix for syntax error 22 years ago Myk Melez [:myk] [@mykmelez] 11.65 KB, patch		Details \| Diff \| Splinter Review
patch v3: fixes for issues raised and others 22 years ago Myk Melez [:myk] [@mykmelez] 20.43 KB, patch	bbaetz : review-	Details \| Diff \| Splinter Review
patch v4: review updates 22 years ago Myk Melez [:myk] [@mykmelez] 21.12 KB, patch	bbaetz : review+	Details \| Diff \| Splinter Review
patch v5: review updates 22 years ago Myk Melez [:myk] [@mykmelez] 21.16 KB, patch	myk : review+	Details \| Diff \| Splinter Review