Open Bug 537856 Opened 15 years ago Updated 5 years ago

[faceted search] UI should convey that results are limited to 400 messages (may be less when duplicates present)

Categories

(Thunderbird :: Search, defect)

x86
Windows XP
defect
Not set
major

Tracking

(Not tracked)

People

(Reporter: lester, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [datalossy])

User-Agent:       Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB6.3; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091204 Thunderbird/3.0

Summarizing my report: gloda is mysterious and--most important for the production environment--unreliable.

Reproducible: Always

Steps to Reproduce:
1.Start using Eudora in 1994.  Accumulate ~240,000 messages in ~2300 mailboxes.  Use POP exclusively. 99% English
2.In March, 2009, import my Eudora messages to Thunderbird
3.Verify that classic Thunderbird edit / search or X1 indexes and retrieves these messages.
4.Verify that gloda indexes and retrieves messages dating back to the very beginning.  Therefore, rule out a massive mail format/importing problem.

5.Search for a rare letter combination like "extraneous".  Retrieve the cofrrect number of messages (26), dating back to 1995.  
6.Search for a common word or letter combination, such as "john"
7.Note the gloda sidebar.  It says "People . . . list all 1280"
8.note the faceted search bar.  It goes back only a year or so.
9.Note the search results line.  It says "10 of 349".

10.Repeat with another common phrase like "ee".  Note that the search results line always says "10 of nn", where ~200 < nn < ~400, and that the retrieved messages are always the most recent hits.
***IMHO, the gloda interface--not the database--has a previously undiscovered limit***
   
11.  Use another search engine.  Note that the correct retrieval for "ee" is ~ 2,000

12. Read the error console.  No errors.  But Note that every search produces 5 to 20 copies of the warning,
Warning: Cannot specify value for internal property.  Error in parsing value for '-x-system-font'.  Declaration dropped.
Source File: chrome://messenger/content/glodaFacetView.xhtml
Line: 0
13.  Note the following params, all set to their default values


mailnews.database.global.indexer.enabled;true
mailnews.database.global.logging.console;true
mailnews.database.global.logging.dump;false
mailnews.database.global.logging.net;false
mailnews.database.summary.dontPreserveOnCopy;account msgOffset threadParent msgThreadId statusOfset flags size numLines ProtoThreadFlags label gloda-id gloda-dirty
mailnews.database.summary.dontPreserveOnMove;account msgOffset threadParent msgThreadId statusOfset flags size numLines ProtoThreadFlags label


14. Delete the global database.  Wait overnbight while it rebuilds. 900 mB.  Retry.  Get same inadequate search results.
15. Wonder whether "messages mentioning john" means "messages mentioning john in the body but not the headers"
16. wonder about many other undocumented, uncommented weirdnesses of gloda.
clarification: 
a) I should have said that results returned in 8 and 9 are incorrect. Correct retrieval would be several thousand, dating back to 1995.
I have clarified this report, after becoming  more familiar with the gloda interface.  It's at Bug 539173.  Um . . . er . . . this is a rather serious shortcoming.  Maybe the team should follow up.
I'd suspect this is fixed but asuth can better speak to that.

Note: eudora milestones unfortunately are always months behind in picking up big thunderbird fixes. eg. if you are using the eudora just released, it is is based on thunderbird 3.0b4, and thus missing major fixes that are already in thunderbird 3.0.
Component: General → Search
QA Contact: general → search
Version: unspecified → 3.0
(In reply to comment #4)
> I'd suspect this is fixed but asuth can better speak to that.
> Note: eudora milestones unfortunately are always months behind in picking up
> big thunderbird fixes. eg. if you are using the eudora just released, it is is
> based on thunderbird 3.0b4, and thus missing major fixes that are already in
> thunderbird 3.0.

This is most definitely **not** fixed in Thunderbird 3.0.  That's why I first reported this is a Thunderbird 3.0 bug, not simply a Eudora bug. 

Because the blockers linked at the Thunderbird Status Meeting "have not been made available to you" (marked in red) I can't tell whether this bug has become a blocker.  IMO, it should be; but I can't tell.

To repeat--the bug remains in Thunderbird 3.0

Because the clearer description has been marked as a duplicate, I'll repeat the clearer desctiption here:
**Any** gloda search for ""messages mentioning . . . " retrieves at most 400 hits.  Even if there are actually a few thousand hits.  Actually, a few less than 400.  Always the latest messages.

My email databse, which has 240,000 messages, occasionally turns up a "scaling-up" bug that others have missed.  This bug seems to be such a case.
For performance reasons, the number of results retrieved by the faceted search query is intentionally limited to 400.  This is controlled in msg_search.js.  The actionable problem is that there is no UI that conveys that this limit was artificially introduced and the de-duplication code makes it very likely that the number we display is not the big-round 400, but something arbitrarily less.

The annoying "-x-system-font" errors are tracked on bug 515775.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: gloda retrieves too few messages → [faceted search] UI should convey that results are limited to 400 messages (may be less when duplicates present)
Since we're bound by performance goals here I can see two possible changes in the UI that might help remove the mystery.

1) When Gloda returns 400 (the limit) hits before duplicate removal we assume that there weren't exactly 400 hits and indicate in the UI "1 of hundreds" instead of a precise number.  This would remove the real number from the UI such that we are a bit more mysterious intentionally.  However we'll also have to balance this rule with the number we get after duplicates are removed.  Seems tricky.

2) Give a small explanation of the limit at the last page of results indicating that you're paged through a couple hundred results and it's likely you should try a new search or some facets.

Google in their search results for "gloda" gives me 1 of 1,390,000 but when you start paging through you can't get past 478, the end number changes at some point during your paging.  However they do give this message at the bottom:

"In order to show you the most relevant results, we have omitted some entries very similar to the 478 already displayed"

Each of these solutions, and likely any others, are going to require new strings so this will have to be targeted at 3.1 and not 3.0
a.  I recommend stating the precise result: "10 of the most recent 400"

b.  Now that you told where to look, I changed it to 1000.  No performance issues until I clicked "show as list".  This increased the handle count markedly (perhap one handle per folder with a hit, as in the classic search function?), but had no other performance issues.  We might assume that anybody desperate enough to use a global search function is willing to use few reources to get his result.

c.  Possible to access this parameter w/ config editor?
(In reply to comment #8)
> b.  Now that you told where to look, I changed it to 1000.  No performance
> issues until I clicked "show as list".  This increased the handle count
> markedly (perhap one handle per folder with a hit, as in the classic search
> function?), but had no other performance issues.  We might assume that anybody
> desperate enough to use a global search function is willing to use few reources
> to get his result.

Yes to the handle count theory.  However, the opening of the msf indices is also performed synchronously and could result in noticeable periods of unresponsiveness for very larger folders or in the aggregate.  However, we aren't really optimizing for the "show as list" case, so that's not the concern.

The goal is that, like google, the interesting search results are going to be near the top of your list and you are going to refine your search terms if you are several orders of magnitude off in the size of the results.  Ranking is informed by messages that are starred/tagged and from messages involving starred contacts, so this is not completely wishful thinking.  Sadly, some UI refinement did not make 3.0.

The choice of a limit (arbitrarily chosen, admittedly) was that your search would not be what you want at all, and that you would want to find out sooner rather than later.  Loading 10,000 messages just so you can refine your search is sub-optimal.  Primary costs incurred are in deserialization, additional database retrievals required for related metadata and memory usage with serious concerns about garbage collection overhead.  The cost of the message database query is almost fully incurred regardless of the limit in the current implementation, but there are various cost reductions possible where the limit does start to become a serious boon.

> c.  Possible to access this parameter w/ config editor?

Not currently, but a patch to change this would likely be accepted since it is an arbitrary value and does not complicate testing.  (It could complicate localization somewhat, but we would probably want the localized string(s) be capable of adjusting to changes in the limit anyways.)
Thanks for the explanation.  I'll stick with 1000, but I now advise the user community to await documentation for gloda before following my fly-off-the-handle example.  I continue to discover powerful aspects of the UI.
Whiteboard: [datalossy]
Version: 3.0 → Trunk
I stumbled over that restriction today, too. 

Some additional observations:

1.

(In reply to lester from comment #8)
> a.  I recommend stating the precise result: "10 of the most recent 400"

The displayed < 400 messages did not include some recent mail which should match the filter. Is there any other criteria (e.g. to define the importance of a mail) which could cause this?


2. 
Although I understand the reasons for the restriction, it is non-obvious that when refining the search using the date/time selection on top of the page (to select a year, a month, ...), the search is not re-executed for that time interval in order to present again 400 messages which matches the search terms and the time interval. Instead, only the subset of the 400 messages which are in the new time interval are displayed.

3.
The global search is an incredible helpful tool to deal with large amount of mails. Especially since it spans the search over multiple mail accounts. All other search methods (quick search, "Search Messages" dialog) are restricted to one single account. So the global search restriction to 400 mails is quite unfortunate.
(In reply to Christian Krause from comment #12)
> The displayed < 400 messages did not include some recent mail which should
> match the filter. Is there any other criteria (e.g. to define the importance
> of a mail) which could cause this?

Yes.  Older messages are made to seem 'more recent' based on various static and dynamic scoring factors.  Please see the comments/code at:
http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/modules/msg_search.js#14
See Also: → 1022228
You need to log in before you can comment on or make changes to this bug.