Open Bug 1133269 Opened 9 years ago Updated 2 years ago

Improve order of matches in recipient autocomplete address field drop-down list: Searching for "foo bar" (with space) should also prioritize results having initial string "foo.bar" (with dot)

Categories

(Thunderbird :: Message Compose Window, enhancement)

31 Branch
x86
Windows XP
enhancement

Tracking

(Not tracked)

UNCONFIRMED

People

(Reporter: hpvpp, Unassigned)

References

(Depends on 1 open bug)

Details

(Whiteboard: [X-Ref Bug 382415, "Frecency"])

Attachments

(1 file)

Attached image Capture_4.jpg
User Agent: Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0
Build ID: 20150122214805

Steps to reproduce:

typed :roger p"


Actual results:

the address I wanted was second on the drop-down list


Expected results:

"roger p" *does* match roger.pamplet
it does *not* match roger kirkwood
it also does *not* match rkirkwood@penguins.org.au

Your sorting algorithm is wrong
Component: Untriaged → Message Compose Window
Summary: incorrect order of potential matches in address field drop-down list → incorrect order of potential matches in autocomplete address field drop-down list
That searches matches including roger OR p. 
On trunk (but not 31.x) you can quote it to keep as exact string match.
Status: UNCONFIRMED → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
(In reply to Magnus Melin from comment #1)
> That searches matches including roger OR p. 

Your interpretation is contrary to the standard definition as, for example, given by http://dictionary.reference.com/browse/Autocomplete :
"a feature of a word processor, email program, web browser, etc., that automatically predicts the *remaining characters* in a word or phrase based on what has been typed or input before" (my bolding).

The only way an operator is justified is when there is no string "roger p".  And then I would expect an interpretation of "roger AND p", because the default assumption of a list is that it provides *additional* detail.  For example, when I am ordering an ice cream and I say "chocolate vanilla please", I am expecting one scoop of chocolate and one scoop of vanilla and *not* two scoops of chocolate OR two scoops of vanilla.

I would like to see your justification for your non-standard interpretation.

> On trunk (but not 31.x) you can quote it to keep as exact string match.

That would *not* have the desired effect if the target was "roger.pa".
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
Humphrey, thanks for sharing your user experience, which is valuable to us.

I see your point to some extent, but it would be very hard if not impossible to address that.
Moreover, some of your claims are factually wrong.

(In reply to Humphrey P van Polanen Petel from comment #0)
> Created attachment 8564629 [details]
> Capture_4.jpg
> 
> User Agent: Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0
> Build ID: 20150122214805
> 
> Steps to reproduce:
> 
> typed: "roger p"
> 
> 
> Actual results:
> 
> the address I wanted was second on the drop-down list
> 
> 
> Expected results:
> 
> "roger p" *does* match roger.pamplet

Yes, it matches roger.pamplet and that's why it's included in the result set.
However, it is NOT an exact string match because the email address has "roger.p" (with dot) but you're searching for "roger p" (with space).

> it does *not* match roger kirkwood
> it also does *not* match rkirkwood@penguins.org.au

This is where you really start contradicting yourself in a technical sense.
If you actually want search phrase "roger p" to match "roger.p" (which is NOT an exact match), then it can only work with an algorithm like we currently have (NOT using exact full string match):
> Contains "roger" AND contains "p".
Iow, split up your search string and match each "word" of your search string separately (involving more details but that's the basics).

(In reply to Magnus Melin from comment #1)
> That searches matches including roger OR p. 
> On trunk (but not 31.x) you can quote it to keep as exact string match.

Magnus, we're matching *Roger* AND *p*, not OR.
AND operator is required for incremental searches to narrow down result sets efficiently.

So according to that powerful/versatile matching pattern, "roger p" DOES match
Roger Kirkwood <rkirkwood((at))penguins.org.au> because
Roger - match on display name
p - match on penguins.org.au

In fact, for many users this will be the desired result because it allows them to build up highly efficient incremental searches:
start searching for "Roger"...
note that you have too many contacts matching "Roger" in your AB
narrow down search as-you-type by adding " p" ("Roger p") to find the one and only roger @ penguins.
That's a geniously efficient way of reducing result sets.

> Your sorting algorithm is wrong

The fact that it doesn't match your expectations doesn't mean it's wrong; perhaps there might be room for further improvements.
We've recently done a lot to improve the sorting algorithm (some improvements might not have landed yet for release).

Given that your search string does NOT match EITHER of the two results exactly (full string), the sort order is based on other factors. Most likely over the entire time that you've used this profile, you've written to Roger Kirkw. more often than the other one (in total). We have a bug on record that popularityIndex does not age, so recency is not considered yet, it's a dull ever-lasting counter.

Solution for your scenario:

Add a display name "Roger Pamphlett" to his card. Then it should work, because we DO prioritize exact full string matches of full search string against the beginning of the visible recipient string.

Your other alternative is to define explicit and sufficiently unique nicknames in nickname field of card properties, what about this:

Name: Roger Pamphlett
Nickname: RP#

Name: Roger Kirkwood
Nickname: RK#

Then, just typing "rp#" or "rk#" into recipient field will give you the right person, 100% of the time if it's unique enough.

Beyond that, in theory we could try prioritizing "roger.p...@" when searching for "Roger p" but it's a lot of coding work and runtime resources (search takes longer) for special-casing the dot case.
Plus I don't think it's wanted because it will further undermine the more sustainable approach of Bug 382415 (implement "frecency", so that TB dynamically learns over time that when you type "Roger p", you actually mean that contact having roger.pamphlett as email address; but when your preferred contacts and use patterns change, we'll also dynamically re-adapt to that).

So I'm not sure if this bug has merits.
At present it's certainly invalid, otherwise incomplete.
When completed with detailed description of how you imagine a better sorting algorithm to work in a one-for-all sustainable approach (compatible with Bug 382415), lacking which it might still end up "wontfix".
Severity: normal → enhancement
Summary: incorrect order of potential matches in autocomplete address field drop-down list → Improve order of matches in recipient autocomplete address field drop-down list: Searching for "foo bar" (with space) should also prioritize results having initial string "foo.bar" (with dot)
Thomas, thank you for your considerate reply.  I have been slack in formulating my complaint and I apologize for that, but let me try to put that straight.

Firstly, I was assuming that TB when searching in its address book(s) operates like Google.  That, however, must be a reasonable assumption, because Google searches occur more often than TB address book searches and users are likely to be conditioned into assuming that TB searches work like Google's.  Therefore, I assumed (without thinking) that TB ignores whitespace characters which makes "roger p" equivalent to "roger.p", albeit fuzzily.  

Secondly, I did not really contradict myself, because if I had wanted "roger p" and not "roger.p" then I would have quoted my search string which I would have reported here as "\"roger p\"" which would then not be equivalent to "roger.p".

Thirdly, I expect "roger p" to be sorted before "roger.p" simply because the human mind prefers exact matches over fuzzy matches.

Having said that, would it help if you people build a regex string based on what the user types?  And, as concerns the sorting, would it help if you prioritize the personal address book over the collected address book by increasing the weighting of the personal address book matches by adding the maximum frequency value of the collected address book matches?

Best,
Humphrey
(In reply to Humphrey P van Polanen Petel from comment #4)
> Thomas, thank you for your considerate reply.  I have been slack in
> formulating my complaint and I apologize for that, but let me try to put
> that straight.
> 
> Firstly, I was assuming that TB when searching in its address book(s)
> operates like Google.  That, however, must be a reasonable assumption,
> because Google searches occur more often than TB address book searches and
> users are likely to be conditioned into assuming that TB searches work like
> Google's.  Therefore, I assumed (without thinking) that TB ignores
> whitespace characters which makes "roger p" equivalent to "roger.p", albeit
> fuzzily.

I believe we do *match* things like Google, allowing "fuzzy" matches (which is what I fought for!), but (probably like Google again), the final sorting of results will involve popularity of result contacts.

> Secondly, I did not really contradict myself, because if I had wanted "roger
> p" and not "roger.p" then I would have quoted my search string which I would
> have reported here as "\"roger p\"" which would then not be equivalent to
> "roger.p".

Point taken. All things being equal, {roger.p...@...} is a better match for [Roger p] than {Roger Kirk... @penguin}.

> Thirdly, I expect "roger p" to be sorted before "roger.p" simply because the
> human mind prefers exact matches over fuzzy matches.

I'm not sure if I understand that one. Is it quoted or not?

> Having said that, would it help if you people build a regex string based on
> what the user types?

We already use regex strings for the *sorting* algorithm, where we we uplift results matching your *full (exact) search string* against the beginning of word-like strings anywhere in the full recipient string as seen in autocomplete results.
Such matches are uplifted, when their score is equal, we use popularity.
Roughly speaking, initial and word-initial matches on the displayed recipient outscore popularity.
For details, consult Bug 970456.

I think your request here is that we should break up the search string not only for *finding* results, but also for *sorting* results, so that fuzzy matches against the displayed recipient will rank higher.
I admit this looks appealing and common sense at first sight.

However, there's a catch with any sort algorithm whose highest priority is not based on "frecency", namely that it will practically prevent "frecency" from working (note that we currently just have frequency, not recency in our algorithm).

The more exceptions we create now which outscore "frecency", the harder it will be to land "frecency" later on. But "frecency", as evidenced in FF location bar, is very powerful because it dynamically adapts to the users needs and intentions, instead of trying to enforce one-for-all inflexible algorithms. Same applies for your next suggestion below:

> And, as concerns the sorting, would it help if you
> prioritize the personal address book over the collected address book by
> increasing the weighting of the personal address book matches by adding the
> maximum frequency value of the collected address book matches?

I strongly discourage that approach (discussed in Bug 1114751), because we have absolutely no way of safely predicting for every user that all contacts from collected address book are necessarily less relevant than any matches from personal AB. Nothing can stop users from using "collected AB" as their "main" AB, in fact it's quite easy to get there by just writing messages to new users, or reply to messages you received, and Thunderbird will collect all the addresses for you when sending. Perhaps I imported a bunch of addresses into Personal AB, but does that *automatically* mean they are more important? How would we know? Perhaps it's the archive for less-used addresses. Or perhaps I manually entered some addresses there just to keep them but they are *less* frequently used than those I actually write to and which thus got collected.

If we take that road, we'll end up where we once started before introducing popularityIndex:
A very complex, inflexible sorting algorithm which tries to get it right in a one-for-all approach, which can easily end up up wrong for all. We are already getting the first complaints that prioritizing based on initial matches against recipient result string effectivly undermines popularity, thus uplifting the wrong results, with no way of ever getting it right (Bug 970456, comment 163). By comparison, the planned "frecency" algorithm (more so if linked with usage data linking the actual input to the ultimately chosen result, as in FF) is much more flexible and will actually (over time) learn your personal habits which contacts you *really* want when you type that search phrase. More specifically, it will work for you, but it will also work for that other user who expects "Roger p" to find that Roger with "penguin" in the domain. Or if you change your communication patterns, and the penguin becomes more relevant, TB will also dynamically adjust to that.

I suspect that all attempts at inflexible sorting are just stop-gaps until we introduce the more powerful and versatile "frecency" algorithm including the "what you've typed and what you chose" data thing.

Perhaps there's merit to this RFE in this interim phase. I don't fully know. This stuff is also very complex, involving a large number of factors both from scenarios and net effects of the algorithm depending on those scenarios.

> Best,
> Humphrey
Whiteboard: [X-Ref Bug 382415, "Frecency"]
(In reply to Thomas D. from comment #5)

> Point taken. All things being equal, {roger.p...@...} is a better match for
> [Roger p] than {Roger Kirk... @penguin}.

FTR, this should be rephrased as:

Point taken.  *At first sight*, and all things being equal, {roger.p...@...} *appears* to be a "better" match for [Roger p] than {Roger Kirk... @penguin}. But it really depends on your personal expectations. Other users will expect the penguin toplisted from that search, if that's what they use more often.

> I admit this looks appealing and common sense *at first sight*.
> 
> However, there's a catch with any sort algorithm whose highest priority is
> not based on "frecency", namely that it will practically prevent "frecency"
> from working (note that we currently just have frequency, not recency in our
> algorithm).
Observation: searching from the address field of an email produces different results from searching from the search field in the address book produce different results.  

Suggestion: do away with all the various address books and have just one in which you indicate in a column whether it was a user created or a collected address.  
Also, for a to-be-collected address, put a bar above the message pane (like TB does with a scam suspect) and allow the user the choose whether to keep the address or not and update it as need be.
Also, when searching (whether from the address field of an email or from the search field in the address book) let <left-arrow> hide and <right-arrow> show collected addresses as in contracting and expanding the search results.  (I understand that would require a rethink of key-use, but seems rather intuitive.)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: