Closed Bug 393678 Opened 17 years ago Closed 16 years ago

location bar autocomplete should take word boundaries in account

Categories

(Firefox :: Address Bar, defect, P2)

defect

Tracking

()

RESOLVED FIXED
Firefox 3 beta5

People

(Reporter: asqueella, Assigned: Mardak)

References

Details

Attachments

(7 files, 7 obsolete files)

Since bug 389491 was fixed, typing in the URL bar results in a case-insensitive search in the URL and title is done and the results are sorted by visit_count, visit_date.

Here's a problem with this approach:

There are two often-visited places in my history, among others:
1. http://developer.mozilla.org/en/docs/Special:Recentchanges, with a huge visit_count, and the title="Recent changes - MDC".
2. http://en.wikipedia.org/wiki/Special:Watchlist, with a large visit_count, which is much smaller than the first place's visit_count.

Before the patch, I could type "e" or "en" to get the autocomplete popup with the wikipedia page being first. Now "en" matches "changes" in the MDC page's title and it's the first. The first time I saw this, it took me a minute to figure out why the MDC page was even matched, it certainly looked broken.

Another example: I used to have one letter shortcuts for many of the pages I visit regularly (say, 't' for firefox tinderbox), now most of them bring up https://bugzilla.mozilla.org/process_bug.cgi

I suggest the algorithm should be tweaked to avoid matching in the middle of a word like this. So for example 'en' shouldn't match 'Recent changes', or xen.org, but should match 'en.wikipedia.org', 'subdomain.english.com' (?), 'Learn English in 21 days'.
Perhaps the 'traditional' results (that have their URL starting with the typed prefix) should have more weight too?
Blocks: 389491
No longer depends on: 389491
> 1. http://developer.mozilla.org/en/docs/Special:Recentchanges, with a huge

note there's a \ben at /en in that url so the prescription in the summary wouldn't change that result.
Flags: blocking-firefox3?
OS: Windows XP → All
Hardware: PC → All
The position at where the match was found should be considered. Match that follows '://' should have extra weight added.
No longer blocks: 389491
Depends on: 389491
I think autocomplete shouldn't search protocols at all
…but if I type “https://stat”, and have previously visited “http://stat…”, but not “https://stat…”, it should let me go to https://.
I don't think that we should prevent matching within words, as a lot of URIs tend to mash words together (ie: thisplacetoronto.com). However, I do agree that priority should be given to results that match against the start of a word.
Flags: blocking-firefox3? → blocking-firefox3+
Mike Beltzner wrote:
> I don't think that we should prevent matching within words, as a lot of URIs
> tend to mash words together (ie: thisplacetoronto.com).

Yeah, matching within domain name might be ok, but should we perhaps prevent matching within words in the page's title?

Also, it seems we should highlight the matching substring at least in such cases (since otherwise the results look very confusing). I don't see a bug filed about this, although I recall seeing the highlighting in faaborg's mock-ups.
Target Milestone: --- → Firefox 3 M10
Priority: -- → P2
(In reply to comment #7)
> I don't think that we should prevent matching within words, as a lot of URIs
> tend to mash words together (ie: thisplacetoronto.com).

Although I agree it's a valid point, FF2 didn't match that either. New matching is an interesting concept but I don't think it needs to go too far. As long as it doesn't regress basic functionality it's good.

Now, about regressing basic functionality: in FF2, if I wnated to go to my router (http://192.168......) I just needed to write "1" or, at most "19" and have the result at first place.

Now, observe: http://members.iinet.net.au/~syskin/19.png

It's just Not Fun At All, and this bug (the way it's defined in comment #0 minus the fact that "en" doesn't actually match "changes") would solve this particular problem.
Blocks: 404261
And things have been made much worse in the recent nightlies, as now each match  takes up an enormous amount of vertical space, while the new search matches an enormous amount of (irrelevant) items. So this leads to a very poor results. As I see from most of the comments here, people tend to search for URIs (because that's what people type in the address bar). While the new search very much favors title matches (at least that's how it is displayed). I also do think that only matches along word boundaries should be taken into account and matches at the start should get higher weight, because that reflects the way an address is typed. 
Attached patch v1 (obsolete) — Splinter Review
If bug 395739 lands, we can adaptively deal with the word boundary issue by giving extra weight to partial matches of learned inputs.

Example:
1) user types in "recent" and selects devmo/RecentChanges
2) user types in "en.wiki" and selects wikipedia/Watchlist
3) user types in "en" which matches the beginning of a previously typed input "en.wiki", so it shows wikipedia/Watchlist before devmo/RecentChanges

This is probably slightly different from what comment #0 is asking for which is..
1) user types in "en" so it should show wikipedia/Watchlist first


This patch to do partial starting matches of previously typed inputs is probably much easier to implement than a generic word boundary match for whitespace and special characters for both urls and titles. And most likely the common use case is covered with the user previously typing something and selecting it again.
There's a mac try-server build of this patch and some other adaptive stuff.. there might be a linux build finishing up, but the windows one didn't make it.

https://build.mozilla.org/tryserver-builds/2007-12-12_01:04-edward.lee@engineering.uiuc.edu-boundary/

And just to clarify, attachment 292738 [details] [diff] [review] makes it so previously typed inputs are matched when looking for "boundary" -> "starts with". It doesn't help with the initial finding of a history item.

1) type "192.168" and select "http://192.168...."
2) type "1" and select the first result being the one from (1)

Notice that the first time you need to type out more to skip past matching "19.png", but next time, it'll find what you wanted.
If I understood your comment correctly, this bug is _very much_ different.

Your fix seems like a great improvement over the situation when the previously selected items are given a boost any time the typed string is a substring of a previous input (e.g. typing "i" boosts en.wikipedia after step two in your example; did I read your comment and the code in patch for bug 395739 correectly? I didn't see a wiki page with the description/discussion of the algorithm)

My problem is the uselessness of non-adaptive results in certain cases. This is a serious problem that bites the user when the adaptive algorithm did not have enough feedback from the user or when it doesn't work.

An example of what I'm talking about: if I understand correctly (haven't looked very close at your work), current algorithm only gives boost to the specific place that was selected by the user. So if I have lots of developer.m.o URLs in history, typing "d" will show the results from the adaptive algorithm followed by a long and absolutely random list of URLs (that have "d" somewhere). In this random list there will be other developer.m.o results I didn't select previously.

The point of this bug is to reduce this randomness based on common sense and without any feedback from the user. I didn't think about implementation, but on a high level this seems to be doable.
Question is if we want to try to keep all this ordering/selecting logic in the SQL statement. If so, it seems like it'll be difficult to get perfect word boundaries without regular expressions, which aren't implemented by default in sqlite.

But as the high order bit, perhaps we could reuse the rank and give boosts for titles that match at the start or after a space.. or urls that match a dot or slash before the input.

// ?5 Search string for "LIKE%"
// ?6 Search string for "% LIKE%"
// ?7 Search string for "%.LIKE%"
// ?8 Search string for "%/LIKE%"
MAX(
// adaptive stuff..
(i.input LIKE ?3) * i.use_count...
// title boundary: start and space (0,1,2)
+ h.title LIKE ?5 + h.title LIKE ?6
// url boundary: dot and slash (0,1,2)
+ h.url LIKE ?7 + h.url LIKE ?8
) rank .. ORDER BY rank DESC

It's not perfect, but should work for most cases.. or maybe I just can't think of a good way to do much better with _just_ sqlite.
An alternative is adding extra columns or a whole new table that just tracks all words in the title and url split at word boundaries. Then place results that match more boundaries higher.

SELECT SUM(wb.word LIKE 'input%') rank
FROM moz_places h
JOIN word_boundaries wb ON wb.place_id = h.id
GROUP BY h.id // we'll overcount if we JOIN other stuff
ORDER BY rank DESC

But it seems difficult to keep the queries fast JOINing on a huge table as well as populating/updating the table with the words separated by boundaries as pages are added or titles change.
I didn't think about it for too long, but is filtering "bad" results out of the query results a bad idea? That is, keep the SQL query as is, but add a custom step, that throws away results not matching the query if you take the word boundaries in account, between getting the query results and populating the autocomplete list.
What do you mean by "bad"? We don't want to lose "okay" results just because it matched in the middle ("okay" is better than no results).

The issue is with ordering, not so much of selection. Maybe push it lower in the list? But that seems tricky with chunking.. perhaps that'll be addressed with bug 394038.
I kind of agree with Nickolay. You may want to make this a matter of selection instead. Especially for titles, I don't see much value in results where the typed string matched in the middle of a word. Excluding such results would also help reduce the number of matching entries, thereby also presumably improving sorting time. 
Do we want to remove that substring matching functionality? The location auto-complete was redesigned to help users find stuff, so removing potentially useful entries from the autocomplete seems counter-intuitive.

Right now, I can type "complete" and get bugs that are "auto-complete" and "autocomplete". But then again, when this bug gets fixed, titles that have the word "complete" will jump to the top which isn't quite what I would want either... :(
"bad" means not matching at the boundaries (possibly with the exception of domain names). Normal people don't search for "recent" by typing "en"[*], that was my point in comment 0.

If we still want to show those results (per comment 7), we could do this: remember the filtered out results in a separate list and append that list after all "normal" results.

This will still find "thisplacetoronto.com" when the user types "toronto.com", but will make irrelevant results much less annoying, whatever frecency they have. I assume here that the only use case when a user would start searching from a middle of a "word" is when he doesn't remember the beginning. When I remember the (domain) name, I'll type the first few letters, not a substring in the middle.

[* Some wikis like trac use CamelCase in URLs, but we could treat the change of case as a word boundary.]
(In reply to comment #21)
> Right now, I can type "complete" and get bugs that are "auto-complete" and
> "autocomplete".
I bet you'd love the regular expression searches in the Location Bar.

/auto-?complete/ :)
Target Milestone: Firefox 3 M10 → Firefox 3 M11
Priority: P2 → P4
Attached patch v2 (obsolete) — Splinter Review
This forces matches to be on a word boundary: the start of the title/url/tag, ' ', '.', '/', '-', '_'.

I don't think there needs to be a preference to allow middle-of-word matching or even showing a "these matched but in the middle" list. Tags can be added to create new word boundaries or just bookmark with a more appropriate title.

dietrich: The patch is on top of a bunch of others.. but review-wise, it's just making sure I implemented the FindOnBoundary correctly. I started by looking at the FindInReadable implementation and optimized it for word boundary matches which helps simplify some code.
Assignee: nobody → edilee
Attachment #292738 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #308969 - Flags: review?(dietrich)
Attached image screenshot of trunk
Attached image screenshot of v2
I cleared my adaptive learning for these. With fewer rows, the word boundary screenshot shows just as good results as trunk.
Perhaps instead of explicitly only allowing <space> . / - _ as word boundaries.. how about anything in the ASCII ranges:

0x20-0x2f:  !"#$%&'()*+,-./
0x3a-0x40: :;<=>?@
0x5b-0x60: [\]^_`
0x7b-0x7e: {|}~

For people to try:

https://build.mozilla.org/tryserver-builds/2008-03-12_14:58-edward.lee@engineering.uiuc.edu-optimize.wordboundary/
Comment on attachment 308969 [details] [diff] [review]
v2

Hehe. Oops. Just realized something with so many possible word boundary characters.. you wouldn't be able to match a word boundary character unless it came after a word boundary.

E.g., ".com" won't match "foo.com" but will match "it's a .com"

I wonder if there's an easy way to check those ascii ranges in UTF16.. or maybe just switch them to UTF8.. hrmm
Attachment #308969 - Flags: review?(dietrich) → review-
Attached patch v2.1 (obsolete) — Splinter Review
Treats more things as word boundaries and passes all existing testcases now. I'll write a testcase specifically for this, but the changes I made to existing ones works pretty well already.
Attachment #308969 - Attachment is obsolete: true
Attachment #309037 - Flags: review?(dietrich)
I was using an awesome-build and tried to find my "using the awesomebar" post and found a bunch of unrelated pages. schEDule and bUS?

This build contains v2.1 and the screenshot on top shows its awesomeness.

https://build.mozilla.org/tryserver-builds/2008-03-12_21:35-edward.lee@engineering.uiuc.edu-better.boundary/
mconnor: I'm tentatively moving this to P2 unless you feel it should be back at P4.

<dietrich> Mardak: make a case for blocking on the word boundaries bug, that gets mentioned so often by beta users

There's a few dups to this bug and various numbers of voters on bugs related to this issue. Even with Beta 4, users are complaining about not being able to find results because it doesn't show pages they're expecting.

"I don't want to search somewhere in the middle of an URL"
"“ebay” should never match “thepiratebay”"
"the results are unpredictable and the non-trivial amount of time that it needs to find contents for the dropdown box is frustrating"
"Half the time it seems to pick the letters that are within a word"

"Bug 404261 – Autocomplete favours impractical items with the first one to two letters are typed"
"Bug 393705 – New location bar autocomplete never guesses what I want"
"Bug 404625 – Addressbar autocomplete suggests bizarre completions"
"Bug 407437 – New completion in address bar not an improvement"

For those following along.. probably no need to add additional comments on why this should be fixed. ;)

But you can play around with this build if you like:
https://build.mozilla.org/tryserver-builds/2008-03-13_14:29-edward.lee@engineering.uiuc.edu-better.win.boundary/
Priority: P4 → P2
Target Milestone: Firefox 3 beta3 → Firefox 3 beta5
Attached patch v2.2 (obsolete) — Splinter Review
Moved the patch to the top of my stack and unbitrotted from bug 422491. Also added a new testcase just for word boundary stuff.

While writing the testcase.. I started thinking about other languages...
http://unicode.org/reports/tr29/#Word_Boundaries
Attachment #309037 - Attachment is obsolete: true
Attachment #309349 - Flags: review?(dietrich)
Attachment #309037 - Flags: review?(dietrich)
Attached patch v2.3 (obsolete) — Splinter Review
Instead of specifying exactly which things must be word boundaries, we can be conservative and specify things that we know can't be word boundaries and anything that we miss will be treated as a word boundary.

This means if we didn't include "abc" as "not a word boundary", we would be able to match words starting with a, b, or c in the middle of a word just like the functionality we have right now.

This means we can treat ideographs like chinese characters as individual words without explicitly specifying that they're words.

I've explicitly enumerated a-z and hiragana/katakana for now. Probably want to include cyrillic and others. But if we don't, cyrillic characters could be matched in the middle of a word which isn't too bad as a fallback.
Attachment #309349 - Attachment is obsolete: true
Attachment #309529 - Flags: review?(dietrich)
Attachment #309349 - Flags: review?(dietrich)
Attached image counter-screenshot
illustrating comment 21 somewhat -- with the patch applied, I can't find autocomplete and auto-complete by typing those two words :| Also, what about the AnnoyingCamelCaseWikiSilliness mentioned in comment 22?

I guess I sympathize with not breaking existing fx2 habits, but not at the cost of losing awesomeness in slightly hairy cases like that, where some parts have unknown variance that you can try to take into consideration in your search strings -- helpful for finding stuff lost to the deep bowels of history.

re comment 31, said existing fx2 habits -- 
bug 404261: o for osnews.com
bug 393705: en for en.wikipedia
bug 404625: n for www.newsarama.com, co for community.livejournal
bug 407437: "expect to get matches for the URL I'm typing"
bug 393678 (this one): en for en.wikipedia, t for tinderbox.mozilla

...none of those comment 0's at a glance complain about subword matches in general, but only about breaking a quick-access habit or in a more general sense "url completion". The suggestion of comment 1, prioritizing matches to url prefix (I suppose with some www. skip magic), would fix those without risk to the awesomeness. (or comment 7, weighting wordstart matches in general, in case that should be less complicated)
For what little it's worth, as the comment 0 in bug 404625, my problems have been adequately solved already.  How much of that is improvements to the auto-complete and how much is just adjusting to Firefox's new behavior I can't say for sure, although both of the examples I gave in my bug now give the "correct" auto-completion.  (And, as I recall, did so as of Beta 3.)

About the only minor annoyance I run into now is the auto-complete matching in the ".net" or ".com" parts of the URL, but that's always in results after the first few and would probably be a different bug...
Attached patch v2.4 (obsolete) — Splinter Review
browser.urlbar.matchOnWordBoundary defaults to true

This should be the correct default as it makes it more intuitive for new users and still provides awesomeness functionality. For urls like "thisplaceintoronto", the title most likely is "This Place In Toronto". Similarly for CamelCaseWikis, the title probably has the correctly spaced title.

If we were to only match the beginning of the url, we would be only matching the beginning of the url and miss out on matching anywhere else.

Word boundaries are being matched in beta 4 indirectly with the adaptive learning because users type words at the beginning of words. Adaptive learning does prefix matching which will match the words on the word boundary just because the user's input was already word boundary.
Attachment #309529 - Attachment is obsolete: true
Attachment #309570 - Flags: review?(dietrich)
Attachment #309529 - Flags: review?(dietrich)
> "thisplaceintoronto", the title most likely is "This Place In Toronto".

Likely, yes.

> Similarly for CamelCaseWikis, the title probably has the correctly spaced
> title.

http://wiki.debian.org/DebianPackageConfiguration
http://wiki.rubyonrails.com/rails/pages/TipSheetForBeginners
http://www.intertwingly.net/wiki/pie/XmlNamespaceConformanceTests

(3/20 top ghits for "wiki")
Attached image screenshot of v2.5
Attached patch v2.5 (obsolete) — Splinter Review
This makes the word boundary check even more conservative by allowing CamelCase. Basically, it treats upper-case as a word boundary, and it just like before, we can match on a word boundary.
Attachment #309570 - Attachment is obsolete: true
Attachment #309709 - Flags: review?(dietrich)
Attachment #309570 - Flags: review?(dietrich)
If you want to try it out:

https://build.mozilla.org/tryserver-builds/2008-03-15_16:00-edward.lee@engineering.uiuc.edu-camelBoundary.commaTags.colonKey/

Bug 393678 - location bar autocomplete should take word boundaries in account
Bug 407946 - emphasize all matching text in title and url, not just the first match in title and url
Bug 415403 - Show matches for all search words for location bar autocomplete
Bug 418257 - Show what part of which tags match for urlbar autocomplete
Bug 392143 - show keywords as url bar autocomplete choices
Bug 249468 - Add all bookmark keywords to location bar autocomplete drop-down list
Bug 395161 - make it possible to restrict the url bar autocomplete results to bookmarks, tags, or history entries
Bug 407204 - adjust the title and url text sizes
Bug 406257 - reduce the number of rows in url bar autocomplete from 10 to 6
Bug 420437 - Search and emphasize quoted strings with spaces
Bug 414326 - Use DownloadUtils for software update downloads
Bug 422491 - Optimize AwesomeBar if it finished searching and found fewer than maxResults
Attached patch v2.6Splinter Review
Remove the katakana/hiragana checks because those words can be mixed with english words and they still don't have spaces. E.g., <k><k><k>hello<h><h><h> would all be treated as one word. But using the conservative word boundary check will still help out with many urls and titles.. and at worst, we have the same functionality as now with match-in-the-middle.
Attachment #309709 - Attachment is obsolete: true
Attachment #310256 - Flags: review?(dietrich)
Attachment #309709 - Flags: review?(dietrich)
Comment on attachment 310256 [details] [diff] [review]
v2.6

r=me. thanks for iterating on this so much, trying to find the right balance. this conservative non-word-boundary-list approach seems fine. a further enhancement might be to make that list into a pref, so that users can tweak it.
Attachment #310256 - Flags: review?(dietrich) → review+
Checking in browser/app/profile/firefox.js;
/cvsroot/mozilla/browser/app/profile/firefox.js,v  <--  firefox.js
new revision: 1.311; previous revision: 1.310
done
Checking in toolkit/components/places/src/nsNavHistory.cpp;
/cvsroot/mozilla/toolkit/components/places/src/nsNavHistory.cpp,v  <--  nsNavHistory.cpp
new revision: 1.276; previous revision: 1.275
done
Checking in toolkit/components/places/src/nsNavHistory.h;
/cvsroot/mozilla/toolkit/components/places/src/nsNavHistory.h,v  <--  nsNavHistory.h
new revision: 1.149; previous revision: 1.148
done
Checking in toolkit/components/places/src/nsNavHistoryAutoComplete.cpp;
/cvsroot/mozilla/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp,v  <--  nsNavHistoryAutoComplete.cpp
new revision: 1.49; previous revision: 1.48
done
Checking in toolkit/components/places/tests/unit/test_history_autocomplete_tags.js;
/cvsroot/mozilla/toolkit/components/places/tests/unit/test_history_autocomplete_tags.js,v  <--  test_history_autocomplete_tags.js
new revision: 1.6; previous revision: 1.5
done
RCS file: /cvsroot/mozilla/toolkit/components/places/tests/unit/test_word_boundary_search.js,v
done
Checking in toolkit/components/places/tests/unit/test_word_boundary_search.js;
/cvsroot/mozilla/toolkit/components/places/tests/unit/test_word_boundary_search.js,v  <--  test_word_boundary_search.js
initial revision: 1.1
done
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Blocks: 409023
I'm seeing matches like gfxF_ont_utils and nsC_ont_ent for ont. That looks like arbitrary substring matches but isn't; presumably the logic is that of wanting to match bar in FOObarGOAT. How's a user supposed to figure that out and not expect substring matches, and not be surprised when substrings sometimes match and sometimes don't? I think this example shows that boundary matching is a misguided effort to make the search more informed than it can be without making its model too hard to figure out from the results such that the results can appear predictable to the user.

Arguably matching any random substring for a search word could be called a kind of brokenness (or, say, weakness) but it's a simple, predictable, and understandable kind of brokenness; matching only some random substrings like this is a flavor of brokenness that's harder to figure out and use predictably.
Blocks: 429498
(In reply to comment #44)
> I'm seeing matches like gfxF_ont_utils and nsC_ont_ent for ont.
See bug 429498 or try this build:
https://build.mozilla.org/tryserver-builds/2008-04-17_09:47-edward.lee@engineering.uiuc.edu-nocamel.1after/

Searching for "inef" on trunk matches a bunch of Minefield pages but with that build, I only find one page that matches "inefficiencies".
Blocks: 429531
yes, nocamel.1after seems to take care of gfxF_ont_utils and nsC_ont_ent for ont, but I am seeing matches all over uppercase words, e.g. rnet matches INTERNET but not internet, which is weird.
(...which issue is pre-existing to that change)
(In reply to comment #7)
> I don't think that we should prevent matching within words, as a lot of URIs
> tend to mash words together (ie: thisplacetoronto.com). However, I do agree
> that priority should be given to results that match against the start of a
> word.
See bug 429531. This build has the patch there for the attached screenshot.

https://build.mozilla.org/tryserver-builds/2008-04-17_17:55-edward.lee@engineering.uiuc.edu-boundary.anywhere/
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: