409271 - Weight the main website address higher than sub-pages

Reporter

Description

•

18 years ago

I think that we should add extra weight to root pages of websites. All other factors being equal, "www.goats.com" should be ranked higher in search results than sub-pages. This way, if you visit this website once and then search for "goats", you will be more likely to be presented with the main page than with some random, obscure subpage like "www.goats.com/aspx?FamilyID=ad724ae0-e72d-4f54-9ab3-75b8eb148356&displaylang=en". Of course, it would be just one of the factors - other factors could still cause the sub-page to be ranked higher. I believe this is how the Firefox2-Places autocomplete used to work and as far as I recall it was much praised.

Ed Lee :Mardak

Comment 1

•

18 years ago

Bug 394066 comment #6 briefly mentioned something along these lines.. "url length (shorter is better?)" Just a data point.. SELECT MAX(LENGTH(url)), MIN(LENGTH(url)) FROM moz_places 32116, 13 where length 13 is place:sort=4& and http://cs598/ (14 has javascript:NaN, place:folder=4, place:folder=3)

Ed Lee :Mardak

Comment 2

•

18 years ago

Attached patch v1 — Details — Splinter Review

Give weight to shorter urls by grouping them into buckets by length (divide and round to 1 decimal)

Adam Kowalczyk

Reporter

Comment 3

•

18 years ago

Is checking length used as a hack for detecting website roots, or is it because you think that shorter URLs in general should be ranked higher? The most correct way to detect website root would be to create an nIURI and check the path. But that's probably out of the question for performance reasons, even if done using a custom SQL function, am I right? The length hack might be good enough but it may also generate some undesired results in some cases.

Ed Lee :Mardak

Comment 4

•

18 years ago

Right. This is to estimate the root because.. 1) callback and creating a nsIURI for *every* row in the moz_places table will probably be slow 2) figuring out what's exactly a root and not isn't always easy -- / vs /index.html ?

Ed Lee :Mardak

Comment 5

•

18 years ago

Attached image bug (short) — Details

Here's typing in "bug" on trunk. I've cleared out my adaptive searching history and fixed all the pages to be "recent".

Ed Lee :Mardak

Comment 6

•

18 years ago

Attached image bug (trunk) — Details

Prefer shorter urls for the same "bug"

Ed Lee :Mardak

Comment 7

•

18 years ago

Attached image moz (short) — Details

Ed Lee :Mardak

Comment 8

•

18 years ago

Attached image moz (trunk) — Details

Note that I never visit forums.mozillazine.org -- only to the firefox builds page directly, so it doesn't show up here, but it's something the adaptive bar will bubble up.

Ed Lee :Mardak

Updated

•

18 years ago

Attachment #294160 - Attachment description: bug (trunk) → bug (short)

Ed Lee :Mardak

Updated

•

18 years ago

Attachment #294161 - Attachment description: bug (short) → bug (trunk)

Attachment #294161 - Attachment filename: bug short.png → bug trunk.png

Ed Lee :Mardak

Updated

•

18 years ago

Attachment #294162 - Attachment description: moz (trunk) → moz (short)

Attachment #294162 - Attachment filename: moz trunk.png → moz short.png

Ed Lee :Mardak

Updated

•

18 years ago

Attachment #294163 - Attachment description: moz (short) → moz (trunk)

Attachment #294163 - Attachment filename: moz short.png → moz trunk.png

Ed Lee :Mardak

Comment 9

•

18 years ago

Whoops.. named the files the wrong way ;) As a side note.. this also includes the patch to include typed in the rank (bug 352923), and some pages like wiki.mozilla.org and bonsai.mozilla.org actually redirect to another page. Should we put weight for the targets of redirects?

Ed Lee :Mardak

Comment 10

•

18 years ago

Adam: Can you try these builds to see if it does what you want? (Only the mac build is finished right now) https://build.mozilla.org/tryserver-builds/2007-12-20_19:46-edward.lee@engineering.uiuc.edu-awbar.weight.dlmgr.faster/ It's got a bunch of patches, but you should be able to see the effect of weighting the main website over sub-pages.. (note that if you start selecting stuff, the adaptive will kick in) Bug 395735 - Instrument the location bar auto-complete to report accuracy Bug 406359 - Unify the logic for url bar searches and drop down items Bug 406358 - Location bar drop down ignores frequency of visits Bug 395739 - adaptive learning (match entered text to selected item) in url bar autocomplete Bug 393678 - location bar autocomplete should take word boundaries in account Bug 406422 - Globally decay adaptive input history to allow for new entries Bug 406425 - Limit how many adaptive input history entries to track Bug 406427 - Penalize skipped results to allow adaptive learning of shifting preferences Bug 352923 - Autocomplete should prefer typed URLs to URLs that were merely visited Bug 394066 - use "if bookmark" to weight url autocomplete results Bug 409271 - Weight the main website address higher than sub-pages Bug 408696 - Don't show leading 0 for download dates Bug 409311 - I see beach ball instead of download manager Bug 408745 - Download Manager clobbers the UI thread and is slow to respond when updating its list Bug 409105 - Just finished download shows start time instead of completed time Bug 409314 - Default download manager size should be golden (485x300) Bug 409317 - Download date/time should show the full date and time as a tooltip Bug 409326 - Complement eTLD+1 status with a full host tooltip Bug 394516 - Figure out a remaining-time rounding scheme for minutes -> hours/days Bug 406857 - Don't clear referrer when restarting downloads Bug 401316 - Open-with downloads are readonly Just to be clear, the full query is.. // ?1 Starting chunk time // ?2 Ending chunk time // ?3 Search string for %LIKE% // ?4 Plain search string "SELECT h.url, h.title, f.url, b.id, b.parent, " // Find the best rank for a given id/url. "MAX(" // Adaptive ranking relies on inputhistory which can be NULL "IFNULL(0 " // adaptive: Factor in use_count if the input matches (0 to 10) "+ (i.input LIKE ?3 ESCAPE '/') * i.use_count " // adaptive: Give extra weight for exact matches (0 or 5) "+ (i.input = ?4) * 5 " // adaptive: Check "word boundary" for starting matches (0 or 5) "+ (SUBSTR(i.input, 1, LENGTH(?4)) = ?4) * 5 " ", 0) " // visit type: Give extra weight for typed results (0 or 1) "+ h.typed " // visit type: Give extra weight for bookmark visits (0 or 1) "+ (v.visit_type = 3) " // url size: Shorter urls (main page) over longer (sub page) (0 to .5) "+ ROUND(5.0 / LENGTH(h.url), 1) " ") rank " "FROM moz_places h " "JOIN moz_historyvisits v ON h.id = v.place_id " "LEFT OUTER JOIN moz_bookmarks b ON b.fk = h.id " "LEFT OUTER JOIN moz_favicons f ON h.favicon_id = f.id " "LEFT OUTER JOIN moz_inputhistory i ON i.place_id = h.id " "WHERE v.visit_date >= ?1 AND v.visit_date <= ?2 AND h.hidden <> 1 AND " " v.visit_type <> 0 AND v.visit_type <> 4 AND " "(h.title LIKE ?3 ESCAPE '/' OR h.url LIKE ?3 ESCAPE '/') " "GROUP BY h.id " "ORDER BY rank DESC, h.visit_count DESC, MAX(v.visit_date) DESC;");

Marco Bonardo [:mak]

Comment 11

•

18 years ago

why using visit_type=3 instead of b.id NOT NULL? this query could be probably an ideal query to use triggers: add a trigger that updates rank on every visit insert and add a rank column to moz_places... should really be effective and fast

Ed Lee :Mardak

Comment 12

•

18 years ago

Marco, you seem to better understand the performance aspect better.. I was thinking about changing the above query into sub queries instead of JOINing on moz_historyvisits and moz_inputhistory. Both of those together will effectively cross join on both of those tables for each moz_place.id. So this query here is changed to the 2 sub-queries per moz_place.id SELECT COUNT(*) FROM ( SELECT h.url, h.title, h.visit_count, (SELECT MAX(v.visit_date) FROM moz_historyvisits v WHERE v.place_id = h.id AND v.visit_type <> 0 AND v.visit_type <> 4) visit_date, IFNULL((SELECT MAX((i.input LIKE '%%' ESCAPE '/') * i.use_count + (i.input = '') * 5 + (SUBSTR(i.input, 1, LENGTH('')) = '') * 5) FROM moz_inputhistory i WHERE i.place_id = h.id), 0) + h.typed + NOT b.id ISNULL + ROUND(5.0 / LENGTH(h.url), 1) rank FROM moz_places h LEFT OUTER JOIN moz_bookmarks b ON b.fk = h.id LEFT OUTER JOIN moz_favicons f ON h.favicon_id = f.id WHERE h.hidden <> 1 AND (h.title LIKE '%%' ESCAPE '/' OR h.url LIKE '%%' ESCAPE '/') GROUP BY h.id ORDER BY rank DESC, h.visit_count DESC, visit_date DESC ) And the original query.. SELECT COUNT(*) FROM ( SELECT h.url, h.title, h.visit_count, MAX(v.visit_date) visit_date, MAX(IFNULL((i.input LIKE '%%' ESCAPE '/') * i.use_count + (i.input = '') * 5 + (SUBSTR(i.input, 1, LENGTH('')) = '') * 5, 0) + h.typed + NOT b.id ISNULL + ROUND(5.0 / LENGTH(h.url), 1)) rank FROM moz_places h JOIN moz_historyvisits v ON v.place_id = h.id LEFT OUTER JOIN moz_bookmarks b ON b.fk = h.id LEFT OUTER JOIN moz_favicons f ON h.favicon_id = f.id LEFT OUTER JOIN moz_inputhistory i ON i.place_id = h.id WHERE h.hidden <> 1 AND v.visit_type <> 0 AND v.visit_type <> 4 AND (h.title LIKE '%%' ESCAPE '/' OR h.url LIKE '%%' ESCAPE '/') GROUP BY h.id ORDER BY rank DESC, h.visit_count DESC, visit_date DESC ) Both the queries perform about the same for my places, history and inputhistory. I suppose the former would be better if there are many historyvisits for a place id that also has many inputhistory for that same place id. But most likely there'll be at most a couple inputhistory per place id and that table is small to begin with. (In reply to comment #11) > updates rank on every visit insert and add a rank column to moz_places... I believe that's what Seth is doing for bug 394038. Essentially all the static rank stuff can be pre-computed (v.visit_count, MAX(v.visit_date), h.typed, b.id ISNULL, LENGTH(h.url)) but the adaptive stuff that depends on the user's input depend on whatever the user types in.

Marco Bonardo [:mak]

Comment 13

•

18 years ago

subqueries should be faster than joining on big tables, but this needs some testing, i think however that the first query is better, i'd move the rank calculation directly into the order by clause, but again this should be tested to have timings, this could prevent calculating the rank for visits that will be discarded by the group by clause for max visit_date see bug 392399, with a new index and the new fragment (limited order by) it will perform faster, but for now using max is ok i'll try to take some timings on the queries

Ed Lee :Mardak

Comment 14

•

18 years ago

(In reply to comment #13) > i'd move the rank > calculation directly into the order by clause, but again this should be tested > to have timings, this could prevent calculating the rank for visits that will > be discarded by the group by clause Would that do the right thing? Like for the MAX(visit_date), the MAX is aggregating over all the GROUPed place.id. Similarly, the rank is computed to be the best out of the GROUPed place.id, so the GROUP BY can't really discard computations.

Marco Bonardo [:mak]

Comment 15

•

18 years ago

i'm not sure if sqlite makes the computation before or after the group by... i think that you end up having h.id | v.id | rank 12 32 5 12 34 5 12 56 5 and then the group unify the rows, while putting in the order by shoul be h.id | v.id 12 32 12 34 12 56 then group by unify the rows and then calls order by that do the calculation once per place_id

Marco Bonardo [:mak]

Comment 16

•

18 years ago

mmh wait that is true when joining on moz_historyvisits, but since you don't join anymore i'm not sure the group by is needed at all...

Marco Bonardo [:mak]

Comment 17

•

18 years ago

the only things affected by group by are b.id and f.id, for a single place_id you have only one favicon, so that does not require grouping... while if a place_id has more than one bookmark (is this possibile now? i don't think) how can you choose the bookmark.id? So, if you don't join on moz_historyvisits you don't need to group by at all... so the only think affecting moving rank to order by is some memory saving when returning the results from sqlite to places (you don't return rank), but since rank is an int will be really a little saving

Ed Lee :Mardak

Comment 18

•

18 years ago

moz_inputhistory can have multiple entries for a given h.id 'mozilla' and 'moz' can both lead to the same place.id If the user types in 'moz', the 'moz' entry will get extra weight for a perfect match so it's rank would be higher than the 'mozilla' entry. And rank is a float in this case.

Marco Bonardo [:mak]

Comment 19

•

18 years ago

yes, but since you are not joining on moz_inputhistory, but doing the work on a subselect, the group id of the main query is not used, it applies only to joined tables (favicon and bookmarks). the only thing that the group by can change in rank calculation is the b.id, but if a place_id cannot have different bookmarks it does nothing. FROM moz_places h LEFT OUTER JOIN moz_bookmarks b ON b.fk = h.id LEFT OUTER JOIN moz_favicons f ON h.favicon_id = f.id GROUP BY h.id the group by will catch in if a place_id has more than one entry in bookmarks or more than one entry in favicons... subselected columns cannot add rows to resultset, while joins do

(not reading, please use seth@sspitzer.org instead)

Comment 20

•

18 years ago

> this query could be probably an ideal query to use triggers: add a trigger that > updates rank on every visit insert and add a rank column to moz_places... > should really be effective and fast a drive by thought on triggers: while visits happen one (or a few, think about opening n tabs at once) at a time, visit removal happens in bulk (think clear private data). I'm concerned that if we do calculation on a trigger, we could impact performace. another drive by thought: in bug #394038, the general approach is to pay the cost of the frecency calculation on visit time, and not on autocomplete query time. a visit happens one at a time, but we do several queries per key stroke (because of chunking) when typing in the url bar. I haven't read all the recent comments and thoughts from Edward, I'll go do that so I better understand what he is doing. I'll go write up the approach in bug #394038 in more details (so you don't have to read the patch to figure it out).

Ed Lee :Mardak

Comment 21

•

18 years ago

(In reply to comment #19) > yes, but since you are not joining on moz_inputhistory, but doing the work on a > subselect, the group id of the main query is not used, it applies only to > joined tables (favicon and bookmarks). Yup. Thought you were referring to the first query I pasted that joined on historyvisits and inputhistory. > the group by will catch in if a place_id has more than one entry in bookmarks > or more than one entry in favicons I don't see there being more than one favicon for a url, but there's potentially multiple bookmarks. But would we want to show multiple bookmark entries? They could have different titles as well. Seth: There's 2 separate aspects to my stack of patches: the adaptive feedback and a query-time ranking. The query-time ranking is similar to the frecency stuff in that it ranks results, but bug 394038 will do things at visit time. I'm assuming eventually there'll be a split of ranks that we can pre-compute and weights that can only be applied with inputs.

Roman R.

Comment 23

•

18 years ago

I just typed "gateway" into the location bar, and instead of http://gateway.2wire.net/, which is what I wanted, I got http://www.cvs.com/CVSApp/cvs/gateway/cvsstores on the top of the list.

Ed Lee :Mardak

Comment 24

•

18 years ago

That's a separate issue. This bug is about pages from the same domain. You're talking more about bug 409895. That should be addressed with this build.. https://build.mozilla.org/tryserver-builds/2007-12-26_15:26-edward.lee@engineering.uiuc.edu-awesom.domain/

Adam Kowalczyk

Reporter

Comment 25

•

17 years ago

Mardak, are you going to try to get this in for Firefox 3 by any chance?

Ed Lee :Mardak

Comment 26

•

17 years ago

This shouldn't be as much as a problem anymore now that adaptive learning is in. There are only so many root sites that you can visit compared to the subpages, and you're likely to select the root site multiple times causing it to be adapted. I doubt that we'll do ranking of results when querying they've gotten in the DB for perf reasons. Seth: I suppose it would be possible to tweak the frecency to give higher rank to "main websites", but as I've pointed out before, it's difficult to determine what's a root. It's not just http://site/ because it could be http://site/MainPage.html. Or what I did before with an earlier patch.. just prefer shorter urls.

Seth Spitzer

Comment 27

•

17 years ago

> Seth: I suppose it would be possible to tweak the frecency to give higher rank > to "main websites" I haven't thought as long or as hard about this as you, but it seems to me that your adaptive feature would be enough to solve this problem, as over time, that should improve results to give users what they want. or is this bug more about improving the quality of results before we have have adaptive data? either way, I wouldn't block fx 3 for this one.

Adam Kowalczyk

Reporter

Comment 28

•

17 years ago

(In reply to comment #27) > > Seth: I suppose it would be possible to tweak the frecency to give higher rank > > to "main websites" > > I haven't thought as long or as hard about this as you, but it seems to me that > your adaptive feature would be enough to solve this problem, as over time, that > should improve results to give users what they want. > > or is this bug more about improving the quality of results before we have have > adaptive data? > > either way, I wouldn't block fx 3 for this one. > Adaptive learning can compensate any kind shortcomings in the frecency algorithm, because the initial ordering of the results is going be gradually changed to match the user's behaviour. However, this shouldn't be a reason not to refine the initial results. So yes, this bug is about improving the quality of results before we have have adaptive data. I don't think it should block Firefox 3 either, but it would be a nice-to-have.

Vova Olar

Comment 30

•

13 years ago

Bug 240397 is a duplicate

Dão Gottwald [:dao]

Updated

•

13 years ago

No longer blocks: 405745

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → DUPLICATE

Adam Kowalczyk

Reporter

Comment 32

•

13 years ago

Dao, I'm reopening for reconsideration. This bug is probably more useful because it contains research and discussion from Mardak and Marco, whereas bug 240397 is old and predates Awesomebar. Feel free to mark as duplicate again if you don't agree, though.

Status: RESOLVED → REOPENED

Resolution: DUPLICATE → ---

BMO Automation

Comment 33

•

7 years ago

Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.

Status: REOPENED → RESOLVED

Closed: 13 years ago → 7 years ago

Resolution: --- → INACTIVE

v1 18 years ago Ed Lee :Mardak 1.10 KB, patch		Details \| Diff \| Splinter Review
bug (short) 18 years ago Ed Lee :Mardak 138.19 KB, image/png		Details
bug (trunk) 18 years ago Ed Lee :Mardak 153.65 KB, image/png		Details
moz (short) 18 years ago Ed Lee :Mardak 101.72 KB, image/png		Details
moz (trunk) 18 years ago Ed Lee :Mardak 129.21 KB, image/png		Details