Closed Bug 287066 Opened 17 years ago Closed 5 years ago

History automatic indexing (full text indexing of pages content)

Categories

(Firefox :: Bookmarks & History, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: arnaud.legout, Unassigned)

References

(Depends on 1 open bug)

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050223 Firefox/1.0.1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050223 Firefox/1.0.1

History is useful to find pages you consulted some times ago. However, when you
browse a site, the title of each page is often the same. The history results in
tens of links with the same title and no way to know which page is relevant to you. 

Also, when you browse the Web, you can go through a very large number of sites
and do not remember at all the title/name/URL of the site. You simply remember
some keywords related to the content of the page (e.g. firefox, block, image for
http://adblock.mozdev.org/)

A very nice functionnality would be to have an automatic indexing (as performed
by a web crawler) of the history. When you access a web page, it is added to the
history and automatically indexed for future search. When you do a fast history
search, the search is performed on the title, and also of the index database. 

Related to https://bugzilla.mozilla.org/show_bug.cgi?id=286544
which is on bookmarks

This bug https://bugzilla.mozilla.org/show_bug.cgi?id=126621 presents a solution
based on caches for mozilla. However, this solution has many drawbacks
(limitation in the number of pages cached, solution slower than a keyword
indexing, cache can be flushed, etc.)

Reproducible: Always
I have an argument similar to the one of https://bugzilla.mozilla.org/show_bug.cgi?id=286544

The first entry to find a site is to use google. However, the first pages do not always contain the links you are looking for. Reasons are:
-the site is not well ranked
-you do not remember the important keywords

One solution would be to automatically create your own database of sites and to 
perform the search in this database. 

A unified database of bookmarks and history links would be an elegant solution. 
Bookmarks would be history links with a given rating.
Status: RESOLVED → UNCONFIRMED
Resolution: EXPIRED → ---
History and bookmarks are being reworked for 2.0
http://wiki.mozilla.org/Places

However, I don't think full text indexing is being implemented.
Depends on: 342913
Assignee: bugs → nobody
QA Contact: mozilla → history
Resorting this enhancement into places...
Component: History → Places
QA Contact: history → places
Version: unspecified → Trunk
Very interesting, at least as an extension.

I'm wondering how large would the database be tough.
Let's guess for English :
5000 words is supposed to cover 98.5% of the words.
The average English word length is said to be 5 characters.
We have to add 1 byte for varchar, 4 bytes for (integer type) count number and 4 bytes for the page id (an integer too).

So we have 5000 x (5+1+4+4) = 50 000 bytes ~= 50 KiB.
We can admit that 5000 is very few, so if we do it with 500 000 words (a little less than all the English words), we end up with a database of less than 7 MiB, so the size is not a problem.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: History automatic indexing → History automatic indexing (full text indexing of pages content)
Read word id instead of page id.
But the size estimation is wrong, since SQLite use variable integer size.
It would be more like 5+1+3+3 since word id and page id aren't likely to be greater than 8 388 608.

The words table could be smaller than 7 MiB, but what would be big is the table which stores page/word pairs.
The structure would be id - word id - page id - word count
The size per record would be 4+3+3+2 = 12 bytes.
But now how much record would we have ?

Let's say that the mean number of unique words per page would 1000.
For the maximum of pages allowed by default (40 000), we would get 1000 x 40 000 x 12 = 480000000 bytes which is a little more than 450 MiB.
And a little more than the words table... not counting the indexes.
Places received a great feedback from FF3 beta users because it partially solves
a very old problem with bookmarks: we bookmark a lot but do not remember
accurately enough what we bookmarked in order to find it again. 

With full page indexing of bookmarks and history, you bring to users 
a google-like service tailored to their needs. This is really a killer 
feature. 
Indeed, a lot of people uses google as a start page, and this is a non-sense as 
what you want is very often something you already looked for, so something
you already have in your bookmarks or history. 

If you tell me it is 450MiB to have this feature for 40 000 pages (which sounds
a lot) I would say it is fine. We all have already have much more for imap 
offline support and various indexing services.

Moreover, simple heuristics can reduce the size of the indexes:


1)Limit the number of words indexed per page (for intance to 100), 
indexing the most frequent words first (not taking into account words like a, 
the, etc.). Also, words in title and HTML headers should always be indexed. 
2)implement the long awaited bookmark sanity check that proposes to check on-demand 
all bookmarks and propose to remove bookmarks that return a 404 error. 
3)give a UI preference to index bookmarks only or bookmarks and history. 
4)Fix a hard limit in the size of the index and then use a FIFO policy. Probably
FIFO is not the best policy in that case, as pages browsed a long time ago
will be that ones that really need support to be found (as you are unlikely to remember where the bookmark is).

I strongly believe this is a must have feature that everybody wants to have,
as it will dramatically improve the usability of bookmarks and history. 


I just stumbled across Breadcrumbs(1), which does exactly what this bug is about.

Not updated since 10.02.2007 but it works perfectly even under the soon-to-be-released Firefox 3 with Nightly Tester Tools(2).

It looks like the author needs a little bit of motivation, so feel free to tell him how much you like his add-on by adding a review on the add-on's page.


(1)https://addons.mozilla.org/fr/firefox/addon/2954
(2)https://addons.mozilla.org/fr/firefox/addon/6543
We're not going to index content of the pages in Places (it may happen in Activity Stream or elsewhere, but for sure not in Places).
Status: NEW → RESOLVED
Closed: 16 years ago5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.