moz_places_url_uniqueindex takes up to 1/3 of places.sqlite

RESOLVED FIXED in Firefox 50

Status

()

Toolkit
Places
P2
normal
RESOLVED FIXED
5 years ago
9 months ago

People

(Reporter: (dormant account), Assigned: mak)

Tracking

(Blocks: 1 bug, {perf})

unspecified
mozilla50
x86_64
Windows 8
Points:
5
Dependency tree / graph
Bug Flags:
firefox-backlog +
qe-verify -

Firefox Tracking Flags

(firefox50 fixed)

Details

(Whiteboard: [fxsearch])

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment, 2 obsolete attachments)

Comment hidden (empty)
So, we need the url index for 2 reasons:
1. uniqueness check (internal functionality relies on this precondition)
2. actual queries doing WHERE url = some_url (many APIs rely on this)

though, we don't exactly need an url index that replicates the whole url in this case (that is what SQLite does), we just need the 2 above points.
A good hashing algorithm (the mfbt one, cityhash or murmur) may easily provide the same benefits, taking just 64 bits per entry in the ideal case (to be verified).

This may bring in a reduction of 15-20% the database size.

So the plan of action here may be:
1. figure out a good hashing algo. We may add some telemetry probes to the current method that adds URIs to the db and report collisions in Nightly/Aurora (this requires some changes to the db, but if plan the change that's fine, we can just add the url_hash column earlier).
2. remove the url_uniqueindex, add instead a unique index on the url_hash column. Measure the db size win. The hashing may either be done at the Places level or (better) implemented as a SQLite function.
3. Fix queries to actually compare hashes instead of plain urls.
note: I said 64bits, cause we'd need a 32 bits integer column for the hash and an index replicating it.
(Reporter)

Comment 3

5 years ago

(In reply to Marco Bonardo [:mak] from comment #2)
>The hashing may either be done at the Places level or (better) implemented as a SQLite function.

> note: I said 64bits, cause we'd need a 32 bits integer column for the hash
> and an index replicating it.

We could also ask Dr Hipp to implement function indexes...That might not be more complicated than changing all of the callsites to deal with a new column and have the advantage of not wasting extra space
(In reply to Taras Glek (:taras) from comment #3)
> We could also ask Dr Hipp to implement function indexes...That might not be
> more complicated than changing all of the callsites to deal with a new
> column and have the advantage of not wasting extra space

See Hash Indexes on this page:
http://www2.sqlite.org/cvstrac/wiki?p=ProposedIncompatibleChanges

that said, the given reasoning is weak for us, in the sense you may not care about in_order retrieval, for example in Places we order by URI only in a very specific and not so interesting case (clicking the Library url column)
Though that may be a very valid reason why an hash index can't become a default choice in a generic storage like indexedDB
I am under the impression we can't use 32 bits, considered number of urls in the world (http://www.worldwidewebsize.com/ says about 50 billions indexed pages on Google, we may ignore the rest for a rough estimate), with 64 bits we get 1 collision every 4 billions hashes, to have 1 collision out of 70 billions hashes we need at least 72 bits.
(Reporter)

Comment 7

5 years ago
(In reply to Marco Bonardo [:mak] from comment #6)
> I am under the impression we can't use 32 bits, considered number of urls in
> the world (http://www.worldwidewebsize.com/ says about 50 billions indexed
> pages on Google, we may ignore the rest for a rough estimate), with 64 bits
> we get 1 collision every 4 billions hashes, to have 1 collision out of 70
> billions hashes we need at least 72 bits.

we'll just need a little extra logic to compare the actual url. Based on my history(42K urls) a 16bit index would do.
Comparing the urls is still one (or more depending on how many collisions we have) additional string compare we could avoid though.
It's the usual size vs performance problem, we can decide any direction.

fwiw I have 100k urls and my urls are likely different than yours, each user owns a subgroup of the whole urls in the web and it's hard to predict collisions in a subgroup.

16 bits would give us 1 collision every 256 urls, doesn't sound too good even for your 40k pages. Definitely telemetry will help telling us collisions rate, we could start with 32 (likely to collide once in an average db) and see.
(Reporter)

Comment 9

5 years ago
(In reply to Marco Bonardo [:mak] from comment #8)
> 16 bits would give us 1 collision every 256 urls, doesn't sound too good
> even for your 40k pages. Definitely telemetry will help telling us
> collisions rate, we could start with 32 (likely to collide once in an
> average db) and see.

I think you are worrying too much about collisions. Even with a 1/256 collision rate, we are doing 256x less IO.

I think we should just pick a hashing algo without worrying too much about collision perf.
I'm worrying cause on collision we will have to compare urls, that's fine, if not that the search is linear at that point.
If I simply calculate 100k (pages in my db) / 256 (possible collisions) = 390, means that in the worst case SQLite will have to do a linear scan and string compare across 390 entries. Plus it will have to actually fetch those 390 entries for the linear scan. VS having to do a linear scan and a fetch of 2 or 3 entries.
Created attachment 814619 [details] [diff] [review]
wip

this implements an hash() SQL function based on what we already have in mfbt (we use it for internal hash tables and atoms, iirc it's ). I ran it on a couple profiles, on my profile (106k entries) it returns 2 hash conflicts, so it appears to be quite good and we know it's fast per the past analysis done in bug 729952.
See also http://mxr.mozilla.org/mozilla-central/source/mfbt/HashFunctions.h

I think it may be a good starting point to make an experimental patch.
Assignee: nobody → mak77
Blocks: 950073
Keywords: perf
Whiteboard: [triage]

Updated

4 years ago
Whiteboard: [triage]

Updated

4 years ago
Whiteboard: p=0
I investigated using the new WITHOUT ROWID feature of Sqlite. Unfortunately it's not usable for our use-case, while it makes selecting by url faster and avoids duplicating url into its own index, each other index in moz_places ends up duplicating the primary key value, that is now the url. So my 70MB database becomes a 150MB one :( Looks like the use-cases of WITHOUT ROWID are fewer than I expected.

Though, we can still use WITHOUT_ROWID to save space and increase speed of other tables, where we currently have compound indices. I will file bugs for specific cases once I have some numbers.

I guess at this point the only plausible solution for this specific issue is the hash. This will break some use-cases:
- uriIsPrefix queries will become extremely slow, basically queries getting all pages beginning with a given string. The feature will have to be dropped (it would end up scanning hundred thousands of urls one by one)
- for basically the same reason we won't be able to query entries based on the protocol with tricks like "url BETWEEN 'file://' AND 'file:/~'" or "url BETWEEN 'place:' AND 'place;'", we'll need a further column to filter these.

For the latter case we could make the hash be "protocol:hash", it would take a bit more space though
Blocks: 977149
Blocks: 981530
No longer blocks: 950073

Updated

4 years ago
Blocks: 950073
No longer blocks: 981530

Updated

4 years ago
No longer blocks: 950073
Flags: firefox-backlog+

Updated

4 years ago
Summary: moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong → Investigation - moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong

Updated

4 years ago
Whiteboard: p=0 → p=3

Updated

4 years ago
Whiteboard: p=3 → p=3 [qa-]

Updated

4 years ago
Points: --- → 3
Flags: qe-verify-
Whiteboard: p=3 [qa-]
likely more a 5 points than a 3.
Points: 3 → 5
Priority: -- → P1
Blocks: 1214723
Priority: P1 → P2
Whiteboard: [fxsearch]
Depends on: 1158511
These are the cases we need to address, since they won't work anymore in the new index-less world, or better, it will keep working but will be slow as hell.

/toolkit/components/places/nsNavHistory.cpp (View Hg log or Hg annotations)

    line 1030 -- "WHEN url BETWEEN 'place:' AND 'place;' "
    line 1824 -- "AND h.url BETWEEN 'file://' AND 'file:/~' "
    line 3373 -- clause.Condition("h.url >= ").Param(":uri")
    line 3374 -- .Condition("h.url <= ").Param(":uri_upper");
    line 3464 -- clause.Condition("NOT h.url BETWEEN 'place:' AND 'place;'");

/toolkit/components/places/History.jsm (View Hg log or Hg annotations)

    line 481 -- WHEN url BETWEEN 'place:' AND 'place;'

/toolkit/components/places/SQLFunctions.cpp (View Hg log or Hg annotations)

    line 491 -- "(url > 'place:' AND url < 'place;') "

/dom/push/PushRecord.jsm (View Hg log or Hg annotations)

    line 154 -- p.url >= :urlLowerBound AND p.url <= :urlUpperBound AND

To fix the place: and file: checks, I'll likely need to add a prefix to the hashes, but that should be fine since it's just a few entries.

The other history checks are due to uriIsPrefix query option, that we really never used in the product. Though looks like the addons-sdk exposes it through createQuery
http://mxr.mozilla.org/mozilla-central/source/addon-sdk/source/lib/sdk/places/utils.js#143
I'll file a separate bug to remove uriIsQuery and verify the impact there.

The most problematic thing cause I'm not sure what is doing is dom/push/PushRecord.jsm. Looks like it's trying to query by origin, I think we can restrict the records using the rev_host column and then it should be fine. Again this needs a separate bug.
Depends on: 1243778
Depends on: 1243779
Summary: Investigation - moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong → moz_places_url_uniqueindex takes up to 1/3 of places.sqlite
(In reply to Marco Bonardo [::mak] from comment #14)
> The most problematic thing cause I'm not sure what is doing is
> dom/push/PushRecord.jsm. Looks like it's trying to query by origin, I think
> we can restrict the records using the rev_host column and then it should be
> fine. Again this needs a separate bug.

Yup, it's querying the last visit to a given origin. The push code uses this to calculate the background message quota, which changes based on the number of days the user last visited the site.

Would it be sufficient to replace `p.url` with `rev_host` and `PlacesUtils.getReversedHost`? Would we need to restrict the query to account for the scheme and port?

Thanks for filing the bug!
(In reply to Kit Cambridge [:kitcambridge] from comment #15)
> Yup, it's querying the last visit to a given origin. The push code uses this
> to calculate the background message quota, which changes based on the number
> of days the user last visited the site.
> 
> Would it be sufficient to replace `p.url` with `rev_host` and
> `PlacesUtils.getReversedHost`? Would we need to restrict the query to
> account for the scheme and port?

The thing that will be slow in the "new world" will be queries doing "url >", "url <", "url BETWEEN", so I suppose that would do. But depends on the push design requirements whether it should be restricted by scheme and port. Btw let's keep the discussion in the specifc Push bug. I will try to look into it soon and we can evaluate what to do.
Created attachment 8749808 [details]
Bug 889561 - moz_places_url_uniqueindex takes up too much database space

Review commit: https://reviewboard.mozilla.org/r/51163/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/51163/
Attachment #814619 - Attachment is obsolete: true
Comment on attachment 8749808 [details]
Bug 889561 - moz_places_url_uniqueindex takes up too much database space

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/51163/diff/1-2/
Comment on attachment 8749808 [details]
Bug 889561 - moz_places_url_uniqueindex takes up too much database space

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/51163/diff/2-3/
Attachment #8749808 - Attachment description: MozReview Request: Bug 889561 - moz_places_url_uniqueindex takes up too much database space → Bug 889561 - moz_places_url_uniqueindex takes up too much database space
Created attachment 8763819 [details]
Bug 889561 - Reduce the size of places.sqlite by removing the url unique index from moz_places.

Review commit: https://reviewboard.mozilla.org/r/59938/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/59938/
Attachment #8763819 - Flags: review?(adw)
Attachment #8749808 - Attachment is obsolete: true
The patch is adding an hash SQL function, that can accept up to 2 arguments, the former is the string to hash, the latter (optional) is an hashing mode (currently can be prefix_hi or profix_lo, for querying schemes)
Maintenance will take care of eventual misses from add-ons.
It's also changing expiration to adapt better to every single db, while trying to limit the db size to 60MiB (the previous 160MiB limit was unrealistic). We should take a close look at telemetry after the change. When moving favicons out of the db we can likely improve this further.
Comment on attachment 8763819 [details]
Bug 889561 - Reduce the size of places.sqlite by removing the url unique index from moz_places.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59938/diff/1-2/

Comment 23

2 years ago
Comment on attachment 8763819 [details]
Bug 889561 - Reduce the size of places.sqlite by removing the url unique index from moz_places.

https://reviewboard.mozilla.org/r/59938/#review57468

This looks great, I don't see any problems.  Nice work!

::: toolkit/components/places/SQLFunctions.cpp:882
(Diff revision 2)
> +
> +    RefPtr<nsVariant> result = new nsVariant();
> +    if (mode.IsEmpty()) {
> +      // URI-like strings (having a prefix before a colon), are handled specially,
> +      // as a 48 bit hash, where first 16 bits are the prefix hash, while the
> +      // other 48 are the string hash.

Typo: should be "other 32" right?
Attachment #8763819 - Flags: review?(adw) → review+
(In reply to Drew Willcoxon :adw from comment #23)
> Typo: should be "other 32" right?

yep. I'll fix that as soon as I can recover a working tree, some recent pull has broken it in a dom subfolder I never touched...

Comment 25

2 years ago
Pushed by mak77@bonardo.net:
https://hg.mozilla.org/integration/fx-team/rev/ceff61c9fc5a
Reduce the size of places.sqlite by removing the url unique index from moz_places. r=adw

Comment 26

2 years ago
Pushed by mak77@bonardo.net:
https://hg.mozilla.org/integration/fx-team/rev/07e1f152fe92
followup - Missing changes to Bookmarks.jsm. r=post-facto
this caused frequent failures like https://treeherder.mozilla.org/logviewer.html#?job_id=10160078&repo=fx-team on pgo and so had to back this out, sorry

Updated

2 years ago
Flags: needinfo?(mak77)

Comment 28

2 years ago
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/fx-team/rev/3fa6d81f720a
Backed out changeset 07e1f152fe92 
https://hg.mozilla.org/integration/fx-team/rev/cda8c83f6a57
Backed out changeset ceff61c9fc5a for frequent testfailures on pgo in /bookmarks/test_
Testing a fix on Try.
If I'm right, the problem is that history tries to start expiration very late on shutdown, and now that fails cause on expiration startup we try to query the database, that is already closed at that time. So we bail out with an exception.
This happens in tests mostly, it's unlikely to happen in reality.
The fix only requires fixing our AsyncShutdown blockers to properly manage the exception we throw when we can't open the db cause it's too late.
Flags: needinfo?(mak77)

Comment 31

2 years ago
Pushed by mak77@bonardo.net:
https://hg.mozilla.org/integration/fx-team/rev/21585b3e4814
Reduce the size of places.sqlite by removing the url unique index from moz_places. r=adw

Comment 32

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/21585b3e4814
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox50: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50

Comment 33

2 years ago
After this change landed, an alert for the telemetry probe PLACES_MOST_RECENT_EXPIRED_VISIT_DAYS was raised [1]. Looking at this patch, it seems as though maybe the change in nsPlacesExpiration's DATABASE_MAX_SIZE resulted in a (presumably-temporary) change in how long uris are in the database before being expired?
(In reply to Chris H-C :chutten from comment #33)
> After this change landed, an alert for the telemetry probe
> PLACES_MOST_RECENT_EXPIRED_VISIT_DAYS was raised [1]. Looking at this patch,
> it seems as though maybe the change in nsPlacesExpiration's
> DATABASE_MAX_SIZE resulted in a (presumably-temporary) change in how long
> uris are in the database before being expired?

yes, we are constantly checking that probe to find the right compromise for our performance work. The scope of the probe is to verify the expiration changes are not causing large dataloss.
The only meaningful regression would be if the value becomes *much* lower than it is, then we would actually be losing a meaningful amount of user's history.

I'm sure there will be further changes in the next months, since we are trying to make places.sqlite smaller in size.
Blocks: 1296958
Depends on: 1318650
Blocks: 977053
You need to log in before you can comment on or make changes to this bug.