Last Comment Bug 669410 - Use a PrefixSet to reduce amount of times SQLite DB must be queried
: Use a PrefixSet to reduce amount of times SQLite DB must be queried
Status: RESOLVED FIXED
:
Product: Toolkit
Classification: Components
Component: Safe Browsing (show other bugs)
: unspecified
: All All
: -- minor with 2 votes (vote)
: Firefox 9
Assigned To: Gian-Carlo Pascutto [:gcp]
:
Mentors:
Depends on:
Blocks: 650649 669407
  Show dependency treegraph
 
Reported: 2011-07-05 12:14 PDT by Gian-Carlo Pascutto [:gcp]
Modified: 2014-05-27 12:25 PDT (History)
29 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch v1 1. Add the PrefixTree datastructure (14.80 KB, patch)
2011-07-14 08:37 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v1 2. Use the Prefix Tree to short-circuit lookups (22.61 KB, patch)
2011-07-14 08:38 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v1 3. Load the PrefixTree from the DB in chunks (12.60 KB, patch)
2011-07-14 08:39 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v1 4. Delay SQLite loading until fragments are checked vs the PrefixTree (7.64 KB, patch)
2011-07-14 08:39 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v2 1. Add the PrefixTree datastructure (23.35 KB, patch)
2011-07-22 01:58 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v2 2. Use the Prefix Tree to short-circuit lookups (25.59 KB, patch)
2011-07-22 01:58 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v2 3. Probe the PrefixSet from the main thread (43.21 KB, patch)
2011-07-22 01:59 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v2 4. Reduce SQLite caches when not updating (9.59 KB, patch)
2011-07-22 02:00 PDT, Gian-Carlo Pascutto [:gcp]
dcamp: review+
Details | Diff | Review
Patch v3 1. Add the PrefixTree datastructure (23.15 KB, patch)
2011-07-25 22:45 PDT, Gian-Carlo Pascutto [:gcp]
dcamp: review+
Details | Diff | Review
Patch v3 2. Use the Prefix Tree to short-circuit lookups (26.76 KB, patch)
2011-07-25 22:45 PDT, Gian-Carlo Pascutto [:gcp]
dcamp: feedback+
Details | Diff | Review
Patch v3 3. Probe the PrefixSet from the main thread (44.66 KB, patch)
2011-07-25 22:46 PDT, Gian-Carlo Pascutto [:gcp]
dcamp: feedback+
Details | Diff | Review
Patch v3 4. Reduce SQLite caches when not updating (12.44 KB, patch)
2011-07-25 22:47 PDT, Gian-Carlo Pascutto [:gcp]
dcamp: review+
Details | Diff | Review
Patch v3 5. Make the PrefixSet persistent + consistency fix (23.02 KB, patch)
2011-07-27 22:53 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v3 6. Remove the old fragment caches (3.55 KB, patch)
2011-07-28 03:03 PDT, Gian-Carlo Pascutto [:gcp]
dcamp: review+
Details | Diff | Review
Patch v4 1. Add the PrefixTree datastructure (23.36 KB, patch)
2011-08-03 03:08 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v4 2. Use the Prefix Tree to short-circuit lookups (26.83 KB, patch)
2011-08-03 03:09 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: feedback+
Details | Diff | Review
Patch v4 3. Probe the PrefixSet from the main thread (43.51 KB, patch)
2011-08-03 03:12 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: feedback+
Details | Diff | Review
Patch v4 4. Reduce SQLite caches when not updating (12.45 KB, patch)
2011-08-03 03:13 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v4 5. Make the PrefixSet persistent + consistency fix (23.00 KB, patch)
2011-08-03 03:14 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v4 6. Remove the old fragment caches (3.55 KB, patch)
2011-08-03 03:15 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v4 7. Remove storage of hostkeys (23.65 KB, patch)
2011-08-03 03:18 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v5 1. Add the PrefixTree datastructure (23.35 KB, patch)
2011-08-29 02:21 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v5 2. Use the Prefix Tree to short-circuit lookups (26.82 KB, patch)
2011-08-29 02:23 PDT, Gian-Carlo Pascutto [:gcp]
tony: review+
gpascutto: feedback+
Details | Diff | Review
Patch v5 3. Probe the PrefixSet from the main thread (43.56 KB, patch)
2011-08-29 02:25 PDT, Gian-Carlo Pascutto [:gcp]
tony: review+
gpascutto: feedback+
Details | Diff | Review
Patch v5 4. Reduce SQLite caches when not updating (12.46 KB, patch)
2011-08-29 02:27 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v5 5. Make the PrefixSet persistent + consistency fix (22.94 KB, patch)
2011-08-29 02:28 PDT, Gian-Carlo Pascutto [:gcp]
tony: review-
Details | Diff | Review
Patch v5 6. Remove the old fragment caches (3.56 KB, patch)
2011-08-29 02:29 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v5 7. Remove storage of hostkeys (26.78 KB, patch)
2011-08-29 02:32 PDT, Gian-Carlo Pascutto [:gcp]
tony: review+
Details | Diff | Review
Patch v5 8. Rehash with a per-user key to prevent collisions (24.98 KB, patch)
2011-08-29 02:33 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v5. 9. Add telemetry for new PrefixSet operations (4.22 KB, patch)
2011-08-29 02:34 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v5. 9. Add telemetry for new PrefixSet operations (4.22 KB, patch)
2011-08-29 02:59 PDT, Gian-Carlo Pascutto [:gcp]
taras.mozilla: review+
Details | Diff | Review
Patch v6 1. Add the PrefixTree datastructure (23.35 KB, patch)
2011-09-06 13:12 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v6 2. Use the Prefix Tree to short-circuit lookups (26.89 KB, patch)
2011-09-06 13:13 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v6 3. Probe the PrefixSet from the main thread (43.62 KB, patch)
2011-09-06 13:14 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v6 4. Reduce SQLite caches when not updating (12.46 KB, patch)
2011-09-06 13:15 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v6 5. Make the PrefixSet persistent + consistency fix (20.10 KB, patch)
2011-09-06 13:16 PDT, Gian-Carlo Pascutto [:gcp]
tony: review+
Details | Diff | Review
Patch v6 6. Remove the old fragment caches (3.56 KB, patch)
2011-09-06 13:17 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v6 7. Remove storage of hostkeys (26.83 KB, patch)
2011-09-06 13:18 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v6 8. Rehash with a per-user key to prevent collisions (22.51 KB, patch)
2011-09-06 13:19 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review
Patch v6. 9. Add telemetry for new PrefixSet operations (4.22 KB, patch)
2011-09-06 13:20 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v7 8. Rehash with a per-user key to prevent collisions (23.07 KB, patch)
2011-09-06 15:57 PDT, Gian-Carlo Pascutto [:gcp]
tony: review+
Details | Diff | Review
Patch v7 5. Make the PrefixSet persistent + consistency fix (20.02 KB, patch)
2011-09-08 07:50 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v8 8. Rehash with a per-user key to prevent collisions (23.06 KB, patch)
2011-09-08 07:54 PDT, Gian-Carlo Pascutto [:gcp]
gpascutto: review+
Details | Diff | Review
Patch v1 9. Add missing copyright header (2.47 KB, patch)
2011-09-16 14:29 PDT, Gian-Carlo Pascutto [:gcp]
no flags Details | Diff | Review

Description Gian-Carlo Pascutto [:gcp] 2011-07-05 12:14:03 PDT
Currently, on each site visited, Firefox will query a +-50M SQLite database to gather a list of SHA-256 prefixes that contain hashed URLs with malware or phishing links on that domain. This list is potentially empty if there is no known malware on that domain.

Because the hashes are distributed randomly, browsing at a reasonable speed effectively requires this DB to be cached in memory. There is some code already present to speed up the most common cases (browsing links within a domain, revisiting a clean domain).

With limited changes and using some stuff we already have, it's possible to avoid hitting the disk in almost all cases. This is a first phase on getting rid of the SQLite DB entirely.
Comment 1 Gian-Carlo Pascutto [:gcp] 2011-07-05 12:25:22 PDT
It's possible to build a prefix tree of known hashes and domains on startup and each time the SQLite DB gets an update. Mehdi Mulani already wrote a JavaScript version of that datastructure including the needed tests.

To get this working, the following is still required:

1) JavaScript can only run on the main thread, whereas the Safe Browsing code is using a worker thread in the background. I managed to (painfully due to inexperience) work around this by using NS_GetProxyForObject. Need to investigate if this can be replaced by IRunnable.

2) There about 1M prefixes in the DB right now. The JavaScript to construct the tree takes some time, and blocks the main thread. We probably need to construct it incrementally to not block the UI too much.

3) The DB stores both domain hashes, prefix hashes, and full hashes. We can ignore the full hashes. As long as we keep the SQLite DB, there is no disadvantage. If it disappears, we will send a bit more queries to Google when a user revisits a known malware or phising site (not a common usage scenario!!). I managed to store both the domain hashes and prefix hashes in the same tree by XOR-ing them with magic constants, but this still needs some debugging.
Comment 2 Brian Ryner 2011-07-07 14:25:19 PDT
If you're talking about building a prefix tree of all hashes that are in the client database, I think this will probably be pretty slow -- we were doing something similar in C++ on the server side and it took about 5 minutes to build the data structure (and you'd have to repeat this on every update).  You might want to investigate using a Bloom filter like Chrome is doing.

Also, just to clarify, the host keys are provided as a hint, but it's not required that you use them if your data structure works better without them (as is the case in Chrome's implementation).
Comment 3 Gian-Carlo Pascutto [:gcp] 2011-07-07 22:59:01 PDT
>If you're talking about building a prefix tree of all hashes that are in the 
>client database, I think this will probably be pretty slow -- we were doing 
>something similar in C++ on the server side and it took about 5 minutes to 
>build the data structure (and you'd have to repeat this on every update).

Hrm, we have this working already, and it's taking about 5s, in debug mode, with the Prefix Tree code in JavaScript. The problem is that we currently block the main thread during this 5s.

When starting up it can be serialized from disk, so it should be even faster.

The tricky part will be handling updates. I'd like to update the tree in-place, but there are many nasty cases, like deleting a key that causes the 2 neighboring keys to be more than 2^16 away from each other. So it remains to be seen if this is really feasible. Reconstructing the tree on each update would be annoying, especially on mobile where it will eat more than 5s CPU time.

>You might want to investigate using a Bloom filter like Chrome is doing.

I understood they are moving away from that. There are some disadvantages in that approach.

The main problem is that you can never have it as your main data-structure, because it has collisions. So you need the bloom filter *and* the database. Chrome reduced the overhead by storing the database more efficiently than SQLite does, but it's still overhead.

It also has false positives, which would cause us to leak where the user is browsing. 

The advantage of the Bloom filter is that you can control the size of the in-memory part.

>Also, just to clarify, the host keys are provided as a hint, but it's not 
>required that you use them if your data structure works better without them 
>(as is the case in Chrome's implementation).

I've left them in because they allow taking an early exit and not run the entire fragmentation and lookup of all the generated URLs. They take about half of the size of the tree though, so I might want to revisit this after measuring the speed impact. You certainly could be right that they aren't worth it.
Comment 4 Brian Ryner 2011-07-08 00:35:33 PDT
(In reply to comment #3)
> Hrm, we have this working already, and it's taking about 5s, in debug mode,
> with the Prefix Tree code in JavaScript. The problem is that we currently
> block the main thread during this 5s.
> 
> When starting up it can be serialized from disk, so it should be even faster.

Ah, ok great.  Could be that the implementation we were using was just particularly slow to build the tree.

Any stats on the memory usage of the prefix tree once constructed?

> >You might want to investigate using a Bloom filter like Chrome is doing.
> 
> I understood they are moving away from that. There are some disadvantages in
> that approach.
> 
> The main problem is that you can never have it as your main data-structure,
> because it has collisions. So you need the bloom filter *and* the database.
> Chrome reduced the overhead by storing the database more efficiently than
> SQLite does, but it's still overhead.

Ah, sounds like you're more up to date on the Chrome implementation than I am.  Agreed on your points about needing a database to back the Bloom filter, as well as the FP's (hash prefixes don't uniquely identify a URL, but it's understandable if you want to avoid unneeded pingbacks).

One other thing I was wondering about the prefix tree approach -- how would you handle chunk deletes? (ad: and sd:)

> I've left them in because they allow taking an early exit and not run the
> entire fragmentation and lookup of all the generated URLs. They take about
> half of the size of the tree though, so I might want to revisit this after
> measuring the speed impact. You certainly could be right that they aren't
> worth it.

Sure, sounds reasonable.
Comment 5 Brian Ryner 2011-07-08 00:40:13 PDT
(Just to be clear, I meant how would you handle chunk deletes if you were to get rid of the SQLite database, as it sounds like you want to do eventually)
Comment 6 Gian-Carlo Pascutto [:gcp] 2011-07-08 01:14:02 PDT
>Any stats on the memory usage of the prefix tree once constructed?

Currently it's 2.2M, but this is including the domain prefixes. Without them, it would be about 1.2M.

I will have some patches for this up as soon as (1) and (2) in comment 1 are OK.

>One other thing I was wondering about the prefix tree approach -- how would 
>you handle chunk deletes? (ad: and sd:)

The tree needs to be augmented with the chunk information, which will double it's size (which should still be acceptable afterwards). Then we'll either have to reconstruct it (easier/slower) or delete entries from it (hairy/faster).

There are tricky aspects here. The chunk numbers are bigger than 2^16, but there are only about 12000 distinct ones in use, so we preferably want to make use of that redundancy (extra indirection).

I hope I didn't miss anything. The current update code is more complicated than I would expect from the protocol description. I'm not entirely sure what the moz_subs table we have now does, for example.
Comment 7 Gian-Carlo Pascutto [:gcp] 2011-07-14 08:37:50 PDT
Created attachment 545920 [details] [diff] [review]
Patch v1 1. Add the PrefixTree datastructure
Comment 8 Gian-Carlo Pascutto [:gcp] 2011-07-14 08:38:35 PDT
Created attachment 545923 [details] [diff] [review]
Patch v1 2. Use the Prefix Tree to short-circuit lookups
Comment 9 Gian-Carlo Pascutto [:gcp] 2011-07-14 08:39:14 PDT
Created attachment 545924 [details] [diff] [review]
Patch v1 3. Load the PrefixTree from the DB in chunks
Comment 10 Gian-Carlo Pascutto [:gcp] 2011-07-14 08:39:55 PDT
Created attachment 545925 [details] [diff] [review]
Patch v1 4. Delay SQLite loading until fragments are checked vs the PrefixTree
Comment 11 Gian-Carlo Pascutto [:gcp] 2011-07-14 09:04:55 PDT
Here's a first bunch of patches. This is a work in progress, there's still the following issues, and I'm putting it here so other people can start commenting on it.

1) Doug requested that the usage of Proxy objects was replaced by using nsIRunnables. I have a patch for this, but this causes errors in an unrelated part of the code handling updates. I have no idea why; either I messed up and my patch is buggy, there are issues mixing Proxy's with Runnables, I'm exposing an existing bug, ... That patch is here:

http://hg.mozilla.org/users/gpascutto_mozilla.com/patches/file/41fcc8537504/remove-prefix-proxy

2) The list of prefixes is now sent to Javascript in chunks. The idea was that this would prevent the main thread from blocking too long while handling the tree construction. However, this causes a performance drop: the tree construction has to reallocate memory when dealing with each chunk, and because this is using ctypes.ArrayTypes, I don't believe there's a nice JS way to quickly copy everything over, so it's using a simple loop right now. If it's not possible to do that in a nicer way, the code can probably be adjusted to preallocate the memory. But it won't be as nice.

What's worse is that as far as I can tell the responsiveness of the UI is still bad during the tree construction.

3) To make (2) worse, the tree construction seems to be called much more than would be expected. It's called on every SQLite db open/close and on every update. Need to instrument where all those are coming from, and why it's done so much.

4) I didn't include a patch yet to reduce the SQLite caches. After discussing with dcamp, it likely needs to be smart and increase the caches again when processing an update. The code for this is already present, but disabled. So this will need some (performance) testing. But that doesn't make much sense until the above issues are dealt with.

5) The code right now checks a small cache of host keys in the main thread before firing off an (asynchronous) lookup to the worker thread. With the tree present, that cache can be removed and much work can be postponed until the tree (which lives in the main thread!) is checked. Unfortunately it's also nontrivial, as all the database access currently lives in the worker thread, and should stay there so I/O doesn't block the main thread. I'm have a partial patch here:

http://hg.mozilla.org/users/gpascutto_mozilla.com/patches/file/41fcc8537504/probe-in-main-thread
Comment 12 Gian-Carlo Pascutto [:gcp] 2011-07-22 01:58:20 PDT
Created attachment 547634 [details] [diff] [review]
Patch v2 1. Add the PrefixTree datastructure
Comment 13 Gian-Carlo Pascutto [:gcp] 2011-07-22 01:58:56 PDT
Created attachment 547635 [details] [diff] [review]
Patch v2 2. Use the Prefix Tree to short-circuit lookups
Comment 14 Gian-Carlo Pascutto [:gcp] 2011-07-22 01:59:40 PDT
Created attachment 547636 [details] [diff] [review]
Patch v2 3. Probe the PrefixSet from the main thread
Comment 15 Gian-Carlo Pascutto [:gcp] 2011-07-22 02:00:17 PDT
Created attachment 547637 [details] [diff] [review]
Patch v2 4. Reduce SQLite caches when not updating
Comment 16 Gian-Carlo Pascutto [:gcp] 2011-07-22 02:22:53 PDT
Second revision. Notes:

1) The PrefixSet is now in C++ instead of JS. This is way faster, removes a lot of thread juggling, makes the chunked loading unnecessary (the prefixset construction is almost instantaneous), and the code is simpler, too. Some bugs for corner cases have been fixed.

2) Patch to reduce the caches, and increase upon updates, is included. But see comment 40 in bug 650649 for some strange stuff.

3) In the main thread, we now do a probe inside the PrefixSet and avoid posting anything to the background thread if we get a clean result. This allows removing all of the old internal caching (hmm, can prolly remove some support code in ClassifierUtils too!). If the probe indicates that the prefixset is still loading, we use the old path.

4) I kept the test code for the old caches that no longer exist (test_cleankeycache.js) but adapted them to the new behavior. This served a bit as a sanity-check to see if things were happening as expected. Maybe these can go.

5) The code assumes that there is at least one entry in the prefixset. Some of the code in the tests is adapted to ensure this, and in case the urlclassifier store is empty, a sentinel is added. 
I believed that the extra URL's we add (http://www.mozilla.org/firefox/its-a-trap.html) should make an empty prefixset in real use impossible, but that lead to: http://tbpl.mozilla.org/?tree=Try&rev=81f95515a021 
I'm not sure I fully understand why that happened.

6) This code keeps the host keys together with the fragment keys, leading to about 2.1M of memory usage. This can be reduced to 1.2M if we don't use the host keys, and hence the fast path and opportunistic probe. But I would like to see a performance measurement on the impact. What's the best way to measure this accurately?

These patches alone aren't enough to get SafeBrowsing on mobile, but it does mean that the backing store for the prefixes can be much slower or compact than before.
Comment 17 Dave Camp (:dcamp) 2011-07-22 10:11:29 PDT
Will try to review this in detail quickly, but I'm pretty concerned about reading through that entire sqlite database at startup before we can classify anything.

It shouldn't be too hard to serialize the 2.1meg prefixset to disk and read that in one fell swoop.  I suspect that'd be a heck of a lot quicker than paging through the entire unnecessarily huge sqlite db at startup?
Comment 18 Gian-Carlo Pascutto [:gcp] 2011-07-22 12:21:36 PDT
Sure, but the above patches need to be working first or we won't have anything to bootstrap the prefixset and serialize :-)

I'm going to start working on that and storing chunks + add/remove now.
Comment 19 Dave Camp (:dcamp) 2011-07-22 12:58:46 PDT
Comment on attachment 547637 [details] [diff] [review]
Patch v2 4. Reduce SQLite caches when not updating


> nsresult
>+nsUrlClassifierDBServiceWorker::SetCacheSize(
>+  nsCOMPtr<mozIStorageConnection> aConnection, PRInt32 aCacheSize)

You can just take a mozIStorageConnection *, the caller is holding a reference for the lifetime of the call.
Comment 20 Dave Camp (:dcamp) 2011-07-22 14:46:01 PDT
Comment on attachment 547634 [details] [diff] [review]
Patch v2 1. Add the PrefixTree datastructure

Review of attachment 547634 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp
@@ +105,5 @@
> +  mIndexStarts.AppendElement(mDeltas.Length());
> +
> +  PRUint32 numOfDeltas = 0;
> +  PRUint32 currentItem = prefixes[0];
> +  for (int i = 1; i < aLength; i++) {

Should probably match aLength's data type here.

@@ +146,5 @@
> +  } else if (value > target) {
> +    return BinSearch(start, i-1, target);
> +  }
> +  return i;
> +}

An iterative implementation would be better here.

@@ +181,5 @@
> +
> +  if (mIndexPrefixes[i] > target ||
> +      (i < mIndexPrefixes.Length() - 2 &&
> +       mIndexPrefixes[i+2] < target)) {
> +    LOG(("XXX: this should not happen"));

NS_WARNING(), or maybe even NS_ASSERTION() should be used here.
Comment 21 Dave Camp (:dcamp) 2011-07-22 15:41:56 PDT
Comment on attachment 547635 [details] [diff] [review]
Patch v2 2. Use the Prefix Tree to short-circuit lookups

Review of attachment 547635 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsIUrlClassifierPrefixSet.idl
@@ +12,2 @@
>    boolean contains(in unsigned long aPrefix);
> +  PRUint32 estimateSize();

(Usually would need to update uuid while updating an interface, but this will presumably land at the same time, so not a big deal).

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +3957,5 @@
>    NS_ENSURE_SUCCESS(rv, rv);
>  
> +  mPrefixSet = do_CreateInstance("@mozilla.org/url-classifier/prefixset;1", &rv);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +

I'd just use "new nsUrlClassifierPrefixSet()", no need to get xpcom involved here.  You can use nsRefPtr rather than nsCOMPtr to deal with the C++ class instead of the nsIGarbage.

Sad that we need to make prefixset an xpcom object at all, but it does make testing more convenient.
Comment 22 Dave Camp (:dcamp) 2011-07-22 15:48:58 PDT
Comment on attachment 547636 [details] [diff] [review]
Patch v2 3. Probe the PrefixSet from the main thread

Review of attachment 547636 [details] [diff] [review]:
-----------------------------------------------------------------

This patch blocks the main UI for the amount of time it takes to build the prefixset.  That's a somewhat large operation, are you confident it'll never get near the 50ms max we try to avoid blocking?

It might be better (though with some memory consumption tradeoffs) to build up a new prefixset without holding the lock, and lock just to swap it in.

::: toolkit/components/url-classifier/nsIUrlClassifierPrefixSet.idl
@@ +9,5 @@
>                     in unsigned long aLength);
>    void addPrefixes([const, array, size_is(aLength)] in unsigned long aPrefixes,
>                     in unsigned long aLength);
>    boolean contains(in unsigned long aPrefix);
> +  boolean probe(in unsigned long aPrefix, inout boolean aReady);

When adding methods to the idl, you should generate a new uuid for the interface.

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +1415,5 @@
>  
> +  *clean = PR_TRUE;
> +
> +  for (int i = 0; i < lookupHosts.Length(); i++) {
> +    nsUrlClassifierDomainHash hostKeyHash;

Match iterator and length data types here please
Comment 23 Gian-Carlo Pascutto [:gcp] 2011-07-25 22:45:15 PDT
Created attachment 548369 [details] [diff] [review]
Patch v3 1. Add the PrefixTree datastructure
Comment 24 Gian-Carlo Pascutto [:gcp] 2011-07-25 22:45:49 PDT
Created attachment 548370 [details] [diff] [review]
Patch v3 2. Use the Prefix Tree to short-circuit lookups
Comment 25 Gian-Carlo Pascutto [:gcp] 2011-07-25 22:46:25 PDT
Created attachment 548371 [details] [diff] [review]
Patch v3 3. Probe the PrefixSet from the main thread
Comment 26 Gian-Carlo Pascutto [:gcp] 2011-07-25 22:47:22 PDT
Created attachment 548372 [details] [diff] [review]
Patch v3 4. Reduce SQLite caches when not updating
Comment 27 Gian-Carlo Pascutto [:gcp] 2011-07-27 22:53:20 PDT
Created attachment 549026 [details] [diff] [review]
Patch v3 5. Make the PrefixSet persistent + consistency fix

This also fixes the naming to consistently use PrefixSet instead of PrefixTree.
Comment 28 Gian-Carlo Pascutto [:gcp] 2011-07-28 03:03:29 PDT
Created attachment 549051 [details] [diff] [review]
Patch v3 6. Remove the old fragment caches
Comment 29 Dave Camp (:dcamp) 2011-08-01 15:28:31 PDT
Comment on attachment 548369 [details] [diff] [review]
Patch v3 1. Add the PrefixTree datastructure

Review of attachment 548369 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp
@@ +99,5 @@
> +  }
> +
> +  // XXX: Probably shouldn't assume that |aPrefixes| doesn't contain
> +  // duplicates..
> +

Let's either fix this now or have a followup bug filed/mentioned.

@@ +171,5 @@
> +
> +  // XXX: |binsearch| does not necessarily return the correct index (when the
> +  // target is not found) but rather it returns an index at least one away
> +  // from the correct index.
> +  // Because of this, we need to check if the target lies before the beginning

This isn't an XXX in the "fixme" sense, is it?  Just an explanation of what the code is doing?
Comment 30 Dave Camp (:dcamp) 2011-08-01 15:47:01 PDT
Comment on attachment 548371 [details] [diff] [review]
Patch v3 3. Probe the PrefixSet from the main thread

Review of attachment 548371 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +1053,5 @@
> +nsresult GetHostKeys(const nsACString &spec,
> +                     nsTArray<nsCString> &hostKeys);
> +
> +// Check for a canonicalized IP address.
> +PRBool IsCanonicalizedIP(const nsACString& host);

Should these two methods be static?

@@ +1413,5 @@
>    NS_ENSURE_SUCCESS(rv, rv);
>  
> +  *clean = PR_TRUE;
> +
> +  for (nsAutoTArray<nsCString, 2>::size_type i = 0;

hrm, that's probably more correct, but PRUint32 is much easier to read (and that particular size_type is a simple typedef).

@@ +1420,5 @@
> +    hostKeyHash.FromPlaintext(lookupHosts[i], mHash);
> +
> +    // First probe the Prefix Tree for presence
> +    PRUint32 domainkey = hostKeyHash.ToUint32() ^ ENCODE_DOMAIN_MAGIC;
> +

Do we have a followup bug to test the perf/memory tradeoffs of caring about domain keys?

@@ +1425,5 @@
> +    PRBool found;
> +    PRBool ready = PR_FALSE;  /* opportunistic probe */
> +     rv = mPrefixSet->Probe(domainkey, &ready, &found);
> +     NS_ENSURE_SUCCESS(rv, rv);
> +     LOG(("CheckCleanHost Probed %X ready: %d found: %d ",

Indentation change here.
Comment 31 Dave Camp (:dcamp) 2011-08-01 15:53:17 PDT
Overall this seems like a good thing to do, and is an important step toward replacing sqlite.

I think we need to quantify the performance changes from the previous implementation before landing, though - in particular:

* The impact of walking the sqlite db to rebuild that prefixtree during an update.
* The memory savings of dropping the sqlite cache aggressively vs. the extra usage of the prefixtree.
* Startup impact of reading the prefixtree on first load, with a full db.

I suspect that both of those are a win here, but we should be deliberate about it.

This is a big enough change that we should be ready to turn it off in aurora if it causes problems.  Might be as simple as reverting to the old codebase if needed.
Comment 32 Dave Camp (:dcamp) 2011-08-01 15:55:16 PDT
Comment on attachment 548371 [details] [diff] [review]
Patch v3 3. Probe the PrefixSet from the main thread

feedback+'ing the remaining bugs instead of r+.  The look good to me, but:

a) I think the performance changes should be understood before landing
b) I'm not technically an url-classifier peer. tony@ponderer.org should sign off before landing big changes.
Comment 33 Gian-Carlo Pascutto [:gcp] 2011-08-03 03:08:22 PDT
Created attachment 550336 [details] [diff] [review]
Patch v4 1. Add the PrefixTree datastructure

Carrying forward review
Comment 34 Gian-Carlo Pascutto [:gcp] 2011-08-03 03:09:11 PDT
Created attachment 550337 [details] [diff] [review]
Patch v4 2. Use the Prefix Tree to short-circuit lookups
Comment 35 Gian-Carlo Pascutto [:gcp] 2011-08-03 03:12:07 PDT
Created attachment 550338 [details] [diff] [review]
Patch v4 3. Probe the PrefixSet from the main thread
Comment 36 Gian-Carlo Pascutto [:gcp] 2011-08-03 03:13:34 PDT
Created attachment 550339 [details] [diff] [review]
Patch v4 4. Reduce SQLite caches when not updating

Carrying forward review
Comment 37 Gian-Carlo Pascutto [:gcp] 2011-08-03 03:14:26 PDT
Created attachment 550341 [details] [diff] [review]
Patch v4 5. Make the PrefixSet persistent + consistency fix
Comment 38 Gian-Carlo Pascutto [:gcp] 2011-08-03 03:15:19 PDT
Created attachment 550342 [details] [diff] [review]
Patch v4 6. Remove the old fragment caches

Carrying forward review
Comment 39 Gian-Carlo Pascutto [:gcp] 2011-08-03 03:18:06 PDT
Created attachment 550343 [details] [diff] [review]
Patch v4 7. Remove storage of hostkeys

This is an *optional* patch that reduces the storage from 2.1M to 1.2M at the cost of a GetFragments call for every lookup. It is most likely a win overall, and simplifies the code. See also comment 30. Need to get performance data on this.
Comment 40 Tony Chang (Google) 2011-08-03 09:47:18 PDT
(In reply to comment #32)
> Comment on attachment 548371 [details] [diff] [review] [diff] [details] [review]
> Patch v3 3. Probe the PrefixSet from the main thread
> 
> feedback+'ing the remaining bugs instead of r+.  The look good to me, but:
> 
> a) I think the performance changes should be understood before landing

I too would like to know the performance impact before landing.  Here's an idea for a perf test:
Use wireshark/fiddler to catch safebrowsing traffic and save the response data for a while (a day or maybe a few days).  Keep each update from the server is a separate file and replay them against the old code vs the new PrefixTree code.  It would also be good to see what the perf impact is if the user hasn't used Firefox in a week and a big update occurs.

Another way to measure perf might be to use Test Pilot.  That would require landing the code to collect data from nightlies, but maybe that's OK (I'm not familiar with the release process to know).
Comment 41 Gian-Carlo Pascutto [:gcp] 2011-08-29 02:21:55 PDT
Created attachment 556492 [details] [diff] [review]
Patch v5 1. Add the PrefixTree datastructure
Comment 42 Gian-Carlo Pascutto [:gcp] 2011-08-29 02:23:12 PDT
Created attachment 556493 [details] [diff] [review]
Patch v5 2. Use the Prefix Tree to short-circuit lookups
Comment 43 Gian-Carlo Pascutto [:gcp] 2011-08-29 02:25:45 PDT
Created attachment 556494 [details] [diff] [review]
Patch v5 3. Probe the PrefixSet from the main thread
Comment 44 Gian-Carlo Pascutto [:gcp] 2011-08-29 02:27:06 PDT
Created attachment 556496 [details] [diff] [review]
Patch v5 4. Reduce SQLite caches when not updating
Comment 45 Gian-Carlo Pascutto [:gcp] 2011-08-29 02:28:00 PDT
Created attachment 556497 [details] [diff] [review]
Patch v5 5. Make the PrefixSet persistent + consistency fix
Comment 46 Gian-Carlo Pascutto [:gcp] 2011-08-29 02:29:41 PDT
Created attachment 556498 [details] [diff] [review]
Patch v5 6. Remove the old fragment caches
Comment 47 Gian-Carlo Pascutto [:gcp] 2011-08-29 02:32:11 PDT
Created attachment 556499 [details] [diff] [review]
Patch v5 7. Remove storage of hostkeys
Comment 48 Gian-Carlo Pascutto [:gcp] 2011-08-29 02:33:16 PDT
Created attachment 556501 [details] [diff] [review]
Patch v5 8. Rehash with a per-user key to prevent collisions
Comment 49 Gian-Carlo Pascutto [:gcp] 2011-08-29 02:34:20 PDT
Created attachment 556502 [details] [diff] [review]
Patch v5. 9. Add telemetry for new PrefixSet operations
Comment 50 Gian-Carlo Pascutto [:gcp] 2011-08-29 02:48:44 PDT
Patch rebased to current m-c on expanded on 2 parts:

- As explained in bug 669407, if the hostkeys are dropped we risk that every Firefox browser gets a collision and slowdown on the same site/page. To stop this, all keys are rehashed with a per-user random key. I originally used SHA-256 for this as described in that bug, but this is too slow. It is replaced by MurmurHash3, and I confirmed that gives approximately the same amount of collisions as SHA-256.

- There is an extra patch which adds telemetry for the critical operations. This allows us to monitor the performance impact of this patch in the field and take action if any users are seeing a bigger than expected performance impact.

Some measurements on my machine (standard desktop, somewhat old disk): 

- Complete update of the database (dcamp's script in bug 673470): before ~20.9s, now ~22.2s.
- Constructing the PrefixSet after an update or on first startup: 0.5-2 seconds, in the background. Most of this is the read of the entire DB.
- Loading the PrefixSet (normal startup): 10-50ms, depending on disk activity. It's 1 disk seek + 1M serial read.

Memory usage changes:

- During normal usage, never more than 1M (down from >40M in worst case before).
- During an update, at most 40M, and basically identical to the old behavior.

The size of the updates has no effect on the performance impact of this patch.
Comment 51 Gian-Carlo Pascutto [:gcp] 2011-08-29 02:59:48 PDT
Created attachment 556507 [details] [diff] [review]
Patch v5. 9. Add telemetry for new PrefixSet operations

Fix silly typo.
Comment 52 Nicholas Nethercote [:njn] 2011-08-29 03:15:58 PDT
(In reply to Gian-Carlo Pascutto (:gcp) from comment #50)
>
> - During normal usage, never more than 1M (down from >40M in worst case
> before).

I like that! :)
Comment 53 (dormant account) 2011-08-29 13:53:22 PDT
Comment on attachment 556507 [details] [diff] [review]
Patch v5. 9. Add telemetry for new PrefixSet operations


> /**
>+ * Url-Classifier telemetry
>+ */
>+#ifdef MOZ_URL_CLASSIFIER
>+HISTOGRAM(URLCLASSIFIER_PS_FILELOAD_TIME, 1, 2000, 10, EXPONENTIAL, "Time spent loading PrefixSet from file (ms)")
>+HISTOGRAM(URLCLASSIFIER_PS_CONSTRUCT_TIME, 1, 10000, 15, EXPONENTIAL, "Time spent constructing PrefixSet from DB (ms)")
>+HISTOGRAM(URLCLASSIFIER_PS_LOOKUP_TIME, 1, 500, 10, EXPONENTIAL, "Time spent per PrefixSet lookup (ms)")
>+#endif

I suspect your max values are overly conservative.
I would change your maximum values: 2s->1s, 10s->3s. The exact value will still be sent with .sum component of the histogram, but the histogram will look nicer.
Comment 54 Tony Chang (Google) 2011-08-29 15:24:10 PDT
Comment on attachment 556493 [details] [diff] [review]
Patch v5 2. Use the Prefix Tree to short-circuit lookups

Review of attachment 556493 [details] [diff] [review]:
-----------------------------------------------------------------

It's been a long time since I've looked at this code, so this is mostly just a review of style and agreement that we should try the prefix tree.

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +278,5 @@
>    const PRBool StartsWith(const nsUrlClassifierHash<PARTIAL_LENGTH>& hash) const {
>      NS_ASSERTION(sHashSize >= PARTIAL_LENGTH, "nsUrlClassifierHash must be at least PARTIAL_LENGTH bytes long");
>      return memcmp(buf, hash.buf, PARTIAL_LENGTH) == 0;
>    }
> +  const PRUint32 ToUint32() {

Nit: The const here doesn't really do anything.

@@ +3474,5 @@
> +    const PRUint8 *blobdomain = mDomainPrefixStatement->AsSharedBlob(0, &size);
> +    if (!blobdomain || (size != DOMAIN_LENGTH))
> +      return PR_FALSE;
> +
> +    domainval = *((PRUint32*)blobdomain);

Nit: static_cast?

@@ +3494,5 @@
> +    const PRUint8 *blobdomain = mAllPrefixStatement->AsSharedBlob(0, &size);
> +    if (!blobdomain || (size != DOMAIN_LENGTH))
> +      return PR_FALSE;
> +
> +    domainval = *((PRUint32*)blobdomain);

Nit: static_cast?

@@ +3515,5 @@
> +  }
> +
> +  LOG(("SB domains: %d prefixes: %d fulldomain: %d\n", dcnt, pcnt, fcnt));
> +
> +

Nit: Extra blank line?
Comment 55 Tony Chang (Google) 2011-08-29 15:38:23 PDT
Comment on attachment 556494 [details] [diff] [review]
Patch v5 3. Probe the PrefixSet from the main thread

Review of attachment 556494 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +1384,1 @@
>                                                 PRBool *clean)

Nit: Align with ( on previous line.

::: toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp
@@ +218,5 @@
>    return NS_OK;
>  }
> +
> +NS_IMETHODIMP
> +nsUrlClassifierPrefixSet::Probe(PRUint32 aPrefix, PRBool* aReady, PRBool* aFound)

Nit: It would be nice if the spacing around * was consistent with the rest of the file.

@@ +222,5 @@
> +nsUrlClassifierPrefixSet::Probe(PRUint32 aPrefix, PRBool* aReady, PRBool* aFound)
> +{
> +  MutexAutoLock lock(mPrefixTreeLock);
> +
> +  /* check whether we are opportunistically probing or should wait */

Nit: Maybe use C++ style // comments so it's easier to comment out a block?

::: toolkit/components/url-classifier/nsUrlClassifierPrefixSet.h
@@ +79,3 @@
>    PRUint32 BinSearch(PRUint32 start, PRUint32 end, PRUint32 target);
>  
> +  //  boolean indicating whether |setPrefixes| has been

Nit: Extra space after // ?
Comment 56 Tony Chang (Google) 2011-08-29 15:57:59 PDT
Comment on attachment 556497 [details] [diff] [review]
Patch v5 5. Make the PrefixSet persistent + consistency fix

Review of attachment 556497 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp
@@ +329,5 @@
> +  nsCOMPtr<nsILocalFile> file(do_QueryInterface(aFile, &rv));
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  AutoFDClose fileFd;
> +  rv = file->OpenNSPRFileDesc(PR_RDONLY, 0644, &fileFd);

Do file premissions matter if you're opening as ready only?

@@ +340,5 @@
> +  PRFileMap * fileMap = PR_CreateFileMap(fileFd, size, PR_PROT_READONLY);
> +  if (!fileMap) {
> +    mapWorks = PR_FALSE;
> +  } else {
> +    buf = (PRUint32*)PR_MemMap(fileMap, 0, size);

static_cast

@@ +350,5 @@
> +
> +  if (!mapWorks) {
> +    /* fallback if mmap doesn't work */
> +    LOG(("MemMap failed, falling back to PR_Read"));
> +    return LoadFromFd(fileFd);

If you're reading the whole file and closing it, what's the benefit of mmapping?

@@ +356,5 @@
> +    PRUint32 offset = 0;
> +
> +    if (buf[offset++] == PREFIXSET_VERSION_MAGIC) {
> +      PRUint32 indexSize = buf[offset++];
> +      PRUint32 deltaSize = buf[offset++];

Do you have to verify you're not overflowing buf?

@@ +367,5 @@
> +      nsTArray<PRUint32> mNewIndexStarts;
> +      nsTArray<PRUint32> mNewIndexPrefixes;
> +      nsTArray<PRUint16> mNewDeltas;
> +
> +      mNewIndexPrefixes.AppendElements(buf+offset, indexSize);

Same here, do you need to verify you're not reading past the end of |buf|?

@@ +372,5 @@
> +      offset += indexSize;
> +      mNewIndexStarts.AppendElements(buf+offset, indexSize);
> +      offset += indexSize;
> +      if (deltaSize > 0) {
> +        PRUint16 * buf16 = (PRUint16*)(buf+offset);

static_cast
Comment 57 Tony Chang (Google) 2011-08-29 16:17:39 PDT
Comment on attachment 556501 [details] [diff] [review]
Patch v5 8. Rehash with a per-user key to prevent collisions

Review of attachment 556501 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +3468,5 @@
> +*/
> +static nsresult KeyedHash(PRUint32 aPref, PRUint32 aDomain,
> +                          PRUint32 aKey, PRUint32 *aOut)
> +{
> +  /* This is a reimplementation of MurmurHash3 32-bit

Why did you decide to use MurmurHash3?  I'm not a hash expert, so it makes me a bit nervous and it would be nice if we could use an existing hash function.
Comment 58 Gian-Carlo Pascutto [:gcp] 2011-09-06 13:12:27 PDT
Created attachment 558566 [details] [diff] [review]
Patch v6 1. Add the PrefixTree datastructure
Comment 59 Gian-Carlo Pascutto [:gcp] 2011-09-06 13:13:27 PDT
Created attachment 558568 [details] [diff] [review]
Patch v6 2. Use the Prefix Tree to short-circuit lookups
Comment 60 Gian-Carlo Pascutto [:gcp] 2011-09-06 13:14:15 PDT
Created attachment 558570 [details] [diff] [review]
Patch v6 3. Probe the PrefixSet from the main thread
Comment 61 Gian-Carlo Pascutto [:gcp] 2011-09-06 13:15:15 PDT
Created attachment 558572 [details] [diff] [review]
Patch v6 4. Reduce SQLite caches when not updating
Comment 62 Gian-Carlo Pascutto [:gcp] 2011-09-06 13:16:30 PDT
Created attachment 558573 [details] [diff] [review]
Patch v6 5. Make the PrefixSet persistent + consistency fix
Comment 63 Gian-Carlo Pascutto [:gcp] 2011-09-06 13:17:39 PDT
Created attachment 558574 [details] [diff] [review]
Patch v6 6. Remove the old fragment caches
Comment 64 Gian-Carlo Pascutto [:gcp] 2011-09-06 13:18:34 PDT
Created attachment 558575 [details] [diff] [review]
Patch v6 7. Remove storage of hostkeys
Comment 65 Gian-Carlo Pascutto [:gcp] 2011-09-06 13:19:41 PDT
Created attachment 558577 [details] [diff] [review]
Patch v6 8. Rehash with a per-user key to prevent collisions
Comment 66 Gian-Carlo Pascutto [:gcp] 2011-09-06 13:20:47 PDT
Created attachment 558580 [details] [diff] [review]
Patch v6. 9. Add telemetry for new PrefixSet operations
Comment 67 Gian-Carlo Pascutto [:gcp] 2011-09-06 13:41:57 PDT
>I suspect your max values are overly conservative.
>I would change your maximum values: 2s->1s, 10s->3s. The exact value will still 
>be sent with .sum component of the histogram, but the histogram will look 
>nicer.

I reduced these a bit. Not too much, because there might be some bad cases when there is disk I/O that we want to spot. Also, they're exponential, so it doesn't matter that much, right?

> +  const PRUint32 ToUint32() {
>Nit: The const here doesn't really do anything.

Right, should have been at the end. I had big problems here; this function seems to be miscompiled with the version of GCC that is in Fedora 12, when compiling for 32-bit and using optimalisations including PGO (i.e. one of the configurations on the tryserver). I narrowed it down far enough that the compiler is almost certainly to blame:

http://hg.mozilla.org/try/rev/656d2af9ff1f (broken)
versus
http://hg.mozilla.org/try/rev/a4bc19779cf4 (working)

Adding the logging made the problem disappear. I ended up slightly reformulating this function and its caller, which made the problem go away.

>Nit: static_cast?

Changed most of these to reinterpret_cast (we're copying char buffers and void* thingies into uint32)

>Do file premissions matter if you're opening as ready only?

Fixed

>If you're reading the whole file and closing it, what's the benefit of mmapping?

For some reason I thought mmap was generally faster. Reading a file sequentially is not one of them, it seems. I removed this entire path and only use normal file reading now. This also fixes your complaints about potentially overrunning the buffer (the normal read path has checks).

>Why did you decide to use MurmurHash3?  I'm not a hash expert, so it makes me a bit nervous and 
>it would be nice if we could use an existing hash function.

I presume that with "existing hash function" you mean a cryptographic hash like MD5 or SHA. The problem with these is that they have elaborate setup and finalization phases, and as a result are very slow for hashing small amounts of data, which is our use case ((3 x 4 bytes) x 600 000 hashes!). Think on the order of 20-60s on a normal PC. So I had to find something faster. I ended up with MurmurHash3 for the following reasons:

- Of the non-cryptographic hashes (FNV, Jenkins, ...) it appears to be reasonably well researched.
- It is the fastest general purpose hash with good dispersion that I could find.
- It has a specific (faster) formulation for 32-bit output, which is exactly what we want.
- The code is compact and I could easily reimplement a specific version of the algorithm for the 12-byte case.
- The list of software using it (libstdc++, nginx, memcached, hadoop, kyoto cabinet) is very encouraging.
- I calculated that a perfect hash is expected to generate about 42 collisions in the Prefix set. Changing the hashing function from SHA-256 to Murmurhash3 increased the total amount of collisions by only 2. This is small enough to be neglible (increase could be random chance), and for 600k entries does not appear to be any issue for practical use.

Try run of all patches:

https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=cfaa1e1147f2
Comment 68 Nicholas Nethercote [:njn] 2011-09-06 15:21:33 PDT
> I presume that with "existing hash function" you mean a cryptographic hash
> like MD5 or SHA. The problem with these is that they have elaborate setup
> and finalization phases, and as a result are very slow for hashing small
> amounts of data, which is our use case ((3 x 4 bytes) x 600 000 hashes!).
> Think on the order of 20-60s on a normal PC. So I had to find something
> faster. I ended up with MurmurHash3 for the following reasons:
> 
> - Of the non-cryptographic hashes (FNV, Jenkins, ...) it appears to be
> reasonably well researched.
> - It is the fastest general purpose hash with good dispersion that I could
> find.
> - It has a specific (faster) formulation for 32-bit output, which is exactly
> what we want.
> - The code is compact and I could easily reimplement a specific version of
> the algorithm for the 12-byte case.
> - The list of software using it (libstdc++, nginx, memcached, hadoop, kyoto
> cabinet) is very encouraging.
> - I calculated that a perfect hash is expected to generate about 42
> collisions in the Prefix set. Changing the hashing function from SHA-256 to
> Murmurhash3 increased the total amount of collisions by only 2. This is
> small enough to be neglible (increase could be random chance), and for 600k
> entries does not appear to be any issue for practical use.

Please put that wonderfully informative text in a comment in the code :)
Comment 69 Gian-Carlo Pascutto [:gcp] 2011-09-06 15:57:36 PDT
Created attachment 558651 [details] [diff] [review]
Patch v7 8. Rehash with a per-user key to prevent collisions

Added Murmurhash3 explanation as requested.
Comment 70 Tony Chang (Google) 2011-09-07 15:48:16 PDT
Comment on attachment 558573 [details] [diff] [review]
Patch v6 5. Make the PrefixSet persistent + consistency fix

Review of attachment 558573 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsUrlClassifierPrefixSet.cpp
@@ +318,5 @@
> +    LOG(("Version magic mismatch, not loading"));
> +    return NS_ERROR_FAILURE;
> +  }
> +
> +  LOG(("Loading PrefixSet successfull"));

nit: typo: successful

@@ +363,5 @@
> +    written = PR_Write(fileFd, mDeltas.Elements(), deltaSize * sizeof(PRUint16));
> +    NS_ENSURE_TRUE(written > 0, NS_ERROR_FAILURE);
> +  }
> +
> +  LOG(("Saving PrefixSet successfull\n"));

nit: typo: successful
Comment 71 Tony Chang (Google) 2011-09-07 16:09:24 PDT
Comment on attachment 558651 [details] [diff] [review]
Patch v7 8. Rehash with a per-user key to prevent collisions

Review of attachment 558651 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +1037,5 @@
> +//  www.mail.hostname.com/foo/bar -> mail.hostname.com
> +static nsresult GetKey(const nsACString& spec, nsUrlClassifierDomainHash& hash,
> +                       nsICryptoHash * aCryptoHash);
> +
> +/* We have both a prefix and a domain. Drop the domain, but

nit: Maybe use C++ style comments to be consistent with the rest of this file.

@@ +3455,5 @@
>  
>    return NS_OK;
>  }
>  
> +/* We have both a prefix and a domain. Drop the domain, but

nit: Maybe use C++ style comments to be consistent with the rest of this file.

@@ +3473,5 @@
> +*/
> +static nsresult KeyedHash(PRUint32 aPref, PRUint32 aDomain,
> +                          PRUint32 aKey, PRUint32 *aOut)
> +{
> +  /* This is a reimplementation of MurmurHash3 32-bit

nit: Maybe use C++ style comments to be consistent with the rest of this file.
Comment 72 Gian-Carlo Pascutto [:gcp] 2011-09-08 07:50:12 PDT
Created attachment 559152 [details] [diff] [review]
Patch v7 5. Make the PrefixSet persistent + consistency fix

Incorporate review comments.
Comment 73 Gian-Carlo Pascutto [:gcp] 2011-09-08 07:54:04 PDT
Created attachment 559154 [details] [diff] [review]
Patch v8 8. Rehash with a per-user key to prevent collisions

Incorporate review comments.
Comment 74 Gian-Carlo Pascutto [:gcp] 2011-09-08 07:59:43 PDT
Try run of updated patches.

https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=8dcd952801b9
Comment 75 Nicholas Nethercote [:njn] 2011-09-08 17:01:00 PDT
This bug has been marked [inbound] and I see 5 patches in the repository with this bug number, but no links to landed patches have been posted in this bug...
Comment 77 :Ms2ger 2011-09-16 07:01:34 PDT
Comment on attachment 558566 [details] [diff] [review]
Patch v6 1. Add the PrefixTree datastructure

>--- /dev/null
>+++ b/toolkit/components/url-classifier/nsIUrlClassifierPrefixSet.idl
>@@ -0,0 +1,12 @@
>+#include "nsISupports.idl"
>+
>+interface nsIArray;
>+
>+[scriptable, uuid(6e9f759a-3f8d-4552-99ed-dbf0ea0a5f67)]
>+interface nsIUrlClassifierPrefixSet : nsISupports
>+{
>+  void setPrefixes([const, array, size_is(aLength)] in unsigned long aPrefixes,
>+                   in unsigned long aLength);
>+
>+  boolean contains(in unsigned long aPrefix);
>+};

Add a license header to this file. This is not optional.
Comment 78 Gian-Carlo Pascutto [:gcp] 2011-09-16 14:29:21 PDT
Created attachment 560630 [details] [diff] [review]
Patch v1 9. Add missing copyright header

Added missing copyright header.
Comment 79 Matt Brubeck (:mbrubeck) 2011-09-16 14:40:08 PDT
(In reply to Gian-Carlo Pascutto (:gcp) from comment #78)
> Created attachment 560630 [details] [diff] [review]
> Patch v1 9. Add missing copyright header

https://hg.mozilla.org/integration/mozilla-inbound/rev/d9bfdfbd9a89
Comment 80 Ed Morley [:emorley] 2011-09-17 02:09:24 PDT
Comment on attachment 560630 [details] [diff] [review]
Patch v1 9. Add missing copyright header

https://hg.mozilla.org/mozilla-central/rev/d9bfdfbd9a89

Note You need to log in before you can comment on or make changes to this bug.