Closed Bug 669407 Opened 13 years ago Closed 12 years ago
Reduce disk space requirements for Safe Browsing
The current Safe Browsing implementation uses an SQLite database which is currently +- 50M. This is too much storage for mobile devices, so Safe Browsing is currently disabled on Fennec. Figure out a way to reduce the space requirement. Currently under consideration is removing the SQLite database, and replacing it by a prefix tree. This should reduce the requirement from about 50M to about 4.5M (2.2M hash-prefixes, and likely another 2.2M for update information). Some of the work needed is already available here: http://hg.mozilla.org/users/mmmulani_uwaterloo.ca/safe-browsing/ http://hg.mozilla.org/users/gpascutto_mozilla.com/patches/
Bug 669410 tracks the first part of this, which is to construct the prefix tree and use it to reduce the accesses to the SQLite DB.
tracking-fennec: ? → ---
I'm not sure if this is the bug where you're tracking matching only on the prefixes and ignoring the hostname hashes. But here's the comment I posted at  regarding this change. If you drop the 32-bit hostname hash, then would you be matching only on the 32-bit prefixes? If so, I think you're setting yourself up for false-positives. Consider: With 600,000 hashes in the database, the chance of some other website colliding on a 32-bit hash is 600,000 * 2^(-32) = 0.00014. So you'd need to view .5 / 0.00014 = 3,600 pages to have a 50% chance of a false-positive. That's not a lot of pages. Using (truncated) hashes is itself probabilistic. So if you tuned the bloom filter so the chance of a false-positive is much lower than the chance of a hash collision, it might be sensible.  http://www.morbo.org/2011/07/bringing-safebrowsing-to-mobile-devies.html
Oops, the chance of a collision is actually 1 - (1 - 0.00014) ^ y where y is the number of websites you visit. So if you want probability of collision = 0.5, you get y = log 0.5 / log(1 - 0.00014) or about 5000 webpages.
>I'm not sure if this is the bug where you're tracking matching only on the >prefixes and ignoring the hostname hashes. For now it's an optional patch for bug 669410. >Using (truncated) hashes is itself probabilistic. So if you tuned the bloom >filter so the chance of a false-positive is much lower than the chance of a >hash collision, it might be sensible. There are 2 aspects here: the first one is a collision on a prefix causing a lookup to the Google server. You're right this will now happen about once every few thousand pages instead of, well, almost never. This will increases the load on Google's server, though note that full prefix lookups are cached. I don't know if the resulting load increase is substantial or meaningful. The second aspect is that if you want to handle updates, a "false positive" makes it impossible to safely remove an entry from the datastructure. This isn't an issue with a Prefix Trie (which can handle duplicated entries), whereas it *is* an issue with a Bloom filter. The persistent datastructure must *support* deletions. Lastly, a Bloom filter with false positive rate of 1 out of every 5000 lookups and 600 000 prefixes takes 17.7 bits per entry, or 1.7 bit per entry more than the Prefix Trie. So it's a loss in every case.
I see. I didn't realize that when there's a hash collision, we send the full URL to Google for verification. That's much more sensible. Increasing the probability of sending the URL to Google by to 0.00014 doesn't seem so bad. Do you know if we wait to render the page until Google responds? Then I suppose it might matter some. It sounds like a prefix trie tromps a bloom filter here, although I wonder how Chrome deals with deletions if they also use a bloom filter...
>Do you know if we wait to render the page until Google responds? Then I >suppose it might matter some. Yes, everything blocks until the URL is confirmed (for security reasons). If the hostkeys are still saved on disk (which I'd do for desktop), the impact will be 0. If they aren't (mobile), we'll load 1 in 5000 pages slower. >It sounds like a prefix trie tromps a bloom filter here, although I wonder how >Chrome deals with deletions if they also use a bloom filter... They don't. They delete it and reconstruct it entirely from a (larger) disk-based database. We'd like to use Prefix Tries for both. Not storing hostkeys is an optimization for either technique.
(In reply to comment #6) > >Do you know if we wait to render the page until Google responds? Then I > >suppose it might matter some. > > Yes, everything blocks until the URL is confirmed (for security reasons). If > the hostkeys are still saved on disk (which I'd do for desktop), the impact > will be 0. If they aren't (mobile), we'll load 1 in 5000 pages slower. This seems particularly painful because (a) we're talking about slow connections and (b) when it slows down a page, it does so for all our users (which is bad for whoever owns that page), right?
As for (a), I'm not sure connection speed is relevant as the time spent doing the extra query compared to the total page load time will be approximately the same independent of the connection speed, and quite low (tiny data query to fast Google server). Is it perceived more annoying by a user to wait 5 seconds for a page load instead of 4 seconds, as compared to 250 ms instead of 200 ms? Can they even detect that 1 out 5000 pages is a fraction slower? As for (b), we can rehash the prefixes with a unique-per-browser hash to ensure this 1/5000th page collision is happening on a different page for every user, if we feel that is needed. We cannot avoid false positives if we hash. If we want less of them, they have to be payed for in bits of flash storage. (An interesting consideration on the side: right now, in the current implementation existing since FF3, something like half the entries in the DB are full-domain blocks, where both hashes are equal (because hostkey == URL-key). This means that some (1/10000) domain owners already suffer from this issue.)
> As for (b), we can rehash the prefixes with a unique-per-browser hash to ensure > this 1/5000th page collision is happening on a different page for every user, > if we feel that is needed. How would this work, exactly? You don't get the URL itself, just the prefix (a hash)? So you generate a random number R and do for all prefixes, store sha32(prefix|R). Then when you visit a page, you check whether sha32(sha32(url)|R) is in the database? But doesn't this collide if sha32(url) == prefix, which is exactly the condition you had before?
You get from Google: domain prefix (sha32(domain))=DH + canonicalized-url prefix (sha32(url))=PH. for all prefixes, store sha32(DH|PH|R) When visiting a page, check whether sha32(sha32(domain)|sha32(url)|R) is in the database. Now sha32(url) = PH with sha32(domain) != DH will collide depending on R.
Ah, I see. I had to write it out to understand, so if it helps anyone else: * sha32(PH) collides if two URLs have the same hash. (probability 2^-32) * sha32(DH|PH) collides if * two URLs have the same hash and their domains have the same hash, or * two URLs' DH|PH values have the same hash. (probability 2^-64 + 2^-32) * sha32(DH|PH|R) collides if * two URLs have the same hash and their domains have the same hash, or * two URLs' DH|PH|R values have the same hash. (probability 2^-64 + 2^-32) So you're trading off a (tiny) increase in chance of collisions and getting back randomization on the 2^-32 term.
Component: General → Phishing Protection
OS: Android → All
Product: Fennec → Firefox
QA Contact: general → phishing.protection
Hardware: ARM → All
Bug 673470 landed, reducing the store from 40M to 5-6M.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.