Last Comment Bug 730247 - Improve SafeBrowsing store SubPrefix compression
: Improve SafeBrowsing store SubPrefix compression
Status: RESOLVED FIXED
:
Product: Toolkit
Classification: Components
Component: Safe Browsing (show other bugs)
: 13 Branch
: All All
: -- enhancement (vote)
: Firefox 17
Assigned To: Gian-Carlo Pascutto [:gcp]
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-02-24 03:49 PST by Gian-Carlo Pascutto [:gcp]
Modified: 2014-05-27 12:25 PDT (History)
3 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch 1. Split data into byte slices to improve compression. (9.31 KB, patch)
2012-02-24 03:50 PST, Gian-Carlo Pascutto [:gcp]
dcamp: review+
Details | Diff | Splinter Review
Patch. Backout (9.28 KB, patch)
2012-04-19 09:22 PDT, Gian-Carlo Pascutto [:gcp]
akeybl: approval‑mozilla‑aurora+
mark.finkle: approval‑mozilla‑central+
Details | Diff | Splinter Review

Description Gian-Carlo Pascutto [:gcp] 2012-02-24 03:49:14 PST
I've been keeping an eye on the growth of the SafeBrowsing database to be sure everything is working OK, and one thing that gets to me is the size of the malware files. When originally working on this thing, I thought that AddPrefixes and AddChunks were the important thing, and we'd get everything down to about 2.4M. And this is right, for the phishing list. The malware list is another matter:

2012-02-24 05:55:40.431215 UTC - 1657325312[7f546462b9c0]: Table goog-malware-shavar now has:
2012-02-24 05:55:40.431234 UTC - 1657325312[7f546462b9c0]:   12846 add chunks
2012-02-24 05:55:40.431239 UTC - 1657325312[7f546462b9c0]:   386464 add prefixes
2012-02-24 05:55:40.431242 UTC - 1657325312[7f546462b9c0]:   1 add completions
2012-02-24 05:55:40.431254 UTC - 1657325312[7f546462b9c0]:   10562 sub chunks
2012-02-24 05:55:40.431257 UTC - 1657325312[7f546462b9c0]:   226033 sub prefixes
2012-02-24 05:55:40.431259 UTC - 1657325312[7f546462b9c0]:   1 sub completions

The amount of sub prefixes is over half the total amount. We currently use zlib (DEFLATE) on the chunk numbers (of which sub prefixes have 2), and store the sub prefixes uncompressed (mostly because PrefixSet doesn't have stream serialization support, and I didn't anticipate there being so many in the first place).

Because the chunk data only spans a small amount of numbers (+-13000, see above), I did a quick hack to store them delta-encoded. To my surprise, this *worsened* the compression. After looking at it a bit more, I realized 2 issues:

a) Because DEFLATE requires matches with length at least 3 to compress well, it will only be useful for consecutive numbers that aren't more than 256 apart.
b) The data is stored in prefix order, which means the chunk numbers are purely random. This means (a) will almost always fail.

This in turn lead to the following comment in HashStore::WriteSubPrefixes:
  // chunk-ordered prefixes are not compressible

This was true at some point, but before the coded landed, I actually rewrote it to have everything work on the data in prefix order, and hence avoid a whole bunch of re-sorting that was otherwise necessary.

But this does mean that the prefixes should be compressible now. And so should the chunk numbers, if we're a bit smarter.

I kept the DEFLATE backend because we already have it and it does nice things such as huffman coding for us, but added a very simply preprocessing step that splits all 32-bit values into 4 8-bit slices. Because the MSB of our chunkdata is almost all identical, and for example the prefixes, which are sorted, will also have long runs of identical values, compressibility gains a lot. It also avoids the DEFLATE-needs-a-length-3-match problem entirely. And it's fast. Win!

These are the test results on the current database. ShortSlice was a try with splitting the 32-bit data into 2 x 16-bit values, which makes it behave very close to PrefixSet.

https://docs.google.com/spreadsheet/ccc?key=0AiObroK0ZHF-dGFmZWhKLVdSWnhYcGk1VDZaZjg3X1E#gid=0

This reduces our SafeBrowsing store from about 5.5M to 4.5M.
Comment 1 Gian-Carlo Pascutto [:gcp] 2012-02-24 03:50:40 PST
Created attachment 600351 [details] [diff] [review]
Patch 1. Split data into byte slices to improve compression.
Comment 2 Gian-Carlo Pascutto [:gcp] 2012-02-24 11:48:18 PST
https://tbpl.mozilla.org/?tree=Try&rev=457ee3012f48
Comment 3 Dave Camp (:dcamp) 2012-02-24 11:54:16 PST
Comment on attachment 600351 [details] [diff] [review]
Patch 1. Split data into byte slices to improve compression.

Neat.
Comment 4 Gian-Carlo Pascutto [:gcp] 2012-02-26 22:48:59 PST
https://hg.mozilla.org/integration/mozilla-inbound/rev/a01cf079ee0b
Comment 5 Marco Bonardo [::mak] 2012-02-27 04:55:57 PST
https://hg.mozilla.org/mozilla-central/rev/a01cf079ee0b
Comment 6 Gian-Carlo Pascutto [:gcp] 2012-04-19 09:22:25 PDT
Created attachment 616613 [details] [diff] [review]
Patch. Backout

[Approval Request Comment]
Backout due to bug 744993:

a01cf079ee0b Bug 730247
1a6d008acb4f Bug 729928
f8bf3795b851 Bug 729640
35bf0d62cc30 Bug 726002
a010dcf1a973 Bug 726002
e9291f227d63 Bug 725597
db52b4916cde Bug 673470
173f90d397a8 Bug 673470
Comment 7 Gian-Carlo Pascutto [:gcp] 2012-04-19 22:48:52 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/ad824b9f5c38
Comment 9 Alex Keybl [:akeybl] 2012-04-24 07:11:48 PDT
Comment on attachment 616613 [details] [diff] [review]
Patch. Backout

Sorry - thought this bug was reopened because the backout was backed out. My mistake. Approved for Aurora 13 (or Beta 13 if the merge occurs before we land).
Comment 10 Gian-Carlo Pascutto [:gcp] 2012-05-01 06:41:20 PDT
https://hg.mozilla.org/releases/mozilla-beta/rev/bcf80cb89d21
Comment 11 Gian-Carlo Pascutto [:gcp] 2012-08-15 00:16:11 PDT
Relanding after fixes in bug 673470 to fix bug 744993.
https://hg.mozilla.org/integration/mozilla-inbound/rev/c716b0e67d08
Comment 12 Ed Morley [:emorley] 2012-08-15 09:47:25 PDT
https://hg.mozilla.org/mozilla-central/rev/c716b0e67d08

Note You need to log in before you can comment on or make changes to this bug.