I've been keeping an eye on the growth of the SafeBrowsing database to be sure everything is working OK, and one thing that gets to me is the size of the malware files. When originally working on this thing, I thought that AddPrefixes and AddChunks were the important thing, and we'd get everything down to about 2.4M. And this is right, for the phishing list. The malware list is another matter: 2012-02-24 05:55:40.431215 UTC - 1657325312[7f546462b9c0]: Table goog-malware-shavar now has: 2012-02-24 05:55:40.431234 UTC - 1657325312[7f546462b9c0]: 12846 add chunks 2012-02-24 05:55:40.431239 UTC - 1657325312[7f546462b9c0]: 386464 add prefixes 2012-02-24 05:55:40.431242 UTC - 1657325312[7f546462b9c0]: 1 add completions 2012-02-24 05:55:40.431254 UTC - 1657325312[7f546462b9c0]: 10562 sub chunks 2012-02-24 05:55:40.431257 UTC - 1657325312[7f546462b9c0]: 226033 sub prefixes 2012-02-24 05:55:40.431259 UTC - 1657325312[7f546462b9c0]: 1 sub completions The amount of sub prefixes is over half the total amount. We currently use zlib (DEFLATE) on the chunk numbers (of which sub prefixes have 2), and store the sub prefixes uncompressed (mostly because PrefixSet doesn't have stream serialization support, and I didn't anticipate there being so many in the first place). Because the chunk data only spans a small amount of numbers (+-13000, see above), I did a quick hack to store them delta-encoded. To my surprise, this *worsened* the compression. After looking at it a bit more, I realized 2 issues: a) Because DEFLATE requires matches with length at least 3 to compress well, it will only be useful for consecutive numbers that aren't more than 256 apart. b) The data is stored in prefix order, which means the chunk numbers are purely random. This means (a) will almost always fail. This in turn lead to the following comment in HashStore::WriteSubPrefixes: // chunk-ordered prefixes are not compressible This was true at some point, but before the coded landed, I actually rewrote it to have everything work on the data in prefix order, and hence avoid a whole bunch of re-sorting that was otherwise necessary. But this does mean that the prefixes should be compressible now. And so should the chunk numbers, if we're a bit smarter. I kept the DEFLATE backend because we already have it and it does nice things such as huffman coding for us, but added a very simply preprocessing step that splits all 32-bit values into 4 8-bit slices. Because the MSB of our chunkdata is almost all identical, and for example the prefixes, which are sorted, will also have long runs of identical values, compressibility gains a lot. It also avoids the DEFLATE-needs-a-length-3-match problem entirely. And it's fast. Win! These are the test results on the current database. ShortSlice was a try with splitting the 32-bit data into 2 x 16-bit values, which makes it behave very close to PrefixSet. https://docs.google.com/spreadsheet/ccc?key=0AiObroK0ZHF-dGFmZWhKLVdSWnhYcGk1VDZaZjg3X1E#gid=0 This reduces our SafeBrowsing store from about 5.5M to 4.5M.
Created attachment 600351 [details] [diff] [review] Patch 1. Split data into byte slices to improve compression.
Comment on attachment 600351 [details] [diff] [review] Patch 1. Split data into byte slices to improve compression. Neat.
Created attachment 616613 [details] [diff] [review] Patch. Backout [Approval Request Comment] Backout due to bug 744993: a01cf079ee0b Bug 730247 1a6d008acb4f Bug 729928 f8bf3795b851 Bug 729640 35bf0d62cc30 Bug 726002 a010dcf1a973 Bug 726002 e9291f227d63 Bug 725597 db52b4916cde Bug 673470 173f90d397a8 Bug 673470
Comment on attachment 616613 [details] [diff] [review] Patch. Backout Sorry - thought this bug was reopened because the backout was backed out. My mistake. Approved for Aurora 13 (or Beta 13 if the merge occurs before we land).
Relanding after fixes in bug 673470 to fix bug 744993. https://hg.mozilla.org/integration/mozilla-inbound/rev/c716b0e67d08