Improve SafeBrowsing store SubPrefix compression

RESOLVED FIXED in Firefox 17

Status

()

Toolkit
Safe Browsing
--
enhancement
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: gcp, Assigned: gcp)

Tracking

13 Branch
Firefox 17
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Assignee)

Description

5 years ago
I've been keeping an eye on the growth of the SafeBrowsing database to be sure everything is working OK, and one thing that gets to me is the size of the malware files. When originally working on this thing, I thought that AddPrefixes and AddChunks were the important thing, and we'd get everything down to about 2.4M. And this is right, for the phishing list. The malware list is another matter:

2012-02-24 05:55:40.431215 UTC - 1657325312[7f546462b9c0]: Table goog-malware-shavar now has:
2012-02-24 05:55:40.431234 UTC - 1657325312[7f546462b9c0]:   12846 add chunks
2012-02-24 05:55:40.431239 UTC - 1657325312[7f546462b9c0]:   386464 add prefixes
2012-02-24 05:55:40.431242 UTC - 1657325312[7f546462b9c0]:   1 add completions
2012-02-24 05:55:40.431254 UTC - 1657325312[7f546462b9c0]:   10562 sub chunks
2012-02-24 05:55:40.431257 UTC - 1657325312[7f546462b9c0]:   226033 sub prefixes
2012-02-24 05:55:40.431259 UTC - 1657325312[7f546462b9c0]:   1 sub completions

The amount of sub prefixes is over half the total amount. We currently use zlib (DEFLATE) on the chunk numbers (of which sub prefixes have 2), and store the sub prefixes uncompressed (mostly because PrefixSet doesn't have stream serialization support, and I didn't anticipate there being so many in the first place).

Because the chunk data only spans a small amount of numbers (+-13000, see above), I did a quick hack to store them delta-encoded. To my surprise, this *worsened* the compression. After looking at it a bit more, I realized 2 issues:

a) Because DEFLATE requires matches with length at least 3 to compress well, it will only be useful for consecutive numbers that aren't more than 256 apart.
b) The data is stored in prefix order, which means the chunk numbers are purely random. This means (a) will almost always fail.

This in turn lead to the following comment in HashStore::WriteSubPrefixes:
  // chunk-ordered prefixes are not compressible

This was true at some point, but before the coded landed, I actually rewrote it to have everything work on the data in prefix order, and hence avoid a whole bunch of re-sorting that was otherwise necessary.

But this does mean that the prefixes should be compressible now. And so should the chunk numbers, if we're a bit smarter.

I kept the DEFLATE backend because we already have it and it does nice things such as huffman coding for us, but added a very simply preprocessing step that splits all 32-bit values into 4 8-bit slices. Because the MSB of our chunkdata is almost all identical, and for example the prefixes, which are sorted, will also have long runs of identical values, compressibility gains a lot. It also avoids the DEFLATE-needs-a-length-3-match problem entirely. And it's fast. Win!

These are the test results on the current database. ShortSlice was a try with splitting the 32-bit data into 2 x 16-bit values, which makes it behave very close to PrefixSet.

https://docs.google.com/spreadsheet/ccc?key=0AiObroK0ZHF-dGFmZWhKLVdSWnhYcGk1VDZaZjg3X1E#gid=0

This reduces our SafeBrowsing store from about 5.5M to 4.5M.
(Assignee)

Comment 1

5 years ago
Created attachment 600351 [details] [diff] [review]
Patch 1. Split data into byte slices to improve compression.
Assignee: nobody → gpascutto
Attachment #600351 - Flags: review?(dcamp)
(Assignee)

Comment 2

5 years ago
https://tbpl.mozilla.org/?tree=Try&rev=457ee3012f48

Comment 3

5 years ago
Comment on attachment 600351 [details] [diff] [review]
Patch 1. Split data into byte slices to improve compression.

Neat.
Attachment #600351 - Flags: review?(dcamp) → review+
(Assignee)

Comment 4

5 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/a01cf079ee0b
https://hg.mozilla.org/mozilla-central/rev/a01cf079ee0b
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 13
(Assignee)

Comment 6

5 years ago
Created attachment 616613 [details] [diff] [review]
Patch. Backout

[Approval Request Comment]
Backout due to bug 744993:

a01cf079ee0b Bug 730247
1a6d008acb4f Bug 729928
f8bf3795b851 Bug 729640
35bf0d62cc30 Bug 726002
a010dcf1a973 Bug 726002
e9291f227d63 Bug 725597
db52b4916cde Bug 673470
173f90d397a8 Bug 673470
Attachment #616613 - Flags: approval-mozilla-central?
Attachment #616613 - Flags: approval-mozilla-aurora?
Attachment #616613 - Flags: approval-mozilla-central? → approval-mozilla-central+
(Assignee)

Comment 7

5 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/ad824b9f5c38
https://hg.mozilla.org/mozilla-central/rev/ad824b9f5c38
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Updated

5 years ago
Attachment #616613 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-

Comment 9

5 years ago
Comment on attachment 616613 [details] [diff] [review]
Patch. Backout

Sorry - thought this bug was reopened because the backout was backed out. My mistake. Approved for Aurora 13 (or Beta 13 if the merge occurs before we land).
Attachment #616613 - Flags: approval-mozilla-aurora- → approval-mozilla-aurora+
(Assignee)

Comment 10

5 years ago
https://hg.mozilla.org/releases/mozilla-beta/rev/bcf80cb89d21
(Assignee)

Comment 11

5 years ago
Relanding after fixes in bug 673470 to fix bug 744993.
https://hg.mozilla.org/integration/mozilla-inbound/rev/c716b0e67d08
https://hg.mozilla.org/mozilla-central/rev/c716b0e67d08
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: Firefox 13 → Firefox 17
Component: Phishing Protection → Phishing Protection
Product: Firefox → Toolkit
You need to log in before you can comment on or make changes to this bug.