Closed
Bug 1297962
Opened 8 years ago
Closed 7 years ago
Support adding noise when send v4 gethash request
Categories
(Toolkit :: Safe Browsing, defect, P2)
Toolkit
Safe Browsing
Tracking
()
RESOLVED
FIXED
mozilla54
Tracking | Status | |
---|---|---|
firefox54 | --- | fixed |
People
(Reporter: dimi, Assigned: tnguyen)
References
Details
(Whiteboard: #sbv4-m5)
Attachments
(1 file)
Current AddNoise implementation is based on 4-bytes prefix set (see Classifier::ReadNoiseEntries). We should also support add noise for variable-length prefix set.
Reporter | ||
Updated•8 years ago
|
Whiteboard: #sbv4-m2
Updated•8 years ago
|
Whiteboard: #sbv4-m2 → #sbv4-m3
Reporter | ||
Updated•8 years ago
|
Whiteboard: #sbv4-m3 → #sbv4-m4
Reporter | ||
Updated•7 years ago
|
Whiteboard: #sbv4-m4 → #sbv4-m5
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → tnguyen
Updated•7 years ago
|
Assignee | ||
Comment 1•7 years ago
|
||
Before supporting noise request in V4, we should figure out the length of matched prefix need to be sent. (based on comment 1 bug 1323953) If it is possible to send 4-byte prefix for both V2 V4, I assume that we may not need to add new noise, just use the same noise request with V2
Comment 2•7 years ago
|
||
I moved this bug to block one that's still open, otherwise it doesn't show up on the V4 bug tree: https://bugzilla.mozilla.org/showdependencytree.cgi?id=1167038&hide_resolved=1 You're right about the dependency. I'll add that now.
Comment 3•7 years ago
|
||
As mentioned 1323953, we can stick to 4-byte hash prefixes when doing completions and then re-use the existing noise code.
Reporter | ||
Comment 4•7 years ago
|
||
Existing noise code use 4-bytes prefix array to get noise entries. For V4, prefix is built up with both 4-bytes prefix array and variable-length prefix map. If it is ok that we just use 4-bytes prefix array to create noise entries, then this bug will only require LookupCache to provide an API to retrieve 4-bytes prefix array. But the assumption here will be most of prefixes are 4-bytes.
Comment 5•7 years ago
|
||
(In reply to Dimi Lee[:dimi][:dlee] from comment #4) > If it is ok that we just use 4-bytes prefix > array to create noise entries, then this bug will only require LookupCache > to provide an API to retrieve 4-bytes prefix array. > But the assumption here will be most of prefixes are 4-bytes. I think that's a safe assumption. We will only ever be requesting 4-byte prefixes when doing V4 completions, so it makes sense to only use 4-byte noise prefixes.
Updated•7 years ago
|
Status: NEW → ASSIGNED
Comment hidden (mozreview-request) |
Assignee | ||
Updated•7 years ago
|
Attachment #8835362 -
Flags: review?(francois)
Comment 7•7 years ago
|
||
mozreview-review |
Comment on attachment 8835362 [details] Bug 1297962 - Add noise data when sending v4 gethash request https://reviewboard.mozilla.org/r/111040/#review112302 Code looks all good. Let's see if you can add a gtest for it so that we make sure we always have the noise working.
Attachment #8835362 -
Flags: review?(francois)
Comment hidden (mozreview-request) |
Comment 9•7 years ago
|
||
mozreview-review |
Comment on attachment 8835362 [details] Bug 1297962 - Add noise data when sending v4 gethash request https://reviewboard.mozilla.org/r/111040/#review115466 ::: toolkit/components/url-classifier/Classifier.cpp:1228 (Diff revisions 1 - 2) > return NS_ERROR_FAILURE; > } > > + for (size_t i = 0; i < aCount; i++) { > - // Pick some prefixes from cache as noise > + // Pick some prefixes from cache as noise > - size_t idx = aPrefix.ToUint32() % prefixes.Length(); > + uint32_t idx = ((aPrefix.ToUint32() / aCount) * (i + 1)) % prefixes.Length(); This is getting a little bit too clever / convoluted. Maybe we should just use `rand() * length % length` to make it clear that we're picking a few random prefixes from the cache. ::: toolkit/components/url-classifier/Classifier.cpp:1230 (Diff revisions 1 - 2) > - // Pick some prefixes from cache as noise > + // Pick some prefixes from cache as noise > - size_t idx = aPrefix.ToUint32() % prefixes.Length(); > + uint32_t idx = ((aPrefix.ToUint32() / aCount) * (i + 1)) % prefixes.Length(); > - idx -= idx % aCount; > > - for (size_t i = 0; (i < aCount) && ((idx+i) < prefixes.Length()); i++) { > Prefix newPref; While we're touching this code, let's call this variable `newPrefix` so that there's no confusion with preferences :) ::: toolkit/components/url-classifier/tests/gtest/Common.cpp:86 (Diff revision 2) > +PrefixArrayToAddPrefixArrayV2(const nsTArray<nsCString>& prefixArray, > + AddPrefixArray& out) > +{ > + out.Clear(); > + > + for (uint32_t i = 0; i < prefixArray.Length(); i++) { Should `uint32_t` be `size_t`? ::: toolkit/components/url-classifier/tests/gtest/Common.cpp:89 (Diff revision 2) > + out.Clear(); > + > + for (uint32_t i = 0; i < prefixArray.Length(); i++) { > + // Create prefix hash from string > + Prefix hash; > + memcpy(hash.buf, prefixArray[i].BeginReading(), PREFIX_SIZE); Can we put a static_assert here to ensure that `sizeof(hash.buf) == PREFIX_SIZE`? Also, we could add a runtime assert that checks that `prefixArray[i].Length() == PREFIX_SIZE`. ::: toolkit/components/url-classifier/tests/gtest/TestClassifier.cpp:103 (Diff revision 2) > + > +TEST(Classifier, ReadNoiseEntriesV4) > +{ > + UniquePtr<Classifier> classifier(GetClassifier()); > + _PrefixArray array = { GeneratePrefix(_Fragment("bravo.com/"), 32), > + GeneratePrefix(_Fragment("browsing.com/"), 8), Maybe we should add a 5-byte prefix, just because that's not a power of 2.
Attachment #8835362 -
Flags: review?(francois)
Comment hidden (mozreview-request) |
Assignee | ||
Comment 11•7 years ago
|
||
Rebased commit
Assignee | ||
Comment 12•7 years ago
|
||
mozreview-review-reply |
Comment on attachment 8835362 [details] Bug 1297962 - Add noise data when sending v4 gethash request https://reviewboard.mozilla.org/r/111040/#review115466 > This is getting a little bit too clever / convoluted. > > Maybe we should just use `rand() * length % length` to make it clear that we're picking a few random prefixes from the cache. rand() % length would be enough to take random from 1 -> lengh-1. I did avoid using rand() because I don't know how it works on different platforms, and i thuoght it may break. But seems I was wrong, I could find rand() is used in many other places in our codebase. Fixed as you recommened, thanks you for your review.
Comment hidden (mozreview-request) |
Comment 14•7 years ago
|
||
mozreview-review |
Comment on attachment 8835362 [details] Bug 1297962 - Add noise data when sending v4 gethash request https://reviewboard.mozilla.org/r/111040/#review116546 Looks good, but please address the comment below. ::: toolkit/components/url-classifier/tests/gtest/Common.cpp:91 (Diff revisions 3 - 4) > - for (uint32_t i = 0; i < prefixArray.Length(); i++) { > + for (size_t i = 0; i < prefixArray.Length(); i++) { > // Create prefix hash from string > Prefix hash; > + static_assert(sizeof(hash.buf) == PREFIX_SIZE, "Prefix must be 4 bytes length"); > memcpy(hash.buf, prefixArray[i].BeginReading(), PREFIX_SIZE); > + EXPECT_TRUE(prefixArray[i].Length() == PREFIX_SIZE); That should be `MOZ_ASSERT()`, not `EXPECT_TRUE`.
Attachment #8835362 -
Flags: review?(francois) → review+
Comment hidden (mozreview-request) |
Assignee | ||
Comment 16•7 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=55b9e0e41a7e
Keywords: checkin-needed
Comment 17•7 years ago
|
||
Pushed by cbook@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9c05f32dd0b1 Add noise data when sending v4 gethash request r=francois
Keywords: checkin-needed
Comment 18•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/9c05f32dd0b1
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox54:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
rand() is often predictable. If this is a privacy consideration, we would probably want to use something with stronger unpredictability properties. (I'm just jumping in here based off a coverity issue, so this may not be relevant.)
Flags: needinfo?(tnguyen)
Comment 20•7 years ago
|
||
The old code explicitly did _not_ use rand() but a deterministic algorithm. Using rand() makes it easy to filter out any popular prefix people are hitting by traffic analysis and simply averaging it out. The deterministic algorithm made this impossible by generating the same noise for the same original URL being looked up. Thus, it was never possible to say with certainty (and without correlating with other data sources) which URL was actually being hit, because every user that hits a certain URL will send the *SAME* 4 prefixes instead of 1 *IDENTICAL* one and 3 *DIFFERENT* ones. It is obvious that the latter immediately allows filtering out the noise. I suspect you'll find the original description with some use of hg blame to the original bug.
Comment 21•7 years ago
|
||
If you can't find the original bug, the algorithm simply divides the entire prefix array in chunks of 4, finds which group of 4 the prefix belongs to and sends all prefixes in the group.
Assignee | ||
Comment 22•7 years ago
|
||
Thanks gcp Originally, we send a group prefix in the succession index in the prefix list(0,1,2,3,4 for example) In V2, that's fine, but in v4, we stored database as 2 set "fixed 4 bytes prefix" and "variable length prefix" and decides only send 4 bytes noise from "fixed 4 bytes prefix". I observed there will be some cases we send a real prefix as "ABBBB" and 4 noises "C111, C112,C113, C114" . The real one looks quite different from noise and the noises are meaningless. To fix that I used > + for (size_t i = 0; i < aCount; i++) { > + uint32_t idx = ((aPrefix.ToUint32() / aCount) * (i + 1)) % prefixes.Length(); I used this before changing into rand(), in comment 9 and 12. Francois is on PTO now, probably he will not able to comment on this. gcp, do you agree with this so I can file another bug to remove rand() here
Flags: needinfo?(tnguyen)
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(gpascutto)
Comment 23•7 years ago
|
||
This constructs a kind of LCG random number generator, deterministically seeded from the prefix we're generating noise from, right? I think that achieves the desired effect. We probably want to document how this works a bit more in the code itself, to prevent this from regressing as it almost did here. (Yes, Google can probably figure out the real URLs by correlating with the link traffic in their search engine, or by correlating with Chrome results. But that takes a conscious, intentional effort on their side to violate user privacy. Observing that consecutive queries from Firefox users have 2 prefixes in common only takes a cache. Let's do best effort for privacy, guys.)
Flags: needinfo?(gpascutto)
Comment 24•7 years ago
|
||
GCP has a point about the non-determinism allowing the noise to be removed more easily. Should we simply seed the random number generator with the prefix before calling rand()? The previous code was getting harder to understand and I feared it would be hard to spot a regression (ironically, my suggestions introduced a different regression). Thomas, could you please file a follow-up bug to make the noise values deterministic for the same prefix, if you haven't done so already?
Flags: needinfo?(tnguyen)
Comment 25•7 years ago
|
||
Nevermind, I found the follow-up :)
Depends on: 1343416
Flags: needinfo?(tnguyen)
You need to log in
before you can comment on or make changes to this bug.
Description
•