Closed Bug 1297962 Opened 5 years ago Closed 5 years ago

Support adding noise when send v4 gethash request

Categories

(Toolkit :: Safe Browsing, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: dimi, Assigned: tnguyen)

References

Details

(Whiteboard: #sbv4-m5)

Attachments

(1 file)

Current AddNoise implementation is based on 4-bytes prefix set (see Classifier::ReadNoiseEntries).

We should also support add noise for variable-length prefix set.
Whiteboard: #sbv4-m2
Blocks: 1276826
Priority: -- → P2
Whiteboard: #sbv4-m2 → #sbv4-m3
Whiteboard: #sbv4-m3 → #sbv4-m4
Whiteboard: #sbv4-m4 → #sbv4-m5
Assignee: nobody → tnguyen
Blocks: 1329808
No longer blocks: 1276826
Before supporting noise request in V4, we should figure out the length of matched prefix need to be sent.
(based on comment 1 bug 1323953)
If it is possible to send 4-byte prefix for both V2 V4, I assume that we may not need to add new noise, just use the same noise request with V2
Blocks: 1276826
No longer blocks: 1329808
I moved this bug to block one that's still open, otherwise it doesn't show up on the V4 bug tree: https://bugzilla.mozilla.org/showdependencytree.cgi?id=1167038&hide_resolved=1

You're right about the dependency. I'll add that now.
Blocks: 1329808
No longer blocks: 1276826
Depends on: 1323953
As mentioned 1323953, we can stick to 4-byte hash prefixes when doing completions and then re-use the existing noise code.
Existing noise code use 4-bytes prefix array to get noise entries.
For V4, prefix is built up with both 4-bytes prefix array and variable-length prefix map. If it is ok that we just use 4-bytes prefix array to create noise entries, then this bug will only require LookupCache to provide an API to retrieve 4-bytes prefix array.
But the assumption here will be most of prefixes are 4-bytes.
(In reply to Dimi Lee[:dimi][:dlee] from comment #4)
> If it is ok that we just use 4-bytes prefix
> array to create noise entries, then this bug will only require LookupCache
> to provide an API to retrieve 4-bytes prefix array.
> But the assumption here will be most of prefixes are 4-bytes.

I think that's a safe assumption.

We will only ever be requesting 4-byte prefixes when doing V4 completions, so it makes sense to only use 4-byte noise prefixes.
Status: NEW → ASSIGNED
No longer depends on: 1323953
Attachment #8835362 - Flags: review?(francois)
Comment on attachment 8835362 [details]
Bug 1297962 - Add noise data when sending v4 gethash request

https://reviewboard.mozilla.org/r/111040/#review112302

Code looks all good. Let's see if you can add a gtest for it so that we make sure we always have the noise working.
Attachment #8835362 - Flags: review?(francois)
Comment on attachment 8835362 [details]
Bug 1297962 - Add noise data when sending v4 gethash request

https://reviewboard.mozilla.org/r/111040/#review115466

::: toolkit/components/url-classifier/Classifier.cpp:1228
(Diff revisions 1 - 2)
>      return NS_ERROR_FAILURE;
>    }
>  
> +  for (size_t i = 0; i < aCount; i++) {
> -  // Pick some prefixes from cache as noise
> +    // Pick some prefixes from cache as noise
> -  size_t idx = aPrefix.ToUint32() % prefixes.Length();
> +    uint32_t idx = ((aPrefix.ToUint32() / aCount) * (i + 1)) % prefixes.Length();

This is getting a little bit too clever / convoluted.

Maybe we should just use `rand() * length % length` to make it clear that we're picking a few random prefixes from the cache.

::: toolkit/components/url-classifier/Classifier.cpp:1230
(Diff revisions 1 - 2)
> -  // Pick some prefixes from cache as noise
> +    // Pick some prefixes from cache as noise
> -  size_t idx = aPrefix.ToUint32() % prefixes.Length();
> +    uint32_t idx = ((aPrefix.ToUint32() / aCount) * (i + 1)) % prefixes.Length();
> -  idx -= idx % aCount;
>  
> -  for (size_t i = 0; (i < aCount) && ((idx+i) < prefixes.Length()); i++) {
>      Prefix newPref;

While we're touching this code, let's call this variable `newPrefix` so that there's no confusion with preferences :)

::: toolkit/components/url-classifier/tests/gtest/Common.cpp:86
(Diff revision 2)
> +PrefixArrayToAddPrefixArrayV2(const nsTArray<nsCString>& prefixArray,
> +                              AddPrefixArray& out)
> +{
> +  out.Clear();
> +
> +  for (uint32_t i = 0; i < prefixArray.Length(); i++) {

Should `uint32_t` be `size_t`?

::: toolkit/components/url-classifier/tests/gtest/Common.cpp:89
(Diff revision 2)
> +  out.Clear();
> +
> +  for (uint32_t i = 0; i < prefixArray.Length(); i++) {
> +    // Create prefix hash from string
> +    Prefix hash;
> +    memcpy(hash.buf, prefixArray[i].BeginReading(), PREFIX_SIZE);

Can we put a static_assert here to ensure that `sizeof(hash.buf) == PREFIX_SIZE`?

Also, we could add a runtime assert that checks that `prefixArray[i].Length() == PREFIX_SIZE`.

::: toolkit/components/url-classifier/tests/gtest/TestClassifier.cpp:103
(Diff revision 2)
> +
> +TEST(Classifier, ReadNoiseEntriesV4)
> +{
> +  UniquePtr<Classifier> classifier(GetClassifier());
> +  _PrefixArray array = { GeneratePrefix(_Fragment("bravo.com/"), 32),
> +                         GeneratePrefix(_Fragment("browsing.com/"), 8),

Maybe we should add a 5-byte prefix, just because that's not a power of 2.
Attachment #8835362 - Flags: review?(francois)
Rebased commit
Comment on attachment 8835362 [details]
Bug 1297962 - Add noise data when sending v4 gethash request

https://reviewboard.mozilla.org/r/111040/#review115466

> This is getting a little bit too clever / convoluted.
> 
> Maybe we should just use `rand() * length % length` to make it clear that we're picking a few random prefixes from the cache.

rand() % length would be enough to take random from 1 -> lengh-1.
I did avoid using rand() because I don't know how it works on different platforms, and i thuoght it may break.
But seems I was wrong, I could find rand() is used in many other places in our codebase.
Fixed as you recommened, thanks you for your review.
Comment on attachment 8835362 [details]
Bug 1297962 - Add noise data when sending v4 gethash request

https://reviewboard.mozilla.org/r/111040/#review116546

Looks good, but please address the comment below.

::: toolkit/components/url-classifier/tests/gtest/Common.cpp:91
(Diff revisions 3 - 4)
> -  for (uint32_t i = 0; i < prefixArray.Length(); i++) {
> +  for (size_t i = 0; i < prefixArray.Length(); i++) {
>      // Create prefix hash from string
>      Prefix hash;
> +    static_assert(sizeof(hash.buf) == PREFIX_SIZE, "Prefix must be 4 bytes length");
>      memcpy(hash.buf, prefixArray[i].BeginReading(), PREFIX_SIZE);
> +    EXPECT_TRUE(prefixArray[i].Length() == PREFIX_SIZE);

That should be `MOZ_ASSERT()`, not `EXPECT_TRUE`.
Attachment #8835362 - Flags: review?(francois) → review+
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9c05f32dd0b1
Add noise data when sending v4 gethash request r=francois
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/9c05f32dd0b1
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
rand() is often predictable. If this is a privacy consideration, we would probably want to use something with stronger unpredictability properties. (I'm just jumping in here based off a coverity issue, so this may not be relevant.)
Flags: needinfo?(tnguyen)
The old code explicitly did _not_ use rand() but a deterministic algorithm. Using rand() makes it easy to filter out any popular prefix people are hitting by traffic analysis and simply averaging it out. The deterministic algorithm made this impossible by generating the same noise for the same original URL being looked up. Thus, it was never possible to say with certainty (and without correlating with other data sources) which URL was actually being hit, because every user that hits a certain URL will send the *SAME* 4 prefixes instead of 1 *IDENTICAL* one and 3 *DIFFERENT* ones. It is obvious that the latter immediately allows filtering out the noise.

I suspect you'll find the original description with some use of hg blame to the original bug.
If you can't find the original bug, the algorithm simply divides the entire prefix array in chunks of 4, finds which group of 4 the prefix belongs to and sends all prefixes in the group.
Thanks gcp
Originally, we send a group prefix in the succession index in the prefix list(0,1,2,3,4 for example)

In V2, that's fine, but in v4, we stored database as 2 set "fixed 4 bytes prefix" and "variable length prefix" and decides only send 4 bytes noise from "fixed 4 bytes prefix".
I observed there will be some cases we send a real prefix as "ABBBB" and 4 noises "C111, C112,C113, C114" . The real one looks quite different from noise and the noises are meaningless. 

To fix that I used 
> +  for (size_t i = 0; i < aCount; i++) {
> +    uint32_t idx = ((aPrefix.ToUint32() / aCount) * (i + 1)) % prefixes.Length();

I used this before changing into rand(), in comment 9 and 12.
Francois is on PTO now, probably he will not able to comment on this. gcp, do you agree with this so I can file another bug to remove rand() here
Flags: needinfo?(tnguyen)
Flags: needinfo?(gpascutto)
This constructs a kind of LCG random number generator, deterministically seeded from the prefix we're generating noise from, right? I think that achieves the desired effect.

We probably want to document how this works a bit more in the code itself, to prevent this from regressing as it almost did here.

(Yes, Google can probably figure out the real URLs by correlating with the link traffic in their search engine, or by correlating with Chrome results. But that takes a conscious, intentional effort on their side to violate user privacy. Observing that consecutive queries from Firefox users have 2 prefixes in common only takes a cache. Let's do best effort for privacy, guys.)
Flags: needinfo?(gpascutto)
GCP has a point about the non-determinism allowing the noise to be removed more easily.

Should we simply seed the random number generator with the prefix before calling rand()?

The previous code was getting harder to understand and I feared it would be hard to spot a regression (ironically, my suggestions introduced a different regression).

Thomas, could you please file a follow-up bug to make the noise values deterministic for the same prefix, if you haven't done so already?
Flags: needinfo?(tnguyen)
Nevermind, I found the follow-up :)
Depends on: 1343416
Flags: needinfo?(tnguyen)
Depends on: 1351472
You need to log in before you can comment on or make changes to this bug.