Closed Bug 1285848 Opened 8 years ago Closed 8 years ago

Supports Rice-encoded table update for v4

Categories

(Toolkit :: Safe Browsing, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: hchang, Assigned: hchang)

References

Details

(Whiteboard: #sbv4-m1)

Attachments

(3 files, 4 obsolete files)

In SB v4, Rice Golomb encoding is introduced to reduce the prefix list size. We need a decompression function to decode the compressed prefixes.
Assignee: nobody → hchang
Attachment #8770498 - Attachment is obsolete: true
Whiteboard: #sbv4-m1
Summary: [SafeBrowsing] Supports encoded table update for v4 → Supports encoded table update for v4
Attachment #8770500 - Attachment is obsolete: true
Comment on attachment 8776503 [details]
Bug 1285848 - Part 1: Implement Rice Delta Decoding. .

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/68276/diff/1-2/
Attachment #8776499 - Attachment is obsolete: true
RICE decoding related codes have landed Chromium a week ago [1]. Reading the code and see we might be able to reuse ... (maybe just the test case)

[1] https://chromium.googlesource.com/chromium/src/+/9e71d408737100714f022dffe20dfb346d395bf9
Summary: Supports encoded table update for v4 → Supports Rice-encoded table update for v4
Priority: -- → P2
Attachment #8777288 - Attachment is obsolete: true
Comment on attachment 8776503 [details]
Bug 1285848 - Part 1: Implement Rice Delta Decoding. .

https://reviewboard.mozilla.org/r/68276/#review71496

Clearing review while Henry investigates the Chromium code.
Attachment #8776503 - Flags: review?(francois)
Comment on attachment 8783460 [details]
Bug 1285848 - Part 2: Request and parse RICE encoded prefix by default.

https://reviewboard.mozilla.org/r/73252/#review71498

Clearing review while Henry investigates the Chromium code.
Attachment #8783460 - Flags: review?(francois)
Attached file Bug 1285848.pdf
Attachment #8776503 - Flags: review?(francois)
Attachment #8776503 - Flags: review?(francois)
Depends on: 1305581
Comment on attachment 8783460 [details]
Bug 1285848 - Part 2: Request and parse RICE encoded prefix by default.

https://reviewboard.mozilla.org/r/73252/#review80424

A few things to address. The only thing that's wrong here is the bug that Henry already knew about (`1` v. `4`).

::: toolkit/components/url-classifier/HashStore.cpp:171
(Diff revision 3)
>  TableUpdateV4::NewPrefixes(int32_t aSize, std::string& aPrefixes)
>  {
> -  NS_ENSURE_TRUE_VOID(aPrefixes.size() % aSize == 0);
> -  NS_ENSURE_TRUE_VOID(!mPrefixesMap.Get(aSize));
> +  MOZ_ASSERT(aPrefixes.size() % aSize == 0);
> +  MOZ_ASSERT(!mPrefixesMap.Get(aSize));
> +
> +#if 1 // DEBUG

This should be `if (LOG_ENABLED()) { ... }`.

Henry says he wants to also dump the last ten elements.

::: toolkit/components/url-classifier/ProtocolParser.cpp:981
(Diff revision 3)
>  
>    return NS_OK;
>  }
>  
> +static void
> +ReverseByte(uint8_t& b)

If you can find an existing function in gecko, that would be great because we could rely on it and not have to write a test for this one.

::: toolkit/components/url-classifier/ProtocolParser.cpp:1048
(Diff revision 3)
> +
> +  nsTArray<uint32_t> decoded;
> +  nsresult rv = DoRiceDeltaDecode(aAddition.rice_hashes(), decoded);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  // Say we have raw prefixes

"Say we have the following raw prefixes"

::: toolkit/components/url-classifier/ProtocolParser.cpp:1060
(Diff revision 3)
> +  // which can be treated as uint32 (big-endian) sorted in increasing order:
> +  //
> +  // [1, 512, 196608, 67108864]
> +  //
> +  // According to https://developers.google.com/safe-browsing/v4/compression,
> +  // they should be done the following operations prior to compression:

"the following should be done prior to compression:"

::: toolkit/components/url-classifier/ProtocolParser.cpp:1063
(Diff revision 3)
> +  //
> +  // According to https://developers.google.com/safe-browsing/v4/compression,
> +  // they should be done the following operations prior to compression:
> +  //
> +  // 1) re-interpret in little-endian ==> [16777216, 131072, 768, 4]
> +  // 2) sort in increaing order       ==> [4, 768, 131072, 16777216]

typo: "increasing"

::: toolkit/components/url-classifier/ProtocolParser.cpp:1072
(Diff revision 3)
> +  //
> +  // 1) sort in big-endian order      ==> [16777216, 131072, 768, 4]
> +  // 2) copy each uint32 in little-endian to the result string
> +  //
> +
> +  // The 4-byte prefixes have to be re-sorted in the lexical order.

"in the lexical order" => "in the lexical (Big Endian) order"

::: toolkit/components/url-classifier/ProtocolParser.cpp:1092
(Diff revision 3)
> +
> +  // The encoded prefixes are always 4 bytes.
> +  std::string prefixes;
> +  for (size_t i = 0; i < decoded.Length(); i++) {
> +    char p[4];
> +    NativeEndian::copyAndSwapToLittleEndian(p, &decoded[i], 4);

Note from Henry: that last argument should be a `1`, not a `4` because it's the number of elements, not the size in bytes.

I will also suggest a small comment at the end of that line to say that it's number of elements and not bytes, because everywhere else is in bytes.

::: toolkit/components/url-classifier/nsUrlClassifierUtils.cpp:110
(Diff revision 3)
>  {
>    aListUpdateRequest->set_threat_type(aThreatType);
>    aListUpdateRequest->set_platform_type(GetPlatformType());
>    aListUpdateRequest->set_threat_entry_type(URL);
>  
>    // Only RAW data is supported for now.

We can remove this comment now :)
Attachment #8783460 - Flags: review?(francois) → review-
Comment on attachment 8776503 [details]
Bug 1285848 - Part 1: Implement Rice Delta Decoding. .

https://reviewboard.mozilla.org/r/68276/#review80460

A few general comments, as we discussed in person:

- Since we're going to fork this code from WebRTC, can we modify the guts of the decoder to simplify the rest of our code (e.g. by removing some bit reversals)?
- We should only support test cases in the Safe Browsing format to keep the code simpler.
- The decoder returning true/false is fine by me (as opposed to detailed error codes like in Chromium), but it would be good to `LOG()` all errors explicitly to help with our future debugging.

::: toolkit/components/url-classifier/RiceDeltaDecoder.cpp:10
(Diff revision 5)
> +#include "RiceDeltaDecoder.h"
> +
> +namespace {
> +
> +////////////////////////////////////////////////////////////////////////
> +// BitBuffer is copied and modified from webrtc/base/bitbuffer.h

Since you're going to modify this, we should say something like "BitBuffer is based on the equivalent class in webrtc/base/bitbuffer.h"

And then include the copyright header from that file:

```
/*
 *  Copyright 2015 The WebRTC Project Authors. All rights reserved.
 *
 *  Use of this source code is governed by a BSD-style license
 *  that can be found in the LICENSE file in the root of the source
 *  tree (media/webrtc/<whatever>). An additional intellectual property rights grant can be found
 *  in the file PATENTS.  All contributing project authors may
 *  be found in the AUTHORS file in the root of the source tree.
 */
```

::: toolkit/components/url-classifier/RiceDeltaDecoder.cpp:93
(Diff revision 5)
> +                         uint32_t aFirstValue,
> +                         uint32_t aNumEntries,
> +                         uint32_t* aDecodedData,
> +                         bool aIsRemainderBigEndian)
> +{
> +  BitBuffer bitBuffer(mEncodedData, mEncodedDataSize);

A comment defining the terminology (which seems to be standard) would be useful:

```
// N = the delta
// q = quotient
// r = remainder
// k = RICE parameter
```

::: toolkit/components/url-classifier/RiceDeltaDecoder.cpp:96
(Diff revision 5)
> +                         bool aIsRemainderBigEndian)
> +{
> +  BitBuffer bitBuffer(mEncodedData, mEncodedDataSize);
> +
> +  const uint32_t k = aRiceParameter;
> +  uint32_t last = aFirstValue;

I would suggest naming this `previous` instead of `last` to remove the ambiguity.

::: toolkit/components/url-classifier/RiceDeltaDecoder.cpp:102
(Diff revision 5)
> +  for (uint32_t i = 0; i < aNumEntries; i++) {
> +    // Read the quotient of N.
> +    uint32_t q;
> +    NS_ENSURE_TRUE(bitBuffer.ReadExponentialGolomb(&q), false);
> +
> +    // Read the remainder of N.

`// Read the remainder of N, one bit at a time.`

::: toolkit/components/url-classifier/RiceDeltaDecoder.cpp:107
(Diff revision 5)
> +    // Read the remainder of N.
> +    uint32_t r = 0;
> +    for (uint32_t j = 0; j < k; j++) {
> +      uint32_t b = 0;
> +      if (!bitBuffer.ReadBits(&b, 1)) {
> +        // Insufficient bits. Just remain zero.

"Just leave them as zeros."

::: toolkit/components/url-classifier/RiceDeltaDecoder.cpp:110
(Diff revision 5)
> +      uint32_t b = 0;
> +      if (!bitBuffer.ReadBits(&b, 1)) {
> +        // Insufficient bits. Just remain zero.
> +        break;
> +      }
> +      if (aIsRemainderBigEndian) {

`// Add the bit to the right position so that it's in Little Endian order.`

::: toolkit/components/url-classifier/RiceDeltaDecoder.cpp:133
(Diff revision 5)
> +namespace {
> +//////////////////////////////////////////////////////////////////////////
> +// The BitBuffer impl is copied and modified from webrtc/base/bitbuffer.cc
> +//
> +
> +#define RTC_DCHECK_LE(v1, v2)

Since we're going to modify the code, we can now remove these empty defines and change the corresponding uses of them in the code to use the standard Mozilla assertions.

::: toolkit/components/url-classifier/RiceDeltaDecoder.cpp:153
(Diff revision 5)
> +  return (byte & mask) >> shift;
> +}
> +
> +BitBuffer::BitBuffer(const uint8_t* bytes, size_t byte_count)
> +    : bytes_(bytes), byte_count_(byte_count), byte_offset_(), bit_offset_() {
> +  RTC_DCHECK(static_cast<uint64_t>(byte_count_) <=

I think we could simply replace `RTC_DCHECK` with `MOZ_ASSERT` here, no?

::: toolkit/components/url-classifier/tests/gtest/TestRiceDeltaDecoder.cpp:1
(Diff revision 5)
> +#include "gtest/gtest.h"

Include standard copyright header for tests (I think it's CC0).

::: toolkit/components/url-classifier/tests/gtest/TestRiceDeltaDecoder.cpp:20
(Diff revision 5)
> +static bool runOneTest(TestingData& aData, bool aSBV4Order);
> +
> +TEST(RiceDeltaDecoder, NetworkOrder)
> +{
> +  static const std::vector<TestingData> TESTING_DATA = {
> +    // The first batch of testing data, which has the same delta values.

"same delta values (1)."

::: toolkit/components/url-classifier/tests/gtest/TestRiceDeltaDecoder.cpp:52
(Diff revision 5)
> +    {1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15, 16, 17, 18, 19, 20},
> +      {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01},
> +      7
> +    },
> +
> +    // The second batch of testing data, which has random delta values.

"random" => "different"

::: toolkit/components/url-classifier/tests/gtest/TestRiceDeltaDecoder.cpp:102
(Diff revision 5)
> +// in |runOneTest| for more detail.
> +TEST(RiceDeltaDecoder, SBV4Order) {
> +
> +  // The following structure and testing data is copied from Chromium source code:
> +  //
> +  // http://goo.gl/VgFIAc

I don't trust the longevity of URL shorteners. Let's just use

https://chromium.googlesource.com/chromium/src.git/+/950f9975599768b6a08c7146cb4befa161be87aa/components/safe_browsing_db/v4_rice_unittest.cc#75

even though it's ugly :)

::: toolkit/components/url-classifier/tests/gtest/TestRiceDeltaDecoder.cpp:104
(Diff revision 5)
> +
> +  // The following structure and testing data is copied from Chromium source code:
> +  //
> +  // http://goo.gl/VgFIAc
> +  //
> +  // and will be tranlated to our own testing format.

typo: translated

::: toolkit/components/url-classifier/tests/gtest/TestRiceDeltaDecoder.cpp:120
(Diff revision 5)
> +      , mDeltas(aDeltas)
> +      , mEncoded(aEncoded)
> +    {
> +    }
> +  };
> +

We can add `// ----- Start of Chromium test code ----` here along with the Chromium copyright header:

```
// Copyright 2016 The Chromium Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the <wherever that file is in gecko>/LICENSE file.
```

::: toolkit/components/url-classifier/tests/gtest/TestRiceDeltaDecoder.cpp:187
(Diff revision 5)
> +               287633354, 801308671, 424169435, 372520475, 277287849},
> +          "\xb2\x2c\x26\x3a\xcd\x66\x9c\xdb\x5f\x07\x2e\x6f\xe6\xf9\x21\x10\x52"
> +          "\xd5\x94\xf4\x82\x22\x48\xf9\x9d\x24\xf6\xff\x2f\xfc\x6d\x3f\x21\x65"
> +          "\x1b\x36\x34\x56\xea\xc4\x21\x00"),
> +  };
> +

We can add `// ----- End of Chromium test code ----` here.

::: toolkit/components/url-classifier/tests/gtest/TestRiceDeltaDecoder.cpp:204
(Diff revision 5)
> +    // set to an arbitrary value, say 7, to avoid any assumption to the
> +    // first value in the implementation.
> +    d.mExpectedDecoded.resize(tdc.mDeltas.size() + 1);
> +    for (size_t i = 0; i < d.mExpectedDecoded.size(); i++) {
> +      if (0 == i) {
> +        d.mExpectedDecoded[i] = 7;

Add `// "7" is an arbitrary starting value` to the end of that line.
Attachment #8776503 - Flags: review?(francois) → review-
Comment on attachment 8776503 [details]
Bug 1285848 - Part 1: Implement Rice Delta Decoding. .

https://reviewboard.mozilla.org/r/68276/#review80460

As for "the guts of the decoder", I'd like to confirm with you that it doesn't imply the bitbuffer but the RiceDeltaDecoder.
No longer blocks: safebrowsingv4
Comment on attachment 8776503 [details]
Bug 1285848 - Part 1: Implement Rice Delta Decoding. .

https://reviewboard.mozilla.org/r/68276/#review81558

::: toolkit/components/url-classifier/RiceDeltaDecoder.cpp:18
(Diff revisions 5 - 6)
> +/*
> + *  Copyright 2015 The WebRTC Project Authors. All rights reserved.
> + *
> + *  Use of this source code is governed by a BSD-style license
> + *  that can be found in the LICENSE file in the root of the source
> + *  tree (media/webrtc/<whatever>). An additional intellectual property

`<whatever>` should be replaced by the actual path :)

::: toolkit/components/url-classifier/RiceDeltaDecoder.cpp:159
(Diff revisions 5 - 6)
>  namespace {
>  //////////////////////////////////////////////////////////////////////////
>  // The BitBuffer impl is copied and modified from webrtc/base/bitbuffer.cc
>  //
>  
> -#define RTC_DCHECK_LE(v1, v2)
> +#define RTC_DCHECK_LE(v1, v2) MOZ_ASSERT(v1 <= v2)

Since we are forking the webrtc version of this code and aren't planning to merge changes in the future, I don't think there's any point in leaving the original macro names.

We may as well do a search/replace in the .cpp and use `MOZ_ASSERT()` directly.

::: toolkit/components/url-classifier/tests/gtest/TestRiceDeltaDecoder.cpp:2
(Diff revisions 5 - 6)
> +/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
> +/* This Source Code Form is subject to the terms of the Mozilla Public

nit: For tests, we can use the Public Domain copyright header (https://www.mozilla.org/en-US/MPL/headers/).

This would enable Chromium to take our tests if they wanted to.

::: toolkit/components/url-classifier/tests/gtest/TestRiceDeltaDecoder.cpp:50
(Diff revisions 5 - 6)
>      }
>    };
>  
> +  // Copyright 2016 The Chromium Authors. All rights reserved.
> +  // Use of this source code is governed by a BSD-style license that can be
> +  // found in the <wherever that file is in gecko>/LICENSE file.

Again, the `<whatever...>` placeholder should be replaced with the actual location in the source tree.
Attachment #8776503 - Flags: review?(francois) → review-
Comment on attachment 8783460 [details]
Bug 1285848 - Part 2: Request and parse RICE encoded prefix by default.

https://reviewboard.mozilla.org/r/73252/#review81570
Attachment #8783460 - Flags: review?(francois) → review+
Comment on attachment 8776503 [details]
Bug 1285848 - Part 1: Implement Rice Delta Decoding. .

https://reviewboard.mozilla.org/r/68276/#review82282

r+ but please fix this license header before landing.

::: toolkit/components/url-classifier/tests/gtest/TestRiceDeltaDecoder.cpp:48
(Diff revisions 6 - 7)
>      }
>    };
>  
>    // Copyright 2016 The Chromium Authors. All rights reserved.
>    // Use of this source code is governed by a BSD-style license that can be
> -  // found in the <wherever that file is in gecko>/LICENSE file.
> +  // found in the TestRiceDeltaDecoder.cpp LICENSE file.

Here, I meant to use the path where the LICENSE file is located: `media/webrtc/trunk/webrtc/LICENSE`
Attachment #8776503 - Flags: review?(francois) → review+
Pushed by hchang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e7410bc7e068
Part 1: Implement Rice Delta Decoding. r=francois.
https://hg.mozilla.org/integration/autoland/rev/14a91d6bf939
Part 2: Request and parse RICE encoded prefix by default. r=francois
https://hg.mozilla.org/mozilla-central/rev/e7410bc7e068
https://hg.mozilla.org/mozilla-central/rev/14a91d6bf939
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Depends on: 1309440
Depends on: 1431370
You need to log in before you can comment on or make changes to this bug.