Closed Bug 1135022 Opened 5 years ago Closed 4 years ago

Optimize ChunkSet to efficiently store ranges

Categories

(Toolkit :: Safe Browsing, enhancement, P5)

enhancement

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: gcp, Unassigned, Mentored)

References

(Depends on 1 open bug)

Details

(Whiteboard: [lang=c++][good first bug])

Attachments

(18 files, 14 obsolete files)

2.62 KB, patch
mmc
: review+
Details | Diff | Splinter Review
3.52 KB, patch
Details | Diff | Splinter Review
6.28 KB, patch
Details | Diff | Splinter Review
5.32 KB, patch
Details | Diff | Splinter Review
9.24 KB, patch
gcp
: review-
Details | Diff | Splinter Review
5.71 KB, patch
Details | Diff | Splinter Review
7.12 KB, patch
Details | Diff | Splinter Review
5.25 KB, patch
francois
: review+
Details | Diff | Splinter Review
11.89 KB, patch
Details | Diff | Splinter Review
2.09 KB, patch
Details | Diff | Splinter Review
1.41 KB, patch
Details | Diff | Splinter Review
1.17 KB, patch
Details | Diff | Splinter Review
1.58 KB, patch
Details | Diff | Splinter Review
1.57 KB, patch
gcp
: review+
Details | Diff | Splinter Review
1.66 KB, patch
gcp
: review+
Details | Diff | Splinter Review
10.85 KB, patch
gcp
: review+
Details | Diff | Splinter Review
58 bytes, text/x-review-board-request
gcp
: review+
Details
58 bytes, text/x-review-board-request
francois
: review+
Details
SafeBrowsing uses ChunkSet to store the chunks numbers that it has or will be processed in a SafeBrowsing update. It's a simple interface implementing a set of integer numbers:
https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/ChunkSet.h

The current implementation uses as storage a sorted array of numbers. This works fine for typical data, which looks something like 10001, 10003, 10100-10190, 10191, but it's wasteful for the ranges. The example above would require storing 94 values (1+1+91+1) in the current implementation. If we stored everything as a range (start and stop value) instead, it would require only 8 values (i.e. 10001-10001, 10003-10003, 10100-10190, 10191-10191). Most of the time the data consists of a single range, so there are potentially nice memory savings by doing this optimization.

The main difficulty in this bug is implementing Set/Unset/Merge correctly.
Whiteboard: [lang=c++][good first bug]
Small patch to remove some unused parts from the API. This should
simplify fixing this bug.
Attachment #8567071 - Flags: review?(mmc)
Attachment #8567071 - Flags: review?(mmc) → review+
Keywords: leave-open
I've had a look at ChunkSet.h and would like to have a go at this. In principle I should just be able to change mChunks to be FallibleTArray< nsTArray<uint32_t> >, so that the inner array stores the range start and stop values.
Hi I would like to work on this bug,

Can someone give me some more information?
I don't think I have enough time to work on this, so go right ahead.

Currently ChunkSet stores chunk numbers individually. The basic idea, as far as I understand it, is to instead store ranges the chunk ranges. So, for example, if you had chunks 1, 2, 3, 4, 5 stored in ChunkSet, you could reduce memory usage by storing 1, 6 and noting that this specifies a range from 1 to 5. This could be implemented by switching uint32_t in mChunks for something like a std::pair<uint32_t, uint32_t>. The first value in the pair would specify the start, and the second would specify the last plus one.

I haven't thought about this in great detail, but I have some ideas on how ChunkSet would need to modified to make this work:

* ChunkSet::Set will need some algorithm to determine whether aChunk is adjacent to an existing range, and expand the range to accomodate the new value. If there's no adjacent range, a new one will have to be added.
* ChunkSet::Has will need to be modified so that the binary search checks whether the specified value is within one of the ranges. I don't know if the existing binary search algorithms allow this, it may be they do.
* There will also need to be some mechanism to make sure that the ranges to do not overlap, or that when they do, that they're combined into a single range.
This algorithm in ChunkSet::Set determines whether aChunk is adjacent to an existing range, and expand the range to accommodate the new value. If there's no adjacent range, a new one will have to be added.
Gian-Carlo, can you let me know if I'm going on the right track (attachment 8601775 [details] [diff] [review])? If so, I would like to assign this bug.
Hi Ryan,

If I may make a couple of comments at this stage.

* It'd generally be better to use a const int or constexpr int in place of a #define for STARTED and STOPED (STOPPED?), since this would facilitate debugging.
* On lines 76 and 91 of your patch, you can use else statements since the conditions in the previous if statements on lines 70 and 85 are simply the opposite of the statements on lines 76 and 91.
* Similarly on line 83 you could probably use and else if statement, since the values you're comparing to are not going to be the same.
* It may be more performant to use something like a std::pair, which has a fixed number of elements, instead of what you've defined as nsT on line 136. Then you could refer to the first and second elements like newmChunks[x].first and newmChunks[x].second.
Matt, you should be able to use the review option on the patch to attach your comments to it directly, if you want to. You can comment directly in the code and things like that.

Ryan, you don't need to have a bug assigned to work on them, just update your patches in the bug every now and then (or comment on the progress) and it's obvious who's working on it. Assigning doesn't actually do anything and sometimes the person working on it has no time to finish the patch, and the bug being assigned then scares away anyone else who wants to pick up the bug. We mostly use it to assign people who work on Firefox fulltime to "please fix this first".

I'll comment more in the patch.
I didn't know you could do this. I've added my comments.
Comment on attachment 8601775 [details] [diff] [review]
Create existing range in ChunkSet::Set

Review of attachment 8601775 [details] [diff] [review]:
-----------------------------------------------------------------

These might be useful to know:
https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style
https://developer.mozilla.org/en-US/docs/Mozilla/C++_Portability_Guide

I'd say, keep on your efforts, see if you can fix the issues identified and keep iterating till you get a solution.

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +11,5 @@
> +class Comparator
> +{
> + private:
> +	 typedef nsTArray<uint32_t> nsT;
> +	

You have a mix of indentations here, as well as a bunch of whitespace lying around. Check whether your editor can show you leading and/or trailing whitespace, it will help create cleaner patches.

nsT isn't really a descriptive name. I see you also already declared this type in the header too, so you probably will want to reuse that definition instead of (re)declaring a similar-but-not-identical one here.

@@ +57,5 @@
>  nsresult
>  ChunkSet::Set(uint32_t aChunk)
>  {
> +  #define STARTED 0
> +  #define STOPED 1

These should probably be constants in the class.

Alternatively, given that you just want to store two numbers and not really an array, maybe it makes more sense to just have the class have 2 member variable named start and stop.

There was a suggestion to use std::pair, but STL isn't generally usable inside Firefox. There is mozilla/Pair.h though: https://dxr.mozilla.org/mozilla-central/source/mfbt/Pair.h

@@ +68,5 @@
>      }
> +
> +    if(aChunk == (newmChunks[x][STOPED]+ 1))
> +    {
> +      if((x+1 == newmChunks.Length()) ||(newmChunks[x+1][STARTED] - 1 !=aChunk))

You're going to be handling a bunch of different cases. It might be good to comment what each piece of code is supposed to be doing. This will make it more obvious if cases are missed.

@@ +92,5 @@
> +		
> +       if((x!=0)&&(newmChunks[x-1][STOPED]+1 == aChunk))
> +       {
> +         newmChunks[x][STARTED] = newmChunks[x-1][STARTED];
> +	 newmChunks.RemoveElementsAt(x-1,1);	

This could get tricky later on because you're effectively shifting the array around underneath yourself. i.e. newmChunks[x] now points to another element.

@@ +100,5 @@
> +}
> +
> +	size_t g = newmChunks.Length();
> +
> +	newmChunks.AppendElements(nsT());

Note that it's a fallible array, which means this can fail if there is no more memory. This happens for example on mobile devices with very little RAM. You should check the return codes.

You might want to try to have it working first before handling these exceptional cases, though. Just putting it here for future reference.

::: toolkit/components/url-classifier/ChunkSet.h
@@ +40,5 @@
>    }
>  
>  private:
> +	typedef nsTArray<uint32_t> nsT;
> +	FallibleTArray<nsT> newmChunks;

FWIW, the "m" in mChunks stands for "member variable". So for consistency this should be mNewChunks.
I will work on updating the code per your inputs.
Gian-Carlo,

Per your review, I'm not sure what you mean by "reuse that definition instead of (re)declaring a similar-but-not-identical one here". Perhaps an example would be suffice to show what you mean.
I *think* what Gian-Carlo means is that you declare the typedef in the header inside the ChunkSet class, but you then go on to use it again in Comparator. To improve maintainability you could make the typedef in ChunkSet public, then use it in Comparator like so:

bool Equals(const ChunkSet::nsT& a1, const ChunkSet::nsT& a2) const

This way, if the definition of nsT in ChunkSet ever changes, this is automatically propagated through to Comparator.
Yes, both ChunkSet.cpp and ChunkSet.h contain an identical definition:

typedef nsTArray<uint32_t> nsT;

Not only does that duplicate the code, you're also introducing the risk of bugs if you change it in one place and forget the other.

You should define it in the .h and then use that definition in the .cpp file.
Matt and Gian-Carlo, I've made updates based on your comments.
Just putting a note here that https://bugzilla.mozilla.org/show_bug.cgi?id=1163445 might add some code which overlaps in functionality and is potentially reusable.
Gian-Carlo,

I'm a bit confused on how ChunkSet::Remove should be implemented. Should Remove be removing a Pair?
Flags: needinfo?(gpascutto)
No, it takes a ChunkSet as an argument. So it's any arbitrary list of integers, too.
Flags: needinfo?(gpascutto)
Comment on attachment 8609517 [details] [diff] [review]
ChunkSet::Has modified so that the binary search checks whether the specified value is within one of the ranges.

Review of attachment 8609517 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like it's coming along nicely.

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +12,5 @@
> +{
> + public:
> +   bool Equals(const ChunkSet::ChunkRange& a1, const ChunkSet::ChunkRange& b1) const
> +   {
> +     return a1.first() == b1.first();  

nit: spurious trailing whitespace. You have this problem throughout your patch, so check your editor settings.

Does it make sense to make the comparison = when the extent is possibly different? i.e. this makes [0, 1] == [0, 2].

The fact that you have to provide the Equals here is IMHO a bug in nsTArray, which I'll file.

@@ +51,5 @@
>  
>  nsresult
>  ChunkSet::Set(uint32_t aChunk)
>  {
> +  for(size_t x = 0; x< mNewChunks.Length(); x++)

nit: x < y please use consistent spacing (see coding style guideline)

@@ +58,5 @@
> +    if(aChunk >= mNewChunks[x].first() && aChunk <=mNewChunks[x].second())
> +    {
> +      return NS_OK;
> +    }
> +    // Increment the second value of the Pair by one to determine if aChunk should replace the second value of the Pair or 

General remark: "Increment the second value of the Pair by one" doesn't need to be documented, because it literally says the same as the line of code below. So it's as if you were repeating the code twice.

What it's really doing is something like "check if aChunk is immediately behind the end of the current ChunkRange" so document that instead. Explain what the code tries to achieve instead of what it's doing.

@@ +62,5 @@
> +    // Increment the second value of the Pair by one to determine if aChunk should replace the second value of the Pair or 
> +    // the second value of the next index replaces the second value of the current index 
> +    if(aChunk == (mNewChunks[x].second()+ 1))
> +    {
> +      //Replace the second value of the Pair with aChunk 

"Extend the range to include aChunk if we can't merge with the next range"

Maybe "x" should be renamed to something that makes it more obvious what it is.

@@ +68,5 @@
> +      {
> +        mNewChunks[x]=MakePair(mNewChunks[x].first(),aChunk);
> +        return NS_OK;
> +      }
> +      //Replace the second value of the Pair with the second value of the next index. Then delete the Pair in the next index. 

"Merge this range and the next one"

@@ +77,5 @@
> +        return NS_OK;
> +      }
> +    }
> +    // Decrement the first value of the range by one to determine if aChunk should replace the first value of the Pair or
> +    // the first value of the previous index replaces the first value of the current index

Same remark with documenting what it does vs it's trying to achieve.
Attachment #8609517 - Flags: feedback?(gpascutto) → feedback+
Depends on: 1168345
Comment on attachment 8625521 [details] [diff] [review]
ChunkSet::Remove modified so that it removes any arbitrary list of integers

Review of attachment 8625521 [details] [diff] [review]:
-----------------------------------------------------------------

There's a number of bugs here, and a bunch of stylistic issues that were already identified in the previous review. That said, nice progress.

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +12,5 @@
> +{
> +public:
> +  bool Equals(const ChunkSet::ChunkRange& a1, const ChunkSet::ChunkRange& b1) const
> +  {
> +    return a1.first() == b1.first();  

nit: trailing whitespace

@@ +51,5 @@
>  
>  nsresult
>  ChunkSet::Set(uint32_t aChunk)
>  {
> +  for(size_t range = 0; range < mNewChunks.Length(); range++)

It's a single number, not a range, isn't it?

@@ +53,5 @@
>  ChunkSet::Set(uint32_t aChunk)
>  {
> +  for(size_t range = 0; range < mNewChunks.Length(); range++)
> +  {
> +    // Check the first and second value of the Pair to determine whether aChunk exist between the two values(inclusive).  

nit: trailing whitespace

Again: check your editor settings to make it visible, you have this problem throughout the patch.

@@ +136,3 @@
>  
> +  for(const ChunkRange *iter = dupIter; iter != end; iter++) 
> +  {

I would move the code below into a subroutine that operates on a ChunkRange parameter. That would reduce the amount of loop nesting and show that we break the problem of removing a ChunkSet down into removing a series of ranges.

It'll also fix some bugs, see below.

@@ +137,5 @@
> +  for(const ChunkRange *iter = dupIter; iter != end; iter++) 
> +  {
> +      if(Has(iter->first()) && Has(iter->second()))
> +      {         
> +        for(size_t range = 0; range < mNewChunks.Length(); range++)

Note that all 4 cases in the containing if contain this exact loop. So it can be hoisted outside the ifs.

@@ +139,5 @@
> +      if(Has(iter->first()) && Has(iter->second()))
> +      {         
> +        for(size_t range = 0; range < mNewChunks.Length(); range++)
> +        {
> +          if(iter->first() >= mNewChunks[range].first() && iter->second() <= mNewChunks[range].second())

There is no else here.

What about this case:

mNewChunks: [[2 10], [15 20]]
aOther: [3 18]

This will arrive here, but be rejected by the above if (because it straddles two ranges, and the check above only checks one range). There's no "else" so this wouldn't get handled, even though our mNewChunks should end up being modified to:

==> mNewChunks: [2 2], [19 20]

The amount of straddled ranges can be anything, e.g.:

mNewChunks: [[2 10], [15 20], [25 30]]
aOther: [4 27]

==> mNewChunks: [2 3], [28 30]

@@ +148,5 @@
> +              if(iter->second() < mNewChunks[range].second())
> +              {
> +                Comparator sortPair;
> +                mNewChunks[range] = MakePair(mNewChunks[range].first(), iter->first() - 1); 
> +                mNewChunks[mNewChunks.Length()] = MakePair(iter->second() + 1, mNewChunks[range].second());

This doesn't work because you're writing past the end of mNewChunks, you need to call AppendElement first.

@@ +177,5 @@
> +            }
> +          } 
> +        }
> +      }
> +      // Check to see if the second element in aOther appears in mNewChunk       

Same comment as before: document what it's trying to achieve. In this situation, the end point of a range in aOther overlaps with an existing range.

@@ +181,5 @@
> +      // Check to see if the second element in aOther appears in mNewChunk       
> +      else if (Has(iter->second()))
> +      {
> +        for (size_t range = 0; range < mNewChunks.Length(); range++)
> +        { // Shrink the range by reducing the first value of mNewChunk 

First value? Why is it operating on second() then? And increasing it, to boot.

I think this illustrates that making a custom class or struct with "begin" and "end" fields may be more clear than trying to use Pair everywhere.

What is it trying to achieve? "Shrink the range by setting the new start just past the removed range and keeping the end"

@@ +185,5 @@
> +        { // Shrink the range by reducing the first value of mNewChunk 
> +          if (iter->second() < mNewChunks[range].second())
> +          {
> +            mNewChunks[range] = MakePair(iter->second() + 1, mNewChunks[range].second());
> +            return NS_OK;

So we're bailing out of the function here. What about all the other ChunkRanges in aOther? Won't they get skipped? You have this bug in a few places, splitting up the work in multiple functions will help with this.
Attachment #8625521 - Flags: review?(gpascutto) → review-
Is this still open? If so, I'd like to work on it. Thanks!
The last update was 2 months ago so I presume Ryan has had no time to continue working on it. Go ahead.
Alright, thanks! This is my first bug, and I'm new to C++ (though I've done C for a semester). Would you be able to chat over IRC or email? Thanks.
Yes, you should be able to see my email by hovering over my name in this bug, and I'm "gcp" on Mozilla's IRC server. Feel free to nag me.
Alright, thanks. Any idea when you'll be on IRC? I can never seem to catch you. Thanks.
I'm going to give this a shot. 

Gian-Carlo, what's the best approach for changing the read and write functions? As it is right now, we read in or write out elements one by one, and it looks like WriteTArray and ReadTArray can't handle ranges.
I'm not really sure how to unit test this. But here is my progress so far.
Attachment #8657649 - Flags: feedback?
Comment on attachment 8657649 [details] [diff] [review]
Attempt at implementing ranges

Review of attachment 8657649 [details] [diff] [review]:
-----------------------------------------------------------------

Code quality looks decent, but I need to think more about whether this works. You're also missing part of your changes, I think you forgot to add at least the changes to the .h file?

You are right about this being a pain to unit test. I'll see if I can fix that - when the original implementation was written unit testing C++ code like this was a big problem (which is why there's none right now!) but I believe it should be easier now. Meanwhile, you can run tests like:

./mach xpcshell-test toolkit/components/url-classifier/

Which tests the functionality of the components using this code.
Attachment #8657649 - Flags: feedback? → feedback+
Sounds good. I'll try to implement Read and Write in the meantime and give those tests a try.
I implemented Read() and Write(). I'm not sure what I have is the best way to do this though. Also, I ran into an error running the tests. (I'm running Windows 8.1)

python mach xpcshell-test toolkit/components/url-classifier/

Error running mach:

    ['xpcshell-test', 'toolkit/components/url-classifier/']

The error occurred in code that was called by the mach command. This is either
a bug in the called code itself or in the way that mach is calling it.

You should consider filing a bug for this issue.

If filing a bug, please include the full output of mach, including this error
message.

The details of the failure are as follows:

WindowsError: [Error 193] %1 is not a valid Win32 application
Attachment #8658462 - Flags: feedback?
Comment on attachment 8658462 [details] [diff] [review]
ChunkSet with Read() and Write()

Review of attachment 8658462 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for not responding earlier. If you feedback? and not fill in the Requestee field then no-one gets notified this needs a response :-/

https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/How_to_Submit_a_Patch -> Getting Reviews

I'll look at the situation with unit tests now. You seem to be progressing well but I have a bunch of remarks. Also, can you combine your patches? This patch didn't contain a cleaned up version of the last patch.

The Windows thing with mach I'll also check.

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +7,5 @@
>  
>  namespace mozilla {
>  namespace safebrowsing {
>  
> +#define BUFFER_SIZE 10;

Move this to the .h file, it's just duplicated here.

@@ +16,5 @@
>    aChunkStr.Truncate();
>    uint32_t i = 0;
>    while (i < mChunkStarts.Length()) {
>  
> +    if (i != 0) {

No need to clean up the existing indentation here, it just adds noise to the patch.

@@ +72,3 @@
>      } else {
> +      if (!mChunkStarts.InsertElementSorted(start, fallible))
> +        return NS_ERROR_OUT_OF_MEMORY

bracing (code style), also you're missing semicolons so I doubt this compiles.

@@ +146,5 @@
> +  void *buffer = strmBuf.Elements();
> +  if (!strmBuf.SetLength(BUFFER_SIZE, fallible))
> +    return NS_ERROR_OUT_OF_MEMORY;
> +
> +  uint32_t i; // index of range

Move inside the for.

@@ +147,5 @@
> +  if (!strmBuf.SetLength(BUFFER_SIZE, fallible))
> +    return NS_ERROR_OUT_OF_MEMORY;
> +
> +  uint32_t i; // index of range
> +  uint32_t wr = 0; // current buffer index to write to

Make the var name more understandable.

@@ +154,5 @@
> +    while (cur <= mChunkEnds[i]) {
> +      while (wr < BUFFER_SiZE) {
> +        // check if single number range
> +        if (mChunkStarts[i] == mChunkEnds[i]) {
> +          buffer[wr++] = cur++;

Arithmetic on something defined as a void*?

@@ +158,5 @@
> +          buffer[wr++] = cur++;
> +          break;
> +        } else {
> +          if (cur <= mChunkEnds[i])
> +            buffer[wr++] = cur++;

You've got the two sides of the if statement here running the same code, so this can be simplified.

@@ +164,5 @@
> +            break;
> +        }
> +      }
> +      uint32_t written;
> +      if (wr == BUFFER_SIZE) {

What if the number of entries doesn't match up with BUFFER_SIZE? i.e. near the end of a ChunkSet, when there's only a few entries remaining?

@@ +165,5 @@
> +        }
> +      }
> +      uint32_t written;
> +      if (wr == BUFFER_SIZE) {
> +        nsresult rv = aOut->Write(reinterpret_cast<char*>(strmBuf.Elements()),

Didn't you define "buffer" for this?

::: toolkit/components/url-classifier/ChunkSet.h
@@ +12,5 @@
>  
>  namespace mozilla {
>  namespace safebrowsing {
>  
> +#define BUFFER_SIZE 10;

IO_BUFFER_SIZE? Make it a class const instead of a define? Also you can probably make it a lot bigger to cut down on IO, say 128 or so.

@@ +39,5 @@
>  
>  private:
> +  FallibleTArray<uint32_t> mChunkStarts;
> +  FallibleTArray<uint32_t> mChunkEnds;
> +  FallibleTArray<uint32_t> strmBuf;

mStreamBuffer, but this probably doesn't need to be a member var at all, and not sure if it needs to be an nsTArray subclass rather than a normal array either, given that it's small and can live on the stack.
Attachment #8658462 - Flags: feedback? → feedback+
(In reply to jay from comment #36)
> Also, I ran into an error running the tests. (I'm running
> Windows 8.1)
> 
> python mach xpcshell-test toolkit/components/url-classifier/

If you're running from the start-shell-msvcxxxx.bat shells from MozillaBuild (which you should be if you're following the Windows build instructions), then you don't need to add the "python", you can just run "mach".
Depends on: 1229051
Comment on attachment 8694120 [details] [diff] [review]
Add C++ test coverage for UrlClassifier ChunkSets

Review of attachment 8694120 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/tests/gtest/TestChunkSet.cpp
@@ +55,5 @@
> +
> +TEST(UrlClassifierChunkSet, Merge)
> +{
> +  static int testVals[] = {2, 1, 5, 6, 8, 7, 14, 10, 12, 13};
> +  static int mergeVals[] = {3, 4, 9, 16, 20};

Maybe it would be good to add say "14" to mergeVals to test that Merge does the right thing when the value already exists in the original set?

Of course, if you do that, you'll have to subtract one in the length test later on.
Attachment #8694120 - Flags: review?(francois) → review+
(In reply to jay from comment #33)
> Created attachment 8657649 [details] [diff] [review]
> Attempt at implementing ranges
> 
> I'm not really sure how to unit test this. But here is my progress so far.

I've just added basic unit tests for ChunkSet. You can now do:

mach gtest UrlClassifierChunkSet.*

to test the old code, or your replacement for it. (code is in mozilla-inbound but should hit mozilla-central soon)
Attachment #8707993 - Flags: feedback?(gpascutto)
Gian-Carlo, let me know if the Chunk class can be implemented in a better way. For now, this is just a prototype. Also, can you provide some more information on how Read and Write should work?
Flags: needinfo?(gpascutto)
Comment on attachment 8707993 [details] [diff] [review]
Optimize ChunkSet to efficiently store ranges

Review of attachment 8707993 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like a pretty clean solution. I've got a number of small remarks wrt C++ style, and I think the code likely doesn't work correctly in some edge cases, so for that reason f-. Let me know if I'm wrong here.

Does this pass the unit tests?

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +56,5 @@
> +    start = aChunk;
> +    end = aChunk;
> +  }
> +
> +  Chunk chunk = Chunk(start, end);

Chunk chunk(start, end);

@@ +66,5 @@
>    return NS_OK;
>  }
>  
> +int32_t
> +ChunkSet::BinaryRangeOf(uint32_t aChunk) const

Maybe call it SearchContainingRange or something? We don't care about it being a binary search, that's an implementation detail, and "RangeOf" doesn't really explain what it does.

@@ +68,5 @@
>  
> +int32_t
> +ChunkSet::BinaryRangeOf(uint32_t aChunk) const
> +{
> +  size_t low = 1;

I would expect this to be 0. If mChunks has one element, we'll break out of the loop immediately because 1 <= 0 so we abort without even looking if there was a match.

@@ +69,5 @@
> +int32_t
> +ChunkSet::BinaryRangeOf(uint32_t aChunk) const
> +{
> +  size_t low = 1;
> +  size_t high = mChunks.Length() - 1;

That looks like it will hurt if mChunks is empty (gets to be SIZE_MAX).

@@ +94,5 @@
>  
>  nsresult
>  ChunkSet::Merge(const ChunkSet& aOther)
>  {
> +  const Chunk* dupIter = aOther.mChunks.Elements();

Let's make that begin() for consistency with the line below.

@@ +112,5 @@
>  
>  nsresult
>  ChunkSet::Remove(const ChunkSet& aOther)
>  {
> +  Chunk* addIter = mChunks.Elements();

begin() and end() for consistency with the above code.

@@ +119,5 @@
> +  for (Chunk* iter = addIter; iter != endIter; iter++) {
> +    for (uint32_t chunk = iter->getStart(); chunk <= iter->getEnd(); chunk++) {
> +      if (aOther.Has(chunk)) {
> +        if (iter->getStart() == iter->getEnd()) {
> +          mChunks.RemoveElement(*iter);

This invalidates the iterators so now the outer loop will break.

@@ +129,5 @@
> +            iter->setEnd(chunk - 1);
> +          } else {
> +            iter->setEnd(chunk - 1);
> +            Chunk c = Chunk(chunk + 1, tmp);
> +            if (!mChunks.InsertElementSorted(c, fallible)) {

Same issue with the outer iterator being invalid now.

::: toolkit/components/url-classifier/ChunkSet.h
@@ +8,5 @@
>  
>  #include "Entries.h"
>  #include "nsString.h"
>  #include "nsTArray.h"
> +#include "mozilla/Pair.h"

It looks like you didn't end up using this.

@@ +16,5 @@
>  
>  /**
> + * Chunks are stored as ranges in order to optimize storage and efficiency.
> + * Consecutive chunks, i.e. [4, 5, 6], are compressed into a single range [4-6],
> + * while individual chunks, i.e. [1, 6, 8], are compressed into [1-1, 6-6, 8-8].

Very small nitpick: "compressed into" is probably not the right wording for individual chunks, as it's actually expanding them.

@@ +21,5 @@
> + */
> +class Chunk
> +{
> +public:
> +  Chunk() {}

Do you actually need this?

@@ +25,5 @@
> +  Chunk() {}
> +
> +  Chunk(uint32_t start, uint32_t end)
> +  {
> +    mStart = start;

You can use member initializer lists here.

@@ +29,5 @@
> +    mStart = start;
> +    mEnd = end;
> +  }
> +
> +  Chunk(const Chunk& aChunk)

Doesn't the compiler generate a default copy constructor for you here? It would do the right thing, I believe.

@@ +35,5 @@
> +    mStart = aChunk.getStart();
> +    mEnd = aChunk.getEnd();
> +  }
> +
> +  ~Chunk() {}

If it doesn't do anything you don't need it. The compiler will generate a default.

@@ +53,5 @@
> +  {
> +    return c1.mStart < c2.mStart && c1.mEnd < c2.mEnd;
> +  }
> +
> +  friend bool operator!=(const Chunk& c1, const Chunk& c2)

Do you actually need all of these?
Attachment #8707993 - Flags: feedback?(gpascutto) → feedback-
(In reply to Wilmer Paulino [:wpaulino] from comment #45)
> Gian-Carlo, let me know if the Chunk class can be implemented in a better
> way. For now, this is just a prototype. Also, can you provide some more
> information on how Read and Write should work?

They have to read/write the contents of the ChunkSet to a stream so it can be persisted. The complication is that we would prefer the format they're written in to be backwards compatible, so this means writing it out as a stream of integers (all the stored chunks), not as ranges (the optimized format we're creating in this bug). This is needed because we want to be able to read the data the users already have on their disks, and dealing with different format databases will make this a lot more complex so we don't want that for now.


I reviewed the "ChunkSet with Read() and Write()" patch that's attached to this bug, and it seemed like it would work bar the remarks I made in the review. You can likely adapt that to your Chunk classes.
Flags: needinfo?(gpascutto)
Hey Gian-Carlo, I've made some changes for the tests. They now test the ranges rather than the individual chunks. All tests pass with this version. Let me know of any style issues, naming conventions, etc. Also, are Read and Write working correctly? Is there a way to test those?
Attachment #8709493 - Flags: feedback?(gpascutto)
Attachment #8707993 - Attachment is obsolete: true
Comment on attachment 8709493 [details] [diff] [review]
Optimize ChunkSet to efficiently store ranges v2

Review of attachment 8709493 [details] [diff] [review]:
-----------------------------------------------------------------

For testing the serialization code, see a previous comment: https://bugzilla.mozilla.org/show_bug.cgi?id=1135022#c34

This is looking pretty good! Just some remarks regarding the tests that you changed. You can likely ask for review on your next revision/update.

I have one "extra". If we are reading in a large amount of data (when starting the browser), we expect most of the chunks read in to be simply one higher than the last one read in. (i.e. 111, 112, 113, 114, ...) Currently the code needs 2 binary searches for each chunk. Can we optimize the common case? Simply checking if chunk is one higher than the end of the last chunk and fast-tracking that case should work.

::: toolkit/components/url-classifier/ChunkSet.h
@@ +59,5 @@
>  
>    nsresult Serialize(nsACString& aStr);
>    nsresult Set(uint32_t aChunk);
> +  nsresult Remove(uint32_t aChunk);
> +  int32_t SearchContainingRange(uint32_t aChunk) const;

These should be private. In fact any internal functions that aren't used outside ChunkSet should be.

@@ +64,3 @@
>    bool Has(uint32_t chunk) const;
>    nsresult Merge(const ChunkSet& aOther);
> +  uint32_t Length() const { return mChunkRanges.Length(); }

This returns the number of ChunkRanges, not the number of Chunks, so it's not what the callers expect. This is also why you needed to delete several tests just to get it to pass.

Investigating the callers, most only care whether it's empty or not (good), but the others are in the serialization code, so they will break.

Note that as explained in one of my previous comments in this bug, unfortunately serializing in the more efficient format would require that all users update or re-download their anti-phishing and malware databases, which we don't want to require.

So you'll have to code this to give a "compatible" answer.

::: toolkit/components/url-classifier/tests/gtest/TestChunkSet.cpp
@@ +162,5 @@
>    chunkSet.Serialize(mergeResult);
>  
>    printf("mergeResult: %s\n", mergeResult.get());
>  
> +  nsAutoCString expected(NS_LITERAL_CSTRING("1-10,12-14,16-16,20-20"));

You can't simply change this as the output of this is fed to a server running this protocol: https://developers.google.com/safe-browsing/developers_guide_v2?hl=en

I think, strictly reading, this more verbose notation is also allowed by the spec, but I doubt any existing client sends it so this is unnecessary risk. It also makes the messages to the server a lot bigger if there are many noncontinuous chunks. You may note that the examples in the spec use the notation that was originally in this test.

So, you should fix your code to pass this test, not change the test to make your code pass.
Attachment #8709493 - Flags: feedback?(gpascutto) → feedback+
Can I contribute to this bug?
Also I am a beginner can you tell me how should I approach this bug?
What I need to install?
(In reply to kshitija from comment #50)
> Can I contribute to this bug?
> Also I am a beginner can you tell me how should I approach this bug?
> What I need to install?

You need to be able to fetch the Firefox source, build it and run the result. What you need to install depends on what platform you are on. There are already several approaches discussed in this bug, and proposed patches, so read up above. I think Wilmer Paulino was very close to having a finished patch but he seems to have run out of time before being able to put the finishing touches on it.

Start there:
https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Introduction
https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Build_Istructions
(In reply to Gian-Carlo Pascutto [:gcp] from comment #51)
> (In reply to kshitija from comment #50)
> > Can I contribute to this bug?
> > Also I am a beginner can you tell me how should I approach this bug?
> > What I need to install?
> 
> You need to be able to fetch the Firefox source, build it and run the
> result. What you need to install depends on what platform you are on. There
> are already several approaches discussed in this bug, and proposed patches,
> so read up above. I think Wilmer Paulino was very close to having a finished
> patch but he seems to have run out of time before being able to put the
> finishing touches on it.
> 
> Start there:
> https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Introduction
> https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/
> Build_Istructions

Unfortunately, I couldn't figure out the rest of Read/Write before my school semester started so I ran out of time. I have spring break coming up and will probably give it a try again.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #51)
> (In reply to kshitija from comment #50)
> > Can I contribute to this bug?
> > Also I am a beginner can you tell me how should I approach this bug?
> > What I need to install?
> 
> You need to be able to fetch the Firefox source, build it and run the
> result. What you need to install depends on what platform you are on. There
> are already several approaches discussed in this bug, and proposed patches,
> so read up above. I think Wilmer Paulino was very close to having a finished
> patch but he seems to have run out of time before being able to put the
> finishing touches on it.
> 
> Start there:
> https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Introduction
> https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/
> Build_Istructions


The second link you provide gives an error
From where should I fetch the Firefox source? I am on ubuntu 14.03 so what should I install?
Just wanted to hear if you can approve of the implementation I'm going for, before cleaning up what I have and take on Read & Write.
The code passes all of the associated tests in UrlClassifierChunkSet.
Attachment #8740641 - Flags: feedback?(gpascutto)
Comment on attachment 8740641 [details] [diff] [review]
Implement ranges, missing Read and Write

Review of attachment 8740641 [details] [diff] [review]:
-----------------------------------------------------------------

Generally this looks excellent! I have some remarks to make it more readable, some points where I think you can simplify, and there's 1 or 2 points where I'm not clear it will handle all cases correctly.

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +80,2 @@
>  
> +  for (size_t i = 0; i < mRanges.Length() - 1;) {

I wouldn't use a for loop in this case.

@@ +109,5 @@
> +  ChunkRange lastRange = mRanges.LastElement();
> +  for (size_t i = 0; i < aOther.mRanges.Length(); i++) {
> +    ChunkRange range = aOther.mRanges[i];
> +
> +    if (lastRange.End() < range.Begin()) {

Any reason this is special cased but the opposite case (at the start) isn't?

@@ +113,5 @@
> +    if (lastRange.End() < range.Begin()) {
> +      break;
> +    }
> +
> +    nsTArray<size_t> tmp = IntersectingRanges(range, 0, mRanges.Length());

"tmp" isn't a very descriptive name.

@@ +114,5 @@
> +      break;
> +    }
> +
> +    nsTArray<size_t> tmp = IntersectingRanges(range, 0, mRanges.Length());
> +    for (size_t j = 0; mRanges.Length() != 0 && j < tmp.Length(); j++) {

I'd move "mRanges.Length() != 0" to a separate check at the end of the loop.

@@ +116,5 @@
> +
> +    nsTArray<size_t> tmp = IntersectingRanges(range, 0, mRanges.Length());
> +    for (size_t j = 0; mRanges.Length() != 0 && j < tmp.Length(); j++) {
> +      ChunkSet diff;
> +      if (mRanges[tmp[j]].Remove(range, diff) != NS_OK) {

Pull tmp[j] out into a var and give it a more descriptive name. Same for "diff", it isn't all that obvious what it is doing. It's really "SplitRemainder" or something.

@@ +122,5 @@
> +      }
> +
> +      mRanges.RemoveElementAt(tmp[j]);
> +      for (size_t k = 0; k < diff.mRanges.Length(); k++) {
> +        if (!mRanges.InsertElementSorted(diff.mRanges[k], fallible)) {

Would this not mess up the indexes that the "tmp" array contains and which also point into mRanges? If not, add a comment why this works. It's not obvious to me.

At worst I think you can run this "add split ranges back" pass outside the loop over "tmp"?

@@ +146,5 @@
> +  nsTArray<size_t> tmp;
> +  size_t idx;
> +  if (BinarySearchIf(mRanges, aMin, aMax,
> +                     ChunkRange::IntersectionComparator(aRange), &idx)) {
> +    tmp.AppendElements(IntersectingRanges(aRange, 0, idx));

Are there pathological cases where you'll end up recursing like crazy? Something like:

[1 - 20] [25 - 30] [40 - 50] [60 - 70] ...... [ 10000 - 10010]
  [2                       -                            10010]

Which I could imagine if for example the first chunk consists of test entries and we want to clear the rest of the database.

If yes then maybe consider rewriting this as a non-recursive function.

@@ +157,5 @@
> +
> +bool
> +ChunkSet::HasSubrange(const ChunkRange& aRange) const {
> +  const ChunkRange* end = mRanges.Elements() + mRanges.Length();
> +  for (const ChunkRange* i = mRanges.Elements(); i != end; i++) {

If I see "i" I generally think of a counter. If we're using short names "it" is more typical for something iterator/pointer like.

@@ +160,5 @@
> +  const ChunkRange* end = mRanges.Elements() + mRanges.Length();
> +  for (const ChunkRange* i = mRanges.Elements(); i != end; i++) {
> +    if ((*i).Begin() <= aRange.Begin() && aRange.End() <= (*i).End()) {
> +      return true;
> +    } else if (aRange.Intersects(*i)) {

Doesn't this cover the above case too?

@@ +176,5 @@
> +
> +nsresult
> +ChunkSet::ChunkRange::Remove(const ChunkRange& aRange, ChunkSet& aSetDiff) const
> +{
> +  if (mBegin < aRange.mBegin && aRange.mEnd < mEnd) {

I think you can cover this case by falling through to the two below and making them "if" instead of "else if".

@@ +201,5 @@
> +  return NS_OK;
> +}
> +
> +inline bool
> +ChunkSet::ChunkRange::FoldL(const ChunkRange& aRange)

FoldLeft?

::: toolkit/components/url-classifier/ChunkSet.h
@@ +96,1 @@
>    FallibleTArray<uint32_t> mChunks;

This can go now.
Attachment #8740641 - Flags: feedback?(gpascutto) → feedback+
(In reply to Gian-Carlo Pascutto [:gcp] from comment #57)
> Comment on attachment 8740641 [details] [diff] [review]
> Implement ranges, missing Read and Write
> 
> Review of attachment 8740641 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Generally this looks excellent! I have some remarks to make it more
> readable, some points where I think you can simplify, and there's 1 or 2
> points where I'm not clear it will handle all cases correctly.
> 
> ::: toolkit/components/url-classifier/ChunkSet.cpp
> @@ +80,2 @@
> >  
> > +  for (size_t i = 0; i < mRanges.Length() - 1;) {
> 
> I wouldn't use a for loop in this case.
> 
> @@ +109,5 @@
> > +  ChunkRange lastRange = mRanges.LastElement();
> > +  for (size_t i = 0; i < aOther.mRanges.Length(); i++) {
> > +    ChunkRange range = aOther.mRanges[i];
> > +
> > +    if (lastRange.End() < range.Begin()) {
> 
> Any reason this is special cased but the opposite case (at the start) isn't?
> 
> @@ +113,5 @@
> > +    if (lastRange.End() < range.Begin()) {
> > +      break;
> > +    }
> > +
> > +    nsTArray<size_t> tmp = IntersectingRanges(range, 0, mRanges.Length());
> 
> "tmp" isn't a very descriptive name.
> 
> @@ +114,5 @@
> > +      break;
> > +    }
> > +
> > +    nsTArray<size_t> tmp = IntersectingRanges(range, 0, mRanges.Length());
> > +    for (size_t j = 0; mRanges.Length() != 0 && j < tmp.Length(); j++) {
> 
> I'd move "mRanges.Length() != 0" to a separate check at the end of the loop.
> 
> @@ +116,5 @@
> > +
> > +    nsTArray<size_t> tmp = IntersectingRanges(range, 0, mRanges.Length());
> > +    for (size_t j = 0; mRanges.Length() != 0 && j < tmp.Length(); j++) {
> > +      ChunkSet diff;
> > +      if (mRanges[tmp[j]].Remove(range, diff) != NS_OK) {
> 
> Pull tmp[j] out into a var and give it a more descriptive name. Same for
> "diff", it isn't all that obvious what it is doing. It's really
> "SplitRemainder" or something.
> 
> @@ +122,5 @@
> > +      }
> > +
> > +      mRanges.RemoveElementAt(tmp[j]);
> > +      for (size_t k = 0; k < diff.mRanges.Length(); k++) {
> > +        if (!mRanges.InsertElementSorted(diff.mRanges[k], fallible)) {
> 
> Would this not mess up the indexes that the "tmp" array contains and which
> also point into mRanges? If not, add a comment why this works. It's not
> obvious to me.
> 
> At worst I think you can run this "add split ranges back" pass outside the
> loop over "tmp"?
> 
> @@ +146,5 @@
> > +  nsTArray<size_t> tmp;
> > +  size_t idx;
> > +  if (BinarySearchIf(mRanges, aMin, aMax,
> > +                     ChunkRange::IntersectionComparator(aRange), &idx)) {
> > +    tmp.AppendElements(IntersectingRanges(aRange, 0, idx));
> 
> Are there pathological cases where you'll end up recursing like crazy?
> Something like:
> 
> [1 - 20] [25 - 30] [40 - 50] [60 - 70] ...... [ 10000 - 10010]
>   [2                       -                            10010]
> 
> Which I could imagine if for example the first chunk consists of test
> entries and we want to clear the rest of the database.
> 
> If yes then maybe consider rewriting this as a non-recursive function.
> 
> @@ +157,5 @@
> > +
> > +bool
> > +ChunkSet::HasSubrange(const ChunkRange& aRange) const {
> > +  const ChunkRange* end = mRanges.Elements() + mRanges.Length();
> > +  for (const ChunkRange* i = mRanges.Elements(); i != end; i++) {
> 
> If I see "i" I generally think of a counter. If we're using short names "it"
> is more typical for something iterator/pointer like.
> 
> @@ +160,5 @@
> > +  const ChunkRange* end = mRanges.Elements() + mRanges.Length();
> > +  for (const ChunkRange* i = mRanges.Elements(); i != end; i++) {
> > +    if ((*i).Begin() <= aRange.Begin() && aRange.End() <= (*i).End()) {
> > +      return true;
> > +    } else if (aRange.Intersects(*i)) {
> 
> Doesn't this cover the above case too?
> 
> @@ +176,5 @@
> > +
> > +nsresult
> > +ChunkSet::ChunkRange::Remove(const ChunkRange& aRange, ChunkSet& aSetDiff) const
> > +{
> > +  if (mBegin < aRange.mBegin && aRange.mEnd < mEnd) {
> 
> I think you can cover this case by falling through to the two below and
> making them "if" instead of "else if".
> 
> @@ +201,5 @@
> > +  return NS_OK;
> > +}
> > +
> > +inline bool
> > +ChunkSet::ChunkRange::FoldL(const ChunkRange& aRange)
> 
> FoldLeft?
> 
> ::: toolkit/components/url-classifier/ChunkSet.h
> @@ +96,1 @@
> >    FallibleTArray<uint32_t> mChunks;
> 
> This can go now.

I'm working on a patch to address your points.
I'm not sure if I follow your point about falling through to the two if statements below. What I gathered, is that you asked me to remove the first case and make the two other conditions be checked unconditionally. And thus, you are saying that will also cover the first case? I'm not sure if I understood you correct, but assuming I did... consider ([1-10] - [2,9]). In this example none of the "two remaining conditions" are met and thus the range is not split at all.
Attachment #8740641 - Attachment is obsolete: true
Attachment #8741954 - Flags: feedback?(gpascutto)
(In reply to bd339 from comment #58)
> I'm not sure if I follow your point about falling through to the two if
> statements below. What I gathered, is that you asked me to remove the first
> case and make the two other conditions be checked unconditionally. And thus,
> you are saying that will also cover the first case? I'm not sure if I
> understood you correct, but assuming I did... consider ([1-10] - [2,9]). In
> this example none of the "two remaining conditions" are met and thus the
> range is not split at all.

I reread the code and you're right, so ignore my comment there.
Comment on attachment 8741954 [details] [diff] [review]
Implementation of ranges, 2nd revision (still missing Read & Write)

Review of attachment 8741954 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, I only have tiny suggestions. 

Also, maybe it's worth looking if some of the repeated if (a.begin < b.begin() && a.end()...) etc. conditions can be replaced by Contains(), Precedes() functions that can be factored out.

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +84,5 @@
> +  if (oldLen < mRanges.Length()) {
> +    Range* range = mRanges.begin();
> +    while (range != mRanges.end() - 1) {
> +      if (range->FoldLeft(*(range + 1))) {
> +        mRanges.RemoveElementAt(range - mRanges.begin() + 1);

If you look, all three preceding lines effectively operate on "range + 1". So maybe this loop can be simplified.

@@ +158,2 @@
>        return true;
> +    } else if (!(aSubrange.Begin() > range->End() ||

This condition looks superfluous: if we don't hit the "else" we will fall through to return false at the end anyway.
Attachment #8741954 - Flags: feedback?(gpascutto) → feedback+
(In reply to Gian-Carlo Pascutto [:gcp] from comment #61)
> Comment on attachment 8741954 [details] [diff] [review]
> Implementation of ranges, 2nd revision (still missing Read & Write)
> 
> Review of attachment 8741954 [details] [diff] [review]:
> -----------------------------------------------------------------
> If you look, all three preceding lines effectively operate on "range + 1".
> So maybe this loop can be simplified.
> 
> @@ +158,2 @@
> >        return true;
> > +    } else if (!(aSubrange.Begin() > range->End() ||
> 
> This condition looks superfluous: if we don't hit the "else" we will fall
> through to return false at the end anyway.

You are right that the code will be correct without it, however, it allows to break the loop early, perhaps at a very early point.
It is there because if the "subrange" is not a direct subrange but "merely" intersects the current range, I can know that it will never be a subrange of any of the following ranges, because the implementation guarantees that there will be no overlaps and all of the ranges will be as "wide" as possible.
Comments and readability improvements.
Attachment #8741954 - Attachment is obsolete: true
Attachment #8743521 - Flags: feedback?(gpascutto)
Attachment #8743521 - Flags: feedback?(gpascutto) → feedback+
Attached patch Implement Write (obsolete) — Splinter Review
I'm unsure of whether a "buffer size" as proposed in comment 36 is preferable, considering that 'chunks' lives on the stack and is infallible.

I assume it is not too "IO-wasteful" to immediately write "single-chunk" ranges to the output stream, because the point of this bug is that we don't have many "single-chunk" ranges.
Attachment #8743521 - Attachment is obsolete: true
Attachment #8744615 - Flags: feedback?(gpascutto)
Attached patch Simplify Write (obsolete) — Splinter Review
Comment 64 still applies.
Attachment #8744615 - Attachment is obsolete: true
Attachment #8744615 - Flags: feedback?(gpascutto)
Attachment #8744617 - Flags: feedback?(gpascutto)
Attached patch Implement Read (obsolete) — Splinter Review
Might want to limit how many chunks are read at once?
Attachment #8744619 - Flags: feedback?(gpascutto)
(In reply to bd339 from comment #64)
> I assume it is not too "IO-wasteful" to immediately write "single-chunk"
> ranges to the output stream, because the point of this bug is that we don't
> have many "single-chunk" ranges.

*cough*

https://bugzilla.mozilla.org/show_bug.cgi?id=1150334
https://bugzilla.mozilla.org/show_bug.cgi?id=1211090
(In reply to Gian-Carlo Pascutto [:gcp] from comment #67)
> *cough*
> 
> https://bugzilla.mozilla.org/show_bug.cgi?id=1150334
> https://bugzilla.mozilla.org/show_bug.cgi?id=1211090

Oh, well, you can't be right every time :)
Considering this, what are your recommendations?
Attachment #8744617 - Flags: feedback?(gpascutto) → feedback+
Comment on attachment 8744619 [details] [diff] [review]
Implement Read

Review of attachment 8744619 [details] [diff] [review]:
-----------------------------------------------------------------

r- due to non-fallible memory allocation.

The style thing is a minor suggestion, if you don't want to change it it's ok for me too.

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +171,5 @@
>  
>  nsresult
>  ChunkSet::Read(nsIInputStream* aIn, uint32_t aNumElements)
>  {
> +  nsTArray<uint32_t> chunks(aNumElements);

Reading it in at once is OK, but it must be a fallible array.

If not, a corrupted file (say the header claims it has 2^31 chunk elements) would crash the browser. And this data is read right after startup, so the situation would be completely unrecoverable and even persist after a reinstall, oops!

(I think you're actually safe here as the relevant file is protected with an MD5 checksum. But the general principle applies - large arrays must be fallible. Arrays that come from outside the browser certainly so.)

@@ +173,5 @@
>  ChunkSet::Read(nsIInputStream* aIn, uint32_t aNumElements)
>  {
> +  nsTArray<uint32_t> chunks(aNumElements);
> +  nsresult readRes = ReadTArray(aIn, &chunks, aNumElements);
> +  NS_ENSURE_SUCCESS(readRes, readRes);

nit: this style of error handling is deprecated: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style#Error_handling

Maybe it's better not to introduce it in new/rewritten code.
Attachment #8744619 - Flags: feedback?(gpascutto) → feedback-
(In reply to bd339 from comment #68)
> (In reply to Gian-Carlo Pascutto [:gcp] from comment #67)
> > *cough*
> > 
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1150334
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1211090
> 
> Oh, well, you can't be right every time :)
> Considering this, what are your recommendations?

You can move the buffer outside the loop and just flush it when it has X elements (and once at the end).

That said this isn't all that important - the impact is much less than for those bugs - I just wanted to point out that funnily enough both of your erroneous assumptions were actual problems we had and fixed :-)
Attached patch make Read fallible (obsolete) — Splinter Review
Attachment #8744619 - Attachment is obsolete: true
Attachment #8746526 - Flags: feedback?(gpascutto)
Attached patch make Write buffer its output (obsolete) — Splinter Review
Attachment #8744617 - Attachment is obsolete: true
Attachment #8746548 - Flags: feedback?(gpascutto)
Attached patch update error handling (obsolete) — Splinter Review
Use the error handling style referenced in comment 69.
Comment on attachment 8746526 [details] [diff] [review]
make Read fallible

Review of attachment 8746526 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +173,5 @@
>  ChunkSet::Read(nsIInputStream* aIn, uint32_t aNumElements)
>  {
> +  FallibleTArray<uint32_t> chunks;
> +
> +  if (ReadTArray(aIn, &chunks, aNumElements) != NS_OK) {

I'd use NS_FAILED() and return the actual error instead of masking and replacing it. i.e.

rv = ReadTArray(...)
if (NS_FAILED(rv)) {
  return rv;
}

@@ +180,3 @@
>  
>    for (const uint32_t* chunk = chunks.begin(); chunk != chunks.end(); chunk++) {
> +    if (Set(*chunk) != NS_OK) {

Same remark.
Attachment #8746526 - Flags: feedback?(gpascutto) → feedback-
Attachment #8746548 - Flags: feedback?(gpascutto) → feedback+
Attachment #8746554 - Flags: review+
Attached patch Implement ranges, 1st revision (obsolete) — Splinter Review
All of my patches towards this bug so far, combined into one.
Attachment #8746526 - Attachment is obsolete: true
Attachment #8746548 - Attachment is obsolete: true
Attachment #8746554 - Attachment is obsolete: true
Attachment #8746682 - Flags: review?(gpascutto)
Attached patch Implement ranges, 2nd revision (obsolete) — Splinter Review
No code changes, just submitting in the correct format (with commit message).
Attachment #8746682 - Attachment is obsolete: true
Attachment #8746682 - Flags: review?(gpascutto)
Attachment #8747516 - Flags: review?(gpascutto)
Priority: -- → P5
Comment on attachment 8747516 [details] [diff] [review]
Implement ranges, 2nd revision

Review of attachment 8747516 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Let me test this a bit and if I don't see any issues on my profile I'll land it.

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +59,5 @@
>  bool
>  ChunkSet::Has(uint32_t aChunk) const
>  {
> +  Range tmp(aChunk, aChunk);
> +  for (const Range* range = mRanges.begin(); range != mRanges.end(); range++) {

The ranges are sorted, aren't they? This could be a binary search then.
Attachment #8747516 - Flags: review?(gpascutto) → review+
Comment on attachment 8747516 [details] [diff] [review]
Implement ranges, 2nd revision

Review of attachment 8747516 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Let me test this a bit and if I don't see any issues on my profile I'll land it.

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +59,5 @@
>  bool
>  ChunkSet::Has(uint32_t aChunk) const
>  {
> +  Range tmp(aChunk, aChunk);
> +  for (const Range* range = mRanges.begin(); range != mRanges.end(); range++) {

The ranges are sorted, aren't they? This could be a binary search then.

@@ +189,5 @@
> +nsresult
> +ChunkSet::Read(nsIInputStream* aIn, uint32_t aNumElements)
> +{
> +  FallibleTArray<uint32_t> chunks;
> +  nsresult rv = ReadTArray(aIn, &chunks, aNumElements);

Reading the entire chunk array in at once means our maximum memory usage stays unchanged compared to the previous implementation. It's actually worse because both arrays are now "alive" at the same time.

Given that one point of this bug was to reduce memory usage, I think you'll want to read this in piece by piece, similar to the write function.
Comment on attachment 8747516 [details] [diff] [review]
Implement ranges, 2nd revision

I believe this is buggy in some cases, see attached patch with a diff for the unit tests.
Attachment #8747516 - Flags: review+ → review-
Attachment #8751939 - Flags: feedback?(gpascutto)
Attachment #8752247 - Flags: feedback?(gpascutto)
I believe the bug exposed by the unit tests in attachment 8750856 [details] [diff] [review], was caused by Merge not acting appropriately, when ranges which have existing ranges as subranges are merged.
Attachment #8752317 - Flags: feedback?(gpascutto)
Attachment #8751939 - Flags: feedback?(gpascutto) → feedback+
Attachment #8752247 - Flags: feedback?(gpascutto) → feedback+
Comment on attachment 8752317 [details] [diff] [review]
fix subranging in Merge 1st revision

Review of attachment 8752317 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +82,5 @@
>    }
>  
>    if (oldLen < mRanges.Length()) {
>      for (size_t i = 1; i < mRanges.Length(); i++) {
> +      while (mRanges[i - 1].Contains(mRanges[i])) {

Can this be simplified to (mRanges[i - 1].Contains(mRanges[i]) || mRanges[i - 1].FoldLeft(mRanges[i]))?

I think an argument can also be made that FoldLeft can/should handle the Contains case.

In any case this duplication of code in the two loops is not good.
Attachment #8752317 - Flags: feedback?(gpascutto) → feedback-
:gcp was indeed correct that the looping was unnecessary, and the same fix can be applied by letting FoldLeft check for subranging. However, notice how the ordering of the if-statements matter. i.e. the Contains check has to be before the condition that checks whether aRange "overlaps to the right". Should this behaviour be documented in a comment?
Attachment #8752317 - Attachment is obsolete: true
Attachment #8753468 - Flags: feedback?(gpascutto)
Attached patch simplify RemoveSplinter Review
The condition that the chunk set has at least one range, checked before the loop and at the end of each iteration, can be replaced by a single check at the start of the loop.
Attached patch make Set use HasSplinter Review
Since Has was updated to use binary search in attachment 8752247 [details] [diff] [review], Set ought to use that.
Also removed the use of a redundant local variable in Has.
Attachment #8753468 - Flags: feedback?(gpascutto) → feedback+
Comment on attachment 8753478 [details] [diff] [review]
simplify Remove

Review of attachment 8753478 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +113,5 @@
>    for (const Range* removalRange = aOther.mRanges.begin();
>         removalRange != aOther.mRanges.end(); removalRange++) {
>  
> +    if (mRanges.Length() == 0) {
> +      return NS_OK;

In what circumstances could this code ever be reached? Given that it's right after the !=end check in the for loop, I can't see any?
Comment on attachment 8753478 [details] [diff] [review]
simplify Remove

Review of attachment 8753478 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +113,5 @@
>    for (const Range* removalRange = aOther.mRanges.begin();
>         removalRange != aOther.mRanges.end(); removalRange++) {
>  
> +    if (mRanges.Length() == 0) {
> +      return NS_OK;

Nevermind, it's checking the other chunkset, oops.
Attachment #8753478 - Flags: review+
Attachment #8753482 - Flags: review+
Can you fold all your patches together and ask for review on the total?
Flags: needinfo?(bd339)
(In reply to Gian-Carlo Pascutto [:gcp] from comment #90)
> Can you fold all your patches together and ask for review on the total?

Certainly. I will do so later today. That patch will also have the comment about needed optimization in ChunkSet.h removed.
Flags: needinfo?(bd339)
Attached patch Implement ranges, 3rd revision (obsolete) — Splinter Review
This does not include the extended unit tests which showed a bug in my Merge implementation. I think they should also be a permanent addition?
Attachment #8747516 - Attachment is obsolete: true
Attachment #8753981 - Flags: review?(gpascutto)
(In reply to bd339 from comment #92)
> Created attachment 8753981 [details] [diff] [review]
> Implement ranges, 3rd revision
> 
> This does not include the extended unit tests which showed a bug in my Merge
> implementation. I think they should also be a permanent addition?

Yes, but I'll do that for you. I don't seem to have a good way to run a real update with and without your changes (because I can't guarantee I get the same data from the server) so I'll probably extend them even more to hit all corner cases I see in your solution.
Removes the ChunkSet constructor and destructor - both empty and default, as they will be generated by the compiler anyway.
Attachment #8753981 - Attachment is obsolete: true
Attachment #8753981 - Flags: review?(gpascutto)
Attachment #8755551 - Flags: review?(gpascutto)
Attachment #8755551 - Flags: review?(gpascutto) → review+
Comment on attachment 8755804 [details]
MozReview Request: Bug 1135022 - Optimize ChunkSet by storing ranges instead of numbers. r=gcp

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/54822/diff/1-2/
Comment on attachment 8755805 [details]
MozReview Request: Bug 1135022 - Extend UrlClassifier ChunkSet test with stress test. r?francois

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/54824/diff/1-2/
Comment on attachment 8755804 [details]
MozReview Request: Bug 1135022 - Optimize ChunkSet by storing ranges instead of numbers. r=gcp

https://reviewboard.mozilla.org/r/54822/#review51478
Attachment #8755804 - Flags: review?(gpascutto) → review+
Comment on attachment 8755805 [details]
MozReview Request: Bug 1135022 - Extend UrlClassifier ChunkSet test with stress test. r?francois

https://reviewboard.mozilla.org/r/54824/#review52014

With this new else clause, the test looks all good to me.

::: toolkit/components/url-classifier/tests/gtest/TestChunkSet.cpp:212
(Diff revision 2)
> +  // Unmerge
> +  chunkSet.Remove(origSet);
> +  for (int it = 0; it < TEST_RANGE; ++it) {
> +    if (origSet.Has(it)) {
> +      ASSERT_FALSE(chunkSet.Has(it));
> +    }

Should there be an else clause here to check that the entries that got added after `chunkSet.Merge(mergeSet);` are still there?

    } else if (mergeSet.Has(it)) {
      ASSERT_TRUE(chunkSet.Has(it));
    }

unless of course they were in both, but the `else` should take care of that.
Attachment #8755805 - Flags: review?(francois) → review+
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/5e6c3b69c77f
https://hg.mozilla.org/mozilla-central/rev/c79735102a2c
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
You need to log in before you can comment on or make changes to this bug.