Closed Bug 730100 Opened 8 years ago Closed 8 years ago

Add a Bloom filter implementation

Categories

(Core :: MFBT, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla13

People

(Reporter: bzbarsky, Assigned: bzbarsky)

References

Details

Attachments

(1 file, 4 obsolete files)

I need this for bug 705877.
Attachment #600172 - Flags: review?(jwalden+bmo)
Attachment #600300 - Flags: review?(jwalden+bmo)
Attachment #600172 - Attachment is obsolete: true
Attachment #600172 - Flags: review?(jwalden+bmo)
Attachment #600304 - Flags: review?(jwalden+bmo)
Attachment #600300 - Attachment is obsolete: true
Attachment #600300 - Flags: review?(jwalden+bmo)
Attachment #600310 - Flags: review?(jwalden+bmo)
Attachment #600304 - Attachment is obsolete: true
Attachment #600304 - Flags: review?(jwalden+bmo)
Comment on attachment 600310 [details] [diff] [review]
Add a Bloom filter implementation.

Review of attachment 600310 [details] [diff] [review]:
-----------------------------------------------------------------

Looks pretty good.  Basically lots of nits, yet as some of them are relevant to reader/user understandability I'd like to look again.  That, plus for at least one or two of them no particularly better ideas occurred to me -- hopefully you can bail me out in those cases.  :-)

::: mfbt/BloomFilter.h
@@ +1,5 @@
> +/* -*- Mode: C++; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
> +/* This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this file,
> + * You can obtain one at http://mozilla.org/MPL/2.0/. */
> +

Add an overview descriptive comment here, like the ones the existing files have in MXR.  (MXR won't show it yet because it doesn't understand the MPL2 boilerplate, but hopefully that'll be fixed soon.  :-\  That's bug 717196.)

@@ +6,5 @@
> +#ifndef mozilla_BloomFilter_h_
> +#define mozilla_BloomFilter_h_
> +
> +#include "mozilla/Util.h"
> +#include "mozilla/StdInt.h"

Alphabetize these, and add a blank line between mfbt includes and standard includes.

@@ +7,5 @@
> +#define mozilla_BloomFilter_h_
> +
> +#include "mozilla/Util.h"
> +#include "mozilla/StdInt.h"
> +#include "string.h"

<string.h>

@@ +11,5 @@
> +#include "string.h"
> +
> +namespace mozilla {
> +
> +/**

For better or worse, mfbt doesn't use /**-style comments.

This comment is more of an implementation-information comment than an overview comment.  In this position I think we want a comment written for potential users of the class -- what it does (implement a probabilistic set-containment data structure, and link to some sort of definition of a Bloom filter too) and how to use it (specifically, how to determine what KeySize to use).

The mathy bits of this comment are quite interesting but not relevant at their level of specificity to anyone just looking for a tool to use and how to use it.  So move this comment just inside the class definition and out of the traditional user-documentation comment position, pulling out whatever pieces of it make sense for the overview comment.

@@ +13,5 @@
> +namespace mozilla {
> +
> +/**
> + * A counting Bloom filter with 8-bit counters.  For now we assume
> + * that having two hash functions is enough, but we may revisit that

If it were me, I couldn't possibly say "is enough" without adding "for anybody".  :-)

@@ +19,5 @@
> + *
> + * The filter is parametrized on the desired size of the key generated
> + * by each of the hash functions, in bits, and the type of object T
> + * being added and removed.  T must implement a const getHash method
> + * which returns a uint32_t hash.

Infallible methods in Gecko and SpiderMonkey don't have a "get" prefix, so this should just be hash(), which leads to this simplified wording: "T must implement a uint32_t hash() const method."

@@ +21,5 @@
> + * by each of the hash functions, in bits, and the type of object T
> + * being added and removed.  T must implement a const getHash method
> + * which returns a uint32_t hash.
> + *
> + * The filter uses 2^keySize bytes of memory.

Stupid C operators.  Could you use ** for exponentiation throughout to be unambiguous?

Oh, I guess this is LaTeX.  Sigh.  It took me over a day of looking at this patch, off and on, to realize that.  Python experience among readers seems likely to be far more common than LaTeX experience.  I think you still should use **.

@@ +25,5 @@
> + * The filter uses 2^keySize bytes of memory.
> + *
> + * Assuming a well-distributed hash function, a Bloom filter with
> + * array size m (in our case m == 2^keySize) containing n elements and
> + * using k hash function has false positive rate exactly

Expected false positive rate, surely?  It's all probabilities, so to me the current wording makes a stronger claim than it should.

@@ +45,5 @@
> + * where r is the false positive rate.  This can be used to compute
> + * the desired keySize for a given load n and false positive rate r.
> + *
> + * If n/m is assumed small, then the false positive rate can further
> + * be approximated as 4*n^2/m^2.  So each extra bit of keySize reduces

KeySize is a number of bits, but "each extra bit" kind of reads to me more like it being a bitmask or something.  I guess you mean "each extra bit of hash value used", maybe?  I'm having a hard time formulating a phrase that captures exactly the semantics you're ascribing.  Maybe that counsels for naming it HashBits.  Or I don't know, probably that's an awful name.  Help!

@@ +51,5 @@
> + * positive rate of 1% corresponds to about m/n == 20.
> + *
> + * What this means in practice is that for a few hundred keys using a
> + * 10-12-bit keySize gives a reasonably low false positive rate.  A
> + * 12-bit keySize has the additional benefit of using exactly one page

More moderately-confusing claims of keySize being sized in bits, for the reader that missed the previous one.

@@ +54,5 @@
> + * 10-12-bit keySize gives a reasonably low false positive rate.  A
> + * 12-bit keySize has the additional benefit of using exactly one page
> + * for the filter in typical hardware configurations.
> + */
> +template<unsigned keySize, class T>

I tended to confuse keySize and arraySize while reading this, not remembering which was the input parameter and which a derived parameter.  I'd be helped (and I think it's generally standard practice anyway) if keySize were KeySize, the capitalization indicating it's a template parameter.

@@ +56,5 @@
> + * for the filter in typical hardware configurations.
> + */
> +template<unsigned keySize, class T>
> +class BloomFilter {
> +public:

mfbt indents public/private by two spaces inside classes, like SpiderMonkey does.

@@ +61,5 @@
> +    BloomFilter() {
> +        MOZ_STATIC_ASSERT(keySize <= keyShift, "keySize too big");
> +
> +        // Should we have a custom operator new using calloc instead and
> +        // require that we're allocated via the operator?

Is it actually possible to require this?  I could be wrong, but I don't think it's possible.  If it's not, probably you should just remove this comment.

@@ +75,5 @@
> +
> +    /**
> +     * Add an item to the filter.
> +     */
> +    void add(const T *t);

mfbt style puts the * by the type name.  (throughout)

@@ +84,5 @@
> +    void remove(const T *t);
> +
> +    /**
> +     * Check whether the filter might contain an item.  This can have
> +     * false positives, but not false negatives.

I never remember which false positive and false negative mean, myself, so I'd prefer wording which avoided the terms.  Maybe "Returns true if the filter contains the item; returns either true or false (but probably false, assuming a uniform hash function and proper KeySize) if the filter doesn't contain the item."  Feel free to suggest better phrasing, as that's a bit wordy.

@@ +93,5 @@
> +     * Methods for add/remove/contain when we already have a hash computed
> +     */
> +    void add(uint32_t hash);
> +    void remove(uint32_t hash);
> +    bool mayContain(uint32_t hash) const;

I assume you have users which want to be able to access these directly?  My first inclination would be that only the T-parametric methods should be exposed, otherwise.

Another possibility is that we make this require both the const T* and the precomputed hash.  Then these methods can verify that the provided hash matches the item's actual hash, preventing user inconsistency.  Since this is all inline, the unused parameter would boil away in debug builds.

@@ +115,5 @@
> +};
> +
> +template<unsigned keySize, class T>
> +inline void
> +BloomFilter<keySize,T>::clear()

<KeySize, T> (space between the two) (throughout)

@@ +117,5 @@
> +template<unsigned keySize, class T>
> +inline void
> +BloomFilter<keySize,T>::clear()
> +{
> +    memset(counters, 0, arraySize);

Indentation of statements and such should be two spaces.

@@ +126,5 @@
> +BloomFilter<keySize,T>::add(const T* t)
> +{
> +    uint32_t hash = t->getHash();
> +    return add(hash);
> +}

These being templates only instantiated at use it doesn't really matter, but for consistency it seems nice to only use inline-defined methods after their definitions.  So the const T* methods would go after the uint32_t methods.

@@ +133,5 @@
> +inline void
> +BloomFilter<keySize,T>::add(uint32_t hash)
> +{
> +    uint8_t& slot1 = firstSlot(hash);
> +    if (MOZ_LIKELY(! full(slot1)))

No space after !.  (and elsewhere)

::: mfbt/Util.h
@@ +48,5 @@
>  #include "mozilla/Assertions.h"
>  #include "mozilla/Attributes.h"
>  #include "mozilla/Types.h"
>  
> +#if defined(__GNUC__) && (__GNUC__ > 2)

So, clang happens to slip through this because it pretends to be gcc 4.2ish (and I understand maybe even something newer as of recently), but it seems best to be clear about this.  Let's do it this way:

#if defined(__clang__) || (defined(__GNUC__) && __GNUC__ > 2)
...
#else
...
#endif

@@ +50,5 @@
>  #include "mozilla/Types.h"
>  
> +#if defined(__GNUC__) && (__GNUC__ > 2)
> +# define MOZ_LIKELY(x)   (__builtin_expect((x), 1))
> +# define MOZ_UNLIKELY(x) (__builtin_expect((x), 0))

mfbt style for preprocessor directives uses two-space indentation:

#if ...
#  define MOZ_LIKELY(x) ...
#  define MOZ_UNLIKELY(x) ...
#else
#  define MOZ_LIKELY(x) ...
#  define MOZ_UNLIKELY(x) ...
#endif

@@ +54,5 @@
> +# define MOZ_UNLIKELY(x) (__builtin_expect((x), 0))
> +#else
> +# define MOZ_LIKELY(x)   (x)
> +# define MOZ_UNLIKELY(x) (x)
> +#endif

Util.h is a dumping-ground header we'd like to eventually remove (bug 713082), and dumping-ground headers tend to maximize build times versus narrower headers.  Plus it makes it harder to guess where functionality of interest might live, for the reader trying to find that thing he wants to use.  So this should go in a different header, Likely.h, I guess, for lack of anything better.

@@ +60,5 @@
> +/*
> + * Bit operations
> + */
> +#define MOZ_BIT(n) (1 << (n))
> +#define MOZ_BITMASK(n) (MOZ_BIT(n) - 1)

Is there any possibility I could persuade you to just inline this math at call sites?  I find the bit macros horribly unreadable myself, because I can never remember exactly how I should interpret any number passed as an argument to them.  I'd much rather see 1 << 5 or (1 << 5) - 1 or even 0x1F (when feasible, tho that's not the case for BloomFilter), myself.  Any chance I can convince you to use such things instead of adding these macros?

::: xpcom/tests/TestBloomFilter.cpp
@@ +36,5 @@
> +  }
> +
> +  if (filter->mayContain(&multiple)) {
> +    printf("TEST-UNEXPECTED-FAIL: Filter claims to contain 'multiple' when it "
> +	   "should not\n");

TestHarness.h gives you a fail() method which sets up the formatting properly for buildbot/tinderbox to scrape things correctly.  It would probably be a better idea to include that file and use that method than to hand-roll it.  (And to be honest, I'm not 100% sure if it's scraping for "TEST-UNEXPECTED-FAIL" or for "TEST-UNEXPECTED-FAIL |".)

@@ +48,5 @@
> +  }
> +
> +  filter->add(&two);
> +  if (!filter->mayContain(&multiple)) {
> +    printf("TEST-UNEXPECTED_FAIL: Filter should contain 'multiple'");

Add "(false positive)" to this for clarity?

@@ +61,5 @@
> +    return -1;
> +  }
> +
> +  // Test multiple addition/removal
> +  const int FILTER_SIZE = 255;

Make this size_t or unsigned (and the associated looping variables) since it's never used as a signed integer -- more reader clarity.

@@ +67,5 @@
> +    filter->add(&two);
> +  }
> +  if (!filter->mayContain(&multiple)) {
> +    printf("TEST-UNEXPECTED_FAIL: Filter should contain 'multiple' after 'two' "
> +	   "added lots of times");

(false positive)

@@ +85,5 @@
> +    filter->add(&two);
> +  }
> +  if (!filter->mayContain(&multiple)) {
> +    printf("TEST-UNEXPECTED_FAIL: Filter should contain 'multiple' after 'two' "
> +	   "added lots more times");

(false positive)

@@ +93,5 @@
> +    filter->remove(&two);
> +  }
> +  if (!filter->mayContain(&multiple)) {
> +    printf("TEST-UNEXPECTED-FAIL: Filter claims to not contain 'multiple' even "
> +	   "though we should have run out of space in the buckets");

(false positive)

@@ +97,5 @@
> +	   "though we should have run out of space in the buckets");
> +    return -1;
> +  }
> +  if (!filter->mayContain(&two)) {
> +    printf("TEST-UNEXPECTED-FAIL: Filter claims to not contain 'two' even "

(false positive)
Attachment #600310 - Flags: review?(jwalden+bmo)
> For better or worse, mfbt doesn't use /**-style comments.

OK.  So just nix the second '*'?

> Oh, I guess this is LaTeX.

Yes.  Trying to really write down math as ASCII in anything else is an exercise in fail.

I can use ** for exponentiation, though, I guess.

> I'm having a hard time formulating a phrase that captures exactly the semantics

How about "so increasing keySize by one reduces ..." ?

> mfbt indents public/private by two spaces inside classes, like SpiderMonkey does.

Got an emacs mode for that?  ;)

> Is it actually possible to require this?

Yes; you just make your constructor private ane expose a public static creator function which always allocates the memory via your (likewise private) operator new.

> mfbt style puts the * by the type name. 

Oh, ffs.  Fine.  Is any of this actually like _documented_ anywhere?

> I assume you have users which want to be able to access these directly?

Yes.  That saves those users a memory indirection, and furthermore allows them to store a uin32_t instead of an nsIAtom* in long-lived (and quite numerous, so this does matter) heap storage, which saves memory on 64-bit.

I started out without the direct passing of hashes, but it led to significant memory bloat.  Hence I would also be against the requirement to pass both, unless it's debugonly or something.

Note that what I actually don't have right now are any consumers using the T* versions of add/remove/etc...  I just left them for completeness.

> Indentation of statements and such should be two spaces.

I was told that mfbt used a 4-space indent.  And my look at the other files indicates that it does.  What gives?

> No space after !.  (and elsewhere)

I had it that way to start with.  It was utterly unreadable to have the "(!f" all in a row with nothing to break it up.  Suggestions?

> So, clang happens to slip through this because it pretends to be gcc 4.2ish

Note that I copied that code from our existing JS_LIKELY/NS_LIKELY impls.  I guess I can add clang explicitly...

> Is there any possibility I could persuade you to just inline this math at call sites? 

I guess.  I prefer having a semantic meaning to the math instead of just having the operations, because then it's clearer what the goal really is.  But I guess I can change this.

> TestHarness.h gives you a fail() method

Wish someone had told me that!  I'll look into it.

I'll make the other changes.
Attachment #601805 - Flags: review?(jwalden+bmo)
Attachment #600310 - Attachment is obsolete: true
(In reply to Boris Zbarsky (:bz) from comment #6)
> > For better or worse, mfbt doesn't use /**-style comments.
> 
> OK.  So just nix the second '*'?

Yes.

> > I'm having a hard time formulating a phrase that captures exactly the semantics
> 
> How about "so increasing keySize by one reduces ..." ?

Sounds good.

> > Is it actually possible to require this?
> 
> Yes; you just make your constructor private ane expose a public static
> creator function which always allocates the memory via your (likewise
> private) operator new.

Oh, that.  Hmm, let's just go with what we have for now, I guess.  C++'s allocation/initialization story makes me sad sometimes.

> > mfbt style puts the * by the type name. 
> 
> Oh, ffs.  Fine.  Is any of this actually like _documented_ anywhere?

No.  :-\  It's the usual glean-from-context thing.  I need to make everything consistent at some point and write everything down.

> > I assume you have users which want to be able to access these directly?
> 
> Yes.  That saves those users a memory indirection, and furthermore allows
> them to store a uin32_t instead of an nsIAtom* in long-lived (and quite
> numerous, so this does matter) heap storage, which saves memory on 64-bit.

Interesting, wouldn't have thought of that.

> I started out without the direct passing of hashes, but it led to
> significant memory bloat.  Hence I would also be against the requirement to
> pass both, unless it's debugonly or something.

Debug-only passing seems a bit much to muddy up every caller, so yeah, let's not do this.

> Note that what I actually don't have right now are any consumers using the
> T* versions of add/remove/etc...  I just left them for completeness.

Heh.  Makes sense.

> > Indentation of statements and such should be two spaces.
> 
> I was told that mfbt used a 4-space indent.  And my look at the other files
> indicates that it does.  What gives?

Just the normal thing of people being inconsistent.  :-\  I'll fix up whatever you go with later, I guess, since everything's already a bit weird.

> > No space after !.  (and elsewhere)
> 
> I had it that way to start with.  It was utterly unreadable to have the
> "(!f" all in a row with nothing to break it up.  Suggestions?

Hum, didn't seem unreadable to me.

> Note that I copied that code from our existing JS_LIKELY/NS_LIKELY impls.  I
> guess I can add clang explicitly...

Yeah, that's about what I expected.  mfbt generally does split clang out for readability, we should keep doing that.
Comment on attachment 601805 [details] [diff] [review]
Add a Bloom filter implementation.

Review of attachment 601805 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/BloomFilter.h
@@ +13,5 @@
> +#ifndef mozilla_BloomFilter_h_
> +#define mozilla_BloomFilter_h_
> +
> +#include "mozilla/Likely.h"
> +#include "mozilla/StdInt.h"

This may need to change if jlebar beats you to landing, as seems likely.

@@ +63,5 @@
> +     * Assuming a well-distributed hash function, a Bloom filter with
> +     * array size M containing N elements and
> +     * using k hash function has expected false positive rate exactly
> +     *
> +     * $$  (1 - (1 - 1/M)^{kN})^k  $$

Maybe I'm forgetting my LaTeX.  But wasn't the way to enter math mode to just use plain old $?  Seems like this should be $(1 - ...)^k$ most properly.  Same for the other math-mode expressions.

@@ +103,5 @@
> +        // require that we're allocated via the operator?
> +        clear();
> +    }
> +
> +    /**

Single-* comments as just noted.

@@ +137,5 @@
> +    bool mayContain(uint32_t hash) const;
> +
> +private:
> +    static const size_t arraySize = (1 << KeySize);
> +    static const uint32_t keyMask = arraySize - 1;

(1 << KeySize) - 1 makes clearer that it's a mask, I think.

::: xpcom/tests/TestBloomFilter.cpp
@@ +36,5 @@
> +    fail("Filter should contain 'one'");
> +  }
> +
> +  if (filter->mayContain(&multiple)) {
> +    fail("Filter claims to contain 'multiple' when it should not");

I can't remember if the failure condition for compiled tests is non-zero exit code or printing a failure message.  I think in case of failure it's probably the former, and the latter is icing on the cake -- but I'm really not sure what happens if the two actively disagree (0 but printed failures).  So having everything fall through here to the |return 0| probably isn't adequate, and you should probably readd the non-zero returns to all these fails.
Attachment #601805 - Flags: review?(jwalden+bmo) → review+
> But wasn't the way to enter math mode to just use plain old $?

That's for inline math mode.  $$ for display math mode.  I can do single $ here, since this is pseudo-TeX at best anyway, sure.

> (1 << KeySize) - 1 makes clearer that it's a mask, I think.

Will do.

> I can't remember if the failure condition for compiled tests is non-zero exit code or
> printing a failure message.

Non-zero return immediately stops running all other tests, fwiw.  But I'll add that.  Sigh....
(In reply to Boris Zbarsky (:bz) from comment #10)
> Non-zero return immediately stops running all other tests, fwiw.  But I'll
> add that.  Sigh....

Maybe you could set it up with independent tests, viz:

int rv = 0;

if (!TestOne())
  rv = 1;
if (!TestTwo())
  rv = 1;
...

return rv;
No, I meant it stops running all other tests in xpcom/tests.  As in, the test harness is killed off.
If you consider that an issue, it's one fundamental to the compiled-code test setup, not really to this particular test.  That could be changed, certainly, but it seems a bit beyond this bug.
http://hg.mozilla.org/integration/mozilla-inbound/rev/857f74eb3947
Flags: in-testsuite+
Whiteboard: [need review]
Target Milestone: --- → mozilla13
https://hg.mozilla.org/mozilla-central/rev/857f74eb3947
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment on attachment 601805 [details] [diff] [review]
Add a Bloom filter implementation.

>--- /dev/null
>+++ b/mfbt/Likely.h
>@@ -0,0 +1,17 @@
>+/* This Source Code Form is subject to the terms of the Mozilla Public
>+ * License, v. 2.0. If a copy of the MPL was not distributed with this file,
>+ * You can obtain one at http://mozilla.org/MPL/2.0/. */
>+

No header guards?
Oops.  That's what I get for adding new files!
Depends on: 792180
You need to log in before you can comment on or make changes to this bug.