Closed Bug 673470 Opened 13 years ago Closed 12 years ago

Replace the sqlite safeb store with a flat file

Categories

(Toolkit :: Safe Browsing, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 17

People

(Reporter: dcamp, Assigned: gcp)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(25 files, 46 obsolete files)

1.55 KB, text/plain
Details
4.11 KB, application/x-javascript
Details
335.03 KB, patch
gcp
: review+
Details | Diff | Splinter Review
8.31 KB, patch
gcp
: review+
Details | Diff | Splinter Review
3.06 KB, patch
gcp
: review+
Details | Diff | Splinter Review
2.74 KB, patch
gcp
: review+
Details | Diff | Splinter Review
5.73 KB, patch
gcp
: review+
Details | Diff | Splinter Review
1.32 KB, patch
gcp
: review+
Details | Diff | Splinter Review
4.27 KB, patch
gcp
: review+
Details | Diff | Splinter Review
9.31 KB, patch
gcp
: review+
Details | Diff | Splinter Review
1.47 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
1.30 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
1.49 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
1.52 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
25.73 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
3.77 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
7.62 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
2.36 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
5.79 KB, patch
dcamp
: review+
justin.lebar+bug
: feedback+
Details | Diff | Splinter Review
2.86 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
1.17 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
1.47 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
18.27 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
2.98 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
19.61 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
Attached file barely-limping WIP (obsolete) —
I spent a couple days gutting nsUrlClassifierDBService.cpp.  The attached patch starts to replace the sqlite database with store files similar to chrome's safebrowsing implementation.

Very similar - HashStore.cpp in particular is mostly a port of chrome code, and retains their copyright header as a result.  Other parts are heavily influenced.  licensing@ would need to take a look at this before it was landed.

ProtocolParser.[h,cpp] is used to build a series of TableUpdate objects that are applied by the HashStore.  LookupCache maintains a (currently stupid) lookup cache.

Some behavior improvements:
* Updates are a Lot Quicker with this code than the sqlite database - a complete load of the db takes about 30 seconds with this code (mostly network-bound, haven't done tests against local copies of the protocol data yet) vs. 1m:30s with the old code (mostly sqlite-bound).  Haven't done real measurements yet, just wall clock stuff.
* Updates take a lot less memory.  Again, no solid measurements yet but glancing at an xpcshell instance running an entire empty-to-full update shaved 10ish megs off the process.  This is with the very stupid lookup cache (see below), and there are probably other quick wins too.
* The backing storage is completely dumped out of memory after an update, leaving only the lookup cache implementation (similar to the work in bug 669410).
* Backing store is a lot smaller, and is not subject to the whims of sqlite fragmentation.
* Ignores host keys entirely (in the store and in the lookup cache).  I suspect the wins in simplicity and storage here outweigh the benefits of the hostkey cache, but we can revisit.

Things that aren't finished:
* Gethash result caching isn't finished yet.
* My fast-lookup cache is dumb, should use the prefix tree implementation from 669410.
* Haven't run tests lately
* There are quite a few XXX comments that will need to be dealt with.  In particular, ActiveTables() is completely stupid.
* Probably a lot of things are stupid, this is all done hastily.
* Haven't ported checksumming storage files when they are read, that's important.
* All lookups are sent to the worker thread.  It should be pretty easy to send NO lookups to the worker thread, with proper locking around the LookupCache.
* Finally, the documentation suuucks.

Generally, thinks are a bit of a mess.  I hope to fix a few of these in future updates (particularly the documentation thing), but I have some other responsibilities and impending paternity leave.  I wanted to dump this here in case someone wants to pick it up (or pull individual pieces out of it for other work).
Attached file Chromium license
License applicable to the parts from Chrome.
I've been using this script to exercise the update code.  Set TESTPROF=[path to a directory] and run under xpcshell to start an update.  It ignores backoff, just keeps asking google for the next update until it gets an empty update.
The script doesn't clear the profile or db on each run, you'll need to manually delete store files to start from scratch.
Assignee: nobody → gpascutto
Attachment #547744 - Attachment is obsolete: true
Attached patch Patch v2. Excorcise SQLite (obsolete) — Splinter Review
Alright, here is my work in progress. Memory usage is down to 1.9M, and disk usage down to 5.5M for SafeBrowsing. There's a zillion issues still to deal with, but the code works, and should make it usable on Mobile. These need to be applied on top of the patches in bug 669410, v3.

Issues fixed compared to the first WIP:
- Fixed URL parsing so updates using a MAC work.
- Fixed the LookupCache code so it doesn't give a false positive and hence hashcomplete to Google for every single lookup :-)
- Changed the order of processing an update so expires (adddel/subdel) are handled first. This makes the canonical testcase of "ad:1\na:1:32:xxx" not delete itself.
- Rewrote the LookupCache to only store completions and delegate to PrefixSet for everything else.
- Changed the HashStore to deinterleave (yuck!) the prefixes/chunk-ids, and zip the chunk-ids before storing to disk. They have a lot of redundancy so this is a massive saving. The prefixes should be encoded as a PrefixSet, but the code for that needs some extra helpers to be usable here.
- Implemented caching of successful completions, by executing an update that simulates an add. This required some tweaking to the update-handling code to not skip already known chunks, and avoid duplicating hashes within a chunk. An update rebuilds the entire LookupCache, so the old GetHash cache is gone now.
- Slightly better handling of corrupted stores.

Known issues remaining:
- Hundreds of XXX in the code
- All of what Dave said minus what is fixed in the above paragraph
- HashStore should save prefixes in prefix order (instead of chunk) when going to disk, and use PrefixSet for compressing them. Chunks can stay with zlib but will likely lose some compression from not being in chunk order. Should still be better than the other way around.
- mTableUpdates is duplicated in DBService(Worker) and ProtocolParser, I think.
- TableFreshness seems disabled. We prolly need to audit the entire code wrt to the Protocol spec.
- Hashcomplete misses aren't cached. This is important without hostkeys, see the explanation in bug 669407.
- Hashcompletion noise entries are disabled.
- We need to get all the testcases to work again, minus the wrong ones.
- Did I mention there's still lots of XXX in the code?

The very good news is that this splits up a 4000+ line C++ file with 10 or so different classes handling everything, into smaller, much more manageable pieces. But that's not my responsibility, blame :dcamp instead!
Blocks: 669407
Please put the full Chrome license header at the top of the file. Other than that, this is fine. You'll need a small patch to about:license adding the name of the file (whatever it ends up as) to the list of files covered by the Chrome license copy which is already present there.

Gerv
>Please put the full Chrome license header at the top of the file. Other than 
>that, this is fine.

What is the "full license header"? Is it this:

// Copyright (c) 2011 The Chromium Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.

Should there be a regular Mozilla copyright header, too? (I expect to heavily edit, expand, and perhaps mostly rewrite this file)
I traced an issue in the new code that was causing the database sizes to differ from the old code. It turns out that all of the code is keeping prefixes sorted by add-chunk, so routines like knockoutsubs and removing matching add prefixes by subs works correctly.

Unfortunately, when taking an update containing subs, adding these new subs to the list of subs has to go by sub-chunk, and the code assumes the entries are sorted. But they're actually sorted by add-chunk, not sub-chunk. This was causing sub updates to go lost, causing our database to be too large and still contain entries that should have been removed.

Attached patch fixes this by taking an alternate sort order before processing sub chunks, and resorting back to add-chunk order afterwards.

Later on, we probably want to keep all lists in prefix order and modify the code to work from that assumption. This would remove the need to sort and resort continuously, as well as integrating better with PrefixSet, which needs prefix order.
Attached patch Patch 7 Add Copyright headers (obsolete) — Splinter Review
Add copyright headers to the new source files.
Attachment #552424 - Attachment is obsolete: true
Attachment #552425 - Attachment is obsolete: true
Attachment #552426 - Attachment is obsolete: true
Attachment #552427 - Attachment is obsolete: true
Attachment #552428 - Attachment is obsolete: true
Attachment #553462 - Attachment is obsolete: true
Attachment #553463 - Attachment is obsolete: true
Attachment #561512 - Flags: review?(dcamp)
Attachment #561513 - Flags: review?(dcamp)
Attachment #561515 - Flags: review?(dcamp)
Attachment #561514 - Flags: feedback?(tony)
Attachment #561514 - Flags: feedback?(dcamp)
Attachment #561513 - Flags: review?(dcamp)
Attachment #561513 - Flags: feedback?(tony)
Attachment #561513 - Flags: feedback?(dcamp)
Attachment #561512 - Flags: review?(dcamp)
Attachment #561512 - Flags: feedback?(tony)
Attachment #561512 - Flags: feedback?(dcamp)
Attachment #561515 - Flags: review?(dcamp)
Attachment #561515 - Flags: feedback?(tony)
Attachment #561515 - Flags: feedback?(dcamp)
- Fixed an issue in the original protocol parser code that caused SubChunks to have swapped prefixes vs chunk numbers.
- Reworked the sort order to be prefix based and updated the code. This avoids resorting for subs on each update.
- Removed the Add prefixes from the hash store and have the Classifier fetch & forward them from the PrefixSet.

Total size of the (current) store is now 5.3M. 

I'm pondering if maybe the PrefixSet interaction should go directly into HashStore, avoiding that Classifier has to know about it and forward the data as it does now. LookupCache would then only serve as a completion cache. The HashStore would need to be ordered so that the PrefixSet goes to disk first and we only load that on startup. Due to the latter, it might not actually be that much cleaner than what is here now. But it would allow compressing SubPrefixes for some small extra savings.

There's still many XXX's and remarks made above to address, but the functionality is now pretty much complete (save for caching Completion misses), so throwing it out there for a first feedback round.
OS: Mac OS X → All
Hardware: x86 → All
Version: unspecified → Trunk
Whiteboard: [MemShrink]
Attachment #561512 - Attachment is obsolete: true
Attachment #561513 - Attachment is obsolete: true
Attachment #561514 - Attachment is obsolete: true
Attachment #561515 - Attachment is obsolete: true
Attachment #561512 - Flags: feedback?(tony)
Attachment #561512 - Flags: feedback?(dcamp)
Attachment #561513 - Flags: feedback?(tony)
Attachment #561513 - Flags: feedback?(dcamp)
Attachment #561514 - Flags: feedback?(tony)
Attachment #561514 - Flags: feedback?(dcamp)
Attachment #561515 - Flags: feedback?(tony)
Attachment #561515 - Flags: feedback?(dcamp)
Attachment #564583 - Flags: review?
Attachment #564583 - Flags: feedback?(tony)
Attachment #564584 - Flags: review?(dcamp)
Attachment #564584 - Flags: feedback?(tony)
Attachment #564581 - Flags: review?(dcamp)
Attachment #564581 - Flags: feedback?(tony)
Attachment #564583 - Flags: review? → review?(dcamp)
Whiteboard: [MemShrink] → [MemShrink:P2]
I forgot that I had "temporarily" commented out some tests. This uncomments them, fixes the bugs, and adds back telemetry.
Attachment #564584 - Attachment is obsolete: true
Attachment #564584 - Flags: review?(dcamp)
Attachment #564584 - Flags: feedback?(tony)
Attachment #564765 - Flags: review?(dcamp)
Attachment #564765 - Flags: feedback?(tony)
What remains to be done:

- Noise. That is a Mozilla extension, so I'd rather have this reviewed first and add it separately afterwards.
- There is a number of XXX which I believe aren't critical, such as hashing the store to check for corruption. The SQLite solution didn't do this either, and we do checksum part of them through zlib CRCs.
- I left some XXX's from Dave where I'm not sure what he was referring to, or where I wasn't sure of the best way to clean things up.
- A number of tests are commented out because I believe they're wrong.
- We don't use host caches or fragment caches, and all lookups are in the background thread. We should profile to see which, if any, are worthwhile.
Fix build issues in optimized mode and on win32. Fix mochitest-5 failures.

https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=ad7f27d16993
Attachment #564765 - Attachment is obsolete: true
Attachment #564765 - Flags: review?(dcamp)
Attachment #564765 - Flags: feedback?(tony)
Comment on attachment 564966 [details] [diff] [review]
Patch 3. Use PrefixSet code in new store

This should be the last update for a while, until review comments come in.
Attachment #564966 - Flags: review?(dcamp)
Attachment #564966 - Flags: feedback?(tony)
Blocks: 383031
Blocks: 441481
Is the old urlclassifier3.sqlite supposed to stick after an upgrade? 

I'm running 2011-10-16's nightly and a have both that and urlclassifier.pset.
None of the patches have been reviewed yet, so they are not in a Nightly.

Even if it did, I'm not clear if we should clean up the old database (yet). See for example bug 499362.
Oh. Where did %localappdata%\Mozilla\Firefox\Profiles\xxxxxxx.default\urlclassifier.pset come from then?
It's from bug 669410.
Reviewers, what's the ETA on getting to this patch? If it's going to be a while, is there anyone you can suggest for a feedback-r pass first?
Comment on attachment 564581 [details] [diff] [review]
Patch 1. Add the Safebrowsing store sources

Review of attachment 564581 [details] [diff] [review]:
-----------------------------------------------------------------

This is a lot of code; I wasn't able to do a very thorough review of it.  At a high level, it seems OK.

For new on disk file formats, it would be good to document what the format is supposed to be.  This makes it easier to verify a file correct vs whether there's a bug in the code.  You don't want to be like the mork file format :)  I'm not sure if the best place for this type of documentation is on a web page or in a comment.

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +40,5 @@
> +
> +nsresult
> +ChunkSet::Unset(uint32 chunk)
> +{
> +  mChunks.RemoveElementSorted(chunk);

If there are duplicates, will they all be removed?  Maybe Set() and Merge() should check for duplicates?

::: toolkit/components/url-classifier/ChunkSet.h
@@ +9,5 @@
> +namespace mozilla {
> +namespace safebrowsing {
> +
> +/**
> + * This is an awful implementation.

Nit: Can you remove this comment or be specific about why it's awful (maybe with suggestions on how to improve it).?

::: toolkit/components/url-classifier/Classifier.cpp
@@ +21,5 @@
> +}
> +
> +Classifier::~Classifier()
> +{
> +}

Nit: Should this call DropStores()?

::: toolkit/components/url-classifier/Classifier.h
@@ +28,5 @@
> +  /**
> +   * Get the list of active tables and their chunks in a format
> +   * suitable for an update request.
> +   */
> +  void TableRequest(nsACString &result);

Nit: In some files you use pointers for out params and in others you pass by ref.  Would be nice to be consistent.

::: toolkit/components/url-classifier/Entries.h
@@ +115,5 @@
> +  uint32 addChunk;
> +  union {
> +    Prefix prefix;
> +    Completion complete;
> +  } hash;

What does this get initialized to?  Where are Prefix and Completion defined?

@@ +117,5 @@
> +    Prefix prefix;
> +    Completion complete;
> +  } hash;
> +
> +  AddComplete() : addChunk(0) {};

Nit: ; is not needed

::: toolkit/components/url-classifier/HashStore.cpp
@@ +394,5 @@
> +  rv = storeFile->AppendNative(mTableName + NS_LITERAL_CSTRING(".sbstore"));
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  nsCOMPtr<nsIOutputStream> out;
> +  rv = NS_NewSafeLocalFileOutputStream(getter_AddRefs(out), storeFile,

Is this safe against power outage during writes?  Would it be safer to write to a temp file and then rename the temp file over the old file?

::: toolkit/components/url-classifier/LookupCache.h
@@ +33,5 @@
> +
> +  // True if we have a complete match for this hash in the table.
> +  bool mComplete;
> +
> +  // True if this is a noise entry. XXX: explain what that is

Nit: Can you do this XXX now?

@@ +46,5 @@
> +};
> +
> +typedef nsTArray<LookupResult> LookupResultArray;
> +
> +// XXX: Inconsistent

Can you elaborate on this more?

::: toolkit/components/url-classifier/ProtocolParser.h
@@ +6,5 @@
> +
> +namespace mozilla {
> +namespace safebrowsing {
> +
> +struct ForwardedUpdate {

Nit: Can we declare this in the class?
Attachment #564581 - Flags: feedback?(tony) → feedback+
> I'm not sure if the best place for this type of documentation is on a
> web page or in a comment.

If it's in a web page, there should be a comment giving the link to the web page.
(In reply to Tony Chang (Google) from comment #31)

Thanks for your time on this Tony!

> For new on disk file formats, it would be good to document what the format
> is supposed to be.  This makes it easier to verify a file correct vs whether
> there's a bug in the code.  You don't want to be like the mork file format
> :)  I'm not sure if the best place for this type of documentation is on a
> web page or in a comment.

My preference would be for it to be documented in a comment, as close to the relevant code as is possible. Web pages for things like this tend to age poorly.
dcamp, tony:  you have review/feedback requests here that are over 5 weeks old.  Could you do them this week, please?
I started looking at this, but it would be a lot easier as one big patch (I'm realizing a lot of problems in the first patch are solved in the second and third patches).  The second patch isn't applying cleanly at all for me.  Gian Carlo, could I please get one big megapatch that applies against m-c?
Attachment #574884 - Flags: review?(dcamp)
(In reply to Tony Chang (Google) from comment #31)

> > +  nsCOMPtr<nsIOutputStream> out;
> > +  rv = NS_NewSafeLocalFileOutputStream(getter_AddRefs(out), storeFile,
> 
> Is this safe against power outage during writes?  Would it be safer to write
> to a temp file and then rename the temp file over the old file?

SafeLocalFileOutputStream takes care of that, renames the temp file after a successful close.
Comment on attachment 574884 [details] [diff] [review]
Patch 4. Folded version of above patches, rebased

Review of attachment 574884 [details] [diff] [review]:
-----------------------------------------------------------------

Looking good - Tony's comments need to be addressed in addition to the ones I'm adding below:

This patch hasn't reimplemented gethash noise, I think we decided in the security review that we needed to keep that?

::: toolkit/components/url-classifier/ChunkSet.cpp
@@ +72,5 @@
> +
> +nsresult
> +ChunkSet::Set(uint32 chunk)
> +{
> +  mChunks.InsertElementSorted(chunk);

I think this needs to check for duplicates.

::: toolkit/components/url-classifier/ChunkSet.h
@@ +49,5 @@
> +namespace safebrowsing {
> +
> +/**
> + * This is an awful implementation.
> + */

To elaborate a bit here, this implementation just stores chunks as an array of integers, which is wasteful if you have a lot of contiguous chunks.  Should file a followup to improve that.

::: toolkit/components/url-classifier/Classifier.cpp
@@ +70,5 @@
> +}
> +
> +nsresult
> +Classifier::InitKey()
> +{

Can you document what this key is for?

@@ +438,5 @@
> +
> +nsresult
> +Classifier::ApplyTableUpdates(nsTArray<TableUpdate*> &updates,
> +                              const nsACString &table)
> +{

Should document that this consumes/deletes the updates that it applies.

@@ +531,5 @@
> +
> +  // At this point the store is updated and written out to disk, but
> +  // the data is still in memory.  Build our quick-lookup table here.
> +  rv = prefixSet->Build(store->AddPrefixes(), store->AddCompletes());
> +  // XXXnewstore: Deal with failure correctly here.

Does this still need addressing?  If Build() fails what happens (and what should happen?)

@@ +534,5 @@
> +  rv = prefixSet->Build(store->AddPrefixes(), store->AddCompletes());
> +  // XXXnewstore: Deal with failure correctly here.
> +  NS_ENSURE_SUCCESS(rv, rv);
> +#if defined(DEBUG) && defined(PR_LOGGING)
> +  prefixSet->Dump();

Probably want to wrap this with if (LOG_ENABLED())

::: toolkit/components/url-classifier/HashStore.cpp
@@ +1,5 @@
> +//* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
> +// Based on chrome stuff:
> +// Copyright (c) 2010 The Chromium Authors. All rights reserved.
> +// Use of this source code is governed by a BSD-style license that can be
> +// found in the LICENSE file.

I think Comment 11 is still waiting on an answer?

@@ +27,5 @@
> +
> +const uint32 STORE_MAGIC = 0x1231af3b;
> +const uint32 CURRENT_VERSION = 1;
> +
> +Timer::Timer(const char *name) : mName(name) {

Can probably dump this little wall clock test I was doing, particularly if we have good telemetry probes.

@@ +275,5 @@
> +  }
> +
> +  nsCOMPtr<nsISeekableStream> seekable = do_QueryInterface(mInputStream);
> +
> +  // XXX: md5sum

I think we need to do some sort of integrity check here before landing.

::: toolkit/components/url-classifier/LookupCache.cpp
@@ +138,5 @@
> +void
> +LookupCache::Dump()
> +{
> +  if (!LOG_ENABLED())
> +    return;

Oh, I see Dump() takes care of the LOG_ENABLED(), ignore that other comment.

@@ +199,5 @@
> +                                       -1, -1, 0);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  UpdateHeader();
> +  // XXX: md5sum

Same here, integrity checking is important.

@@ +266,5 @@
> +                                  &buffer,
> +                                  sizeof(Header));
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  // XXX: Sanity check.

Should do this (at least a magic/version check).

::: toolkit/components/url-classifier/LookupCache.h
@@ +54,5 @@
> +
> +#define MAX_HOST_COMPONENTS 5
> +#define MAX_PATH_COMPONENTS 4
> +
> +// XXX: Pretty sure we can clean this up quite a bit.

This comment should either be removed if it's no longer true, or a followup should be filed.

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +454,5 @@
>  void
>  nsUrlClassifierDBServiceWorker::ResetStream()
>  {
> +  LOG(("ResetStream"));
> +  //XXXnewstore: Audit this

This comment should be removed.

@@ +693,5 @@
>  nsUrlClassifierDBServiceWorker::CancelUpdate()
>  {
> +  LOG(("nsUrlClassifierDBServiceWorker::CancelUpdate"));
> +
> +  // XXXnewstore: fix this.

It looks like CancelUpdate() works now, this XXX can be removed.

@@ +703,4 @@
>      mUpdateObserver->UpdateError(mUpdateStatus);
> +    // XXX: See no point in removing TableFreshness here
> +    // it's not as if its info becomes invalid?
> +    // This does break tests which check this specifically.

iirc we were asked to do this.  If they block a valid site and need to unblock it and updates are failing, they'll be blocking legitimate sites longer (45 minutes rather than the usual 30 minute update latency).

@@ +996,5 @@
> +    // thread.
> +    mDBService->CacheCompletions(mCacheResults.forget());
> +  } else if (mResults->Length() > 0) {
> +    // XXX: this doesn't work with our noise - noise might generate
> +    // hits even if the main hash doesn't.

This will need to be dealth with (once noise is added back)

::: toolkit/components/url-classifier/tests/unit/test_addsub.js
@@ +481,5 @@
>      testSubPartiallyMatches2,
>      testSubsDifferentChunks,
>      testSubsDifferentChunksSameHostId,
>      testExpireLists,
> +    // Chrome's implementation doesn't get these right, we won't either.

Should probably just remove these tests rather than commenting them out.

::: toolkit/components/url-classifier/tests/unit/test_partial.js
@@ +818,5 @@
>        testMixedSizesDifferentDomains,
>        testInvalidHashSize,
> +      // XXX: Changed how caching works with unexpected tables
> +      // testWrongTable,
> +      // XXX: No idea why the current behavior would be wrong

Does this need further investigation?
Attachment #574884 - Flags: review?(dcamp)
Attachment #574884 - Flags: review-
Attachment #574884 - Flags: feedback+
(In reply to Gian-Carlo Pascutto (:gcp) from comment #11)
> >Please put the full Chrome license header at the top of the file. Other than 
> >that, this is fine.
> 
> What is the "full license header"? Is it this:
> 
> // Copyright (c) 2011 The Chromium Authors. All rights reserved.
> // Use of this source code is governed by a BSD-style license that can be
> // found in the LICENSE file.

The full license is what is found in LICENSE, which is here:
http://git.chromium.org/gitweb/?p=chromium/chromium.git;a=blob;f=LICENSE;h=a6e0dcfbc54451334e1ad2ee859aa2bffaa08692;hb=HEAD
Attachment #564581 - Flags: review?(dcamp)
Attachment #564583 - Flags: review?(dcamp)
Attachment #564966 - Flags: review?(dcamp)
>Does this need further investigation?

testWrongChunk does the following test: it adds an URL to the database, then lets the classifier fire a completion for that URL, and lets the completion server return a hit but with a different chunk number than the database knows. The old test checks if this result is not cached. I see no reason for this behavior in the spec (on the contrary...), but it was likely a limitation in the old database that probably had to "fill in" the completion in the same database row as the original prefix, or something like that. As far as I can tell we should behave correctly in that we'll report owning the chunk back to the server, and it can be expired or subbed by the server as needed. So I think this test is probably wrong.

testWrongTable does something similar, but instead of the wrong chunk it lets the completer reply with a different table. I kept this one, fixed a small bug in my code (a wrong table shouldn't "confirm" a prefix, i.e. if a user has some lists disabled, a completion hit in a non-checked list when probing a checked one shouldn't block the page) but changed the test somewhat too (the completer result will now always be cached and this changes the behavior of the second probe). 

Anyway, I *think* we're fine here.
Yeah, ok I'm agreed.  We should just remove those tests then.
Attachment #574884 - Attachment is obsolete: true
Attachment #578004 - Flags: feedback?(tony)
Attachment #578005 - Flags: review?(dcamp)
Incorporated review comments in the 4th patch. Some comments:

>For new on disk file formats, it would be good to document what the format is 
>supposed to be.  This makes it easier to verify a file correct vs whether there's 
>a bug in the code.  You don't want to be like the mork file format :)  I'm not 
>sure if the best place for this type of documentation is on a web page or in a 
>comment.

I added comments about the expected fileformat inside the the LookupCache and HashStore. I didn't document the PrefixSet, as it's in practise defined by the actual datastructure, so this made less sense.

>To elaborate a bit here, this implementation just stores chunks as an array of 
>integers, which is wasteful if you have a lot of contiguous chunks.  Should file a 
>followup to improve that.

We're talking about 50-100k RAM here, only used during updates, so I think we don't need to care about it.

>Nit: In some files you use pointers for out params and in others you pass by ref.  
>Would be nice to be consistent.

There's a discussion on this issue going on right now at:
news://news.mozilla.org/mozilla.dev.platform
As far as I understand, out parameters should be pointers, unless they're strings(?).

There is also some inconsistency in parameters, i.e. my code was mostly aParam but dcamp's code doesn't use that (Dave, is your code simply reflecting what is a more modern style in Mozilla code, or should it be "fixed"). And some files use "type * varname", some "type* varname".

I cleaned *some* of this but there's much inconsistency left. Maybe I should just clean all that up mechanically in a last patch.

>What does this get initialized to?  Where are Prefix and Completion defined?

Note sure about the question here. They're defined right above.

>>Is this safe against power outage during writes?  Would it be safer to write to a temp 
>>file and then rename the temp file over the old file?
>SafeLocalFileOutputStream takes care of that, renames the temp file after a successful close.

I seem to have run into a problem with this and it's causing xpcshell test failures on Windows, so I had to remove the usages. Will file follow-up bugs. Code is XXX'ed.

>> +// XXX: Inconsistent
>Can you elaborate on this more?

This was dcamp's comment, and given that he didn't say anything about it, I just removed it.

>This patch hasn't reimplemented gethash noise, I think we decided in the security review that 
>we needed to keep that?

Implemented this. I ran into an issue. The old noise code would randomly take some hashes from right before and after the hash that hit. I figured that this still allows you to detect the true URL being queried if you can intercept enough of different users, because the true URL is simply the one being probed most. 
A solution I wanted to implement is to fix the prefixes being probed, for example, by rounding the index to the nearest multiple of 4 (instead of making it random) and then requesting those 4 URLs, causing all users to request the same noise URLs, and making the larger scale analysis much more difficult.

Unfortunately, we encode the prefixes with a per-user key to avoid them from colliding on the same sites. But this also means that the prefixes have a different order for every user, so this technique cannot work.

So we can't do anything more than add some random prefixes. You can't say for sure which URL a user requested, but you can still make an educated guess if you can observe enough users.

>Does this still need addressing?  If Build() fails what happens (and what should happen?)

Build() can't really fail, unless it OOM's or whatever, at which point recovery is a mess anyway :-P

>I think we need to do some sort of integrity check here before landing.

Too late :-/

https://bugzilla.mozilla.org/show_bug.cgi?id=702217
https://bugzilla.mozilla.org/show_bug.cgi?id=706049

I figured the following:
- LookupCache needs to protect its header. Corruption in the actual data can't cause crashes and will only cause visible problems on a SHA-256 collision. So it doesn't need further checks.
- HashStore needs MD5s. Unfortunately, integrity checking on an nsIOutputStream on the fly is quite difficult, and I think I'd have needed to rework a lot of code for that. So instead I wrote the code to checksum the file right after writing, and then append the checksum to the file. This doesn't prevent the case where the file gets corrupted between the data being in memory and written out to disk, and the checksum being calculated, but due to how modern OS's work, that seems quite unlikely anyway.
- PrefixSet only sanity checks its header, though it will detect a lot of data corruption because it crosschecks its data with the checksumed HashStore. Right now it can cause a crash if it tries to load corrupted data. This should be fixed for bug 706049 independent of this bug, so that isn't included in the patches here.

I added Telemetry for the completions cached as requested in the security review. 
More stats on completions requested/failed are a bit more complicated, I didn't see
how to do that with current Telemetry. Can file follow up bugs if its deemed important.
(In reply to Gian-Carlo Pascutto (:gcp) from comment #44)

> >To elaborate a bit here, this implementation just stores chunks as an array of 
> >integers, which is wasteful if you have a lot of contiguous chunks.  Should file a 
> >followup to improve that.
> 
> We're talking about 50-100k RAM here, only used during updates, so I think
> we don't need to care about it.

Yeah, it's not a big deal.

> There's a discussion on this issue going on right now at:
> news://news.mozilla.org/mozilla.dev.platform
> As far as I understand, out parameters should be pointers, unless they're
> strings(?).

Yeah, should follow the recommendations in that thread re: out pointers.

> There is also some inconsistency in parameters, i.e. my code was mostly
> aParam but dcamp's code doesn't use that (Dave, is your code simply
> reflecting what is a more modern style in Mozilla code, or should it be
> "fixed"). And some files use "type * varname", some "type* varname".

No, it's not reflecting a more modern style.  Feel free to fix up new code to aStyle.

> >>Is this safe against power outage during writes?  Would it be safer to write to a temp 
> >>file and then rename the temp file over the old file?
> >SafeLocalFileOutputStream takes care of that, renames the temp file after a successful close.
> 
> I seem to have run into a problem with this and it's causing xpcshell test
> failures on Windows, so I had to remove the usages. Will file follow-up
> bugs. Code is XXX'ed.

I bet it needs some help from the dummy directory provider in the xpcshell tests.  I'll take a look at the Safe LocalFileOutputStream impl and figure out what the tests need to do differently...

> >This patch hasn't reimplemented gethash noise, I think we decided in the security review that 
> >we needed to keep that?
> 
> Implemented this. I ran into an issue. The old noise code would randomly
> take some hashes from right before and after the hash that hit. I figured
> that this still allows you to detect the true URL being queried if you can
> intercept enough of different users, because the true URL is simply the one
> being probed most. 
> A solution I wanted to implement is to fix the prefixes being probed, for
> example, by rounding the index to the nearest multiple of 4 (instead of
> making it random) and then requesting those 4 URLs, causing all users to
> request the same noise URLs, and making the larger scale analysis much more
> difficult.
> 
> Unfortunately, we encode the prefixes with a per-user key to avoid them from
> colliding on the same sites. But this also means that the prefixes have a
> different order for every user, so this technique cannot work.
> 
> So we can't do anything more than add some random prefixes. You can't say
> for sure which URL a user requested, but you can still make an educated
> guess if you can observe enough users.

Yeah, that's a pretty general problem with the noise strategy.  Even with an ideal noise solution, they can still correlate against chrome users to make an educated guess.  Let me think about this for a bit.

> I figured the following:
> - LookupCache needs to protect its header. Corruption in the actual data
> can't cause crashes and will only cause visible problems on a SHA-256

> collision. So it doesn't need further checks.
> - HashStore needs MD5s. Unfortunately, integrity checking on an
> nsIOutputStream on the fly is quite difficult, and I think I'd have needed
> to rework a lot of code for that. So instead I wrote the code to checksum
> the file right after writing, and then append the checksum to the file. This
> doesn't prevent the case where the file gets corrupted between the data
> being in memory and written out to disk, and the checksum being calculated,
> but due to how modern OS's work, that seems quite unlikely anyway.

That's a fairly hefty chunk of extra IO.  How much work would it be to just add a ChecksumWrite(data, length, outputstream, digest) helper?

> - PrefixSet only sanity checks its header, though it will detect a lot of
> data corruption because it crosschecks its data with the checksumed
> HashStore. Right now it can cause a crash if it tries to load corrupted
> data. This should be fixed for bug 706049 independent of this bug, so that
> isn't included in the patches here.

It doesn't crosscheck during lookup though, right?  It's regenerated with each database update though, so maybe that's not as important...
Comment on attachment 578005 [details] [diff] [review]
Patch 5. Folded version of all patches

Review of attachment 578005 [details] [diff] [review]:
-----------------------------------------------------------------

A few quick comments, in addition to the ones above.  Looking good.

r-'ing for the following stuff:

* Fixing SafeLocalFileOutputStream would be nice if possible, let's see if there's a quick fix.
* Same with checksumming as we write the store, rather than as a separate pass.  If that's troublesome we can do it as a followup, but if we decide to do that I need to do one more review pass on that code.
* gcp and I discussed some changes to the noise implementation on IRC, so I didn't review this implementation [will post a summary in a followup comment]/
* InitKey() question below.

::: toolkit/components/url-classifier/Classifier.cpp
@@ +79,5 @@
> + * https://bugzilla.mozilla.org/show_bug.cgi?id=669407#c10
> + */
> +nsresult
> +Classifier::InitKey()
> +{

If this file is corrupt/missing, we need to throw away the prefixset, right?

::: toolkit/components/url-classifier/ProtocolParser.h
@@ +53,5 @@
> +public:
> +  struct ForwardedUpdate {
> +    nsCString table;
> +    nsCString url;
> +  nsCString mac;

Tiny nit, indendation is off here.
Attachment #578005 - Flags: review?(dcamp) → review-
The noise change we discussed was:

gcp wanted to use a multiple-of-4 sliding window for choosing noise entries.  He decided that wouldn't work as well as hoped, because each user has an individual hash that mixes up the hash order.  So he fell back to random numbers.  But now he's thinking that the multiple-of-4 sliding window would work as well as random numbers.

I'm not really comfortable r+ing a change to the character of the noise without getting a bit of feedback from the security team.  But given the limited usefulness of the noise to begin with, I don't think we should worry too much about it.
Attachment #578004 - Attachment is obsolete: true
Attachment #578005 - Attachment is obsolete: true
Attachment #578004 - Flags: feedback?(tony)
Attachment #579307 - Flags: review?(dcamp)
Attached patch Patch 5. Fix all style issues (obsolete) — Splinter Review
Attachment #579310 - Flags: review?(dcamp)
Attachment #579311 - Flags: review?(dcamp)
Attachment #579312 - Flags: review?(dcamp)
Attachment #579316 - Flags: review?(dcamp)
>No, it's not reflecting a more modern style.  Feel free to fix up new code to 
>aStyle.

I reworked all the affected code to be consistent there, also with in/out parameters (references in, pointers for inout, strings always by reference but const if not modified) as per the style discussion in the newsgroups.

>I bet it needs some help from the dummy directory provider in the xpcshell tests.  
>I'll take a look at the Safe LocalFileOutputStream impl and figure out what the 
>tests need to do differently...

FWIW, the directory seems to exist (I already checked this by logging it and verifying that it was created). I don't know what the problem with that code is.

>That's a fairly hefty chunk of extra IO.  How much work would it be to just add a 
>ChecksumWrite(data, length, outputstream, digest) helper?

There's no simple way to construct such a helper that I can see (that doesn't cause the extra IO, otherwise it's exactly what the first patch did).

It's possible to create a wrapper class that checksums output streams on-the-fly, which I just did now with nsCheckSummingOutputStream. I'm not sure if this is very clean because it directly extends nsFileOutputStream, so it must It must find the nsFileStream headers through a ../../... Maybe we can put this nsCheckSummingOutputStream right into the netwerk/... tree and avoid that? Is this even a real issue?

But on input, it's necessary to first read in the file entirely before you can verify the checksum, and you want to do this before you pass the data on. So you must go over the file twice. The cleanest solution here looked like simply wrapping the input into a nsBufferedStream with a 6Mb buffer, which uses more memory but avoids the duplicate IO altogether. 

>It doesn't crosscheck during lookup though, right?  It's regenerated with each 
>database update though, so maybe that's not as important..

The corruption can't exist for more than 45 minutes (at which point all databases are considered stale), typically can't exist more than 30 minutes (update happens), and is safe because the nsPrefixSet code was audited for bug 706049 to not misbehave even if the underlying data is completely bogus.

>gcp wanted to use a multiple-of-4 sliding window for choosing noise entries.  He decided that wouldn't 
>work as well as hoped, because each user has an individual hash that mixes up the hash order.  So he fell 
>back to random numbers.  But now he's thinking that the multiple-of-4 sliding window would work as well as 
>random numbers.

To elaborate on comment 44, I concluded that we can do no better than send random entries, because there is no concept of "neighboring entries" with each user having randomized prefixes (due to rehashing with a specific per-user key), so the order between each user differs.

However, we had problems with bug 706740, which means that we want to avoid generating random numbers in the nsUrlClassifierDBServiceWorker thread. So I implemented the multiple-of-4 window anyway, in the knowledge that this is equivalent to random picks to an outside attacker/observer, but avoids the need to call the random number generator.

> InitKey() question below.

Good catch, fixed. I also found and fixed a case where the LookupCache detects corruption and resets - it needs to tell this upwards to Classifier, which needs to pass it down to HashStore. This is necessary because the data sits split between LookupCache and HashStore, and having a HashStore without LookupCache causes problems.

I added Telemetry for the number of Prefixes, which seems to expose a bug in LINEAR Histograms :-P

I think I have everything from the review, aside from the nsISafeOutputStream thing.

One thing to investigate in the future: the old nsUrlClassifierPrefix had code to handle lookups on the main thread and wait until the PrefixSet gets loaded (in the worker thread) if necessary. This complicated the code and given that the main thread lookups are entirely gone, we might be able to axe that if we don't plan to reinstate "probe on the main thread". (There is Telemetry included to track the performance of the actual lookup in the worker vs the observed delay in the main thread, including thread ping-pong)
If we want to export the checksumming output stream for xpcom consumers, it probably belongs in netwerk or something rather than url-classifier.  Rather than block this bug on a new exposed api, maybe it would be best to not export a contractid.  You can create it manually with operator new rather than do_CreateInstance).

You can file a followup to export this functionality somewhere more appropriate.
Blocks: 568893
Attachment #564581 - Attachment is obsolete: true
Attachment #564583 - Attachment is obsolete: true
Attachment #564966 - Attachment is obsolete: true
Attachment #564583 - Flags: feedback?(tony)
Attachment #564966 - Flags: feedback?(tony)
Attachment #579307 - Attachment is obsolete: true
Attachment #579310 - Attachment is obsolete: true
Attachment #579311 - Attachment is obsolete: true
Attachment #579312 - Attachment is obsolete: true
Attachment #579316 - Attachment is obsolete: true
Attachment #592678 - Attachment is obsolete: true
Attachment #592679 - Attachment is obsolete: true
Attachment #592680 - Attachment is obsolete: true
Attachment #579307 - Flags: review?(dcamp)
Attachment #579310 - Flags: review?(dcamp)
Attachment #579311 - Flags: review?(dcamp)
Attachment #579312 - Flags: review?(dcamp)
Attachment #579316 - Flags: review?(dcamp)
Attachment #592691 - Flags: review?(dcamp)
Attachment #592692 - Flags: review?(dcamp)
Attachment #592690 - Attachment is obsolete: true
Attachment #593174 - Flags: review?(dcamp)
Attachment #593175 - Flags: review?(dcamp)
Attachment #592692 - Attachment is obsolete: true
Attachment #592692 - Flags: review?(dcamp)
Attachment #593177 - Flags: review?(dcamp)
I think I'm done for now.

Changes
1) Unbitrot the patches to the changes on m-c (PrefixSet OOM, Memory Reporter).
1) Fixes so the nsISafeOutputStream usage works on Windows.
2) NS_GetSpecialDirectory should run on the main thread, because it causes an enumeration of Directory reporters.
3) I had some occasional orange based on freeing callbacks on the wrong thread. Fixed this with an NS_ReleaseProxy. As far as I can tell this has *always* been broken.

Try run: https://tbpl.mozilla.org/?tree=Try&rev=e75426a42c3e
(The oth orange is telemetry, not due to this patch.)
Attachment #592691 - Attachment is obsolete: true
Attachment #592691 - Flags: review?(dcamp)
Comment on attachment 593174 [details] [diff] [review]
Patch 7. Add nsCheckSummedOutputStream and use it.

Review of attachment 593174 [details] [diff] [review]:
-----------------------------------------------------------------

As discussed on IRC, can you file a followup to implement this as a stream that wraps another stream, rather than deriving from the file stream please?
Attachment #593174 - Flags: review?(dcamp) → review+
Attachment #593175 - Flags: review?(dcamp) → review+
(In reply to Gian-Carlo Pascutto (:gcp) from comment #54)

> But on input, it's necessary to first read in the file entirely before you
> can verify the checksum, and you want to do this before you pass the data
> on. So you must go over the file twice. The cleanest solution here looked
> like simply wrapping the input into a nsBufferedStream with a 6Mb buffer,
> which uses more memory but avoids the duplicate IO altogether. 

If we added an ChecksummingInputStream along the lines of the ChecksummingOutputStream, and added a bailout after reading in all the data the first time, couldn't that work?

As it stands this patch should significantly reduce IO, so I wouldn't hold off landing over that.
Comment on attachment 593177 [details] [diff] [review]
Patch. Amalgamation of all the above

This is a *giant* patch.

With this most recent version I've been through this patch a few times, and I'm pretty confident that it's a solid improvement over our current implementation.   Let's file followups for the checksumming questions I had and get it landed early in the 12 cycle.

Thanks for your hard work on this, Gian-Carlo.
Attachment #593177 - Flags: review?(dcamp) → review+
(In reply to Dave Camp (:dcamp) from comment #74)
> Let's file followups for the checksumming questions I had
> and get it landed early in the 12 cycle.

(* early in the 13 cycle - 12 is aurora now. The trains, they run evermore.)
File bug 723146 for nsCheckSummedOutputStream improvements.
Filed bug 723153 for cleaning up the old database.
https://hg.mozilla.org/integration/mozilla-inbound/rev/ae2d52111205

Regression  Trace Malloc MaxHeap increase 20.3% on CentOS release 5 (Final) Mozilla-Inbound
---------------------------------------------------------------------------------------------
    Previous: avg 26145933.333 stddev 5968.211 of 30 runs up to revision 66fcee339e32
    New     : avg 31442740.000 stddev 21819.555 of 5 runs since revision 44a0dc4fb9ff
    Change  : +5296806.667 (20.3% / z=887.503)
    Graph   : http://mzl.la/xJXSES


I suspect this is the fixed-size 6M buffer in nsBufferedInputStream. Will add a patch to make it dependent on the actual file size.
The Talos memory use regression almost certainly comes from the fixed size 6M buffer used for avoiding the double read. Now sizing it exactly right.

Performance regressions are trickier because there are multiple potential causes. The worst I found was that we rescan the filesystem on each probe to see which databases are present. This patch checks once on startup and after every update, and caches elsewhere.

From comparing the try run with other try runs I think the regression is gone or reduced.

https://tbpl.mozilla.org/?tree=Try&rev=e102f89a5cf8
Attachment #593756 - Flags: review?(dcamp)
https://hg.mozilla.org/mozilla-central/rev/44a0dc4fb9ff
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 13
This was backed out by
https://hg.mozilla.org/mozilla-central/rev/ae2d52111205
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Sorry yeah realised as soon as I pressed submit, that the backout was in the same 96 cset merge :-)
About using mmap?
What would the upsides be? Doesn't give any more guarantee the result is cached, and certainly not faster than regular reads for the purely serial access we do. Obvious downside: not available in a streaming API, so requires a bunch more code to be written.
Attachment #593756 - Flags: review?(dcamp) → review+
https://hg.mozilla.org/mozilla-central/rev/173f90d397a8
https://hg.mozilla.org/mozilla-central/rev/db52b4916cde
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
I noticed that the old url3classifier3.sqlite database wasn't removed when the new folder was added. Also, I see that the files for 'test' provider are always present, next to the ones form Google.

I see that one of the tests (head_urlclassifier.js) has a cleanup procedure - it's exactly that what is missing.
>I noticed that the old url3classifier3.sqlite database wasn't removed when the new 
>folder was added.

See comment 78.

>Also, I see that the files for 'test' provider are always present, next to the 
>ones form Google.

Working as designed, see for example Bug 693389.
For those following along, who may not have seen gcp's excellent summary post :-)
http://www.morbo.org/2012/02/new-safebrowsing-backend.html
Depends on: 725872
Comment on attachment 592682 [details] [diff] [review]
Patch 1. Add the optimized store for SafeBrowsing.

>+  LOG(("Applied %d update(s) to %s.", applied, PromiseFlatCString(store->TableName()).get()));
You call PromiseFlatCString a number of times on store->TableName() which is already flat, so you're in the best case you're just exercising the compiler's PGO code...
Blocks: 741528
No longer blocks: 741528
Attached patch Patch. Backout part 1 (obsolete) — Splinter Review
[Approval Request Comment]
Backout due to bug 744993:

a01cf079ee0b Bug 730247
1a6d008acb4f Bug 729928
f8bf3795b851 Bug 729640
35bf0d62cc30 Bug 726002
a010dcf1a973 Bug 726002
e9291f227d63 Bug 725597
db52b4916cde Bug 673470
173f90d397a8 Bug 673470
Attachment #616616 - Flags: approval-mozilla-central?
Attachment #616616 - Flags: approval-mozilla-aurora?
Attached patch Patch. Backout part 2 (obsolete) — Splinter Review
Attachment #616618 - Flags: approval-mozilla-central?
Attachment #616618 - Flags: approval-mozilla-aurora?
This isn't actually on beta yet, is it? Didn't this land in the 13 cycle?
Sorry, misread the approval flags - nevermind!
Depends on: 744993
Attachment #616616 - Flags: approval-mozilla-central? → approval-mozilla-central+
Attachment #616618 - Flags: approval-mozilla-central? → approval-mozilla-central+
When this re-lands, please also pick up the toolkit/components/url-classifier/Entries.h fix from bug 738533. The patch will be landing without it.
Attachment #616616 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Attachment #616618 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Comment on attachment 616616 [details] [diff] [review]
Patch. Backout part 1

Sorry - thought this bug was reopened because the backout was backed out. My mistake. Approved for Aurora 13 (or Beta 13 if the merge occurs before we land).
Attachment #616616 - Flags: approval-mozilla-aurora- → approval-mozilla-aurora+
Attachment #616618 - Flags: approval-mozilla-aurora- → approval-mozilla-aurora+
Depends on: 750625
Carrying forward review. Only changes are rebasing.
Attachment #592682 - Attachment is obsolete: true
Attachment #592683 - Attachment is obsolete: true
Attachment #592684 - Attachment is obsolete: true
Attachment #592686 - Attachment is obsolete: true
Attachment #592687 - Attachment is obsolete: true
Attachment #592688 - Attachment is obsolete: true
Attachment #593174 - Attachment is obsolete: true
Attachment #593175 - Attachment is obsolete: true
Attachment #593177 - Attachment is obsolete: true
Attachment #616616 - Attachment is obsolete: true
Attachment #616618 - Attachment is obsolete: true
Attachment #650036 - Flags: review+
Fixes the performance regression seen on the initial landing. Already reviewed => rebased.
Attachment #650037 - Flags: review+
Originally landed in bug 725597, already reviewed.
Attachment #593756 - Attachment is obsolete: true
Attachment #650038 - Flags: review+
Originally landed in bug 726002. Rebased.
Attachment #650039 - Flags: review+
Originally landed in 726002, already reviewed, rebased.
Attachment #650040 - Flags: review+
Bug 729640, already reviewed, rebased.
Attachment #650041 - Flags: review+
Bug 729928, already reviewed, rebased.
Attachment #650042 - Flags: review+
Bug 730247, already reviewed, rebased.
Attachment #650044 - Flags: review+
The comments describing the sbstore/HashStore on-disk format were erroneous.  This mistakes were found when writing a Python tool to analyze the databases: https://github.com/gcp/sbdbdump
Attachment #650045 - Flags: review?(dcamp)
We knock out addprefixes that match received subs, but this function also deletes the sub that caused the knockout. If we have previously had a hit on a malware/phishing site, and store an addcomplete for it, we need to knock out that addcomplete. 

Because we use per-client randomization, this is done by generating both randomized and non-randomized subs, where we assign the non-randomized subs to subchunk 0.

This reorders/cleans up the ProcessSubs function to do the Complete removals based on matching prefixes first, clean up the fake Subs (which can only be in SubPrefixes), and then do the normal knockout afterwards (which now only needs to check half as much data).
Attachment #650049 - Flags: review?(dcamp)
Add an option to disable the per-client randomization. This is useful for debugging the protocol and running tests, as it makes the database behavior reproducible.
Attachment #650052 - Flags: review?(dcamp)
In one instance the order of arguments to the per-client randomization function were reversed, causing about half of the prefixes in the database to be wrong. This happens even if per-client randomization is disabled.
Attachment #650053 - Flags: review?(dcamp)
There were many superfluous sort calls in the existing code. We only need to sort the prefixes once: directly after they have been gone through a Merge() call. (Note that in that function, we could quicksort the new prefixes, and then mergesort them into the already sorted database, but we probably don't need to care for such further optimization right now)

Add assertions and some functions that confirm that all the other code maintains the sorted order of the list, when running in debug mode.

>return *((uint32*)a) - *((uint32*)b);

Dave, this doesn't work :)

None of the sorts were working, causing subprefix knockouts (among others) to be entirely broken.

This also cleans up another part of the file format documentation being wrong.
Attachment #650056 - Flags: review?(dcamp)
If we receive a SubPrefix that refers to an addChunk that we already have, but the SubPrefix survives the Knockout phase, this must mean the update server already optimized the AddPrefix the Sub was referring to away in order to save bandwidth. The SubPrefix obviously can't ever do anything in this case.

Detect such Prefixes and remove then after the Knockout phase.

It should be noted that the old Firefox Safebrowing code was also doing this optimization. (rv = mPendingSubStore.ExpireAddChunk(tableId, chunkNum);)
Attachment #650069 - Flags: review?(dcamp)
At some point we made the PrefixSet code be fallible wrt OOM situations. This was possible because the old SQLite backed store could always fall back to just doing disk lookups with SQLite in case there wasn't enough memory to construct our in-memory database.

Because SQLite is gone, and the PrefixSet is now the sole place where the essential AddPrefixes are stored, this fallback is no longer available.

We now apply "Security or DEATH" (again) by making the PrefixSet crash rather than OOM.

The new code only needs to manipulate a dataset half as large as the previous implementation, and generally tries to avoid allocating more memory than necessary. On top of that, our diagnostics for Firefox crashes caused by OOM have improved, so we should be better able to understand and fix things if this happens again.
Attachment #650079 - Flags: review?(dcamp)
Attachment #650079 - Flags: feedback?(justin.lebar+bug)
Minor code cleanup pointed out in this bug.
Attachment #650082 - Flags: review?(dcamp)
See notes in this bug and Bug 738533. There's no reason to use NS_QuickSort over qsort, and it simplifies the code.
Attachment #650084 - Flags: review?(dcamp)
The PrefixSet we have now is quite threadsafe and is safe to use (and with minimal blocking) while database updates are in progress. This was a complicated but useful thing to have when we still had an SQLite backend, as it allowed almost every query to be resolved immediately in the main thread without ever blocking or hitting disk (though it might defer a lookup if we are still loading the prefixset from disk).

This patch removes the multithread handling entirely from PrefixSet. This has several advantages: it's both simpler, it supports empty prefixsets without the ugly hacks, and *it cuts the maximum memory requirement in half*. 

The new SafeBrowsing code is more modular, and splits the database into a prefixset per safebrowsing list. It hasn't ever supported probing from within the main thread, and when it originally landed no performance regressions were reported for this, so the complications in prefixset are no longer useful.

In case we ever do need to accelerate lookups by doing them from the main thread, I'd advise to use a seperate, simpler, LRU-style cache for fragments or prefixes, similar to what the code had before we added PrefixSets.
Attachment #650088 - Flags: review?(dcamp)
Attachment #650045 - Flags: review?(dcamp) → review+
Attachment #650046 - Flags: review?(dcamp) → review+
Attachment #650047 - Flags: review?(dcamp) → review+
> We now apply "Security or DEATH" (again) by making the PrefixSet crash rather than OOM.

What's the maximum amount of memory allocated here?
Attachment #650053 - Flags: review?(dcamp) → review+
Attachment #650056 - Flags: review?(dcamp) → review+
Attachment #650082 - Flags: review?(dcamp) → review+
Attachment #650084 - Flags: review?(dcamp) → review+
Attachment #650049 - Flags: review?(dcamp) → review+
Attachment #650069 - Flags: review?(dcamp) → review+
Attachment #650083 - Flags: review?(dcamp) → review+
Attachment #650088 - Flags: review?(dcamp) → review+
Comment on attachment 650052 [details] [diff] [review]
Patch 13. Provide an option to disable per-client randomization.

Review of attachment 650052 [details] [diff] [review]:
-----------------------------------------------------------------

If the pref changes, won't databases built before the pref change break silently?
How are you testing these changes?  Can we get something automated to catch these problems if they spring up again?
>If the pref changes, won't databases built before the pref change break silently?

Yes. It's meant for debugging, in the "don't toggle this unless you know what you're doing" style.

>How are you testing these changes?  Can we get something automated to catch these 
>problems if they spring up again?

There's a tracking bug for improving the tests to actually test the real protocol:
https://bugzilla.mozilla.org/show_bug.cgi?id=750751

But, what I *really* did here to debug this was to use the tool I linked earlier: https://github.com/gcp/sbdbdump

I compile 2 Firefoxen, one with the old code, one with the new code, both with per-client randomization disabled, and launch them both at the same time pointing to different profiles. I then use the sbdbdump tool to compare the databases from time to time. They should not differ greatly or start to diverge, though they can obviously differ a little due to collisions or the server not sending 100% the same data.

I've simply kept fixing bugs and using the tool to analyze the remaining differences until I can receive and process the entire current set without any unexpected divergence. The current patchset seems to do this, i.e. it will produce the same database contents as the old code.

Based on this, I'm quite confident there are no serious bugs remaining, perhaps more so than if we'd just pass a static set of tests, and I'd consider landing the current code before we have bug 750751. We can keep monitoring the divergence for a few more weeks manually and react if needed. 

I do think bug 727370 should land reasonably soon after, and in the same versions, as this one lands.
>What's the maximum amount of memory allocated here?

The biggest single set is 400K add prefixes + 170k sub prefixes.

Classifier calls BeginUpdate which calls ReadEntireStore. This means we will have the entire store in RAM at this time:

400k * (uint32 prefix, uint32 chunk)
170k * (uint32 prefix, uint32 chunk, uint32 chunk)

3.2M + 2.0M = 5.2M

ReadSubPrefixes temporarily keeps two copies of the data live during reading (this is fixable). It SetCapacities the nsTArray so there's no loss from that. So peak is 3.2M + 2.0M*2 = 7.2M

It will then call GetPrefixes followed by AugmentAdds, which will load only the add prefixes. So another 400k * uint32 = 1.6M. We might be able to save that by making GetPrefixes aware of the datastructure it's filling into, though that'd make the code uglier. The extra storage is freed immediately at this point. (Peak = 5.2M + 1.6M = 6.8M)

While applying an update, we do an nsTArray AppendElements onto the existing database. I suppose that at worst this can double the memory usage if the nsTArray's capacity needs to resize. So new peak = 6.4M (3.2M*2) + 2.0M = 8.4M, resident = 5.2M

After this point, we could free the SubPrefixes part of the data already, lowering peak and resident by 2.0M (We don't do this right now!)

We then call Build->ConstructPrefixSet. This needs to make a temporary uint32 array, so peak = 5.2M + 1.6M = 6.8M, but it frees the AddPrefixes immediately after (6.8M-3.2M => 3.6M). SetPrefixes/MakePrefixSet then get called, which will allocate 2 uint32 arrays and 1 uint16 one. The worst case is a bit hard to analyze here because it's dependent on the compression, and we're guaranteed to get compression if the data is big. The unrealistic worst case is 400k * (uint32 + uint32) = 3.2M, realistically it'll be closer to 1M (400k * uint16). New peak = 3.6M + 3.2M = 6.8M. Realistic peak = 3.6M + 1M = 4.6M

Because the PrefixSet is resident, I'm doing a Compact() call at this point. This could, worst case, temporarily double the memory usage for one of the uint32 arrays. Peak = 6.8M + 1.6M = 8.4M. (Realistic = 4.6M + 2M = 6.6M)

So, the peak memory usage at 2 points is 8.4M. Because the second occurrence (during PrefixSet construction) is less realistic, the real peak will always happen when we have the data resident and are merging more data into the biggest nsTArray, potentially causing a realloc. 

Fixing this might be possible by adding slack space to process the update at the very beginning. Most updates seem to be <100k entries, so if we allocate that slack space (100k/400k=25%), we would turn that 8.4M peak into 3.2M * 1.25 + 2.0M = 6.0M

The next remaining thing is then 7.2M, requiring fixing ReadSubPrefixes, and 6.8M, requiring GetPrefixes to work with the AddPrefixArray data-structure directly. 

But I'd hope we are able to temporarily allocate 8.4M.
This implements the ideas from the previous comment, and should cut the peak memory usage from 8.4M to ~7.5M.  This also avoids a lot of reallocation when merging updates. Unfortunately the "slack" space we preallocate is a magic number, as the only way to know it is to see what Google's servers are actually sending out. It also isn't possible to make it depend on the size of the DB, as the update size you receive isn't correlated to that. 

This means that this patch will cause Mr. Talos to complain to me that we regress MaxHeap by 1.8M, which is quite annoying. (Even though the MaxHeap of a real installation would lowered).

What do you think?
Attachment #650503 - Flags: review?(justin.lebar+bug)
Attachment #650052 - Flags: review?(dcamp) → review+
Oops.
Attachment #650619 - Flags: review?(justin.lebar+bug)
Attachment #650503 - Attachment is obsolete: true
Attachment #650503 - Flags: review?(justin.lebar+bug)
> [...]
> But I'd hope we are able to temporarily allocate 8.4M.

Wow, that was impressive.  You earned this f+.  :)
Comment on attachment 650079 [details] [diff] [review]
Patch 17. Make the PrefixSet/LookupCache construction infallible again.

f=me, but perhaps you want to accumulate (URLCLASSIFIER_PS_FAILURE, 0)?  Otherwise it's hard to tell whether an uptick in failures is due to an uptick in usage or a regression.
Attachment #650079 - Flags: feedback?(justin.lebar+bug) → feedback+
Comment on attachment 650619 [details] [diff] [review]
Patch 22. v2 Reduce peak RAM. Add slack space.

I haven't read the other 21 patches in this bug, so I don't feel comfortable reviewing this.

It's not clear to me that this is completely right, but I have more questions than answers.  There are clownshoes issues -- is 150,000 the right number of elements to expand, or will that leave a bunch of the allocated space unused?  (Maybe nsTArray should just use malloc_usable_size.  In fact, it probably should.)

Also, suppose we go over the additional 150,000 elements.  Now nsTArray will double its already-large size.  This is perhaps worse than the original behavior.  How confident are you that we'll stay within the 150,000 elements?  Is it worth adding telemetry for this?

I'll take another look in the morning, perhaps with all the patches applied to my tree so I have some idea what the heck is going on.  :)
Attachment #650619 - Flags: review?(justin.lebar+bug)
Attachment #650619 - Flags: review?(dcamp)
Attachment #650619 - Flags: feedback?(justin.lebar+bug)
>There are clownshoes issues -- is 150,000 the right number of elements to expand, 
>or will that leave a bunch of the allocated space unused?  

For most updates, that will leave part of that space unused. For the test databases, it will leave 149999 of them ununsed, hence the expected Talos regression in our tests.

>Also, suppose we go over the additional 150,000 elements.  Now nsTArray will 
>double its already-large size.

Unless I misread the nsTArray source, that won't happen. The array is added to via AppendElement*s* calls, which does a single, up-front SetCapacity to the size of both arrays added. So the behavior shouldn't be worse than what we have.

>How confident are you that we'll stay within the 150,000 elements?  Is it worth 
>adding telemetry for this?

Might be a good idea.

I don't have a large attachment to this patch because I agree it's not particularly elegant. For what it's worth, Chrome switched to deque's recently: http://src.chromium.org/viewvc/chrome/trunk/src/chrome/browser/safe_browsing/safe_browsing_store.cc?revision=107415&view=markup

But as far as I know such modern STL functionality is not available in our codebase?
Attachment #650079 - Flags: review?(dcamp) → review+
Would it make sense to handle patch 22 (maybe alongside patch 17) as a separate bug?
I'd take 17 with this one (it's made possible by the rewrite). Patch 22 is optional, we can revisit it if we see OOMs.

Looks like we're ready to land? I'll keep the database compare running over the weekend. Meanwhile we can see if Google replies regarding the per-client randomization. If we can get rid of that, I'd do it here too, as it simplifies many of the code that this adds.
(In reply to Gian-Carlo Pascutto (:gcp) from comment #125)
> Based on this, I'm quite confident there are no serious bugs remaining,
> perhaps more so than if we'd just pass a static set of tests, and I'd
> consider landing the current code before we have bug 750751. We can keep
> monitoring the divergence for a few more weeks manually and react if needed. 

This isn't ideal, but: this patch doesn't regress the poor testing state of the safebrowsing code, it's a better basis on which to build testing code, and it's a pretty big overall win.  I do hope we can get attention on bug 750751 sooner rather than later though.

> I do think bug 727370 should land reasonably soon after, and in the same
> versions, as this one lands.

Agreed.

(In reply to Gian-Carlo Pascutto (:gcp) from comment #135)
> I'd take 17 with this one (it's made possible by the rewrite). Patch 22 is
> optional, we can revisit it if we see OOMs.

OK.  We should probably open a followup bug, easy for a patch on this bug to get lost in the noise.

> Looks like we're ready to land? 

I think so.
Tackling patch 22 in a separate bug sounds good to me.  I'll follow up once it's filed.
(In reply to Gian-Carlo Pascutto (:gcp) from comment #135)
> Meanwhile we can see if Google replies regarding the per-client
> randomization. If we can get rid of that, I'd do it here too, as it
> simplifies many of the code that this adds.

I must be missing some context, but talk of getting rid of the randomization seems surprising. I'd like to hear more about this (ideally in a new bug). Seems like we're straying pretty far away from the summary at this point, so I think you should move as much work as you can into new bugs.
The randomization is much harder to deal with in the new code because we don't store the non-randomized hashes any more, so it's closely related to this bug. Likewise, removing the randomization will require blowing all the databases away a second time, because we can't recover the original data.

For this reason, it should preferably land at the same time as this bug, even though we can certainly track it in a separate bug.
Comment on attachment 650619 [details] [diff] [review]
Patch 22. v2 Reduce peak RAM. Add slack space.

Clearing f?, since we're going to handle this in a separate bug (right?).

(I had another look at the relevant tarray code, and I think this is basically fine, btw.)
Attachment #650619 - Flags: feedback?(justin.lebar+bug)
It's potentially possible to get an empty chunk (optimized away) in one update, and get it another time (not empty) in an older update. This means that we can both a) store information that is unneeded b) because we drop SubPrefix entries belonging to an addchunk we have, after we finish knockout, we potentially won't have the SubPrefixes to knock out the now useless data when we receive that addchunk again.

Skip processing chunks we already have. It's a performance optimization, too.

The old code also had this behavior, in this snippet:

  if (!InsertChunkId(mCachedAddChunks, chunkNum)) {
    LOG(("Ignoring duplicate add chunk %d in table %d", chunkNum, tableId));
    return NS_OK;
  }

With this patch, after several days, I still get a 100% match on the phishing database, one single missing AddPrefix in the malware database (no idea what causes that one...collision?), and 36 missing SubPrefixes that all refer to an AddChunk so old that the clients can never receive it. The latter looks to me like one client got a slightly more recent update than the other one, where those subs were already optimized away. 

I'm slightly puzzled by that one missing Prefix, but this doesn't look problematic enough that it'd stop me from landing. (And I think isolating that difference *will* likely need Bug 750751 to get 100% reproducible updates)
Attachment #650619 - Attachment is obsolete: true
Attachment #650619 - Flags: review?(dcamp)
Attachment #651753 - Flags: review?(dcamp)
Attachment #651753 - Flags: review?(dcamp) → review+
(In reply to Gian-Carlo Pascutto (:gcp) from comment #139)
> The randomization

Sorry, I think I was just confused about what you meant when you referred to "randomization". I thought you were referring to the gethashnoise code, but it sounds like you're referring to something else entirely.
There was a bug in the previous patch wrt to Completes: they will appear with the chunk of the corresponding Prefix set (which we will have), but they should not be skipped.
Attachment #651753 - Attachment is obsolete: true
Attachment #651920 - Flags: review?(dcamp)
Last one before landing.

Try run of all the above:
https://tbpl.mozilla.org/?tree=Try&rev=2dd3c6ca79f3
Attachment #651922 - Flags: review?(dcamp)
Attachment #651922 - Flags: review?(dcamp) → review+
Attachment #651920 - Flags: review?(dcamp) → review+
Bug 782887, bug 723153, bug 727370 and bug 750751 will contain the further development in this feature (if this doesn't get backed out - which I truly hope won't)
uint32? :/
On this bug and some of the dependants that landed at the same time, the target milestone was still set to mozilla13. Please can you check when landing on inbound that the milestone is correct (or else leave it blank and the new merge tool will choose the current milestone) :-)
Depends on: 797302
Depends on: 825627
Product: Firefox → Toolkit
Regressions: 1781027
You need to log in before you can comment on or make changes to this bug.