Closed Bug 1254766 Opened 4 years ago Closed 3 years ago

Stop caching Safe Browsing completions to disk

Categories

(Toolkit :: Safe Browsing, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox51 --- fixed

People

(Reporter: francois, Assigned: dimi)

References

Details

(Whiteboard: #sbv4-m0)

Attachments

(1 file, 20 obsolete files)

63.95 KB, patch
dimi
: review+
Details | Diff | Splinter Review
Chrome only caches completions in memory for the current browsing session. We should do the same to reduce the disk I/O we have to do.

gcp also made the following comment:

"Right now the completes are in .sbtore and .cache. This was a mistake, which hurt for example when I tried fixing https://bugzilla.mozilla.org/show_bug.cgi?id=1164518#c50"

In bug 1164518, I discovered that the on-disk completion probably isn't working anyways because it would have made reproducing the bug a lot harder (i.e. I would have had to copy the safebrowsing cache prior to doing the gethash calls and then overwrite the updated files with the backup afterwards).
Summary: Stop caching completions on disk → Stop caching Safe Browsing completions to disk
See bug 806422 for the explanation that this is allowed in SafeBrowsing v2 (v2.2->v2.3).
Duplicate of this bug: 518236
Duplicate of this bug: 825895
Assignee: nobody → dlee
Status: NEW → ASSIGNED
Just an update that I am working on this right now. But not really writing patch, currently still trying to figure out the .cache, .pset, .sbstore and all the logic about caching prefix, completions...
Hi Francouis,
One question, do we want to remove those .cache files for the old version or just leave them there ?
Flags: needinfo?(francois)
(In reply to Dimi Lee[:dimi][:dlee] from comment #6)
> Hi Francouis,
> One question, do we want to remove those .cache files for the old version or
> just leave them there ?
This comment is bad, sorry, i'll restate what i would like to ask.

I think we should remove .cache files, but it will make gecko always checks if there is .cache file when startup and keeps this behavior until someday we decide we can remove this code(or we won't).

So maybe keep .cache files is not really a bad idea since they are not very large and the size will not grow because we will stop using them.

so i would like to know if you have any opinion on this ? thanks!
Whatever behavior you program, remember that people can switch between Nightly/Aurora/Beta, and it would probably be nice if that didn't wipe their DB each time. (You may be lucky here because IIRC missing .cache files are not considered corrupted profiles.)
(In reply to Gian-Carlo Pascutto [:gcp] from comment #8)
> Whatever behavior you program, remember that people can switch between
> Nightly/Aurora/Beta, and it would probably be nice if that didn't wipe their
> DB each time. (You may be lucky here because IIRC missing .cache files are
> not considered corrupted profiles.)

Thanks for the information :) , then i will not remove .cache files in this bug to avoid wiping db files for other builds as gcp suggested.
(In reply to Dimi Lee[:dimi][:dlee] from comment #9)
> Thanks for the information :) , then i will not remove .cache files in this
> bug to avoid wiping db files for other builds as gcp suggested.

Sounds good to me too. These files are small and cleaning them up would be tricky. Better to leave them alone for now.
Flags: needinfo?(francois)
Hi gcp, francois,
I think i am a little uncertain about what we are trying to achieve here.
Following is my understanding, it could be totally wrong, please correct me if anything wrong, thanks!

The purpose of this bug is to stop caching completions to disk.
In current v2 implementation, completions may come from update[1] or gethash[2]. In both cases completions
are cached in |LookupCache::mCompletions| and eventually will be store to .cache and .sbstore.

We could stop writing completions to .cache files but i guess we will still need to write completions
to .sbstore and read it back to |mCompletions| when startup since it may contain completions from update.
If this is the way we want to do this, I am confused about what we achieve here because we will still write completions to disk and read it back even from a more expensive file, ".sbstore", compare to ".cache".

Another question is when i think about the implementation of v4.
v3 spec[3] says "Full-length hashes obtained in shavar add chunks must also be verified via a full-length
hash request prior to showing a warning", and v4 specs talks about caching, the "caching" in spec stands for response from GetEncodedFullHashes. So i think we will need to separate completions between update and gethash in v4 implementation.

If we separate completions from udpate and gethash, then we could just store update completions to disk
and gethash completions will only be cached in memory. But this leave me a question is that if we get the
completions and restart firefox immediately, the completions will be lost because it doesn't write to disk. It seems fine but according to v4 spec "caching", there are positive cache duration and negative cache duration, if we don't store cache to disk, then we will not be able to follow the duration rule after restart, is this correct behavior ?

[1]https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/ProtocolParser.cpp#609
[2]https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#697
[3]https://developers.google.com/safe-browsing/developers_guide_v3#Overview
Flags: needinfo?(gpascutto)
Flags: needinfo?(francois)
>We could stop writing completions to .cache files but i guess we will still need to write completions
to .sbstore and read it back to |mCompletions| when startup since it may contain completions from update.

Yes, In fact, TP is *all* completions.

I think what we don't want to do is cache gethash completes.

>So i think we will need to separate completions between update and gethash in v4 implementation.

Exactly, and maybe for v2 too (if we want to fix this bug!).

>But this leave me a question is that if we get the
completions and restart firefox immediately, the completions will be lost because it doesn't write to disk. It seems fine but according to v4 spec "caching", there are positive cache duration and negative cache duration, if we don't store cache to disk, then we will not be able to follow the duration rule after restart, is this correct behavior ?

You send out the HTTP GET to the Google server, and while the request is in-flight I cut your network cable. Now nothing that respects the laws of physics will prevent you from breaking spec :-)

More seriously, I'd verify what the current Chromium implementation is doing and/or have Francois ask the Chrome guys about it. I agree respecting cache durations means you'd have to store them on disk with associated cache data, so maybe the "don't store on disk" was specific to v2.
Flags: needinfo?(gpascutto)
Like gcp said, we only want to stop caching gethash completions to disk. So as Dimi points out, that means splitting completions up so that we keep the ones that come from update.

We should do this in V2 so that we get the same behaviour on V4.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #12)
> More seriously, I'd verify what the current Chromium implementation is doing
> and/or have Francois ask the Chrome guys about it. I agree respecting cache
> durations means you'd have to store them on disk with associated cache data,
> so maybe the "don't store on disk" was specific to v2.

Google told me that Chrome only caches gethash completions in memory. They don't care about clients asking for the same completions again in the case of a browser restart.

I would expect they will have the same policy in V4 since completions will be even rarer in V4.
Flags: needinfo?(francois)
thanks! gcp, francois,
That is really helpful, I know exactly what i should do next now :)
One question,
For completion cache, v2 spec doesn't mention that we should clear cache when update and since v3 spec says "Clients must clear cached full-length hashes each time they send an update request." in "Changes since version 2.2" paragraph, so i think what we are doing right now - not clear cache when update, doesn't violate v2 spec.
As for v4, i didn't find any part of spec talking about this(only cache duration) but i assume it remains the same as v3.

Since we will separate gethash and update completion in this bug, so i propose that we could also clear gethash completion when update in current implementation, the benefits are:
1.Reduce the complexity when an update occurs, otherwise we will have to deal with things like check if we need to remove completion in cache if update contains expiration, sub complete...
2.More easier to migrate to v4 implementation.
This change won't affect the correctness of sb, it only affects how often we send a gethash request.

gcp, francois, how do you think ?
(In reply to Dimi Lee[:dimi][:dlee] from comment #15)
> One question,
> For completion cache, v2 spec doesn't mention that we should clear cache
> when update

See:
https://bugzilla.mozilla.org/show_bug.cgi?id=806422

This was a change requested by Google that they never documented.

> so i think what we are doing right now - not clear cache
> when update, doesn't violate v2 spec.

We do clear the gethash cache on updates. See the bug above for where the relevant code lives.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #16)
> We do clear the gethash cache on updates. See the bug above for where the
> relevant code lives.

Thanks!
One more question here, for the ClearCompletes[1] code here, we won't clear cache if there is a complete.
So if an update contains completion, we will not remove gethash completions in cache. 
Do we still want to keep this behavior ? Or we should always clear gethash completions in cache when there is an update.

[1]https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/Classifier.cpp#652
The existing code doesn't know the difference between Completions coming from a "real" update from the server containing Completions in addition to Prefixes, and an "update" constructed from the result of a gethash request here:
https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp?from=nsUrlClassifierDBService.cpp#697

That is, if it has completions, it doesn't know where they came from.

The logic there makes the assumption that a real update would consist of a mix of Completions and Prefixes, so that in that case, we'd not want to drop the Completions from the update (immediately).

I suspect that logic may also be required in order to pass the tests or to make TP work, as they do use Completions in updates.

If you split the Completions in two classes then you can make the behavior more correct.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #18)
> The existing code doesn't know the difference between Completions coming
> from a "real" update from the server containing Completions in addition to
> Prefixes, and an "update" constructed from the result of a gethash request
> here:
> 
> That is, if it has completions, it doesn't know where they came from.
> 

I saw there is an |mLocalUpdate| flag[1] seems could distinguish completion from update or from gethash, why don't we use that ? or that flag doesn't work as i thought.
I am going to remove that in this bug, just want to make sure that i understand the logic correctly.

Thanks in advanced that keep answering lots of my questions :)

[1]https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#708
I think the problem is that that flag doesn't tell you where the *existing* Completions were from, just the ones you're receiving now.
Also note:

a) that flag controls the updates of database liveliness, so you can't outright remove it without also breaking that logic. (I guess if the database update goes away entirely it doesn't matter)

b) If you've have a database with prefixes and completes (not from gethash), and would receive an update consisting of only prefixes, we'd wipe the completes part of the DB. But no such databases/server behavior exists in the wild that we've seen, which is why the current code still manages to get by.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #21)
> Also note:
> 
> a) that flag controls the updates of database liveliness, so you can't
> outright remove it without also breaking that logic. (I guess if the
> database update goes away entirely it doesn't matter)

Got it, thanks!
Attachment #8745251 - Attachment is obsolete: true
Attached patch (WIP) P1. Remove .cache files (obsolete) — Splinter Review
just an update that i am working on writing testcase(mochitest) right now.

Following is what i am going to test:
1. completions from update should work
2. completions from gethash should work
3. completions retrieved from update should still exist after restart.(Not sure how to simulate restart)
4. completions retrieved from gethash should not exist after restart
5. an update will clear completions comes from gethash
6. an update will not clear completions comes from previous update

please let me know if anything else should be test in this bug.
(In reply to Dimi Lee[:dimi][:dlee] from comment #29)
> Following is what i am going to test:
> 1. completions from update should work
> 2. completions from gethash should work
> 3. completions retrieved from update should still exist after restart.(Not
> sure how to simulate restart)
> 4. completions retrieved from gethash should not exist after restart
> 5. an update will clear completions comes from gethash
> 6. an update will not clear completions comes from previous update

That looks both right and complete to me.

I'm not sure either how we can simulate a restart though.
It seems we disable gethash from "test-xxx" table in many places[1][2], we always assume test database should only contain complete hash.

But I would like to trigger gethash request in the testcase, do you have any suggestion ?
Do we still want to block gethash for "test-" db ?

[1]https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#1630
[2]https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#866
Flags: needinfo?(francois)
Maybe it makes more sense to (only) block gethashes off of the "disallow_completions" list in the prefs? You can then simply unset/reset that pref in the relevant test.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #32)
> Maybe it makes more sense to (only) block gethashes off of the
> "disallow_completions" list in the prefs? You can then simply unset/reset
> that pref in the relevant test.

That sounds like a good solution. We always add the test lists to urlclassifier.disallow_completions anyways.
Flags: needinfo?(francois)
Depends on: 1272239
Thanks for suggestion!
I open bug 1272239 to remove those code that forbid "test-" list to trigger completions. Then we can use "disallow_completions" in testcase as suggested.
.cache files will only store completions from update
Attachment #8748997 - Attachment is obsolete: true
Attachment #8748999 - Attachment is obsolete: true
Attachment #8749001 - Attachment is obsolete: true
Attachment #8749002 - Attachment is obsolete: true
Attachment #8749003 - Attachment is obsolete: true
Attachment #8749004 - Attachment is obsolete: true
Attachment #8764811 - Flags: review?(francois)
Attachment #8764812 - Flags: review?(francois)
Attached patch P5. Implement UpdateCache (obsolete) — Splinter Review
Attachment #8764815 - Flags: review?(francois)
(In reply to Dimi Lee[:dimi][:dlee] from comment #40)
> Created attachment 8764816 [details] [diff] [review]
> P6. Testcase for stoping caching Safe Browsing completions to disk

Hi Francois,
I don't really have a good way to test restart, so in this patch i add an API to "simulate" the behavior we will do when restart. Please let me know if you have any suggestion,  thanks!
Comment on attachment 8764811 [details] [diff] [review]
P1. Do not store completions from gethash to .cache files

Review of attachment 8764811 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/LookupCache.cpp
@@ +18,5 @@
>  // the updates from the protocol.
>  
>  // This module has its own store, which stores the Completions,
> +// mCache caching lookups that have happened over the net.
> +// Full length hashes retriving by updating are stored in

nit: "retrieved"

@@ +19,5 @@
>  
>  // This module has its own store, which stores the Completions,
> +// mCache caching lookups that have happened over the net.
> +// Full length hashes retriving by updating are stored in
> +// mUpateCompletions. The prefixes are cached/checked by looking

typo: missing "d" in mUpdateCompletions
Attachment #8764811 - Flags: review?(gpascutto)
Attachment #8764811 - Flags: review?(francois)
Attachment #8764811 - Flags: feedback+
Comment on attachment 8764812 [details] [diff] [review]
P2. Separate code path for gethash and update

Review of attachment 8764812 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/Classifier.cpp
@@ +697,5 @@
> +                        const nsACString& aTable)
> +{
> +  LOG(("Classifier::UpdateCache(%s)", PromiseFlatCString(aTable).get()));
> +
> +  // TODO : Implement it

I see that you've implemented this in Patch 5 so I think it makes sense to combine these two. There's not much point in splitting these patches if it leaves incomplete functions around.
Attachment #8764812 - Flags: review?(francois)
Attachment #8764815 - Flags: review?(francois)
Attachment #8764813 - Flags: review?(francois)
Attachment #8764814 - Flags: review?(francois)
Comment on attachment 8764811 [details] [diff] [review]
P1. Do not store completions from gethash to .cache files

Review of attachment 8764811 [details] [diff] [review]:
-----------------------------------------------------------------

Patches 1-5 are not very big and also aren't exactly stand-alone. I think they would be easier to read as a single patch.

Once you've merged them, I think gcp would be the best person to review them since he's a lot more familiar with this code than I am .
Attachment #8764811 - Flags: review?(gpascutto)
Comment on attachment 8764816 [details] [diff] [review]
Testcase for stoping caching Safe Browsing completions to disk

Review of attachment 8764816 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Dimi Lee[:dimi][:dlee] from comment #41)
> I don't really have a good way to test restart, so in this patch i add an
> API to "simulate" the behavior we will do when restart. Please let me know
> if you have any suggestion,  thanks!

I think you were going to try Dimi's suggestion (watching for changes to the nextupdatetime pref), right?
Attachment #8764816 - Flags: review?(francois)
(In reply to François Marier [:francois] from comment #45)
> Comment on attachment 8764816 [details] [diff] [review]
> P6. Testcase for stoping caching Safe Browsing completions to disk
> 
> Review of attachment 8764816 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> (In reply to Dimi Lee[:dimi][:dlee] from comment #41)
> > I don't really have a good way to test restart, so in this patch i add an
> > API to "simulate" the behavior we will do when restart. Please let me know
> > if you have any suggestion,  thanks!
> 
> I think you were going to try Dimi's suggestion (watching for changes to the
> nextupdatetime pref), right?

Not this one, that is bug 1281083
Hi gcp,
Could you help review this patch ? Thanks!
This patch merges patch P1-P5 according to francois's suggestion.
Attachment #8770117 - Flags: review?(gpascutto)
Attachment #8764816 - Attachment description: P6. Testcase for stoping caching Safe Browsing completions to disk → Testcase for stoping caching Safe Browsing completions to disk
Attachment #8764816 - Flags: review?(gpascutto)
Comment on attachment 8770117 [details] [diff] [review]
Merged patch for stop caching Safe Browsing completions to disk.

Review of attachment 8770117 [details] [diff] [review]:
-----------------------------------------------------------------

I think there's a strong argument to be made that LookupCache shouldn't operate on the .cache files at all any more (and they should effectively be ignored and eventually disappear from new Firefox profiles), but just read in the mUpdateCompletions from .sbstore. 

https://dxr.mozilla.org/mozilla-central/rev/88bebcaca249aeaca9197382e89d35b02be8292e/toolkit/components/url-classifier/HashStore.cpp#87

I do regret not putting all the fixed-length data *before* the variable size, ByteSlice/zlib stuff, which would have made that trivial. Now you'd have to read in the lengths of the compressed chunks and skip over them, rather than being able to seek to the right place in that file directly. But that may still be preferable than spreading the IO over LookupCache.cpp and HashStore.cpp, and duplicating the data.

I can live with not doing this but think about it and let me know your opinion.

::: toolkit/components/url-classifier/Classifier.cpp
@@ +350,5 @@
> +  LOG(("Applying %d table gethashes.", aUpdates->Length()));
> +
> +  for (uint32_t i = 0; i < aUpdates->Length(); i++) {
> +    nsCString updateTable(aUpdates->ElementAt(i)->TableName());
> +    nsresult rv = UpdateCache(aUpdates, updateTable);

updateTable applies to aUpdates[i], so passing aUpdates rather than aUpdates[i] looks wrong. I think the callee is needlessly looping over the array again trying to find updates that match the passed table name.

Let's rework this to just pass aUpdates[i] directly?

@@ +701,5 @@
> +{
> +  LOG(("Classifier::UpdateCache(%s)", PromiseFlatCString(aTable).get()));
> +
> +  if (NS_FAILED(CheckValidUpdate(aUpdates, aTable))) {
> +    return NS_OK;

This if (NS_FAILED) return NS_OK looks pretty ugly. I'd suggest making CheckValidUpdate return a boolean instead.

::: toolkit/components/url-classifier/LookupCache.cpp
@@ +18,5 @@
>  // the updates from the protocol.
>  
>  // This module has its own store, which stores the Completions,
> +// mCache caching lookups that have happened over the net.
> +// Full length hashes retriving by updating are stored in

..retrieved..

@@ +43,5 @@
>  namespace mozilla {
>  namespace safebrowsing {
>  
>  const uint32_t LOOKUPCACHE_MAGIC = 0x1231af3e;
> +const uint32_t CURRENT_VERSION = 3;

This is going to blow away the SafeBrowsing data of every Firefox user and force a redownload. And it will happen every time if they have for example Release and Dev Edition and switch between the two.

https://dxr.mozilla.org/mozilla-central/rev/88bebcaca249aeaca9197382e89d35b02be8292e/toolkit/components/url-classifier/LookupCache.cpp#327
https://dxr.mozilla.org/mozilla-central/rev/88bebcaca249aeaca9197382e89d35b02be8292e/toolkit/components/url-classifier/LookupCache.cpp#138

It's not clear to me what it's trying to accomplish, too. The on-disk format didn't change?

There might be a concern that there are fullhash-completion-Completes in the .cache file, which we now no longer will want there. But the update logic in Classifier should get rid of these after the first remote update, as the .cache file is rebuilt from the contents of .sbstore, which won't have those.

@@ +172,5 @@
> +    if (mCache.BinaryIndexOf(aAddCompletes[i].CompleteHash()) == mCache.NoIndex) {
> +      mCache.AppendElement(aAddCompletes[i].CompleteHash());
> +    }
> +  }
> +  mCache.Sort();

I guess as soon as there's a few aAddCompletes this is indeed faster than InsertElementSorted.
Attachment #8770117 - Flags: review?(gpascutto) → review-
Comment on attachment 8764816 [details] [diff] [review]
Testcase for stoping caching Safe Browsing completions to disk

Review of attachment 8764816 [details] [diff] [review]:
-----------------------------------------------------------------

I don't seem to have toolkit/components/url-classifier/tests/mochitest/gethash.sjs

LGTM

::: toolkit/components/url-classifier/Classifier.cpp
@@ +381,5 @@
> +Classifier::GetLastUpdateTime(const nsACString& aTableName)
> +{
> +  int64_t age;
> +  bool found = mTableFreshness.Get(aTableName, &age);
> +  return found ? age * PR_MSEC_PER_SEC: 0;

nit: add a space after PR_MSEC_PER_SEC

Maybe parentheses around age * PR_MSEC_PER_SEC for readability?

::: toolkit/components/url-classifier/tests/mochitest/gethash.sjs
@@ +49,3 @@
>  
> +    var counter = getState("counter");
> +    counter = counter == "" ? "1" : (parseInt(counter) + 1).toString();

Can't getState/setState store integers?

::: toolkit/components/url-classifier/tests/mochitest/test_bug1254766.html
@@ +199,5 @@
> +      ok(gCurGethashCounter > gPreGethashCounter, "Gethash request is triggered."); })
> +    .then(reset);
> +}
> +
> +// This testcae is to make sure completions in update works:

nit: testcase
Attachment #8764816 - Flags: review?(gpascutto) → review+
(In reply to Gian-Carlo Pascutto [:gcp] from comment #49)
> Comment on attachment 8764816 [details] [diff] [review]
> Testcase for stoping caching Safe Browsing completions to disk
> 
> Review of attachment 8764816 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I don't seem to have
> toolkit/components/url-classifier/tests/mochitest/gethash.sjs
> 
The patch is currently based on bug 1272239 which is not yet landed.
Attachment #8764811 - Attachment is obsolete: true
Attachment #8764812 - Attachment is obsolete: true
Attachment #8764813 - Attachment is obsolete: true
Attachment #8764814 - Attachment is obsolete: true
Attachment #8764815 - Attachment is obsolete: true
Attachment #8770117 - Attachment is obsolete: true
Thanks for review! This is really helpful!

(In reply to Gian-Carlo Pascutto [:gcp] from comment #48)
> Comment on attachment 8770117 [details] [diff] [review]
> Merged patch for stop caching Safe Browsing completions to disk.
> 
> Review of attachment 8770117 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I think there's a strong argument to be made that LookupCache shouldn't
> operate on the .cache files at all any more (and they should effectively be
> ignored and eventually disappear from new Firefox profiles), but just read
> in the mUpdateCompletions from .sbstore. 
> 
> https://dxr.mozilla.org/mozilla-central/rev/
> 88bebcaca249aeaca9197382e89d35b02be8292e/toolkit/components/url-classifier/
> HashStore.cpp#87
> 
> I do regret not putting all the fixed-length data *before* the variable
> size, ByteSlice/zlib stuff, which would have made that trivial. Now you'd
> have to read in the lengths of the compressed chunks and skip over them,
> rather than being able to seek to the right place in that file directly. But
> that may still be preferable than spreading the IO over LookupCache.cpp and
> HashStore.cpp, and duplicating the data.
> 
> I can live with not doing this but think about it and let me know your
> opinion.
> 

For my understanding there are two reasons to have .cache and .sbstore originally, please correct me if i am wrong
1. .cache contains completions from gethash and update, .sbstore only have completions from update
2. .cache is not as complex as .sbstore, so it is faster to read all the completions

Since now we don't want keep completions from gethash, so the only reason to keep .cache is for fast access (That's the reason I keep it in the first patch).

But I think you are right, we can read completions directly from .sbstore as long as we don't have to read
complex data structure to get data we want. Then we can get rid of .cache file entirely.

I implement reading fix-sized completions directly from .sbstore by calculating the offset from the end, so we don't have to calculate the size of prefixes. Please help check the patch to see if you agree on this approach, if you are ok with it, then i think we can just keep the .sbstore file format as it is now.

> >  const uint32_t LOOKUPCACHE_MAGIC = 0x1231af3e;
> > +const uint32_t CURRENT_VERSION = 3;
> 
> This is going to blow away the SafeBrowsing data of every Firefox user and
> force a redownload. And it will happen every time if they have for example
> Release and Dev Edition and switch between the two.
> 
> https://dxr.mozilla.org/mozilla-central/rev/
> 88bebcaca249aeaca9197382e89d35b02be8292e/toolkit/components/url-classifier/
> LookupCache.cpp#327
> https://dxr.mozilla.org/mozilla-central/rev/
> 88bebcaca249aeaca9197382e89d35b02be8292e/toolkit/components/url-classifier/
> LookupCache.cpp#138
> 
> It's not clear to me what it's trying to accomplish, too. The on-disk format
> didn't change?
> 
> There might be a concern that there are fullhash-completion-Completes in the
> .cache file, which we now no longer will want there. But the update logic in
> Classifier should get rid of these after the first remote update, as the
> .cache file is rebuilt from the contents of .sbstore, which won't have those.
> 
Yes in the beginning I update the version because of there are fullhash-completion-Completes that we no longer want. But since now we are going to remove .cache file entirely so this part will be removed.
Attachment #8770954 - Flags: review?(gpascutto)
Whiteboard: #sbv4-m0
Comment on attachment 8770954 [details] [diff] [review]
P1. Stop caching Safe Browsing completions to disk. v1

Review of attachment 8770954 [details] [diff] [review]:
-----------------------------------------------------------------

Doing the read by seeking from the end of the file is a great idea! The code simplifies a whole lot now that we can get rid of the .cache files, so these patches are great. Unfortunately, you'll have to do a bit more work to refactor HashStore::Open() to not do any extra work, else I foresee an extra performance penalty on startup. So r- because of the latter.

::: toolkit/components/url-classifier/LookupCache.cpp
@@ +19,5 @@
> +// This module has its own store, which stores the PrefixSet,
> +// mCache caching lookups that have happened over the net.
> +// Full length hashes retrieved by updating are stored in
> +// mUpateCompletions. The prefixes are cached/checked by looking
> +// them up in the PrefixSet.

Let's reword this, as PrefixSet isn't stored in LookupCache.

// This module provides a front for PrefixSet, mUpdateCompletions, 
// and mCache, which together contain everything needed to
// provide a classification as long as the data is up to date.

// PrefixSet stores and provides lookups for 4-byte prefixes.
// mUpdateCompletions contains 32-byte completions which were
// contained in updates. They are retrieved from HashStore/.sbtore
// on startup.
// mGetHashCache contains 32-byte completions which were
// returned from the gethash server. They are not serialized,
// only cached until the next update.

@@ +234,2 @@
>  
> +  nsresult rv = store.Open();

https://dxr.mozilla.org/mozilla-central/rev/88bebcaca249aeaca9197382e89d35b02be8292e/toolkit/components/url-classifier/HashStore.cpp#230

That function is now doing too much to be good for you.

a) fileSize should probably be a field in HashStore, so you don't have to open up things again in ReadCompletions just to get the fileSize (again) which you just retrieved in Open().

b) We probably don't care about detection corruption when we just open the file to read Completions (which would be at startup). It's OK to delay detecting corruption (i.e. CheckCheckSum) until we do an update. That avoids reading in the entire database at startup.

c) You don't care about ReadChunkNumbers either.

I suggest splitting HashStore::Open() into Open() and PrepareForUpdate() and moving (b) and (c) into the latter, so they don't get called when you're just setting up LookupCache.

::: toolkit/components/url-classifier/LookupCache.h
@@ +139,5 @@
>    RefPtr<nsUrlClassifierPrefixSet> mPrefixSet;
> +  // Full length hashes obtained in update request
> +  CompletionArray mUpdatedCompletions;
> +  // Full length hashes obtained in gethash request
> +  CompletionArray mCache;

mGetHashCache?
Attachment #8770954 - Flags: review?(gpascutto) → review-
Hi gcp,
Thanks for your suggestion! This patch address comment 53.
But there is one change in this patch I would like to highlight:

Move |ReadCompletions| to private function. Caller now use AddChunk(), AddCompletes() directly, Inside AddChunk/AddCompletes we will read data if data haven't been read before.

This change is because of reading chunks/completes are not in Hash::Open, this can make sure caller will not read empty data.
Attachment #8770954 - Attachment is obsolete: true
Attachment #8771899 - Flags: review?(gpascutto)
Comment on attachment 8771899 [details] [diff] [review]
P1. Stop caching Safe Browsing completions to disk. v2

Review of attachment 8771899 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me.

::: toolkit/components/url-classifier/HashStore.cpp
@@ +470,5 @@
> +
> +nsresult
> +HashStore::BeginUpdate()
> +{
> +  // Check of the file is corrupted and read the rest of the store

...check whether...

@@ +479,4 @@
>    // Close input stream, won't be needed any more and
>    // we will rewrite ourselves.
>    if (mInputStream) {
> +    mFileSize = 0;

I'd rather reset this at the point where we open the file for writing (which is really the place where it'll become 0-sized).

::: toolkit/components/url-classifier/LookupCache.h
@@ +138,4 @@
>    // Set of prefixes known to be in the database
>    RefPtr<nsUrlClassifierPrefixSet> mPrefixSet;
> +  // Full length hashes obtained in update request
> +  CompletionArray mUpdatedCompletions;

grammar: mUpdateCompletions (they haven't been updated, rather they belong to/come from an update)
Attachment #8771899 - Flags: review?(gpascutto) → review+
Attachment #8771899 - Attachment is obsolete: true
Hi gcp,
I have a question about the |ApplyTableUpdates|, currently we loop through entire TableUpdates array and apply update if table name matches[1]. So it means it is possible that there exists more than
one |TableUpdate| object with same table name.

For my understanding it may happen when gethash return multiple results for same table[2], but will this happen when we process update response ? From what i saw the |TableUpdates| will not have two object with same table name[3][4]. So if this is correct, i think i should make some changes to this patch because code path of gethash and update are separate now.

[1] https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/Classifier.cpp#612
[2] https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#936
[3] https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#528
[4] https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/ProtocolParser.cpp#683
Flags: needinfo?(gpascutto)
(In reply to Dimi Lee[:dimi][:dlee] from comment #57)
> Hi gcp,
> I have a question about the |ApplyTableUpdates|, currently we loop through
> entire TableUpdates array and apply update if table name matches[1]. So it
> means it is possible that there exists more than
> one |TableUpdate| object with same table name.

Or a TableUpdates array contains updates for multiple tables. I do not know if the code can construct it like that, you'd have to check.

> For my understanding it may happen when gethash return multiple results for
> same table[2], but will this happen when we process update response ? From
> what i saw the |TableUpdates| will not have two object with same table
> name[3][4]. So if this is correct, i think i should make some changes to
> this patch because code path of gethash and update are separate now.

For the code you linked, note that there is GetTableUpdate and GetTableUpdateS. Yes, that's horrible naming, sorry!
Flags: needinfo?(gpascutto)
See Also: → 1289028
Depends on: 1289028
See Also: 1289028
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/fx-team/rev/482f81651a2e
Stop caching Safe Browsing completions to disk. r=gcp
Keywords: checkin-needed
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/fx-team/rev/3bc5415cc37d
Backed out changeset 482f81651a2e for failing on own test
Depends on: 1291024
Hi francois, gcp.
After digging into the try failure, i found the root cause is that in certain platform, testcase may run earlier than SafeBrowsing.jsm initialization, so everything isn't ready at that moment.

Whether the problem occurs or not depends on timing because we delay init Safebrowsing.jsm for 2secs[1](bug 778855, bug 779008).

So i guess possible solution for this might be
1. Fix bug 779008, put addMozEntries somewhere and remove the setTimeout[1].
2. Split the init of safebrowsing into two part, put addMozEntries in the second and only delay the second    one.
3. Do not setTimeout for tests, but i am not sure if this might affect other tests.
4. Set a flag indicate that SafeBrowsing.jsm is initialized. testcase should run until the flag is set. But the problem is where to put this flag, listmanager?dbserver?preference?

Personally I prefer the first two approach, but i would like to know your opinion, do you have any suggestion ? thanks!

[1] https://dxr.mozilla.org/mozilla-central/source/browser/base/content/browser.js#1180
Flags: needinfo?(gpascutto)
Flags: needinfo?(francois)
Take a look at https://bugzilla.mozilla.org/show_bug.cgi?id=1244259

SafeBrowsing testcases that require the service to be initialized, or the test entries to be present, should probably hold until that event has fired.
Flags: needinfo?(gpascutto)
(In reply to Gian-Carlo Pascutto [:gcp] from comment #64)
> Take a look at https://bugzilla.mozilla.org/show_bug.cgi?id=1244259
> 
> SafeBrowsing testcases that require the service to be initialized, or the
> test entries to be present, should probably hold until that event has fired.

Thanks for the info!
Flags: needinfo?(francois)
Flags: needinfo?(dlee)
Attached patch Patch fIx try error (obsolete) — Splinter Review
Hi gcp,
Could you help review this patch ? Thanks!
Attachment #8778793 - Flags: review?(gpascutto)
Comment on attachment 8778793 [details] [diff] [review]
Patch fIx try error

Obsolete because upload the wrong file
Attachment #8778793 - Attachment is obsolete: true
Attachment #8778793 - Flags: review?(gpascutto)
Attached patch Path fix try error (obsolete) — Splinter Review
Attachment #8778795 - Flags: review?(gpascutto)
Comment on attachment 8778795 [details] [diff] [review]
Path fix try error

Review of attachment 8778795 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/tests/mochitest/classifierCommon.js
@@ +55,5 @@
>    sendAsyncMessage("reloadSuccess");
>  }
>  
> +// SafeBrowsing.jsm is initialized after mozEntries are added.
> +// Add observer to receive "finished" event. If this function

Wording: For the case when this function is called after the event had already been notified...
Attachment #8778795 - Flags: review?(gpascutto) → review+
Attachment #8776427 - Attachment is obsolete: true
Attachment #8778795 - Attachment is obsolete: true
Comment on attachment 8780334 [details] [diff] [review]
Patch - Stop caching Safe Browsing completions to disk

carry r+
Attachment #8780334 - Flags: review+
Keywords: checkin-needed
Priority: -- → P2
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/fx-team/rev/05e5ceb799a1
Stop caching Safe Browsing completions to disk. r=gcp
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/05e5ceb799a1
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
You need to log in before you can comment on or make changes to this bug.