Closed Bug 1338970 Opened 7 years ago Closed 7 years ago

Refine DB update scheme: update in a separate directory and swap in when finished

Categories

(Toolkit :: Safe Browsing, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: hchang, Assigned: hchang)

References

Details

Attachments

(3 files, 3 obsolete files)

In order to improve the overall DB update blocking time in Bug 1338017, we need to firstly change the DB update scheme to

1) Copy the current DB to a backup directory (we actually already do this)
2) Update against the backup directory (instead of the one in use)
3) Swap the backup and the in-use directory. This is only the operation    
   where lock is required. After swapping, remove the backup directory.

For some special cases:

1) If we met crash during update, the in-use directory will still be sane.
   In other words, if the backup directory is present after bootup, it implies
   a crash occurred in the previous update and we can just remove the backup
   directory. 

   Question: backward compatibility. If we were using an old firefox and hitting
             update crash, the backup directory is sane but the in-use directory
             is be a junk. What if we restart with new firefox under this 
             configuration?

   Possible solution: just use a different name from "safebrowsing-backup" and
                      tell apart the two cases.

2) anything else?
Blocks: 1338017
Assignee: nobody → hchang
Attached image Safebrowsing new DB update scheme.png (obsolete) —
Summary: Improve DB update scheme to reduce lock time → Refine DB update scheme: update in a separate directory and swap in when finished
Hi Francois,

I haven't tested this patch but it should be not far from what I imagined.
Please refer to the attached diagram and the patch for a early sense that
what I am going to do.

There is totally no worries about the DB concurrent read/write issue since
this bug will not change the thread that the any Classifier function will 
be running on. Moving update code to a separate thread will be done in 
other bug.

I'll keep refining my patch but the basic idea will not change. You can
take a quick look at the diagram to make sure if we are on the same page.

Thanks.
Flags: needinfo?(francois)
Attached image Safebrowsing new DB update scheme.png (obsolete) —
Attachment #8836713 - Attachment is obsolete: true
Attachment #8836716 - Flags: review?(gpascutto)
Attachment #8836716 - Flags: review?(francois)
Flags: needinfo?(francois)
Comment on attachment 8836716 [details]
Bug 1338970 - Introduce 2-step DB update scheme. (background update then swap in)

https://reviewboard.mozilla.org/r/112070/#review113636

I think the way this is written potentially leads to total data loss as you're missing at least one recovery scenario.

::: toolkit/components/url-classifier/Classifier.cpp:569
(Diff revision 3)
> +
> +  LOG(("Swapping directories %s and %s...", dirName1.get(),
> +                                            dirName2.get()));
> +
> +  // 1. Rename "dirName1" to "temp"
> +  rv = aDir1->RenameToNative(nullptr, TEMP_DIR_NAME);

What happens if we crash right after this operation? We've just moved mRootStoreDirectory away, and we haven't moved mUpdatingDirectory in its place.

The recovery code only tries to delete an incomplete updating directory.

At this point we don't know where our valid data is so we just lost everything.

::: toolkit/components/url-classifier/Classifier.cpp:573
(Diff revision 3)
> +  // 1. Rename "dirName1" to "temp"
> +  rv = aDir1->RenameToNative(nullptr, TEMP_DIR_NAME);
> +  if (NS_FAILED(rv)) {
> +    LOG(("Unable to rename %s to %s", dirName1.get(),
> +                                      TEMP_DIR_NAME.get()));
> +    return rv; // Nothting to roll back.

typo: Nothing

::: toolkit/components/url-classifier/Classifier.cpp:583
(Diff revision 3)
> +  //      object points to.
> +  nsCOMPtr<nsIFile> tempDirectory;
> +  rv = aParentDir->Clone(getter_AddRefs(tempDirectory));
> +  rv = tempDirectory->AppendNative(TEMP_DIR_NAME);
> +
> +  // 2. Rename "dirName2" to "dirName2".

typo: dirName1
Attachment #8836716 - Flags: review?(gpascutto) → review-
I think the patch has been ready for review. The goal of the patch is described in comment 0.
As for some implementation detail:

1) Most of the work is done in Classifier, including
   - Introduce mUpdatingDirectory (i.e. "safebrowsing-updating" for on-going update)
   - Remove the use of mToDeleteDirectory/mBackupDirectory but still do clean and recover
     in the very beginning just for backward compatibility.
   - Implement |SwapDirectoryContent|. It will be used in |SwapInUpdatedTables|.
   - Get rid of "2-step deletion". In the past, when we remove backup directory, we firstly
     rename it to "safebrowsing-to_delete" then do the actual removal. I don't see any
     necessity.

2) The overall flow is depicted in the attached diagram. The text version is
   - In Classifier::OpenDb(), recover "safebrowsing-backup" (if existed),
                              remove "safebrowsing-to_delete" and "safebrowsing-updating"
   - In Classifier::ApplyUpdate, we 
       a) Copy "safebrowsing" to "safebrowsing-updating",
       b) Do in-memory update
       c) Write out to "safebrowsing-updating"
       d) If (b) or (c) failed, dump "safebrowsing" to "safebrowsing-failedupdate",
          remove "safebrowsing-updating" and finish update.
       e) Swap directory "safebrowsing" and "safebrowing-updating". After swap,
          "safebrowsing" will contain the updated DB and "safebrowsing-updating"
          will contain outdated DB.
       f) Remove "safebrowsing-updating"

3) LookupCache::WriteFile, LookupCacheV4::WriteMetadat and HashStore::WriteFile
   will take an additional argument for "write out location", which was the same
   one for read. Note that we have to infer the actual write location by using
   |Classifier::GetPrivateStoreDirectory|.
  
All the changes will not change the threading. I.e, everything will still run
on the worker thread until Bug 1339050.
Status: NEW → ASSIGNED
Priority: -- → P2
Comment on attachment 8836716 [details]
Bug 1338970 - Introduce 2-step DB update scheme. (background update then swap in)

https://reviewboard.mozilla.org/r/112070/#review113636

I've addressed the case which you pointed out and can't think of others. All possible snapshots for a crash are

0) nothing: nothing to do.
1) "safebrowsing": Perfect!
2) "safebrowsing-updating" (new) and "safebrowsing" (old): new DB may have been downloaded but not active. Acceptable.
3) "safebrowsing-backup"   (new) and "safebrowsing" (old): The case you pointed out. (assume we rename "updating" to "backup" first) 
                                                           The new DB will be recovered after reboot. 
4) "safebrowsing-backup"   (new) and "safebrowsing-updating" (old): The new DB will be recovered after reboot.
5) "safebrowsing"          (new) and "safebrowsing-updating" (old): Perfect!

New patch with a little change around "SwapInUpdatedTables" will be submitted later.
I forgot to indicate the "in-memory" tables should be also swapped atomically with in-disk tables.
But it will still be fine because we haven't introduce concurrent lookup and update. 
They still both on the same thread in this patch.

> What happens if we crash right after this operation? We've just moved mRootStoreDirectory away, and we haven't moved mUpdatingDirectory in its place.
> 
> The recovery code only tries to delete an incomplete updating directory.
> 
> At this point we don't know where our valid data is so we just lost everything.

My solution is using "safebrowsing-backup" as the intermediary of swapping. If we crash after renaming "safebrowsing" to "safebrowsing-backup", we will recover the databases from "safebrowsing-backup". (I kept the recovery code for backward
compatibility at first but it turned out also effective in this case.)
Blocks: 1340179
I just greatly simplified the entire design to make the changes less error-prone:

Background: 

Classifier maintains a list of LookupCache, which is the only interface to
read/write DB from/into disk and update in-memory data. In theory we don't 
need such a list: just create a new one whenever needed. However, creating
a LookupCache requires opening files and data parsing. So we maintain
a list of created LookupCache to avoid re-creation.

Original solution (in the current patch):

In order to have a "background-y" update, I modified LookupCache::WriteFile()
so that we can assign a "special" directory to output files. Besides, I have to
also do something like in Bug 1340179 to deal with the in-memory update issue.

Simplified solution:

Bug 1340179 is fairly simple but really annoying. We have to take care of
every implementation detail. For example, we need to know what member variables 
of LookupCache will be mutated while doing an update.

The idea behind the simplification is we don't have to stick to the current
maintained list of LookupCache. Analogue to writing file to a "temp" directory,
we can just allocate new LookupCaches for update only. Writing file and updating
in-memory data will not interfere the in-use ones.

Assume we originally had a list of LookupCaches like

[goog-phish-shavar, goog-unwanted-shavar]

and we've just downloaded something in memory for update. In addition to 
copying "safebrowsing" directory to "safebrowsing-updating", we "conceptually" 
clone the currently maintained LookupCache list. Since we haven't implemented
deep copy function or copy constructor for LookupCache, "re-open" is required 
right now.

So after update, we have

in-use but old list |mLookupCache|    => [goog-phish-shavar, goog-unwanted-shavar]
new list            |mNewLookupCache| => [goog-phish-shavar, goog-unwanted-shavar, goog-malware-shavar]

and can then simply swap these two lists.
Attachment #8837079 - Attachment is obsolete: true
The latest diagram attachment 8838502 [details]  and patch revision 5 are for the new design in comment 11.
Comment on attachment 8836716 [details]
Bug 1338970 - Introduce 2-step DB update scheme. (background update then swap in)

https://reviewboard.mozilla.org/r/112070/#review115406

::: toolkit/components/url-classifier/Classifier.cpp:620
(Diff revision 5)
> +void
> +Classifier::RemoveUpdateIntermediaries()
> +{
> +  // Remove old LookupCaches.
> +  for (auto c: mNewLookupCaches) {
> +    delete c;

Maybe it's possible to add a debug assert somewhere, checking that if we are not in an update, this array is indeed cleared. Right now it seems all paths are properly covered, but manual memory management is something that can easily regress.

::: toolkit/components/url-classifier/Classifier.cpp:747
(Diff revision 5)
>  
>    } // End of scopedUpdatesClearer scope.
>  
> -  rv = RegenActiveTables();
> -  NS_ENSURE_SUCCESS(rv, rv);
> -
> +  rv = SwapInNewTablesAndCleanup();
> +  if (NS_FAILED(rv)) {
> +    LOG(("Failed to swap in new tables."));

Does it make sense to propagate the error here? If we can't update the tables, does it make sense to do a full reset? (We don't ever want to get stuck with non-updatable, old tables)
Attachment #8836716 - Flags: review?(gpascutto) → review+
Comment on attachment 8836716 [details]
Bug 1338970 - Introduce 2-step DB update scheme. (background update then swap in)

https://reviewboard.mozilla.org/r/112070/#review115500

Looks good, just a few suggestions to make the code clearer.

::: toolkit/components/url-classifier/Classifier.cpp:588
(Diff revision 5)
> +  rv = aDir2->RenameToNative(nullptr, dirName1);
> +  if (NS_FAILED(rv)) {
> +    LOG(("Failed to rename %s to %s. Rename temp directory back to %s",
> +         dirName2.get(), dirName1.get(), dirName1.get()));
> +    nsresult rbrv = tempDirectory->RenameToNative(nullptr, dirName1);
> +    if (NS_FAILED(rbrv)) {

Given that we can't recover from this failure, should we do like in Step 3 and only have `NS_ENSURE_SUCCESS(rbrv, rbrv)` here?

::: toolkit/components/url-classifier/Classifier.cpp:626
(Diff revision 5)
> +  }
> +  mNewLookupCaches.Clear();
> +
> +  // Remove the "old" directory. (despite its looking-new name)
> +  if (NS_FAILED(mUpdatingDirectory->Remove(true))) {
> +    LOG(("Failed to remove updating directory but we don't care."));

Why do we not care?

Maybe you can have "Failed to remove updating directory." in the LOG and then a comment above that explains why we ignore that error.

::: toolkit/components/url-classifier/Classifier.cpp:653
(Diff revision 5)
> +
> +  // Step 2. Swap in in-memory tables.
> +  mLookupCaches.SwapElements(mNewLookupCaches);
> +
> +  // Step 3. Re-generate active tables based on on-disk tables.
> +  // XXX: If step 2 failed, should we re-generate?

(Answering the question you ask) Is there an alternative? If we don't have a working lookup cache, Safe Browsing won't work, no?

We should answer the question before landing and remove that comment :)

::: toolkit/components/url-classifier/Classifier.cpp:997
(Diff revision 5)
>  }
>  
> -nsresult
> -Classifier::RemoveBackupTables()
> +void
> +Classifier::CopyInUseLookupCacheForUpdate()
>  {
> -  nsCString toDeleteName;
> +  // TODO: Implement deep copy function or copy ctor. For now

Is this something we should open a follow-up bug for?

::: toolkit/components/url-classifier/Classifier.cpp:998
(Diff revision 5)
>  
> -nsresult
> -Classifier::RemoveBackupTables()
> +void
> +Classifier::CopyInUseLookupCacheForUpdate()
>  {
> -  nsCString toDeleteName;
> -  nsresult rv = mToDeleteDirectory->GetNativeLeafName(toDeleteName);
> +  // TODO: Implement deep copy function or copy ctor. For now
> +  // we just re-open every table we have opened. Another option

Can you clarify what "re-opening every table we have opened" does?

Unless I'm misunderstanding your code, I think what this does is: "Populate the new lookup cache based on the tables in the updating directory by loading them from disk."
Attachment #8836716 - Flags: review?(francois) → review+
Comment on attachment 8836716 [details]
Bug 1338970 - Introduce 2-step DB update scheme. (background update then swap in)

https://reviewboard.mozilla.org/r/112070/#review115406

Thanks!

> Maybe it's possible to add a debug assert somewhere, checking that if we are not in an update, this array is indeed cleared. Right now it seems all paths are properly covered, but manual memory management is something that can easily regress.

I might also have to add debug assert for intermediary directory.

> Does it make sense to propagate the error here? If we can't update the tables, does it make sense to do a full reset? (We don't ever want to get stuck with non-updatable, old tables)

It makes total sense. Thanks for reminding :)
Comment on attachment 8836716 [details]
Bug 1338970 - Introduce 2-step DB update scheme. (background update then swap in)

https://reviewboard.mozilla.org/r/112070/#review115500

Thanks, too!

> Given that we can't recover from this failure, should we do like in Step 3 and only have `NS_ENSURE_SUCCESS(rbrv, rbrv)` here?

Yes I think we should :)

> Why do we not care?
> 
> Maybe you can have "Failed to remove updating directory." in the LOG and then a comment above that explains why we ignore that error.

Because it's just a clean up and the failure will not cause anything to misbahave until the next udpate. I will add it to the comment then.

> (Answering the question you ask) Is there an alternative? If we don't have a working lookup cache, Safe Browsing won't work, no?
> 
> We should answer the question before landing and remove that comment :)

The question should no longer be a question because step 2 is infalliable and the step 3 will not be reached if step 1 failed. 

To sum up, if we failed to swap in update intermediaries, every thing in-use should remain untouched (since we just didn't touch the in-use stuff.)

> Is this something we should open a follow-up bug for?

Sure and I'll not add dependency here because it's a kind of optimization.

> Can you clarify what "re-opening every table we have opened" does?
> 
> Unless I'm misunderstanding your code, I think what this does is: "Populate the new lookup cache based on the tables in the updating directory by loading them from disk."

In this stage, the update hasn't started and we'd only like to "mirror" the already opended LookupCaches for update. (just like we copy "safebrowsing" to "safebrowsing-updating" for update.)
Target Milestone: --- → mozilla53
Pushed by hchang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2c1fc455e81e
Introduce 2-step DB update scheme. (background update then swap in) r=francois,gcp
https://hg.mozilla.org/mozilla-central/rev/2c1fc455e81e
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: mozilla53 → mozilla54
Depends on: 1342397
You need to log in before you can comment on or make changes to this bug.