Closed Bug 1339050 Opened 8 years ago Closed 8 years ago

Run DB update off the worker thread with real concurrency with table lookup

Categories

(Toolkit :: Safe Browsing, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: hchang, Assigned: hchang)

References

Details

Attachments

(2 files, 3 obsolete files)

This is the sub-task of Bug 1338017. The goal is to move update code off the worker thread, which requires a few mutex efforts in Classifier.
Assignee: nobody → hchang
Blocks: 1338017
Depends on: 1339760
Summary: Move DB update off the worker thread → Run DB update off the worker thread with real concurrency with table lookup
Depends on: 1340179
Status: NEW → ASSIGNED
Priority: -- → P2
Attachment #8844749 - Flags: review?(gpascutto)
Attachment #8844749 - Flags: review?(francois)
I spotted many try errors [1] and most of them were caused by the misuse of nsIUrlClassifierDBService.beginUpdate() in test cases such as [2]. nsIUrlClassifierDBService.beginUpdate() might fail due to the ongoing update. I have review all uses of nsIUrlClassifierDBService.beginUpdate() but the production code is still fine since the StreamUpdater takes care of the return value of beginUpdate and queues the request if failing to do beginUpdate. (The queued requests will only be re-processed whenever new request is coming. A timer can be added here to make it better.) [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=d6eb50561a59bb1be1e83380c23f81c32aace741&selectedJob=82351323 [2] http://searchfox.org/mozilla-central/rev/d4adaabd6d2f88be0d0b151e022c06f6554909da/toolkit/components/url-classifier/tests/UrlClassifierTestUtils.jsm#89
Depends on: 1345922
Other than the primary change (asynchronously dispatching background update task to the update thread), more explanation about the patch are in the following: 1) Refined the mLookupCache/mNewLookupCache swap scheme. Original: We mirror mLookupCache to mNewLookupCache, apply updates on mNewLookupCache then swap them. Refined: We start with an empty mNewLookupCache. If a table needs to be updates, mNewLookupCache will then grow by appending a newly created LookupCaches. After applying updates, instead of swapping mLookupCache/mNewLookupCache, we "merge" mNewLookupCache to mLookupCache. That is, for any entry in mNewLookupCache it will either i) replace exactly one entry (in mLookupCache) which has the same table name ii) be appended to mLookupCache if its table name is not found in mLookupCache. After the merge, mNewLookupCache will be stored the "replaced" (old) LookupCaches. 2) Introduced a "Reset Mutex" and a flag "mUpdateInterrupted" to deal with the following case: Failed to open a table (on worker thread, happens in GetLookupCache()) when copying "safebrowsing" to "safebrowsing-updating" (on update thread) since the table will be removed and Classifier::Reset() will be called. Consider the following possible sequences: (CopyDirectory/LookupCache::Reset()/Classifier::Reset() are mutually exclusive and Classifier::Reset() will set mUpdateInterrupted to 1) a) Worker thread | Update thread ------------------------------------------------ | CopyDirectory() LookupCache::Reset() | Updating... Classifier::Reset() | | CopyDirectory() wins all the races. Update will not interrupted until reaching check points of "mUpdateInterrupted". b) Worker thread | Update thread ------------------------------------------------ LookupCache::Reset() | | CopyDirectory() Classifier::Reset() | Updating... CopyDirectory() loses LookupCache::Reset() but wins Classifier::Reset(). Similar to (a): Update will not interrupted until reaching check points of "mUpdateInterrupted". c) Worker thread | Update thread ------------------------------------------------ LookupCache::Reset() | Classifier::Reset() | | CopyDirectory() | Updating... CopyDirectory() loses all the races. We can add a "mUpdateInterrupted" check once upon CopyDirectory() gets the mutex to interrupt the update even earlier. To sum up, during update, if GetLookupCache (on work thread) failed and reset is required, we have to stop the update as soon as possible. 3) Dealt with the racing issue between BeginUpdate() and NotifyUpdateObserver(). |nsUrlClassiferDBService::mInUpdate| [1] used to work well because "notifying update observer" is synchronously done in Worker::FinishUpdate(). Even if the the update observe hasn't been notified at this point, we can still dispatch BeginUpdate() since it will NOT be executed on worker thread until the previous Worker::FinishUpdate() returns. However, since some tasks in Worker::FinishUpdate() have been moved to another thread, the update observer will NOT be notified when Worker::FinishUpdate() returns. If we only check |mInUpdate|, the following sequence might happen on worker thread: Worker::FinsihUpdate() // for update 1 Worker::BeginUpdate() // for update 2 Worker::NotifyUpdateObserver() // for update 1 So, we have to find out a way to reject BeginUpdate() right here if the previous update observer hasn't been notified. Directly probing the worker's state is the most lightweight solution. No lock is required since Worker::BeginUpdate() and Worker::NotifyUpdateObserver() are by nature mutual exclusive. (both run on worker thread.) [1] http://searchfox.org/mozilla-central/rev/ef0b6a528e0e7b354ddf8cdce08392be8b8ca679/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#2041
Depends on: 1346757
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review122296 I will need to go through this again to review the logic along with with your diagrams, but here are a few minor things I noticed in my first reading. ::: toolkit/components/url-classifier/Classifier.cpp:690 (Diff revision 1) > RemoveUpdateIntermediaries(); > return rv; > } > > - // Step 2. Swap in in-memory tables and update root directroy handles. > - mLookupCaches.SwapElements(mNewLookupCaches); > + // Step 2. Merge mNewLookupCaches into mLookupCaches. The outdated > + // LookupCaches will be stored in mNewLookupCaches and be cleanup later. "and cleaned up later." ::: toolkit/components/url-classifier/Classifier.cpp:730 (Diff revision 1) > + > nsresult > Classifier::ApplyUpdatesBackground(nsTArray<TableUpdate*>* aUpdates, > nsACString& aFailedTableName) > { > // Will run on the update thread after Bug 1339760. That's the wrong bug number, it should be bug 1339050. Therefore, the comment needs to be updated in this patch. ::: toolkit/components/url-classifier/Classifier.cpp:737 (Diff revision 1) > // the table lookup may lead to database removal. > > + // |mUpdateInterrupted| is guaranteed to have been unset. > + // If |mUpdateInterrupted| is set at any point, Reset() must have > + // been called then we need to interrupt the update process. > + // We only add check points to where the non-trial task I don't understand this sentence. Do you mean something like "we only add checkpoints for non-trivial tasks" ? ::: toolkit/components/url-classifier/Classifier.cpp:765 (Diff revision 1) > > { > ScopedUpdatesClearer scopedUpdatesClearer(aUpdates); > > { > // TODO: Bug 1339050. This section should be mutual exclusive to Presumably this TODO should be removed? ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2038 (Diff revision 1) > LOG(("Already updating, not available")); > return NS_ERROR_NOT_AVAILABLE; > } > + if (mWorker->IsBusy()) { > + // |mInUpdate| used to work well because "notifying update observer" > + // is synchronously done in Worker::FinishUpdate(). Even if the the typo: "the the" ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2048 (Diff revision 1) > + // However, some tasks in Worker::FinishUpdate() have been moved to > + // another thread. The update observer will NOT be notified when > + // Worker::FinishUpdate() returns. If we only check |mInUpdate|, > + // the following sequence might happen on worker thread: > + // > + // Worker::FinsihUpdate() // for update 1 typo: Finish
Attachment #8839099 - Attachment is obsolete: true
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review122500 ::: commit-message-f3b22:3 (Diff revision 1) > +Bug 1339050 - Concurrently do update (whenever possible) and lookup. > +Conflicts: > + toolkit/components/url-classifier/nsUrlClassifierDBService.cpp Remember to clean the conflicts up in this message, it's from your own merges. ::: toolkit/components/url-classifier/Classifier.h:222 (Diff revision 1) > // The copy of mLookupCaches for update only. > nsTArray<LookupCache*> mNewLookupCaches; > + > + bool mUpdateInterrupted; > + > + // Hold this lock to prevent from in-use data being reset. nit: ...to prevent in-use data from being reset... ::: toolkit/components/url-classifier/Classifier.cpp:659 (Diff revision 1) > + if (mLookupCaches[swapIndex]->TableName() == newCache->TableName()) { > + break; > + } > + } > + if (swapIndex == mLookupCaches.Length()) { > + mLookupCaches.AppendElement(nullptr); I'd like to see some ASSERTs in here that verify whatever invariants are needed to make sure noone else can try to get a mLookupCache now, because they risk getting a sudden nullptr. ::: toolkit/components/url-classifier/Classifier.cpp:691 (Diff revision 1) > return rv; > } > > - // Step 2. Swap in in-memory tables and update root directroy handles. > - mLookupCaches.SwapElements(mNewLookupCaches); > - for (auto c: mLookupCaches) { > + // Step 2. Merge mNewLookupCaches into mLookupCaches. The outdated > + // LookupCaches will be stored in mNewLookupCaches and be cleanup later. > + MergeNewLookupCaches(); Any reason the cleanup can't happen here? ::: toolkit/components/url-classifier/HashStore.h:242 (Diff revision 1) > // Wipe out all Completes. > void ClearCompletes(); > > private: > nsresult Reset(); > + nsresult ResetNoLock(); More common naming is ResetInternal(). After all, things may well be locked when entering the function. ::: toolkit/components/url-classifier/nsUrlClassifierDBService.h:185 (Diff revision 1) > nsresult GCC_MANGLING_WORKAROUND CloseDb(); > > nsresult CacheCompletions(CacheResultArray * aEntries); > nsresult CacheMisses(PrefixArray * aEntries); > > + bool IsBusy() const { return !!mUpdateObserver; } Add a comment here pointing out that this is nulled after updating finishes (and that's why this works). ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:644 (Diff revision 1) > - NS_NewRunnableFunction([self, &backgroundRv, &failedTableName] () -> void { > - backgroundRv = self->ApplyUpdatesBackground(failedTableName); > + > + LOG(("Background update finished. Do foreground update!")); > + > + // nsUrlClassifierDBService::BackgroundThread() is relatively > + // "foreground" to the DB update thread. > + auto foregroundThread = nsUrlClassifierDBService::BackgroundThread(); No. Please use consistent naming for the two threads here. Update Thread and Worker Thread is fine. ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2032 (Diff revision 1) > nsUrlClassifierDBService::BeginUpdate(nsIUrlClassifierUpdateObserver *observer, > const nsACString &updateTables) > { > NS_ENSURE_TRUE(gDbBackgroundThread, NS_ERROR_NOT_INITIALIZED); > > if (mInUpdate) { Merge this with the next check?
Attachment #8844749 - Flags: review?(gpascutto) → review-
(In reply to Henry Chang [:henry][:hchang] from comment #7) > 2) Introduced a "Reset Mutex" and a flag "mUpdateInterrupted" to deal with > the following case: > > Failed to open a table (on worker thread, happens in GetLookupCache()) > when copying "safebrowsing" to "safebrowsing-updating" (on update thread) > since the table will be removed and Classifier::Reset() will be called. Why not deal with this case *synchronously* by Dispatching to the Update Thread to do the Resets, synchronously from the Worker Thread? This avoids the (nasty!) passing around of the Mutex, and the use of the mutex already makes those calls synchronous anyway.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #12) > (In reply to Henry Chang [:henry][:hchang] from comment #7) > > 2) Introduced a "Reset Mutex" and a flag "mUpdateInterrupted" to deal with > > the following case: > > > > Failed to open a table (on worker thread, happens in GetLookupCache()) > > when copying "safebrowsing" to "safebrowsing-updating" (on update thread) > > since the table will be removed and Classifier::Reset() will be called. > > Why not deal with this case *synchronously* by Dispatching to the Update > Thread to do the Resets, synchronously from the Worker Thread? > > This avoids the (nasty!) passing around of the Mutex, and the use of the > mutex already makes those calls synchronous anyway. It sounds good! My only concern is Classifier/LookupCache/HashStore then rely on the update thread and that makes it hard to write unit tests. We can mock up the update thread for unit test though but I will try to avoid this. I need to think what I can do for this issue...
Blocks: 1347657
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review122500 Thanks for your first round review :) Using SyncRunnable to avoid locking around is a brilliant idea but I haven't had a good idea doing them gracefully. So I'll leave the lock there first and submit a new patch with some fix for "shutdown" issues. > Any reason the cleanup can't happen here? I got 3 weak reasons :p 1) Consisteny. SwapDirectoryContent() and MergeNewLookupCaches() are only responsible for bringing in updates data. All the update intermediaries will be removed in the following RemoveUpdateIntermediaries(). 2) I tried to make a function not be too versatile. 3) I was thinking one day the cleanup can be done in the update thread (if removing directory takes a long whlie). But as I said they are not strong so I don't mind change the design. > Merge this with the next check? I intentionally split them apart because they are at the different "busy level". If |mInUpdate| is true, it means the FinishUpdate() hasn't been called. It could be caused by the danlging BeginUpdate (i.e. the mis-use of the API) or the concurrent use of the API. If |mInUpdate| is false but mWorker->IsBusy(), the FinishUpdate() must has been called but hasn't been notified finished. It's usually because the user didn't start a new update until being notified. Both cases lead to the same external/public state so splitting them is only useful when I am debugging ^^"
(In reply to Henry Chang [:henry][:hchang] from comment #13) > (In reply to Gian-Carlo Pascutto [:gcp] from comment #12) > > (In reply to Henry Chang [:henry][:hchang] from comment #7) > > > 2) Introduced a "Reset Mutex" and a flag "mUpdateInterrupted" to deal with > > > the following case: > > > > > > Failed to open a table (on worker thread, happens in GetLookupCache()) > > > when copying "safebrowsing" to "safebrowsing-updating" (on update thread) > > > since the table will be removed and Classifier::Reset() will be called. > > > > Why not deal with this case *synchronously* by Dispatching to the Update > > Thread to do the Resets, synchronously from the Worker Thread? > > > > This avoids the (nasty!) passing around of the Mutex, and the use of the > > mutex already makes those calls synchronous anyway. > > It sounds good! > > My only concern is Classifier/LookupCache/HashStore then rely on > the update thread and that makes it hard to write unit tests. > We can mock up the update thread for unit test though but I will try to > avoid this. > I've come out the solution: let Classifier own the update thread. I have implicitly made the update thread strongly coupled with Classifier (haven't submitted yet) to avoid shutdown issue and now I'd like to make them explicitly in a "composition" relationship: 1) DBService only knows the worker object (DBServiceWorker) and the worker thread. 2) DBServiceWorker only knows Classifier but *NOT* the update thread. 3) Classifier privately owns the update thread and provides async update API: AsyncApplyUpdates(). Its callback will occur on the caller thread, which is the worker thread in our case. So it now becomes simpler for DBServiceWorker to do a async update like: mClassifier->AsyncApplyUpdates([=] { NotifyUpdateObserver(); }); Obviously the shutdown order should be DBService shutdown begins DBServiceWorker shutdown begins worker thread shutdown begins Classifier shutdown begins update thread shutdown begins update thread shutdown ends Classifier shutdown ends worker thread shutdown ends DBServiceWorker shutdown ends DBService shutdown ends Finally, since Classifier owns the update thread, we can do those "udpate-safety-required" tasks (e.g. Reset()) by synchronously dispatching to the owing update thread. I will include this approach in the next patch.
Attachment #8843227 - Attachment is obsolete: true
Attachment #8847520 - Attachment is obsolete: true
Some major changes to the previous revision: 1) Removed DBServiceWorker::ApplyUpdatesBackground/Foreground. So I only added an exception (DBServiceWorker::IsBusy) which will not be run on worker thread. 2) Made Classifier::ApplyUpdatesBackground/Foreground private. So all the Classifier public interfaces will not be run on update thread. 3) Added Classifier::AsyncApplyUpdates. So DBServiceWorker::FinishUpdate can directly make use of this function and install a callback (occur on the worker thread) to accomplish async update very easily. 4) Moved gDbUpdateThread to Classifier (Classifier::mUpdateThread). The presence of the update thread is transparent to DBService[Worker] 5) Added Classifier::FlushAndDisableUpdate for synchronizing with update thread. It's simply to shutdown the update thread. 6) In the past we delete LookupCache/HashStore file when Open() failed and is due to file corruption. This seems unnecessary since Classifier::Reset() will be called a couple of lines below [1] so I got rid of LookupCache::Reset and HashStore::Reset(). About the shutdown process: We mutate two flags on main thread when it's notified to shut down: - gDbBackgroundThread: set to null to avoid further events being sent to worker thread from "main thread". In the past only main thread would send event to worker thread so it can guarantee the worker thread will no longer receive events. - gShuttingDownThread: to interrupt the update. Since this flag is set on main thread and read on worker thread with no lock, this flag is only meaningful for worker thread when it's true. When this flag is false, it doesn't guarantee the shut down will not start at any point beyond (Chances are this flag is set right after we checked its value.) The current shutdown process is: 1) Set |gShuttingDownThread| to true to interrupt the ongoing update as soon as possible. It's just a "nice-to-have" flag. Even if the flag is always false, threads should still be able to shut down. 2) Try to get synchronized with the update thread. This is really really important because the update thread might be doing update and will post event to the worker thread at any point in the shutdown process. Checking the above two flags is useless because they are mutated on main thread and we are reading them on the update thread. 3) Once we get this stage, the worker thread will no longer receive events except the last two we are going to send: CancelUpdate() and CloseDb(): The former is to notify the "dangling update" [2] (which is not covered by step 2) and the later is to clear mClassifier (this is why this event needs to be the very last one on the worker thread.) 4) Now we can just wait until the worker thread finishes all pending events. Note that step (2) is done by joining the update thread "on the worker thread". Why not join on the main thread? If we join the update thread from main thread, it implies we will manipulate Classifier::mUpdateThread on main thread and worker thread. It requires locking around Classifier::mUpdateThread, which is no good. The final solution is to *synchronously* post a event to worker thread to shut down the update thread. Why *synchronously*? If we post asynchronously, the async update callback event might be flushed to the worker thread "after" the CloseDb() event which we are about to send in step (3) [1] http://searchfox.org/mozilla-central/rev/ee7cfd05d7d9f83b017dd54c2052a2ec19cbd48b/toolkit/components/url-classifier/Classifier.cpp#1393 [2] Dangling update means BeginUpdate() is called but FinishUpdate() isn't.
(In reply to Henry Chang [:henry][:hchang] from comment #7) > 3) Dealt with the racing issue between BeginUpdate() and > NotifyUpdateObserver(). I wonder whether it might be worth creating a setter function to modify mUpdateObserver. Inside that function, we could check that we're on the right thread, and also check that we're either: 1. replacing a nullptr with a non-nullptr 2. replacing a nullptr with a nullptr i.e. checking that we're not trying to add a second observer before the previous one has been notified.
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review124572 I like the approach (thanks for the diagrams, they were very helpful) and didn't find anything obviously wrong with the implementation. I'm suggesting a few extra checks to improve clarity and easily catch some possible threading mistakes we might make in the future. There are also a few questions inline. ::: toolkit/components/url-classifier/Classifier.h:78 (Diff revision 3) > + * used in gtest only. > */ > nsresult ApplyUpdates(nsTArray<TableUpdate*>* aUpdates); > > /** > - * The "background" part of ApplyUpdates. Once the background update > + * Equivalent to ApplyUpdates execept that the update is done nit: "except" ::: toolkit/components/url-classifier/Classifier.cpp:281 (Diff revision 3) > > void > Classifier::Reset() > { > + MOZ_ASSERT(NS_GetCurrentThread() != mUpdateThread, > + "Reset() MUST not be called on update thread"); nit: "MUST NOT" Do you think we should also assert that it's not running on the main thread? ::: toolkit/components/url-classifier/Classifier.cpp:284 (Diff revision 3) > { > + MOZ_ASSERT(NS_GetCurrentThread() != mUpdateThread, > + "Reset() MUST not be called on update thread"); > + > + LOG(("Reset() is called so we interrupt the update.")); > + mUpdateInterrupted = true; As far as I can see, we don't need a lock on `mUpdateInterrupted` because we never write to it from the update thread. I would therefore suggest that we create a small setter function: ``` void SetUpdateInterrupted(bool aInterrupted) { MOZ_ASSERT(current thread != update thread) mUpdateInterrupted = aInterrupted } ``` ::: toolkit/components/url-classifier/Classifier.cpp:661 (Diff revision 3) > } > > +void > +Classifier::MergeNewLookupCaches() > +{ > + MOZ_ASSERT(NS_GetCurrentThread() != mUpdateThread, Maybe we should also assert that we're not on the main thread? ::: toolkit/components/url-classifier/Classifier.cpp:708 (Diff revision 3) > RemoveUpdateIntermediaries(); > return rv; > } > > - // Step 2. Swap in in-memory tables and update root directroy handles. > - mLookupCaches.SwapElements(mNewLookupCaches); > + // Step 2. Merge mNewLookupCaches into mLookupCaches. The outdated > + // LookupCaches will be stored in mNewLookupCaches and be cleaned When/where will it be cleaned up? ::: toolkit/components/url-classifier/Classifier.cpp:781 (Diff revision 3) > +void Classifier::FlushAndDisableAsyncUpdate() > +{ > + LOG(("Classifier::FlushAndDisableAsyncUpdate [%p, %p]", this, mUpdateThread.get())); > + > + if (!mUpdateThread) { > + LOG(("Async update has been disabled.")); nit: maybe this should say "already been disabled"? ::: toolkit/components/url-classifier/Classifier.cpp:810 (Diff revision 3) > > +nsresult > +Classifier::AsyncApplyUpdates(nsTArray<TableUpdate*>* aUpdates, > + AsyncUpdateCallback aCallback) > +{ > + LOG(("Classifier::AsyncApplyUpdates")); We should probably assert that we're not on the update thread. ::: toolkit/components/url-classifier/Classifier.cpp:825 (Diff revision 3) > + // (processing other task) | (bg-update done. ping back to caller thread) > + // (processing other task) | idle... > + // ApplyUpdatesForeground | > + // callback | > + > + mUpdateInterrupted = false; This should also use `SetUpdateInterrupted()`. ::: toolkit/components/url-classifier/Classifier.cpp:1496 (Diff revision 3) > LookupCache * > Classifier::GetLookupCache(const nsACString& aTable, bool aForUpdate) > { > + if (aForUpdate) { > + MOZ_ASSERT(NS_GetCurrentThread() == mUpdateThread, > + "GetLookupCacheForUpdate can only be called on update thread."); The name of the function is `GetLookupCache()` so maybe that should be `GetLookupCache(aForUpdate==true)` for clarity. ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:635 (Diff revision 3) > } > > RefPtr<nsUrlClassifierDBServiceWorker> self = this; > + nsresult rv = mClassifier->AsyncApplyUpdates(&mTableUpdates, > + [=] (nsresult aRv) -> void { > +#ifdef MOZ_SAFEBROWSING_DUMP_FAILED_UPDATES An assertion checking that the thread is the right one would be helpful here. ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:660 (Diff revision 3) > > nsresult > nsUrlClassifierDBServiceWorker::NotifyUpdateObserver(nsresult aUpdateStatus) > { > + LOG(("nsUrlClassifierDBServiceWorker::NotifyUpdateObserver")); > + Again, checking the thread is the right one might be useful. ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:693 (Diff revision 3) > NS_ERROR_GET_CODE(updateStatus)); > } > > mMissCache.Clear(); > > + // Null out mUpdateObserver before notifying so that BeginUpdate() Is this a good idea? If I understand this part correctly, we're talking about two updates trying to run at the same time. Maybe we should just wait until everyone has been notified before letting another update start. ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:695 (Diff revision 3) > > mMissCache.Clear(); > > + // Null out mUpdateObserver before notifying so that BeginUpdate() > + // becomes available prior to callback. > + nsCOMPtr<nsIUrlClassifierUpdateObserver> updateObserver; Adding `= nullptr` would emphasize that we are nulling this out via the swap. ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2012 (Diff revision 3) > return NS_ERROR_NOT_AVAILABLE; > } > + if (mWorker->IsBusy()) { > + // |mInUpdate| used to work well because "notifying update observer" > + // is synchronously done in Worker::FinishUpdate(). Even if the > + // update observe hasn't been notified at this point, we can still typo: "update observer" ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2096 (Diff revision 3) > nsUrlClassifierDBService::ResetDatabase() > { > NS_ENSURE_TRUE(gDbBackgroundThread, NS_ERROR_NOT_INITIALIZED); > > + if (mWorker->IsBusy()) { > + LOG(("Failed to ResetDatabase because of the unfinished update.")); If we don't reset or reload the database when the worker is busy, are we going to retry later? ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2189 (Diff revision 3) > } else if (NS_LITERAL_STRING(CONFIRM_AGE_PREF).Equals(aData)) { > gFreshnessGuarantee = Preferences::GetInt(CONFIRM_AGE_PREF, > CONFIRM_AGE_DEFAULT_SEC); > } > } else if (!strcmp(aTopic, "quit-application")) { > - Shutdown(); > + // Hint the update thread to finish as soon as possible. nit: "Tell" instead of "Hint" ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2254 (Diff revision 3) > + // events on the worker thread in the shutdown process. > DebugOnly<nsresult> rv; > - // First close the db connection. > - if (mWorker) { > - rv = mWorkerProxy->CancelUpdate(); > + rv = mWorkerProxy->CancelUpdate(); > - NS_ASSERTION(NS_SUCCEEDED(rv), "failed to post cancel update event"); > + MOZ_ASSERT(NS_SUCCEEDED(rv), "failed to post close db event"); This should be "cancel update even".
Attachment #8844749 - Flags: review?(francois)
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review124572 Thanks for the review! As for your suggestions for assertions in Classifier, I kept struggling with adding threading assertions to Classifier: 1) I try to decouple Classifier with worker thread. Coupling Classifier with worker thread makes it difficult to write unit tests. In order to not fail the assertion in unit test, we have to "negatively" assert the current thread. For example, we test if "not on main thread" instead of "on worker thread". Actually, tesing "if on main thread" is equivalent to tesing "if on worker thread" in terms of how much difficulty it introduces for unit test: "not on main thread" implies on any worker thread else (not restricted to "the" worker thread.) In gtest, we have to spin a new thread for calling some Classifier memeber functions. http://searchfox.org/mozilla-central/rev/0079c7adf3b329bff579d3bbe6ac7ba2f6218a19/toolkit/components/urlㄦclassifier/tests/gtest/Common.cpp#56 2) The global worker thread pointer (gDbBackgroundThread) will be nulled out in the shutdown process. On the contrary, the "update thread" is totally managed and only known by Classifier so it's always fine to add assertions for update thread. In other words, every Classifier member function has a strict restriction that it must or must not run on the update thread. That said, it might be worth enforcing the caller thread of all Classifier public APIs to be consistent. For example, we maintain the thread on which Classifier is constructed and check if every public API is called on the same thread. This enforcement ensures all public APIs to be mutually exclusive without coupling with the worker thread. Maybe I am over-designing and we already coupled many things to the worker thread. So I will not insist on not adding assertion if there is no alternative. What do you think :) > When/where will it be cleaned up? In step 4: RemoveUpdateIntermediaries() :) > Is this a good idea? > > If I understand this part correctly, we're talking about two updates trying to run at the same time. Maybe we should just wait until everyone has been notified before letting another update start. Regardless of whether this is a good design, given that 1) mUpdateObserver can only be on worker thread. (DBServiceWorker::BeginUpdate) 2) DBServiceWorker::NotifyUpdateObserver() is on worker thread, too. 3) DBService::BeginUpdate will reject the update request if DBServiceWorker::mUpdateObserver is non-null. 4) mUpdateObserver::UpdateSuccess/UpdateError/DownloadError is sending an event to main thread so there will be no reentrance to this function or nested DBServiceWorker::BeginUpdate() So, if the main thread requests a new update before DBServiceWorker::BeginUpdate() returns, the request will be accpeted but the BeginUpdate() event will not be processed until this function returns. > If we don't reset or reload the database when the worker is busy, are we going to retry later? ResetDatabase() and ReloadDatabase() are the only two public APIs that I failed to retain their behavior. I was trying to not reject the request by doing something like ``` RefPtr<nsIRunnable> r = NewRunnableMethod(mWorker, &Worker::FlushAndDisableAsyncUpdate); SyncRunnable::DispatchToThread(gDbBackgroundThread, r); ``` to wait until the previous update is notified. However, it would fail when there is an |OpenDb| on the worker thread. The reason is OpenDb() will initialize a CyptoHash, which will be constructed *synchronously* on the main thread: http://searchfox.org/mozilla-central/rev/0079c7adf3b329bff579d3bbe6ac7ba2f6218a19/security/manager/ssl/nsNSSComponent.cpp#99 I was so close! Thankfully ResetDatabase and ReloadDatabse are only for test case. If they are not used properly, the error message (JS exception) will be thrown and I will fix them on demand. BTW, all the existed uses of ResetDatabase and ReloadDatabse are fine. They are called when the previous update is notified.
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review124572 I see, that's annoying. If adding assertions would make it too hard to test or cause you to test code/threads that are different from what is running for real on release, then it's probably not worth it. > Regardless of whether this is a good design, given that > > 1) mUpdateObserver can only be on worker thread. (DBServiceWorker::BeginUpdate) > 2) DBServiceWorker::NotifyUpdateObserver() is on worker thread, too. > 3) DBService::BeginUpdate will reject the update request if DBServiceWorker::mUpdateObserver is non-null. > 4) mUpdateObserver::UpdateSuccess/UpdateError/DownloadError is sending an event to main thread > so there will be no reentrance to this function or nested DBServiceWorker::BeginUpdate() > > So, if the main thread requests a new update before DBServiceWorker::BeginUpdate() returns, the request > will be accpeted but the BeginUpdate() event will not be processed until this function returns. And are there any benefits to making `BegingUpdate()` available prior to the callback? > ResetDatabase() and ReloadDatabase() are the only two public APIs that I failed to retain their behavior. I was trying to not reject the request by doing something like > > ``` > RefPtr<nsIRunnable> r = > NewRunnableMethod(mWorker, &Worker::FlushAndDisableAsyncUpdate); > SyncRunnable::DispatchToThread(gDbBackgroundThread, r); > ``` > > to wait until the previous update is notified. > > However, it would fail when there is an |OpenDb| on the worker thread. The reason is OpenDb() will initialize a CyptoHash, which will be constructed *synchronously* on the main thread: > > http://searchfox.org/mozilla-central/rev/0079c7adf3b329bff579d3bbe6ac7ba2f6218a19/security/manager/ssl/nsNSSComponent.cpp#99 > > I was so close! > > Thankfully ResetDatabase and ReloadDatabse are only for test case. If they are not used properly, the error message > (JS exception) will be thrown and I will fix them on demand. > > BTW, all the existed uses of ResetDatabase and ReloadDatabse are fine. They are called when the previous update > is notified. Ok, if they're only used in tests then it will be obvious if they are used improperly (i.e. we'll get test failures).
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review124572 > And are there any benefits to making `BegingUpdate()` available prior to the callback? The one and the only one benefit is to ensure ``` streamUpdater.downloadUpdate(....., function updateSucess() { streamUpdater.downloadUpdate(); }, ...); ``` or even ``` streamUpdater.downloadUpdate() streamUpdater.downloadUpdate() ``` to work as before. As what I mentioned in (4), the callback is async. It's not impossible that a new downloadUpdate() is called before DBServiceWorker::mUpdateObserver is nulled out. Even though the new update request will be queued, it requires yet another streamUpdater.downloadUpdate() to trigger. The later case might be avoidable but the former one cannot be "more correct". So I ended up with taking the current approach you are reviewing :)
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review126344 ::: commit-message-f3b22:5 (Diff revisions 1 - 3) > -Conflicts: > - toolkit/components/url-classifier/nsUrlClassifierDBService.cpp > > -MozReview-Commit-ID: DFnpyQ0MKTl > +A new function Classifier::AsyncApplyUpdates() is implemented for async update. > +Besides, all public Classifier interfaces become "worker thread only" and > +remove DBServiceWorker::ApplyUpdatesBackground/Foreground. nit: ...and we remove... ::: toolkit/components/url-classifier/Classifier.h:73 (Diff revisions 1 - 3) > > /** > * Apply the table updates in the array. Takes ownership of > * the updates in the array and clears it. Wacky! > + * This function is only a wrapper of AsyncApplyUpdates and > + * used in gtest only. I don't see anything in the implementation that needs specific peeking into Classifier, so shouldn't this function be moved into the tests themselves? ::: toolkit/components/url-classifier/Classifier.cpp:301 (Diff revisions 1 - 3) > - mTableFreshness.Clear(); > + mTableFreshness.Clear(); > - RegenActiveTables(); > + RegenActiveTables(); > + }; > > - LOG(("Reset() is called so we interrupt the update.")); > - mUpdateInterrupted = true; > + if (!mUpdateThread) { > + LOG(("Async update has been disabled. Just Reset() on worker thread.")); In what case do we end up here? Do we want to run Reset during Shutdown? ::: toolkit/components/url-classifier/Classifier.cpp:754 (Diff revisions 1 - 3) > + monitor.Notify(); > + }); > + > + // Failed to do AsyncApplyUpdates. > + if (NS_FAILED(rv)) { > + ret = rv; Can't you put the runnable in a var and reuse it? (this code is copypaste from the code above) ::: toolkit/components/url-classifier/Classifier.cpp:787 (Diff revisions 1 - 3) > + return; > + } > + > + // nsIThread.shutdown has side effetc: it causes the joining thread to > + // process pending events. We spin a specific thread for shutting down > + // the update thread. We need to spin up a thread to call Shutdown on another one? I can't believe this is the correct solution. ::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2095 (Diff revisions 1 - 3) > NS_IMETHODIMP > nsUrlClassifierDBService::ResetDatabase() > { > NS_ENSURE_TRUE(gDbBackgroundThread, NS_ERROR_NOT_INITIALIZED); > > + if (mWorker->IsBusy()) { IsBusyUpdating? Can it be busy with something else? If not, the log message is potentially wrong.
Attachment #8844749 - Flags: review?(gpascutto)
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review126344 > In what case do we end up here? Do we want to run Reset during Shutdown? Assume we have 10 tables and 7 of them have been loaded to Classifier::mLookupCaches and a async update is requested in the background. Since this patch makes it able to do lookup during update, chances are we failed to load one of the 3 non-loaded tables and the reason is file corruption. If it happens, 1) The worker thread will set mUpdateInterrupted. 2) The worker thread will *synchronously* send a event to udpate thread to do Classifier::Reset() 3) The update thread will be interrupted by mUpdateInterrupted then post an event to the worker to do Classifier::RemoveUpdateIntermediaries(). As for "Do we want to run Reset during Shutdown?", I don't have a good answer. Reset() is not only for cleaning up in-memory data. Whenever possilbe, I will hope to ensure the on-disk data integrtity. That's also why I don't add nsUrlClassifierDBService::ShutdownHasStarted() everywhere. > We need to spin up a thread to call Shutdown on another one? I can't believe this is the correct solution. nsIThread::Shutdown has the same side-effect as DISPATCH_SYNC. http://searchfox.org/mozilla-central/rev/7419b368156a6efa24777b21b0e5706be89a9c2f/xpcom/threads/nsIEventTarget.idl#39 I was quite certain that spinning up a thread to shutdown a thread is required before I moved "shutting down the update thread" to here. However, I've changed the place of calling shutdown so I am not very sure for current approach. I'll try to figure out if this trick is still needed.
Submitted a new patch for review. Only one comment from Francois not addressed, which is about mUpdateInterrupted: Classifier::mUpdateInterrupted is written and read on different threads. The reason we don't lock it around is this flag is NOT associated with non-thread-safe data so asserting the thread where we change this flag doesn't seem to assure anything.
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review127800 ::: commit-message-f3b22:1 (Diff revisions 3 - 4) > -Bug 1339050 - Asynchronously apply safebrowsing DB update. > +Bug 1339050 - Asynchronously apply safebrowsing DB update.e There is a stray "e" after the period. ::: toolkit/components/url-classifier/Classifier.h:70 (Diff revisions 3 - 4) > const nsACString& tables, > uint32_t aFreshnessGuarantee, > LookupResultArray& aResults); > > /** > - * Apply the table updates in the array. Takes ownership of > + * Equivalent to ApplyUpdates except that the update is done That comment now refers to a function that has been removed. It should be rephrased to remove that reference.
Attachment #8844749 - Flags: review?(francois) → review+
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review128680 ::: toolkit/components/url-classifier/tests/gtest/Common.cpp:59 (Diff revision 4) > + // the main thread. As a result we have to keep processing > + // pending event until |done| becomes true. If there's no > + // more pending event, what we only can do is wait. > + // Condition variable doesn't work here because instrusively > + // notifying the from NS_NewCheckSummedOutputStream() or > + // HashStore::WriteFile() is weird. Eeuw. Given that we own CheckSummedOutputStream, can we get rid of this, if necessary in a follow-up?
Attachment #8844749 - Flags: review?(gpascutto) → review+
Comment on attachment 8844749 [details] Bug 1339050 - Asynchronously apply safebrowsing DB update. https://reviewboard.mozilla.org/r/118054/#review128680 > Eeuw. Given that we own CheckSummedOutputStream, can we get rid of this, if necessary in a follow-up? According to the implementation, we can get around this by *synchronously* initializing NS_CRYPTO_HASH_CONTRACTID in an arbitrary non-main thread. All non-main threads share the same |initialized|. I ended up not doing this is because it's just a test case so we can afford to blocking the main thread. http://searchfox.org/mozilla-central/rev/381a7b8f8a78b481336cfa384cb0a5536e217e4a/security/manager/ssl/nsNSSComponent.cpp#86 BTW, I am suspecting the NSS initialization scheme has changed because I found the following code which seems to address the similar issue but it no longer works. http://searchfox.org/mozilla-central/rev/c283b146cd745b8c85e7625d5ef5d7e7803d7e7d/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#1125
I believe I found the root causes of those two try failures. For the crash, it's caused by the concurrent use of mRootStoreDirectory during update. CopyInUseDirForUpdate(), which is running on the update thread, will invalidate mRootStoreDirectory at [1] and lead to potential bad access on worker thread. (SetupPathNames() will manipulate a tons of non-thread-safe object and I'm surprised that I didn't hit any issue before.) So the solution is to remove the use of SetupPathNames if we can assure nothing needs to be updated. Since nsIFile.idl doesn't specify if the object will be updated to refer to the new path, I make a copy of mRootStoreDirectory for update purpose only every time right before update. As for the browser_bug400731.js intermittent failure, it's just because the testing entries are not added to the DB before the first test starts. [1] http://searchfox.org/mozilla-central/rev/b8cce081200129a1307e2a682f121135b5ca1d19/toolkit/components/url-classifier/Classifier.cpp#997
Pushed by hchang@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/baaee2deb3ac Asynchronously apply safebrowsing DB update. r=francois,gcp
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Depends on: 1338638
It seems as though this was responsible for the sudden change in two urlclassifier Telemetry probes: [1][2] Is this intentional? Is this okay? Is this good or bad or neither? Are these probes still measuring useful things? [1]: http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/111/alerts/?from=2017-04-07&to=2017-04-07 [2]: http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/2058/alerts/?from=2017-04-07&to=2017-04-07
Flags: needinfo?(hchang)
The previous telemetry data are not the actual shutdown time http://searchfox.org/mozilla-central/rev/821d5fb522f396fe072dd9dbbc3be6437ca53de5/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#2237 Instead, it only measures the "Shutdown()" function itself, which is simply to post tasks to worker thread. The most time wasting work will be done later at http://searchfox.org/mozilla-central/rev/821d5fb522f396fe072dd9dbbc3be6437ca53de5/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#2200 which were not measured :)
Flags: needinfo?(hchang)
As for the "improvement" on URLCLASSIFIER_PS_CONSTRUCT_TIME [1], it's not expected (by me, at least). The only change I made to this measurement is that the function we measure will run on different thread. BTW, the telemetry things including the sudden change auto detection really really help! So grateful to the team :) [1] http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/111/alerts/?from=2017-04-07&to=2017-04-07
It seems as though there are 40-50% fewer samples in URLCLASSIFIER_PS_CONSTRUCT_TIME (before (1.94M Samples): [1], after (1M Samples): [2]) Is there something you did that may have reduced the number of times it was called? [1]: https://mzl.la/2nEFLwE [2]: https://mzl.la/2nEJ6vz
Flags: needinfo?(hchang)
(In reply to Chris H-C :chutten from comment #41) > It seems as though there are 40-50% fewer samples in > URLCLASSIFIER_PS_CONSTRUCT_TIME (before (1.94M Samples): [1], after (1M > Samples): [2]) > > Is there something you did that may have reduced the number of times it was > called? > > [1]: https://mzl.la/2nEFLwE > [2]: https://mzl.la/2nEJ6vz I am not expecting this result :( In theory, the update frequency, which is controlled by the server, may also affect the number of calling times. Dimi, Thomas, Could you think of any changes you did lately which might lead to this result?
Flags: needinfo?(hchang) → needinfo?(dlee)
(In reply to Henry Chang [:henry][:hchang] from comment #42) > (In reply to Chris H-C :chutten from comment #41) > Dimi, Thomas, > > Could you think of any changes you did lately which might lead > to this result? I am not sure, but I think the sample count change may not related to this bug. I just checked the telemetry it seems that the sample count of every telemetry related to update drops, for example, URLCLASSIFIER_UPDATE_ERROR[1][2]. Sample count of telemetry for lookup[3][4] also decreased but not as many as update. Is it possible anything changed globally in FX? [1] https://goo.gl/UfNMtO [2] https://goo.gl/JEGuVd [3] https://goo.gl/aRR5de [4] https://goo.gl/IHQBMd
Flags: needinfo?(dlee)
I just had a look at this patch and the triggered try build. I have seen that you missed to include the firefox-ui tests, which basically test the remote behavior for the real services. All the troubles we have had with the tests the last 11 days we could have saved when those tests were also triggered. Henry, can you please make sure to include those for try builds of patches on other bugs in the future? Thanks.
(In reply to Henrik Skupin (:whimboo) from comment #44) > I just had a look at this patch and the triggered try build. I have seen > that you missed to include the firefox-ui tests, which basically test the > remote behavior for the real services. All the troubles we have had with the > tests the last 11 days we could have saved when those tests were also > triggered. > > Henry, can you please make sure to include those for try builds of patches > on other bugs in the future? Thanks. Sorry for that at first :( However, I am quite certain I did trigger firefox-ui (maybe some of the triggered builds are not). I think the point is I didn't carefully tell if the "orange failures" in the try results are the actual intermittent ones. That test case bothered you in the last couple of days has other related intermittent failures from I can tell when I checked the try results. I checked the error message and I found the error messages were identical to those already identified intermittent failure bugs. Anyways, I am very sorry about that.
By the way, I triggered a lot of try builds while I was debugging some thread racing issues. Most of them are not via reviewboard so maybe that's why you concluded that the firefox-ui was not included in the triggered try builds?
No problem at all. I just wanted to point out that we should include those tests in the future. They were useful a couple of times in the past by showing regressions in the safebrowsing code. Also because those are the only ones which test the remote ends. Thank you for the fixes!
(In reply to Henrik Skupin (:whimboo) from comment #47) > No problem at all. I just wanted to point out that we should include those > tests in the future. They were useful a couple of times in the past by > showing regressions in the safebrowsing code. Also because those are the > only ones which test the remote ends. Thank you for the fixes! Thanks :) Do you have any rule of thumb for me when we see orange results in the try builds? My practice is to compare the error messages with the filed bugs. But sometimes even the error messages are identical. I did capture some the orange results were not the same filed intermittent failures but still missed this one. (This test cases passed on autoland luckily) I sometimes would check other people's try builds to make sure if the oranges are the real orange.
Flags: needinfo?(hskupin)
Sadly not a general rule. It depends from test to test. But we try to make the test failures more descriptive. If there are still some which only say "Timed out", we definitely should enhance the test. In general if you see failures feel free to ping me on IRC, and I can help out.
Flags: needinfo?(hskupin)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: