47 bytes, text/x-phabricator-request
|Details | Review|
Followup bug for crashes in 64 beta. +++ This bug was initially created as a clone of Bug #1453038 +++ This bug was filed from the Socorro interface and is report bp-53ff0449-a421-4467-b2c2-5b1e70180410. ============================================================= Windows only crash: https://bit.ly/2HqEfF4. Some user comments mention continuous crashes. Top 10 frames of crashing thread: 0 ntdll.dll NtWaitForAlertByThreadId 1 ntdll.dll RtlSleepConditionVariableCS 2 kernelbase.dll SleepConditionVariableCS 3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:58 4 xul.dll mozilla::CondVar::Wait xpcom/threads/CondVar.h:68 5 xul.dll mozilla::SyncRunnable::DispatchToThread xpcom/threads/SyncRunnable.h:70 6 xul.dll mozilla::SyncRunnable::DispatchToThread xpcom/threads/SyncRunnable.h:98 7 xul.dll nsUrlClassifierDBService::Shutdown toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2539 8 xul.dll nsUrlClassifierDBService::Observe toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2480 9 xul.dll nsObserverList::NotifyObservers xpcom/ds/nsObserverList.cpp:112 =============================================================
At least for the crash in comment 0, the crashing thread is a lie. The crash reason is: MOZ_CRASH Reason MOZ_CRASH(Shutdown hanging before starting.) and so the crashing thread is something else: Thread 50, Name: Shutdown Hang Terminator Frame Module Signature Source 0 xul.dll mozilla::`anonymous namespace'::RunWatchdog toolkit/components/terminator/nsTerminator.cpp:206 1 nss3.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c:397 2 nss3.dll pr_root nsprpub/pr/src/md/windows/w95thred.c:137 3 ucrtbase.dll thread_start<unsigned int (__stdcall*)(void*)> 4 kernel32.dll BaseThreadInitThunk 5 mozglue.dll patched_BaseThreadInitThunk mozglue/build/WindowsDllBlocklist.cpp:838 But the thread that's taking too long to do something, i.e. is stuck, is the main thread waiting on the URL classifier service. Looking at the first couple reports in: https://crash-stats.mozilla.com/signature/?signature=shutdownhang%20%7C%20mozilla%3A%3ASyncRunnable%3A%3ADispatchToThread&date=%3E%3D2018-11-01T10%3A36%3A06.000Z&date=%3C2018-11-08T09%3A36%3A06.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-version&_sort=-date&page=1 this seems to be the common pattern. What is taking so long here?
This happened when URL classifier was doing a SafeBrowsing update which copies the blocklist database files to a tempory folder, update them, and then move them back to the original folder. We have added several checkpoints so we can abort the update as soon as possible while shutting down. The crash report in Comment 0 is from Firefox 59, that happened while we were copying the entire database directory. That should be fixed in Bug 1481819 which landed in 64. I'll see if I can find anything by checking crash reports in firefox 64, 65.
Assignee: nobody → dlee
Status: NEW → ASSIGNED
In Bug 1453038, |mUpdateInterrupted| is set in Classifer::Close() which is called by PreShutdown to abort an ongoing update. That doesn't handle all the cases. The SafeBrowsing update involves two threads, worker thread, and update thread. Update thread takes care of most of the update work, when it finishes its task, it posts a task back to the worker thread to apply the updated database and also do some cleanup stuff. Then the update is complete. The fix in Bug 1453038 doesn't abort an update if the woker thread is doing the job. This is because the |mUpdateInterrupted| flag is set in the worker thread. The PreShutdown event which eventually sets the flag has to wait until the worker thread's current task is done. In this patch: 1. Check nsUrlClassifierDBService::ShutdownHasStarted() to abort shutdown. This is set by main thread so both worker thread and update thread can be interrupted now. 2. mIsClosed is now replaced by the mUpdateInterrupted. The semantics of mUpdateInterrupted is now changed to abort update for any internal APIs which should cause an update to abort. 3. Remove |mUpdateInterrupted| and |ShutdownHasStarted()| checks and unify with |ShouldAbort()|
This patch addressed an issue I found while checking this bug. But it doesn't cover all the cases. To be honest, most of the crashes are pretty strange and I still don't have any clue.
(In reply to Julien Cristau [:jcristau] from comment #5) > Is this patch ready to land? Yes, I'll test on try again and land if everything goes well.
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/autoland/rev/5b1c8bfb5cd2 Fix url-classifier worker thread is not aborted while shutting down. r=francois
Please request Beta approval on this when you get a chance.
Comment on attachment 9025961 [details] Bug 1504774 - Fix url-classifier worker thread is not aborted while shutting down. r?francois [Beta/Release Uplift Approval Request] Feature/Bug causing the regression: None User impact if declined: crash while shutting down Is this code covered by automated tests?: Yes Has the fix been verified in Nightly?: Yes Needs manual test from QE?: No If yes, steps to reproduce: List of other uplifts needed: None Risk to taking this patch: Medium Why is the change risky/not risky? (and alternatives if risky): This bug is really related to the timing while shutting down. It is not easy to verify every scenario. String changes made/needed: no
Attachment #9025961 - Flags: approval-mozilla-beta?
Comment on attachment 9025961 [details] Bug 1504774 - Fix url-classifier worker thread is not aborted while shutting down. r?francois [Triage Comment] Improved shutdown handling to avoid crashes. Approved for 65.0b8.
Attachment #9025961 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Dimi Lee [:dimi][:dlee] from comment #11) > Is this code covered by automated tests?: Yes > Needs manual test from QE?: No Setting qe-verify- since it is covered by automated tests.
You need to log in before you can comment on or make changes to this bug.