Closed Bug 1504774 Opened 2 years ago Closed 2 years ago

Crash in shutdownhang | mozilla::SyncRunnable::DispatchToThread


(Toolkit :: Safe Browsing, defect, P1)

65 Branch
Windows 10



Tracking Status
firefox-esr60 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- fixed
firefox66 --- fixed


(Reporter: lizzard, Assigned: dimi)



(Keywords: crash, regression)

Crash Data


(1 file)

Followup bug for crashes in 64 beta. 

+++ This bug was initially created as a clone of Bug #1453038 +++

This bug was filed from the Socorro interface and is
report bp-53ff0449-a421-4467-b2c2-5b1e70180410.

Windows only crash: Some user comments mention continuous crashes.

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableCS 
2 kernelbase.dll SleepConditionVariableCS 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:58
4 xul.dll mozilla::CondVar::Wait xpcom/threads/CondVar.h:68
5 xul.dll mozilla::SyncRunnable::DispatchToThread xpcom/threads/SyncRunnable.h:70
6 xul.dll mozilla::SyncRunnable::DispatchToThread xpcom/threads/SyncRunnable.h:98
7 xul.dll nsUrlClassifierDBService::Shutdown toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2539
8 xul.dll nsUrlClassifierDBService::Observe toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2480
9 xul.dll nsObserverList::NotifyObservers xpcom/ds/nsObserverList.cpp:112

At least for the crash in comment 0, the crashing thread is a lie.  The crash reason is:

MOZ_CRASH Reason 	MOZ_CRASH(Shutdown hanging before starting.)

and so the crashing thread is something else:

Thread 50, Name: Shutdown Hang Terminator
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog 	toolkit/components/terminator/nsTerminator.cpp:206
1 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397
2 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:137
3 	ucrtbase.dll 	thread_start<unsigned int (__stdcall*)(void*)> 	
4 	kernel32.dll 	BaseThreadInitThunk 	
5 	mozglue.dll 	patched_BaseThreadInitThunk 	mozglue/build/WindowsDllBlocklist.cpp:838

But the thread that's taking too long to do something, i.e. is stuck, is the main thread waiting on the URL classifier service.

Looking at the first couple reports in:

this seems to be the common pattern.  What is taking so long here?
Flags: needinfo?(francois)
This happened when URL classifier was doing a SafeBrowsing update which copies the blocklist database files to a tempory folder, update them, and then move them back to the original folder.
We have added several checkpoints so we can abort the update as soon as possible while shutting down.
The crash report in Comment 0 is from Firefox 59, that happened while we were copying the entire database directory. That should be fixed in Bug 1481819 which landed in 64.

I'll see if I can find anything by checking crash reports in firefox 64, 65.
Assignee: nobody → dlee
Flags: needinfo?(francois)
In Bug 1453038, |mUpdateInterrupted| is set in Classifer::Close() which is
called by PreShutdown to abort an ongoing update. That doesn't handle
all the cases.

The SafeBrowsing update involves two threads, worker thread, and update
thread. Update thread takes care of most of the update work, when it finishes
its task, it posts a task back to the worker thread to apply the updated database
and also do some cleanup stuff. Then the update is complete.

The fix in Bug 1453038 doesn't abort an update if the woker thread is doing
the job. This is because the |mUpdateInterrupted| flag is set in the
worker thread. The PreShutdown event which eventually sets the flag has to
wait until the worker thread's current task is done.

In this patch:
1. Check nsUrlClassifierDBService::ShutdownHasStarted() to abort shutdown.
This is set by main thread so both worker thread and update thread can
be interrupted now.
2. mIsClosed is now replaced by the mUpdateInterrupted. The semantics of
mUpdateInterrupted is now changed to abort update for any internal APIs
which should cause an update to abort.
3. Remove |mUpdateInterrupted| and |ShutdownHasStarted()| checks and
unify with |ShouldAbort()|
This patch addressed an issue I found while checking this bug. But it doesn't cover all the cases.
To be honest, most of the crashes are pretty strange and I still don't have any clue.
Is this patch ready to land?
(In reply to Julien Cristau [:jcristau] from comment #5)
> Is this patch ready to land?

Yes, I'll test on try again and land if everything goes well.
Pushed by
Fix url-classifier worker thread is not aborted while shutting down. r=francois
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
Please request Beta approval on this when you get a chance.
Flags: needinfo?(dlee)
Comment on attachment 9025961 [details]
Bug 1504774 - Fix url-classifier worker thread is not aborted while shutting down. r?francois

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: None

User impact if declined: crash while shutting down

Is this code covered by automated tests?: Yes

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Medium

Why is the change risky/not risky? (and alternatives if risky): This bug is really related to the timing while shutting down.
It is not easy to verify every scenario.

String changes made/needed: no
Flags: needinfo?(dlee)
Attachment #9025961 - Flags: approval-mozilla-beta?
Comment on attachment 9025961 [details]
Bug 1504774 - Fix url-classifier worker thread is not aborted while shutting down. r?francois

[Triage Comment]
Improved shutdown handling to avoid crashes. Approved for 65.0b8.
Attachment #9025961 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Dimi Lee [:dimi][:dlee] from comment #11)

> Is this code covered by automated tests?: Yes

> Needs manual test from QE?: No

Setting qe-verify- since it is covered by automated tests.
Flags: qe-verify-

Hi Dimi, since ESR60 is affected, should we consider uplifting this patch to ESR60 repo?

Flags: needinfo?(dlee)

(In reply to Ritu Kothari (:ritu) from comment #15)

Hi Dimi, since ESR60 is affected, should we consider uplifting this patch to ESR60 repo?

Hi Ritu,
We can't uplift this to ESR60 because it depends on lots of things we have done for the past 6 months.

Flags: needinfo?(dlee)
You need to log in before you can comment on or make changes to this bug.