Crash in shutdownhang | mozilla::SyncRunnable::DispatchToThread

RESOLVED FIXED in Firefox 65

Status

()

defect
P1
critical
RESOLVED FIXED
7 months ago
4 months ago

People

(Reporter: lizzard, Assigned: dimi)

Tracking

({crash, regression})

65 Branch
mozilla66
x86
Windows 10
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox-esr60 wontfix, firefox63 wontfix, firefox64 wontfix, firefox65 fixed, firefox66 fixed)

Details

(crash signature)

Attachments

(1 attachment)

Followup bug for crashes in 64 beta. 

+++ This bug was initially created as a clone of Bug #1453038 +++

This bug was filed from the Socorro interface and is
report bp-53ff0449-a421-4467-b2c2-5b1e70180410.
=============================================================

Windows only crash: https://bit.ly/2HqEfF4. Some user comments mention continuous crashes.

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableCS 
2 kernelbase.dll SleepConditionVariableCS 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:58
4 xul.dll mozilla::CondVar::Wait xpcom/threads/CondVar.h:68
5 xul.dll mozilla::SyncRunnable::DispatchToThread xpcom/threads/SyncRunnable.h:70
6 xul.dll mozilla::SyncRunnable::DispatchToThread xpcom/threads/SyncRunnable.h:98
7 xul.dll nsUrlClassifierDBService::Shutdown toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2539
8 xul.dll nsUrlClassifierDBService::Observe toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:2480
9 xul.dll nsObserverList::NotifyObservers xpcom/ds/nsObserverList.cpp:112

=============================================================
At least for the crash in comment 0, the crashing thread is a lie.  The crash reason is:

MOZ_CRASH Reason 	MOZ_CRASH(Shutdown hanging before starting.)

and so the crashing thread is something else:

Thread 50, Name: Shutdown Hang Terminator
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog 	toolkit/components/terminator/nsTerminator.cpp:206
1 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397
2 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:137
3 	ucrtbase.dll 	thread_start<unsigned int (__stdcall*)(void*)> 	
4 	kernel32.dll 	BaseThreadInitThunk 	
5 	mozglue.dll 	patched_BaseThreadInitThunk 	mozglue/build/WindowsDllBlocklist.cpp:838

But the thread that's taking too long to do something, i.e. is stuck, is the main thread waiting on the URL classifier service.

Looking at the first couple reports in:

https://crash-stats.mozilla.com/signature/?signature=shutdownhang%20%7C%20mozilla%3A%3ASyncRunnable%3A%3ADispatchToThread&date=%3E%3D2018-11-01T10%3A36%3A06.000Z&date=%3C2018-11-08T09%3A36%3A06.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-version&_sort=-date&page=1

this seems to be the common pattern.  What is taking so long here?
Flags: needinfo?(francois)
Assignee

Comment 2

6 months ago
This happened when URL classifier was doing a SafeBrowsing update which copies the blocklist database files to a tempory folder, update them, and then move them back to the original folder.
 
We have added several checkpoints so we can abort the update as soon as possible while shutting down.
The crash report in Comment 0 is from Firefox 59, that happened while we were copying the entire database directory. That should be fixed in Bug 1481819 which landed in 64.

I'll see if I can find anything by checking crash reports in firefox 64, 65.
Assignee: nobody → dlee
Status: NEW → ASSIGNED
Flags: needinfo?(francois)
Assignee

Comment 3

6 months ago
In Bug 1453038, |mUpdateInterrupted| is set in Classifer::Close() which is
called by PreShutdown to abort an ongoing update. That doesn't handle
all the cases.

The SafeBrowsing update involves two threads, worker thread, and update
thread. Update thread takes care of most of the update work, when it finishes
its task, it posts a task back to the worker thread to apply the updated database
and also do some cleanup stuff. Then the update is complete.

The fix in Bug 1453038 doesn't abort an update if the woker thread is doing
the job. This is because the |mUpdateInterrupted| flag is set in the
worker thread. The PreShutdown event which eventually sets the flag has to
wait until the worker thread's current task is done.

In this patch:
1. Check nsUrlClassifierDBService::ShutdownHasStarted() to abort shutdown.
This is set by main thread so both worker thread and update thread can
be interrupted now.
2. mIsClosed is now replaced by the mUpdateInterrupted. The semantics of
mUpdateInterrupted is now changed to abort update for any internal APIs
which should cause an update to abort.
3. Remove |mUpdateInterrupted| and |ShutdownHasStarted()| checks and
unify with |ShouldAbort()|
Assignee

Comment 4

6 months ago
This patch addressed an issue I found while checking this bug. But it doesn't cover all the cases.
To be honest, most of the crashes are pretty strange and I still don't have any clue.
Is this patch ready to land?
Assignee

Comment 6

5 months ago
(In reply to Julien Cristau [:jcristau] from comment #5)
> Is this patch ready to land?

Yes, I'll test on try again and land if everything goes well.

Comment 8

5 months ago
Pushed by dlee@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5b1c8bfb5cd2
Fix url-classifier worker thread is not aborted while shutting down. r=francois

Comment 9

5 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/5b1c8bfb5cd2
Status: ASSIGNED → RESOLVED
Last Resolved: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
Please request Beta approval on this when you get a chance.
Flags: needinfo?(dlee)
Assignee

Comment 11

5 months ago
Comment on attachment 9025961 [details]
Bug 1504774 - Fix url-classifier worker thread is not aborted while shutting down. r?francois

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: None

User impact if declined: crash while shutting down

Is this code covered by automated tests?: Yes

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Medium

Why is the change risky/not risky? (and alternatives if risky): This bug is really related to the timing while shutting down.
It is not easy to verify every scenario.

String changes made/needed: no
Flags: needinfo?(dlee)
Attachment #9025961 - Flags: approval-mozilla-beta?
Comment on attachment 9025961 [details]
Bug 1504774 - Fix url-classifier worker thread is not aborted while shutting down. r?francois

[Triage Comment]
Improved shutdown handling to avoid crashes. Approved for 65.0b8.
Attachment #9025961 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Dimi Lee [:dimi][:dlee] from comment #11)

> Is this code covered by automated tests?: Yes

> Needs manual test from QE?: No

Setting qe-verify- since it is covered by automated tests.
Flags: qe-verify-

Hi Dimi, since ESR60 is affected, should we consider uplifting this patch to ESR60 repo?

Flags: needinfo?(dlee)
Assignee

Comment 16

4 months ago

(In reply to Ritu Kothari (:ritu) from comment #15)

Hi Dimi, since ESR60 is affected, should we consider uplifting this patch to ESR60 repo?

Hi Ritu,
We can't uplift this to ESR60 because it depends on lots of things we have done for the past 6 months.

Flags: needinfo?(dlee)

Thanks! In that case, let's update this to a wontfix.

You need to log in before you can comment on or make changes to this bug.