If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Crash in shutdownhang | mozilla::CondVar::Wait | nsEventQueue::GetEvent | nsThread::ProcessNextEvent | NS_ProcessNextEvent | nsThread::Shutdown | nsUrlClassifierDBService::Shutdown

RESOLVED WONTFIX

Status

()

Toolkit
Safe Browsing
P1
critical
RESOLVED WONTFIX
11 months ago
3 months ago

People

(Reporter: lizzard, Unassigned)

Tracking

({crash})

unspecified
x86
Windows XP
crash
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox49 affected, firefox50 affected, firefox51- affected, firefox52 affected)

Details

(crash signature)

(Reporter)

Description

11 months ago
This bug was filed from the Socorro interface and is 
report bp-530585c4-21eb-4bef-a9c1-642872161013.
=============================================================

digitarald pointed out a crash spike on release in the past couple of days. I think it may be from this crash signature which has more than doubled since Oct. 11th. (200-300 crashes per day, then over 1000 on Wednesday). That was Patch Tuesday, so maybe this is from a Windows update. 

Crashing thread: 

0 	ntdll.dll 	KiFastSystemCallRet 	
1 	ntdll.dll 	ZwWaitForSingleObject 	
2 	kernel32.dll 	WaitForSingleObjectEx 	
3 	kernel32.dll 	WaitForSingleObject 	
4 	nss3.dll 	PR_WaitCondVar 	nsprpub/pr/src/threads/combined/prucv.c:525
5 	xul.dll 	mozilla::CondVar::Wait(unsigned int) 	obj-firefox/dist/include/mozilla/CondVar.h:79
6 	xul.dll 	nsEventQueue::GetEvent(bool, nsIRunnable**, mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsEventQueue.cpp:55
7 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1057
8 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp:290
9 	xul.dll 	nsThread::Shutdown() 	xpcom/threads/nsThread.cpp:931
10 	xul.dll 	nsUrlClassifierDBService::Shutdown() 	toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:1738
Crash volume for signature 'shutdownhang | mozilla::CondVar::Wait | nsEventQueue::GetEvent | nsThread::ProcessNextEvent | NS_ProcessNextEvent | nsThread::Shutdown | nsUrlClassifierDBService::Shutdown':
 - nightly (version 52): 6 crashes from 2016-09-19.
 - aurora  (version 51): 6 crashes from 2016-09-19.
 - beta    (version 50): 11 crashes from 2016-09-20.
 - release (version 49): 8106 crashes from 2016-09-05.
 - esr     (version 45): 0 crashes from 2016-07-25.

Crash volume on the last weeks (Week N is from 10-17 to 10-23):
            W. N-1  W. N-2  W. N-3  W. N-4
 - nightly       3       0       2       1
 - aurora        2       1       3       0
 - beta          4       3       3       0
 - release    3557    1995    1466     291
 - esr           0       0       0       0

Affected platform: Windows

Crash rank on the last 7 days:
           Browser     Content   Plugin
 - nightly #314
 - aurora  #2215
 - beta    #3532
 - release #11
 - esr
status-firefox49: --- → affected
status-firefox50: --- → affected
status-firefox51: --- → affected
status-firefox52: --- → affected
Component: Networking → Safe Browsing
Product: Core → Toolkit
Priority: -- → P1

Comment 2

11 months ago
From the crash call stack, it looks like a crash when closing firefox. 
However, according to some comments from [1], people seemed to close 
the browser because of the unresponsiveness and get crashed afterward. 

[1] https://crash-stats.mozilla.com/signature/?product=Firefox&signature=shutdownhang%20%7C%20mozilla%3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent%20%7C%20nsThread%3A%3AProcessNextEvent%20%7C%20NS_ProcessNextEvent%20%7C%20nsThread%3A%3AShutdown%20%7C%20nsUrlClassifierDBService%3A%3AShutdown&date=%3E%3D2016-10-20T03%3A07%3A00.000Z&date=%3C2016-10-27T03%3A07%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#comments
After checking the crash report, i found the shutdownhang may happen in many places.
For example.
1. ApplyUpdate (Most frequently one)
https://crash-stats.mozilla.com/report/index/a5f8bded-3ad1-48f8-a22c-cd4852161027#allthreads (Thread 35)
https://crash-stats.mozilla.com/report/index/dea682e4-baf3-422d-97ed-b0a392161027#allthreads (Thread 28)

2. ApplyUpdate(triggered by CacheCompletions)
https://crash-stats.mozilla.com/report/index/c4d725cf-f31d-4933-a0cb-9ecdd2161027#allthreads (Thread 35)

3. CloseDB
https://crash-stats.mozilla.com/report/index/4a3e3f27-4b1a-4db1-b7f2-03cf92161027#allthreads (Thread 35)

4. Classifier::Open
https://crash-stats.mozilla.com/report/index/ea06837c-ad49-4e34-afcf-136932161027#allthreads (Thread 30)

So I think the shutdown hang is not because of deadlock in safebrowsing code, it is because certain operations(especially applyupdate) took longer, so when shutdown took longer then usual, it may have a chance that we are closing the safebrowsing backgroud thread.
From log I cannot make sure if "hang" is related to safebrowsing code or not, but there's a lots of file access code in SB background thread. And for this bug lots of cases happen when we are doing I/O, so maybe I/O is one of the reason cause shutdown took too long, but i am not sure.
Blocks: 1305486
Hi Francois,
I think this bug is not related to V4 because lots of cases happened before we land v4 code.
Is there any reason you think this blocks bug 1305486 ?
Flags: needinfo?(francois)
(In reply to Dimi Lee[:dimi][:dlee] from comment #5)
> Is there any reason you think this blocks bug 1305486 ?

I just want to avoid enabling any of the new code (V4) while we figure out whether or not the existing SB code is responsible for this hang. According to your investigation (comment 3 in particular), it doesn't look like it.

I've asked Tanvi and Dan if they have experience debugging this or if they know someone who could help us figure out what we should do with this bug.
Flags: needinfo?(francois)
Depends on: 1315140

Comment 7

11 months ago
What I inferred according to [0][1], the story is likely to be

1) Firefox hanged for some reason. Maybe safebrowsing, maybe not.
2) Users closed the browser because they have waited too long.
3) However, even on shutdown, it took more than 63 seconds because of
   nsUrlClassifierDBService::Shutdown()
4) Therefore MOZ_CRASH() was called in [1] with reason 

"Shutdown too long, probably frozen, causing a crash."

Probably we are not dealing with nsUrlClassifierDBService::Shutdown() very well
in some edge cases. 

[0] https://crash-stats.mozilla.com/report/index/4a3e3f27-4b1a-4db1-b7f2-03cf92161027#tab-rawdump
[1] https://hg.mozilla.org/releases/mozilla-release/annotate/7356baae8e73/toolkit/components/terminator/nsTerminator.cpp#l158
Depends on: 1315386
(In reply to Henry Chang [:henry][:hchang] from comment #7)
> Probably we are not dealing with nsUrlClassifierDBService::Shutdown() very
> well in some edge cases.

I put some ideas of what we could add in bug 1315386.
What happened on Aug 30? There are no crashes before then (zero) and then suddenly 17 crashes on Aug 31, settling out in the mid double-digits per day. At the end of September it ramps up to ~300 per day over the course of a week or two, matching pretty well with the September 20 release of Fx49. Also, don't see anything crashes in versions older than 49.

But the crashes didn't start when 49 moved to beta (Aug 2 or so), it was abrupt and mid-cycle.

https://ashughes1.github.io/bugzilla-socorro-lens/chart.htm?s=shutdownhang%20|%20mozilla::CondVar::Wait%20|%20nsEventQueue::GetEvent%20|%20nsThread::ProcessNextEvent%20|%20NS_ProcessNextEvent%20|%20nsThread::Shutdown%20|%20nsUrlClassifierDBService::Shutdown

(mouseover the graph to get numbers. If you find that useful check out Anthony's awesome add-on https://addons.mozilla.org/en-US/firefox/addon/bugzilla-socorro-lens/ which will put that right in the bug itself for you)

All the crashes with that signature since August 1st:
https://crash-stats.mozilla.com/signature/?signature=shutdownhang%20%7C%20mozilla%3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent%20%7C%20nsThread%3A%3AProcessNextEvent%20%7C%20NS_ProcessNextEvent%20%7C%20nsThread%3A%3AShutdown%20%7C%20nsUrlClassifierDBService%3A%3AShutdown&date=%3E%3D2016-08-01T06%3A09%3A00.000Z&date=%3C2016-11-04T06%3A09%3A00.000Z

At first I was thinking we should look for whatever got uplifted to beta around then, but looking closely it's clear that the crashes started abruptly on that date but affected much older versions of 49 (nightly builds, dev-edition, and lots of that first Aug-2 beta build that worked fine for most of the month). That suggests something external triggered the crashes. Might still be a bug in our code, of course, but might help narrow it down if we could figure out what exposed it.

Did the SafeBrowsing​ service change something on that date?

Did the total number of shutdown kills go way up unrelated to safebrowsing, and it just increased the odds that the user was in the middle of an update?

I really don't have any idea how to debug this but I hope these clues are helpful in some way.
By using keyword "shutdownhang | nsThread::ProcessNextEvent | NS_ProcessNextEvent | nsThread::Shutdown | nsUrlClassifierDBService::Shutdown" to search on crash report[0], you will see

1 crash on 5/23, 7/31 & 8/30
And then "256" crashes on 8/31 in different versions(45,46,47,48).
(This should be the same issue as this bug just for some reason versions prior than 49 won't show the call stack "mozilla::CondVar::Wait | nsEventQueue::GetEvent").

So I'll assume something happened on 8/31(as Dan suspected) cause this bug.
This should not be related to anything got uplift or anything landed on 49 since this occurs on different versions.

So maybe windows update ? or Sabebrowsing server side change any behavior ? I'll see if i can find anything...

[0] https://crash-stats.mozilla.com/signature/?signature=shutdownhang%20%7C%20nsThread%3A%3AProcessNextEvent%20%7C%20NS_ProcessNextEvent%20%7C%20nsThread%3A%3AShutdown%20%7C%20nsUrlClassifierDBService%3A%3AShutdown&date=%3E%3D2016-01-01T07%3A49%3A00.000Z&date=%3C2016-12-28T07%3A49%3A00.000Z#graphs
from windows update history and crash report

windows 7:
https://support.microsoft.com/en-us/help/12387

windows 10:
https://support.microsoft.com/en-us/help/22801/windows-7-and-windows-server-2008-r2-update-history

I don't find any connection between windows update and crash(starts on 8/31)

Comment 12

11 months ago
Track 51- as the volume of the crashes is low.
tracking-firefox51: --- → -
Ah, i finally know what happen on Aug 31...
before Aug 31, CrashReport reports shutdownhang with following signature 

nsThread::ProcessNextEvent
  NS_ProcessNextEvent
    nsThread::Shutdown
       ........

in one report, so different background threads cause shutdown hang will all be classified into this report, you can check this link[1] to see that this crash report never happened after Aug31.

So crash report of background threads cause shutdown hang(with different signature) will begin to occur after Aug31, for example, nsUrlClassifierDBService[2], nsSocketTransportService[3] ... etc

So actually this bug already happened for a long time, the reason why it looks like it start to appear recently just related to how the "mozilla crash report" works.

[1] https://crash-stats.mozilla.com/signature/?product=Firefox&platform=Windows&signature=shutdownhang%20%7C%20nsThread%3A%3AProcessNextEvent%20%7C%20NS_ProcessNextEvent%20%7C%20nsThread%3A%3AShutdown&date=%3E%3D2016-05-08T09%3A08%3A30.000Z&date=%3C2016-11-08T09%3A08%3A30.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&_sort=&page=1#graphs
[2] https://crash-stats.mozilla.com/signature/?signature=shutdownhang%20%7C%20mozilla%3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent%20%7C%20nsThread%3A%3AProcessNextEvent%20%7C%20NS_ProcessNextEvent%20%7C%20nsThread%3A%3AShutdown%20%7C%20nsUrlClassifierDBService%3A%3AShutdown&date=%3E%3D2016-05-08T09%3A29%3A39.000Z&date=%3C2016-11-08T09%3A29%3A39.000Z#graphs
[3] https://crash-stats.mozilla.com/signature/?_sort=-date&signature=shutdownhang%20%7C%20_PR_MD_WAIT_CV%20%7C%20_PR_WaitCondVar%20%7C%20mozilla%3A%3ACondVar%3A%3AWait%20%7C%20nsEventQueue%3A%3AGetEvent%20%7C%20nsThread%3A%3AProcessNextEvent%20%7C%20NS_ProcessNextEvent%20%7C%20nsThread%3A%3AShutdown%20%7C%20nsSocketTransportService%3A%3AShutdown&date=%3E%3D2016-11-01T09%3A30%3A00.000Z&date=%3C2016-11-08T09%3A30%3A00.000Z#graphs
FWIW
https://crash-stats.mozilla.com/signature/?signature=shutdownhang%20%7C%20PR_Wait%20%7C%20nsThread%3A%3AProcessNextEvent%20%7C%20NS_ProcessNextEvent%20%7C%20nsThread%3A%3AShutdown%20%7C%20nsUrlClassifierDBService%3A%3AShutdown&date=%3E%3D2016-05-08T10%3A35%3A33.000Z&date=%3C2016-11-08T10%3A35%3A33.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=version&_sort=date&_sort=&page=1#graphs

It also occurred on a various versions from 35 -> 44 (even in ESR version)

Comment 15

11 months ago
(In reply to Dimi Lee[:dimi][:dlee] from comment #13)
> Ah, i finally know what happen on Aug 31...
> before Aug 31, CrashReport reports shutdownhang with following signature 
> nsThread::ProcessNextEvent
>   NS_ProcessNextEvent
>     nsThread::Shutdown
>        ........
> in one report, so different background threads cause shutdown hang will all
> be classified into this report, you can check this link[1] to see that this
> crash report never happened after Aug31.

Dimi, I am a bit confused.  Please help me to clarify it.

Do you mean that the crash reports with this signature but caused by different threads were aggregated into a single report, which is the original report of this bug, "after" the date August 31?

Comment 16

11 months ago
Clarified with Dimi offline.

For some reason, starting from August 31, 2016, crashes with this signature
"shutdownhang | nsThread::ProcessNextEvent | NS_ProcessNextEvent | nsThread::Shutdown"
were added more information (the signatures were tagged by deeper call stacks) 
and were consequently separated into different crash reports.

This explains that why the crash report of this bug started since August 31.
(https://crash-stats.mozilla.com/report/index/530585c4-21eb-4bef-a9c1-642872161013)
But actually this crash has existed for a long time.

We believe that bug 1315386 could alleviate this crash but it will be difficult to eliminate it completely.
(In reply to Daniel Veditz [:dveditz] from comment #9)
> Did the SafeBrowsing​ service change something on that date?

I checked with Google and there were no changes or service interruptions / slowdowns on the server-side around that time.
No longer blocks: 1305486
Now that bug 1315386 has landed and that the crash rate is low, let's close this bug.
Status: NEW → RESOLVED
Last Resolved: 9 months ago
Resolution: --- → WONTFIX

Updated

7 months ago
Crash Signature: [@ shutdownhang | mozilla::CondVar::Wait | nsEventQueue::GetEvent | nsThread::ProcessNextEvent | NS_ProcessNextEvent | nsThread::Shutdown | nsUrlClassifierDBService::Shutdown] → [@ shutdownhang | mozilla::CondVar::Wait | nsEventQueue::GetEvent | nsThread::ProcessNextEvent | NS_ProcessNextEvent | nsThread::Shutdown | nsUrlClassifierDBService::Shutdown] [@ shutdownhang | mozilla::CondVar::Wait | nsEventQueue::GetEvent | nsT&hellip;
FWIW this looks like it's the #1 browser crash on 52.0esr.

Updated

5 months ago
Crash Signature: [@ shutdownhang | mozilla::CondVar::Wait | nsEventQueue::GetEvent | nsThread::ProcessNextEvent | NS_ProcessNextEvent | nsThread::Shutdown | nsUrlClassifierDBService::Shutdown] [@ shutdownhang | mozilla::CondVar::Wait | nsEventQueue::GetEvent | nsT&hellip; → [@ shutdownhang | mozilla::CondVar::Wait | nsEventQueue::GetEvent | nsThread::ProcessNextEvent | NS_ProcessNextEvent | nsThread::Shutdown | nsUrlClassifierDBService::Shutdown] [@ shutdownhang | mozilla::CondVar::Wait | nsEventQueue::GetEvent | nsT&hellip;
(In reply to Julien Cristau [:jcristau] from comment #19)
> FWIW this looks like it's the #1 browser crash on 52.0esr.

https://hg.mozilla.org/releases/mozilla-esr52/file/tip/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp

It appears that the fix of Bug 1315386 didn't land on esr52.
Flags: needinfo?(tnguyen)
Perhaps it was a bit risky at the time we landed bug 1315386, and we decided not push to esr to avoid any side effect.
However for the several months, we probably have to rethink and push that bug to esr due to high crash rate 
Francois, do you agree to push to esr?
Flags: needinfo?(tnguyen) → needinfo?(francois)
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #21)
> Perhaps it was a bit risky at the time we landed bug 1315386, and we decided
> not push to esr to avoid any side effect.
> However for the several months, we probably have to rethink and push that
> bug to esr due to high crash rate 
> Francois, do you agree to push to esr?

I think there were a couple of follow-up commits that we'd need to backport too.

I'm not sure what the policy is for what we should be backporting to ESR. Julien, are shutdown crashes something we generally want to address on ESR or do we normally restrict fixes to sec-high bugs and the likes?
Flags: needinfo?(francois) → needinfo?(jcristau)
Sorry for dropping this on the floor.  Probably not worth the risk to backport to esr.
Flags: needinfo?(jcristau)
You need to log in before you can comment on or make changes to this bug.