Closed Bug 1143866 Opened 5 years ago Closed 3 years ago

crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::DOMStorageDBThread::Shutdown()

Categories

(Core :: DOM: Core & HTML, defect, critical)

x86
Windows
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox36 --- affected
firefox37 --- affected
firefox38 --- affected
firefox39 --- affected
firefox42 --- affected
firefox43 --- affected
firefox44 --- affected
firefox45 --- affected
firefox47 --- affected
firefox48 --- affected
firefox49 --- affected
firefox-esr38 --- affected
firefox-esr45 --- affected
firefox50 --- affected
firefox51 --- affected

People

(Reporter: ashughes, Unassigned)

References

()

Details

(Keywords: crash, steps-wanted, topcrash-win)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-593574cb-90c4-430d-a5a2-2220c2150309.
=============================================================
0 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog(void*) 	toolkit/components/terminator/nsTerminator.cpp
1 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c
2 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c
3 	msvcr120.dll 	_callthreadstartex 	f:\dd\vctools\crt\crtw32\startup\threadex.c:376
4 	msvcr120.dll 	msvcr120.dll@0x2c000 	
5 	kernel32.dll 	BaseThreadInitThunk 	
6 	ntdll.dll 	__RtlUserThreadStart 	
7 	ntdll.dll 	_RtlUserThreadStart 	
=============================================================
More reports: https://crash-stats.mozilla.com/report/list?product=Firefox&signature=shutdownhang+|+WaitForSingleObjectEx+|+WaitForSingleObject+|+PR_WaitCondVar+|+PR_JoinThread+|+mozilla%3A%3Adom%3A%3ADOMStorageDBThread%3A%3AShutdown%28%29#tab-reports

While there have been over 1600 reports of this crash over the last week, I do not believe this qualifies it for topcrash status.

Correlations:
* Top platform is Windows 7 @ 51%
* Top product is Firefox 36.0.1 @ 77%
* Top 10 GPUs are all Intel
* Top URLs are facebook, youtube, gmail, and twitter
* Comments: several people mention this occurs when restoring from a sleep/hibernate state

Florin, do you think you have time to see if you can shake out steps to reproduce?
Flags: needinfo?(florin.mezei)
QA Whiteboard: [triaged]
Maybe Honza has some ideas.
Flags: needinfo?(honzab.moz)
We figured out earlier today that the "crashing thread" stack for shutdownhangs in Socorro does not match that of the reported signature. To get this you need to drill down to the other threads in the report. I have filed a Socorro bug to rectify this.

For clarity, here is the stack for the matching signature in this bug:
0 	ntdll.dll 	NtWaitForSingleObject 	
1 	kernelbase.dll 	WaitForSingleObjectEx 	
2 	kernel32.dll 	WaitForSingleObjectExImplementation 	
3 	kernel32.dll 	WaitForSingleObject 	
4 	nss3.dll 	PR_WaitCondVar 	nsprpub/pr/src/threads/combined/prucv.c
5 	nss3.dll 	PR_JoinThread 	nsprpub/pr/src/threads/combined/pruthr.c
6 	xul.dll 	mozilla::dom::DOMStorageDBThread::Shutdown() 	dom/storage/DOMStorageDBThread.cpp
7 	xul.dll 	mozilla::dom::DOMStorageCache::StopDatabase() 	dom/storage/DOMStorageCache.cpp
8 	xul.dll 	mozilla::dom::DOMStorageObserver::Observe(nsISupports*, char const*, wchar_t const*) 	dom/storage/DOMStorageObserver.cpp
9 	xul.dll 	nsObserverList::NotifyObservers(nsISupports*, char const*, wchar_t const*) 	xpcom/ds/nsObserverList.cpp
10 	xul.dll 	nsObserverService::NotifyObservers(nsISupports*, char const*, wchar_t const*) 	xpcom/ds/nsObserverService.cpp
11 	xul.dll 	nsXREDirProvider::DoShutdown() 	toolkit/xre/nsXREDirProvider.cpp
12 	xul.dll 	ScopedXPCOMStartup::~ScopedXPCOMStartup() 	toolkit/xre/nsAppRunner.cpp
13 	xul.dll 	ScopedXPCOMStartup::`scalar deleting destructor'(unsigned int) 	
14 	xul.dll 	XREMain::XRE_main(int, char** const, nsXREAppData const*) 	toolkit/xre/nsAppRunner.cpp
15 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp
16 	firefox.exe 	do_main 	browser/app/nsBrowserApp.cpp
17 	firefox.exe 	NS_internal_main(int, char**) 	browser/app/nsBrowserApp.cpp
18 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp
19 	firefox.exe 	__tmainCRTStartup 	f:/dd/vctools/crt/crtw32/startup/crt0.c
20 	kernel32.dll 	BaseThreadInitThunk 	
21 	ntdll.dll 	__RtlUserThreadStart 	
22 	ntdll.dll 	_RtlUserThreadStart
Hmm.. it actually "hangs" on the DOMStorage IO thread.  I see two cases:

1. mozilla::storage::Connection::Close -> NtClose
2. mozilla::storage::Connection::Close -> arena_dalloc_large

I've encountered two other bugs (1124880 and 1131330) that hang in those two functions as well.

We either have a global problem here or it's just another individual case we have to handle separately.

I'm on both of those other bugs already, so I may find some conclusion there.
Flags: needinfo?(honzab.moz)
See Also: → 1124880, 1131330
We tried to reproduce the issue today on two separate machines (Windows 7 x64 and x86) using info from comment 0, but we could not get the crash. Also we have not encountered this during our testing last week. At this point it seems highly unlikely that we could find steps to reproduce this.
Flags: needinfo?(florin.mezei)
I don't know when, but in the last weeks there was a patch checked in that fix a lot of the [OOM | small] crashes. So for me - I had this signature often before - it is nearly impossible to crash FF with this signature again.

Can somebody figure out when this patch was checked in and if the [shutdownhang] and [WaitForSingleObjectEx] crashes rose after that date ???

If so, I guess that it is (nearly) the same problem as described from me here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1131330#c37

In short words:
FF now crashes no more on complex JS with a [OOM | small] (e.g. by Facebook, Google, Youtube, Twitter) because FF have now a patch that prevent FF from crashing with this signature, but now FF is not able to handle the cases that have FF crashed before.
As it normally only happens on Windows and there the memory goes very often by this pages "crazy", I guess it is a Garbage Collection problem (in different variations).
Solution should be IMHO to analyze mem-reports again, find the problems, fix GC and look if the crash signatures go down again.
Some "new" signature I had in last time:

[@ shutdownhang | ReleaseData(void*, unsigned int) ]

[@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_InvokeByIndex ]

[@ shutdownhang | _VEC_memset ]

[@ hang | WaitForMultipleObjectsEx | RealMsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | F_1152915508___________________________________ ]

[@ shutdownhang | RtlEnterCriticalSection | PR_Lock | mozilla::net::CacheFile::~CacheFile()]

[OOM | large | mozalloc_abort(char const* const) | mozalloc_handle_oom(unsigned int) | moz_xrealloc | mozilla::CycleCollectedJSRuntime::DeferredFinalize(nsISupports*)]

[@ js::gc::AllocateNonObject<JSFatInlineString, int>(js::ExclusiveContext*)]



I attach two mem-reports from the last days that were auto-created by FF but didn't send out because FF was no more crashing.

Btw.: FF hangs during creating this reports but don't send them out anymore because FF crash no more. Think there should be something changed again ...

The second archive is incomplete because it seems I killed the task before FF was able to finish the report.
But I don't think that FF would ever finished the report because the file in the archive would have - when complete - over 3GB !!!
Some more infos:
- I never use sleep/hibernate.
- Plz try to open in Facebook e.g. a group, where at the right side the advertisements and suggestions reload all the time. Then let it run for some time (some hours).
Tobias, unless you are sure your reports belong to this *specific* shutdownhang in DOMStorageDBThread shutdown, please put your comments on a separate bug report instead.

Also, there are no good reasons I know of to believe that OOM|small would have anything to do with any shutdownhangs. The former means we were out of (virtual) memory, the latter means Firefox was already being (intentionally) closed and could not finish that in more than a minute of time.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #10)
> Tobias, unless you are sure your reports belong to this *specific*
> shutdownhang in DOMStorageDBThread shutdown, please put your comments on a
> separate bug report instead.
> 
> Also, there are no good reasons I know of to believe that OOM|small would
> have anything to do with any shutdownhangs. The former means we were out of
> (virtual) memory, the latter means Firefox was already being (intentionally)
> closed and could not finish that in more than a minute of time.

Plz read again:

(In reply to Tobias B. Besemer from comment #5)
> I don't know when, but in the last weeks there was a patch checked in that
> fix a lot of the [OOM | small] crashes. So for me - I had this signature
> often before - it is nearly impossible to crash FF with this signature again.
> 
> Can somebody figure out when this patch was checked in and if the
> [shutdownhang] and [WaitForSingleObjectEx] crashes rose after that date ???
> 
> If so, I guess that it is (nearly) the same problem as described from me
> here:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1131330#c37
> 
> In short words:
> FF now crashes no more on complex JS with a [OOM | small] (e.g. by Facebook,
> Google, Youtube, Twitter) because FF have now a patch that prevent FF from
> crashing with this signature, but now FF is not able to handle the cases
> that have FF crashed before.
> As it normally only happens on Windows and there the memory goes very often
> by this pages "crazy", I guess it is a Garbage Collection problem (in
> different variations).
> Solution should be IMHO to analyze mem-reports again, find the problems, fix
> GC and look if the crash signatures go down again.

Maybe you should read: https://bugzilla.mozilla.org/show_bug.cgi?id=1131330#c37
Robert, I comment on this bug because of "See Also: 1131330 Remove" and "Keywords: steps-wanted".

Bug 1131330 was also "steps-wanted" and you can find them here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1131330#c37

I was able to reproduce it because I had an idea what should be the reason for the crash and so I guess what is the reason of the crash. ;-)
They still do not have anything to do with each other. You are speculating.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #13)
> They still do not have anything to do with each other. You are speculating.

OK, I speculating! ;-)

But the crash seems to be gone in the last days, that's fine, or? :-)
https://crash-stats.mozilla.com/report/list?signature=shutdownhang%20|%20WaitForSingleObjectEx%20|%20WaitForSingleObject%20|%20PR_WaitCondVar%20|%20PR_JoinThread%20|%20mozilla%3A%3Adom%3A%3ADOMStorageDBThread%3A%3AShutdown%28%29#tab-table
OS: Windows NT → Windows
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::DOMStorageDBThread::Shutdown()] → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::DOMStorageDBThread::Shutdown()] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::…
*arg* Signature have changed ... :-/

https://crash-stats.mozilla.com/report/list?range_unit=days&range_value=28&signature=shutdownhang+|+WaitForSingleObjectEx+|+WaitForSingleObject+|+PR_WaitCondVar+|+PR_JoinThread+|+mozilla%3A%3Adom%3A%3ADOMStorageDBThread%3A%3AShutdown

Last 28 days:
Operating System Percentage Number Of Crashes
Windows 7 	59.66% 	2181
Windows XP 	14.28% 	522
Windows 8.1 	13.48% 	493
Windows Vista 	6.32% 	231
Windows 10 	4.18% 	153
Windows 8 	2.02% 	74
Windows Unknown 0.06% 	2

18 comments, just one useful:
part of the problem is that time warner wifi service is slower that the old telephone dialup service. they ae the only game in town so I am stuck with them
Submitted: 2015-11-29T13:57:43+00:00
https://crash-stats.mozilla.com/report/index/d2fbe824-e635-4a7b-9175-f37332151129

Last report on FF44.0a2:
https://crash-stats.mozilla.com/report/index/85e33bee-3ef1-46b2-8979-854cc2151202
Crashing Thread
Frame 	Module 		Signature 	Source
0 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog(void*)
 				toolkit/components/terminator/nsTerminator.cpp
1 	nss3.dll 	_PR_NativeRunThread
 				nsprpub/pr/src/threads/combined/pruthr.c
2 	nss3.dll 	pr_root
			 	nsprpub/pr/src/md/windows/w95thred.c
3 	msvcr120.dll 	_callthreadstartex
			 	f:\dd\vctools\crt\crtw32\startup\threadex.c:376
4 	msvcr120.dll 	msvcr120.dll@0x2c000 	
5 	kernel32.dll 	BaseThreadStart

Seems the Crashing Thread is always the same.
Status: RESOLVED → REOPENED
Keywords: topcrash-win
Resolution: WORKSFORME → ---
Component: DOM → Hardware Abstraction Layer (HAL)
See Also: → 1144718
Component: Hardware Abstraction Layer (HAL) → DOM
It's been a year since you were looking at those other bugs, and this is still here ...
Flags: needinfo?(honzab.moz)
I've spend a lot of time on this.  We all did in necko, also asking around etc.  No idea where from hangs in ntclose() and other IO operations are coming.  Could be just a slow hard drive and just a coincidence the watchdog catches it in NTClose (where it may hang for few hundred ms).  I also start thinking that _VEC_memset is just hanging because of system swaps, so again - blocked on io.

True is that we should not crash.  But we are crashing artificially...  I start thinking that the shutdown watchdog is a bad thing to do by only one measure.  Apparently slower systems should have the delay longer.


To sum, no, I personally don't have any new data here.  Sorry.
Flags: needinfo?(honzab.moz)
BTW, the hangs are inside mozstorage.  You should target your questions to someone who maintains that code, like mak77.
Dropping in as a mozStorage peer... the NtClose thing seems like this is a system issue, but indeed there seem to be a lot of other indications in other stacks that indicate I/O issues, like the mozStorage stack being in a call to FlushFileBuffers, etc.  There seem to be two major possibilities:

* Anti-virus issues.  Brief searching brought up http://stackoverflow.com/questions/24169830/how-can-i-work-out-which-process-thread-owns-the-resource-that-my-program-is-han where someone who seems to know their windows kernels investigation stuff identified Symantec anti-virus as a problem with NtClose.
* I/O backlog (which can also be compounded with anti-virus).  It probably also doesn't help that closes on Windows can be excessively slow, as :gps identified at http://gregoryszorc.com/blog/2015/10/22/append-i/o-performance-on-windows/

Anti-virus wise, does Socorro know whether there is an active anti-virus on the system and what it is?

I/O backlog wise... could the watchdog maybe do something crazy like join on a thread that issues one or more writes to the profile directory followed by a FlushFileBuffer invocation?  The idea would be that the watchdog wouldn't kill the process if I/O is hopelessly backlogged while also grabbing a measurement point about the I/O status if it does force a crash.  It would be very useful to know that it took 5 seconds to issue a write and a flush when the forced crash happens, for example.  (Or even if there's just a number that indicates how long the crash-dump process took from start to finish if we have that?)
(In reply to Andrew Sutherland [:asuth] from comment #20)
> Anti-virus wise, does Socorro know whether there is an active anti-virus on
> the system and what it is?

Only if they have loaded a library into our process or installed and LSP.
Crash volume for signature 'shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::DOMStorageDBThread::Shutdown':
 - nightly (version 50): 3 crashes from 2016-06-06.
 - aurora  (version 49): 41 crashes from 2016-06-07.
 - beta    (version 48): 1774 crashes from 2016-06-06.
 - release (version 47): 11676 crashes from 2016-05-31.
 - esr     (version 45): 749 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          0          0          0          0          2          1          0
 - aurora           8          4          8          6          6          7          1
 - beta           224        239        207        274        345        327         83
 - release       1814       1918       1724       1706       1764       1725        490
 - esr             81         94         71         64        111         88         64

Affected platform: Windows
Crash volume for signature 'shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::DOMStorageDBThread::Shutdown':
 - nightly (version 51): 2 crashes from 2016-08-01.
 - aurora  (version 50): 6 crashes from 2016-08-01.
 - beta    (version 49): 389 crashes from 2016-08-02.
 - release (version 48): 477 crashes from 2016-07-25.
 - esr     (version 45): 967 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       0       1       1
 - aurora        0       5       1
 - beta        103     222      64
 - release     107     275      95
 - esr          47      89      83

Affected platform: Windows

Crash rank on the last 7 days:
             Browser   Content   Plugin
 - nightly
 - aurora
 - beta
 - release
 - esr
Status: REOPENED → RESOLVED
Closed: 4 years ago3 years ago
Resolution: --- → WORKSFORME
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.