Closed Bug 705761 Opened 13 years ago Closed 9 years ago

chromehang | ZwQueryFullAttributesFile in nsDiskCacheDevice::Shutdown()

Categories

(Core :: Networking: Cache, defect)

x86_64
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: jimm, Assigned: michal)

References

Details

(Keywords: hang, Whiteboard: [Snappy:P1])

Crash Data

Attachments

(1 file)

All appear to be in the same place, occurring while deleting a "trash" folder.


0|0|ntdll.dll|ZwQueryFullAttributesFile|||0xa 0|1|KERNELBASE.dll||||0x81cc 0|2|xul.dll|nsAString_internal::ReplacePrepInternal(unsigned int,unsigned int,unsigned int,unsigned int)|hg:hg.mozilla.org/mozilla-central:xpcom/string/src/nsTSubstring.cpp:c58bad0b4640|209|0x10 0|3|xul.dll|nsAString_internal::Replace(unsigned int,unsigned int,nsSubstringTuple const &)|hg:hg.mozilla.org/mozilla-central:xpcom/string/src/nsTSubstring.cpp:c58bad0b4640|530|0xf 0|4|xul.dll|nsAString_internal::Assign(wchar_t const *,unsigned int)|hg:hg.mozilla.org/mozilla-central:xpcom/string/src/nsTSubstring.cpp:c58bad0b4640|336|0x15 0|5|xul.dll|GetFileInfo|hg:hg.mozilla.org/mozilla-central:xpcom/io/nsLocalFileWin.cpp:c58bad0b4640|460|0xf 0|6|KERNELBASE.dll||||0x13c5 0|7|xul.dll|nsLocalFile::ResolveAndStat()|hg:hg.mozilla.org/mozilla-central:xpcom/io/nsLocalFileWin.cpp:c58bad0b4640|814|0xd 0|8|xul.dll|nsLocalFile::AppendInternal(nsString const &,bool)|hg:hg.mozilla.org/mozilla-central:xpcom/io/nsLocalFileWin.cpp:c58bad0b4640|1107|0x44 0|9|xul.dll|IsShortcutPath|hg:hg.mozilla.org/mozilla-central:xpcom/io/nsLocalFileWin.cpp:c58bad0b4640|320|0x4b 0|10|xul.dll|nsLocalFile::IsDirectory(bool *)|hg:hg.mozilla.org/mozilla-central:xpcom/io/nsLocalFileWin.cpp:c58bad0b4640|2471|0x4 0|11|xul.dll|nsLocalFile::Remove(bool)|hg:hg.mozilla.org/mozilla-central:xpcom/io/nsLocalFileWin.cpp:c58bad0b4640|1833|0x13 0|12|xul.dll|nsCOMPtr_base::assign_from_qi(nsQueryInterface,nsID const &)|hg:hg.mozilla.org/mozilla-central:obj-firefox/xpcom/build/nsCOMPtr.cpp:c58bad0b4640|96|0x18 0|13|xul.dll|nsDirEnumerator::HasMoreElements(bool *)|hg:hg.mozilla.org/mozilla-central:xpcom/io/nsLocalFileWin.cpp:c58bad0b4640|652|0x28 0|14|xul.dll|nsCOMPtr<nsIFile>::nsCOMPtr<nsIFile>(nsQueryInterface)|hg:hg.mozilla.org/mozilla-central:obj-firefox/dist/include/nsCOMPtr.h:c58bad0b4640|623|0xe 0|15|xul.dll|nsLocalFile::Remove(bool)|||0xc2 0|16|xul.dll|nsLocalFile::Remove(bool)|||0xc2 0|17|xul.dll|DeleteDirThreadFunc|hg:hg.mozilla.org/mozilla-central:netwerk/cache/nsDeleteDir.cpp:c58bad0b4640|53|0xe 0|18|xul.dll|DeleteDir(nsIFile *,bool,bool,unsigned int)|hg:hg.mozilla.org/mozilla-central:netwerk/cache/nsDeleteDir.cpp:c58bad0b4640|115|0x7 0|19|xul.dll|GetTrashDir(nsIFile *,nsCOMPtr<nsIFile> *)|hg:hg.mozilla.org/mozilla-central:netwerk/cache/nsDeleteDir.cpp:c58bad0b4640|144|0xf 0|20|xul.dll|nsLocalFile::Exists(bool *)|hg:hg.mozilla.org/mozilla-central:xpcom/io/nsLocalFileWin.cpp:c58bad0b4640|2252|0x7 0|21|xul.dll|nsDiskCacheDevice::Shutdown()|||0x11
Crash Signature: [@ chromehang | ZwQueryFullAttributesFile ] → [@ chromehang | ZwQueryFullAttributesFile ] [@ chromehang | ZwClose ] [@ chromehang | NtClose ]
Those last two signatures include some other cases, but profile shutdown and the deleting of this folder appear to be the most common.
Most, but not all, of the dozen hangs I looked at in NtQueryFullAttributesFile are also in ::Shutdown().
Crash Signature: [@ chromehang | ZwQueryFullAttributesFile ] [@ chromehang | ZwClose ] [@ chromehang | NtClose ] → [@ chromehang | ZwQueryFullAttributesFile ] [@ chromehang | ZwClose ] [@ chromehang | NtClose ] [@ chromehang | NtQueryFullAttributesFile]
Severity: normal → critical
Keywords: hang
Stack in comment one is due to browser invalidating our disk cache. Is there a reason this bug isn't under networking?
Nevermind, it's something in netwerk that removes .Trash, no idea what that directory is used for
Component: XPCOM → Networking: Cache
QA Contact: xpcom → networking.cache
Whiteboard: [Snappy:P1]
(In reply to Taras Glek (:taras) from comment #5)
> Nevermind, it's something in netwerk that removes .Trash, no idea what that
> directory is used for

That's nsDeleteDir, isn't it? (ref bug 670911)
Is this on trunk or a release?  Michal fixed a race condition bug with the trash deletion that IIRC should be in aurora at this point.  Might be a dupe of bug 695003?
Oh, I'm wrong--bug 695003 hasn't landed yet.  We did fix another cache deletion issue in bug 681407, but the stack trace is different.  But if this is somehow related to either 1) multiple threads deleting the same directory, or 2) the delete thread continuing to delete stuff too late in XPCOM shutdown, then bug 695003 might still be our answer.  

Assigning to michal for now--let me know if you get stuck or can't get to it soon.

Do we have any idea how common this is?  Would be nice to help us prioritize.  If this is low-frequency, one answer might just be to land bug 695003 and see if this goes away.
Assignee: nobody → michal.novotny
(In reply to Jason Duell (:jduell) from comment #8)
> Do we have any idea how common this is?
With combined signatures, it's #1 top crasher in 11.0a1 from build 20111126 to build 20111129 (hang detector turn on and plugin time out lower than chrome time out).
Depends on: 695003
Crash Signature: [@ chromehang | ZwQueryFullAttributesFile ] [@ chromehang | ZwClose ] [@ chromehang | NtClose ] [@ chromehang | NtQueryFullAttributesFile] → [@ chromehang | ZwQueryFullAttributesFile ] [@ chromehang | ZwClose ] [@ chromehang | NtClose ] [@ chromehang | NtQueryFullAttributesFile] [@ chromehang | NtOpenFile] [@ chromehang | ZwOpenFile] [@ chromehang | ZwQueryDirectoryFile]
Generally happens more on Win7 than other os, and as scoob pointed out if you combine the various signatures this shows up under it's very common.

ZwQueryFullAttributesFile:

https://crash-stats.mozilla.com/report/list?range_value=7&range_unit=days&date=2011-11-30&signature=chromehang%20|%20ZwQueryFullAttributesFile&version=Firefox%3A11.0a1

To find the actual stack, you have to look at the raw dump in these.

See bug 706027 for background on the hang detector.
Correction - bug 429592 is the hang detector bug.
Blocks: 685027
No longer blocks: 685027
BTW what counts as a "hang?" for the hang detector?  Is it doing a Turing-complete analysis to make sure the program will never exit (sorry-that's a joke :) or are we just hitting a timeout?  The cache deletion may take a while (and used to cause a crash in come cases--that's fixed) but not an actual permanent hang.

As long as this isn't a permanent hang, I'm fairly confident bug 695003 will fix this for all but the (rare) cases where a user has selected "clear private data" AND their browser session has accumulated a large cache, in which case it's still a *feature* that we wait for the whole cache to be deleted: (see bug 695003 comment 27).
By "clear private data" I mean Prefs | Privacy | custom settings for history | clear history when APP closes.  Don't have data on how many users use it, but suspect it's a small percentage (and only a small percent of those will have long enough sessions to suffer long shutdown hangs, hopefully.  And they asked for it :)
I think it triggers when main thread hasn't responded in 30 seconds.
(In reply to Jason Duell (:jduell) from comment #12)
> BTW what counts as a "hang?" for the hang detector?

See the patch in bug 429592 for details, but it's when our main thread doesn't respond to pings from another thread for 30 seconds, AFAIK.
On windows it's tied into the main event dispatch loop:

http://mxr.mozilla.org/mozilla-central/source/widget/src/windows/nsAppShell.cpp#318
Attached patch fixSplinter Review
As Jason already wrote, most of the crashes should be now fixed by bug 695003. This patch suspends the hang detector in case that user wants to clear the cache on shutdown.
Attachment #578835 - Flags: review?(jduell.mcbugs)
There are also a rare cases in the crash stats that will be fixed neither by 695003 nor by this bug (e.g. https://crash-stats.mozilla.com/report/index/ff8ade26-f283-4c3c-9321-76a2b2111128). I'll fill a follow-up bug for this issue.
(In reply to Michal Novotny (:michal) from comment #18)
> I'll fill a follow-up bug for this issue.

bug #707436
(In reply to Jason Duell (:jduell) from comment #13)
> By "clear private data" I mean Prefs | Privacy | custom settings for history
> | clear history when APP closes.  Don't have data on how many users use it,
> but suspect it's a small percentage (and only a small percent of those will
> have long enough sessions to suffer long shutdown hangs, hopefully.  And
> they asked for it :)

I just posted on dev.platform, saying that we should remove this feature, partially for this reason. Please comment in that thread.
https://crash-stats.mozilla.com/report/index/bp-3239f21b-f08d-4d8f-9601-299852111212

Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:11.0a1) Gecko/20111212 Firefox/11.0a1 ID:20111212031122

Should I file another bug or this one covers this hang?
Comment on attachment 578835 [details] [diff] [review]
fix

Review of attachment 578835 [details] [diff] [review]:
-----------------------------------------------------------------

Michal: I don't think we'll need this once bug 709262 lands, correct?
Attachment #578835 - Flags: review?(jduell.mcbugs)
Can verify that the incidence of this has gone down significantly?

We shouldn't be seeing this hang (or at best should see it a very small % of the time, for users who have "Clear private data at shutdown set", and/or hit bug 707436) as of builds after 12/4, ie after

   https://bugzilla.mozilla.org/show_bug.cgi?id=695003#c34
Comment on attachment 578835 [details] [diff] [review]
fix

(In reply to Jason Duell (:jduell) from comment #22)
> Michal: I don't think we'll need this once bug 709262 lands, correct?

See https://bugzilla.mozilla.org/show_bug.cgi?id=709964#c1
Attachment #578835 - Flags: review?(jduell.mcbugs)
(In reply to Jason Duell (:jduell) from comment #23)
> Can verify that the incidence of this has gone down significantly?

All chromehang signatures have faded away as we disabled the hang monitor that made them happen on 2011-11-29 in bug 705748, so Nightly builds since 2011-11.30 don't see those signatures any more.

That didn't make the hangs go away, it just made them not report crashes any more.
(In reply to alex_mayorga from comment #21)
> https://crash-stats.mozilla.com/report/index/bp-3239f21b-f08d-4d8f-9601-
> 299852111212
> 
> Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:11.0a1) Gecko/20111212
> Firefox/11.0a1 ID:20111212031122
> 
> Should I file another bug or this one covers this hang?

This is a different hang, file another bug.
(In reply to Michal Novotny (:michal) from comment #26)
> (In reply to alex_mayorga from comment #21)
> > https://crash-stats.mozilla.com/report/index/bp-3239f21b-f08d-4d8f-9601-
> > 299852111212
> > 
> > Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:11.0a1) Gecko/20111212
> > Firefox/11.0a1 ID:20111212031122
> > 
> > Should I file another bug or this one covers this hang?
> 
> This is a different hang, file another bug.

Filed bug 710705
Comment on attachment 578835 [details] [diff] [review]
fix

Review of attachment 578835 [details] [diff] [review]:
-----------------------------------------------------------------

OK, this will now be a rare case, but if you want to suspend the hang monitor for it I'm OK with that.
Attachment #578835 - Flags: review?(jduell.mcbugs) → review+
Comment on attachment 578835 [details] [diff] [review]
fix

I think that this will be very rare now, but this still isn't the right fix. The right fix will be whatever we do for bug 648232, where we block the UI the "right" way without making the UI unresponsive (e.g. modal dialog box, like we use in PSM for long-running operations).

I don't think we need this patch to wallpaper over things because bug 705748 already temporarily disabled in bug 705748.

I suggest we make this (or a follow-up bug) depend on bug 648232, and implement the fix for bug 648232 with this in mind. If we do that, then we could do the "clear cache" operation upon closing the preferences dialog box, which would cover the edge case that we are worried about.
Attachment #578835 - Flags: review-
(In reply to Brian Smith (:bsmith) from comment #29)
> I don't think we need this patch to wallpaper over things because bug 705748
> already temporarily disabled in bug 705748.

In English: I don't think we need this patch to wallpaper over the problem because bug 705748 already temporarily disables the hang detector completely.
Well after chain-crashing about 12 more times in a row, I figured I'd toss in some wisdom from one of my MOZILLA CRASH REPORTs (under the RAW DUMP tab).

https://crash-stats.mozilla.com/report/index/532a04c4-abae-449e-b5de-8304e2131023

OS|Windows NT|5.1.2600 Service Pack 3
CPU|x86|AuthenticAMD family 15 model 107 stepping 2|2
Crash|EXCEPTION_BREAKPOINT|0x3e119c|10 Module|mozalloc.dll|24.0.0.5001|mozalloc.pdb|C4CA4FDDBAD64EB8A36992C68FF8B63F1|0x003e0000|0x003e5fff|0


10|0|mozalloc.dll|mozalloc_abort(char const * const)|hg:hg.mozilla.org/releases/mozilla-release:memory/mozalloc/mozalloc_abort.cpp:7c3b0732e765|30|0x0
10|1|xul.dll|NS_DebugBreak|hg:hg.mozilla.org/releases/mozilla-release:xpcom/base/nsDebugImpl.cpp:7c3b0732e765|417|0xb

And of course a whole bunch of other stuff, but the critical break is 
mozalloc.dll|mozalloc_abort(char const * const)

(Which probably explains why the thing chews up 1.7 GB of RAM & 1.5 GB of disk space before the freakout goes into chain-loop crashing.)

Bug 901346 - crash in mozalloc_abort(char const* const) | js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 
https://bugzilla.mozilla.org/show_bug.cgi?id=901346
2012-01-08 20:06 PST by rtrappman 

Bug 716443 - Large Data Submission causes crash: [@ mozalloc_abort(char const* const) | mozalloc_handle_oom() | xul.dll@0x126615 ] via js::Shape::set 
https://bugzilla.mozilla.org/show_bug.cgi?id=716443
2013-08-04 09:36 PDT by Marcia Knous [:marcia]
Thankfully this crash in mozalloc_abort(char const* const) is "resolved" as in STILL CRASHING EVERYONE'S COMPUTERS.
The new definition of "Resolved" = "Never Fixed, Ignored for a decade".
Gridlock: you're in the wrong bug. Also, if you take a look at http://dxr.mozilla.org/mozilla-central/source/memory/mozalloc/mozalloc_abort.cpp?from=mozalloc_abort#l22, you will realize that mozalloc_abort crashes *by definition* - it is one of our main assertion failure mechanisms.
https://bugzilla.mozilla.org/show_bug.cgi?id=637282
Bug 637282 - Crash [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | mozilla::plugins::PPluginInstance::Transition(mozilla::plugins::PPluginInstance::State, mozilla::ipc::Trigger, mozilla::plugins::PPluginInstance::State*) ] 
Reported: 	2011-02-28 05:08 PST by Scoobidiver (away) 

SO why exactly are we filling the a CHAR length variable with the square of a constant variable?

char = Character or small integer.
1byte =	signed: -128 to 127
unsigned: 0 to 255

What is the value of the CONST variable and is it bigger than one byte?
If it is, then the source of the crash is pretty obvious to even a first year computer programmer.

Conflict stemming from implicit retyping (which can lead to buffer overflows which corrupt your active dataspaces).

http://www.4d.com/docs/CMU/CMU10048.HTM
Some functions return variables of a very precise data type. Assigning the result of one of such variables to a variable already typed differently will cause a data type conflict if you are not careful.
(In reply to David Rajchenbach Teller [:Yoric] from comment #33)
> Gridlock: you're in the wrong bug. Also, if you take a look at
> http://dxr.mozilla.org/mozilla-central/source/memory/mozalloc/mozalloc_abort.
> cpp?from=mozalloc_abort#l22, you will realize that mozalloc_abort crashes
> *by definition* - it is one of our main assertion failure mechanisms.

I see... and why is this happening on WindowsXP if this deliberate bug is supposed to be catching only UNIX Malloc errors?

#if defined(XP_WIN) || defined(XP_OS2)
#  define MOZALLOC_EXPORT __declspec(dllexport)
#endif

Going to the source code, I have the message in the code.

MOZALLOC_EXPORT
#if !defined(__arm__)
  MOZ_NORETURN
#endif
  void mozalloc_abort(const char* const msg);


#endif  /* ifndef mozilla_mozalloc_abort_h */

And yet the error is mozalloc_abort(char const* const)
mozalloc_abort(char const * const)

It is just some very messy looking pointer coding to me and the error message looks like a byte-length character variable is being squared by a constant reference variable which is exceptionally confusing to a casual viewer.

Are you certain there isn't some unintentional hijinks in play with Wild Pointers or Dangling Pointers?
It looks like this is something that should have a second process in play watching the memory allocation exceptions to ensure that a cleanup is double-checking to make certain memory is actually being released rather than being left to zombie memory allocations after being hooked for process observation.
(Like a maid patiently checking that the garbage in the trash cans have all been tossed into the incinerator)
Sorry about the annoyance, but given the quirky aspects of C++ versus standard C the error message appears like a bad variable initialization math expression rather than an odd choice of pointer allocation exception handling.
Gridlock, let's move this conversation to IRC. Here, we are adding noise in an unrelated bug.
( irc.mozilla.org, channels #introduction or #firefox )
I wanted to at least briefly answer comment #36, since I'm not sure I understand your confusion.

1) |const char * const| is the same type as |char const * const| - a read-only pointer to one or more read-only chars. The latter is arguably more consistent , but allowing both |const char| and |char const| is just a quirk of the language. To be clear: |char const *| is a non-const pointer to const chars, and |char * const| is a const pointer to non-const chars.

2) This parameter just refers to a null-terminated ASCII string, presumably intended to indicate why the malloc failed (in your case, it sounds like it ran out of address space). There's no type punning or other weirdness going on.

I hope that clears things up - if not, I must be missing something about what you're saying.
chromehang crashing is turned off by default in current versions. You have turned it on specifically in your preferences, I think, otherwise you cannot get to even crashing this way.
Oh, and BTW, in mozalloc_abort crashes, the most important info is in the abort message, which you will find in the app notes field:

xpcom_runtime_abort(###!!! ABORT: HangMonitor triggered: file e:/builds/moz2_slave/rel-m-rel-w32_bld-000000000000/build/xpcom/threads/HangMonitor.cpp, line 112)

But then, this code is only active if the preference hangmonitor.timeout is set to anything else than the default value of 0. Check about:config to see if that is set to anything else and reset it to disable this code.
(In reply to Brian Smith (:briansmith, :bsmith, use NEEDINFO?) from comment #29)
> Comment on attachment 578835 [details] [diff] [review]
> fix
> 
> I think that this will be very rare now, but this still isn't the right fix.
> The right fix will be whatever we do for bug 648232, where we block the UI
> the "right" way without making the UI unresponsive (e.g. modal dialog box,
> like we use in PSM for long-running operations).
> 
> I don't think we need this patch to wallpaper over things because bug 705748
> already temporarily disabled in bug 705748.
> 
> I suggest we make this (or a follow-up bug) depend on bug 648232, and
> implement the fix for bug 648232 with this in mind. If we do that, then we
> could do the "clear cache" operation upon closing the preferences dialog
> box, which would cover the edge case that we are worried about.

This was fixed in 32. So this bug can be closed?
Flags: needinfo?(brian)
I have no idea what's happening with the disk cache nowadays.
Flags: needinfo?(brian)
:mayhemer see cmoment 41
Flags: needinfo?(honzab.moz)
We are fully async with cache2, and this code is triggered not any more (even for app cache).
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(honzab.moz)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.