Closed Bug 1277826 Opened 3 years ago Closed 3 years ago

Crash in shutdownhang | mozilla::net::CacheIndex::WriteLogToDisk

Categories

(Core :: Networking: Cache, defect, critical)

x86
Windows XP
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 --- wontfix
firefox-esr45 --- affected
firefox50 --- fixed
firefox51 --- fixed

People

(Reporter: mayhemer, Assigned: michal)

Details

(Keywords: crash, Whiteboard: [necko-active])

Crash Data

Attachments

(1 file, 1 obsolete file)

Crash volume for signature 'shutdownhang | mozilla::net::CacheIndex::WriteLogToDisk':
 - nightly(version 50):2 crashes from 2016-06-06.
 - aurora (version 49):1 crash from 2016-06-07.
 - beta   (version 48):65 crashes from 2016-06-06.
 - release(version 47):839 crashes from 2016-05-31.
 - esr    (version 45):8 crashes from 2016-04-07.

Crash volume on the last weeks:
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly       1       0       1       0       0       0       0
 - aurora        0       1       0       0       0       0       0
 - beta          6       7       7       6      11      10      17
 - release     112     118     129     115     125     100     111
 - esr           1       0       2       1       1       2       1

Affected platforms: Windows, Mac OS X
Attached patch fix (obsolete) — Splinter Review
(In reply to Honza Bambas (:mayhemer) from comment #1)
> Some crashes are here, maybe this loop could be the cause?
> 
> https://hg.mozilla.org/releases/mozilla-release/annotate/0b8492c110be/xpcom/
> glue/PLDHashTable.cpp#l786

I don't think the loop is the problem. It's likely that we spent a lot of time shutting down the cache (CacheFiles and CacheEntries) and firefox is killed when we're flushing the index although it doesn't take too much time. We've made shutdown of the cache faster in the latest release which explains why this crash occurs in 48 less than in 47 without any change in the index code.

Anyway, it's nice to close the index cleanly but not necessary. The patch changes the code so that writing of index is skipped or interrupted if CacheObserver::IsPastShutdownIOLag returns true. We might want to introduce a new limit for index, but let's use sMaxShutdownIOLag for simplicity now and see if it helps. I also revised deleting of the whole index when something fails during the shutdown. It should be safe and more efficient to just remove journal and keep the index file, because then we don't need to rebuild the whole index on the next start.
Attachment #8785929 - Flags: review?(honzab.moz)
Comment on attachment 8785929 [details] [diff] [review]
fix

Review of attachment 8785929 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/cache2/CacheIndex.cpp
@@ +1844,5 @@
> +  LOG(("CacheIndex::RemoveIndexFromDisk() [keepIndexFile=%d]", aKeepIndexFile));
> +
> +  if (!aKeepIndexFile) {
> +    RemoveFile(NS_LITERAL_CSTRING(INDEX_NAME));
> +  }

small nit: could we have two methods like following:

RemoveIndexFromDiskButMainIndexFile() - deletes TEMP_INDEX_NAME, JOURNAL_NAME
RemoveIndexFromDisk() - deletes INDEX_NAME, calls the previous method

the bool arg is kinda cryptic.

@@ +1976,5 @@
>      CacheIndexEntry* entry = iter.Get();
>      if (entry->IsRemoved() || entry->IsDirty()) {
> +      rv = wlh.AddEntry(entry);
> +      if (NS_WARN_IF(NS_FAILED(rv))) {
> +        PR_Close(fd);

do we need to close it?  close can be expensive, but I presume you are later deleting the file (doesn't it have some checksums preventing it from use when we leave the file incomplete on disk?)
Attachment #8785929 - Flags: review?(honzab.moz) → review+
Attached patch patch v2Splinter Review
(In reply to Honza Bambas (:mayhemer) from comment #4)
> @@ +1976,5 @@
> >      CacheIndexEntry* entry = iter.Get();
> >      if (entry->IsRemoved() || entry->IsDirty()) {
> > +      rv = wlh.AddEntry(entry);
> > +      if (NS_WARN_IF(NS_FAILED(rv))) {
> > +        PR_Close(fd);
> 
> do we need to close it?  close can be expensive, but I presume you are later
> deleting the file (doesn't it have some checksums preventing it from use
> when we leave the file incomplete on disk?)

It's not necessary. The journal file has checksum but the file will be ignored without reading it because the main file won't be marked as clean.
Attachment #8785929 - Attachment is obsolete: true
Attachment #8787922 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/8225b084f18f
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Not a very big volume but we could take an uplift to 50. Thanks
Comment on attachment 8787922 [details] [diff] [review]
patch v2

Approval Request Comment
[Feature/regressing bug #]: It's in the cache2 code from the beginning. 
[User impact if declined]: Rare shutdown crash.
[Describe test coverage new/current, TreeHerder]: There is no automated test for cache shutdown. 
[Risks and why]: Low, although there is no automated test, the situation when index is not successfully shut down is very well tested, because it happens whenever firefox crashes. 
[String/UUID change made/needed]: None.
Flags: needinfo?(michal.novotny)
Attachment #8787922 - Flags: approval-mozilla-aurora?
Comment on attachment 8787922 [details] [diff] [review]
patch v2

Potential fix for some kinds of shutdown crashes, even though I don't see it in 50, patch is fairly straightforward, Aurora50+
Attachment #8787922 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
This is still reproducible on Fx50 in low volume, based on last month's worth of crash data.

  SIGNATURE   | shutdownhang | mozilla::net::CacheIndex::WriteLogToDisk
  ---------------------------------------------------------------------
  CRASH STATS | http://tinyurl.com/j3j865v
  ---------------------------------------------------------------------
  OVERVIEW    | 0 crashes on nightly 52
	      | 1 crash on nightly 51
	      | 0 crashes on aurora 51
	      | 0 crashes on nightly 50
	      | 0 crashes on aurora 50
	      | 4 crashes on beta 50
  ---------------------------------------------------------------------
  LAST CRASH  | 2016-10-04 (on 50.0b3)
You need to log in before you can comment on or make changes to this bug.