Last Comment Bug 716289 - crash in nsCacheService::SearchCacheDevices
: crash in nsCacheService::SearchCacheDevices
Status: VERIFIED FIXED
startupcrash
: crash, regression, reproducible, topcrash
Product: Core
Classification: Components
Component: Networking: Cache (show other bugs)
: Trunk
: All All
: -- critical with 1 vote (vote)
: mozilla12
Assigned To: Michal Novotny (:michal)
:
: Patrick McManus [:mcmanus]
Mentors:
: 716533 (view as bug list)
Depends on:
Blocks: 707436
  Show dependency treegraph
 
Reported: 2012-01-07 15:09 PST by Steve Fink [:sfink] [:s:]
Modified: 2012-04-06 04:38 PDT (History)
11 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch v1 - fixes regression caused by bug 707436 (1.62 KB, patch)
2012-01-10 07:08 PST, Michal Novotny (:michal)
no flags Details | Diff | Splinter Review
patch v2 (2.12 KB, patch)
2012-01-10 09:26 PST, Michal Novotny (:michal)
bjarne: review+
Details | Diff | Splinter Review
patch v3 (1.83 KB, patch)
2012-01-11 16:14 PST, Michal Novotny (:michal)
bjarne: review+
Details | Diff | Splinter Review

Description Steve Fink [:sfink] [:s:] 2012-01-07 15:09:54 PST
See eg https://crash-stats.mozilla.com/report/index/bp-5ea1e626-0a62-45f7-b52e-5749c2120107 (it's mine).

I updated the nightly and started to get a crash when restoring the session. I was able to reproduce on a local debug build.

The immediate problem seems to be that in this code in netwerk/cache/nsCacheService.cpp

1898 if (mEnableDiskDevice) {
1899   if (!mDiskDevice) {
1900     nsresult rv = CreateDiskDevice();
1901     if (NS_FAILED(rv))
1902       return nsnull;
1903     }
1904
1905     entry = mDiskDevice->FindEntry(key, collision);
1906   }

CreateDiskDevice seems to be returning success, but not setting mDiskDevice, so it's getting a null pointer exception.

Looking further into what fails, somewhere underneath CreateDiskDevice() it is getting a failure when opening $PROFILE/Cache/_CACHE_003_ because the file size is less than the "expected" size. (Sorry, I shut down my debug window so I don't have the exact code.) I didn't trace through to see where the error was getting eaten.

Deleting the Cache directory made the problem go away.
Comment 1 Steve Fink [:sfink] [:s:] 2012-01-07 15:11:37 PST
Not sure what I'm supposed to put in the Crash Signature field of bugzilla.
Comment 2 Steve Fink [:sfink] [:s:] 2012-01-07 23:31:59 PST
So the problem is in nsCacheService::CreateDiskDevice(). mDiskDevice->Init() fails and returns an error code. That is detected and triggers this warning:

        printf("###\n");
        printf("### mDiskDevice->Init() failed (0x%.8x)\n", rv);
        printf("###    - disabling disk cache for this session.\n");
        printf("###\n");

which I'm not sure if I believe, but anyway...

It then deletes mDiskDevice and sets it to null, and continues. It sets up a timer to recreate the "disk device", whatever that is, which incidentally succeeds and sets the return value to NS_OK. That gets returned from CreateDiskDevice(), which triggers the crash in the code in comment 0 because it now expects the disk device to exist.

I don't know if the delayed re-creation is new or something (I'm guessing it is, since it's a perf optimization and this would all work if the re-creation were immediate). Nor do I understand what it's trying to do well enough to know if the right thing is to propagate the error code, or somehow cause it to perform the original action once the disk device has been re-created. But one of those two sounds right to me. :)
Comment 3 Scoobidiver (away) 2012-01-08 05:58:39 PST
It's #1 top crasher in 12.0a1.

It first appeared in 12.0a1/20120107. The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8ae16e346bd0&tochange=5a446202be5f
It might be a regression from bug 703100.
Comment 4 Scoobidiver (away) 2012-01-09 08:42:57 PST
*** Bug 716533 has been marked as a duplicate of this bug. ***
Comment 5 Scoobidiver (away) 2012-01-09 08:45:06 PST
See bug 716533 for STR.
Comment 6 Steve Fink [:sfink] [:s:] 2012-01-09 10:55:53 PST
Apart from avoiding the startup crash when the cache is corrupted, any idea why the cache is getting corrupted? Is it known, or should it be a separate bug?
Comment 7 Michal Novotny (:michal) 2012-01-10 07:08:43 PST
Created attachment 587296 [details] [diff] [review]
patch v1 - fixes regression caused by bug 707436
Comment 8 Bjarne (:bjarne) 2012-01-10 08:04:35 PST
Comment on attachment 587296 [details] [diff] [review]
patch v1 - fixes regression caused by bug 707436

Review of attachment 587296 [details] [diff] [review]:
-----------------------------------------------------------------

If creating the device fails just return the failure; otherwise try to set up the timer but always return OK. Means that the timer will never be set up if device is not created.

Looks reasonable. But can the timer exist at this point (there is a check for that)? I.e. can it exist, creation fails (setting mDiskDevice to null), and then an existing timer fires and screws up something? Another question is why the code listed in comment #0 checks the return-code of CreateDiskDevice. Irrc, in other places we make sure to ignore this value and just check if the device exists...

(Clearing review-request - please re-request when issues above has been considered.)
Comment 9 Michal Novotny (:michal) 2012-01-10 09:26:08 PST
Created attachment 587349 [details] [diff] [review]
patch v2

(In reply to Bjarne (:bjarne) from comment #8)
> Looks reasonable. But can the timer exist at this point (there is a check
> for that)?

It was there for case that we do the following:

1) create disk device
2) destroy it before timer runs
3) create it again

This is not possible with the current code since we destroy the disk device only at shutdown. Anyway, we should cancel the timer at point 2 if we ever decide to allow destroying the disk device at other places. So I've changed the check to assertion.


> Another question is why the code listed in comment #0 checks the return-code
> of CreateDiskDevice. Irrc, in other places we make sure to ignore this value
> and just check if the device exists...

I don't think it is a common rule. E.g. we check the return value in nsCacheService::VisitEntries() too.
Comment 10 Sean Newman 2012-01-11 15:02:48 PST
having this pref:
user_pref("browser.cache.disk.parent_directory", "R:\\\\Temp");
also causes an instant crash on start even in safe mode.
It also appeared in build from 7th of Jan, is it this bug or another one?
Comment 11 Michal Novotny (:michal) 2012-01-11 15:07:15 PST
(In reply to Sean Newman from comment #10)
> It also appeared in build from 7th of Jan, is it this bug or another one?

It is this bug. Patch in bug #707436 landed on 6th January.
Comment 12 Bjarne (:bjarne) 2012-01-11 15:23:10 PST
Comment on attachment 587349 [details] [diff] [review]
patch v2

Review of attachment 587349 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good but please address comment before committing (note btw that checking that the device exists instead of return-value is robust wrt this  ;)  )

I assume this passes local testing and the known failure-cases.

::: netwerk/cache/nsCacheService.cpp
@@ +1475,5 @@
> +                                           nsITimer::TYPE_ONE_SHOT);
> +    if (NS_FAILED(rv)) {
> +        NS_WARNING("Failed to post smart size timer");
> +        mSmartSizeTimer = nsnull;
> +        return rv;

Do you really want to return failure if posting the timer fails?
Comment 13 Michal Novotny (:michal) 2012-01-11 16:14:07 PST
Created attachment 587865 [details] [diff] [review]
patch v3

(In reply to Bjarne (:bjarne) from comment #12)
> I assume this passes local testing and the known failure-cases.

Yes


> ::: netwerk/cache/nsCacheService.cpp
> @@ +1475,5 @@
> > +                                           nsITimer::TYPE_ONE_SHOT);
> > +    if (NS_FAILED(rv)) {
> > +        NS_WARNING("Failed to post smart size timer");
> > +        mSmartSizeTimer = nsnull;
> > +        return rv;
> 
> Do you really want to return failure if posting the timer fails?

This shouldn't be a problem, but you're probably right that we should fail iff mDiskDevice is null. I've changed the patch.
Comment 14 Bjarne (:bjarne) 2012-01-12 02:36:32 PST
Comment on attachment 587865 [details] [diff] [review]
patch v3

Review of attachment 587865 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/cache/nsCacheService.cpp
@@ +1480,2 @@
>      }
>  

Proposed comment: Ignore state of the timer and return Ok here since the purpose of the method (create the disk-device) has been fulfilled
Comment 16 Marco Bonardo [::mak] 2012-01-13 01:06:03 PST
https://hg.mozilla.org/mozilla-central/rev/d16983821d2f
Comment 17 Scoobidiver (away) 2012-01-15 00:56:04 PST
The latest crashes took place in 12.0a1/20120113.
Comment 18 chris hofmann 2012-04-05 19:14:25 PDT
wonder if its possible that this created the volume regression described in crash in  Bug 741179 - nsDiskCacheBlockFile::Write that ramped up around 12.0b3
Comment 19 Robert Kaiser 2012-04-06 04:38:49 PDT
(In reply to chris hofmann from comment #18)
> wonder if its possible that this created the volume regression described in
> crash in  Bug 741179 - nsDiskCacheBlockFile::Write that ramped up around
> 12.0b3

I doubt it, as that one has spiked in 11.0 as well.

Note You need to log in before you can comment on or make changes to this bug.