Closed Bug 716289 Opened 13 years ago Closed 12 years ago

crash in nsCacheService::SearchCacheDevices

Categories

(Core :: Networking: Cache, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla12

People

(Reporter: sfink, Assigned: michal)

References

Details

(4 keywords, Whiteboard: startupcrash)

Crash Data

Attachments

(1 file, 2 obsolete files)

See eg https://crash-stats.mozilla.com/report/index/bp-5ea1e626-0a62-45f7-b52e-5749c2120107 (it's mine).

I updated the nightly and started to get a crash when restoring the session. I was able to reproduce on a local debug build.

The immediate problem seems to be that in this code in netwerk/cache/nsCacheService.cpp

1898 if (mEnableDiskDevice) {
1899   if (!mDiskDevice) {
1900     nsresult rv = CreateDiskDevice();
1901     if (NS_FAILED(rv))
1902       return nsnull;
1903     }
1904
1905     entry = mDiskDevice->FindEntry(key, collision);
1906   }

CreateDiskDevice seems to be returning success, but not setting mDiskDevice, so it's getting a null pointer exception.

Looking further into what fails, somewhere underneath CreateDiskDevice() it is getting a failure when opening $PROFILE/Cache/_CACHE_003_ because the file size is less than the "expected" size. (Sorry, I shut down my debug window so I don't have the exact code.) I didn't trace through to see where the error was getting eaten.

Deleting the Cache directory made the problem go away.
Not sure what I'm supposed to put in the Crash Signature field of bugzilla.
Crash Signature: [@ nsCacheService::SearchCacheDevices ]
Assignee: nobody → michal.novotny
So the problem is in nsCacheService::CreateDiskDevice(). mDiskDevice->Init() fails and returns an error code. That is detected and triggers this warning:

        printf("###\n");
        printf("### mDiskDevice->Init() failed (0x%.8x)\n", rv);
        printf("###    - disabling disk cache for this session.\n");
        printf("###\n");

which I'm not sure if I believe, but anyway...

It then deletes mDiskDevice and sets it to null, and continues. It sets up a timer to recreate the "disk device", whatever that is, which incidentally succeeds and sets the return value to NS_OK. That gets returned from CreateDiskDevice(), which triggers the crash in the code in comment 0 because it now expects the disk device to exist.

I don't know if the delayed re-creation is new or something (I'm guessing it is, since it's a perf optimization and this would all work if the re-creation were immediate). Nor do I understand what it's trying to do well enough to know if the right thing is to propagate the error code, or somehow cause it to perform the original action once the disk device has been re-created. But one of those two sounds right to me. :)
It's #1 top crasher in 12.0a1.

It first appeared in 12.0a1/20120107. The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8ae16e346bd0&tochange=5a446202be5f
It might be a regression from bug 703100.
Severity: normal → critical
Crash Signature: [@ nsCacheService::SearchCacheDevices ] → [@ nsCacheService::SearchCacheDevices] [@ nsCacheService::SearchCacheDevices(nsCString*, int, bool*)]
OS: Linux → All
Hardware: x86_64 → All
Whiteboard: startupcrash
Version: unspecified → Trunk
See bug 716533 for STR.
Keywords: reproducible
Apart from avoiding the startup crash when the cache is corrupted, any idea why the cache is getting corrupted? Is it known, or should it be a separate bug?
Attachment #587296 - Flags: review?(bjarne)
Blocks: 707436
Comment on attachment 587296 [details] [diff] [review]
patch v1 - fixes regression caused by bug 707436

Review of attachment 587296 [details] [diff] [review]:
-----------------------------------------------------------------

If creating the device fails just return the failure; otherwise try to set up the timer but always return OK. Means that the timer will never be set up if device is not created.

Looks reasonable. But can the timer exist at this point (there is a check for that)? I.e. can it exist, creation fails (setting mDiskDevice to null), and then an existing timer fires and screws up something? Another question is why the code listed in comment #0 checks the return-code of CreateDiskDevice. Irrc, in other places we make sure to ignore this value and just check if the device exists...

(Clearing review-request - please re-request when issues above has been considered.)
Attachment #587296 - Flags: review?(bjarne)
Attached patch patch v2 (obsolete) — Splinter Review
(In reply to Bjarne (:bjarne) from comment #8)
> Looks reasonable. But can the timer exist at this point (there is a check
> for that)?

It was there for case that we do the following:

1) create disk device
2) destroy it before timer runs
3) create it again

This is not possible with the current code since we destroy the disk device only at shutdown. Anyway, we should cancel the timer at point 2 if we ever decide to allow destroying the disk device at other places. So I've changed the check to assertion.


> Another question is why the code listed in comment #0 checks the return-code
> of CreateDiskDevice. Irrc, in other places we make sure to ignore this value
> and just check if the device exists...

I don't think it is a common rule. E.g. we check the return value in nsCacheService::VisitEntries() too.
Attachment #587296 - Attachment is obsolete: true
Attachment #587349 - Flags: review?(bjarne)
having this pref:
user_pref("browser.cache.disk.parent_directory", "R:\\\\Temp");
also causes an instant crash on start even in safe mode.
It also appeared in build from 7th of Jan, is it this bug or another one?
(In reply to Sean Newman from comment #10)
> It also appeared in build from 7th of Jan, is it this bug or another one?

It is this bug. Patch in bug #707436 landed on 6th January.
Comment on attachment 587349 [details] [diff] [review]
patch v2

Review of attachment 587349 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good but please address comment before committing (note btw that checking that the device exists instead of return-value is robust wrt this  ;)  )

I assume this passes local testing and the known failure-cases.

::: netwerk/cache/nsCacheService.cpp
@@ +1475,5 @@
> +                                           nsITimer::TYPE_ONE_SHOT);
> +    if (NS_FAILED(rv)) {
> +        NS_WARNING("Failed to post smart size timer");
> +        mSmartSizeTimer = nsnull;
> +        return rv;

Do you really want to return failure if posting the timer fails?
Attachment #587349 - Flags: review?(bjarne) → review+
Attached patch patch v3Splinter Review
(In reply to Bjarne (:bjarne) from comment #12)
> I assume this passes local testing and the known failure-cases.

Yes


> ::: netwerk/cache/nsCacheService.cpp
> @@ +1475,5 @@
> > +                                           nsITimer::TYPE_ONE_SHOT);
> > +    if (NS_FAILED(rv)) {
> > +        NS_WARNING("Failed to post smart size timer");
> > +        mSmartSizeTimer = nsnull;
> > +        return rv;
> 
> Do you really want to return failure if posting the timer fails?

This shouldn't be a problem, but you're probably right that we should fail iff mDiskDevice is null. I've changed the patch.
Attachment #587349 - Attachment is obsolete: true
Attachment #587865 - Flags: review?(bjarne)
Comment on attachment 587865 [details] [diff] [review]
patch v3

Review of attachment 587865 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/cache/nsCacheService.cpp
@@ +1480,2 @@
>      }
>  

Proposed comment: Ignore state of the timer and return Ok here since the purpose of the method (create the disk-device) has been fulfilled
Attachment #587865 - Flags: review?(bjarne) → review+
http://hg.mozilla.org/integration/mozilla-inbound/rev/d16983821d2f
Whiteboard: startupcrash → [inbound] startupcrash
Target Milestone: --- → mozilla12
Whiteboard: [inbound] startupcrash → [inbound], startupcrash
https://hg.mozilla.org/mozilla-central/rev/d16983821d2f
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: [inbound], startupcrash → startupcrash
The latest crashes took place in 12.0a1/20120113.
Status: RESOLVED → VERIFIED
wonder if its possible that this created the volume regression described in crash in  Bug 741179 - nsDiskCacheBlockFile::Write that ramped up around 12.0b3
(In reply to chris hofmann from comment #18)
> wonder if its possible that this created the volume regression described in
> crash in  Bug 741179 - nsDiskCacheBlockFile::Write that ramped up around
> 12.0b3

I doubt it, as that one has spiked in 11.0 as well.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: