Closed Bug 774146 Opened 12 years ago Closed 12 years ago

Get more detail on reason for DISK_CACHE_CORRUPT failures

Categories

(Core :: Networking: Cache, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla17

People

(Reporter: bbondy, Assigned: bbondy)

References

Details

(Whiteboard: [snappy:p2])

Attachments

(1 file, 3 obsolete files)

From the data in DISK_CACHE_CORRUPT, 12.5% of network startups are corrupt.  This seems very high.  There are many different code paths that return the corrupt error code that triggers this 12.5% though. 

The reason for the corrupt cache is likely a dirty header, but we should add telemetry data to be sure.
Whiteboard: [snappy]
Attached patch Patch v1. (obsolete) — Splinter Review
I seen you did the previous telemetry data around this area of code so thought I'd hit you up again for a review :)
Attachment #642984 - Flags: review?(jduell.mcbugs)
Comment on attachment 642984 [details] [diff] [review]
Patch v1.

Review of attachment 642984 [details] [diff] [review]:
-----------------------------------------------------------------

Brian or Michal are better to review this--giving to Michal because I know Brian is busy with data-jar for B2G this week.

::: toolkit/components/telemetry/TelemetryHistograms.h
@@ +194,5 @@
>  #undef _HTTP_HIST
>  #undef HTTP_HISTOGRAMS
>  
>  HISTOGRAM(DISK_CACHE_CORRUPT, 0, 1, 2, BOOLEAN, "Was the HTTP disk cache corrupt at startup?")
> +HISTOGRAM(DISK_CACHE_CORRUPT_DETAILS, 1, 30, 31, LINEAR, "Why the HTTP disk cache was corrupted at startup")

min/max values are 1-30, so bucket count should be 30 instead of 31?
Attachment #642984 - Flags: review?(jduell.mcbugs) → review?(michal.novotny)
> Brian or Michal are better to review this--giving to Michal 
> because I know Brian is busy with data-jar for B2G this week.

OK, thanks!

> min/max values are 1-30, so bucket count should be 30 instead of 31?

31 is correct. The min value (1 in this case) is the smallest non-zero value, and 0 is always implicitly included.
Comment on attachment 642984 [details] [diff] [review]
Patch v1.

>  nsresult
> -nsDiskCacheMap::Open(nsIFile *  cacheDirectory)
> +nsDiskCacheMap::Open(nsIFile *  cacheDirectory,
> +                     nsDiskCacheMap::CorruptCacheInfo & corruptInfo)
>  {

Please pass a pointer instead of a reference.


> +    corruptInfo = nsDiskCacheMap::kNotCorrupt;

IMO the default value should be some error (e.g. kUnexpectedError). There are few places in nsDiskCacheMap::Open() where corruptInfo isn't set appropriately in case of failure. These failures are highly unlikely or almost impossible, but just in case it really happen we shouldn't return kNotCorrupt.


>     mRecordArray = (nsDiskCacheRecord *)
>         PR_CALLOC(mHeader.mRecordCount * sizeof(nsDiskCacheRecord));
>     if (!mRecordArray) {
> +       corruptInfo = nsDiskCacheMap::kOutOfMemory;

>     mRecordArray = (nsDiskCacheRecord *) PR_MALLOC(recordArraySize);
>     if (!mRecordArray) {
> +       corruptInfo = nsDiskCacheMap::kOutOfMemory;

AFAIK both PR_CALLOC() and PR_MALLOC() are now infallible just like new[], right? So there is no need to handle this error...


>          PRUint32 recordArraySize =
>                  mHeader.mRecordCount * sizeof(nsDiskCacheRecord);
> -        if (mapSize < recordArraySize + sizeof(nsDiskCacheHeader))
> +        if (mapSize < recordArraySize + sizeof(nsDiskCacheHeader)) {
> +            corruptInfo = nsDiskCacheMap::kMapSizeNotLargeEnough;
>              goto error_exit;
> +        }

>      } else {
> +        corruptInfo = nsDiskCacheMap::kMapSizeNotLargeEnough;
>          goto error_exit;
>      }

We should differentiate the corruptInfo in these 2 cases. The first could be named like kRecordsIncomplete and the second kHeaderIncomplete.


>      rv = OpenBlockFiles();
> -    if (NS_FAILED(rv))  goto error_exit;
> +    if (NS_FAILED(rv)) {
> +      corruptInfo = nsDiskCacheMap::kCouldNotOpenBlockFiles;
> +      goto error_exit;
> +    }

Maybe we could pass the corruptInfo to nsDiskCacheMap::OpenBlockFiles() and then to nsDiskCacheBlockFile::Open() to get a more detailed error here.
Attachment #642984 - Flags: review?(michal.novotny) → review-
> Please pass a pointer instead of a reference.

Is there a reason why? I'll change it if you really want, but I don't see why it's a problem and I prefer the simplified code where a NULL error condition check is not needed. I've used references pretty extensively throughout a couple hundred patches and haven't had any negative reviews or nits because of it.  I don't see it documented here either: https://developer.mozilla.org/En/Mozilla_Coding_Style_Guide

The only argument I've ever heard is that it makes it more readable when you call the function, but in this case I think it's pretty obvious. 

> re: the other nits

Sounds good, I'll update them.
Attached patch Patch v2. (obsolete) — Splinter Review
Implemented review comments minus what was mentioned in Comment 5.
Attachment #642984 - Attachment is obsolete: true
Attachment #643384 - Flags: review?(michal.novotny)
> AFAIK both PR_CALLOC() and PR_MALLOC() are now infallible just like new[], right?

Nope.  moz_xmalloc is infallible, but not the NSPR functions.  Is there any reason we need to use the PR_ allocators?  

  https://developer.mozilla.org/en/Infallible_memory_allocation
I can re-add the memory error codes as a nit before pushing if you want or on the next set of review comments. As for changing the allocator functions that should be a different bug if a change is wanted.
> as for changing the allocator functions

That's fine, but then keep the error check for the existing allocation calls if they fail :)
Yup I will, just will do it before pushing or on the next set of review comments. 

> I can re-add the memory error codes as a nit before pushing if 
> you want or on the next set of review comments
Whiteboard: [snappy] → [snappy:p2]
Comment on attachment 643384 [details] [diff] [review]
Patch v2.

(In reply to Brian R. Bondy [:bbondy] from comment #5)
> > Please pass a pointer instead of a reference.
> 
> Is there a reason why? I'll change it if you really want, but I don't see
> why it's a problem and I prefer the simplified code where a NULL error
> condition check is not needed. I've used references pretty extensively
> throughout a couple hundred patches and haven't had any negative reviews or
> nits because of it.  I don't see it documented here either:
> https://developer.mozilla.org/En/Mozilla_Coding_Style_Guide
> 
> The only argument I've ever heard is that it makes it more readable when you
> call the function, but in this case I think it's pretty obvious. 

It could be that this has changed over the time, but a long time ago I was told that reference should be used only when passing const nsACString/nsAString... Anyway, the general rule is to use the style of the subtree and we don't use references in the cache tree.


> +    if (!cacheDirectory) {
> +        corruptInfo = nsDiskCacheMap::kNotCorrupt;
> +        NS_WARNING("cacheDirectory must not be NULL");
> +        return NS_ERROR_INVALID_POINTER;
> +    }

That's exactly what I wanted to avoid by using kUnexpectedError. In this case nsDiskCacheMap::Open() fails, in nsDiskCacheDevice::OpenDiskCache() we log 1 as Telemetry::DISK_CACHE_CORRUPT, but the detailed reason is kNotCorrupt. This doesn't make any sense.

All these errors that shouldn't normally happen (not just this one) should be logged as kUnexpectedError. And in case the telemetry data shows non zero for this error we should investigate why that happened.


>          rv = mBlockFile[i].Open(blockFile, blockSize, bitMapSize);
> -        if (NS_FAILED(rv)) break;
> +        if (NS_FAILED(rv)) {
> +            corruptInfo = nsDiskCacheMap::kCouldNotOpenBlockFileForIndex;
> +            break;
> +        }

I meant to propagate the corruptInfo down to nsDiskCacheBlockFile::Open().
Attachment #643384 - Flags: review?(michal.novotny) → review-
We are trying to track the cases exactly where we have a NS_ERROR_FILE_CORRUPTED.  Because this is the only time that we call nsDeleteDir::DeleteDir(mCacheDirectory...

>        } else if (rv == NS_ERROR_FILE_CORRUPTED) {
>            Telemetry::Accumulate(Telemetry::DISK_CACHE_CORRUPT, 1);
>            // delay delete by 1 minute to avoid IO thrash at startup
>            rv = nsDeleteDir::DeleteDir(mCacheDirectory, true, 60000);
>            if (NS_FAILED(rv))
>                return rv;
>            exists = false;
>        } else {

For that reason I only want corrupt errors to show up in this histogram. For the errors at the top, I want them to be marked as not corrupt because they are not, they are some other type of error that does not lead to us nuking the cache directory.
> the general rule is to use the style of the subtree and we don't use 
> references in the cache tree.

Sounds fair on those grounds.  I'll update to pointers.
Submitting a new patch shortly with all review comments addressed.
> That's exactly what I wanted to avoid by using kUnexpectedError. In this case 
> nsDiskCacheMap::Open() fails, in nsDiskCacheDevice::OpenDiskCache() we log 1 as 
> Telemetry::DISK_CACHE_CORRUPT, but the detailed reason is kNotCorrupt. This 
> doesn't make any sense.

By the way we only log 1 when a return value is NS_ERROR_FILE_CORRUPTED exactly so it does make sense.  I'll add the extra error codes prefixed with kNotCorrupt though just to move things along.
(In reply to Brian R. Bondy [:bbondy] from comment #15)
> By the way we only log 1 when a return value is NS_ERROR_FILE_CORRUPTED
> exactly so it does make sense.  I'll add the extra error codes prefixed with
> kNotCorrupt though just to move things along.

I overlooked it, sorry. So what about logging Telemetry::DISK_CACHE_CORRUPT_DETAILS only when we log Telemetry::DISK_CACHE_CORRUPT? Then you would not need to handle these few special cases in nsDiskCacheMap::Open()...
> I overlooked it, sorry. So what about logging
> Telemetry::DISK_CACHE_CORRUPT_DETAILS only when 
> we log Telemetry::DISK_CACHE_CORRUPT

That sounds good.
Attached patch Patch v3 (obsolete) — Splinter Review
Thanks for the quick reviews, I'm looking forward to seeing this data.
I expect most to be header dirty, but it would be interesting to see if something else comes up.

Since we moved this to align with the DISK_CACHE_CORRUPT, that histogram is no longer useful as the same info can be seen inside DISK_CACHE_CORRUPT_DETAILS.
Attachment #643384 - Attachment is obsolete: true
Attachment #644418 - Flags: review?(michal.novotny)
Comment on attachment 644418 [details] [diff] [review]
Patch v3

The patch looks good. Just one last thing. Couldn't be the enum CorruptCacheInfo defined e.g. in nsDiskCache class? Logically it doesn't belong to nsDiskCacheBlockFile.
Attachment #644418 - Flags: review?(michal.novotny) → review+
Ya I agree regarding nsDiskCacheBlockFile.  The only downside is we'll have to include nsDIskCache.h from within nsDiskCacheBlockFile.h since enums cannot be forward declared in C++ except in C++11.  I'll attach a new patch with that and carry forward the r+.  Thanks for the review.
Attached patch Patch v4Splinter Review
- Moved enum to nsDiskCache
- Added #include "nsDiskCache.h" inside nsDiskCacheBlockFile.h

Carrying forward r+ since it was previously r+ and this was a nit.
Attachment #644418 - Attachment is obsolete: true
Attachment #644525 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/4c632ce71344
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Most of the cache clearing is due to a dirty header as expected.
A couple of interesting notes:

kCreateCacheSubdirectories is returned only on Windows and for 0.5%.
I think this probably suggests that the move of corrupted cache folder sometimes fails.
Let me know if you have any other ideas about that.

kEntryCountIncorrect is returned 0.4% of the times in Mac Which I think suggests entries with hashes that are 0.  On Windows this also appears with 0.04% of the time, and on Linux not at all.
Blocks: 105843
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: