Last Comment Bug 774146 - Get more detail on reason for DISK_CACHE_CORRUPT failures
: Get more detail on reason for DISK_CACHE_CORRUPT failures
Status: RESOLVED FIXED
[snappy:p2]
:
Product: Core
Classification: Components
Component: Networking: Cache (show other bugs)
: Trunk
: x86_64 Windows 7
: -- normal (vote)
: mozilla17
Assigned To: Brian R. Bondy [:bbondy]
:
: Patrick McManus [:mcmanus]
Mentors:
Depends on:
Blocks: 105843
  Show dependency treegraph
 
Reported: 2012-07-15 18:17 PDT by Brian R. Bondy [:bbondy]
Modified: 2012-07-24 11:17 PDT (History)
4 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch v1. (10.83 KB, patch)
2012-07-17 09:07 PDT, Brian R. Bondy [:bbondy]
michal.novotny: review-
Details | Diff | Splinter Review
Patch v2. (13.51 KB, patch)
2012-07-18 08:13 PDT, Brian R. Bondy [:bbondy]
michal.novotny: review-
Details | Diff | Splinter Review
Patch v3 (18.81 KB, patch)
2012-07-20 12:23 PDT, Brian R. Bondy [:bbondy]
michal.novotny: review+
Details | Diff | Splinter Review
Patch v4 (19.93 KB, patch)
2012-07-20 16:45 PDT, Brian R. Bondy [:bbondy]
netzen: review+
Details | Diff | Splinter Review

Description Brian R. Bondy [:bbondy] 2012-07-15 18:17:12 PDT
From the data in DISK_CACHE_CORRUPT, 12.5% of network startups are corrupt.  This seems very high.  There are many different code paths that return the corrupt error code that triggers this 12.5% though. 

The reason for the corrupt cache is likely a dirty header, but we should add telemetry data to be sure.
Comment 1 Brian R. Bondy [:bbondy] 2012-07-17 09:07:17 PDT
Created attachment 642984 [details] [diff] [review]
Patch v1.

I seen you did the previous telemetry data around this area of code so thought I'd hit you up again for a review :)
Comment 2 Jason Duell [:jduell] (needinfo me) 2012-07-17 10:50:17 PDT
Comment on attachment 642984 [details] [diff] [review]
Patch v1.

Review of attachment 642984 [details] [diff] [review]:
-----------------------------------------------------------------

Brian or Michal are better to review this--giving to Michal because I know Brian is busy with data-jar for B2G this week.

::: toolkit/components/telemetry/TelemetryHistograms.h
@@ +194,5 @@
>  #undef _HTTP_HIST
>  #undef HTTP_HISTOGRAMS
>  
>  HISTOGRAM(DISK_CACHE_CORRUPT, 0, 1, 2, BOOLEAN, "Was the HTTP disk cache corrupt at startup?")
> +HISTOGRAM(DISK_CACHE_CORRUPT_DETAILS, 1, 30, 31, LINEAR, "Why the HTTP disk cache was corrupted at startup")

min/max values are 1-30, so bucket count should be 30 instead of 31?
Comment 3 Brian R. Bondy [:bbondy] 2012-07-17 11:05:57 PDT
> Brian or Michal are better to review this--giving to Michal 
> because I know Brian is busy with data-jar for B2G this week.

OK, thanks!

> min/max values are 1-30, so bucket count should be 30 instead of 31?

31 is correct. The min value (1 in this case) is the smallest non-zero value, and 0 is always implicitly included.
Comment 4 Michal Novotny (:michal) 2012-07-18 06:21:12 PDT
Comment on attachment 642984 [details] [diff] [review]
Patch v1.

>  nsresult
> -nsDiskCacheMap::Open(nsIFile *  cacheDirectory)
> +nsDiskCacheMap::Open(nsIFile *  cacheDirectory,
> +                     nsDiskCacheMap::CorruptCacheInfo & corruptInfo)
>  {

Please pass a pointer instead of a reference.


> +    corruptInfo = nsDiskCacheMap::kNotCorrupt;

IMO the default value should be some error (e.g. kUnexpectedError). There are few places in nsDiskCacheMap::Open() where corruptInfo isn't set appropriately in case of failure. These failures are highly unlikely or almost impossible, but just in case it really happen we shouldn't return kNotCorrupt.


>     mRecordArray = (nsDiskCacheRecord *)
>         PR_CALLOC(mHeader.mRecordCount * sizeof(nsDiskCacheRecord));
>     if (!mRecordArray) {
> +       corruptInfo = nsDiskCacheMap::kOutOfMemory;

>     mRecordArray = (nsDiskCacheRecord *) PR_MALLOC(recordArraySize);
>     if (!mRecordArray) {
> +       corruptInfo = nsDiskCacheMap::kOutOfMemory;

AFAIK both PR_CALLOC() and PR_MALLOC() are now infallible just like new[], right? So there is no need to handle this error...


>          PRUint32 recordArraySize =
>                  mHeader.mRecordCount * sizeof(nsDiskCacheRecord);
> -        if (mapSize < recordArraySize + sizeof(nsDiskCacheHeader))
> +        if (mapSize < recordArraySize + sizeof(nsDiskCacheHeader)) {
> +            corruptInfo = nsDiskCacheMap::kMapSizeNotLargeEnough;
>              goto error_exit;
> +        }

>      } else {
> +        corruptInfo = nsDiskCacheMap::kMapSizeNotLargeEnough;
>          goto error_exit;
>      }

We should differentiate the corruptInfo in these 2 cases. The first could be named like kRecordsIncomplete and the second kHeaderIncomplete.


>      rv = OpenBlockFiles();
> -    if (NS_FAILED(rv))  goto error_exit;
> +    if (NS_FAILED(rv)) {
> +      corruptInfo = nsDiskCacheMap::kCouldNotOpenBlockFiles;
> +      goto error_exit;
> +    }

Maybe we could pass the corruptInfo to nsDiskCacheMap::OpenBlockFiles() and then to nsDiskCacheBlockFile::Open() to get a more detailed error here.
Comment 5 Brian R. Bondy [:bbondy] 2012-07-18 07:07:44 PDT
> Please pass a pointer instead of a reference.

Is there a reason why? I'll change it if you really want, but I don't see why it's a problem and I prefer the simplified code where a NULL error condition check is not needed. I've used references pretty extensively throughout a couple hundred patches and haven't had any negative reviews or nits because of it.  I don't see it documented here either: https://developer.mozilla.org/En/Mozilla_Coding_Style_Guide

The only argument I've ever heard is that it makes it more readable when you call the function, but in this case I think it's pretty obvious. 

> re: the other nits

Sounds good, I'll update them.
Comment 6 Brian R. Bondy [:bbondy] 2012-07-18 08:13:10 PDT
Created attachment 643384 [details] [diff] [review]
Patch v2.

Implemented review comments minus what was mentioned in Comment 5.
Comment 7 Jason Duell [:jduell] (needinfo me) 2012-07-18 10:26:18 PDT
> AFAIK both PR_CALLOC() and PR_MALLOC() are now infallible just like new[], right?

Nope.  moz_xmalloc is infallible, but not the NSPR functions.  Is there any reason we need to use the PR_ allocators?  

  https://developer.mozilla.org/en/Infallible_memory_allocation
Comment 8 Brian R. Bondy [:bbondy] 2012-07-18 10:36:20 PDT
I can re-add the memory error codes as a nit before pushing if you want or on the next set of review comments. As for changing the allocator functions that should be a different bug if a change is wanted.
Comment 9 Jason Duell [:jduell] (needinfo me) 2012-07-18 10:42:17 PDT
> as for changing the allocator functions

That's fine, but then keep the error check for the existing allocation calls if they fail :)
Comment 10 Brian R. Bondy [:bbondy] 2012-07-18 10:44:30 PDT
Yup I will, just will do it before pushing or on the next set of review comments. 

> I can re-add the memory error codes as a nit before pushing if 
> you want or on the next set of review comments
Comment 11 Michal Novotny (:michal) 2012-07-20 09:03:45 PDT
Comment on attachment 643384 [details] [diff] [review]
Patch v2.

(In reply to Brian R. Bondy [:bbondy] from comment #5)
> > Please pass a pointer instead of a reference.
> 
> Is there a reason why? I'll change it if you really want, but I don't see
> why it's a problem and I prefer the simplified code where a NULL error
> condition check is not needed. I've used references pretty extensively
> throughout a couple hundred patches and haven't had any negative reviews or
> nits because of it.  I don't see it documented here either:
> https://developer.mozilla.org/En/Mozilla_Coding_Style_Guide
> 
> The only argument I've ever heard is that it makes it more readable when you
> call the function, but in this case I think it's pretty obvious. 

It could be that this has changed over the time, but a long time ago I was told that reference should be used only when passing const nsACString/nsAString... Anyway, the general rule is to use the style of the subtree and we don't use references in the cache tree.


> +    if (!cacheDirectory) {
> +        corruptInfo = nsDiskCacheMap::kNotCorrupt;
> +        NS_WARNING("cacheDirectory must not be NULL");
> +        return NS_ERROR_INVALID_POINTER;
> +    }

That's exactly what I wanted to avoid by using kUnexpectedError. In this case nsDiskCacheMap::Open() fails, in nsDiskCacheDevice::OpenDiskCache() we log 1 as Telemetry::DISK_CACHE_CORRUPT, but the detailed reason is kNotCorrupt. This doesn't make any sense.

All these errors that shouldn't normally happen (not just this one) should be logged as kUnexpectedError. And in case the telemetry data shows non zero for this error we should investigate why that happened.


>          rv = mBlockFile[i].Open(blockFile, blockSize, bitMapSize);
> -        if (NS_FAILED(rv)) break;
> +        if (NS_FAILED(rv)) {
> +            corruptInfo = nsDiskCacheMap::kCouldNotOpenBlockFileForIndex;
> +            break;
> +        }

I meant to propagate the corruptInfo down to nsDiskCacheBlockFile::Open().
Comment 12 Brian R. Bondy [:bbondy] 2012-07-20 09:11:04 PDT
We are trying to track the cases exactly where we have a NS_ERROR_FILE_CORRUPTED.  Because this is the only time that we call nsDeleteDir::DeleteDir(mCacheDirectory...

>        } else if (rv == NS_ERROR_FILE_CORRUPTED) {
>            Telemetry::Accumulate(Telemetry::DISK_CACHE_CORRUPT, 1);
>            // delay delete by 1 minute to avoid IO thrash at startup
>            rv = nsDeleteDir::DeleteDir(mCacheDirectory, true, 60000);
>            if (NS_FAILED(rv))
>                return rv;
>            exists = false;
>        } else {

For that reason I only want corrupt errors to show up in this histogram. For the errors at the top, I want them to be marked as not corrupt because they are not, they are some other type of error that does not lead to us nuking the cache directory.
Comment 13 Brian R. Bondy [:bbondy] 2012-07-20 09:38:22 PDT
> the general rule is to use the style of the subtree and we don't use 
> references in the cache tree.

Sounds fair on those grounds.  I'll update to pointers.
Comment 14 Brian R. Bondy [:bbondy] 2012-07-20 09:45:50 PDT
Submitting a new patch shortly with all review comments addressed.
Comment 15 Brian R. Bondy [:bbondy] 2012-07-20 09:50:03 PDT
> That's exactly what I wanted to avoid by using kUnexpectedError. In this case 
> nsDiskCacheMap::Open() fails, in nsDiskCacheDevice::OpenDiskCache() we log 1 as 
> Telemetry::DISK_CACHE_CORRUPT, but the detailed reason is kNotCorrupt. This 
> doesn't make any sense.

By the way we only log 1 when a return value is NS_ERROR_FILE_CORRUPTED exactly so it does make sense.  I'll add the extra error codes prefixed with kNotCorrupt though just to move things along.
Comment 16 Michal Novotny (:michal) 2012-07-20 10:36:16 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #15)
> By the way we only log 1 when a return value is NS_ERROR_FILE_CORRUPTED
> exactly so it does make sense.  I'll add the extra error codes prefixed with
> kNotCorrupt though just to move things along.

I overlooked it, sorry. So what about logging Telemetry::DISK_CACHE_CORRUPT_DETAILS only when we log Telemetry::DISK_CACHE_CORRUPT? Then you would not need to handle these few special cases in nsDiskCacheMap::Open()...
Comment 17 Brian R. Bondy [:bbondy] 2012-07-20 10:56:43 PDT
> I overlooked it, sorry. So what about logging
> Telemetry::DISK_CACHE_CORRUPT_DETAILS only when 
> we log Telemetry::DISK_CACHE_CORRUPT

That sounds good.
Comment 18 Brian R. Bondy [:bbondy] 2012-07-20 12:23:17 PDT
Created attachment 644418 [details] [diff] [review]
Patch v3

Thanks for the quick reviews, I'm looking forward to seeing this data.
I expect most to be header dirty, but it would be interesting to see if something else comes up.

Since we moved this to align with the DISK_CACHE_CORRUPT, that histogram is no longer useful as the same info can be seen inside DISK_CACHE_CORRUPT_DETAILS.
Comment 19 Michal Novotny (:michal) 2012-07-20 15:10:51 PDT
Comment on attachment 644418 [details] [diff] [review]
Patch v3

The patch looks good. Just one last thing. Couldn't be the enum CorruptCacheInfo defined e.g. in nsDiskCache class? Logically it doesn't belong to nsDiskCacheBlockFile.
Comment 20 Brian R. Bondy [:bbondy] 2012-07-20 16:20:26 PDT
Ya I agree regarding nsDiskCacheBlockFile.  The only downside is we'll have to include nsDIskCache.h from within nsDiskCacheBlockFile.h since enums cannot be forward declared in C++ except in C++11.  I'll attach a new patch with that and carry forward the r+.  Thanks for the review.
Comment 21 Brian R. Bondy [:bbondy] 2012-07-20 16:45:55 PDT
Created attachment 644525 [details] [diff] [review]
Patch v4

- Moved enum to nsDiskCache
- Added #include "nsDiskCache.h" inside nsDiskCacheBlockFile.h

Carrying forward r+ since it was previously r+ and this was a nit.
Comment 23 Ryan VanderMeulen [:RyanVM] 2012-07-21 14:42:40 PDT
https://hg.mozilla.org/mozilla-central/rev/4c632ce71344
Comment 24 Brian R. Bondy [:bbondy] 2012-07-24 11:16:48 PDT
Most of the cache clearing is due to a dirty header as expected.
A couple of interesting notes:

kCreateCacheSubdirectories is returned only on Windows and for 0.5%.
I think this probably suggests that the move of corrupted cache folder sometimes fails.
Let me know if you have any other ideas about that.

kEntryCountIncorrect is returned 0.4% of the times in Mac Which I think suggests entries with hashes that are 0.  On Windows this also appears with 0.04% of the time, and on Linux not at all.

Note You need to log in before you can comment on or make changes to this bug.