Closed Bug 774146 Opened 13 years ago Closed 13 years ago

Get more detail on reason for DISK_CACHE_CORRUPT failures

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla17

People

(Reporter: bbondy, Assigned: bbondy)

References

Details

(Whiteboard: [snappy:p2])

Attachments

(1 file, 3 obsolete files)

Patch v1. 13 years ago Brian R. Bondy [:bbondy] 10.83 KB, patch	michal : review-	Details \| Diff \| Splinter Review
Patch v2. 13 years ago Brian R. Bondy [:bbondy] 13.51 KB, patch	michal : review-	Details \| Diff \| Splinter Review
Patch v3 13 years ago Brian R. Bondy [:bbondy] 18.81 KB, patch	michal : review+	Details \| Diff \| Splinter Review
Patch v4 13 years ago Brian R. Bondy [:bbondy] 19.93 KB, patch	bbondy : review+	Details \| Diff \| Splinter Review

Brian R. Bondy [:bbondy]

Assignee

Description

•

13 years ago

From the data in DISK_CACHE_CORRUPT, 12.5% of network startups are corrupt. This seems very high. There are many different code paths that return the corrupt error code that triggers this 12.5% though. The reason for the corrupt cache is likely a dirty header, but we should add telemetry data to be sure.

Brian R. Bondy [:bbondy]

Assignee

Updated

•

13 years ago

Whiteboard: [snappy]

Brian R. Bondy [:bbondy]

Assignee

Comment 1

•

13 years ago

Attached patch Patch v1. (obsolete) — Details — Splinter Review

I seen you did the previous telemetry data around this area of code so thought I'd hit you up again for a review :)

Attachment #642984 - Flags: review?(jduell.mcbugs)

Jason Duell

Comment 2

•

13 years ago

Comment on attachment 642984 [details] [diff] [review] Patch v1. Review of attachment 642984 [details] [diff] [review]: ----------------------------------------------------------------- Brian or Michal are better to review this--giving to Michal because I know Brian is busy with data-jar for B2G this week. ::: toolkit/components/telemetry/TelemetryHistograms.h @@ +194,5 @@ > #undef _HTTP_HIST > #undef HTTP_HISTOGRAMS > > HISTOGRAM(DISK_CACHE_CORRUPT, 0, 1, 2, BOOLEAN, "Was the HTTP disk cache corrupt at startup?") > +HISTOGRAM(DISK_CACHE_CORRUPT_DETAILS, 1, 30, 31, LINEAR, "Why the HTTP disk cache was corrupted at startup") min/max values are 1-30, so bucket count should be 30 instead of 31?

Attachment #642984 - Flags: review?(jduell.mcbugs) → review?(michal.novotny)

Brian R. Bondy [:bbondy]

Assignee

Comment 3

•

13 years ago

> Brian or Michal are better to review this--giving to Michal > because I know Brian is busy with data-jar for B2G this week. OK, thanks! > min/max values are 1-30, so bucket count should be 30 instead of 31? 31 is correct. The min value (1 in this case) is the smallest non-zero value, and 0 is always implicitly included.

Michal Novotny [:michal]

Comment 4

•

13 years ago

Comment on attachment 642984 [details] [diff] [review] Patch v1. > nsresult > -nsDiskCacheMap::Open(nsIFile * cacheDirectory) > +nsDiskCacheMap::Open(nsIFile * cacheDirectory, > + nsDiskCacheMap::CorruptCacheInfo & corruptInfo) > { Please pass a pointer instead of a reference. > + corruptInfo = nsDiskCacheMap::kNotCorrupt; IMO the default value should be some error (e.g. kUnexpectedError). There are few places in nsDiskCacheMap::Open() where corruptInfo isn't set appropriately in case of failure. These failures are highly unlikely or almost impossible, but just in case it really happen we shouldn't return kNotCorrupt. > mRecordArray = (nsDiskCacheRecord *) > PR_CALLOC(mHeader.mRecordCount * sizeof(nsDiskCacheRecord)); > if (!mRecordArray) { > + corruptInfo = nsDiskCacheMap::kOutOfMemory; > mRecordArray = (nsDiskCacheRecord *) PR_MALLOC(recordArraySize); > if (!mRecordArray) { > + corruptInfo = nsDiskCacheMap::kOutOfMemory; AFAIK both PR_CALLOC() and PR_MALLOC() are now infallible just like new[], right? So there is no need to handle this error... > PRUint32 recordArraySize = > mHeader.mRecordCount * sizeof(nsDiskCacheRecord); > - if (mapSize < recordArraySize + sizeof(nsDiskCacheHeader)) > + if (mapSize < recordArraySize + sizeof(nsDiskCacheHeader)) { > + corruptInfo = nsDiskCacheMap::kMapSizeNotLargeEnough; > goto error_exit; > + } > } else { > + corruptInfo = nsDiskCacheMap::kMapSizeNotLargeEnough; > goto error_exit; > } We should differentiate the corruptInfo in these 2 cases. The first could be named like kRecordsIncomplete and the second kHeaderIncomplete. > rv = OpenBlockFiles(); > - if (NS_FAILED(rv)) goto error_exit; > + if (NS_FAILED(rv)) { > + corruptInfo = nsDiskCacheMap::kCouldNotOpenBlockFiles; > + goto error_exit; > + } Maybe we could pass the corruptInfo to nsDiskCacheMap::OpenBlockFiles() and then to nsDiskCacheBlockFile::Open() to get a more detailed error here.

Attachment #642984 - Flags: review?(michal.novotny) → review-

Brian R. Bondy [:bbondy]

Assignee

Comment 5

•

13 years ago

> Please pass a pointer instead of a reference. Is there a reason why? I'll change it if you really want, but I don't see why it's a problem and I prefer the simplified code where a NULL error condition check is not needed. I've used references pretty extensively throughout a couple hundred patches and haven't had any negative reviews or nits because of it. I don't see it documented here either: https://developer.mozilla.org/En/Mozilla_Coding_Style_Guide The only argument I've ever heard is that it makes it more readable when you call the function, but in this case I think it's pretty obvious. > re: the other nits Sounds good, I'll update them.

Brian R. Bondy [:bbondy]

Assignee

Comment 6

•

13 years ago

Attached patch Patch v2. (obsolete) — Details — Splinter Review

Implemented review comments minus what was mentioned in Comment 5.

Attachment #642984 - Attachment is obsolete: true

Attachment #643384 - Flags: review?(michal.novotny)

Jason Duell

Comment 7

•

13 years ago

> AFAIK both PR_CALLOC() and PR_MALLOC() are now infallible just like new[], right? Nope. moz_xmalloc is infallible, but not the NSPR functions. Is there any reason we need to use the PR_ allocators? https://developer.mozilla.org/en/Infallible_memory_allocation

Brian R. Bondy [:bbondy]

Assignee

Comment 8

•

13 years ago

I can re-add the memory error codes as a nit before pushing if you want or on the next set of review comments. As for changing the allocator functions that should be a different bug if a change is wanted.

Jason Duell

Comment 9

•

13 years ago

> as for changing the allocator functions That's fine, but then keep the error check for the existing allocation calls if they fail :)

Brian R. Bondy [:bbondy]

Assignee

Comment 10

•

13 years ago

Yup I will, just will do it before pushing or on the next set of review comments. > I can re-add the memory error codes as a nit before pushing if > you want or on the next set of review comments

(dormant account)

Updated

•

13 years ago

Whiteboard: [snappy] → [snappy:p2]

Michal Novotny [:michal]

Comment 11

•

13 years ago

Comment on attachment 643384 [details] [diff] [review] Patch v2. (In reply to Brian R. Bondy [:bbondy] from comment #5) > > Please pass a pointer instead of a reference. > > Is there a reason why? I'll change it if you really want, but I don't see > why it's a problem and I prefer the simplified code where a NULL error > condition check is not needed. I've used references pretty extensively > throughout a couple hundred patches and haven't had any negative reviews or > nits because of it. I don't see it documented here either: > https://developer.mozilla.org/En/Mozilla_Coding_Style_Guide > > The only argument I've ever heard is that it makes it more readable when you > call the function, but in this case I think it's pretty obvious. It could be that this has changed over the time, but a long time ago I was told that reference should be used only when passing const nsACString/nsAString... Anyway, the general rule is to use the style of the subtree and we don't use references in the cache tree. > + if (!cacheDirectory) { > + corruptInfo = nsDiskCacheMap::kNotCorrupt; > + NS_WARNING("cacheDirectory must not be NULL"); > + return NS_ERROR_INVALID_POINTER; > + } That's exactly what I wanted to avoid by using kUnexpectedError. In this case nsDiskCacheMap::Open() fails, in nsDiskCacheDevice::OpenDiskCache() we log 1 as Telemetry::DISK_CACHE_CORRUPT, but the detailed reason is kNotCorrupt. This doesn't make any sense. All these errors that shouldn't normally happen (not just this one) should be logged as kUnexpectedError. And in case the telemetry data shows non zero for this error we should investigate why that happened. > rv = mBlockFile[i].Open(blockFile, blockSize, bitMapSize); > - if (NS_FAILED(rv)) break; > + if (NS_FAILED(rv)) { > + corruptInfo = nsDiskCacheMap::kCouldNotOpenBlockFileForIndex; > + break; > + } I meant to propagate the corruptInfo down to nsDiskCacheBlockFile::Open().

Attachment #643384 - Flags: review?(michal.novotny) → review-

Brian R. Bondy [:bbondy]

Assignee

Comment 12

•

13 years ago

We are trying to track the cases exactly where we have a NS_ERROR_FILE_CORRUPTED. Because this is the only time that we call nsDeleteDir::DeleteDir(mCacheDirectory... > } else if (rv == NS_ERROR_FILE_CORRUPTED) { > Telemetry::Accumulate(Telemetry::DISK_CACHE_CORRUPT, 1); > // delay delete by 1 minute to avoid IO thrash at startup > rv = nsDeleteDir::DeleteDir(mCacheDirectory, true, 60000); > if (NS_FAILED(rv)) > return rv; > exists = false; > } else { For that reason I only want corrupt errors to show up in this histogram. For the errors at the top, I want them to be marked as not corrupt because they are not, they are some other type of error that does not lead to us nuking the cache directory.

Brian R. Bondy [:bbondy]

Assignee

Comment 13

•

13 years ago

> the general rule is to use the style of the subtree and we don't use > references in the cache tree. Sounds fair on those grounds. I'll update to pointers.

Brian R. Bondy [:bbondy]

Assignee

Comment 14

•

13 years ago

Submitting a new patch shortly with all review comments addressed.

Brian R. Bondy [:bbondy]

Assignee

Comment 15

•

13 years ago

> That's exactly what I wanted to avoid by using kUnexpectedError. In this case > nsDiskCacheMap::Open() fails, in nsDiskCacheDevice::OpenDiskCache() we log 1 as > Telemetry::DISK_CACHE_CORRUPT, but the detailed reason is kNotCorrupt. This > doesn't make any sense. By the way we only log 1 when a return value is NS_ERROR_FILE_CORRUPTED exactly so it does make sense. I'll add the extra error codes prefixed with kNotCorrupt though just to move things along.

Michal Novotny [:michal]

Comment 16

•

13 years ago

(In reply to Brian R. Bondy [:bbondy] from comment #15) > By the way we only log 1 when a return value is NS_ERROR_FILE_CORRUPTED > exactly so it does make sense. I'll add the extra error codes prefixed with > kNotCorrupt though just to move things along. I overlooked it, sorry. So what about logging Telemetry::DISK_CACHE_CORRUPT_DETAILS only when we log Telemetry::DISK_CACHE_CORRUPT? Then you would not need to handle these few special cases in nsDiskCacheMap::Open()...

Brian R. Bondy [:bbondy]

Assignee

Comment 17

•

13 years ago

> I overlooked it, sorry. So what about logging > Telemetry::DISK_CACHE_CORRUPT_DETAILS only when > we log Telemetry::DISK_CACHE_CORRUPT That sounds good.

Brian R. Bondy [:bbondy]

Assignee

Comment 18

•

13 years ago

Attached patch Patch v3 (obsolete) — Details — Splinter Review

Thanks for the quick reviews, I'm looking forward to seeing this data. I expect most to be header dirty, but it would be interesting to see if something else comes up. Since we moved this to align with the DISK_CACHE_CORRUPT, that histogram is no longer useful as the same info can be seen inside DISK_CACHE_CORRUPT_DETAILS.

Attachment #643384 - Attachment is obsolete: true

Attachment #644418 - Flags: review?(michal.novotny)

Michal Novotny [:michal]

Comment 19

•

13 years ago

Comment on attachment 644418 [details] [diff] [review] Patch v3 The patch looks good. Just one last thing. Couldn't be the enum CorruptCacheInfo defined e.g. in nsDiskCache class? Logically it doesn't belong to nsDiskCacheBlockFile.

Attachment #644418 - Flags: review?(michal.novotny) → review+

Brian R. Bondy [:bbondy]

Assignee

Comment 20

•

13 years ago

Ya I agree regarding nsDiskCacheBlockFile. The only downside is we'll have to include nsDIskCache.h from within nsDiskCacheBlockFile.h since enums cannot be forward declared in C++ except in C++11. I'll attach a new patch with that and carry forward the r+. Thanks for the review.

Brian R. Bondy [:bbondy]

Assignee

Comment 21

•

13 years ago

Attached patch Patch v4 — Details — Splinter Review

- Moved enum to nsDiskCache - Added #include "nsDiskCache.h" inside nsDiskCacheBlockFile.h Carrying forward r+ since it was previously r+ and this was a nit.

Attachment #644418 - Attachment is obsolete: true

Attachment #644525 - Flags: review+

Brian R. Bondy [:bbondy]

Assignee

Comment 22

•

13 years ago

http://hg.mozilla.org/integration/mozilla-inbound/rev/4c632ce71344

Ryan VanderMeulen [:RyanVM]

Comment 23

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/4c632ce71344

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Brian R. Bondy [:bbondy]

Assignee

Comment 24

•

13 years ago

Most of the cache clearing is due to a dirty header as expected. A couple of interesting notes: kCreateCacheSubdirectories is returned only on Windows and for 0.5%. I think this probably suggests that the move of corrupted cache folder sometimes fails. Let me know if you have any other ideas about that. kEntryCountIncorrect is returned 0.4% of the times in Mac Which I think suggests entries with hashes that are 0. On Windows this also appears with 0.04% of the time, and on Linux not at all.

Brian R. Bondy [:bbondy]

Assignee

Updated

•

13 years ago

Blocks: 105843

You need to log in before you can comment on or make changes to this bug.