Last Comment Bug 777328 - Gather telemetry data for how much cache corruption reduction plan would help
: Gather telemetry data for how much cache corruption reduction plan would help
Status: RESOLVED FIXED
[Snappy]
:
Product: Core
Classification: Components
Component: Networking: Cache (show other bugs)
: Trunk
: x86_64 All
: -- normal (vote)
: mozilla17
Assigned To: Brian R. Bondy [:bbondy]
:
: Patrick McManus [:mcmanus]
Mentors:
Depends on: 784414 784657 787576
Blocks: 105843
  Show dependency treegraph
 
Reported: 2012-07-25 06:22 PDT by Brian R. Bondy [:bbondy]
Modified: 2012-08-31 15:26 PDT (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch v1. (18.18 KB, patch)
2012-07-28 16:27 PDT, Brian R. Bondy [:bbondy]
michal.novotny: review-
Details | Diff | Splinter Review
Patch v2 (24.24 KB, patch)
2012-08-03 10:00 PDT, Brian R. Bondy [:bbondy]
no flags Details | Diff | Splinter Review
Patch v3 (26.25 KB, patch)
2012-08-19 20:50 PDT, Brian R. Bondy [:bbondy]
michal.novotny: review+
Details | Diff | Splinter Review
Patch v4. (26.40 KB, patch)
2012-08-20 05:04 PDT, Brian R. Bondy [:bbondy]
netzen: review+
Details | Diff | Splinter Review

Description Brian R. Bondy [:bbondy] 2012-07-25 06:22:08 PDT
This bug is to gather telemetry data for the cache corruption reduction plan mentioned in Bug 110894.  In particular, we'd make the cache header clean when it is not in use in that Bug 110894 plan.

This bug will add a new file, cachecorruption, which will store 0 or 1 which aligns with when the cache would be dirty / clean once Bug 110894 is implemented. 

We can then compare the new boolean telemetry histogram corrupt / not corrupt to the existing telemetry histogram for cache corruption which showed 10% corruption on Windows & Mac, and 20% on Linux.
Comment 1 Brian R. Bondy [:bbondy] 2012-07-28 16:27:07 PDT
Created attachment 646922 [details] [diff] [review]
Patch v1.

This patch doesn't actually change the way we detect a cache header being dirty (for now), but it does track an alternate way.

We report the result of the new tracked way with telemetry.  In this way we will be able to determine if it's possible to significantly reduce the cache corruption.

The alternate way that this patch tracks the cache being dirty, is the same as the cache corruption reduction plan I wrote about in bug 105843.

It works by having a new _CACHE_CLEAN_ file that stores a value of '0' for not clean or '1' for clean.   
Once the cache hasn't been used for 5 seconds, we call the revalidation function which writes out '1' to the _CACHE_CLEAN_ file (and in bug 105843 would also write out the changed records and header before that).
The first cache change in a spurt will write out a value of '0' in the _CACHE_CLEAN_ file.
Every cache change causes the timer that revalidates the cache back to 0.
Comment 2 Jason Duell [:jduell] (needinfo me) 2012-07-29 15:34:07 PDT
Comment on attachment 646922 [details] [diff] [review]
Patch v1.

Review of attachment 646922 [details] [diff] [review]:
-----------------------------------------------------------------

I'm guessing :michal or :hurley would be better for this.  I can take it if neither of them can.
Comment 3 Brian R. Bondy [:bbondy] 2012-07-29 15:38:22 PDT
OK cool, I was wondering if I should mark it as Michal but you had provided feedback in Bug 105843 so I went with you :)  Thanks for readjusting.
Comment 4 Michal Novotny (:michal) 2012-08-01 04:00:51 PDT
Comment on attachment 646922 [details] [diff] [review]
Patch v1.

> +      NS_WARNING("Could seek in cache map file!");

"Could not seek"


> +  mCleanCacheTimer = do_CreateInstance("@mozilla.org/timer;1", &rv);
> +  if (NS_SUCCEEDED(rv)) {
> +    rv = ResetCacheTimer();
> +  }

We could still initialize the cache on the main thread in some rare cases. So if we are on the main thread here, we should set cache IO thread as the target.


> +nsresult
> +nsDiskCacheMap::RevalidateCache()
> +{
> +  CACHE_LOG_DEBUG(("CACHE: RevalidateCache\n"));
> +  nsresult rv;

This method needs to grab the cache lock.


> +nsresult
> +nsDiskCacheMap::ResetCacheTimer()
> +{
> +  mCleanCacheTimer->Cancel();

IMO this isn't enough to ensure that the callback won't be called. RevalidateCache() could be already waiting for the lock mentioned above.


> Once the cache hasn't been used for 5 seconds, we call the revalidation
> function which writes out '1' to the _CACHE_CLEAN_ file (and in bug 105843
> would also write out the changed records and header before that).

The fact that the cache hasn't been used for some time doesn't ensure that the cache could be flushed to a consistent state. Doomed entries that are in use are removed from the records, but they still occupy the storage. So if we flush the records and header and the crash occurs before the storage is freed then the space will never be freed. Ie. we must not revalidate the cache if the list mDoomedEntries is not empty.
Comment 5 Michal Novotny (:michal) 2012-08-01 04:03:33 PDT
> > +nsresult
> > +nsDiskCacheMap::RevalidateCache()
> > +{
> > +  CACHE_LOG_DEBUG(("CACHE: RevalidateCache\n"));
> > +  nsresult rv;
> 
> This method needs to grab the cache lock.

This is not correct. The lock must be grabbed in RevalidateTimerCallback() since RevalidateCache() is called from nsDiskCacheMap::FlushHeader() under the lock.
Comment 6 Brian R. Bondy [:bbondy] 2012-08-03 10:00:57 PDT
Created attachment 648755 [details] [diff] [review]
Patch v2

Thanks for the review!

Implemented all review comments except this one:
> +nsresult
> +nsDiskCacheMap::ResetCacheTimer()
> +{
> +  mCleanCacheTimer->Cancel();
> IMO this isn't enough to ensure that the callback won't be called. 
> RevalidateCache() could be already waiting for the lock mentioned above.

I think both the timer and RevalidateCache will always be called off the main thread in the Cache IO thread. In the case of init Open where you mentioned there could be a race condition, that will not have anything inside RevalidateCache already.  So I think that should be enough, but let me know if I'm missing something else.
Comment 7 Brian R. Bondy [:bbondy] 2012-08-13 06:51:37 PDT
You might have been away, but just wondering if you had an ETA on this patch review? Thanks!
Comment 8 Michal Novotny (:michal) 2012-08-15 07:55:11 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #6)
> > +  mCleanCacheTimer->Cancel();
> > IMO this isn't enough to ensure that the callback won't be called. 
> > RevalidateCache() could be already waiting for the lock mentioned above.
> 
> I think both the timer and RevalidateCache will always be called off the
> main thread in the Cache IO thread.

mCleanCacheTimer->Cancel() in nsDiskCacheMap::ResetCacheTimer() is called from nsDiskCacheMap::InvalidateCache() which can be called on any thread. I.e. InvalidateCache() can be called on the main thread while the timer might be already running (and waiting for a lock in nsDiskCacheMap::RevalidateTimerCallback()), so mCleanCacheTimer->Cancel() will do nothing and the timer will be executed, instead of being canceled.


> +nsDiskCacheMap::IsCacheInSafeState()
> +{
> +  return nsCacheService::GlobalInstance()->IsDoomListEmpty();
> +}

I thought about this and I'm not sure it is enough. I need to check what exactly happens if the crash occurs in the middle of writing to some entry.


Btw. you use 2 spaces for indentation but both nsCacheService.cpp and nsDiskCacheMap.cpp use 4 spaces.
Comment 9 Brian R. Bondy [:bbondy] 2012-08-17 08:04:32 PDT
> I thought about this and I'm not sure it is enough. I need to check what exactly > happens if the crash occurs in the middle of writing to some entry.

Do you think you'll have time to look into this more soon? If not, do you think we could land what we have with the first comment fixed since it's just to give us a rough estimate of how much such a system would help us?  We could report as a new stat if we discover other things that must be checked after.
Comment 10 Michal Novotny (:michal) 2012-08-17 12:21:44 PDT
I'll have a look at it during the weekend.
Comment 11 Brian R. Bondy [:bbondy] 2012-08-19 20:50:58 PDT
Created attachment 653248 [details] [diff] [review]
Patch v3

Rebased to m-c tip, implemented review comments.
Comment 12 Michal Novotny (:michal) 2012-08-20 04:57:19 PDT
Comment on attachment 653248 [details] [diff] [review]
Patch v3

Review of attachment 653248 [details] [diff] [review]:
-----------------------------------------------------------------

Current cache code doesn't handle partially written entries correctly, but let's assume that it can be easily fixed. r+ with those nits fixed.

::: netwerk/cache/nsDiskCacheMap.cpp
@@ +61,5 @@
>      rv = NS_ERROR_FILE_CORRUPTED;  // presume the worst
>  
> +    if (NS_FAILED(InitCacheClean(cacheDirectory, corruptInfo))) {
> +        // corruptInfo is set in the call to InitCacheClean
> +        goto error_exit;

Compilation fails on my linux box because this goto crosses initialization of mapSize variable.

@@ +1355,5 @@
> +        rv = diskCacheMap->RevalidateCache();
> +    }
> +
> +    if (NS_FAILED(rv)) {
> +        diskCacheMap->ResetCacheTimer(1000);

Maybe create a constant for this value too?
Comment 13 Brian R. Bondy [:bbondy] 2012-08-20 05:04:55 PDT
Created attachment 653341 [details] [diff] [review]
Patch v4.

Implemented nits, carrying forward r+.
I'll push this to try and if there are no errors I'll land it on m-i.
Comment 15 Brian R. Bondy [:bbondy] 2012-08-20 10:37:41 PDT
Thanks for the review Michal.

After we get data, assuming it is a significant benefit, I'll continue work in Bug 105843.
Comment 16 Ed Morley [:emorley] 2012-08-21 06:31:13 PDT
https://hg.mozilla.org/mozilla-central/rev/f11e42f6678b

Note You need to log in before you can comment on or make changes to this bug.