Closed Bug 724513 Opened 12 years ago Closed 12 years ago

Startupcache can load invalid cache if unable to delete startupCache file

Categories

(Core :: XPCOM, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla19

People

(Reporter: graememcc, Assigned: graememcc)

References

Details

Attachments

(3 files, 8 obsolete files)

I have been wrestling with bug 722806, with possibly corrupted XBL in the startup cache for a particular profile. The corruption aside, I realised the corrupted cache survived the version bump from 12 to 13 on January 31st - it should have been invalidated and recreated. (I have since learned it should also have been invalidated whenever I updated my build). Something was circumventing the version checking protections.

I examined the startupCache file in the profile dir, I found it hailed from January 14th. Also, the startupCache file in that profile directory had been created with owner root - no idea how I managed that.

However, this highlighted how the cache checking code in XRE_Main was failing. After determining that the cache was out of date, RemoveComponentRegistries tries to delete the startup cache directory on disk. In this case, it fails with an access denied error, however RemoveComponentRegistries does not report any errors. Hence the operation is assumed to have succeeded, and when the startupCache singleton is initialised, the out of date file is still there, and is read in, (and continues to be re-read any time InvalidateCache is called).

I'm aware this is very much an uncommon edge case, but should we consider signalling to the startup cache not to trust the file on disk in this case? It would mean the profile user pays the cost of slow-loading of all components, but that should be measured against the risk of loading invalid data.
A local flag would be able to help *once* on version upgrade, but the next startup our compatibility.ini is updated with new values, so how would we know then that the startup cache was invalid?

I suspect that this isn't worth fully fixing, although I'd certainly support adding a console warning for this case if we can manage it and maybe even a crash annotation or something like that.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #1)
> A local flag would be able to help *once* on version upgrade, but the next
> startup our compatibility.ini is updated with new values, so how would we
> know then that the startup cache was invalid?

I was thinking if the deletion fails on a version bump, we could have WriteVersion append the "InvalidateCaches=1" line, analogous to what InvalidateCachesOnRestart currently does, which would be picked up by CheckCompatability on next restart, and trigger another deletion attempt.
Attached patch Part 1: StartupCache changes (obsolete) — Splinter Review
API change to allow clients to signal to StartupCache to disregard the contents of the disk file.

I'm not providing the ability for clients to signal the inverse; I don't see that any one client can safely make that determination, given that other clients can trigger invalidations, and writes to disk etc. However, I do allow the startupCache itself to do so, on the conclusion of a successful write.

Not in love with the conditional test, but can't think of a way to temporarily prevent the file deletion on Windows.
Attachment #596015 - Flags: superreview?(benjamin)
Attachment #596015 - Flags: review?(benjamin)
Attached patch Part 1: nsAppRunner changes (obsolete) — Splinter Review
As mentioned in comment 2, I think we can use "InvalidateCaches" to continue to spot this case after a version bump. Using NS_WARNING for the moment, or would you prefer me to grab the console service after XPCOM init, and warn there?
Attachment #596016 - Flags: review?(benjamin)
Comment on attachment 596016 [details] [diff] [review]
Part 1: nsAppRunner changes

Sorry for the review delay. I'm torn about the complexity, but I guess we should try this. Detailed patch notes inline:

>diff --git a/toolkit/xre/nsAppRunner.cpp b/toolkit/xre/nsAppRunner.cpp

>-static void RemoveComponentRegistries(nsIFile* aProfileDir, nsIFile* aLocalProfileDir,
>+static nsresult RemoveComponentRegistries(nsIFile* aProfileDir, nsIFile* aLocalProfileDir,
>                                       bool aRemoveEMFiles)


1) please make this return bool not nsresult. The meaning of the return value should be documented. The checks for NS_ERROR_FILE_TARGET_DOES_NOT_EXIST that are no sprinkled through the callers should be moved into this method itself. You can then get rid of the local "rv"

2) While you're touching this, please fix the indentation to
static bool
RemoveComponentReg...

>   nsCOMPtr<nsIFile> file;
>   aProfileDir->Clone(getter_AddRefs(file));
>   if (!file)
>-    return;
>+    return rv;

Clone should never fail, but if it does you should return false, not ok. It means we're OOM. Once more below.

>@@ -3471,6 +3490,14 @@ XRE_main(int argc, char* argv[], const n
> 
>         SaveStateForAppInitiatedRestart();
> 
>+        if (invalidateCache) {
>+          StartupCache* sc = StartupCache::GetSingleton();
>+          if (sc)
>+            sc->IgnoreDiskCache();
>+
>+          NS_WARNING("Unable to remove invalid startupCache. Ignoring contents");
>+        }

I think this will be too late. We've already started up a lot of stuff by this point. Ideally I think we should make this a static variable which we set before launching XPCOM at all, so that the startup cache checks it right when it is initialized during the startup sequence.
Attachment #596016 - Flags: review?(benjamin) → review-
Oh also, we should definitely add a telemetry notation for this case so that we can tell if people are hitting it in the wild.
Comment on attachment 596015 [details] [diff] [review]
Part 1: StartupCache changes

As noted in the other review, mIgnoreDiskCache should probably be a static gIgnoreDiskCache. With that out of the way, I don't think it's necessary to expose this on nsIStartupCache at all.

On Windows you should be able to prevent a file from being deleted by holding it open.
Attachment #596015 - Flags: superreview?(benjamin)
Attachment #596015 - Flags: review?(benjamin)
Attached patch Part 1: nsAppRunner changes (obsolete) — Splinter Review
Attachment #596016 - Attachment is obsolete: true
Attachment #616617 - Flags: review?(benjamin)
Attached patch Part 2: StartupCache changes (obsolete) — Splinter Review
Detecting the failure in appRunner means my previous testing approach won't work.
Attachment #596015 - Attachment is obsolete: true
Attachment #616622 - Flags: review?(benjamin)
Comment on attachment 616622 [details] [diff] [review]
Part 2: StartupCache changes

Sorry, this one's not quite right
Attachment #616622 - Flags: review?(benjamin)
Comment on attachment 616622 [details] [diff] [review]
Part 2: StartupCache changes

Actually, I stand by it. The test failures I were getting were an honest to goodness Telemetry bug.
Attachment #616622 - Flags: review?(benjamin)
Depends on: 747163
Comment on attachment 616617 [details] [diff] [review]
Part 1: nsAppRunner changes

mwu has graciously agreed to do this review since it has just kept falling off my list of urgent things. I apologize for the delay!
Attachment #616617 - Flags: review?(benjamin) → review?(mwu)
Attachment #616622 - Flags: review?(benjamin) → review?(mwu)
Comment on attachment 616617 [details] [diff] [review]
Part 1: nsAppRunner changes

Review of attachment 616617 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry about the delay - I think we're close here, but I'd like to see the interface to disabling loads in StartupCache changed.

::: toolkit/xre/nsAppRunner.cpp
@@ +2419,5 @@
>  
> +  if (!gDiskStartupCacheValid) {
> +    static const char kInvalidationHeader[] = NS_LINEBREAK "InvalidateCaches=1" NS_LINEBREAK;
> +    PR_Write(fd, kInvalidationHeader, sizeof(kInvalidationHeader) - 1);
> +  }

A few things to make this fit the rest of the function better -
1. Put kInvalidationHeader above the check.
2. Put this above the code that writes kNL, and
3. Change kInvalidationHeader to "InvalidateCaches=1" NS_LINEBREAK (removing the NS_LINEBREAK at the beginning).

@@ +2464,5 @@
> +
> +  if (NS_FAILED(rv) && rv != NS_ERROR_FILE_TARGET_DOES_NOT_EXIST)
> +    return false;
> +
> +  return true;

I think we can do something like
return NS_SUCCEEDED(rv) || rv == NS_ERROR_FILE_TARGET_DOES_NOT_EXIST;

@@ +3794,4 @@
>    return rv;
>  }
>  
> +bool gDiskStartupCacheValid = true;

I like the previous approach of adding a method to StartupCache a bit more. You'll just have to make that method static so it can be called immediately after the call to RemoveComponentRegistries.
Attachment #616617 - Flags: review?(mwu)
Comment on attachment 616622 [details] [diff] [review]
Part 2: StartupCache changes

Clearing review here - comments are on the other patch.
Attachment #616622 - Flags: review?(mwu)
Attached patch Part 1: Startupcache changes (obsolete) — Splinter Review
This follows bsmedberg's suggestion in comment 7 for Windows, which worked perfectly on Try.

Lifting the code from StartupCache.cpp for the filename in the test isn't ideal, but the alternative of trying to find a known location we could put a temporary file, setting the MOZ_STARTUP_CACHE environment variable, ensuring the cache was flushed at the right time.. was getting too messy.

Note this still modifies nsIStartupCache to expose the method on the wrapper, but I understand the interface is purely a convenience for testing anyway.
Attachment #616622 - Attachment is obsolete: true
Attachment #670347 - Flags: review?(mwu)
Attached patch Part 2: AppRunner changes (obsolete) — Splinter Review
Attachment #616617 - Attachment is obsolete: true
Attachment #670348 - Flags: review?(mwu)
InvalidateCache seemed to be the best place the telemetry change, as we will always call this, regardless of whether gIgnoreDiskCache is changed before or after initialization.
Attachment #670349 - Flags: review?(mwu)
Comment on attachment 670349 [details] [diff] [review]
Part 3: Telemetry

Taras, would you be able to look at this? I'm not too familiar with telemetry code..
Attachment #670349 - Flags: review?(mwu) → review?(taras.mozilla)
Comment on attachment 670349 [details] [diff] [review]
Part 3: Telemetry

+ mozilla::Telemetry::Accumulate(Telemetry::STARTUP_CACHE_INVALID, gIgnoreDiskCache);
+  if (gIgnoreDiskCache) {
+         return;
   }

might be slightly more robust in face of changing code.

+  "STARTUP_CACHE_INVALID": {
+    "kind": "boolean",
+    "description": "Was the disk startup cache file detected as invalid"
+  },

this should be a flag(flags are always initialized to 0). More to come later
Comment on attachment 670348 [details] [diff] [review]
Part 2: AppRunner changes

Review of attachment 670348 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good except for a broken looking check for file->Remove. The removal of the global was good.

::: toolkit/xre/nsAppRunner.cpp
@@ +2449,4 @@
>    static const char kNL[] = NS_LINEBREAK;
>    PR_Write(fd, kNL, sizeof(kNL) - 1);
>  
> +

nit: unnecessary new line

@@ +2490,5 @@
>    file->Remove(false);
>  
>    file->SetNativeLeafName(NS_LITERAL_CSTRING("startupCache"));
> +  nsresult rv = file->Remove(true);
> +  return NS_SUCCEEDED(rv) || NS_ERROR_FILE_TARGET_DOES_NOT_EXIST;

I think the second check is suppose to be rv == NS_ERROR_FILE_TARGET_DOES_NOT_EXIST

@@ +3570,5 @@
>    }
>  
> +  if (invalidateStartupCache)
> +    StartupCache::IgnoreDiskCache();
> +    

nit: trailing space.
Attachment #670348 - Flags: review?(mwu)
Comment on attachment 670347 [details] [diff] [review]
Part 1: Startupcache changes

Review of attachment 670347 [details] [diff] [review]:
-----------------------------------------------------------------

::: startupcache/StartupCache.cpp
@@ +203,5 @@
>                                       false);
>    NS_ENSURE_SUCCESS(rv, rv);
>    
> +  if (!gIgnoreDiskCache)
> +    rv = LoadArchive(RECORD_AGE);

There's already a check in LoadArchive, so we don't need another one here, though I don't care much either way.

@@ +480,5 @@
> +  if (NS_FAILED(rv) && rv != NS_ERROR_FILE_TARGET_DOES_NOT_EXIST &&
> +      rv != NS_ERROR_FILE_NOT_FOUND) {
> +    gIgnoreDiskCache = true;
> +    return;
> +  }

Should we set gIgnoreDiskCache to false if the remove succeeds? It seems that if we don't, gIgnoreDiskCache will be left on in the rest of the startupcache tests.

@@ +732,5 @@
>    return NS_OK;
>  }
>  
> +nsresult
> +StartupCacheWrapper::IgnoreDiskCache() 

nit: trailing space.

::: startupcache/test/TestStartupCache.cpp
@@ +311,5 @@
> +  sc->InvalidateCache();
> +  
> +  const char* buf = "Get a Beardbook app for your smartphone";
> +  const char* id = "id";
> +  char* outbuf= NULL;

nit: space before =

@@ +319,5 @@
> +  NS_ENSURE_SUCCESS(rv, rv);
> +  rv = sc->ResetStartupWriteTimer();
> +  rv = WaitForStartupTimer();
> +  NS_ENSURE_SUCCESS(rv, rv);
> +  

nit: trailing spaces
Attachment #670347 - Flags: review?(mwu)
Attachment #670349 - Flags: review?(taras.mozilla) → review+
Attached patch Part 1: Startupcache changes (obsolete) — Splinter Review
Per comments.

Thanks for spotting that gIgnoreDiskCache would be left as true for any remaining tests. I've made the appropriate change. However, I can't think of a way to test it: once we've "unlocked" the file, we'd need to place a value in the cache, wait on a successful write to disk, clear the in-memory cache, reload from disk and confirm the value is present. The only method that wipes the in-memory cache is InvalidateCache, but that also blows away the disk file.
Attachment #670347 - Attachment is obsolete: true
Attachment #673360 - Flags: review?(mwu)
Attached patch Part 2: AppRunner changes (obsolete) — Splinter Review
Per comments.
Attachment #670348 - Attachment is obsolete: true
Attachment #673362 - Flags: review?(mwu)
Attachment #673362 - Flags: review?(mwu) → review+
> mozilla::Telemetry::Accumulate(Telemetry::STARTUP_CACHE_INVALID, gIgnoreDiskCache);

Unfortunately this won't work - see http://hg.mozilla.org/mozilla-central/file/85172b4eb961/ipc/chromium/src/base/histogram.cc#l982

FlagHistograms assume that any call to Accumulate means the flag condition has been triggered - this would always set the flag, regardless of the value of gIgnoreDiskCache. Indeed line 983 aborts on debug builds if gIgnoreDiskCache is false.
(In reply to Graeme McCutcheon [:graememcc] from comment #22)
> Thanks for spotting that gIgnoreDiskCache would be left as true for any
> remaining tests. I've made the appropriate change. However, I can't think of
> a way to test it: once we've "unlocked" the file, we'd need to place a value
> in the cache, wait on a successful write to disk, clear the in-memory cache,
> reload from disk and confirm the value is present. The only method that
> wipes the in-memory cache is InvalidateCache, but that also blows away the
> disk file.

Yeah I don't think we can test this without adding an additional method to wipe the in-memory cache but not the version on the disk.
Comment on attachment 673360 [details] [diff] [review]
Part 1: Startupcache changes

Review of attachment 673360 [details] [diff] [review]:
-----------------------------------------------------------------

::: startupcache/StartupCache.cpp
@@ -202,4 @@
>                                       false);
>    NS_ENSURE_SUCCESS(rv, rv);
>    
> -  rv = LoadArchive(RECORD_AGE);

Did you mean to remove this?
(In reply to Michael Wu [:mwu] from comment #25)
> (In reply to Graeme McCutcheon [:graememcc] from comment #22)
> > Thanks for spotting that gIgnoreDiskCache would be left as true for any
> > remaining tests. I've made the appropriate change. However, I can't think of
> > a way to test it: once we've "unlocked" the file, we'd need to place a value
> > in the cache, wait on a successful write to disk, clear the in-memory cache,
> > reload from disk and confirm the value is present. The only method that
> > wipes the in-memory cache is InvalidateCache, but that also blows away the
> > disk file.
> 
> Yeah I don't think we can test this without adding an additional method to
> wipe the in-memory cache but not the version on the disk.

Though there would be one terrible hack you can try - make a copy of the startup cache early in the test and copy it back into the place when we want to check that we can read from the startup cache.
(In reply to Michael Wu [:mwu] from comment #26)
> Did you mean to remove this?

Ugh, no sorry. Late night carelessness creeping in.

I'll take a look at testing by copying the sc file over the weekend.
(In reply to Graeme McCutcheon [:graememcc] from comment #24)
> > mozilla::Telemetry::Accumulate(Telemetry::STARTUP_CACHE_INVALID, gIgnoreDiskCache);
> 
> Unfortunately this won't work - see
> http://hg.mozilla.org/mozilla-central/file/85172b4eb961/ipc/chromium/src/
> base/histogram.cc#l982
> 
> FlagHistograms assume that any call to Accumulate means the flag condition
> has been triggered - this would always set the flag, regardless of the value
> of gIgnoreDiskCache. Indeed line 983 aborts on debug builds if
> gIgnoreDiskCache is false.

You are right, thanks for pointing this out.
> Though there would be one terrible hack you can try - make a copy of the 
> startup cache early in the test and copy it back into the place when we want
> to check that we can read from the startup cache.

Tried to construct a test with this approach, but was foiled again. When InvalidateCache calls LoadArchive, LoadArchive calls OpenArchive, which is where the file handle is opened. So even copying back our earlier copy immediately after the InvalidateCache call, we're too late - calls to GetBufferFromArchive will read from the old (empty) file. The only way we can coerce a LoadArchive call is via InvalidateCache, which deletes the file.
Attachment #673360 - Attachment is obsolete: true
Attachment #673360 - Flags: review?(mwu)
Attachment #674343 - Flags: review?(mwu)
Comment on attachment 674343 [details] [diff] [review]
Part 1: Startupcache changes

Let's get a followup bug filed about not being able to check if we're still ignoring the disk cache.
Attachment #674343 - Flags: review?(mwu) → review+
I slightly tweaked TestHistogramValues in part 3 to get rid of the redundant else statements after a return. 

https://hg.mozilla.org/integration/mozilla-inbound/rev/1baac3f2334c
https://hg.mozilla.org/integration/mozilla-inbound/rev/f418d97d12ac
https://hg.mozilla.org/integration/mozilla-inbound/rev/006eb98e9d6f
Assignee: nobody → graememcc_firefox
This regressed Ts across the board, for example, see:

Regression  Ts Paint, MAX Dirty Profile increase 27.1% on XP Mozilla-Inbound-Non-PGO
--------------------------------------------------------------------------------------
    Previous: avg 582.219 stddev 1.596 of 30 runs up to revision de25c9c57c53
    New     : avg 740.074 stddev 0.991 of 5 runs since revision 006eb98e9d6f
    Change  : +157.855 (27.1% / z=98.920)
    Graph   : http://mzl.la/RvDB8U

Changeset range: http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=de25c9c57c53&tochange=006eb98e9d6f

Changesets:
  * http://hg.mozilla.org/integration/mozilla-inbound/rev/1baac3f2334c
    : Graeme McCutcheon <graememcc_firefox@graeme-online.co.uk> - Bug 724513 - Part 1 - Add StartupCache method for disregarding disk file. r=mwu
    : http://bugzilla.mozilla.org/show_bug.cgi?id=724513

  * http://hg.mozilla.org/integration/mozilla-inbound/rev/f418d97d12ac
    : Graeme McCutcheon <graememcc_firefox@graeme-online.co.uk> - Bug 724513 - Part 2 - nsAppRunner shouldn't assume the deletion of invalid startup cache file succeeded. r=mwu
    : http://bugzilla.mozilla.org/show_bug.cgi?id=724513

  * http://hg.mozilla.org/integration/mozilla-inbound/rev/006eb98e9d6f
    : Graeme McCutcheon <graememcc_firefox@graeme-online.co.uk> - Bug 724513 - Part 3 - Add telemetry for the invalid disk cache situation. r=taras
    : http://bugzilla.mozilla.org/show_bug.cgi?id=724513

Bugs:
  * http://bugzilla.mozilla.org/show_bug.cgi?id=724513 - Startupcache can load invalid cache if unable to delete startupCache file
_______________________________________________
dev-tree-management mailing list
dev-tree-management@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-tree-management

I backed it out because of this: https://hg.mozilla.org/integration/mozilla-inbound/rev/3f4fef3ca7e9
At some point in the iterations of this part of the patch, when I inverted the meaning of the variable in XRE_mainStartup, I didn't invert all of the logic.

https://tbpl.mozilla.org/?tree=Try&rev=7d59f1a36f54 produced Talos numbers in line with https://hg.mozilla.org/integration/mozilla-inbound/rev/de25c9c57c53 - the cset preceding my inbound landing.
Attachment #673362 - Attachment is obsolete: true
Attachment #677188 - Flags: review?(mwu)
Attachment #677188 - Flags: review?(mwu) → review+
https://hg.mozilla.org/mozilla-central/rev/eb5c570b8a0b
https://hg.mozilla.org/mozilla-central/rev/6ec71de7ba2d
https://hg.mozilla.org/mozilla-central/rev/b032c4cc9d7d
Status: NEW → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: