The startup cache is not invalidated on extension update

RESOLVED INVALID

Status

()

defect
RESOLVED INVALID
Last year
Last year

People

(Reporter: lucian, Unassigned)

Tracking

59 Branch
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:59.0) Gecko/20100101 Firefox/59.0
Build ID: 20180308232640

Steps to reproduce:

Hey guys,

Since Firefox 56 we are experiencing random weird behaviours of the Cliqz extension and the only thing we knew so far was that it happens mostly after the extension updates.
We finally managed to pin it down and we think it is related with the issue described in this bug.
When a user runs version X of our extension and updates to version X+1 he has a chance (20-30% from our telemetry) to run a version which uses files from both versions.
More precisely the issue seems to be in “Services.scriptloader.loadSubScript” which will load a cached version of some files (even bootstrap.js!) from an older version from "/Users/YYYYY/Library/Caches/Firefox/Profiles/XXXXX/startupCache/scriptCache.bin” or ".../scriptCache-current.bin"

It is not 100% reproducible but after a few tries it happens. Here are the steps:
 - new profile in latest Firefox release (58.0.2) probably happens in more versions, likely after Firefox 56
 - install any version from https://addons.mozilla.org/de/firefox/addon/cliqz/versions/   (Eg 2.23.3)
 - use the search dropdown and the browser in general
 - restart the browser, use more the search, restart again (maybe? :) )
 - install another version (eg 2.24.5)
 - in this moment all the resources loaded with “Services.scriptloader.loadSubScript” are from the old version (2.23.3) including our bootstrap.js file which is the entry to our extension although we upgraded to 2.24.5
 - and all the other files are updated (eg process scripts, or iframes or whatever) 
 - it seems to happen more often on downgrade (from 2.24.5 to 2.23.3) but not exclusively so try to alternate upgrade and downgrade to reproduce

To prove that we prepared a small snippet:
  // this will return the wrong version (2.24.5 in the screenshot)
  var win = { global: {} }; 
  Services.scriptloader.loadSubScript('chrome://cliqz/content/runloop.js', win);
  Services.scriptloader.loadSubScript('chrome://cliqz/content/core/app.bundle.js', win);
  console.log(win.global.config.EXTENSION_VERSION); 

  // and this will return the right version (2.23.3 in the screenshot)
  var win = { global: {} }; 
  Services.scriptloader.loadSubScriptWithOptions('chrome://cliqz/content/runloop.js',  { target: win, ignoreCache: true });
  Services.scriptloader.loadSubScriptWithOptions('chrome://cliqz/content/core/app.bundle.js',  { target: win, ignoreCache: true });
  console.log(win.global.config.EXTENSION_VERSION); 

Also a screenshot: https://screenshots.firefox.com/DtUTAx7fKo585FQp/null 
Behind the Browser Console you can see the right version of the code loaded via the chrome url.

We temporarily moved towards "loadSubScriptWithOptions" where we are able to ignore the Cache but it would be great to find a better solution.  https://github.com/cliqz-oss/browser-core/compare/2.24.5...2.24.7#diff-33f3f1d203440162225e08f676fa1d4eR20

Please let me know if these findings make sense.

Cheers!
Lucian
Status: UNCONFIRMED → NEW
Component: Untriaged → WebExtensions: General
Ever confirmed: true
Product: Firefox → Toolkit
Component: WebExtensions: General → Add-ons Manager
The snippet from comment 0 doesn't work in current Beta or Nightly builds since the extension's boostrap methods throw (due to bug 792808).
However, we do invalidate the startup cache during regular upgrades and downgrades.  I recall that we ship cliqz through a variety of methods, are you experiencing this problem for regular profile installations of the extension?  (as opposed to when it is delivered as a system addon or funnelcake or something else)
Flags: needinfo?(lucian)
(In reply to Andrew Swan [:aswan] from comment #1)
> The snippet from comment 0 doesn't work in current Beta or Nightly builds
> since the extension's boostrap methods throw (due to bug 792808).

This issue was fixed in 2.24.7 but we disabled that version due to some crashes correlated with it.

> However, we do invalidate the startup cache during regular upgrades and
> downgrades.  

Most of the time this is true. We however detected some cases in which this invalidation does not happen. Using the details in comment 0 it can be reproduced. 

> I recall that we ship cliqz through a variety of methods, are
> you experiencing this problem for regular profile installations of the
> extension?  (as opposed to when it is delivered as a system addon or
> funnelcake or something else)

We definitely experience this for regular profile installations where our users install from AMO. From our investigations this is very likely to happen for all the other versions as even bootstrap.js seems to be affected by the invalid cache when upgrading/downgrading the extension.
I haven't been able to reproduce this, not matter how many times I try. From my investigations so far, though, it seems like the most likely cause of this is that your extension does async work at shutdown which, among other things, risks loading resources from your the old version of your extension after we've flushed our caches, but before the new version has started up.
(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #3)
> I haven't been able to reproduce this, not matter how many times I try. From
> my investigations so far, though, it seems like the most likely cause of
> this is that your extension does async work at shutdown which, among other
> things, risks loading resources from your the old version of your extension
> after we've flushed our caches, but before the new version has started up.

The async explanation makes sense, and although we are hunting down all those cases they might still happen.
But this should be unrelated with the actual cache issue. The cache issue persists even after browser and full machine restart. 
Here are snapshots for both my test profile and the corresponding cache:
 - /Users/[USER]/Library/Application Support/Firefox/Profiles/t7dl1g4q.XXXXXXX - https://www.dropbox.com/s/eaz1webx5yjt4eg/profile_t7dl1g4q.aaaa.zip?dl=0
- /Users/[USER]/Library/Caches/Firefox/Profiles/t7dl1g4q.XXXXXXX - https://www.dropbox.com/s/sj4ih86cuwe3pjl/cache_t7dl1g4q.aaaa.zip?dl=0

In those snapshots we have 2.24.5 version installed in the profile but somehow the cachedScript corresponds to 2.23.3

Is it possible to somehow analyze those snapshots?

Thanks for looking into in Kris!

PS: We managed to reproduce it again even in Firefox 59.
Flags: needinfo?(lucian) → needinfo?(kmaglione+bmo)
Presumably it persists since your asynchronous reference to code from the old version is immediately (re-)cached and that cache is persistent.
I'm not sure what we would get out of examining the profile?  Assuming we confirm that the startup cache does indeed have a copy of the scripts from the old extension version, what would we do with that information?
(In reply to Andrew Swan [:aswan] from comment #5)
> Presumably it persists since your asynchronous reference to code from the
> old version is immediately (re-)cached and that cache is persistent.

Not completely sure how and when Firefox reads the actual files but having some living references in memory should not trigger in any case a new read + an eventual (re-cache). Isn't that right? Would there be a way to debug/confirm this theory?

> I'm not sure what we would get out of examining the profile?  Assuming we
> confirm that the startup cache does indeed have a copy of the scripts from
> the old extension version, what would we do with that information?

If our hypothesis is correct maybe the cache entry could have some extra information like extension version, or file hash, or file last changed date, or something else which could make it easier to invalidate. (maybe such meta flags are already in place)
In our concrete case we could always load using "loadSubScriptWithOptions" with "ignoreCache" but maybe this hits in more places and not only our extension.

More than happy to provide more information to debug this.
(In reply to lucian from comment #6)
> Not completely sure how and when Firefox reads the actual files but having
> some living references in memory should not trigger in any case a new read +
> an eventual (re-cache). Isn't that right?

I'm not all that familiar with the internals of how caching of compiled scripts works.  A lot of work compilation happens lazily so it wouldn't at all surprise me if touching something in memory caused some compilation work that then got saved.

> Would there be a way to
> debug/confirm this theory?

I think the simplest way is to remove any asynchronous work from your shutdown() method and then see if the issue can be reproduced.  (I realize that doing that while keeping all your functionality intact is difficult, but if you just want to do an experiment to confirm that this is the problem, perhaps you can just stop running some existing code?)

> > I'm not sure what we would get out of examining the profile?  Assuming we
> > confirm that the startup cache does indeed have a copy of the scripts from
> > the old extension version, what would we do with that information?
> 
> If our hypothesis is correct maybe the cache entry could have some extra
> information like extension version, or file hash, or file last changed date,
> or something else which could make it easier to invalidate. (maybe such meta
> flags are already in place)

I don't believe that cache has any of that information nor does it have interfaces to do finer-grained invalidation.  It is designed primarily for caching scripts that are part of the browser, it never really interacted very well with bootstrapped extensions (that is, we had to take the heavyweight approach of invalidating the entire cache whenever an extension is uninstalled)

> In our concrete case we could always load using "loadSubScriptWithOptions"
> with "ignoreCache" but maybe this hits in more places and not only our
> extension.

I don't know of any scenarios other than legacy extensions in which chrome-privileged code is unloaded from the browser at runtime.  And legacy extensions have been deprecated for some time.  We are actively moving away from them for internal applications such as system add-ons...
Haven't heard anything here in 2 weeks, re-open or create a new bug if this isn't resolved.
Status: NEW → RESOLVED
Closed: Last year
Flags: needinfo?(kmaglione+bmo)
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.