No revalidation - Cache ignoring timeout (+profile dependent?)

REOPENED
Unassigned

Status

()

Core
Networking: Cache
P3
critical
REOPENED
a year ago
4 months ago

People

(Reporter: adrian_sv, Unassigned)

Tracking

(Blocks: 1 bug, {stale-bug})

unspecified
stale-bug
Points:
---

Firefox Tracking Flags

(firefox53+ wontfix, firefox54 ?, firefox55 ?)

Details

(Whiteboard: [necko-triaged][see comment 38])

Attachments

(8 attachments, 1 obsolete attachment)

(Reporter)

Description

a year ago
[Pe-requisites:]
1. Set the preference value for extensions.systemaddon.update.url set to https://aus5.mozilla.org/update/3/SystemAddons/53.0.0/20160922113459/default/en-US/beta-sysaddon/default/default/default/update.xml
2. Go to Tools -> Developer Tools -> Web Console -> Open "Toolbox Options" from "Advanced settings" mark as checked the last 3 available check boxes. (browser chrome....; remote debugging; worker debugging)
3. Open Browser Console and run this script: Components.utils.import("resource://gre/modules/AddonManager.jsm"); AddonManagerPrivate.backgroundUpdateCheck();
4. Set the preference value for extensions.shield-recipe-client.api_url set to https://normandy.stage.mozaws.net/api/v1
5. Set the extensions.shield-recipe-client.dev_mode preference to true to run recipes immediately on startup.
6. Set the extensions.shield-recipe-client.logging.level preference to 0 to enable more logging.
7. Set the security.content.signature.root_hash to DB:74:CE:58:E4:F9:D0:9E:E0:42:36:BE:6C:C5:C4:F6:6A:E7:74:7D:C0:21:42:7A:03:BC:2F:57:0C:8B:9B:90

[Description:]
Although at this point we haven't identified the causes or how to reliably reproduce any of the below cases, it is clear by now that after a while the used profiles gets stuck in a state in which shield system add-on and implicitly the recipes are starting to malfunction.

[Steps:]
TBD:
- we are using two security.content.signature.root_hash prefs to test: one is from the staging server: "DB:74:CE:58:E4:F9:D0:9E:E0:42:36:BE:6C:C5:C4:F6:6A:E7:74:7D:C0:21:42:7A:03:BC:2F:57:0C:8B:9B:90." and one is from mock server: "4C:35:B1:C3:E3:12:D9:55:E7:78:ED:D0:A7:E7:8A:38:83:04:EF:01:BF:FA:03:29:B2:46:9F:3C:C5:EC:36:04" 
- we are using the shield-recipe-client.json to alter the time stamp in regards to when a recipe has been executed
- we are using control center to create/enable/edit/delete recipes.

[Actual Result:]
As a rough estimate about 1/3 of the used profiles show signs of instability as follows:
a. on a profile no console output regarding any shield actions: no shield recipes fetched, no warnings, no errors. (recipes/heartbeats were ran successfully on this profile).
b. on another profile Shield is fetching and executing only one recipe, while the recipe is disabled and now deleted. (no other recipes seem visible for this particular case)
c. on several profiles, the Normandy would state the signature is invalid, while after restart Shield would function normally.
d. on several profiles, the version filter ( normandy.version ) didn't return true: (the recipe would not be executed)

[Additional Note1:] 
-earlier this week, while switching to use beta version + system add-on we attempted to use profiles that were used to test Shield while Shield was in add-on mode and the Shield System Add-on wouldn't get installed. (following the listed Pre-Requisites).


[Additional Note2:] 
While testing on Aurora/Nightly using the initial System-add-on or even after moving to test with the standalone add-on, we haven't encountered similar behavior.
(Reporter)

Comment 1

a year ago
Created attachment 8855406 [details]
invalid_signature.zip

Attaching example profiles.
(Reporter)

Comment 2

a year ago
Created attachment 8855407 [details]
delete_recipe_ran.zip
(Reporter)

Comment 3

a year ago
Created attachment 8855427 [details]
shield beta - profile 2.zip
(Reporter)

Comment 4

a year ago
Created attachment 8855428 [details]
cache.jpg
Michal,

Can you take a look at the jpg in comment 4?  I'm not sure what's going on there, but on IRC they told me the request was made more than an hour after the Date header's date, and since max-age was set to just 30 seconds, the entry should be stale.

Adrian: Do you know what loadFlags the channel has set?  There are a number of cache-related flags that can significantly change our cache behavior (on purpose: there are times like session restore when we /want/ to use stale entries).
Flags: needinfo?(michal.novotny)
About loadFlags: The code in question is using fetch to make the request, and the code is here:

https://github.com/mozilla/normandy/blob/4c8401c8e89671ca54d9dbdc39a8a505ba78d313/recipe-client-addon/lib/NormandyApi.jsm#L78

All the code for the add-on is in JS. I don't think we have direct access to setting flags on the channel. Does that help answer the question?
(In reply to Adrian Florinescu [:AdrianSV] from comment #0)
> a. on a profile no console output regarding any shield actions: no shield
> recipes fetched, no warnings, no errors. (recipes/heartbeats were ran
> successfully on this profile).

Based on our testing over Vidyo this morning, I suspect this was due to a mis-configured logging pref, as by default we only log warnings.

> b. on another profile Shield is fetching and executing only one recipe,
> while the recipe is disabled and now deleted. (no other recipes seem visible
> for this particular case)

This is the main caching issue that we're tracking down as a problem.

> c. on several profiles, the Normandy would state the signature is invalid,
> while after restart Shield would function normally.

The update we deployed to staging that added peer approvals broke signatures, and we failed to re-generate them. Disabling all active recipes and re-approving them fixed this. We also observed some signature failures after the fact, and I believe this is also due to the aforementioned caching issues, as a stale cache using staging responses from before the signatures were fixed would present as invalid signatures.

> d. on several profiles, the version filter ( normandy.version ) didn't
> return true: (the recipe would not be executed)

We found this morning that 53.0b9 actually reports its version as 53.0, because it is the last beta being built and will be used as the release build. That, plus stale responses from the caching, can explain these issues.

> [Additional Note1:] 
> -earlier this week, while switching to use beta version + system add-on we
> attempted to use profiles that were used to test Shield while Shield was in
> add-on mode and the Shield System Add-on wouldn't get installed. (following
> the listed Pre-Requisites).

I've been unable to replicate this. If we have any profiles consistently hitting this problem, or STR, I can chase this down.

> [Additional Note2:] 
> While testing on Aurora/Nightly using the initial System-add-on or even
> after moving to test with the standalone add-on, we haven't encountered
> similar behavior.

This might indicate a bug in Beta in caching behavior, depending on what we track down there.

I've been unable to replicate any of the issues with the attached profiles, although I definitely observed the weird cache behavior this morning.
(In reply to Jason Duell [:jduell] (needinfo me) from comment #5)
> Can you take a look at the jpg in comment 4?  I'm not sure what's going on
> there, but on IRC they told me the request was made more than an hour after
> the Date header's date, and since max-age was set to just 30 seconds, the
> entry should be stale.

Looks like there might be a bug, see https://bugzilla.mozilla.org/show_bug.cgi?id=1348701#c2
Flags: needinfo?(michal.novotny)
tracking-firefox53: --- → +
(In reply to Michal Novotny (:michal) from comment #8)
> Looks like there might be a bug, see
> https://bugzilla.mozilla.org/show_bug.cgi?id=1348701#c2

If I'm reading that bug correctly, the issue there is using a stale page within seconds of initially receiving it. In our case, we were seeing stale cache entries used multiple times, hours after it should've expired, which doesn't sound like the same bug. Or am I missing something?
Flags: needinfo?(michal.novotny)
I think Honza was wrong in comment #8.
Flags: needinfo?(michal.novotny) → needinfo?(honzab.moz)
(In reply to Michal Novotny (:michal) from comment #10)
> I think Honza was wrong in comment #8.

?
Flags: needinfo?(honzab.moz) → needinfo?(michal.novotny)
My understanding was that expiration time is computed incorrectly when the response is already stale when fetched and it's not necessary to re-download it within the same second to trigger the bug. But I might be wrong, in this case please have a look at Jason's comment #5.
Flags: needinfo?(michal.novotny) → needinfo?(honzab.moz)
(Reporter)

Comment 13

a year ago
Created attachment 8856379 [details]
HAR.HAR

Attaching the HAR file from the stale request (one of the broken profiles - the one attached to the bug - https://bugzilla.mozilla.org/attachment.cgi?id=8855407.

The forward steps I think we've made on Friday (Mike & I) was to confirm that I had no entry of normandy request in about:cache, the curl for the staging request was returning the expected result - so Mike concluded that this points that the problem is somewhere in FF and not resides in the ISP / SV firewall - rooter.
Additionally, when we changed the extensions.shield-recipe-client.api_url to point to production, the request was again successful and the request was fetched. It is interesting that when we returned to using the staging api, the request was still stale and taken from cache: - still broken. 

IMHO, what's even more puzzling is that using the broken profiles on other machine doesn't seem to get the same result.
I think the next step here is for the SoftVision folks to turn on HTTP logging so we can get more info about what the cache is doing with this URI.  Follow the instructions on this page, and make sure to add "cache2:5" to the module list as described here:

  https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging?document_saved=true#Turning_on_logging_for_the_HTTP_cache

You can zip up the log files and attach them to this bug.  Thanks!
Additionally, mythmon pushed a build to try with the fix from bug 1348701 merged into beta: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5dabd3cd2e28929c5038de5dba0c11b9d2a88a75&selectedJob=90254731

I'm curious to see if this build fixes the cache issues. A zip of the Windows build can be downloaded from here: https://archive.mozilla.org/pub/firefox/try-builds/mcooper@mozilla.com-5dabd3cd2e28929c5038de5dba0c11b9d2a88a75/try-win64/firefox-53.0.en-US.win64.zip
Flags: needinfo?(adrian.florinescu)
I'll wait for how the log looks like.  It will give the answer.
Flags: needinfo?(honzab.moz)
(Reporter)

Comment 17

a year ago
Created attachment 8856940 [details]
networking_logs.zip

Following above comments, enabled networking logs with the following modules: timestamp,rotate:50,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5,cache2:5

Then for both the stock build and the build from comment 15, executed in browser console: 
Cu.import('resource://shield-recipe-client/lib/RecipeRunner.jsm', {}).RecipeRunner.testRun("https://normandy.stage.mozaws.net/api/v1");
which should return the active recipes on staging. The result for each build was that it returned the days old response:

1491911778409	extensions.shield-recipe-client.normandy-api	DEBUG	Fetched 1 recipes from the server:: VERSION VERSION VERSION VERSION VERSION VERSION VERSION VERSION VERSION VERSION

The right results should've been:
1491912164475	extensions.shield-recipe-client.normandy-api	DEBUG	Fetched 2 recipes from the server:: version check - clone, 7days - until user interacts

See the networking logs for both builds executed with the broken profile.
Flags: needinfo?(adrian.florinescu)
I'll be curious to see if bug 1348701  fixes this, although it looks like that bug was possibly only for very short race periods (~1 second), so I'd guess no.

Adrian, is this a bug that happens only for a specific profile on your machine, or can you reproduce it with a clean new profile?  If you see it with a new profile, we could try doing another HTTP log where you clear the cache first, then store the entry in the cache (i.e visit the URI for the 1st time), then hit the bug (visit the 2nd time). 

Honza, can you see anything interesting in the HTTP log above?  Also note that the cache entry is apparently not showing up in about:cache--I don't know if that's a useful clue.
Flags: needinfo?(honzab.moz)
(In reply to Jason Duell [:jduell] (needinfo me) from comment #18)
> I'll be curious to see if bug 1348701  fixes this, although it looks like
> that bug was possibly only for very short race periods (~1 second), so I'd
> guess no.

Per comment 17, the Beta build that mythmon made that included that patch did not fix the issue, as you guessed. :(
(In reply to Jason Duell [:jduell] (needinfo me) from comment #18)
> Honza, can you see anything interesting in the HTTP log above?  Also note
> that the cache entry is apparently not showing up in about:cache--I don't
> know if that's a useful clue.

2017-04-11 11:56:18.383000 UTC - [Main Thread]: D/nsHttp no mandatory validation requirement
2017-04-11 11:56:18.383000 UTC - [Main Thread]: D/nsHttp   NowInSeconds()=1491911778, expiration time=1491914181, freshness lifetime=30, age=429598
2017-04-11 11:56:18.383000 UTC - [Main Thread]: D/nsHttp   not validating, expire time not in the past
2017-04-11 11:56:18.383000 UTC - [Main Thread]: D/nsHttp Not validating based on expiration time

https://dxr.mozilla.org/mozilla-central/rev/f40e24f40b4c4556944c762d4764eace261297f5/netwerk/protocol/http/nsHttpChannel.cpp#4017

where expiration time - NowInSeconds = 2403.  But apparently freshness is way below age, so we should revalidate.  I'll take a look deeper.
Assignee: nobody → honzab.moz
Component: Add-on → Networking: HTTP
Flags: needinfo?(honzab.moz)
Product: Shield → Core
Whiteboard: [necko-next]
Whiteboard: [necko-next] → [necko-active]
Michael, can you please establish priority and severity of this bug?  If not, can you suggest someone who can? thanks.
Flags: needinfo?(mkelly)
Now that we have a workaround for this, the priority isn't as high, although it is still mildly concerning that the cache just completely ignored the timeout. At this point, I think you can just treat this as an incoming bug report and triage it however you would normally. Sorry if that's not helpful.
Flags: needinfo?(mkelly)
Assignee: honzab.moz → nobody
Component: Networking: HTTP → Networking: Cache
Whiteboard: [necko-active] → [necko-next]
(Reporter)

Comment 23

a year ago
With the info that is currently present in this bug, I'm concerned that we might miss on a bigger issue and lack enough information to actually push forward with it even if we want to. Shield and Normandy is currently patched against this, so no chance to actually reproduce it again there. Mike, is there any chance that you have in mind a possible reduced testcase for this (and ofc time)?
Flags: needinfo?(mkelly)
Summary: Shield - Possible profile integrity issues → No revalidation - Cache ignoring timeout (+profile dependent?)
(In reply to Adrian Florinescu [:AdrianSV] from comment #23)
> With the info that is currently present in this bug, I'm concerned that we
> might miss on a bigger issue and lack enough information to actually push
> forward with it even if we want to. Shield and Normandy is currently patched
> against this, so no chance to actually reproduce it again there. Mike, is
> there any chance that you have in mind a possible reduced testcase for this
> (and ofc time)?

I don't, sadly. We were never able to replicate the issue, so I dunno how we'd come up with a test case. At this point I think we have two options:

1. Set up a one-off instance of Normandy without the caching fix, and have you replicate the issue in a virtual machine. Then, send that virtual machine over for us to verify that we still get the failing behavior.

2. Use what we know already to look for logic errors in the caching code.

I was opting for #2, as the effort of #1 doesn't really match up with the amount of affected users. If we were getting reports of this failing in the wild it'd be a different story.
Flags: needinfo?(mkelly)
Is this something we need to be concerned about for 54/55? Doesn't look like there's anything we can do or need to do for 53.
status-firefox53: affected → wontfix
status-firefox54: --- → ?
status-firefox55: --- → ?
Flags: needinfo?(mkelly)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #25)
> Is this something we need to be concerned about for 54/55? Doesn't look like
> there's anything we can do or need to do for 53.

I don't think so. I don't think this meets the threshold of reproduceability we'd need to actually tie it to a version.
Flags: needinfo?(mkelly)
This (or a very similar bug) just bit us again in a different context in bug 1368807. Specifically, the certificate file for signed recipes in SHIELD is being cached by Firefox, despite the response not having any cache headers (vs the previous instance of this, which was Firefox ignoring a 30-second cache header). 

Because of this caching, we can't send any new recipes to users affected, because they cannot verify the signatures on our recipes, which is blocking some high priority preference experiments and surveys being run through SHIELD, hence the P1 marker. We currently have a mitigation plan (cache-busting via query parameters), but the impact is severe enough that I think we need to look into this again.

mythmon's personal profile on his machine is affected by this bug, and he can help get more debug info on what's happening, including a HAR file and logs like what Adrian already provided.
Flags: needinfo?(mcooper)
Priority: -- → P1
Created attachment 8874672 [details]
cache2.debug.log

This is the result of setting MOZ_LOG=cache2:5. I've trimmed out everything from before I made the request that was incorrectly cached (though several requests were made, see the HAR file).
Created attachment 8874673 [details]
mythmon.har

This is the HAR file created during the logs I attached. This is all of the requests that Normandy made, ending in the one with incorrect cache behavior.

I stripped out a cookie from this that was probably something sensitive.
Flags: needinfo?(mcooper)
(Reporter)

Updated

a year ago
See Also: → bug 1368807
Playing around with this, I have found that running `fetch(theFailingUrl)` in about:config's devtools does not reproduce this issue, but running the same code in the browser console reproduces the caching problem. Running the same code in the browser toolbox does not reproduce the problem either.
Honza, does the info from comment 27 onwards give us anything more we can do here?
Flags: needinfo?(honzab.moz)
(In reply to Mike Cooper [:mythmon] from comment #29)
> Created attachment 8874672 [details]
> cache2.debug.log
> 
> This is the result of setting MOZ_LOG=cache2:5. I've trimmed out everything
> from before I made the request that was incorrectly cached (though several
> requests were made, see the HAR file).

Please create a log as: MOZ_LOG=sync,timestamp,nsHttp:5,cache2:5.  just cache logs don't say a thing.  thanks.
Flags: needinfo?(honzab.moz) → needinfo?(mcooper)
Created attachment 8876244 [details]
network.log

Here is the log again with MOZ_LOG=sync,timestamp,nsHttp:5,cache2:5. Again, I trimmed this to just the time around when the requests was made.
Attachment #8874672 - Attachment is obsolete: true
Flags: needinfo?(mcooper)
Thanks!  I'll look at this bug within couple next days.
Assignee: nobody → honzab.moz
Whiteboard: [necko-next] → [necko-active]
(In reply to Michael Kelly [:mkelly,:Osmose] from comment #27)
> This (or a very similar bug) just bit us again in a different context in bug
> 1368807. Specifically, the certificate file for signed recipes in SHIELD is
> being cached by Firefox, despite the response not having any cache headers
> (vs the previous instance of this, which was Firefox ignoring a 30-second
> cache header). 
> 
> Because of this caching, we can't send any new recipes to users affected,
> because they cannot verify the signatures on our recipes, which is blocking
> some high priority preference experiments and surveys being run through
> SHIELD, hence the P1 marker. We currently have a mitigation plan
> (cache-busting via query parameters), but the impact is severe enough that I
> think we need to look into this again.
> 
> mythmon's personal profile on his machine is affected by this bug, and he
> can help get more debug info on what's happening, including a HAR file and
> logs like what Adrian already provided.

Firefox behaves correctly.  Your response headers are incorrect.  Since there is no cache directive (Cache-control) we ignore the ETag and we use Last-Modified time to calculate a heuristic freshness lifetime as (Last-Modified - Date) * 0.1 - this is according the spec.

The calculated values are clearly:
last-modified =              1464809046, Wed, 01 Jun 2016 19:24:06 GMT
date =                       1497031540, Fri, 09 Jun 2017 18:05:40 GMT
NowInSeconds()=              1497031939, Fri, 09 Jun 2017 18:12:19 GMT
expiration time=             1497603022, Fri, 16 Jun 2017 08:50:22 GMT
freshness lifetime=              604800, one week = max(one-week, date - last-modified)
age=                              33318, max(now - date [ = 399], 33318)

So, you must setup your server to either say what is the maximum age to consider the resource fresh (say one day - Cache-control: max-age=lazy-to-calc-seconds-in-one-day) or make Necko revalidate everytime (Cache-control: no-cache) ; note that "no-cache" doesn't mean "don't cache".
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → INVALID
Comment 17
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #20)
> (In reply to Jason Duell [:jduell] (needinfo me) from comment #18)
> > Honza, can you see anything interesting in the HTTP log above?  Also note
> > that the cache entry is apparently not showing up in about:cache--I don't
> > know if that's a useful clue.
> 
> 2017-04-11 11:56:18.383000 UTC - [Main Thread]: D/nsHttp no mandatory
> validation requirement
> 2017-04-11 11:56:18.383000 UTC - [Main Thread]: D/nsHttp  
> NowInSeconds()=1491911778, expiration time=1491914181, freshness
> lifetime=30, age=429598
> 2017-04-11 11:56:18.383000 UTC - [Main Thread]: D/nsHttp   not validating,
> expire time not in the past
> 2017-04-11 11:56:18.383000 UTC - [Main Thread]: D/nsHttp Not validating
> based on expiration time
> 
> https://dxr.mozilla.org/mozilla-central/rev/
> f40e24f40b4c4556944c762d4764eace261297f5/netwerk/protocol/http/nsHttpChannel.
> cpp#4017
> 
> where expiration time - NowInSeconds = 2403.  But apparently freshness is
> way below age, so we should revalidate.  I'll take a look deeper.


I'm reopening the bug, but I have no idea how to proceed here.  Looks like the cache entries were intermixed or badly merged somehow when doing the expiration time calculation.  

Is fetch interception involved here?  Adding :jdm as he added some code that might be responsible for this.

Is your server sending the same caching headers when a range request is made for this particular resource?



Analyzes:

By looking into the code I can't find out how (when walking through the standard http channel paths), with that response headers, would an expiration time difference from now ever become larger than a freshness lifetime.  We calculate (on every headers update - aka first 200/304/206) the exp time as now + lifetime - age.  If lifetime < age, exp time is set to 0.

Hence, as lifetime is 30 seconds (max-age=30), I can't understand how expiration time could ever be shifted so far to the future past now (not just 30 seconds, but 2403 seconds which is roughly 40 minutes).

The log is unfortunately too short to see what all was happening with the cache entry during the session.

IMPORATNT: note that the response expiration time is exactly 7 days from it's creation time (according Date:) which would suggest using heuristic to calculate freshness lifetime (we crop at 1 week for heuristic) when we were calculating expiration time before.  But then it would be calculated for a completely different response, since we need Last-Modified to do heuristic, which the response in question here lacks.
Assignee: honzab.moz → nobody
Status: RESOLVED → REOPENED
Flags: needinfo?(honzab.moz)
Resolution: INVALID → ---
Whiteboard: [necko-active] → [necko-active] [see comment 38]
This is a P1 bug without an assignee. 

P1 are bugs which are being worked on for the current release cycle/iteration/sprint. 

If the bug is not assigned by Monday, 28 August, the bug's priority will be reset to '--'.
Keywords: stale-bug
This is an old bug, and there is a lot going on. To summarize my understanding

* There is some weirdness going on that may or may not be bugs, but...
* ...a lot of it can be explained by not providing a `Last-Modified` on the response header, triggering heuristic based freshness calculations
* We don't really have enough information

We want to change Normandy to run more traffic through a CDN, hopefully reducing cost. We want to revisit the workaround implemented last year. It would be easiest to remove the workaround, and start sending cache-control headers again (along with last-modified headers). Hopefully we'll either not encounter the bug, or if we do see the bug again, we'll be able to get better debugging info this time.
Priority: P1 → P3
Whiteboard: [necko-active] [see comment 38] → [necko-triaged][see comment 38]
You need to log in before you can comment on or make changes to this bug.