Last Comment Bug 717350 - Doom Cache Entries with last-modified inconsistency
: Doom Cache Entries with last-modified inconsistency
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Networking: HTTP (show other bugs)
: 11 Branch
: x86_64 Linux
: -- normal with 1 vote (vote)
: mozilla14
Assigned To: Patrick McManus [:mcmanus]
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-01-11 12:33 PST by Patrick McManus [:mcmanus]
Modified: 2012-03-23 05:58 PDT (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch 0 (5.31 KB, patch)
2012-01-11 12:38 PST, Patrick McManus [:mcmanus]
no flags Details | Diff | Splinter Review
patch 1 (8.78 KB, patch)
2012-01-13 10:33 PST, Patrick McManus [:mcmanus]
honzab.moz: review+
Details | Diff | Splinter Review
v2 (11.43 KB, patch)
2012-02-15 08:48 PST, Patrick McManus [:mcmanus]
mcmanus: review+
Details | Diff | Splinter Review

Description Patrick McManus [:mcmanus] 2012-01-11 12:33:39 PST
If I do 

GET /foo
If-Modified-Since: Jan-10

and I get

HTTP/1.1 304 not modified
Last-Modified: Jan-7

that's pretty suspicious. Because the IMS header is derived from a previous l-m entry and now the server l-m is going backwards in time. One of those things is probably wrong (I suspect the cache entry) or at least out of date.

This is a failure mode of the server pipeline implementation from bug 716840.

I suggest we take a pretty conservative approach here and just doom the cache entry when we see it. i.e. honor the 304 but clean up the cache so that future reloads will get a version from scratch.
Comment 1 Patrick McManus [:mcmanus] 2012-01-11 12:38:38 PST
Created attachment 587783 [details] [diff] [review]
patch 0

honza, this is in my pipeline queue after restart-in-progress
Comment 2 Honza Bambas (:mayhemer) 2012-01-11 13:05:21 PST
Comment on attachment 587783 [details] [diff] [review]
patch 0

Changes like these needs a test.  xpc-shell probably.
Comment 3 Michal Novotny (:michal) 2012-01-12 08:39:17 PST
Comment on attachment 587783 [details] [diff] [review]
patch 0

(In reply to Patrick McManus [:mcmanus] from comment #0)
> cache entry when we see it. i.e. honor the 304 but clean up the cache so
> that future reloads will get a version from scratch.

Why should we honor 304 for the current request when it is suspicious?


> +    rv = mResponseHead->GetHeader(nsHttp::Last_Modified,
> +                                  lastModified);
> +    if (NS_SUCCEEDED(rv))
> +        rv = mResponseHead->GetHeader(nsHttp::Last_Modified, 
> +                                      lastModified304);

The first one should be mCachedResponseHead, right?


> +        if (mCacheEntry)
> +            mCacheEntry->Doom();

We always have mCacheEntry here. There is a check NS_ENSURE_TRUE(mCacheEntry,...) few lines above.
Comment 4 Patrick McManus [:mcmanus] 2012-01-12 09:26:03 PST
(In reply to Michal Novotny (:michal) from comment #3)
> Comment on attachment 587783 [details] [diff] [review]
> patch 0
> 
> (In reply to Patrick McManus [:mcmanus] from comment #0)
> > cache entry when we see it. i.e. honor the 304 but clean up the cache so
> > that future reloads will get a version from scratch.
> 
> Why should we honor 304 for the current request when it is suspicious?

I'm trying to take the least invasive route here. the 304 is suspicious, but I haven't 100% concluded that it is an error - so I don't want to start tossing errors.. but if we can "right the ship" in the background but evicting that cache entry, I think that's a good conservative first step.


> The first one should be mCachedResponseHead, right?
 

one of them should be, yes. thanks.
Comment 5 Honza Bambas (:mayhemer) 2012-01-12 11:56:44 PST
(In reply to Patrick McManus [:mcmanus] from comment #0)
One of those
> things is probably wrong (I suspect the cache entry) or at least out of date.

But you prefer to load it, a bit controversial.  When you detect this, it could potentially be also some kind of protection (not sure if not double-blade right now) against cache poisoning.

> 
> This is a failure mode of the server pipeline implementation from bug 716840.

Can you please rephrase this?  I don't quit understand the relation.
Comment 6 Patrick McManus [:mcmanus] 2012-01-12 17:42:46 PST
(In reply to Honza Bambas (:mayhemer) from comment #5)
> (In reply to Patrick McManus [:mcmanus] from comment #0)
> One of those
> > things is probably wrong (I suspect the cache entry) or at least out of date.
> 
> But you prefer to load it, a bit controversial.  

we can add more, even in another patch. I'd like to get a little experience with it first to validate (pun!) the idea with a penalty that won't stop anything from operating.

> > 
> > This is a failure mode of the server pipeline implementation from bug 716840.
> 
> Can you please rephrase this?  I don't quit understand the relation.

sure - first a bit about this .no "news site".. it (or one of its partner sites) is the only one I've had cited to me by another browser as being known to cause img mixups since IIS 5 and a busted squid version have left the scene many years ago now. And I've done a lot of asking on the topic. (netscaler hardware has other failure modes - generally rsts which are much more detectable as broken images). I can't find any other ones myself and fwiw it is extremely hard to get a reproduction on the vg/db.no site.

if you weren't aware iphone ios 5 does pipelining by default. That's awesome for us - it really helps validate the general suitability of the algorithm on the web. Because of this I have much greater concerns about head of line blocking performance problems than I do about serious safety issues.

Nonetheless, there is this .no site referenced in 716840... a variation on it shows up in a bunch of ancient bugzilla entries too. If you follow the link in that bug to a blog post you'll see the server operators have published a lot of information and even 2 pcaps.

basically at some point a request for "B" was given a valid response to "C" instead. I personally suspect something along the lines of a pipeline request of A,B,C was served as A,C,EOF - which would have the effect of using the C response for the B request (and caching it if the headers said so under B's URI). You can see how that would happen if something on the server side did an over-read when parsing A (analagous to our pushback code).

we can tell something like that happened because you can see in the packet traces a conditional-validation of the "B" URI using a If-Modified-Since of C's timestamp. The actual timestamp of B is older than C so it gets a 304 even though they don't match. (now you see why etags were invented!)

That brings us full circle to this patch - which basically says if the Last-Modified's of our cache entry (which is where we got the IMS value from) and the 304 revalidation don't match then we've got a problem.

It doesn't solve the initial problem (at the 200 stage) - I don't think you really can solve that. But it at least appears fringe enough at this point to not worry about it more than we would worry about any other broken server behavior. (which is to say - do our best but don't disable it for the vast majority of the web because of a break by something non compliant at the very long end of the tail - there will always be broken implementations.)
Comment 7 Patrick McManus [:mcmanus] 2012-01-13 10:33:26 PST
Created attachment 588454 [details] [diff] [review]
patch 1

* adds an xpcshell test

* changes to use an appropriate existing 'red' feedback code instead of minting a new one.

* fixes the bug michal noted
Comment 8 Honza Bambas (:mayhemer) 2012-02-14 07:26:22 PST
Comment on attachment 588454 [details] [diff] [review]
patch 1

Review of attachment 588454 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab however controversial the patch is..

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +1977,5 @@
> +    // our cache with a bad response. Take the minor step here of just dooming
> +    // that cache entry so there is a fighting chance of getting things on the
> +    // right track as well as disabling pipelining for that host.
> +
> +    nsCAutoString lastModified;

lastModifiedCached ?

@@ +1984,5 @@
> +    rv = mCachedResponseHead->GetHeader(nsHttp::Last_Modified,
> +                                        lastModified);
> +    if (NS_SUCCEEDED(rv))
> +        rv = mResponseHead->GetHeader(nsHttp::Last_Modified, 
> +                                      lastModified304);

In braces please.

@@ +1987,5 @@
> +        rv = mResponseHead->GetHeader(nsHttp::Last_Modified, 
> +                                      lastModified304);
> +    if (NS_SUCCEEDED(rv) && !lastModified304.Equals(lastModified)) {
> +        LOG(("Cache Entry and 304 Last-Modified Headers Do Not Match "
> +             "%s and %s\n", lastModified.get(), lastModified304.get()));

Nit: "'%s' and '%s'\n" to see any spaces and special chars.

::: netwerk/test/unit/test_mismatch_lm.js
@@ +25,5 @@
> +                                       offset, count) {
> +	var data = new BinaryInputStream(inputStream).readByteArray(count);
> +      
> +	// This is 'B'
> +	do_check_eq(data, 66);

Shouldn't there be data[0] ?  And maybe instead of 66 do "B".charCodeAt(0);

@@ +59,5 @@
> +	httpChan.asyncOpen(listener_3, null);
> +    }
> +};
> +
> +var listener_1 = {

You should add 

  QueryInterface: function(iid) {
    if (iid.equals(Components.interfaces.nsIStreamListener) ||
        iid.equals(Components.interfaces.nsIRequestObserver) ||
        iid.equals(Components.interfaces.nsISupports))
      return this;
    throw Components.results.NS_ERROR_NO_INTERFACE;
  },
 
to your listeners.


Btw, why not to use ChannelListener from head_channels.js here?

@@ +95,5 @@
> +    httpserver.start(4444);
> +
> +    var chan = ios.newChannel("http://localhost:4444/test1", "", null);
> +    var httpChan = chan.QueryInterface(Ci.nsIHttpChannel);
> +    httpChan.requestMethod = "GET";

Do you really need to do this?

@@ +103,5 @@
> +}
> +
> +var iter=0;
> +function handler(metadata, response) {
> +    iter++;

Nit: it might be a bit better to drive this iterator from the test code (just before asyncOpen with a direct assign) and not here.

::: netwerk/test/unit/xpcshell.ini
@@ +130,5 @@
>  [test_httpcancel.js]
>  [test_httpsuspend.js]
>  [test_idnservice.js]
>  [test_localstreams.js]
> +[test_mismatch_lm.js]

test_mismatch_last-modified.js
Comment 9 Honza Bambas (:mayhemer) 2012-02-14 07:27:03 PST
> I personally suspect something along the lines of a pipeline request of A,B,C was
> served as A,C,EOF ... something on the server side did an over-read when parsing A

In other words, A must then be received broken.  We could detect this, restart its load (should be triggered from the load context with properties not to pipeline - let's call it "reload in safe mode" that may retry w/o pipelining when it was engaged before) and along with it restart the rest of the pipeline sooner we start distributing the potentially broken content to upper level consumers.

I think we should have a mechanism like this anyway one day for cases of failed loads because of pipelining.


Re: the patch: I suspect we just end up with 200 OK/304 -> doom/200 OK/304 -> doom/etc looping on reloads.  However, from user's point of view, there will be a bad load on the first try from cache, and a good load on F5, which is good...

Telemetry?
Comment 10 Patrick McManus [:mcmanus] 2012-02-15 08:22:47 PST
(In reply to Honza Bambas (:mayhemer) from comment #9)
> > I personally suspect something along the lines of a pipeline request of A,B,C was
> > served as A,C,EOF ... something on the server side did an over-read when parsing A
> 
> In other words, A must then be received broken.  

A is just fine.. B gets C's response body (but in a valid HTTP way), and C is detected as a hangup and probly restarted and obtained fine. That matches the data of what is seen off 716840 - two cache items with the same L-M header (i.e. C's), where upon revalidation it is obviously wrong for one of them.

> 
> Re: the patch: I suspect we just end up with 200 OK/304 -> doom/200 OK/304
> -> doom/etc looping on reloads.  However, from user's point of view, there
> will be a bad load on the first try from cache, and a good load on F5, which
> is good...


I don't think we'll end up in the loop for a couple of reasons.. one, this is exceptionally rare and a race condition. two, this patch generates a negative pipeline feedback event which will probably break the cycle.
Comment 11 Patrick McManus [:mcmanus] 2012-02-15 08:48:43 PST
Created attachment 597424 [details] [diff] [review]
v2

updated rev
Comment 12 Patrick McManus [:mcmanus] 2012-03-20 10:38:31 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/9d37b359199f
Comment 13 Matt Brubeck (:mbrubeck) 2012-03-20 12:59:42 PDT
Sorry, I backed this out on inbound because something in the push appeared to cause a leak in mochitest-browser-chrome:
https://hg.mozilla.org/integration/mozilla-inbound/rev/577da08878cb
Comment 14 Patrick McManus [:mcmanus] 2012-03-22 18:04:18 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/f94cf04e8d73
Comment 15 Marco Bonardo [::mak] (Away 6-20 Aug) 2012-03-23 05:58:16 PDT
https://hg.mozilla.org/mozilla-central/rev/f94cf04e8d73

Note You need to log in before you can comment on or make changes to this bug.