Closed Bug 812483 Opened 12 years ago Closed 10 years ago

Handle HTTP cache input stream open errors and load from network instead

Categories

(Core :: Networking: Cache, defect)

18 Branch
defect
Not set
major

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox18 - affected
firefox19 - affected

People

(Reporter: mayhemer, Assigned: mayhemer)

References

Details

Attachments

(1 file)

Since there is no progress on bug 808532, it will in 4 days reach Beta channel.

That is pretty bad.

I want to have a simple and straight solution to bypass the stream open error.  I don't want to quickly fix the true error.

I think this should only land on current Aurora (Fx18) to get to Beta and shouldn't land on Nightly since we still want to catch cause of bug 808532 and it wouldn't be easy when hidden by this workaround.
Attached patch v1Splinter Review
- works around ERROR_FILE_NOT_FOUND in all cases
- when the error is detected we do the following steps:
  - prevent call of OnStartRequest/OnStopRequest of the channel listener
  - async doom the cache entry
  - close the cache entry
  - after async doom is done we do internal redirect to the same URI
- when any of the steps fail, we propagate ERROR_FILE_NOT_FOUND to the listener since there is not much to do

Code of new redirect methods (Continue)AsyncReloadChannel() are almost identical copies of (Continue)AsyncRedirectChannelToHttps().  There is a bug 735538 that should handle all of the duplication around redirects, but since this should be accepted to Beta I want to keep this patch simple.
 

Try: https://tbpl.mozilla.org/?tree=Try&rev=6bfbc0958672
Attachment #682478 - Flags: review?(bsmith)
Btw, reason why I've chosen to redirect is that after already processing some response (usually 304) it is hard (and nasty) to setup the channel to be opened again.  Redirect is a very clean way to reload.
(In reply to Honza Bambas (:mayhemer) from comment #0)
> I think this should only land on current Aurora (Fx18) to get to Beta and
> shouldn't land on Nightly since we still want to catch cause of bug 808532
> and it wouldn't be easy when hidden by this workaround.

Basically, I agree with that. However, similar new regressions might come again in the future. So, I think that the patch should be landed on m-c and disabled by pref or #ifdef only on Nightly forever.
Comment on attachment 682478 [details] [diff] [review]
v1

Review of attachment 682478 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.   I have only minor quibbles/sugggestions, none of which are important.

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +4788,5 @@
> +
> +        // cache input stream failed to open, set the channel to load
> +        // directly from network (finished in OnStopRequest)
> +
> +        NS_ASSERTION(mCacheEntry,

might as well use MOZ_ASSERT (which does a hard fail instead of a warning which none of our test automation will pay any attention to).  At this point I suggest we not add any new NS_ASSERTIONs to the codebase.

@@ +4889,5 @@
>      SAMPLE_LABEL("network", "nsHttpChannel::OnStopRequest");
>      LOG(("nsHttpChannel::OnStopRequest [this=%p request=%p status=%x]\n",
>          this, request, status));
>  
> +    if (mCacheInputStreamOpenFailed && mCacheEntry) {

I'd rather see 

  if (mCacheInputStreamOpenFailed) {

      MOZ_ASSERT(mCacheEntry)

Not a big deal, but seems to make the relationship clearer.  There shouldn't be any cases with mCacheInputStreamOpenFailed and !mCacheEntry.

@@ +5541,5 @@
>  {
> +    LOG(("nsHttpChannel::OnCacheEntryDoomed [channel=%p, status=%X]", this, status));
> +
> +    if (!mCacheInputStreamOpenFailed) {
> +        LOG(("  called while mCacheInputStreamOpenFailed not set?"));

Why the question mark?  Won't all of our other, many existing calls to DoomEntry and/or AsyncDoom all cause this to be called without mCacheInputStreamOpenFailed being set?  If true we probably don't need any LOG at all for this case.

@@ +5545,5 @@
> +        LOG(("  called while mCacheInputStreamOpenFailed not set?"));
> +        return NS_OK;
> +    }
> +
> +    mCacheInputStreamOpenFailed = false;

Why do you set this back to false here?  Is there some reason for it?  My understanding is that the original channel will basically stay in that state no matter what (even if the redirect

@@ +5570,5 @@
> +
> +    nsCOMPtr<nsIChannel> newChannel;
> +    nsCOMPtr<nsIURI> URI;
> +
> +    rv = mURI->Clone(getter_AddRefs(URI));

Efficiency obviously isn't a big concern here, but couldn't you omit 'URI' and just pass in mURI everywhere it's used here?

@@ +5581,5 @@
> +    rv = ioService->NewChannelFromURI(URI, getter_AddRefs(newChannel));
> +    NS_ENSURE_SUCCESS(rv, rv);
> +
> +    rv = SetupReplacementChannel(URI, newChannel, true);
> +    NS_ENSURE_SUCCESS(rv, rv);

Would it be worth setting LOAD_BYPASS_LOCAL_CACHE on the new channel to tell it not to check cache?  Both to be super-sure we don't get into some kind of loop (which the doom should take care of), but also so we can make the load faster--we know we don't need to check cache here.

@@ +5604,5 @@
> +    return rv;
> +}
> +
> +nsresult
> +nsHttpChannel::ContinueAsyncReloadChannel(nsresult rv)

> Code of new redirect methods (Continue)AsyncReloadChannel() 
> are  almost identical copies of 
> (Continue)AsyncRedirectChannelToHttps().

Yeah, this is really almost identical.  We should clean that up at some point.  Doesn't have to be now.

@@ +5612,5 @@
> +
> +    if (NS_FAILED(rv)) {
> +        // Fill the failure status here, the update to https had been vetoed
> +        // but from the security reasons we have to discard the whole channel
> +        // load.

I assume this comment is bogus, and just cut-and-pasted from ContinueAsyncRedirectChannelToHttps?  Remove?
Attachment #682478 - Flags: review?(bsmith) → review+
Comment on attachment 682478 [details] [diff] [review]
v1

This isn't a tiny patch, but the code here is only invoked if we're already about to fail (i.e. we're in the middle of hitting bug 808532), so this can really only make things better AFAICT.  Even though it's late in the cycle, I think we should take this.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 808532
User impact if declined: Users will see annoying "File not found" errors for resources that could be loaded OK off the web (see bug 808532).  Not clear how common this is, but it's bad.
Testing completed (on m-c, etc.):  Not easy to test in automated way.  I assume Honza has run tests to check.
Risk to taking this patch (and alternatives if risky):  It only tries to fix a failure that's about to happen, so hopefully low risk.
String or UUID changes made by this patch:  none.
Attachment #682478 - Flags: approval-mozilla-aurora?
(In reply to Jason Duell (:jduell) from comment #4)
> Comment on attachment 682478 [details] [diff] [review]
> v1
> 
> Review of attachment 682478 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good.   I have only minor quibbles/sugggestions, none of which are
> important.

Thanks Jason!

> 
> ::: netwerk/protocol/http/nsHttpChannel.cpp
> @@ +4788,5 @@
> > +
> > +        // cache input stream failed to open, set the channel to load
> > +        // directly from network (finished in OnStopRequest)
> > +
> > +        NS_ASSERTION(mCacheEntry,
> 
> might as well use MOZ_ASSERT (which does a hard fail instead of a warning
> which none of our test automation will pay any attention to).  At this point
> I suggest we not add any new NS_ASSERTIONs to the codebase.

I personally hate MOZ_ASSERT, since I sometimes cannot continue execution when I want.  I'm on the pro-NS_ASSERTION side of the assertion api dispute.

> 
> @@ +4889,5 @@
> >      SAMPLE_LABEL("network", "nsHttpChannel::OnStopRequest");
> >      LOG(("nsHttpChannel::OnStopRequest [this=%p request=%p status=%x]\n",
> >          this, request, status));
> >  
> > +    if (mCacheInputStreamOpenFailed && mCacheEntry) {
> 
> I'd rather see 
> 
>   if (mCacheInputStreamOpenFailed) {
> 
>       MOZ_ASSERT(mCacheEntry)
> 
> Not a big deal, but seems to make the relationship clearer.  There shouldn't
> be any cases with mCacheInputStreamOpenFailed and !mCacheEntry.

Absolutely not.  If it happens that mCacheEntry is null we introduce with this patch a potential for a null crash!

> 
> @@ +5541,5 @@
> >  {
> > +    LOG(("nsHttpChannel::OnCacheEntryDoomed [channel=%p, status=%X]", this, status));
> > +
> > +    if (!mCacheInputStreamOpenFailed) {
> > +        LOG(("  called while mCacheInputStreamOpenFailed not set?"));
> 
> Why the question mark?  Won't all of our other, many existing calls to
> DoomEntry and/or AsyncDoom all cause this to be called without
> mCacheInputStreamOpenFailed being set?  If true we probably don't need any
> LOG at all for this case.

I simply want to monitor glitches like these in the log.  There is no other log that would tell me this has happened.

> 
> @@ +5545,5 @@
> > +        LOG(("  called while mCacheInputStreamOpenFailed not set?"));
> > +        return NS_OK;
> > +    }
> > +
> > +    mCacheInputStreamOpenFailed = false;
> 
> Why do you set this back to false here?  Is there some reason for it?  My
> understanding is that the original channel will basically stay in that state
> no matter what (even if the redirect

Hmm... I'd rather do it anyway.  Just for safety and since I think it is logical.

> 
> @@ +5570,5 @@
> > +
> > +    nsCOMPtr<nsIChannel> newChannel;
> > +    nsCOMPtr<nsIURI> URI;
> > +
> > +    rv = mURI->Clone(getter_AddRefs(URI));
> 
> Efficiency obviously isn't a big concern here, but couldn't you omit 'URI'
> and just pass in mURI everywhere it's used here?

We always clone URI for the replace channel.  Since this is about to go to a stable branch I really want to preserve cloning here.

> 
> @@ +5581,5 @@
> > +    rv = ioService->NewChannelFromURI(URI, getter_AddRefs(newChannel));
> > +    NS_ENSURE_SUCCESS(rv, rv);
> > +
> > +    rv = SetupReplacementChannel(URI, newChannel, true);
> > +    NS_ENSURE_SUCCESS(rv, rv);
> 
> Would it be worth setting LOAD_BYPASS_LOCAL_CACHE on the new channel to tell
> it not to check cache?  Both to be super-sure we don't get into some kind of
> loop (which the doom should take care of), but also so we can make the load
> faster--we know we don't need to check cache here.

LOAD_BYPASS_LOCAL_CACHE didn't prevent load from cache for me on the new channel.  Also it does more then just bypassing read from cache.  It also refreshes DNS and changes how we open cache entries.  It was one of my attempts for fix, to reload with this flag.  No luck...  I won't add the flag.

> @@ +5612,5 @@
> > +
> > +    if (NS_FAILED(rv)) {
> > +        // Fill the failure status here, the update to https had been vetoed
> > +        // but from the security reasons we have to discard the whole channel
> > +        // load.
> 
> I assume this comment is bogus, and just cut-and-pasted from
> ContinueAsyncRedirectChannelToHttps?  Remove?

Thanks.  I will remove it.
Comment on attachment 682478 [details] [diff] [review]
v1

For now dropping approval (now this needs to go to beta anyway).  Depends on how backing out of bug 405407 goes.

Then we may push this in disabled for m-c (if it can even be done, not sure we have a define for it).
Attachment #682478 - Flags: approval-mozilla-aurora?
>> +        LOG(("  called while mCacheInputStreamOpenFailed not set?"));
>> 
>> Why the question mark? 
> 
> I simply want to monitor glitches like these in the log

I'm not opposed to logging, I just don't think it's a good idea to put in a question mark here (which makes it sounds like this is an unexpected/impossible state) when AFAICT this is perfectly normal to hit for all other AsyncDoom calls we make.
(In reply to Jason Duell (:jduell) from comment #8)
> AFAICT this is perfectly normal to hit for
> all other AsyncDoom calls we make.

We are not doing any other async doom calls in http channel that would take it as a callback.  I can remove ? if you want, tho.
> We are not doing any other async doom calls in http channel that would take 
> it as a callback

Ah, missed that.  Keep the question mark then!
(In reply to Honza Bambas (:mayhemer) from comment #7)
> Comment on attachment 682478 [details] [diff] [review]
> v1
> 
> For now dropping approval (now this needs to go to beta anyway).  Depends on
> how backing out of bug 405407 goes.

Now that this is backed out, can you confirm that we can untrack?
(In reply to Alex Keybl [:akeybl] from comment #11)
> Now that this is backed out, can you confirm that we can untrack?

I think yes, we want to leave the code manifest any left potential errors, so this patch now would be counter productive.  But later I would like to land it anyway using the normal train path.
What should we do with this bug?
Flags: needinfo?(joshmoz)
So this bug is good in that it will have users still see content if the cache is broken somehow and fails to deliver it (after promising to).  And it's bad in that it will mask when that situation is happening (it's a bug and we should know about it).

Brian tells me we have the infrastructure to switch prefs without a binary browser update.  If that's true, I suggest we change the patch to use a pref, land it preff'd off, and then if we discover that shipped cache code is breaking, we can switch the pref on so users with shipped releases get the workaround without us needing to ship a a new binary on the release channel.
Run-time switch of default preferences?  Yeah!  I designed the same thing for allpeers 5 years ago :)

I'll jump on it before this rots.
I don't think I'm qualified to decide what to do here - I'd like to see a discussion between Honza, Michal, and Biesi (possibly other) result in a decision.

I just have general advice - do the right thing for the user at runtime, but include ways for developers to spot errors instead of entirely covering them up.
Flags: needinfo?(joshmoz)
Hiding problems in the cache is bad, but forcing a user to do a shift-reload to see a page makes a bad user experience. Ideal solution IMO would be to enhance this patch so, that we would be notified in case this problem becomes widespread. We might use telemetry for this purpose. It's not ideal, but we probably don't have anything better.
See Also: → 867960
In Bug 867960, what I think is the wrongest is that I could click "try again" and still see the "File not found" error. Clicking "try again" should always trigger a "shift-reload". That would already alleviate the problem.
New cache solves this.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: