Last Comment Bug 770243 - MOZ_ASSERT(mCacheInputStream) failing @ nsHttpChannel::ReadFromCache
: MOZ_ASSERT(mCacheInputStream) failing @ nsHttpChannel::ReadFromCache
Status: RESOLVED FIXED
[qa-]
: regression
Product: Core
Classification: Components
Component: Networking: Cache (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla18
Assigned To: Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
:
: Patrick McManus [:mcmanus]
Mentors:
: 766274 795452 (view as bug list)
Depends on:
Blocks: 746018 764171 770088
  Show dependency treegraph
 
Reported: 2012-07-02 10:13 PDT by Honza Bambas (:mayhemer)
Modified: 2012-10-17 16:48 PDT (History)
9 users (show)
ryanvm: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed
fixed
fixed


Attachments
log (313.62 KB, text/plain)
2012-07-02 10:13 PDT, Honza Bambas (:mayhemer)
no flags Details
v1 (1.19 KB, patch)
2012-07-02 10:40 PDT, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
v1 + test (8.65 KB, patch)
2012-07-13 15:26 PDT, Honza Bambas (:mayhemer)
brian: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta+
Details | Diff | Splinter Review
Close cache input stream only when we're sure we don't need it (2.29 KB, patch)
2012-07-18 14:37 PDT, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
honzab.moz: review+
Details | Diff | Splinter Review
Address nits in test code (5.62 KB, patch)
2012-07-18 14:38 PDT, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
honzab.moz: review+
Details | Diff | Splinter Review
Part 4: Fix test (5.30 KB, patch)
2012-09-24 14:57 PDT, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
honzab.moz: review+
Details | Diff | Splinter Review

Description Honza Bambas (:mayhemer) 2012-07-02 10:13:11 PDT
Created attachment 638401 [details]
log

This should be a higher priority to check on, since when this fails in a release build we don't display any content (cannot read the cache entry and not having a content response from the server).

STR, if I remember correctly:
1. configure proxy for authentication (basic), e.g. squid, for all protocols
2. load http://www.websocket.org/echo.html
3. [optional] try to connect (doesn't matter what the result is)
4. change the proxy settings to just set proxy for http, leave all other protocols blank
5. restart Fx
6. load http://www.websocket.org/echo.html

=> assertion failure

>	xul.dll!mozilla::net::nsHttpChannel::ReadFromCache(bool alreadyMarkedValid=false)  Line 3491 + 0x2c bytes	C++
 	xul.dll!mozilla::net::nsHttpChannel::ProcessNotModified()  Line 2210 + 0xd bytes	C++
 	xul.dll!mozilla::net::nsHttpChannel::ProcessResponse()  Line 1265 + 0x8 bytes	C++
 	xul.dll!mozilla::net::nsHttpChannel::OnStartRequest(nsIRequest * request=0x086e01e0, nsISupports * ctxt=0x00000000)  Line 4784 + 0xe bytes	C++
 	xul.dll!nsInputStreamPump::OnStateStart()  Line 416 + 0x2c bytes	C++
Comment 1 Honza Bambas (:mayhemer) 2012-07-02 10:16:46 PDT
Just confirming that I am able to reproduce using the STR from comment 0.  Hint: you may want to delete the cache before step 1, but I think it is not strictly necessary.  Also, the assertion fails only after the proxy auth dialog is confirmed.  (I have remembered the password.)
Comment 2 Honza Bambas (:mayhemer) 2012-07-02 10:21:44 PDT
BTW: looks like we have to fix the logs in HttpCacheQuery.  Also a log that channel X opens a query Y would be useful...
Comment 3 Honza Bambas (:mayhemer) 2012-07-02 10:33:38 PDT
Simpler STR:
1. configure proxy for authentication (basic), e.g. squid, for all protocols
2. load http://www.websocket.org/echo.html
3. restart Fx
4. load http://www.websocket.org/echo.html

Culprit is the following code at [1]:

    if (httpStatus != 304 && httpStatus != 206) {
        mCacheInputStream.CloseAndRelease();
    }

httpStatus is 407.

I had to catch this at the review time.

[1] http://hg.mozilla.org/mozilla-central/annotate/9c6ad99de47a/netwerk/protocol/http/nsHttpChannel.cpp#l1183
Comment 4 Honza Bambas (:mayhemer) 2012-07-02 10:40:52 PDT
Created attachment 638410 [details] [diff] [review]
v1
Comment 5 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-10 19:22:40 PDT
Honza, your change looks reasonable but in order to understand why it works it would be good to have test cases:

1. 401 followed by 304
2. 401 followed by cacheable 200 OK
3. 407 followed by 304
4. 407 followed by cacheable 200 OK

I guess the cases that are failing now are #1 and #3. We need to test #2 and #4 to test that we're closing the input stream before we try to open an output stream for the cache entry.
Comment 6 Honza Bambas (:mayhemer) 2012-07-11 11:14:47 PDT
I'll add the proposed test.
Comment 7 Honza Bambas (:mayhemer) 2012-07-13 15:26:19 PDT
Created attachment 642096 [details] [diff] [review]
v1 + test

Added a test.  Quit self explaining.
Comment 8 Alex Keybl [:akeybl] 2012-07-16 14:59:04 PDT
We either need to take this fix in Aurora 16, or backout bug 722034. Tracking for that release.
Comment 9 Honza Bambas (:mayhemer) 2012-07-18 10:46:17 PDT
Brian, review ping?
Comment 10 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-18 14:37:47 PDT
Created attachment 643594 [details] [diff] [review]
Close cache input stream only when we're sure we don't need it

Honza, here's an alternate fix that I think is better because:

(a) It does not special-case any status codes
(b) It avoids acquiring the cache service lock at that point in the code (this may have be responsible for the CACHE_SERVICE_LOCK_WAIT_MAINTHREAD regression.)

If you don't think this alternate fix is correct, then we can use yours, which I also think is correct.

Note that this patch applies on top of yours. You can just qfold it onto yours.
Comment 11 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-18 14:38:53 PDT
Created attachment 643595 [details] [diff] [review]
Address nits in test code

Instead of listing the nits, I just created a patch to fix them. Up to you whether you want to fold this onto your patch.
Comment 12 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-18 14:40:11 PDT
Comment on attachment 642096 [details] [diff] [review]
v1 + test

Review of attachment 642096 [details] [diff] [review]:
-----------------------------------------------------------------

Honza, thanks for fixing the bug. Your patch seems reasonable to me. I would prefer us to use the alternate fix that I attached, if possible, but if that's problematic then your patch is also good as-is.
Comment 13 Honza Bambas (:mayhemer) 2012-07-19 14:17:35 PDT
Comment on attachment 643595 [details] [diff] [review]
Address nits in test code

Review of attachment 643595 [details] [diff] [review]:
-----------------------------------------------------------------

:D funny...
Comment 14 Honza Bambas (:mayhemer) 2012-07-19 14:42:51 PDT
Comment on attachment 643594 [details] [diff] [review]
Close cache input stream only when we're sure we don't need it

Review of attachment 643594 [details] [diff] [review]:
-----------------------------------------------------------------

Yep, this could work.  But aren't you actually removing the optimization of throwing away the stream ASAP we know we won't need it?

Test passes for me, but w/o the "funny" patch since I cannot apply it.
Comment 15 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-19 15:04:32 PDT
(In reply to Honza Bambas (:mayhemer) from comment #14)
> Yep, this could work.  But aren't you actually removing the optimization of
> throwing away the stream ASAP we know we won't need it?

Throwing away the input stream could be expensive because that process takes the cache service lock on the main thread, so it is not necessarily an optimization. 

Plus, the optimization would only work if we released our descriptor (not just the input stream) so another channel could open a descriptor.
Comment 16 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-19 16:55:54 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/dcc9891978e1
Comment 17 Ryan VanderMeulen [:RyanVM] 2012-07-19 19:08:20 PDT
Sorry, but I had to back this out due to xpcshell test failures.
https://hg.mozilla.org/integration/mozilla-inbound/rev/a36544756614

https://tbpl.mozilla.org/php/getParsedLog.php?id=13690848&tree=Mozilla-Inbound

TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_bug770243.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_bug770243.js | Response body 2 == Response body 3 - See following stack:
Comment 18 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-19 19:54:50 PDT
Out of time to investigate tonight, but I believe the root cause is actually bug 767277; i.e. I think we should have backed out bug 767277 instead (which I will do shortly). See the comment I am about to add to that bug.
Comment 19 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-19 21:00:59 PDT
Re-pushed to inbound after backing out bug 767277:
https://hg.mozilla.org/integration/mozilla-inbound/rev/64b30d7e3932
Comment 20 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2012-07-19 23:49:40 PDT
Backed out again

https://hg.mozilla.org/integration/mozilla-inbound/rev/ecb953611462

seems to have been failing a test here quite consistently.
Comment 21 Honza Bambas (:mayhemer) 2012-07-21 12:43:41 PDT
What about the v1 (attachment 642096 [details] [diff] [review])?  Does it also cause these problems?
Comment 22 Honza Bambas (:mayhemer) 2012-08-01 05:13:52 PDT
Could we please move here forward?  This makes diagnoses of some auth bugs much more complicated.  Thanks.
Comment 23 Honza Bambas (:mayhemer) 2012-08-01 10:54:11 PDT
Comment on attachment 642096 [details] [diff] [review]
v1 + test

https://tbpl.mozilla.org/?tree=Try&rev=b9f5c4565ae2
Comment 24 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-08-01 17:55:04 PDT
Honza, let me know if you want me to take this back.

The test is failing in the same way as it did when these patches got backed out of mozilla-inbound.

Bug 767277 wasn't the problem after all. Here's the comment I added to bug 767277 previously about it:

(In reply to Brian Smith (:bsmith) from bug 767277 comment #12)
> Backed out because I suspect it is causing the test for bug 770243 to fail:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/71854c060f90
> 
> The problem case is this:
> 
> We async dispatch a runnable to close the cache entry "later," thus leaving
> the cache entry open for a while after we've marked it valid. During that
> time between when we marked the cache entry valid and the time we close it,
> another channel decides to open the cache entry. That channel gets a
> ACCESS_READ descriptor instead of an ACCESS_READ_WRITE descriptor, and so it
> skips the validation of the cached response.
> 
> Bug 770243's tests passed previously when this patch was not applied (we
> should verify this last claim). But, with this patch, case 4 of that bug's
> tests fails. If I modify that case to use do_execute_soon, then the test
> succeeds. Debugging the failure, I saw that the cache is returning an
> ACCESS_READ descriptor. This causes HttpCacheQuery::CheckCache() to skip the
> validation of the entry on the assumption that it is valid. But, the test
> only works when each channel opened does revalidation.
Comment 25 Honza Bambas (:mayhemer) 2012-08-02 09:47:31 PDT
Ah!  My new test is failing!  As usual, the behavior is from some reason different on windows (the test passes).

Why didn't I read that comment sooner more carefully...  We are blocked all the time here just because of bad timing.

I will add execute-soon or sync-on-cache-thread to the test.  

If there is anything we *really* need to fix, then let's do it in a different bug (bug 775830?) with reference to this one to have a tool for reproduce/check.
Comment 26 Alex Keybl [:akeybl] 2012-08-22 17:15:01 PDT
I've asked in bug 722034 if we should consider performing the backout in https://bugzilla.mozilla.org/show_bug.cgi?id=722034#c79 for FF16 as well. If so, we'll untrack this bug.
Comment 27 George Billios 2012-09-02 22:22:46 PDT
Any news on this now that 16 beta has been released?
Comment 28 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-09-19 06:18:38 PDT
We have a patch for this already that landed once but got backed out because it was causing a test to fail. We believe that the problem is with the test and that the patches already attached to this bug are safe enough to land in Nightly, Aurora, and Beta as soon as we tweak that test, which I am planning to do today.
Comment 29 Alex Keybl [:akeybl] 2012-09-20 15:18:03 PDT
(In reply to Brian Smith (:bsmith) from comment #28)
> We have a patch for this already that landed once but got backed out because
> it was causing a test to fail. We believe that the problem is with the test
> and that the patches already attached to this bug are safe enough to land in
> Nightly, Aurora, and Beta as soon as we tweak that test, which I am planning
> to do today.

Thanks for the update. Sounds like this will make it into beta 5, going to build on Tuesday.
Comment 30 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-09-24 14:57:46 PDT
Created attachment 664232 [details] [diff] [review]
Part 4: Fix test

All four patches in this bug need to be qfolded together before landing. 

Part 4 does three things:

1. Rebases the test (from Parts 1-3) to the current mozilla-central.
2. For each case, adds a check that each response is or is not from the cache, as appropriate.
3. Syncs with the cache between each case to avoid race conditions.
Comment 31 Honza Bambas (:mayhemer) 2012-09-25 08:36:43 PDT
Thanks Brian.  Any try run reference?
Comment 32 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-09-25 08:58:30 PDT
https://tbpl.mozilla.org/?tree=Try&rev=6336b7c8acb4

It looks shockingly green, except for the Linux "B" failures and the normal Android rainbow. I just retriggered the Linux builds to see if they will go green.
Comment 33 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-09-26 12:13:09 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/cbfe6a468a12
Comment 34 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-09-26 12:44:35 PDT
Comment on attachment 642096 [details] [diff] [review]
v1 + test

Try runs for -aurora and -beta are here:
https://tbpl.mozilla.org/?tree=Try&rev=5c787fc0db0d
https://tbpl.mozilla.org/?tree=Try&rev=be55a62862e4

Note that this approval request is for all four patches in this bug, folded together.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 746018

User impact if declined: Some pages and sub-resources of pages will not load correctly, especially when HTTP proxy authentication is used.

Testing completed (on m-c, etc.): Automated test case, landed today on m-i.

Risk to taking this patch (and alternatives if risky): Moderate. There is some possibility that closing the cache entry later could cause unforseen loading problems similar to the problems we encountered and fixed with the automated test.
 
String or UUID changes made by this patch: none
Comment 35 Alex Keybl [:akeybl] 2012-09-26 14:48:47 PDT
(In reply to Brian Smith (:bsmith) from comment #34)
> Risk to taking this patch (and alternatives if risky): Moderate. There is
> some possibility that closing the cache entry later could cause unforseen
> loading problems similar to the problems we encountered and fixed with the
> automated test.

We're leaning towards backing out bug 722034 and bug 746018 for FF16 at this point, given the moderate risk evaluation here, the other regression tracked in bug 788365, and the fact that a fix would go into our final beta for the first time (this didn't make beta 5).

Can you make the case for taking a forward fix at this point?
Comment 36 Alex Keybl [:akeybl] 2012-09-26 14:51:37 PDT
Comment on attachment 642096 [details] [diff] [review]
v1 + test

[Triage Comment]
Approving for Aurora in preparation for Beta 16 consideration.
Comment 37 Ryan VanderMeulen [:RyanVM] 2012-09-26 18:43:59 PDT
https://hg.mozilla.org/mozilla-central/rev/cbfe6a468a12
Comment 38 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-09-28 10:43:59 PDT
https://hg.mozilla.org/releases/mozilla-aurora/rev/f59da8ed70ea
Comment 39 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-09-28 13:10:49 PDT
*** Bug 766274 has been marked as a duplicate of this bug. ***
Comment 40 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-09-28 13:13:23 PDT
Bug 766274, especially bug 766274 comment 2 and bug 766274 comment16, show that this happens with regular (not proxy) HTTP auth and that this patch fixes the problem when applied to beta 16.
Comment 41 Alex Keybl [:akeybl] 2012-09-28 14:47:49 PDT
Comment on attachment 642096 [details] [diff] [review]
v1 + test

[Triage Comment]
This bug has a medium risk evaluation, so I want to be very clear with why we've decided to approve for Beta.

1) We've seen reports here and in bug 766274, with very standard browser configurations
2) Our alternatives are to back out bug 722034 (deemed risky) or wontfix
3) Brian believes we'll see regressions on Nightly/Aurora over the next week, if any

Given this, we're willing to take the fix on mozilla-beta with an escape valve of backing out and wontfixing before next Friday's final release build.

In the future, bugs like this need to get attention much sooner in the release.
Comment 42 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-09-28 15:50:51 PDT
*** Bug 795452 has been marked as a duplicate of this bug. ***
Comment 43 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-09-28 17:41:18 PDT
https://hg.mozilla.org/releases/mozilla-beta/rev/e34ef88d3b1e

(In reply to Alex Keybl [:akeybl] from comment #41)
> In the future, bugs like this need to get attention much sooner in the
> release.

I agree. And, also, Alex did everything he could to help push this along. The delay was mostly my fault.
Comment 44 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-09-28 20:31:01 PDT
https://hg.mozilla.org/releases/mozilla-beta/rev/bf691ea80fa4

Add syncWithCacheIOThread to head_cache.js from bug 737615 (written by michal and r=honzab) because the tests require it, a=bustage
Comment 45 Honza Bambas (:mayhemer) 2012-10-01 09:27:42 PDT
Alex, even you were not very happy to approve this patch, I have to say again it is quit important, I was hitting this issue very often.  

Also, for Beta I personally more trust my patch (the first patch "v1 + test" added to this bug) since it only and only adds 401 and 407 to an already well tested code path that simply doesn't throw away cache input stream that early.

The patch has been reviewed, however it hasn't spent any time on m-c or m-a for real testing...  Hence it may not be a very good option either...
Comment 46 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-10-17 16:48:49 PDT
No QA verification since this has in-testsuite coverage. Please add verifyme keyword to request manual verification.

Note You need to log in before you can comment on or make changes to this bug.