Created attachment 634590 [details]
I have created a small tool that searches where all a bug has landed at http://www.janbambas.cz/links/ that displays results in iframes.
0. set network.http.pipelining to true
1. load http://www.janbambas.cz/links/
2. wait a bit (3 - 5 secs) to load the version board (initially a blank space under "Releases" link
3. enter some random bug # to the "Search bug landings" field (needs to be 6 numbers)
4. click the Go button, 5 iframes will be added (initially just blank) that load query to hg.mozilla.org for m-i, m-c up to m-r repos with the search string entered
5. wait some time, may be even a minute - queries take a long time
Sometimes (relatively often) it happens that CSS for the first (m-i) iframe doesn't load. I've also seen failed load also for m-c.
- I can reproduce while all the pages has already been loaded in the past (are cached)
- I can reproduce with Nigthly, Aurora
- seems unreproducible with Beta and Release channels, but that could be a red herring
There are two logs attached, one from a debug Nightly and one from a release Aurora.
I didn't check deeper, but try to search for http://hg.mozilla.org/integration/mozilla-inbound/static/style-gitweb.css that probably failed to load in both cases.
love a good testcase - thanks!
I can say with some confidence that this is related to reschedule-timeout (the time at which head of line blocked transactions are moved from 1 connection to another). I'm still looking for the bug - but setting that timeout quite low repros the issue and making it high sees it go away.
I've fixed a couple issues related to read-timeout which is the more serious condition where the front-of-the-line transaction is aborted and restarted in a context that does not include pipelines. However, this does not appear to be the issue according to the logs (that code is not running - it has a much larger timeout value that is not reached).
From the logs the rescheduled transaction appears to run ok and receive a 304. I wonder if we cache the wrong thing on a restart somehow. too soon to say.
Here's what's happening.
Conn 1 has pipelined onto it A, B
We determine that HOL blocking has occurred and B gets rescheduled onto Conn 2.. but we don't give up on Conn 1 - we're still waiting for A. However, we can't take back the request for B that was written to conn 1 - so conn 1 has to be marked "DontReuse" when rescheduling that blocked pipeline.
And we do that :)
However, when the response for A comes back the values in the response header end up overwriting the Keep-Alive member variables that DontReuse was implemented with.. in effect, undoing the DontReuse(). Ouch!
Now that the connection is marked as reusable a new transaction ("C") can be placed on it like any other reusable persistent connection - in this case its the css file request. The response being read first is actually for "B", and we get ourselves a mixup.
The fix is easy - create a non-reversible bool for dontreuse().
despite the bug, this is actually a vindiction for the pipelining algorithm.. it detected a real head of line blocking problem and routed around it dynamically!
Created attachment 637504 [details] [diff] [review]
(In reply to Patrick McManus [:mcmanus] from comment #4)
> despite the bug, this is actually a vindiction for the pipelining
> algorithm.. it detected a real head of line blocking problem and routed
> around it dynamically!
I'd more say it's sad we are confirming a functionality's working randomly when a bug occurs rather then by automated tests...
(In reply to Honza Bambas (:mayhemer) from comment #6)
> (In reply to Patrick McManus [:mcmanus] from comment #4)
> > despite the bug, this is actually a vindiction for the pipelining
> > algorithm.. it detected a real head of line blocking problem and routed
> > around it dynamically!
> I'd more say it's sad we are confirming a functionality's working randomly
> when a bug occurs rather then by automated tests...
not mutually exclusive thoughts!
Comment on attachment 637504 [details] [diff] [review]
Review of attachment 637504 [details] [diff] [review]:
Thanks for the investigation. This fixes this bug. I don't immediately see any regressions, but real usage will show.