Created attachment 634590 [details] logs I have created a small tool that searches where all a bug has landed at http://www.janbambas.cz/links/ that displays results in iframes. STR: 0. set network.http.pipelining to true 1. load http://www.janbambas.cz/links/ 2. wait a bit (3 - 5 secs) to load the version board (initially a blank space under "Releases" link 3. enter some random bug # to the "Search bug landings" field (needs to be 6 numbers) 4. click the Go button, 5 iframes will be added (initially just blank) that load query to hg.mozilla.org for m-i, m-c up to m-r repos with the search string entered 5. wait some time, may be even a minute - queries take a long time The bug: Sometimes (relatively often) it happens that CSS for the first (m-i) iframe doesn't load. I've also seen failed load also for m-c. Notes: - I can reproduce while all the pages has already been loaded in the past (are cached) - I can reproduce with Nigthly, Aurora - seems unreproducible with Beta and Release channels, but that could be a red herring There are two logs attached, one from a debug Nightly and one from a release Aurora. I didn't check deeper, but try to search for http://hg.mozilla.org/integration/mozilla-inbound/static/style-gitweb.css that probably failed to load in both cases.
love a good testcase - thanks!
I can say with some confidence that this is related to reschedule-timeout (the time at which head of line blocked transactions are moved from 1 connection to another). I'm still looking for the bug - but setting that timeout quite low repros the issue and making it high sees it go away. I've fixed a couple issues related to read-timeout which is the more serious condition where the front-of-the-line transaction is aborted and restarted in a context that does not include pipelines. However, this does not appear to be the issue according to the logs (that code is not running - it has a much larger timeout value that is not reached). From the logs the rescheduled transaction appears to run ok and receive a 304. I wonder if we cache the wrong thing on a restart somehow. too soon to say.
Here's what's happening. Conn 1 has pipelined onto it A, B We determine that HOL blocking has occurred and B gets rescheduled onto Conn 2.. but we don't give up on Conn 1 - we're still waiting for A. However, we can't take back the request for B that was written to conn 1 - so conn 1 has to be marked "DontReuse" when rescheduling that blocked pipeline. And we do that :) However, when the response for A comes back the values in the response header end up overwriting the Keep-Alive member variables that DontReuse was implemented with.. in effect, undoing the DontReuse(). Ouch! Now that the connection is marked as reusable a new transaction ("C") can be placed on it like any other reusable persistent connection - in this case its the css file request. The response being read first is actually for "B", and we get ourselves a mixup. The fix is easy - create a non-reversible bool for dontreuse().
despite the bug, this is actually a vindiction for the pipelining algorithm.. it detected a real head of line blocking problem and routed around it dynamically!
Created attachment 637504 [details] [diff] [review] patch 0
(In reply to Patrick McManus [:mcmanus] from comment #4) > despite the bug, this is actually a vindiction for the pipelining > algorithm.. it detected a real head of line blocking problem and routed > around it dynamically! I'd more say it's sad we are confirming a functionality's working randomly when a bug occurs rather then by automated tests...
(In reply to Honza Bambas (:mayhemer) from comment #6) > (In reply to Patrick McManus [:mcmanus] from comment #4) > > despite the bug, this is actually a vindiction for the pipelining > > algorithm.. it detected a real head of line blocking problem and routed > > around it dynamically! > > I'd more say it's sad we are confirming a functionality's working randomly > when a bug occurs rather then by automated tests... not mutually exclusive thoughts!
Comment on attachment 637504 [details] [diff] [review] patch 0 Review of attachment 637504 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for the investigation. This fixes this bug. I don't immediately see any regressions, but real usage will show.