Seeing the Connection reset neterror way too often

RESOLVED FIXED in Firefox 32

Status

()

Core
Networking
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: glandium, Assigned: mcmanus)

Tracking

({regression})

unspecified
mozilla33
x86_64
Linux
regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox31 unaffected, firefox32 fixed, firefox33 fixed)

Details

Attachments

(2 attachments)

(Reporter)

Description

3 years ago
I used to be running aurora 20140523004002 and upgraded to nightly 20140606030206. With the former, I was never seeing the Connection reset neterror. With the latter, I'm seeing it way too often, although it's hard to reliably reproduce.

Is there some NSPR_LOG output that would be helpful here?
can't think of any change that would manifest itself that way - but a good log would come from

https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
(Reporter)

Comment 2

3 years ago
Created attachment 8436515 [details]
Gzipped log

Lucky me, this happened (at least) twice (that i saw) after a restart with those flags:
- Once on youtube
- Once on bugzilla while uploading this attachment for the first time.

In case it matters, in both cases, it's HTTPS. I don't remember if that was the case before.

Note bug 1022371 may or may not be related.
(Reporter)

Comment 3

3 years ago
(In reply to Mike Hommey [:glandium] from comment #2)
> Note bug 1022371 may or may not be related.

And it sounds plausible, because both curl http://youtube.com:443 and curl http://bugzilla.mozilla.org:443 get a RST from the server while both curl http://www.debian.org:443 and curl http://wikipedia.org:443 return an HTTP 400 error.
The bad think is that (in case of the bugzilla attachment update) we do:
progress=10125312/9494978, apparently we proclaim a length X but send out X+N.  Grep for nsHttpTransaction with this=7fa1f6b26000
According youtube, the server gracefully closes a new connection in some 60ms.  Grep for nsHttpConnection with this=7fa1f89d9740 and nsHttpTransaction with this=7fa1f7fe3800:


2014-06-08 15:03:28.018963 UTC - 481294080[7fa232538a40]: Creating nsHttpConnection @7fa1f89d9740
2014-06-08 15:03:28.018967 UTC - 481294080[7fa232538a40]: nsHalfOpenSocket::OnOutputStreamReady Created new nshttpconnection 7fa1f89d9740

....

2014-06-08 15:03:28.088325 UTC - 481294080[7fa232538a40]: nsHttpConnection::OnSocketReadable [this=7fa1f89d9740]
2014-06-08 15:03:28.088334 UTC - 481294080[7fa232538a40]: nsSocketInputStream::Read [this=7fa1f8aedfb8 count=32768]
2014-06-08 15:03:28.088339 UTC - 481294080[7fa232538a40]:   calling PR_Read [count=32768]
2014-06-08 15:03:28.088347 UTC - 481294080[7fa232538a40]:   PR_Read returned [n=0]
2014-06-08 15:03:28.088351 UTC - 481294080[7fa232538a40]: JIMB: ReleaseFD_Locked: mFDref = 2
2014-06-08 15:03:28.088356 UTC - 481294080[7fa232538a40]: nsHttpConnection::CloseTransaction[this=7fa1f89d9740 trans=7fa1f7fe3800 reason=80470002]
2014-06-08 15:03:28.088361 UTC - 481294080[7fa232538a40]: nsHttpTransaction::Close [this=7fa1f7fe3800 reason=0]
2014-06-08 15:03:28.088367 UTC - 481294080[7fa232538a40]: Assessing red penalty to www.youtube.com class 3 for event 131080. Penalty now 0, throttle[3] = 7000
2014-06-08 15:03:28.088373 UTC - 481294080[7fa232538a40]: nsHttpTransaction::ParseHead [count=1]
2014-06-08 15:03:28.088381 UTC - 481294080[7fa232538a40]: nsHttpResponseHead::ParseVersion [version=]
2014-06-08 15:03:28.088393 UTC - 481294080[7fa232538a40]: looks like a HTTP/0.9 response

....

2014-06-08 15:03:28.088464 UTC - 481294080[7fa232538a40]: nsHttpConnection::Close [this=7fa1f89d9740 reason=80004004]
2014-06-08 15:03:28.088573 UTC - 481294080[7fa232538a40]: Destroying nsHttpConnection @7fa1f89d9740
2014-06-08 15:03:28.088577 UTC - 481294080[7fa232538a40]: nsHttpConnection 7fa1f89d9740 performed 1 HTTP/1.x transactions
2014-06-08 15:03:30.530487 UTC - 864159552[7fa2325386e0]: nsHttpChannel::Resume [this=7fa1f143e800]
2014-06-08 15:03:30.530725 UTC - 864159552[7fa2325386e0]: nsHttpChannel::OnLookupComplete [this=7fa1f143e800] prefetch complete: success status[0x0]
2014-06-08 15:03:30.540897 UTC - 864159552[7fa2325386e0]: sending status notification [this=7fa1f143e800 status=804b000a progress=0/0]
2014-06-08 15:03:30.544666 UTC - 864159552[7fa2325386e0]: nsHttpChannel::OnStartRequest [this=7fa1f143e800 request=7fa1f6e71c00 status=804b0014]
2014-06-08 15:03:30.544677 UTC - 864159552[7fa2325386e0]: nsHttpChannel::ProcessFallback [this=7fa1f143e800]
2014-06-08 15:03:30.544689 UTC - 864159552[7fa2325386e0]: nsHttpChannel::OnStopRequest [this=7fa1f143e800 request=7fa1f6e71c00 status=804b0014]
(In reply to Honza Bambas (:mayhemer) from comment #5)
> According youtube, the server gracefully closes a new connection in some
> 60ms.  Grep for nsHttpConnection with this=7fa1f89d9740 and
> nsHttpTransaction with this=7fa1f7fe3800:

Err... we send out something - 1443 bytes (to the ssl socket).  However, this is somewhat strange, since there is just a single call to PR_Write that returns 1443 bytes has been written.  In case of an ssl socket, we need to loop at least two times with -1/WOULD_BLOCK to drive the handshake, so it seems like there is no ssl socket!

We do two connection attempts, first attempt is for 2404:6800:4008:c01::5b, second for 173.194.38.37.  The first ends immediately with NS_ERROR_CONNECTION_REFUSED, the other one seems to succeed.

But then I can see (on the second transport):

2014-06-08 15:03:28.018991 UTC - 481294080[7fa232538a40]: nsHttpConnection::SetupSSL 7fa1f89d9740 caps=0x1 .S..www.youtube.com:443
2014-06-08 15:03:28.018995 UTC - 481294080[7fa232538a40]: nsHttpConnection::InitSSLParams [this=7fa1f89d9740] connectingToProxy=0
2014-06-08 15:03:28.018999 UTC - 481294080[7fa232538a40]: nsHttpConnection::GetSecurityInfo trans=0 tlsfilter=0 socket=7fa1f8aede78
2014-06-08 15:03:28.019004 UTC - 481294080[7fa232538a40]: nsHttpConnection::SetupSSL Allow SPDY NPN selection
2014-06-08 15:03:28.019011 UTC - 481294080[7fa232538a40]: nsHttpConnection::SetupNPNList 7fa1f89d9740 80004005 < < <
2014-06-08 15:03:28.019020 UTC - 481294080[7fa232538a40]: JIMB: ReleaseFD_Locked: mFDref = 2
2014-06-08 15:03:28.019023 UTC - 481294080[7fa232538a40]:   adding connection to idle list


Seems like SSL_SetNextProtoNego has failed or there is even no ssl socket - that would support the fact we send out plain requests.

Could it be that the second socket setup (nsSocketTransport::InitiateSocket/nsSocketTransport::BuildSocket) fails after the fail-over?
Seems like it is that:

First build

2014-06-08 15:03:27.942376 UTC - 481294080[7fa232538a40]: nsSocketTransport::InitiateSocket [this=7fa1f8aede60]
2014-06-08 15:03:27.942380 UTC - 481294080[7fa232538a40]: nsSocketTransport::BuildSocket [this=7fa1f8aede60]
2014-06-08 15:03:27.942384 UTC - 481294080[7fa232538a40]:   pushing io layer [0:ssl]
2014-06-08 15:03:27.942531 UTC - 481294080[7fa232538a40]:   [secinfo=2554d80 callbacks=f701cee0]
2014-06-08 15:03:27.942548 UTC - 481294080[7fa232538a40]: nsSocketTransportService::AttachSocket [handler=7fa1f8aede60]

Second build

2014-06-08 15:03:27.956453 UTC - 481294080[7fa232538a40]: nsSocketTransport::InitiateSocket [this=7fa1f8aede60]
2014-06-08 15:03:27.956456 UTC - 481294080[7fa232538a40]: nsSocketTransport::BuildSocket [this=7fa1f8aede60]
2014-06-08 15:03:27.956468 UTC - 481294080[7fa232538a40]: nsSocketTransportService::AttachSocket [handler=7fa1f8aede60]

(no ssl push)
Regression from bug 378637.

We call CleanupTypes() that sets mTypeCount = 0 at the end of BuildSocket().
Blocks: 378637
Duplicate of this bug: 1022371
hozna - thanks! that's a trivial optimization making an early (too early!) free of something small that gets cleaned up in the dtor anyhow. The safest thing is just to remove it.

diff --git a/netwerk/base/src/nsSocketTransport2.cpp b/netwerk/base/src/nsSocketTransport2.cpp
--- a/netwerk/base/src/nsSocketTransport2.cpp
+++ b/netwerk/base/src/nsSocketTransport2.cpp
@@ -1158,17 +1158,16 @@ nsSocketTransport::BuildSocket(PRFileDes

         if (NS_FAILED(rv)) {
             SOCKET_LOG(("  error pushing io layer [%u:%s rv=%x]\n", i, mTypes[i], rv));
             if (fd)
                 PR_Close(fd);
         }
     }

-    CleanupTypes();
     return rv;
 }

 nsresult
 nsSocketTransport::InitiateSocket()
 {
     SOCKET_LOG(("nsSocketTransport::InitiateSocket [this=%p]\n", this));
glandium do you want to test https://tbpl.mozilla.org/?tree=Try&rev=06fd18f87272
Created attachment 8437392 [details] [diff] [review]
cleaned up ssl type list too early
Attachment #8437392 - Flags: review?(honzab.moz)
Assignee: nobody → mcmanus
Status: NEW → ASSIGNED
i wonder if that will clear up the other two ssl-not-initiated bugs we talked about..
(Reporter)

Comment 14

3 years ago
(In reply to Patrick McManus [:mcmanus] from comment #11)
> glandium do you want to test
> https://tbpl.mozilla.org/?tree=Try&rev=06fd18f87272

So far, so good.
Attachment #8437392 - Flags: review?(honzab.moz) → review+
(In reply to Honza Bambas (:mayhemer) from comment #4)
> The bad think is that (in case of the bugzilla attachment update) we do:
> progress=10125312/9494978, apparently we proclaim a length X but send out
> X+N.  Grep for nsHttpTransaction with this=7fa1f6b26000

What about this?  Seems like a different bug.
  https://hg.mozilla.org/integration/mozilla-inbound/rev/3663dc6df783

assuming this lands on 33 it should be backported to 32
status-firefox31: --- → unaffected
status-firefox32: --- → affected
(In reply to Honza Bambas (:mayhemer) from comment #15)
> (In reply to Honza Bambas (:mayhemer) from comment #4)
> > The bad think is that (in case of the bugzilla attachment update) we do:
> > progress=10125312/9494978, apparently we proclaim a length X but send out
> > X+N.  Grep for nsHttpTransaction with this=7fa1f6b26000
> 
> What about this?  Seems like a different bug.

we should retest - but given the pretty large delta it seems different I agree.
(In reply to Patrick McManus [:mcmanus] from comment #17)
> (In reply to Honza Bambas (:mayhemer) from comment #15)
> > (In reply to Honza Bambas (:mayhemer) from comment #4)
> > > The bad think is that (in case of the bugzilla attachment update) we do:
> > > progress=10125312/9494978, apparently we proclaim a length X but send out
> > > X+N.  Grep for nsHttpTransaction with this=7fa1f6b26000
> > 
> > What about this?  Seems like a different bug.
> 
> we should retest - but given the pretty large delta it seems different I
> agree.

See also bug 1018883.
https://hg.mozilla.org/mozilla-central/rev/3663dc6df783
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
(Reporter)

Comment 20

3 years ago
(In reply to Honza Bambas (:mayhemer) from comment #18)
> (In reply to Patrick McManus [:mcmanus] from comment #17)
> > (In reply to Honza Bambas (:mayhemer) from comment #15)
> > > (In reply to Honza Bambas (:mayhemer) from comment #4)
> > > > The bad think is that (in case of the bugzilla attachment update) we do:
> > > > progress=10125312/9494978, apparently we proclaim a length X but send out
> > > > X+N.  Grep for nsHttpTransaction with this=7fa1f6b26000
> > > 
> > > What about this?  Seems like a different bug.
> > 
> > we should retest - but given the pretty large delta it seems different I
> > agree.
> 
> See also bug 1018883.

Note that i don't have pipelining enabled (cf. last question in that bug).
Also note that bugzilla rejects attachments bigger than (I think) 10MB and that's also something that I got an error response from bugzilla from, and that could very well be a symptom of bugzilla just throwing anything about 10MB (and bad logging).
(Reporter)

Comment 21

3 years ago
Okay, I know how that one happened, and it's unrelated, and may or may not be considered a bug: the file was growing as I was uploading it (because I'm crazy and was attaching the log file of the current running firefox). There /might/ be a bug as to the HTTP headers sent, but it doesn't look like we're sending a Content-Length.
(In reply to Mike Hommey [:glandium] from comment #21)
> Okay, I know how that one happened, and it's unrelated, and may or may not
> be considered a bug: the file was growing as I was uploading it (because I'm
> crazy and was attaching the log file of the current running firefox). There
> /might/ be a bug as to the HTTP headers sent, but it doesn't look like we're
> sending a Content-Length.

Aha!  So we just ask the input stream at the start how many bytes are there and then we upload until EOF.  Hence the progress log is wrong and since it went over 10MB bugzilla dropped you.  So, I think it's not a bug.

Cool!

Updated

3 years ago
Blocks: 1014250
Thank you for fixing this and thus bug 1014250 - really appreciated :-)
Comment on attachment 8437392 [details] [diff] [review]
cleaned up ssl type list too early

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 378637 .. added to gecko 32
User impact if declined: https instability
Testing completed (on m-c, etc.): on m-c. fix resolves 3 bugs
Risk to taking this patch (and alternatives if risky): low - the patch removes a small memory optimization that was unsafe. (~100 bytes per socket).
String or IDL/UUID changes made by this patch: none
Attachment #8437392 - Flags: approval-mozilla-aurora?
status-firefox33: --- → fixed
Duplicate of this bug: 1015898
Comment on attachment 8437392 [details] [diff] [review]
cleaned up ssl type list too early

Aurora approval granted.
Attachment #8437392 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Keywords: regression
https://hg.mozilla.org/releases/mozilla-aurora/rev/d5851385e16a
status-firefox32: affected → fixed
You need to log in before you can comment on or make changes to this bug.