The default bug view has changed. See this FAQ.

Intermittent test_spdy.js | test failed (with xpcshell return code: 0), | false == true - See following stack:

RESOLVED FIXED in mozilla15

Status

()

Core
Networking
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: philor, Assigned: briansmith)

Tracking

({intermittent-failure})

10 Branch
mozilla15
x86_64
Linux
intermittent-failure
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [spdy])

Attachments

(3 attachments, 1 obsolete attachment)

(Reporter)

Description

5 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=11282433&tree=Mozilla-Inbound
Rev3 Fedora 12x64 mozilla-inbound debug test xpcshell on 2012-04-27 17:00:52 PDT for push df3acd833280
slave: talos-r3-fed64-052

TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | test failed (with xpcshell return code: 0), see following log:
>>>>>>>
### XPCOM_MEM_LEAK_LOG defined -- logging leaks to /tmp/tmp0Reeqt/runxpcshelltests_leaks.log

TEST-INFO | (xpcshell/head.js) | test 1 pending
WARNING: SSL handshake timed out: file /builds/slave/m-in-lnx64-dbg/build/security/manager/ssl/src/nsNSSIOLayer.cpp, line 2118

TEST-INFO | (xpcshell/head.js) | test 2 pending

TEST-INFO | (xpcshell/head.js) | test 2 finished

TEST-INFO | (xpcshell/head.js) | running event loop

TEST-PASS | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | [testOnStartRequest : 50] 200 == 200

TEST-PASS | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | [testOnStartRequest : 51] true == true

TEST-PASS | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | [testOnStopRequest : 62] true == true

TEST-PASS | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | [testOnStopRequest : 63] true == true

TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | false == true - See following stack:
JS frame :: /home/cltbld/talos-slave/test/build/xpcshell/head.js :: do_throw :: line 462
JS frame :: /home/cltbld/talos-slave/test/build/xpcshell/head.js :: _do_check_eq :: line 556
JS frame :: /home/cltbld/talos-slave/test/build/xpcshell/head.js :: do_check_eq :: line 577
JS frame :: /home/cltbld/talos-slave/test/build/xpcshell/head.js :: do_check_true :: line 591
JS frame :: /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js :: testOnStopRequest :: line 64

TEST-INFO | (xpcshell/head.js) | exiting test
WARNING: cannot post event if not initialized: file /builds/slave/m-in-lnx64-dbg/build/netwerk/protocol/http/nsHttpConnectionMgr.cpp, line 213
WARNING: cannot post event if not initialized: file /builds/slave/m-in-lnx64-dbg/build/netwerk/protocol/http/nsHttpConnectionMgr.cpp, line 213
WARNING: nsExceptionService ignoring thread destruction after shutdown: file /builds/slave/m-in-lnx64-dbg/build/xpcom/base/nsExceptionService.cpp, line 199
WARNING: OOPDeinit() without successful OOPInit(): file /builds/slave/m-in-lnx64-dbg/build/toolkit/crashreporter/nsExceptionHandler.cpp, line 1998
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
I'm presuming this is a failure of the new test, not of the tested code. Nick let me know if you feel otherwise.
Whiteboard: [orange] → [orange][spdy]
(In reply to Patrick McManus [:mcmanus] from comment #3)
> I'm presuming this is a failure of the new test, not of the tested code.
> Nick let me know if you feel otherwise.

I'll probably need to instrument the test a bit to see exactly why the failure is occurring. Somewhere along the line, we are getting a response on a connection that either the browser or the server (or both!) doesn't think is SPDY.

The first thing that popped into my mind was that perhaps this is the same issue causing the spdy indicator add-on to not show a page as spdy? (Some failure to set the x-firefox-spdy header, perhaps?) Was there ever a resolution on that? I can't find anything in bugzilla, or in email, so maybe I'm just making stuff up in a desperate attempt to not feel bad about causing an intermittent orange?

Patrick, do you remember anything like that when the spdy indicator add-on first hit, or am I really just making stuff up? I'll have more time to look into this Monday morning and instrument the test to see why, exactly, a connection is getting labeled as non-spdy in the test.
i don't really know anything about the add-on.. I've used it but that's about it.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
I'm running an instrumented version of the test locally in a loop to see if I can figure out which side thinks we're not using spdy. I saw the failure once this morning on my local machine, so I know I can reproduce it, unfortunately it was using the non-instrumented version of the patch when it failed :)
Comment hidden (Treeherder Robot)
Alright, I was able to duplicate this (after a few hours running test_spdy.js in a loop), and the error is coming because we get an NS_ERROR_NOT_AVAILABLE when trying to get the x-firefox-spdy header off the response to make sure the connection is actually spdy. My best guess is for some reason is that we fail to negotiate spdy with the server, although whether that's a server issue or a client issue remains to be seen.

Patrick, what would be the best way to get deep debugging info from the client side to see why negotiating SPDY fails?
Comment hidden (Treeherder Robot)
(In reply to Nick Hurley [:hurley] from comment #16) 
> Patrick, what would be the best way to get deep debugging info from the
> client side to see why negotiating SPDY fails?

I should note, I mean if there's anything other than http logging & packet capture I can do here (I'm already running with those enabled, which I should've done the first time around).
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Nick Hurley [:hurley] from comment #18)
> (In reply to Nick Hurley [:hurley] from comment #16) 
> > Patrick, what would be the best way to get deep debugging info from the
> > client side to see why negotiating SPDY fails?
> 
> I should note, I mean if there's anything other than http logging & packet
> capture I can do here (I'm already running with those enabled, which I
> should've done the first time around).

those are what I would start with. Any SSL or PSM logging would probably help too - but I don't know how to do that. honza or bsmith?
OK, after another 20 hours, I've succeeded in reproducing again, this time with http logging and packet capture! The files are

http://people.mozilla.org/~hurley/spdy_test_failure_xpcshell.txt (this is the output of make check-one on the failed run)
http://people.mozilla.org/~hurley/spdy_test_failure.log (this is the http log from the failed run)
http://people.mozilla.org/~hurley/spdy_test_failure.pcap (this is the packet capture of the failure - it also contains packets from some successful runs, the failures are (obviously) at the end of the capture)

From a quick glance at the pcap, it looks like at some point firefox just decides to use SSLv3 instead of TLS, which means we can't use NPN to upgrade to SPDY. I don't know why it makes this decision (ssl/tls debugging is not my strong suit). Patrick, Josh said to hand this off to you for further investigation.
Assignee: nobody → mcmanus
nick - thanks for the data!

bsmith - I think I need some help.. I can take it a little farther than nick, but then I'm just guessing at some point..

as nick says for most of the time we tcp connect, offer up a tls v1 client hello with extension 13172 (npn) in it and get back a tls v1 server hello with the npn list of options. you can spot check that on "tcp.stream eq 10" in nick's pcap.

but in stream 697 (or packet 77934 if you prefer) we tcp connect and offer up a SSL 3.0 client hello with no NPN extension. The connection proceeds to use HTTP/1.1. 

the preceeding stream 696 (which thankfully is not overlapped) (packet 77920) starts a TLS handshake but the client sends a fin before it is complete. The whole thing happens in 8ms, so there wasn't a timeout that I can think of.. but closing a socket that we don't need is certainly the kind of thing the network stack does these days.

could that failed handshake in 696 (even though the failure was caused by the client) have caused some kind of tls intolerance trigger to downgrade stream 697? 798 is also at v3.

Given that intolerance can happen based on timeouts and is therefore racy, for the purposes of the spdy test it might make sense to just check the SSL version.. but if we've got some kind of behavior that is downgrading our ssl level that needs to be figured out.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Created attachment 620066 [details] [diff] [review]
test instrumentation

Just for reference, here's the patch I used that instruments the test for (slightly) more info on which end is failing (this was in case, for example, my server-side spdy detection was flaky and that's what was causing the test to identify the connection as non-spdy, which is unfortunately not the case here)
Created attachment 620069 [details]
script to run test until failure occurs

Also for reference (and in case anyone wants to try this at home), here's the script I used to run the test over and over again until it failed. Paths will need to be changed to match your setup. This script expects to be run with your cwd = the top-level of your mozilla-central checkout.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
First, are we sure that the end of the pcap corresponds to the XPCShell log that was posted?

In that log, the TLS intolerance timeout is definitely being hit:

WARNING: SSL handshake timed out: file /home/hurley/src/mozilla/m-c/security/manager/ssl/src/nsNSSIOLayer.cpp, line 1015

The code for calculating the timeout is:

  PRIntervalTime now = PR_IntervalNow();
  PRIntervalTime certVerificationTime =
      mCertVerificationEnded - mCertVerificationStarted;
  PRIntervalTime totalTime = now - mHandshakeStartTime;
  PRIntervalTime totalTimeExceptCertVerificationTime =
      totalTime - certVerificationTime;

Patrick is right that the handshake has only taken 8ms, not 25 SECONDS. Is it possible that PR_IntervalNow() isn't monotonically increasing, which could cause these calculations to be wrong?

HandshakeTimeout() is only called in nsSSLIOLayerPoll (the PSM implementation of the poll method). So, we cannot be detecting a handshake timeout after we've closed a socket, because that would mean that we're polling the socket after we've closed it, which is wrong. (But, even if we polled after we closed the socket, then the PSM nsNSSIOLayer would have been poppoed off the socket so nsSSLIOLayerPoll would never get called.)

Once we've made a successful TLS 1.0 connection to a server, we do not allow fallback to SSL 3.0 for that server for the rest of the browsing session. Perhaps there is a bug caused by the (fully intentional) race between the handshake and the certificate verification, but it seems unlikely. I will look at it more

In (tcp.stream eq 695) we close the connection sending a TLS alert first.

In (tcp.stream eq 696) we close the connection without sending a TLS alert.

The logic for determining whether we send a TLS alert is in ssl_SecureClose:

    if (ss->version >= SSL_LIBRARY_VERSION_3_0 	&& /* cond 1 */
	!(ss->shutdownHow & ssl_SHUTDOWN_SEND)	&& /* cond 2 */
    	ss->firstHsDone 			&& /* cond 3 */
	!ss->recvdCloseNotify                   && /* cond 4 */
	ss->ssl3.initialized) {                    /* cond 5 */

Conditions 1, 4, 5 are definitely true.

Is it possible that we have shutdown the socket for sending (Condition 2) before closing it?

If condition 3 isn't true, then that would indicate that the certificate verification might not have finished before we closed the connection. But, if certificate verification were still pending, then we would not hit the TLS intolerance timeout because there is an explicit check that ensures we never to the TLS intolerance fallback if cert verification is pending.

If there are more packet captures for this, I would love to see them, to see if we consistently have this "no sent alert" pattern when the test fails.
(In reply to Brian Smith (:bsmith) from comment #44)
> Is
> it possible that PR_IntervalNow() isn't monotonically increasing, which
> could cause these calculations to be wrong?

I don't believe so.

> Is it possible that we have shutdown the socket for sending (Condition 2)
> before closing it?

We don't shutdown sockets at all (necko).  Just closing them.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Honza Bambas (:mayhemer) from comment #45)
> (In reply to Brian Smith (:bsmith) from comment #44)

> > before closing it?
> 
> We don't shutdown sockets at all (necko).  Just closing them.


in the trace in question when we close the socket there is unread data queued in the kernel.. that generates the tcp rst. I'm wondering if there is some way that manifests in an error that gets interpreted as the intolerance.

(you would think that would be more of a problem for the peer - after we issue the close I wouldn't think we'd notice any io events, but its a possibility.)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Brian Smith (:bsmith) from comment #44)
> First, are we sure that the end of the pcap corresponds to the XPCShell log
> that was posted?

Yes, there was no other traffic on localhost:4443 on the machine I ran this on, this was the most recent pcap available (I was cycling between 10 different files for posterity), and there would have been no xpcshell runs after the test failed.

> If there are more packet captures for this, I would love to see them, to see
> if we consistently have this "no sent alert" pattern when the test fails.

I will arrange for more pcaps, it just might take a while (given how intermittent this issue can be).
(In reply to Brian Smith (:bsmith) from comment #44)

> WARNING: SSL handshake timed out: file
> /home/hurley/src/mozilla/m-c/security/manager/ssl/src/nsNSSIOLayer.cpp, line
> 1015
> 
[..]
> Patrick is right that the handshake has only taken 8ms, not 25 SECONDS. Is
> it possible that PR_IntervalNow() isn't monotonically increasing, which
> could cause these calculations to be wrong?

all the tbpl reports are on linux, and nick's repro is on linux. (and actually I think we might only be running the test on linux.) So let's look at linux pr_intervalnow() handling.. :)

a breakpoint on my desktop shows me: http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinrval.c#77 which bears the comment:

 * This version of interval times is based on the time of day
 * capability offered by system. This isn't valid for two reasons:
 * 1) The time of day is neither linear nor montonically increasing
  * 2) The units here are milliseconds. That's not appropriate for our use.
 */

doh! And stepping through that you can see that it calls gettimeofday() directly.. a little googling on that and you'll easily see that its not guaranteed monotonic. That would be a pretty big problem.

Linux does provide a cheap monotonic time source - clock_gettiime(CLOCK_MONOTONIC).. perhaps we should be using that?
(In reply to Patrick McManus [:mcmanus] from comment #64)
> (In reply to Brian Smith (:bsmith) from comment #44)
> 
> > WARNING: SSL handshake timed out: file
> > /home/hurley/src/mozilla/m-c/security/manager/ssl/src/nsNSSIOLayer.cpp, line
> > 1015
> > 
> [..]
> > Patrick is right that the handshake has only taken 8ms, not 25 SECONDS. Is
> > it possible that PR_IntervalNow() isn't monotonically increasing, which
> > could cause these calculations to be wrong?
> 
> all the tbpl reports are on linux, and nick's repro is on linux. (and
> actually I think we might only be running the test on linux.) So let's look
> at linux pr_intervalnow() handling.. :)

It's true that we're only running the test on linux right now, and linux is where the data above comes from, however, I've also reproduced this on my mac. This does appear to use the same gettimeofday implementation as linux (not surprising, given that they're both unix-ish). Not that this particularly adds any more info to the mix, but I wanted to make sure the whole picture (failures on multiple platforms) is known.
Comment hidden (Treeherder Robot)
fwiw I have not been able to reproduce it - hundreds of thousands of test_spdy iterations and millions of iterations of a simpler c++ test that just tests the ordering of two pr_intervalnow() calls. but clock issues can be extremely hardware dependent.

if it is a problem with unix interval handling, that would also potentially bring doubt to a lot of different code paths on android.. so while its tempting to just try changing the handshaketimeout() methods to use mozilla::timestamp() (which does use clock_gettime) there is more work to do.

nick can you try this patch? (obviously not suitable for promotion to nspr as is) and see if it changes things for you? I'll also let tryserver at it for a long series of runs - the CI seems to produce it fairly often.

diff --git a/nsprpub/pr/src/Makefile.in b/nsprpub/pr/src/Makefile.in
--- a/nsprpub/pr/src/Makefile.in
+++ b/nsprpub/pr/src/Makefile.in
@@ -160,16 +160,17 @@ ifeq ($(OS_TARGET),Android)
 # Android has no libpthread.so in NDK
 OS_LIBS		= -ldl
 else
 OS_LIBS		= -lpthread -ldl
 endif
 else
 OS_LIBS		= -ldl
 endif
+OS_LIBS		+= -lrt
 endif
 
 ifeq ($(OS_ARCH),HP-UX)
 ifeq ($(USE_PTHREADS), 1)
 ifeq (,$(filter-out B.10.10 B.10.20,$(OS_RELEASE)))
 OS_LIBS 	= -ldce
 else
 OS_LIBS 	= -lpthread -lrt
diff --git a/nsprpub/pr/src/md/unix/unix.c b/nsprpub/pr/src/md/unix/unix.c
--- a/nsprpub/pr/src/md/unix/unix.c
+++ b/nsprpub/pr/src/md/unix/unix.c
@@ -3046,22 +3046,22 @@ PR_Now(void)
     LL_I2L(us, tv.tv_usec);
     LL_MUL(s, s, s2us);
     LL_ADD(s, s, us);
     return s;
 }
 
 PRIntervalTime _PR_UNIX_GetInterval()
 {
-    struct timeval time;
+    struct timespec time;
     PRIntervalTime ticks;
 
-    (void)GETTIMEOFDAY(&time);  /* fallicy of course */
+    clock_gettime(CLOCK_MONOTONIC, &time);
     ticks = (PRUint32)time.tv_sec * PR_MSEC_PER_SEC;  /* that's in milliseconds */
-    ticks += (PRUint32)time.tv_usec / PR_USEC_PER_MSEC;  /* so's that */
+    ticks += (PRUint32)time.tv_nsec / PR_NSEC_PER_MSEC;  /* so's that */
     return ticks;
 }  /* _PR_SUNOS_GetInterval */
 
 PRIntervalTime _PR_UNIX_TicksPerSecond()
 {
     return 1000;  /* this needs some work :) */
 }
(In reply to Patrick McManus [:mcmanus] from comment #67)
> nick can you try this patch? (obviously not suitable for promotion to nspr
> as is) and see if it changes things for you? I'll also let tryserver at it
> for a long series of runs - the CI seems to produce it fairly often.

OK, I've got this running on my linux machine. I'll report back in a couple days (or sooner, if the problem recurs before then).
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Try server continues to have intermittent fails negotiating spdy with the patch from comment 67. so the root cause here is something other than the pr_interval using gettimeofday()

https://tbpl.mozilla.org/?tree=Try&rev=a92e6b55c585
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
I insturmented HandshakeTimeout() to print some stuff and had try reproduce the failure

    1.11  
    1.12 -  return totalTimeExceptCertVerificationTime > 
    1.13 +  bool rv = totalTimeExceptCertVerificationTime > 
    1.14        PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS);
    1.15 +  fprintf(stderr,"handshaketimeout:: rv=%d time=%dms state=%d\n",
    1.16 +          rv, PR_IntervalToMilliseconds(totalTimeExceptCertVerificationTime),
    1.17 +          mCertVerificationState);
    1.18 +  return rv;

try::
handshaketimeout:: rv=0 time=0ms state=0
handshaketimeout:: rv=1 time=-3ms state=2
WARNING: SSL handshake timed out: file /builds/slave/try-lnx64-dbg/build/security/manager/ssl/src/nsNSSIOLayer.cpp, line 1019

(and then in another test handshaketimeout:: rv=1 time=-2ms state=2)

I had expected it to print an absurd elapsed time, indicating something was corrupted or maybe being compared to '0'.. but instead it prints -3ms (or -2ms). I'm kinda surprised that's not -1 if the issue is monotonicity.

I wonder about some kind of a rounding propogation problem in this snippet.. (printervaltime is essentially an unisgned number of milliseconds, taken from gtod which does usec, so they are rounded to form the now() value)

 PRIntervalTime now = PR_IntervalNow();
  PRIntervalTime certVerificationTime =
      mCertVerificationEnded - mCertVerificationStarted;
  PRIntervalTime totalTime = now - mHandshakeStartTime;
  PRIntervalTime totalTimeExceptCertVerificationTime =
      totalTime - certVerificationTime;

  bool rv = totalTimeExceptCertVerificationTime >
      PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS);
the negative time is also consistent with setHandshakeInProgress() being called after setcertverficationwaiting(), though I have nothing else to suggest that is happening.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Patrick McManus [:mcmanus] from comment #76)
> the negative time is also consistent with setHandshakeInProgress() being
> called after setcertverficationwaiting(), though I have nothing else to
> suggest that is happening.

confirmed.

See annotated try failure: https://tbpl.mozilla.org/php/getParsedLog.php?id=11589533&tree=Try#error0

setting certverification started to 2F4DD401, handshake start is 0
setting handshake starttime to 2F4DD407
setting certverification ended to 2F4DD408, handshake start is 2F4DD407
handshaketimeout:: rv=1 time=-6ms state=2
WARNING: SSL handshake timed out: file /builds/slave/try-lnx64-dbg/build/security/manager/ssl/src/nsNSSIOLayer.cpp, line 1024

This is the patch with the printfs: https://hg.mozilla.org/try/rev/8bea3e6f1206

In this case SetHandshakeInProgress() was called 6ms after SetCertVerificationWaiting(), but that's the most extreme example I've seen.

I believe Brian says he knows how to fix the race condition here and offered to do so. Thanks!
Assignee: mcmanus → bsmith
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
https://tbpl.mozilla.org/?tree=Try&rev=2d5b7bbddbb0
Oops. Let's try again:

https://tbpl.mozilla.org/?tree=Try&rev=9f8aa02d059e
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Created attachment 623210 [details] [diff] [review]
Do not do TLS intolerance handshake timeout after we know we've received the server hello.

The TLS intolerance timeout is needed for servers that do not respond *at all*
to our hello request; see the long comment in checkHandshake() in nsNSSIOLayer.cpp. 

Once we've received a server hello from the server,
there's no need to do the timeout, and it is better NOT to, because any
timeouts after we've received a response. By explicitly avoiding those wrong
timeouts, we avoid the need for the (incorrect) complicated comparisons of
the PRIntervalTimes I introduced, which caused this bug.
Attachment #623210 - Flags: review?(honzab.moz)
Attachment #623210 - Flags: feedback?(mcmanus)
Here is the try build for just this test, with ~450 passing runs:
https://tbpl.mozilla.org/?tree=Try&rev=908620cfe336

Here is the full try run, just submitted:
https://tbpl.mozilla.org/?tree=Try&rev=6380b35ed74b
(In reply to Brian Smith (:bsmith) from comment #114)
> Created attachment 623210 [details] [diff] [review]
> Do not do TLS intolerance handshake timeout after we know we've received the
> server hello.
> 
> The TLS intolerance timeout is needed for servers that do not respond *at
> all*
> to our hello request; see the long comment in checkHandshake() in
> nsNSSIOLayer.cpp. 
> 
> Once we've received a server hello from the server,
> there's no need to do the timeout, and it is better NOT to, because any
> timeouts after we've received a response

... are false positives.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment on attachment 623210 [details] [diff] [review]
Do not do TLS intolerance handshake timeout after we know we've received the server hello.

Review of attachment 623210 [details] [diff] [review]:
-----------------------------------------------------------------

Brian, please add [diff] showfunc=true to your hgrc.

Dropping r because of the first comment.  Brian, please fix and retest this patch.

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +478,5 @@
>  #define HANDSHAKE_TIMEOUT_SECONDS 25
>  
>  bool nsNSSSocketInfo::HandshakeTimeout()
>  {
> +  if (!!mAllowTLSIntoleranceTimeout)

Rather just a singe '!' ?

According this change, it seems like some of your tests of this code were not done or could have misleading results...

@@ +498,3 @@
>    PRIntervalTime now = PR_IntervalNow();
> +  bool result = (now - mHandshakeStartTime) 
> +                      > PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS);

You can do:

static const kHandshakeTimeout = PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS);

and use that konst rather.  Time conversion may be expensive.

@@ -502,3 @@
>    PRIntervalTime now = PR_IntervalNow();
> -  PRIntervalTime certVerificationTime =
> -      mCertVerificationEnded - mCertVerificationStarted;

You can remove these two members now.
Attachment #623210 - Flags: review?(honzab.moz)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment on attachment 623210 [details] [diff] [review]
Do not do TLS intolerance handshake timeout after we know we've received the server hello.

I don't have anything to add beyond honza's comments other than PR_SecondsToInterval is (cross-platform) just a multiply so there is no perf concern there.. but const is always good :)
Attachment #623210 - Flags: feedback?(mcmanus)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Blocks: 738992
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)

Comment 206

5 years ago
Why have we been sitting on a patch for a common intermittent orange like this? What is the next step here?
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Created attachment 626931 [details] [diff] [review]
Do not do TLS intolerance handshake timeout after we know we've received the server hello [v2]

Here's the updated patch that fixes the typo and which address the other review comments.
Attachment #623210 - Attachment is obsolete: true
Attachment #626931 - Flags: review?(honzab.moz)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment on attachment 626931 [details] [diff] [review]
Do not do TLS intolerance handshake timeout after we know we've received the server hello [v2]

Review of attachment 626931 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab

Thanks.

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +495,1 @@
>    }

static const PRIntervalTime handshakeTimeoutInterval = PR_SecondsToInterval(25) is called the first time this line gets executed.  It is not a static initializer.  Thus, you don't need your lazy init code, compiler does it for you automatically.
Attachment #626931 - Flags: review?(honzab.moz) → review+
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
https://hg.mozilla.org/integration/mozilla-inbound/rev/1d09d60b4bc1
Target Milestone: --- → mozilla15
(In reply to Honza Bambas (:mayhemer) from comment #219)
> static const PRIntervalTime handshakeTimeoutInterval =
> PR_SecondsToInterval(25) is called the first time this line gets executed. 
> It is not a static initializer.  Thus, you don't need your lazy init code,
> compiler does it for you automatically.

I fixed this before I pushed:
https://hg.mozilla.org/integration/mozilla-inbound/diff/1d09d60b4bc1/security/manager/ssl/src/nsNSSIOLayer.cpp#l1.113
https://hg.mozilla.org/mozilla-central/rev/1d09d60b4bc1
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Keywords: intermittent-failure
Whiteboard: [orange][spdy] → [spdy]
Comment hidden (Treeherder Robot)
Depends on: 840625
You need to log in before you can comment on or make changes to this bug.