Closed
Bug 749890
Opened 13 years ago
Closed 13 years ago
Intermittent test_spdy.js | test failed (with xpcshell return code: 0), | false == true - See following stack:
Categories
(Core :: Networking, defect)
Tracking
()
RESOLVED
FIXED
mozilla15
People
(Reporter: philor, Assigned: briansmith)
References
Details
(Keywords: intermittent-failure, Whiteboard: [spdy])
Attachments
(3 files, 1 obsolete file)
5.62 KB,
patch
|
Details | Diff | Splinter Review | |
630 bytes,
application/x-sh
|
Details | |
9.57 KB,
patch
|
mayhemer
:
review+
|
Details | Diff | Splinter Review |
https://tbpl.mozilla.org/php/getParsedLog.php?id=11282433&tree=Mozilla-Inbound
Rev3 Fedora 12x64 mozilla-inbound debug test xpcshell on 2012-04-27 17:00:52 PDT for push df3acd833280
slave: talos-r3-fed64-052
TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | test failed (with xpcshell return code: 0), see following log:
>>>>>>>
### XPCOM_MEM_LEAK_LOG defined -- logging leaks to /tmp/tmp0Reeqt/runxpcshelltests_leaks.log
TEST-INFO | (xpcshell/head.js) | test 1 pending
WARNING: SSL handshake timed out: file /builds/slave/m-in-lnx64-dbg/build/security/manager/ssl/src/nsNSSIOLayer.cpp, line 2118
TEST-INFO | (xpcshell/head.js) | test 2 pending
TEST-INFO | (xpcshell/head.js) | test 2 finished
TEST-INFO | (xpcshell/head.js) | running event loop
TEST-PASS | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | [testOnStartRequest : 50] 200 == 200
TEST-PASS | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | [testOnStartRequest : 51] true == true
TEST-PASS | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | [testOnStopRequest : 62] true == true
TEST-PASS | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | [testOnStopRequest : 63] true == true
TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js | false == true - See following stack:
JS frame :: /home/cltbld/talos-slave/test/build/xpcshell/head.js :: do_throw :: line 462
JS frame :: /home/cltbld/talos-slave/test/build/xpcshell/head.js :: _do_check_eq :: line 556
JS frame :: /home/cltbld/talos-slave/test/build/xpcshell/head.js :: do_check_eq :: line 577
JS frame :: /home/cltbld/talos-slave/test/build/xpcshell/head.js :: do_check_true :: line 591
JS frame :: /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_spdy.js :: testOnStopRequest :: line 64
TEST-INFO | (xpcshell/head.js) | exiting test
WARNING: cannot post event if not initialized: file /builds/slave/m-in-lnx64-dbg/build/netwerk/protocol/http/nsHttpConnectionMgr.cpp, line 213
WARNING: cannot post event if not initialized: file /builds/slave/m-in-lnx64-dbg/build/netwerk/protocol/http/nsHttpConnectionMgr.cpp, line 213
WARNING: nsExceptionService ignoring thread destruction after shutdown: file /builds/slave/m-in-lnx64-dbg/build/xpcom/base/nsExceptionService.cpp, line 199
WARNING: OOPDeinit() without successful OOPInit(): file /builds/slave/m-in-lnx64-dbg/build/toolkit/crashreporter/nsExceptionHandler.cpp, line 1998
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 3•13 years ago
|
||
I'm presuming this is a failure of the new test, not of the tested code. Nick let me know if you feel otherwise.
Whiteboard: [orange] → [orange][spdy]
(In reply to Patrick McManus [:mcmanus] from comment #3)
> I'm presuming this is a failure of the new test, not of the tested code.
> Nick let me know if you feel otherwise.
I'll probably need to instrument the test a bit to see exactly why the failure is occurring. Somewhere along the line, we are getting a response on a connection that either the browser or the server (or both!) doesn't think is SPDY.
The first thing that popped into my mind was that perhaps this is the same issue causing the spdy indicator add-on to not show a page as spdy? (Some failure to set the x-firefox-spdy header, perhaps?) Was there ever a resolution on that? I can't find anything in bugzilla, or in email, so maybe I'm just making stuff up in a desperate attempt to not feel bad about causing an intermittent orange?
Patrick, do you remember anything like that when the spdy indicator add-on first hit, or am I really just making stuff up? I'll have more time to look into this Monday morning and instrument the test to see why, exactly, a connection is getting labeled as non-spdy in the test.
Comment 5•13 years ago
|
||
i don't really know anything about the add-on.. I've used it but that's about it.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 14•13 years ago
|
||
I'm running an instrumented version of the test locally in a loop to see if I can figure out which side thinks we're not using spdy. I saw the failure once this morning on my local machine, so I know I can reproduce it, unfortunately it was using the non-instrumented version of the patch when it failed :)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 16•13 years ago
|
||
Alright, I was able to duplicate this (after a few hours running test_spdy.js in a loop), and the error is coming because we get an NS_ERROR_NOT_AVAILABLE when trying to get the x-firefox-spdy header off the response to make sure the connection is actually spdy. My best guess is for some reason is that we fail to negotiate spdy with the server, although whether that's a server issue or a client issue remains to be seen.
Patrick, what would be the best way to get deep debugging info from the client side to see why negotiating SPDY fails?
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 18•13 years ago
|
||
(In reply to Nick Hurley [:hurley] from comment #16)
> Patrick, what would be the best way to get deep debugging info from the
> client side to see why negotiating SPDY fails?
I should note, I mean if there's anything other than http logging & packet capture I can do here (I'm already running with those enabled, which I should've done the first time around).
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 21•13 years ago
|
||
(In reply to Nick Hurley [:hurley] from comment #18)
> (In reply to Nick Hurley [:hurley] from comment #16)
> > Patrick, what would be the best way to get deep debugging info from the
> > client side to see why negotiating SPDY fails?
>
> I should note, I mean if there's anything other than http logging & packet
> capture I can do here (I'm already running with those enabled, which I
> should've done the first time around).
those are what I would start with. Any SSL or PSM logging would probably help too - but I don't know how to do that. honza or bsmith?
Comment 22•13 years ago
|
||
OK, after another 20 hours, I've succeeded in reproducing again, this time with http logging and packet capture! The files are
http://people.mozilla.org/~hurley/spdy_test_failure_xpcshell.txt (this is the output of make check-one on the failed run)
http://people.mozilla.org/~hurley/spdy_test_failure.log (this is the http log from the failed run)
http://people.mozilla.org/~hurley/spdy_test_failure.pcap (this is the packet capture of the failure - it also contains packets from some successful runs, the failures are (obviously) at the end of the capture)
From a quick glance at the pcap, it looks like at some point firefox just decides to use SSLv3 instead of TLS, which means we can't use NPN to upgrade to SPDY. I don't know why it makes this decision (ssl/tls debugging is not my strong suit). Patrick, Josh said to hand this off to you for further investigation.
Assignee: nobody → mcmanus
Comment 23•13 years ago
|
||
nick - thanks for the data!
bsmith - I think I need some help.. I can take it a little farther than nick, but then I'm just guessing at some point..
as nick says for most of the time we tcp connect, offer up a tls v1 client hello with extension 13172 (npn) in it and get back a tls v1 server hello with the npn list of options. you can spot check that on "tcp.stream eq 10" in nick's pcap.
but in stream 697 (or packet 77934 if you prefer) we tcp connect and offer up a SSL 3.0 client hello with no NPN extension. The connection proceeds to use HTTP/1.1.
the preceeding stream 696 (which thankfully is not overlapped) (packet 77920) starts a TLS handshake but the client sends a fin before it is complete. The whole thing happens in 8ms, so there wasn't a timeout that I can think of.. but closing a socket that we don't need is certainly the kind of thing the network stack does these days.
could that failed handshake in 696 (even though the failure was caused by the client) have caused some kind of tls intolerance trigger to downgrade stream 697? 798 is also at v3.
Given that intolerance can happen based on timeouts and is therefore racy, for the purposes of the spdy test it might make sense to just check the SSL version.. but if we've got some kind of behavior that is downgrading our ssl level that needs to be figured out.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 26•13 years ago
|
||
Just for reference, here's the patch I used that instruments the test for (slightly) more info on which end is failing (this was in case, for example, my server-side spdy detection was flaky and that's what was causing the test to identify the connection as non-spdy, which is unfortunately not the case here)
Comment 27•13 years ago
|
||
Also for reference (and in case anyone wants to try this at home), here's the script I used to run the test over and over again until it failed. Paths will need to be changed to match your setup. This script expects to be run with your cwd = the top-level of your mozilla-central checkout.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 44•13 years ago
|
||
First, are we sure that the end of the pcap corresponds to the XPCShell log that was posted?
In that log, the TLS intolerance timeout is definitely being hit:
WARNING: SSL handshake timed out: file /home/hurley/src/mozilla/m-c/security/manager/ssl/src/nsNSSIOLayer.cpp, line 1015
The code for calculating the timeout is:
PRIntervalTime now = PR_IntervalNow();
PRIntervalTime certVerificationTime =
mCertVerificationEnded - mCertVerificationStarted;
PRIntervalTime totalTime = now - mHandshakeStartTime;
PRIntervalTime totalTimeExceptCertVerificationTime =
totalTime - certVerificationTime;
Patrick is right that the handshake has only taken 8ms, not 25 SECONDS. Is it possible that PR_IntervalNow() isn't monotonically increasing, which could cause these calculations to be wrong?
HandshakeTimeout() is only called in nsSSLIOLayerPoll (the PSM implementation of the poll method). So, we cannot be detecting a handshake timeout after we've closed a socket, because that would mean that we're polling the socket after we've closed it, which is wrong. (But, even if we polled after we closed the socket, then the PSM nsNSSIOLayer would have been poppoed off the socket so nsSSLIOLayerPoll would never get called.)
Once we've made a successful TLS 1.0 connection to a server, we do not allow fallback to SSL 3.0 for that server for the rest of the browsing session. Perhaps there is a bug caused by the (fully intentional) race between the handshake and the certificate verification, but it seems unlikely. I will look at it more
In (tcp.stream eq 695) we close the connection sending a TLS alert first.
In (tcp.stream eq 696) we close the connection without sending a TLS alert.
The logic for determining whether we send a TLS alert is in ssl_SecureClose:
if (ss->version >= SSL_LIBRARY_VERSION_3_0 && /* cond 1 */
!(ss->shutdownHow & ssl_SHUTDOWN_SEND) && /* cond 2 */
ss->firstHsDone && /* cond 3 */
!ss->recvdCloseNotify && /* cond 4 */
ss->ssl3.initialized) { /* cond 5 */
Conditions 1, 4, 5 are definitely true.
Is it possible that we have shutdown the socket for sending (Condition 2) before closing it?
If condition 3 isn't true, then that would indicate that the certificate verification might not have finished before we closed the connection. But, if certificate verification were still pending, then we would not hit the TLS intolerance timeout because there is an explicit check that ensures we never to the TLS intolerance fallback if cert verification is pending.
If there are more packet captures for this, I would love to see them, to see if we consistently have this "no sent alert" pattern when the test fails.
Comment 45•13 years ago
|
||
(In reply to Brian Smith (:bsmith) from comment #44)
> Is
> it possible that PR_IntervalNow() isn't monotonically increasing, which
> could cause these calculations to be wrong?
I don't believe so.
> Is it possible that we have shutdown the socket for sending (Condition 2)
> before closing it?
We don't shutdown sockets at all (necko). Just closing them.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 49•13 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #45)
> (In reply to Brian Smith (:bsmith) from comment #44)
> > before closing it?
>
> We don't shutdown sockets at all (necko). Just closing them.
in the trace in question when we close the socket there is unread data queued in the kernel.. that generates the tcp rst. I'm wondering if there is some way that manifests in an error that gets interpreted as the intolerance.
(you would think that would be more of a problem for the peer - after we issue the close I wouldn't think we'd notice any io events, but its a possibility.)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 63•13 years ago
|
||
(In reply to Brian Smith (:bsmith) from comment #44)
> First, are we sure that the end of the pcap corresponds to the XPCShell log
> that was posted?
Yes, there was no other traffic on localhost:4443 on the machine I ran this on, this was the most recent pcap available (I was cycling between 10 different files for posterity), and there would have been no xpcshell runs after the test failed.
> If there are more packet captures for this, I would love to see them, to see
> if we consistently have this "no sent alert" pattern when the test fails.
I will arrange for more pcaps, it just might take a while (given how intermittent this issue can be).
Comment 64•13 years ago
|
||
(In reply to Brian Smith (:bsmith) from comment #44)
> WARNING: SSL handshake timed out: file
> /home/hurley/src/mozilla/m-c/security/manager/ssl/src/nsNSSIOLayer.cpp, line
> 1015
>
[..]
> Patrick is right that the handshake has only taken 8ms, not 25 SECONDS. Is
> it possible that PR_IntervalNow() isn't monotonically increasing, which
> could cause these calculations to be wrong?
all the tbpl reports are on linux, and nick's repro is on linux. (and actually I think we might only be running the test on linux.) So let's look at linux pr_intervalnow() handling.. :)
a breakpoint on my desktop shows me: http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinrval.c#77 which bears the comment:
* This version of interval times is based on the time of day
* capability offered by system. This isn't valid for two reasons:
* 1) The time of day is neither linear nor montonically increasing
* 2) The units here are milliseconds. That's not appropriate for our use.
*/
doh! And stepping through that you can see that it calls gettimeofday() directly.. a little googling on that and you'll easily see that its not guaranteed monotonic. That would be a pretty big problem.
Linux does provide a cheap monotonic time source - clock_gettiime(CLOCK_MONOTONIC).. perhaps we should be using that?
Comment 65•13 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #64)
> (In reply to Brian Smith (:bsmith) from comment #44)
>
> > WARNING: SSL handshake timed out: file
> > /home/hurley/src/mozilla/m-c/security/manager/ssl/src/nsNSSIOLayer.cpp, line
> > 1015
> >
> [..]
> > Patrick is right that the handshake has only taken 8ms, not 25 SECONDS. Is
> > it possible that PR_IntervalNow() isn't monotonically increasing, which
> > could cause these calculations to be wrong?
>
> all the tbpl reports are on linux, and nick's repro is on linux. (and
> actually I think we might only be running the test on linux.) So let's look
> at linux pr_intervalnow() handling.. :)
It's true that we're only running the test on linux right now, and linux is where the data above comes from, however, I've also reproduced this on my mac. This does appear to use the same gettimeofday implementation as linux (not surprising, given that they're both unix-ish). Not that this particularly adds any more info to the mix, but I wanted to make sure the whole picture (failures on multiple platforms) is known.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 67•13 years ago
|
||
fwiw I have not been able to reproduce it - hundreds of thousands of test_spdy iterations and millions of iterations of a simpler c++ test that just tests the ordering of two pr_intervalnow() calls. but clock issues can be extremely hardware dependent.
if it is a problem with unix interval handling, that would also potentially bring doubt to a lot of different code paths on android.. so while its tempting to just try changing the handshaketimeout() methods to use mozilla::timestamp() (which does use clock_gettime) there is more work to do.
nick can you try this patch? (obviously not suitable for promotion to nspr as is) and see if it changes things for you? I'll also let tryserver at it for a long series of runs - the CI seems to produce it fairly often.
diff --git a/nsprpub/pr/src/Makefile.in b/nsprpub/pr/src/Makefile.in
--- a/nsprpub/pr/src/Makefile.in
+++ b/nsprpub/pr/src/Makefile.in
@@ -160,16 +160,17 @@ ifeq ($(OS_TARGET),Android)
# Android has no libpthread.so in NDK
OS_LIBS = -ldl
else
OS_LIBS = -lpthread -ldl
endif
else
OS_LIBS = -ldl
endif
+OS_LIBS += -lrt
endif
ifeq ($(OS_ARCH),HP-UX)
ifeq ($(USE_PTHREADS), 1)
ifeq (,$(filter-out B.10.10 B.10.20,$(OS_RELEASE)))
OS_LIBS = -ldce
else
OS_LIBS = -lpthread -lrt
diff --git a/nsprpub/pr/src/md/unix/unix.c b/nsprpub/pr/src/md/unix/unix.c
--- a/nsprpub/pr/src/md/unix/unix.c
+++ b/nsprpub/pr/src/md/unix/unix.c
@@ -3046,22 +3046,22 @@ PR_Now(void)
LL_I2L(us, tv.tv_usec);
LL_MUL(s, s, s2us);
LL_ADD(s, s, us);
return s;
}
PRIntervalTime _PR_UNIX_GetInterval()
{
- struct timeval time;
+ struct timespec time;
PRIntervalTime ticks;
- (void)GETTIMEOFDAY(&time); /* fallicy of course */
+ clock_gettime(CLOCK_MONOTONIC, &time);
ticks = (PRUint32)time.tv_sec * PR_MSEC_PER_SEC; /* that's in milliseconds */
- ticks += (PRUint32)time.tv_usec / PR_USEC_PER_MSEC; /* so's that */
+ ticks += (PRUint32)time.tv_nsec / PR_NSEC_PER_MSEC; /* so's that */
return ticks;
} /* _PR_SUNOS_GetInterval */
PRIntervalTime _PR_UNIX_TicksPerSecond()
{
return 1000; /* this needs some work :) */
}
Comment 68•13 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #67)
> nick can you try this patch? (obviously not suitable for promotion to nspr
> as is) and see if it changes things for you? I'll also let tryserver at it
> for a long series of runs - the CI seems to produce it fairly often.
OK, I've got this running on my linux machine. I'll report back in a couple days (or sooner, if the problem recurs before then).
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 72•13 years ago
|
||
Try server continues to have intermittent fails negotiating spdy with the patch from comment 67. so the root cause here is something other than the pr_interval using gettimeofday()
https://tbpl.mozilla.org/?tree=Try&rev=a92e6b55c585
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 75•13 years ago
|
||
I insturmented HandshakeTimeout() to print some stuff and had try reproduce the failure
1.11
1.12 - return totalTimeExceptCertVerificationTime >
1.13 + bool rv = totalTimeExceptCertVerificationTime >
1.14 PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS);
1.15 + fprintf(stderr,"handshaketimeout:: rv=%d time=%dms state=%d\n",
1.16 + rv, PR_IntervalToMilliseconds(totalTimeExceptCertVerificationTime),
1.17 + mCertVerificationState);
1.18 + return rv;
try::
handshaketimeout:: rv=0 time=0ms state=0
handshaketimeout:: rv=1 time=-3ms state=2
WARNING: SSL handshake timed out: file /builds/slave/try-lnx64-dbg/build/security/manager/ssl/src/nsNSSIOLayer.cpp, line 1019
(and then in another test handshaketimeout:: rv=1 time=-2ms state=2)
I had expected it to print an absurd elapsed time, indicating something was corrupted or maybe being compared to '0'.. but instead it prints -3ms (or -2ms). I'm kinda surprised that's not -1 if the issue is monotonicity.
I wonder about some kind of a rounding propogation problem in this snippet.. (printervaltime is essentially an unisgned number of milliseconds, taken from gtod which does usec, so they are rounded to form the now() value)
PRIntervalTime now = PR_IntervalNow();
PRIntervalTime certVerificationTime =
mCertVerificationEnded - mCertVerificationStarted;
PRIntervalTime totalTime = now - mHandshakeStartTime;
PRIntervalTime totalTimeExceptCertVerificationTime =
totalTime - certVerificationTime;
bool rv = totalTimeExceptCertVerificationTime >
PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS);
Comment 76•13 years ago
|
||
the negative time is also consistent with setHandshakeInProgress() being called after setcertverficationwaiting(), though I have nothing else to suggest that is happening.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 79•13 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #76)
> the negative time is also consistent with setHandshakeInProgress() being
> called after setcertverficationwaiting(), though I have nothing else to
> suggest that is happening.
confirmed.
See annotated try failure: https://tbpl.mozilla.org/php/getParsedLog.php?id=11589533&tree=Try#error0
setting certverification started to 2F4DD401, handshake start is 0
setting handshake starttime to 2F4DD407
setting certverification ended to 2F4DD408, handshake start is 2F4DD407
handshaketimeout:: rv=1 time=-6ms state=2
WARNING: SSL handshake timed out: file /builds/slave/try-lnx64-dbg/build/security/manager/ssl/src/nsNSSIOLayer.cpp, line 1024
This is the patch with the printfs: https://hg.mozilla.org/try/rev/8bea3e6f1206
In this case SetHandshakeInProgress() was called 6ms after SetCertVerificationWaiting(), but that's the most extreme example I've seen.
I believe Brian says he knows how to fix the race condition here and offered to do so. Thanks!
Assignee: mcmanus → bsmith
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 103•13 years ago
|
||
Assignee | ||
Comment 104•13 years ago
|
||
Oops. Let's try again:
https://tbpl.mozilla.org/?tree=Try&rev=9f8aa02d059e
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 114•13 years ago
|
||
The TLS intolerance timeout is needed for servers that do not respond *at all*
to our hello request; see the long comment in checkHandshake() in nsNSSIOLayer.cpp.
Once we've received a server hello from the server,
there's no need to do the timeout, and it is better NOT to, because any
timeouts after we've received a response. By explicitly avoiding those wrong
timeouts, we avoid the need for the (incorrect) complicated comparisons of
the PRIntervalTimes I introduced, which caused this bug.
Attachment #623210 -
Flags: review?(honzab.moz)
Attachment #623210 -
Flags: feedback?(mcmanus)
Assignee | ||
Comment 115•13 years ago
|
||
Here is the try build for just this test, with ~450 passing runs:
https://tbpl.mozilla.org/?tree=Try&rev=908620cfe336
Here is the full try run, just submitted:
https://tbpl.mozilla.org/?tree=Try&rev=6380b35ed74b
Assignee | ||
Comment 116•13 years ago
|
||
(In reply to Brian Smith (:bsmith) from comment #114)
> Created attachment 623210 [details] [diff] [review]
> Do not do TLS intolerance handshake timeout after we know we've received the
> server hello.
>
> The TLS intolerance timeout is needed for servers that do not respond *at
> all*
> to our hello request; see the long comment in checkHandshake() in
> nsNSSIOLayer.cpp.
>
> Once we've received a server hello from the server,
> there's no need to do the timeout, and it is better NOT to, because any
> timeouts after we've received a response
... are false positives.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 122•13 years ago
|
||
Comment on attachment 623210 [details] [diff] [review]
Do not do TLS intolerance handshake timeout after we know we've received the server hello.
Review of attachment 623210 [details] [diff] [review]:
-----------------------------------------------------------------
Brian, please add [diff] showfunc=true to your hgrc.
Dropping r because of the first comment. Brian, please fix and retest this patch.
::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +478,5 @@
> #define HANDSHAKE_TIMEOUT_SECONDS 25
>
> bool nsNSSSocketInfo::HandshakeTimeout()
> {
> + if (!!mAllowTLSIntoleranceTimeout)
Rather just a singe '!' ?
According this change, it seems like some of your tests of this code were not done or could have misleading results...
@@ +498,3 @@
> PRIntervalTime now = PR_IntervalNow();
> + bool result = (now - mHandshakeStartTime)
> + > PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS);
You can do:
static const kHandshakeTimeout = PR_SecondsToInterval(HANDSHAKE_TIMEOUT_SECONDS);
and use that konst rather. Time conversion may be expensive.
@@ -502,3 @@
> PRIntervalTime now = PR_IntervalNow();
> - PRIntervalTime certVerificationTime =
> - mCertVerificationEnded - mCertVerificationStarted;
You can remove these two members now.
Attachment #623210 -
Flags: review?(honzab.moz)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 140•13 years ago
|
||
Comment on attachment 623210 [details] [diff] [review]
Do not do TLS intolerance handshake timeout after we know we've received the server hello.
I don't have anything to add beyond honza's comments other than PR_SecondsToInterval is (cross-platform) just a multiply so there is no perf concern there.. but const is always good :)
Attachment #623210 -
Flags: feedback?(mcmanus)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 206•13 years ago
|
||
Why have we been sitting on a patch for a common intermittent orange like this? What is the next step here?
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 209•13 years ago
|
||
Here's the updated patch that fixes the typo and which address the other review comments.
Attachment #623210 -
Attachment is obsolete: true
Attachment #626931 -
Flags: review?(honzab.moz)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 219•13 years ago
|
||
Comment on attachment 626931 [details] [diff] [review]
Do not do TLS intolerance handshake timeout after we know we've received the server hello [v2]
Review of attachment 626931 [details] [diff] [review]:
-----------------------------------------------------------------
r=honzab
Thanks.
::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +495,1 @@
> }
static const PRIntervalTime handshakeTimeoutInterval = PR_SecondsToInterval(25) is called the first time this line gets executed. It is not a static initializer. Thus, you don't need your lazy init code, compiler does it for you automatically.
Attachment #626931 -
Flags: review?(honzab.moz) → review+
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 228•13 years ago
|
||
Target Milestone: --- → mozilla15
Assignee | ||
Comment 229•13 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #219)
> static const PRIntervalTime handshakeTimeoutInterval =
> PR_SecondsToInterval(25) is called the first time this line gets executed.
> It is not a static initializer. Thus, you don't need your lazy init code,
> compiler does it for you automatically.
I fixed this before I pushed:
https://hg.mozilla.org/integration/mozilla-inbound/diff/1d09d60b4bc1/security/manager/ssl/src/nsNSSIOLayer.cpp#l1.113
Comment 230•13 years ago
|
||
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [orange][spdy] → [spdy]
Comment hidden (Legacy TBPL/Treeherder Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•