Closed
Bug 106188
Opened 23 years ago
Closed 23 years ago
Use of blocking connect for SSL sockets locks up the browser
Categories
(Core Graveyard :: Security: UI, defect, P1)
Tracking
(Not tracked)
VERIFIED
FIXED
psm2.2
People
(Reporter: bryner, Assigned: sfraser_bugs)
References
Details
(Whiteboard: [driver:blizzard])
Attachments
(3 files, 1 obsolete file)
4.53 KB,
patch
|
Details | Diff | Splinter Review | |
3.42 KB,
patch
|
Details | Diff | Splinter Review | |
2.68 KB,
patch
|
Details | Diff | Splinter Review |
I finally saw this bug in action tonight (I'd theorized for awhile that it could
happen). Basically, if you try to load an SSL site that does not respond, it
puts the browser into an unrecoverable state. The particular problem I was
seeing was on https://banking.wellsfargo.com/ -- the image at the top of the
page is on an SSL server that was not responding. While an SSL connect is being
attempted, we block the _entire_ socket transport thread, so it can't even poll
other sockets. This means that all TCP/IP activity in the browser is brought to
a halt -- you can't load a page in another window, for example. If you try to
cancel the SSL load, it locks up the entire browser (I'm guessing that the UI
thread might be trying to synchronously proxy an event to the socket transport
thread which is being blocked by the connect).
So, the reason that I originally wrote this code to execute a blocking connect
was because we needed to wedge in the extra force-handshake behavior after the
connect completed, and the NSPR version we were using at the time (4.0) did not
provide a way to do this for nonblocking connections. With our current NSPR
version, we could do this cleanly using PR_ConnectContinue. The other option
would be to get rid of PSM's extra NSPR IO layer entirely, as nelsonb has
suggested. This would require callers who want the force-handshake behavior to
initiate the handshake themselves (by doing a PR_Write of 0 bytes, for example).
Comment 2•23 years ago
|
||
->kai.
If this needs to block on 99303 or needs to be duplicated, then let's do that.
If there's any question as to whether we have a consensus on what to do about
this, we must first clear that before implementing a solution.
Assignee: ssaux → kaie
Priority: -- → P1
Target Milestone: --- → 2.2
Comment 4•23 years ago
|
||
*** Bug 105226 has been marked as a duplicate of this bug. ***
Comment 5•23 years ago
|
||
*** Bug 109540 has been marked as a duplicate of this bug. ***
Comment 6•23 years ago
|
||
Makring worksforme. This appears to be working now.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → WORKSFORME
Comment 7•23 years ago
|
||
> Makring worksforme. This appears to be working now.
I'd be very happy if it were fixed, but my understanding is this bug is *very*
difficult to reproduce.
I think we should not close this bug unless we have set up a wrong behaving
environment and tested that.
Comment 8•23 years ago
|
||
kaie:
You can easily reproduce it on Linux or Solaris.
Steps:
- Install kernel filewall like IPtables (Linux) or IPfilter (Solaris)
- Configure filewall to block SSH port but allow HTTP
- surf around and visit a SSL page
The browser should completely lock-up until the connection times out if this
issue was not fixed.
Comment 9•23 years ago
|
||
Oups...
s/filewall/firewall/ and s/SSH/https/
Comment 10•23 years ago
|
||
Reopening, we still have this bug.
Thanks for the idea, Roland, I found a good test case.
Just try to open the following URL:
https://www.netscape.com:9999/
Wait two seconds, press the stop button. Browser locks up.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Comment 11•23 years ago
|
||
I think this is a serious isue. I had to turn off SSL for my email or otherwise
the browser would lock up randomly (because my email server sometimes refuses
connection). I think is bug is by far more critical then say the "reload page
from server for printing or view source" or something else. It should be
threaded like a mozilla crash without a talkback ID. Someone should add some
keywords (like blocker or so)
Comment 12•23 years ago
|
||
*** Bug 120144 has been marked as a duplicate of this bug. ***
Blocks: 115520
Comment 13•23 years ago
|
||
Bug 120144 has good stack traces in it and steps to reproduce as well. I don't
know how far you are on this...
Comment 14•23 years ago
|
||
Brian suggested to use PR_ConnectContinue to fix the problem. My understanding
is that nsSocketTransport::doConnection needs to be extended. He also suggested
me the place where I could try to add the function call.
I'm attaching a fix that seems to work for me.
Darin, do you think the patch is correct?
Comment 15•23 years ago
|
||
Adding darin to cc list.
Darin, can you please review the patch? (see previous comment).
Status: REOPENED → ASSIGNED
Comment 16•23 years ago
|
||
this patch makes sense to me, r=darin
please make sure however that you test it thoroughly... this is very sensitive
code ;-)
Comment 17•23 years ago
|
||
blizzard: What do you think? Should we try to check it in for 098? Or wait after
the cut and test it on the trunk first?
Comment 18•23 years ago
|
||
cc'ing Javi.
Javi, Bryner, do you agree with the changes to the NSS i/o layer?
I tested it somewhat, it seems to work for me with and without using a proxy.
Comment 19•23 years ago
|
||
Comment on attachment 65103 [details] [diff] [review]
Fix for discussion
Assuming the necko mods are correct, then this seems OK to me.
Comment 20•23 years ago
|
||
Comment on attachment 65103 [details] [diff] [review]
Fix for discussion
sr=jag, provided you've thoroughly tested this
Attachment #65103 -
Flags: superreview+
Attachment #65103 -
Flags: review+
Comment 21•23 years ago
|
||
so, i feel confident about this patch provided PR_ConnectContinue follows it's
implementation per the documentation in prio.h ;-)
it seems odd to me that PR_Poll would return PR_POLL_WRITE on a socket that is
not yet connected, but after poking around at some socket documentation, i think
it makes sense. i've seen documents that state that poll could indicate a
readable or writable socket, but you will not know if the socket is really
connected unless you call getsockopt(SO_ERROR), which is what PR_ConnectContinue
does.
cc'ing wtc for reassurance.
how was this ever supposed to work before PR_ConnectContinue was added to the
NSPR API?
Comment 22•23 years ago
|
||
I tested this more, including some tests with TLS intolerant sites. My
impression is, on both Windows and Linux, everything works when not using a proxy.
When using a proxy, flexible https sites work with and without my patch (where
no retry to site is necessary).
However, when using a proxy, and trying to connect to an TLS intolerant site, it
does not work. But it doesn't work, regardless of whether I apply this patch or
not. (I updated bug 96024.)
I tested Windows (with NSS 3.4) and Linux (NSS 3.3).
It seems this patch does not cause a regression.
Comment 23•23 years ago
|
||
kaie, can you file a bug on the issue of TLS intolerant sites not working with a
proxy?
Comment 24•23 years ago
|
||
Blizzard, the bug is 96024.
Updated•23 years ago
|
Attachment #65103 -
Flags: needs-work+
Comment 25•23 years ago
|
||
Comment on attachment 65103 [details] [diff] [review]
Fix for discussion
I have to veto this patch. Sorry.
1. mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp
I am not qualified to review the changes to this file.
They look fine to me. I suggest that instead of commenting
out the old code, you should delete it.
2. mozilla/netwerk/base/src/nsSocketTransport.cpp
The changes to this file need more work.
The whole block marked "Step 3" from line 872 to line 889
has two problems. First, it is not portable. Second, it
doesn't work with layered sockets such as SSL sockets (the
topic of this bug). I suggest rewriting that block as follows.
Note that the checks for PR_POLL_EXCEPT and PR_POLL_HUP in
aSelectFlags are done by PR_ConnectContinue(). I also marked
a problem with "FIXME" in the comments.
//
// Step 3:
// Process the flags returned by PR_Poll() if any...
//
else if (aSelectFlags) {
status = PR_ConnectContinue(mSocketFD, aSelectFlags);
LOG(("nsSocketTransport: PR_ConnectContinue\n"));
if (status == PR_SUCCESS) {
//
// We are connected.
//
rv = NS_OK;
}
else {
PRErrorCode code = PR_GetError();
//
// If the connect is still not ready, then return WOULD_BLOCK...
// It is the callers responsibility to place the transport on the
// select list of the transport thread...
//
if (PR_IN_PROGRESS_ERROR == code) {
// Set up the select flags for connect...
mSelectFlags = (PR_POLL_READ | PR_POLL_EXCEPT | PR_POLL_WRITE);
rv = NS_BASE_STREAM_WOULD_BLOCK;
}
//
// The connection failed...
//
else {
LOG(("nsSocketTransport: Connection Failed [%s:%d %x].
PRErrorCode = %x\n",
mHostName, mPort, this, code));
// FIXME: "Connection refused" is just one of the possible
// errors. Other possible errors are "Connection timed out"
// and "No route to host". 'rv' should be based on 'code',
// the error code provided by NSPR.
rv = NS_ERROR_CONNECTION_REFUSED;
}
}
} else
rv = NS_BASE_STREAM_WOULD_BLOCK;
Comment 26•23 years ago
|
||
Darin Fisher wrote:
> i've seen documents that state that poll could indicate a
> readable or writable socket, but you will not know if the
> socket is really connected unless you call getsockopt(SO_ERROR),
> which is what PR_ConnectContinue does.
Yes, this is correct, at least under BSD sockets. The handling
of non-blocking connect is non-portable and differs between
BSD-derived Unix, System V derived Unix, and Windows. This is
why I invented PR_GetConnectStatus (which I generalized to
PR_ConnectContinue in NSPR 4.1) to hide the platform-dependent
handling of non-blocking connect.
Under BSD sockets, when poll() tells you the socket is writable,
you only know the non-blocking connect has completed, but you
don't know whether it completed successfully or failed. (The
current necko code incorrectly equates a PR_POLL_WRITE flag to
a successful completion of connection.) You need to call
getsockopt(SO_ERROR) to find out whether the connection succeeded,
or the error code if it failed. (The current necko code incorrectly
assumes that the error is always "Connection refused".)
> how was this ever supposed to work before PR_ConnectContinue
> was added to the NSPR API?
You would have two problems.
1. You would have false connection success. You would eventually
realize the connection failed when you write to the socket. This
could be why you did not notice the false connection success problem.
2. Your connection failure error message would always be "Connection
refused". Average users would not know there are other possible
connection errors, such as "Connection timed out" or "No route to
host".
Comment 27•23 years ago
|
||
*** Bug 120181 has been marked as a duplicate of this bug. ***
Comment 28•23 years ago
|
||
The code posted here also fixes an issue on OS/2 where connecting to
localhost with no server running produced no error message (bug
120181).
Comment 29•23 years ago
|
||
wtc: thanks for the clarification on non-blocking connect. we'll probably want
to add a necko error for "no route to host" and then add code to docshell to
handle this case with an alert dialog. there is already a necko error for
"timeout" -> NS_ERROR_NET_TIMEOUT that docshell understands.
Comment 30•23 years ago
|
||
This patch uses Wan-Teh's code and has the obosolete code in security removed.
Please review.
Attachment #65103 -
Attachment is obsolete: true
Comment 31•23 years ago
|
||
Comment on attachment 65486 [details] [diff] [review]
Updated patch
The one-size-fit-all connection error message I pointed
out in my "FIXME" comment should also be fixed. But
that can be addressed in a separate bug.
Comment 32•23 years ago
|
||
Filed bug 120717 for the error message.
Blizzard, if you approve this, I'll check it in for 098.
Comment 33•23 years ago
|
||
a=blizzard on behalf of drivers for 0.9.8. Let's get this in to find any
regressions sooner instead of later.
Keywords: mozilla0.9.8+
Updated•23 years ago
|
Whiteboard: [driver:blizzard]
Comment 34•23 years ago
|
||
Patch checked in, marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago → 23 years ago
Resolution: --- → FIXED
Comment 35•23 years ago
|
||
Comment on attachment 65486 [details] [diff] [review]
Updated patch
Did darin really review the nsSocketTransport.cpp changes? They're shot full
of tabs, which should be removed (expanded per the Emacs modeline) ASAP.
/be
Attachment #65486 -
Flags: needs-work+
Comment 36•23 years ago
|
||
Comment 37•23 years ago
|
||
It seems the hard tabs were introduced when I copied the text from the html
view, and my tab size had been set to "2", so I didn't see the problem.
Checked in with a=brendan (from IRC).
Comment 38•23 years ago
|
||
Reopening, this introduced a regression, see bug 121327.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 39•23 years ago
|
||
kai: IMO this should stay fixed. the fact that this broke BeOS (not a primary
platform) is not really any fault of your patch. your patch would work if NSPR
implemented its API correctly on BeOS :-)
Assignee | ||
Comment 40•23 years ago
|
||
Mac also regressed: bug 121326.
Assignee | ||
Comment 41•23 years ago
|
||
Note: some of these changes may hvae to be backed out, or #ifdeffed XP_MAC on the
0.9.8 branch, because of bug 121326.
Comment 42•23 years ago
|
||
I still think Kai's patch is correct, but it exposes
two Mac NSPR bugs, which must be fixed first.
1. PR_Poll does not work with layered file descriptors
on Mac. (SSL sockets are layered NSPR FDs.) This is
bug 121951.
2. PR_ConnectContinue has never been tested on Mac
before. We need to verify that it is working correctly.
This is bug 121952.
Comment 43•23 years ago
|
||
I don't want to ship with a broken mac, but I don't want to ship with broken
windows or linux, either. :)
If you guys don't think that you can fix it in the 0.9.8 timeframe, which is
only a few days now, we can ifdef it. Feel free to whip up a patch.
Comment 44•23 years ago
|
||
Please try this patch for the 0.9.8 branch.
This patch reverts to the original code (blocking
SSL connect) on Mac.
Comment 45•23 years ago
|
||
i'll test this patch on the release machine for the trunk as soon as it's
finished with it's current build.
Comment 46•23 years ago
|
||
I did a test build on OS 9 machine with the patch id=66652. You can download
this build from
ftp://sweetlou.mcom.com/products/client/seamonkey/macos/8.x/ppc/2002-01-28-11-0.
9.8/
(Fyi, I cannot test the patch on OS X build machine, because the branch
verification build today is still building)
Loan
Comment 47•23 years ago
|
||
wan-teh, if you want to check in the patch, go for it so we can get the tree
reopened.
Comment 48•23 years ago
|
||
checked into the branch on behalf of wtc.
Comment 49•23 years ago
|
||
I've tested this morning's Mac classic builds and this patch seems to have fixed
the hang on ssl sites. Tested https://verisign.com https://sourceforge.net and
http://hotmail.com.
Comment 50•23 years ago
|
||
I just respin the Mac OS X build with the same patch id=66652. You can test it
from here
ftp://sweetlou.netscape.com/products/client/seamonkey/macos/10.x/ppc/2002-01-28-
13-0.9.8
Loan
Assignee | ||
Comment 51•23 years ago
|
||
I just checked the last patch into the trunk (which uses
NSPRPUB_PRE_4_2_CLIENT_BRANCH) to get the tree open. This bug is now Mac-only.
Assignee: kaie → sfraser
Status: REOPENED → NEW
OS: All → Mac System 9.x
Hardware: All → Macintosh
Assignee | ||
Comment 53•23 years ago
|
||
NSPR fixes have been checked in (bug 121951, bug 121952). This should work on Mac
now.
Status: NEW → RESOLVED
Closed: 23 years ago → 23 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•