Closed Bug 106188 Opened 23 years ago Closed 23 years ago

Use of blocking connect for SSL sockets locks up the browser

Categories

(Core Graveyard :: Security: UI, defect, P1)

1.0 Branch
PowerPC
Mac System 9.x
defect

Tracking

(Not tracked)

VERIFIED FIXED
psm2.2

People

(Reporter: bryner, Assigned: sfraser_bugs)

References

Details

(Whiteboard: [driver:blizzard])

Attachments

(3 files, 1 obsolete file)

I finally saw this bug in action tonight (I'd theorized for awhile that it could happen). Basically, if you try to load an SSL site that does not respond, it puts the browser into an unrecoverable state. The particular problem I was seeing was on https://banking.wellsfargo.com/ -- the image at the top of the page is on an SSL server that was not responding. While an SSL connect is being attempted, we block the _entire_ socket transport thread, so it can't even poll other sockets. This means that all TCP/IP activity in the browser is brought to a halt -- you can't load a page in another window, for example. If you try to cancel the SSL load, it locks up the entire browser (I'm guessing that the UI thread might be trying to synchronously proxy an event to the socket transport thread which is being blocked by the connect). So, the reason that I originally wrote this code to execute a blocking connect was because we needed to wedge in the extra force-handshake behavior after the connect completed, and the NSPR version we were using at the time (4.0) did not provide a way to do this for nonblocking connections. With our current NSPR version, we could do this cleanly using PR_ConnectContinue. The other option would be to get rid of PSM's extra NSPR IO layer entirely, as nelsonb has suggested. This would require callers who want the force-handshake behavior to initiate the handshake themselves (by doing a PR_Write of 0 bytes, for example).
The mentioned "remove extra layer" bug is bug 99303.
->kai. If this needs to block on 99303 or needs to be duplicated, then let's do that. If there's any question as to whether we have a consensus on what to do about this, we must first clear that before implementing a solution.
Assignee: ssaux → kaie
Priority: -- → P1
Target Milestone: --- → 2.2
*** Bug 84219 has been marked as a duplicate of this bug. ***
*** Bug 105226 has been marked as a duplicate of this bug. ***
*** Bug 109540 has been marked as a duplicate of this bug. ***
Depends on: 99303
Makring worksforme. This appears to be working now.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → WORKSFORME
> Makring worksforme. This appears to be working now. I'd be very happy if it were fixed, but my understanding is this bug is *very* difficult to reproduce. I think we should not close this bug unless we have set up a wrong behaving environment and tested that.
kaie: You can easily reproduce it on Linux or Solaris. Steps: - Install kernel filewall like IPtables (Linux) or IPfilter (Solaris) - Configure filewall to block SSH port but allow HTTP - surf around and visit a SSL page The browser should completely lock-up until the connection times out if this issue was not fixed.
Oups... s/filewall/firewall/ and s/SSH/https/
Reopening, we still have this bug. Thanks for the idea, Roland, I found a good test case. Just try to open the following URL: https://www.netscape.com:9999/ Wait two seconds, press the stop button. Browser locks up.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
I think this is a serious isue. I had to turn off SSL for my email or otherwise the browser would lock up randomly (because my email server sometimes refuses connection). I think is bug is by far more critical then say the "reload page from server for printing or view source" or something else. It should be threaded like a mozilla crash without a talkback ID. Someone should add some keywords (like blocker or so)
*** Bug 120144 has been marked as a duplicate of this bug. ***
Bug 120144 has good stack traces in it and steps to reproduce as well. I don't know how far you are on this...
Attached patch Fix for discussion (obsolete) — Splinter Review
Brian suggested to use PR_ConnectContinue to fix the problem. My understanding is that nsSocketTransport::doConnection needs to be extended. He also suggested me the place where I could try to add the function call. I'm attaching a fix that seems to work for me. Darin, do you think the patch is correct?
Adding darin to cc list. Darin, can you please review the patch? (see previous comment).
Status: REOPENED → ASSIGNED
this patch makes sense to me, r=darin please make sure however that you test it thoroughly... this is very sensitive code ;-)
blizzard: What do you think? Should we try to check it in for 098? Or wait after the cut and test it on the trunk first?
cc'ing Javi. Javi, Bryner, do you agree with the changes to the NSS i/o layer? I tested it somewhat, it seems to work for me with and without using a proxy.
Comment on attachment 65103 [details] [diff] [review] Fix for discussion Assuming the necko mods are correct, then this seems OK to me.
Comment on attachment 65103 [details] [diff] [review] Fix for discussion sr=jag, provided you've thoroughly tested this
Attachment #65103 - Flags: superreview+
Attachment #65103 - Flags: review+
so, i feel confident about this patch provided PR_ConnectContinue follows it's implementation per the documentation in prio.h ;-) it seems odd to me that PR_Poll would return PR_POLL_WRITE on a socket that is not yet connected, but after poking around at some socket documentation, i think it makes sense. i've seen documents that state that poll could indicate a readable or writable socket, but you will not know if the socket is really connected unless you call getsockopt(SO_ERROR), which is what PR_ConnectContinue does. cc'ing wtc for reassurance. how was this ever supposed to work before PR_ConnectContinue was added to the NSPR API?
I tested this more, including some tests with TLS intolerant sites. My impression is, on both Windows and Linux, everything works when not using a proxy. When using a proxy, flexible https sites work with and without my patch (where no retry to site is necessary). However, when using a proxy, and trying to connect to an TLS intolerant site, it does not work. But it doesn't work, regardless of whether I apply this patch or not. (I updated bug 96024.) I tested Windows (with NSS 3.4) and Linux (NSS 3.3). It seems this patch does not cause a regression.
kaie, can you file a bug on the issue of TLS intolerant sites not working with a proxy?
Blizzard, the bug is 96024.
Attachment #65103 - Flags: needs-work+
Comment on attachment 65103 [details] [diff] [review] Fix for discussion I have to veto this patch. Sorry. 1. mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp I am not qualified to review the changes to this file. They look fine to me. I suggest that instead of commenting out the old code, you should delete it. 2. mozilla/netwerk/base/src/nsSocketTransport.cpp The changes to this file need more work. The whole block marked "Step 3" from line 872 to line 889 has two problems. First, it is not portable. Second, it doesn't work with layered sockets such as SSL sockets (the topic of this bug). I suggest rewriting that block as follows. Note that the checks for PR_POLL_EXCEPT and PR_POLL_HUP in aSelectFlags are done by PR_ConnectContinue(). I also marked a problem with "FIXME" in the comments. // // Step 3: // Process the flags returned by PR_Poll() if any... // else if (aSelectFlags) { status = PR_ConnectContinue(mSocketFD, aSelectFlags); LOG(("nsSocketTransport: PR_ConnectContinue\n")); if (status == PR_SUCCESS) { // // We are connected. // rv = NS_OK; } else { PRErrorCode code = PR_GetError(); // // If the connect is still not ready, then return WOULD_BLOCK... // It is the callers responsibility to place the transport on the // select list of the transport thread... // if (PR_IN_PROGRESS_ERROR == code) { // Set up the select flags for connect... mSelectFlags = (PR_POLL_READ | PR_POLL_EXCEPT | PR_POLL_WRITE); rv = NS_BASE_STREAM_WOULD_BLOCK; } // // The connection failed... // else { LOG(("nsSocketTransport: Connection Failed [%s:%d %x]. PRErrorCode = %x\n", mHostName, mPort, this, code)); // FIXME: "Connection refused" is just one of the possible // errors. Other possible errors are "Connection timed out" // and "No route to host". 'rv' should be based on 'code', // the error code provided by NSPR. rv = NS_ERROR_CONNECTION_REFUSED; } } } else rv = NS_BASE_STREAM_WOULD_BLOCK;
Darin Fisher wrote: > i've seen documents that state that poll could indicate a > readable or writable socket, but you will not know if the > socket is really connected unless you call getsockopt(SO_ERROR), > which is what PR_ConnectContinue does. Yes, this is correct, at least under BSD sockets. The handling of non-blocking connect is non-portable and differs between BSD-derived Unix, System V derived Unix, and Windows. This is why I invented PR_GetConnectStatus (which I generalized to PR_ConnectContinue in NSPR 4.1) to hide the platform-dependent handling of non-blocking connect. Under BSD sockets, when poll() tells you the socket is writable, you only know the non-blocking connect has completed, but you don't know whether it completed successfully or failed. (The current necko code incorrectly equates a PR_POLL_WRITE flag to a successful completion of connection.) You need to call getsockopt(SO_ERROR) to find out whether the connection succeeded, or the error code if it failed. (The current necko code incorrectly assumes that the error is always "Connection refused".) > how was this ever supposed to work before PR_ConnectContinue > was added to the NSPR API? You would have two problems. 1. You would have false connection success. You would eventually realize the connection failed when you write to the socket. This could be why you did not notice the false connection success problem. 2. Your connection failure error message would always be "Connection refused". Average users would not know there are other possible connection errors, such as "Connection timed out" or "No route to host".
*** Bug 120181 has been marked as a duplicate of this bug. ***
The code posted here also fixes an issue on OS/2 where connecting to localhost with no server running produced no error message (bug 120181).
wtc: thanks for the clarification on non-blocking connect. we'll probably want to add a necko error for "no route to host" and then add code to docshell to handle this case with an alert dialog. there is already a necko error for "timeout" -> NS_ERROR_NET_TIMEOUT that docshell understands.
Attached patch Updated patchSplinter Review
This patch uses Wan-Teh's code and has the obosolete code in security removed. Please review.
Attachment #65103 - Attachment is obsolete: true
Comment on attachment 65486 [details] [diff] [review] Updated patch The one-size-fit-all connection error message I pointed out in my "FIXME" comment should also be fixed. But that can be addressed in a separate bug.
Filed bug 120717 for the error message. Blizzard, if you approve this, I'll check it in for 098.
a=blizzard on behalf of drivers for 0.9.8. Let's get this in to find any regressions sooner instead of later.
Keywords: mozilla0.9.8+
Whiteboard: [driver:blizzard]
Patch checked in, marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
Comment on attachment 65486 [details] [diff] [review] Updated patch Did darin really review the nsSocketTransport.cpp changes? They're shot full of tabs, which should be removed (expanded per the Emacs modeline) ASAP. /be
Attachment #65486 - Flags: needs-work+
It seems the hard tabs were introduced when I copied the text from the html view, and my tab size had been set to "2", so I didn't see the problem. Checked in with a=brendan (from IRC).
Reopening, this introduced a regression, see bug 121327.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
kai: IMO this should stay fixed. the fact that this broke BeOS (not a primary platform) is not really any fault of your patch. your patch would work if NSPR implemented its API correctly on BeOS :-)
Mac also regressed: bug 121326.
Note: some of these changes may hvae to be backed out, or #ifdeffed XP_MAC on the 0.9.8 branch, because of bug 121326.
I still think Kai's patch is correct, but it exposes two Mac NSPR bugs, which must be fixed first. 1. PR_Poll does not work with layered file descriptors on Mac. (SSL sockets are layered NSPR FDs.) This is bug 121951. 2. PR_ConnectContinue has never been tested on Mac before. We need to verify that it is working correctly. This is bug 121952.
Depends on: 121951, 121952
I don't want to ship with a broken mac, but I don't want to ship with broken windows or linux, either. :) If you guys don't think that you can fix it in the 0.9.8 timeframe, which is only a few days now, we can ifdef it. Feel free to whip up a patch.
Please try this patch for the 0.9.8 branch. This patch reverts to the original code (blocking SSL connect) on Mac.
i'll test this patch on the release machine for the trunk as soon as it's finished with it's current build.
I did a test build on OS 9 machine with the patch id=66652. You can download this build from ftp://sweetlou.mcom.com/products/client/seamonkey/macos/8.x/ppc/2002-01-28-11-0. 9.8/ (Fyi, I cannot test the patch on OS X build machine, because the branch verification build today is still building) Loan
wan-teh, if you want to check in the patch, go for it so we can get the tree reopened.
checked into the branch on behalf of wtc.
I've tested this morning's Mac classic builds and this patch seems to have fixed the hang on ssl sites. Tested https://verisign.com https://sourceforge.net and http://hotmail.com.
I just respin the Mac OS X build with the same patch id=66652. You can test it from here ftp://sweetlou.netscape.com/products/client/seamonkey/macos/10.x/ppc/2002-01-28- 13-0.9.8 Loan
I just checked the last patch into the trunk (which uses NSPRPUB_PRE_4_2_CLIENT_BRANCH) to get the tree open. This bug is now Mac-only.
Assignee: kaie → sfraser
Status: REOPENED → NEW
OS: All → Mac System 9.x
Hardware: All → Macintosh
removing keyword, verified on 0.9.8 branch
Keywords: mozilla0.9.8+
NSPR fixes have been checked in (bug 121951, bug 121952). This should work on Mac now.
Status: NEW → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
Verified fixed.
Status: RESOLVED → VERIFIED
Product: PSM → Core
Version: psm2.0 → 1.0 Branch
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: