106188 - Use of blocking connect for SSL sockets locks up the browser

Reporter

Description

•

23 years ago

I finally saw this bug in action tonight (I'd theorized for awhile that it could happen). Basically, if you try to load an SSL site that does not respond, it puts the browser into an unrecoverable state. The particular problem I was seeing was on https://banking.wellsfargo.com/ -- the image at the top of the page is on an SSL server that was not responding. While an SSL connect is being attempted, we block the _entire_ socket transport thread, so it can't even poll other sockets. This means that all TCP/IP activity in the browser is brought to a halt -- you can't load a page in another window, for example. If you try to cancel the SSL load, it locks up the entire browser (I'm guessing that the UI thread might be trying to synchronously proxy an event to the socket transport thread which is being blocked by the connect). So, the reason that I originally wrote this code to execute a blocking connect was because we needed to wedge in the extra force-handshake behavior after the connect completed, and the NSPR version we were using at the time (4.0) did not provide a way to do this for nonblocking connections. With our current NSPR version, we could do this cleanly using PR_ConnectContinue. The other option would be to get rid of PSM's extra NSPR IO layer entirely, as nelsonb has suggested. This would require callers who want the force-handshake behavior to initiate the handshake themselves (by doing a PR_Write of 0 bytes, for example).

Kai Engert [:KaiE:]

Comment 1

•

23 years ago

The mentioned "remove extra layer" bug is bug 99303.

Stephane Saux

Comment 2

•

23 years ago

->kai. If this needs to block on 99303 or needs to be duplicated, then let's do that. If there's any question as to whether we have a consensus on what to do about this, we must first clear that before implementing a solution.

Assignee: ssaux → kaie

Priority: -- → P1

Target Milestone: --- → 2.2

John Unruh

Comment 3

•

23 years ago

*** Bug 84219 has been marked as a duplicate of this bug. ***

Rangan Sen

Comment 4

•

23 years ago

*** Bug 105226 has been marked as a duplicate of this bug. ***

Olav Vitters

Comment 5

•

23 years ago

*** Bug 109540 has been marked as a duplicate of this bug. ***

Kai Engert [:KaiE:]

Updated

•

23 years ago

Depends on: 99303

John Unruh

Comment 6

•

23 years ago

Makring worksforme. This appears to be working now.

Status: NEW → RESOLVED

Closed: 23 years ago

Resolution: --- → WORKSFORME

Kai Engert [:KaiE:]

Comment 7

•

23 years ago

> Makring worksforme. This appears to be working now. I'd be very happy if it were fixed, but my understanding is this bug is *very* difficult to reproduce. I think we should not close this bug unless we have set up a wrong behaving environment and tested that.

Roland Mainz

Comment 8

•

23 years ago

kaie: You can easily reproduce it on Linux or Solaris. Steps: - Install kernel filewall like IPtables (Linux) or IPfilter (Solaris) - Configure filewall to block SSH port but allow HTTP - surf around and visit a SSL page The browser should completely lock-up until the connection times out if this issue was not fixed.

Roland Mainz

Comment 9

•

23 years ago

Oups... s/filewall/firewall/ and s/SSH/https/

Kai Engert [:KaiE:]

Comment 10

•

23 years ago

Reopening, we still have this bug. Thanks for the idea, Roland, I found a good test case. Just try to open the following URL: https://www.netscape.com:9999/ Wait two seconds, press the stop button. Browser locks up.

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

MatthiasMann

Comment 11

•

23 years ago

I think this is a serious isue. I had to turn off SSL for my email or otherwise the browser would lock up randomly (because my email server sometimes refuses connection). I think is bug is by far more critical then say the "reload page from server for printing or view source" or something else. It should be threaded like a mozilla crash without a talkback ID. Someone should add some keywords (like blocker or so)

Kai Engert [:KaiE:]

Comment 12

•

23 years ago

*** Bug 120144 has been marked as a duplicate of this bug. ***

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

23 years ago

Blocks: 115520

Christopher Blizzard (:blizzard)

Comment 13

•

23 years ago

Bug 120144 has good stack traces in it and steps to reproduce as well. I don't know how far you are on this...

Kai Engert [:KaiE:]

Comment 14

•

23 years ago

Attached patch Fix for discussion (obsolete) — Details — Splinter Review

Brian suggested to use PR_ConnectContinue to fix the problem. My understanding is that nsSocketTransport::doConnection needs to be extended. He also suggested me the place where I could try to add the function call. I'm attaching a fix that seems to work for me. Darin, do you think the patch is correct?

Kai Engert [:KaiE:]

Comment 15

•

23 years ago

Adding darin to cc list. Darin, can you please review the patch? (see previous comment).

Status: REOPENED → ASSIGNED

Darin Fisher

Comment 16

•

23 years ago

this patch makes sense to me, r=darin please make sure however that you test it thoroughly... this is very sensitive code ;-)

Kai Engert [:KaiE:]

Comment 17

•

23 years ago

blizzard: What do you think? Should we try to check it in for 098? Or wait after the cut and test it on the trunk first?

Kai Engert [:KaiE:]

Comment 18

•

23 years ago

cc'ing Javi. Javi, Bryner, do you agree with the changes to the NSS i/o layer? I tested it somewhat, it seems to work for me with and without using a proxy.

Javier Delgadillo

Comment 19

•

23 years ago

Comment on attachment 65103 [details] [diff] [review] Fix for discussion Assuming the necko mods are correct, then this seems OK to me.

jag (Peter Annema)

Comment 20

•

23 years ago

Comment on attachment 65103 [details] [diff] [review] Fix for discussion sr=jag, provided you've thoroughly tested this

Attachment #65103 - Flags: superreview+

Attachment #65103 - Flags: review+

Darin Fisher

Comment 21

•

23 years ago

so, i feel confident about this patch provided PR_ConnectContinue follows it's implementation per the documentation in prio.h ;-) it seems odd to me that PR_Poll would return PR_POLL_WRITE on a socket that is not yet connected, but after poking around at some socket documentation, i think it makes sense. i've seen documents that state that poll could indicate a readable or writable socket, but you will not know if the socket is really connected unless you call getsockopt(SO_ERROR), which is what PR_ConnectContinue does. cc'ing wtc for reassurance. how was this ever supposed to work before PR_ConnectContinue was added to the NSPR API?

Kai Engert [:KaiE:]

Comment 22

•

23 years ago

I tested this more, including some tests with TLS intolerant sites. My impression is, on both Windows and Linux, everything works when not using a proxy. When using a proxy, flexible https sites work with and without my patch (where no retry to site is necessary). However, when using a proxy, and trying to connect to an TLS intolerant site, it does not work. But it doesn't work, regardless of whether I apply this patch or not. (I updated bug 96024.) I tested Windows (with NSS 3.4) and Linux (NSS 3.3). It seems this patch does not cause a regression.

Christopher Blizzard (:blizzard)

Comment 23

•

23 years ago

kaie, can you file a bug on the issue of TLS intolerant sites not working with a proxy?

Kai Engert [:KaiE:]

Comment 24

•

23 years ago

Blizzard, the bug is 96024.

Wan-Teh Chang

Updated

•

23 years ago

Attachment #65103 - Flags: needs-work+

Wan-Teh Chang

Comment 25

•

23 years ago

Comment on attachment 65103 [details] [diff] [review] Fix for discussion I have to veto this patch. Sorry. 1. mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp I am not qualified to review the changes to this file. They look fine to me. I suggest that instead of commenting out the old code, you should delete it. 2. mozilla/netwerk/base/src/nsSocketTransport.cpp The changes to this file need more work. The whole block marked "Step 3" from line 872 to line 889 has two problems. First, it is not portable. Second, it doesn't work with layered sockets such as SSL sockets (the topic of this bug). I suggest rewriting that block as follows. Note that the checks for PR_POLL_EXCEPT and PR_POLL_HUP in aSelectFlags are done by PR_ConnectContinue(). I also marked a problem with "FIXME" in the comments. // // Step 3: // Process the flags returned by PR_Poll() if any... // else if (aSelectFlags) { status = PR_ConnectContinue(mSocketFD, aSelectFlags); LOG(("nsSocketTransport: PR_ConnectContinue\n")); if (status == PR_SUCCESS) { // // We are connected. // rv = NS_OK; } else { PRErrorCode code = PR_GetError(); // // If the connect is still not ready, then return WOULD_BLOCK... // It is the callers responsibility to place the transport on the // select list of the transport thread... // if (PR_IN_PROGRESS_ERROR == code) { // Set up the select flags for connect... mSelectFlags = (PR_POLL_READ | PR_POLL_EXCEPT | PR_POLL_WRITE); rv = NS_BASE_STREAM_WOULD_BLOCK; } // // The connection failed... // else { LOG(("nsSocketTransport: Connection Failed [%s:%d %x]. PRErrorCode = %x\n", mHostName, mPort, this, code)); // FIXME: "Connection refused" is just one of the possible // errors. Other possible errors are "Connection timed out" // and "No route to host". 'rv' should be based on 'code', // the error code provided by NSPR. rv = NS_ERROR_CONNECTION_REFUSED; } } } else rv = NS_BASE_STREAM_WOULD_BLOCK;

Wan-Teh Chang

Comment 26

•

23 years ago

Darin Fisher wrote: > i've seen documents that state that poll could indicate a > readable or writable socket, but you will not know if the > socket is really connected unless you call getsockopt(SO_ERROR), > which is what PR_ConnectContinue does. Yes, this is correct, at least under BSD sockets. The handling of non-blocking connect is non-portable and differs between BSD-derived Unix, System V derived Unix, and Windows. This is why I invented PR_GetConnectStatus (which I generalized to PR_ConnectContinue in NSPR 4.1) to hide the platform-dependent handling of non-blocking connect. Under BSD sockets, when poll() tells you the socket is writable, you only know the non-blocking connect has completed, but you don't know whether it completed successfully or failed. (The current necko code incorrectly equates a PR_POLL_WRITE flag to a successful completion of connection.) You need to call getsockopt(SO_ERROR) to find out whether the connection succeeded, or the error code if it failed. (The current necko code incorrectly assumes that the error is always "Connection refused".) > how was this ever supposed to work before PR_ConnectContinue > was added to the NSPR API? You would have two problems. 1. You would have false connection success. You would eventually realize the connection failed when you write to the socket. This could be why you did not notice the false connection success problem. 2. Your connection failure error message would always be "Connection refused". Average users would not know there are other possible connection errors, such as "Connection timed out" or "No route to host".

jhp (no longer active)

Comment 27

•

23 years ago

*** Bug 120181 has been marked as a duplicate of this bug. ***

jhp (no longer active)

Comment 28

•

23 years ago

The code posted here also fixes an issue on OS/2 where connecting to localhost with no server running produced no error message (bug 120181).

Darin Fisher

Comment 29

•

23 years ago

wtc: thanks for the clarification on non-blocking connect. we'll probably want to add a necko error for "no route to host" and then add code to docshell to handle this case with an alert dialog. there is already a necko error for "timeout" -> NS_ERROR_NET_TIMEOUT that docshell understands.

Kai Engert [:KaiE:]

Comment 30

•

23 years ago

Attached patch Updated patch — Details — Splinter Review

This patch uses Wan-Teh's code and has the obosolete code in security removed. Please review.

Attachment #65103 - Attachment is obsolete: true

Wan-Teh Chang

Comment 31

•

23 years ago

Comment on attachment 65486 [details] [diff] [review] Updated patch The one-size-fit-all connection error message I pointed out in my "FIXME" comment should also be fixed. But that can be addressed in a separate bug.

Kai Engert [:KaiE:]

Comment 32

•

23 years ago

Filed bug 120717 for the error message. Blizzard, if you approve this, I'll check it in for 098.

Christopher Blizzard (:blizzard)

Comment 33

•

23 years ago

a=blizzard on behalf of drivers for 0.9.8. Let's get this in to find any regressions sooner instead of later.

Keywords: mozilla0.9.8+

Christopher Blizzard (:blizzard)

Updated

•

23 years ago

Whiteboard: [driver:blizzard]

Kai Engert [:KaiE:]

Comment 34

•

23 years ago

Patch checked in, marking fixed.

Status: ASSIGNED → RESOLVED

Closed: 23 years ago → 23 years ago

Resolution: --- → FIXED

Brendan Eich [:brendan]

Comment 35

•

23 years ago

Comment on attachment 65486 [details] [diff] [review] Updated patch Did darin really review the nsSocketTransport.cpp changes? They're shot full of tabs, which should be removed (expanded per the Emacs modeline) ASAP. /be

Attachment #65486 - Flags: needs-work+

Kai Engert [:KaiE:]

Comment 36

•

23 years ago

Attached patch Patch to correct whitespace — Details — Splinter Review

Kai Engert [:KaiE:]

Comment 37

•

23 years ago

It seems the hard tabs were introduced when I copied the text from the html view, and my tab size had been set to "2", so I didn't see the problem. Checked in with a=brendan (from IRC).

Kai Engert [:KaiE:]

Comment 38

•

23 years ago

Reopening, this introduced a regression, see bug 121327.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Darin Fisher

Comment 39

•

23 years ago

kai: IMO this should stay fixed. the fact that this broke BeOS (not a primary platform) is not really any fault of your patch. your patch would work if NSPR implemented its API correctly on BeOS :-)

Simon Fraser [no longer active]

Assignee

Comment 40

•

23 years ago

Mac also regressed: bug 121326.

Simon Fraser [no longer active]

Assignee

Comment 41

•

23 years ago

Note: some of these changes may hvae to be backed out, or #ifdeffed XP_MAC on the 0.9.8 branch, because of bug 121326.

Wan-Teh Chang

Comment 42

•

23 years ago

I still think Kai's patch is correct, but it exposes two Mac NSPR bugs, which must be fixed first. 1. PR_Poll does not work with layered file descriptors on Mac. (SSL sockets are layered NSPR FDs.) This is bug 121951. 2. PR_ConnectContinue has never been tested on Mac before. We need to verify that it is working correctly. This is bug 121952.

Depends on: 121951, 121952

Christopher Blizzard (:blizzard)

Comment 43

•

23 years ago

I don't want to ship with a broken mac, but I don't want to ship with broken windows or linux, either. :) If you guys don't think that you can fix it in the 0.9.8 timeframe, which is only a few days now, we can ifdef it. Feel free to whip up a patch.

Wan-Teh Chang

Comment 44

•

23 years ago

Attached patch Patch (a Unix text file) for the 0.9.8 branch — Details — Splinter Review

Please try this patch for the 0.9.8 branch. This patch reverts to the original code (blocking SSL connect) on Mac.

Daniel (Leaf) Nunes

Comment 45

•

23 years ago

i'll test this patch on the release machine for the trunk as soon as it's finished with it's current build.

lpham

Comment 46

•

23 years ago

I did a test build on OS 9 machine with the patch id=66652. You can download this build from ftp://sweetlou.mcom.com/products/client/seamonkey/macos/8.x/ppc/2002-01-28-11-0. 9.8/ (Fyi, I cannot test the patch on OS X build machine, because the branch verification build today is still building) Loan

Daniel (Leaf) Nunes

Comment 47

•

23 years ago

wan-teh, if you want to check in the patch, go for it so we can get the tree reopened.

Daniel (Leaf) Nunes

Comment 48

•

23 years ago

checked into the branch on behalf of wtc.

Asa Dotzler [:asa]

Comment 49

•

23 years ago

I've tested this morning's Mac classic builds and this patch seems to have fixed the hang on ssl sites. Tested https://verisign.com https://sourceforge.net and http://hotmail.com.

lpham

Comment 50

•

23 years ago

I just respin the Mac OS X build with the same patch id=66652. You can test it from here ftp://sweetlou.netscape.com/products/client/seamonkey/macos/10.x/ppc/2002-01-28- 13-0.9.8 Loan

Simon Fraser [no longer active]

Assignee

Comment 51

•

23 years ago

I just checked the last patch into the trunk (which uses NSPRPUB_PRE_4_2_CLIENT_BRANCH) to get the tree open. This bug is now Mac-only.

Assignee: kaie → sfraser

Status: REOPENED → NEW

OS: All → Mac System 9.x

Hardware: All → Macintosh

Asa Dotzler [:asa]

Comment 52

•

23 years ago

removing keyword, verified on 0.9.8 branch

Keywords: mozilla0.9.8+

Simon Fraser [no longer active]

Assignee

Comment 53

•

23 years ago

NSPR fixes have been checked in (bug 121951, bug 121952). This should work on Mac now.

Status: NEW → RESOLVED

Closed: 23 years ago → 23 years ago

Resolution: --- → FIXED

John Unruh

Comment 54

•

23 years ago

Verified fixed.

Status: RESOLVED → VERIFIED

timeless

Updated

•

20 years ago

Product: PSM → Core

timeless

Updated

•

16 years ago

Version: psm2.0 → 1.0 Branch

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Core → Core Graveyard

Fix for discussion 23 years ago Kai Engert [:KaiE:] 6.10 KB, patch	jag+mozilla : review+ jag+mozilla : superreview+	Details \| Diff \| Splinter Review
Updated patch 23 years ago Kai Engert [:KaiE:] 4.53 KB, patch		Details \| Diff \| Splinter Review
Patch to correct whitespace 23 years ago Kai Engert [:KaiE:] 3.42 KB, patch		Details \| Diff \| Splinter Review
Patch (a Unix text file) for the 0.9.8 branch 23 years ago Wan-Teh Chang 2.68 KB, patch		Details \| Diff \| Splinter Review