Closed Bug 153769 Opened 22 years ago Closed 22 years ago

[PATCH] Race condition causing connection timeout when using pop3s (pop3 with ssl)

Categories

(Core Graveyard :: Security: UI, defect, P2)

Other Branch
x86
Linux
defect

Tracking

(Not tracked)

VERIFIED FIXED
Future

People

(Reporter: michael, Assigned: KaiE)

References

Details

Attachments

(1 file)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.0) Gecko/20020529
BuildID:    2002052918

When using using pop3s (connection to server on the local machine) 
the connection will more often than not time out while performing the 
login stages (it fails about 9 times out of 10).

Reproducible: Sometimes
Steps to Reproduce:
1.  rm -rf .mozilla 
2.  Start mozilla, set up new mail account with secure connection (SSL)
3.  Attemp to download emails
4.  If successful, repeat step 3

Actual Results:  Mozilla is apparently trying to connect to the pop3s server. 
It won't connect to anything else (pop3 or http).  Eventually, the 
pop3s connection attempt times out.

Expected Results:  Connected to the pop3s server, checked for new mail and
collected any that were waiting.

In nsSSLIOLayerWrite (security/manager/ssl/src/nsNSSIOLayer.cpp)
the socket is put into blocking mode for the first write (even if it
has been set as a non-blocking socket).  If a read occurs at the wrong 
time, it will block.  (At the higher levels, reads always request more 
data than is available so the low level read will fetch the data that 
is there, then go round again to see if there is any more - this is 
where it is blocking.)

I have got round this by adding a lock to nsNSSSocketInfo which is 
grabbed by the nsSSLIOLayerRead function (always) and the 
nsSSLIOLayerWrite function (only on the first write where it is 
changing the socket to blocking mode and only if was previously in 
non-blocking mode).  I have no idea if this is the Right Thing (but 
it works for me :).  I will attach the patch for this in case it is 
of any interest.  (The files affected are nsNSSIOLayer.cpp and 
nsNSSIOLayer.h)

As an aside, if anyone adds a comment to this and knows how to get 
hold of the output of the PR_LOG( .. ) calls I would find this 
very helpful.
Knew I'd forgotten something.  The patch is for 1.0.  Previous 
0.9.x versions and 1.1a also display the same behaviour.
>POP
Assignee: mstoltz → naving
Component: Security: General → Networking: POP
QA Contact: junruh → sheelar
Marking NEW for patch loving. Reporter please email one of the owners of the
module http://www.mozilla.org/owners.html and cc: reviewers@mozilla.org asking
for a review and super review of that patch. Thanks again for the patch.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Race condition causing connection timeout when using pop3s (pop3 with ssl) → [PATCH] Race condition causing connection timeout when using pop3s (pop3 with ssl)
Michael, thanks for the patch, but in the future, please use "diff -u" to
produce unified patches, they are easier to read and apply, and are usually used
for patches attached in bugzilla.

The patch itself looks good, but I want to have a closer look for potential side
effects or deadlocks, before I give the reviewed attribute.
This bug should go to the owner of this code. Who owns this code ? looks like
javi@netscape.com

reassigning to javi. 
Assignee: naving → javi
assigning to Kai.
Assignee: javi → kaie
Priority: -- → P2
Target Milestone: --- → Future
See also bug 87902. Before I review the patch, I want to review the SSL I/O
layer code again.
over to PSM
Component: Networking: POP → Client Library
Keywords: nsbeta1
Product: MailNews → PSM
QA Contact: sheelar → junruh
Version: other → unspecified
Michael,

first let me say, I've been using POP3/SSL on Linux every day for about a year
for my private e-mail account, and I have never run into the problem you describe.

I tried to understand what you are doing with your patch.

Bascially one can describe Mozilla's I/O as layered functions, that generally
pass read requests through to lower layer to do the real work, and can add their
own special handling on certain events.

Layer 1: Necko and Mozilla's network thread
Layer 2: Mozilla's PSM logic to encapsulate security I/O logic from the rest of
the Mozilla code.
Layer 3: The NSS library which is a general purpose library to do SSL communication.

In layer 2, we have a read and a write function.

Normally, I/O is asynchronous. But still I assume that Layer 1 would never use
multithreading to call Layer2Read and Layer2Write at the same time.

On the first write in Layer2Write, we switch to blocked I/O to detect some
special failure.

The lock you added the following locks:

Layer2Read
{
  lock
  call Layer3Read
  unlock
}

Layer2Write
{
  lock
  switch to blocking mode if necessary
  call Layer3Write
  do some error checks and data structure updates
  unlock
}


I can not understand how this could have helped.

I only understood, if during our blocking call to Layer3Write, Layer 1 would be
reached and executed a call to Layer2Read.

I would assume that's not possible.


My only guess is that your change caused a different timing.
I see that Layer 2 (the code you were trying to change) is using obsolete
workarounds, that I did not fully understand. Those workarounds included writing
null bytes to trigger other activity. Actually, when I tried to change that
workaround in an attempt to fix bug 87902, I broke the functionality.

So I consider the existing code is broken. I have created a patch to bug 87902
that removes the workarounds and changes the code to a simpler logic. Our tests
show, that SSL communication with the latest patch from 87902 still works.

I will check in 87902 today, and I hope everything will continue to work.

Please try out the patch from bug 87902 or just test tomorrow's trunk build and
let us know whether you still see your problem.

Before I check in I will make a final test with my patch and POP3/SSL, which I
realized I have not yet tested.
Thanks for your detailed response, Kai.  I tried the patch (on 1.1a) and 
also the 20020815 build but I still have the same problem.  My assumption 
is that reads and writes are being handled on separate threads with the 
write thread being the one passing through the POP3 protocol state 
machine and the read thread waiting on select and sending a message to the 
write thread when data had been collected and put into the pipe.  The order of
events would then be:

1) Layer2Write puts socket in blocking mode
2) Layer2Write calls Layer3Write
3) Server responds and data becomes available to read
4) Read thread is woken up, calls down to Layer3Read and blocks 
   (because read is always called with a request for more data
   than is available)
5) Layer3Write returns and Layer2Write puts socket back into
   non-blocking mode.
6) Read thread is now stuck in a blocking read so no message gets
   back to the write thread that data is available to process.

The function of the lock was to hold up the read until the write had
put the socket back into non-blocking mode.

This would effectively be the situation that you describe with Layer 1 
being responsible for setting off reads and writes at the same time.  I
have now managed to run the code in a debugger (with break points on the
Layer2Write and Layer2Read) and when stepping over the call to Layer3Write, 
there is another thread that then hits the breakpoint in Layer2Read but I
haven't managed to work out exactly how/where the two threads are being
set up.

Unfortunately, I don't have a remote POP3 account that can be accessed
over SSL so I can't test this but I guess that that it would be OK and I
assume, from your experience, that this isn't a very big problem.
Michael, you may want to track bug 163605. I have hope that bug will fix your
problem.
Thanks for keeping me in touch.

I have applied your patch from bug 163605 (v5) to Mozilla 1.2a
(which displayed the same behaviour as previous versions) and
the problem has gone away.  Thankyou.
Depends on: 163605
This should be fixed starting with tomorrow's build, by the patch for bug
163605. Please reopen if you still can reproduce the problem.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Verified.
Status: RESOLVED → VERIFIED
Product: PSM → Core
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: