TCP Nagle delays slow SSL performance

RESOLVED FIXED in 3.6

Status

NSS
Libraries
P2
normal
RESOLVED FIXED
18 years ago
16 years ago

People

(Reporter: Nelson Bolyard (seldom reads bugmail), Assigned: Kirk Erickson)

Tracking

({perf})

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

In the TCP protocol, when an application writes a small set of
data to a TCP connection, and then immediately thereafter writes
a second set of data to the same socket, the second set of data 
is typically delayed (not sent out) until the acknowledgement 
for the first set of data is received from the peer.  This 
added transmission delay is called a "Nagle" delay.

So, with TCP, it is desirable to coalesce data to be written, 
and to write it with a single write (or send or writev) call
rather than with a series of small writes.  

NSS's SSL library already coalesces the writing of SSL records 
during a handshake, but there are still (at least) two circumstances
where two writes are done in succession that could be coalesced.

1. In an application where the client speaks first (e.g. https)
and an SSL "restart" handshake is used, the client is the last
of the two peers to write an SSL handshake record, and then 
immediately therafter sends the client sends its application 
data.  This results in two writes on the socket in rapid 
succession, the first of which is small.

2. In an application where the server speaks first (e.g. IMAPS)
an a full SSL handshake is used, the server is the last of the
two peers to write an SSL handshake record, and then immediately
thereafter the server writes its initial application data. 
This results in two writes on the socket in rapid succession.

When composing the last record in the SSL handshake, it is 
possible to detect that the application is about to write
data (e.g. if the handshake is takeing place as part of a 
call to PR_Write or PR_Send or PR_Writev) and to force the 
final handshake record into a buffer, thereby coalescing it
with the application data to follow.  

This bug report requests that this coalescing be performed
in NSS 3.2.
(Reporter)

Comment 1

18 years ago
Marking P1, since this is a performance issue and performance is
one of the key goals of NSS 3.2.  
Status: NEW → ASSIGNED
Priority: -- → P1
Target Milestone: --- → 3.2
(Reporter)

Comment 2

18 years ago
Created attachment 24603 [details] [diff] [review]
Proposed patch
(Reporter)

Comment 3

18 years ago
Fixed by the following file revisions:
sslimpl.h  1.9
ssl3con.c  1.16
sslsecur.c 1.5
sslsock.c  1.10
Reviewed and approved by Wan-Teh
(Reporter)

Comment 4

18 years ago
marking fixed.
Status: ASSIGNED → RESOLVED
Last Resolved: 18 years ago
Resolution: --- → FIXED
(Reporter)

Updated

18 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Comment 5

18 years ago
Reopening this bug.  Although the code passed all my tests at
the time I wrote it, it now seems to be completely failing to
coalesce client writes for restart handshakes.  That is what
we see in the packet traces Kirk recently recorded using our
AIX test system.  

I'd change the target fix version to 3.2.1, but there's no
3.2.1 in the list.
(Reporter)

Comment 6

18 years ago
False alarm.  The changes in NSS 3.2 to coalesce handshake
writes with application writes _are_ working.  

However, I think we should also disable Nagle delays prior to 
sending the Close Notify alert and closing the socket.  No 
point in delaying the final write, since there won't be 
another.  So, I'll leave this bug open, and change its target
to NSS 3.3, and lower it to P2.
Status: REOPENED → ASSIGNED
Priority: P1 → P2
Target Milestone: 3.2 → 3.3

Updated

17 years ago
Keywords: perf

Comment 7

17 years ago
*** Bug 67718 has been marked as a duplicate of this bug. ***
(Reporter)

Comment 8

17 years ago
I believe 67718 is NOT a duplicate of this bug.
(Reporter)

Comment 9

17 years ago
Nagle delays are now disabled for all SSL sockets.

ssldef.c   1.4
sslimpl.h  1.13
sslsecur.c 1.11
sslsock.c  1.17
Status: ASSIGNED → RESOLVED
Last Resolved: 18 years ago17 years ago
Resolution: --- → FIXED

Comment 10

17 years ago
Disabling Nagle for all SSL sends may be a reasonable workaround, but it is 
inappropriate as a fix.

First, the bug that is being worked around by disabling Nagle in ssl_DefSend() 
needs to be clearly stated.  Then it needs to be fixed and revision 1.4 to 
ssldef.c needs to be backed out.


Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Comment 11

17 years ago
NSS has a "stress test" client and a test server that we use for NSS
performance measurement.  Each of those test programs has an option
(-D) to completely disable Nagle delays.  This allows us to run the 
same code both with and without those delays and measure the difference.
That's how this bug came to be filed in the first place.

As explained previously in the cited bug report, NSS was changed so 
that it does not do two successive writes to the underlying socket 
from a single call to PR_Write or PR_Send.  I verified that two 
succesive write/send calls are never done to the underlying socket 
while the test client program is running, without doing one (or more) 
intervening reads of data (data that would have been sent by the 
server in response to the data just previously written to the client's 
socket).  That is, at least one round trip time occurs between one 
write and the next on the client's socket.

With this behavior, one would predict that there would be no Nagle 
delays introduced into the stream of data written by the client.  
I expected that the -D option on the client would thereafter have no 
effect on test performance, that the test would behave the same with 
or without -D (on the client) because the pattern of writes being 
done on the client would not trigger any Nagle delays.

However, after making these changes and verifying that the client's 
write behavior was as described above, the -D option on the client 
continued to have some significant effect on the test results on 
several platforms.  

Since the writes are already perfectly and completely coalesced, no 
further improvement is to be expected from further changing the write 
behavior.  

That's the motivation for this change.

I am willing, however, to make the call to ssl_EnableNagleDelay
in ssl_DefSend be conditionally compiled so that it is done only
on those platforms that benefit significantly.
(Reporter)

Comment 12

17 years ago
The call to ssl_EnableNagleDelay in ssl_DefSend is now 
conditionally compiled.  
The call in ssl_SecureClose is not.  
Status: REOPENED → RESOLVED
Last Resolved: 17 years ago17 years ago
Resolution: --- → FIXED

Comment 13

17 years ago
It would be a good idea to find out why Nagle was affecting those platforms.  
This is indicative of a problem somewhere.

My guess would be that if a write on the underlying socket returns EWOULDBLOCK, 
then NSS will on the next write call do a short write of the buffered, unsent 
partial record before processing and sending the next record.  The next record 
would then be held by Nagle.

Diagnosing this issue (and fixing it) should be tracked as a bug.  Nelson, do 
you want this to be a separate bug?
(Reporter)

Comment 14

17 years ago
Our tests do not experience EWOULDBLOCK.

Comment 15

17 years ago
Kirk's performance measurements of NSS 3.2.1 RTM and NSS 3.3
Beta showed that the setting of the TCP_NODELAY socket option
for the final alert has no effect on Linux and Windows NT
but has mixed effects on Solaris (this change was made in
mozilla/security/nss/lib/ssl/sslsecur.c, revision 1.11).
We may remove this change from NSS 3.3.

I am re-opening this bug and moving the target to NSS 3.4.
We need to investigate the effect of TCP_NODELAY socket
option more.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 3.3 → 3.4

Comment 16

17 years ago
Reassigned the bug to Kirk.
Assignee: nelsonb → kirke
Status: REOPENED → NEW

Comment 17

17 years ago
I removed the code to disable Nagle algorithm for the final alert
from sslsecur.c on the NSS_3_3_BRANCH.

Comment 18

16 years ago
Changed the QA contact to Bishakha.
QA Contact: sonja.mirtitsch → bishakhabanerjee

Comment 19

16 years ago
Set target milestone to NSS 3.5.
Target Milestone: 3.4 → 3.5
(Assignee)

Comment 20

16 years ago
Set target milestone to NSS 3.6.
Target Milestone: 3.5 → 3.6
(Assignee)

Comment 21

16 years ago
Created attachment 87770 [details]
selfserv stress under Solaris (with and without delay)

I inserted printfs in the spots we disable delay in ssl_DefSend() and
ssl_SecureClose().  On the server, we're hitting the code in ssl_SecureClose.
The clients hit the code in ssl_SecureClose() only after the server is
killed.  This is doing restarts and starting all the clients before the
server, with and without -D.

The code in ssl_DefSend() is ifdef'd away on all platforms.  Its the only
instance of NSS_DISABLE_NAGLE_DELAYS in the tree.

I benchmarked selfserv running under Solaris (soupnazi) at the tip with and
without the close code, then without the close code but always disabling
Nagle delays.

SunOS soupnazi 5.8 Generic_108528-14 sun4u sparc SUNW,Ultra-4
-------------------------------------------------------------
SunOS alphachip 5.8 Generic_108528-13 sun4u sparc SUNW,Ultra-5_10
SunOS cal3 5.8 Generic_108528-08 sun4u sparc SUNW,Ultra-5_10
SunOS foofighter 5.8 Generic_108528-11 sun4u sparc SUNW,Ultra-5_10
Linux gods 2.4.2-2smp #1 SMP Sun Apr 8 20:21:34 EDT 2001 i686 unknown
SunOS gonefishing 5.8 Generic_108528-11 sun4u sparc SUNW,Ultra-5_10
SunOS tank.red.iplanet.com 5.8 Generic_108528-11 sun4u sparc SUNW,Ultra-4

I removed the printfs, and rebuilt all clients and the server as it was
at the tip first, and did 4 baseline runs, with and without -D on the client.

full		strsclnt -N
full-D		strsclnt -N -D 
restart 	strsclnt
restart-D	strsclnt -D

The -noclose entries were generated after removing the code to disable
Nagle delay in ssl_SecureClose().  The -always entries were generated
after removing the NSS_DISABLE_NAGLE_DELAYS ifdef in ssl_DefSend().
All runs are with the zone allocator engaged.

Results with full handshakes are basically a wash.  Restart runs were
about about 1.5% faster in the -always runs.  So at least for Solaris,
removing the close code, and engaging the always code yields the best
selfserv stress performance.

Solaris was the platform we were concerned about when this bug
was reopened.  Apparently, setting NODELAY isn't hurting Solaris
as it did previously, and in fact restarts on this platform would
benefit from disabling delays always rather than just in closing.

Comment 22

16 years ago
We should investigate why Nagle is getting triggered on restart handshakes.
A TCP trace showing one side or the other sending a short packet followed by a
second set of data would be most helpful.

Comment 23

16 years ago
I think I found it by code inspection.  ssl_EmulateTransmitFile() will send the
headers as a short packet, triggering Nagle.

Comment 24

16 years ago
Filed bug 152205 on the ssl_EmulateTransmitFile() issue.  I found that issue was
in the non-unix code, so it is unlikely to be what we are seeing here.  The TCP
trace would still be most helpful.
(Assignee)

Comment 25

16 years ago
Created attachment 89421 [details]
always-close-never runs on soupnazi

I spoke with Nelson, and resolved to repeat the benchmarking runs
on soupnazi, filling in a 3x3 table of results which are attached.
Names indicate when Nagle delay was enabled, server first.

	always	Nagle delays disabled on the command line (-D).
	close	Nagle delays disabled for close only.
	never	Nagle delays never disabled (close code removed).

For example, the "restart-always-close" run is with selfserv disabling
on the command line, and all the clients disabling on close only.

The server and all clients were using the tip.	Original benchmarking
results on Solaris showed a gain when delay was disabled on either the set
of clients or the server, both sides disabled delay on the command line.
This was nearly a year ago now, and only on restarts.

As you can see the "always-always" runs are not slower.  I've been
unable to reprroduce the anomally that sparked this investigation.

I repeated the exercise on iws-perf, another 4 cpu Solaris machine on
another subnet.  Results made the soupnazi results look even
more like a wash because "fastest" and "slowest" runs shifted,
and again "always-always" runs did not stand out.
(Assignee)

Comment 26

16 years ago
Nelson verified with breakpoints on read and write, that
writes are properly coalesced.  I've been unable to reproduce
the anomally on Solaris of poor performance when Nagle delays
were disabled on the command lines of both the server and 
all clients.

Its apparent that the default behavior, of disabling delay
at close time improves performance slightly and is appropriate.
Disabling always is not nice for the network and doesn't impact
performance.

Closing this bug.
Status: NEW → RESOLVED
Last Resolved: 17 years ago16 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.