Open Bug 307576 Opened 19 years ago Updated 2 years ago

socket test case hangs on any failure

Categories

(NSPR :: NSPR, defect)

4.6.1
Sun
Solaris
defect

Tracking

(Not tracked)

People

(Reporter: jason.m.reid, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(1 file)

 
On Solaris machines without IPv6, socket hangs until terminated if
there is no IPv6 interface.

mace[svbld]:/share/builds/mccrel3/security/securitytip/builds/20050908.1/wozzeck_Solaris8/mozilla/nsprpub/SunOS5.8_DBG.OBJ/pr/tests>
./socket
TCP Client/Server Test - IPv4/Ipv4
'PR_NewSem' is obsolete. Use 'locks & condition variables' instead.
'PR_WaitSem' is obsolete. Use 'locks & condition variables' instead.
'PR_PostSem' is obsolete. Use 'locks & condition variables' instead.
TCP_Socket_Client_Server_Test: 1 Server  5 Clients  5 connections_per_client
                             : 10 messages_per_connection 1024
bytes_per_messageTCP_Socket_Client_Server_Test Passed
TCP Client/Server Test - IPv6/Ipv4
TCP_Socket_Client_Server_Test: 1 Server  5 Clients  5 connections_per_client
                             : 10 messages_per_connection 1024
bytes_per_messageTCP_Socket_Client_Server_Test Passed
TCP Client/Server Test - IPv4/Ipv6
TCP_Socket_Client_Server_Test: 1 Server  5 Clients  5 connections_per_client
                             : 10 messages_per_connection 1024
bytes_per_messageTCP_Socket_Client_Server_Test Passed
TCP Client/Server Test - IPv6/Ipv6
PR_Connect failed: (-5980, 128)
PR_Connect failed: (-5980, 128)
PR_Connect failed: (-5980, 128)
PR_Connect failed: (-5980, 128)
PR_Connect failed: (-5980, 128)
^C

QA Contact: wtchang → nspr
Assignee: wtchang → glen.beasley
/nsprpub/pr/include/prerr.h, line 108 -- 
#define PR_NETWORK_UNREACHABLE_ERROR (-5980L)

http://lxr.mozilla.org/nspr/source/nsprpub/pr/tests/socket.c#691

690         if (PR_Connect(sockfd, &netaddr,PR_INTERVAL_NO_TIMEOUT) < 0){
691             fprintf(stderr, "PR_Connect failed: (%ld, %ld)\n",
692                     PR_GetError(), PR_GetOSError());
693             failed_already=1;
694             return;   <<-- BUG, causes hang
695         }

The problem is that this counted thread exits (returns from outermost 
function in thread) without decrementing the thread count.  So the thread 
count never goes to zero.

Functions TCP_server and TransmitFile_Server do this correctly. 
All error paths "goto exit", and exit cleans up properly.  
That's exactly what should be happening in function UDP_server, 
TCP_Client, UDP_Client, and TransmitFile_Server.
Just to clarify, ALL the returns in UDP_server, TCP_Client, UDP_Client, and TransmitFile_Server need to be changed to goto exit, and the label exit needs
to be made to do the right thing in all cases (e.g don't free null pointers).
Thank Nelson, I will implement the fix and ask Wan-teh to reveiw.  I just need to 
finish the current escalation I'm working on.  
Status: NEW → ASSIGNED
Summary: socket test case hangs due to not detecting no IPv6 on Solaris → socket test case hangs on any failure
Depends on: 414067
The main goal of this patch is for the test "socket" to not hang during the nightly QA; remove compiler warnings; and provide clearer errors messages. 

added goto/exit statements in the functions to provide cleanup on failures. 

Error messages now start with their "function name" rather than the program description "prsocket_test". 

I added appropriate cast to the fprintf 
statements to remove compiler warnings. 

In function "TCP_Server" I added a timeout value on PR_Accept of 15 seconds for 
tests that fail such as our Solaris 8 QA machine IPv6 failures:

TCP_Socket_Client_Server_Test Passed
TCP Client/Server Test - IPv6/Ipv6
TCP_Client: PR_Connect failed: (-5980, 128)
TCP_Client: PR_Connect failed: (-5980, 128)
TCP_Client: PR_Connect failed: (-5980, 128)
TCP_Client: PR_Connect failed: (-5980, 128)
TCP_Client: PR_Connect failed: (-5980, 128)
TCP_Server: ERROR - PR_Accept() timed out!
TCP_Server: ERROR - PR_Accept failed
TCP_Socket_Client_Server_Test: 1 Server  5 Clients  5 connections_per_client
                             : 10 messages_per_connection 1024 bytes_per_message TCP_Socket_Client_Server_Test failed



I added the function rm_dash_r to remove a file or a directory recursively.
rm_dash_r is used by the function Socket_Misc_Test in case the previous run
of the program socket failed in Socket_Misc_Test and did not remove it's test 
directory. see bug 414067
Attachment #299784 - Flags: review?(wtc)
Attachment #299784 - Flags: review?(wtc) → review?(nelson)
Comment on attachment 299784 [details] [diff] [review]
cleanup of socket.c

This is a rather classic example of a mega-patch that contains a 
combination of several other smaller patches that are unrelated to 
each other.  If they had been done as separate patches, some of 
them would have gotten r+, but since they're all together in one 
patch, and one part of that gets r-, then the whole things gets r-
and none of the pieces can be committed soon.  

The big problem with this patch is all the casting to long (or 
unsigned long) done for purposes of printf.  There are other 
smaller problems.  

1.  (nit)  In readn(PRFileDesc *sockfd...  we see

>+#ifdef WINNT    
> 	int err;
>+#endif

Move the declaration of this variable down into the only basic
block that uses it, which is in its own ifdef.  
Then the declaration won't need its own separate ifdef.

2. Casting pointers to unsigned longs.

>         DPRINTF(("thread = 0x%lx: calling PR_Recv, bytes = %d\n",
>-            PR_GetCurrentThread(), rem));
>+            (unsigned long int) PR_GetCurrentThread(), rem));

Here we see the first of many casts that cast a pointer to an 
unsigned long.  That will truncate the pointer on some platforms
(Win64 for one).  Please don't cast ANY pointers to unsigned longs
for printing.  There is an ANSI c printf format specifically for 
printing pointers of type void *.  It is "%p".  Please use it.  
In that example above, the patch would be:

>-        DPRINTF(("thread = 0x%lx: calling PR_Recv, bytes = %d\n",
>-            PR_GetCurrentThread(), rem));
>-        DPRINTF(("thread = %p: calling PR_Recv, bytes = %d\n",
>+            (void *)PR_GetCurrentThread(), rem));

Please change ALL the places where you cast pointers to cast to 
void *, and to use %p instead of %lx.


3.  casting char * to int * or long * and then dereferencing it.

>-        DPRINTF(("Serve_Client [0x%lx]: inbuf[0] = 0x%lx\n",PR_GetCurrentThread(),
>-            (*((int *) in_buf->data))));

>+        DPRINTF(("Serve_Client [0x%lx]: inbuf[0] = 0x%lx\n",
>+            (unsigned long int) PR_GetCurrentThread(),
>+            (*((unsigned long int *) in_buf->data))));

in_bug->data is a char *.  Apparently here the desire is to print out the first 4
or 8 bytes of the data buffer in hex.  But the code you see there will print out
the bytes in reverse order on little-endian CPUs, and may crash on systems where
the buffer pointer is not already aligned on an int or long boundary.  

Better to print out the bytes separately in hex, e.g. 

>+        DPRINTF(("Serve_Client [%p]: inbuf[0] = 0x%02x%02x%02x502x\n",
>+            (void *) PR_GetCurrentThread(),
>+            in_buf->data[0],in_buf->data[1],in_buf->data[2],in_buf->data[3],))));

Please change all the places that cast a char * to an int* or long * and then 
dereference it to use the above technique.

4.  fprintfs for errors that don't print the error codes.
There are lots of places in the patch similar to this one:

> 	if ((sockfd = PR_OpenTCPSocket(server_domain)) == NULL) {
>-        fprintf(stderr,"prsocket_test: PR_NewTCPSocket failed\n");
>+        fprintf(stderr,"TCP_Server: PR_OpenTCPSocket failed\n");

If this program fails, the developer who looks at the failure to diagnose it
is going to want to know: what was the error code returned?  The code shown
does not print the error code, and the patch doesn't improve that.

So, I suggest changing all of those "fprintf(stderr," statements that occur
after a function reports failure, and that do NOT report any error code, 
into calls to a function that will print the strings AND print the error codes,.
e.g. something sorta like this:

static void prErr(const char * msg)
{
    int prerr = (int)PR_GetError();
    int oserr = (int)PR_GetOSError();
    fprintf(stderr,"%s (%d,%d)\n", msg, prerr, oserr);
}

and then 

>-        fprintf(stderr,"prsocket_test: PR_NewTCPSocket failed\n");
>+        prErr("TCP_Server: PR_OpenTCPSocket failed");
   Note that I removed the trailing newline --------^

5. Casting variables declared with NSPR integer types to long

>-    DPRINTF(("TCP_Server: PR_BIND netaddr.inet.ip = 0x%lx, netaddr.inet.port = %d\n",
>-        netaddr.inet.ip, netaddr.inet.port));
>+    DPRINTF(("TCP_Server: PR_BIND netaddr.inet.ip = 0x%lx, "
>+        "netaddr.inet.port = %d\n",
>+        (unsigned long int) netaddr.inet.ip, netaddr.inet.port));

netaddr.inet.ip is a PRUint32, which is always the size of an int, but not always the 
size of a long.  You *could* cast it to an unsigned int, and then print it with %x instead
of %lx.  But I suggest that you use NSPR's printf function rather than the native printf
function here.  NSPR's printf function solves all these problems for NSPR types.
With NSPR printf functions:
   %d   always matches PRIntn type
   %ld  always matches PRInt32 type
   %lld always matches PRInt64 type
This avoids a LOT of ugly casting, and avoids possible size mismatches that casting may cause.
It would also simplify cases like this:

>     printf("%2ld Server %2ld Clients %2ld connections_per_client\n",1l,
>-        num_tcp_clients, num_tcp_connections_per_client);
>+        (long int) num_tcp_clients, (long int) num_tcp_connections_per_client);
>     printf("%30s %2ld messages_per_connection %4ld bytes_per_message\n",":",
>-        num_tcp_mesgs_per_connection, tcp_mesg_size);
>+        (long int) num_tcp_mesgs_per_connection, (long int) tcp_mesg_size);


6.  reducing code duplicated in both if and else paths.
There are several places where the new code adds duplicate gotos and/or printfs
in both the "if" path and the "else" path.  For example:

>+                if (PR_NOT_CONNECTED_ERROR != PR_GetError()) {
>+                    /* connection no longer connected */
>+                    goto exit;
>+                } else {
>+                    fprintf(stderr,"TCP_Client writen ERROR: (%d, %d)\n",
>+                        (int) PR_GetError(), (int) PR_GetOSError());
>+                        failed_already=1;
>+                        goto exit;
>+                }

That can be reduced to 

>+                if (PR_NOT_CONNECTED_ERROR == PR_GetError()) {
>+                    fprintf(stderr,"TCP_Client writen ERROR: (%d, %d)\n",
>+                        (int) PR_GetError(), (int) PR_GetOSError());
>+                        failed_already=1;
>+                }
>+                goto exit;


Likewise 

>+        if (rm_dash_r(TEST_DIR) == 0) {
>+            if ((PR_MkDir(TEST_DIR, 0777)) < 0) {
>+                fprintf(stderr, "Socket_Misc_Test: "
>+                                "failed to create dir %s\n",TEST_DIR);
>+                failed_already=1;
>+                return -1;
>+            }
>+        } else {
>+            fprintf(stderr, "Socket_Misc_Test: "
>+                            "failed to create dir %s\n",TEST_DIR);
>+            failed_already=1;
>+            return -1;
>+        }

can be reduced to

>+        if (rm_dash_r(TEST_DIR) != 0) ||
>+            PR_MkDir(TEST_DIR, 0777) < 0) {
>+            fprintf(stderr, "Socket_Misc_Test: "
>+                            "failed to create dir %s\n",TEST_DIR);
>+            failed_already=1;
>+            return -1;
>+        }


7.  potential buffer overflow 
In new function rm_dash_r we see

>+    char filename[240];

>+        while((entry = PR_ReadDir(dir, PR_SKIP_BOTH)) != NULL) {
>+            sprintf(filename, "%s/%s", path, entry->name);

The combined length of the strings "path" and "entry->name" could 
exceed 240 characters.  There are several ways to avoid this.
I recommend using PR_smprintf, which returns a malloc'ed buffer.
Or, you could use PR_snprintf with a much larger buffer.
Attachment #299784 - Flags: review?(nelson) → review-
Severity: normal → S3

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: glenbeasley → nobody
Status: ASSIGNED → NEW
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: