Closed Bug 70408 Opened 19 years ago Closed 19 years ago

Mac regularly fails to connect to server (PR_Poll fix)

Categories

(Core :: Networking: HTTP, defect)

PowerPC
Mac System 9.x
defect
Not set

Tracking

()

VERIFIED FIXED
Future

People

(Reporter: jrgmorrison, Assigned: sfraser_bugs)

References

()

Details

(Keywords: perf)

Attachments

(3 files)

Overview Description: 


  The Mac, for some time now, has occasionally thrown up an alert 
  "The operation timed out when attempting to contact jrgm.mcom.com"
  while running the page loading test.

  However, the server is fully reachable (e.g., another Linux or Windows box
  on the same subnet running the same test at the same time will never throw
  up this alert).

  Since builds of Feb 22nd, this has gone from occasional to much more common
  (visiting 200 URLs will cause this error to occur 2 or 3 times). Note that
  Feb 22nd is the day after dougt's big channel changes.

  The good news is that the Mac is on average running this test more quickly
  (about 10% faster). Perhaps these two facts (the speedup and the increase in
  how often this error occurs) are related.


Steps to Reproduce: 
1) go to http://jrgm.mcom.com/page-loader/loader.pl
2) take the default values on that form and hit submit
3) wait until it happens 

Actual Results:   says it can't connect when server is A-OK
Expected Results: should connect

Reproducibility: between 1 in 100, and one in 70 running this test

Unfortunately, waiting until the alert is spun up may be too late to
reconstruct why the connection failed (but it's a start).

Note: the pages are loaded by doing JS Location.href = "aURL", which is
similar but not identical to the code path for clicking on a link or hitting
the back button.

This is a minor nuisance when running the page loading test (must be watched),
but I expect that real users are occasionally getting bogus failures, so 
it would be good to track down why this is happening.
s/often/regularly/. Anyone have a good adjective for "1 in 70"?
Keywords: perf
Summary: Mac often fails to connect to server, running page loading test → Mac regularly fails to connect to server, running page loading test
Target Milestone: --- → Future
This is enough of a show-stopper for me to warrant more attention, because it 
often prevents me from getting page load performance data.
adding keyword nsbeta1
Keywords: nsbeta1
Here's the chunk of log where things fail:

3[9afdbd8]: nsSocketTransport: Entering Process() [host=jrgm.mcom.com:80 this=
d4b4138], aSelectFlags=1, CurrentState=5.
3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d4b4138] is 
in WaitReadWrite state [readtype=1 writetype=0 status=80470007].
3[9afdbd8]: nsSocketTransport: doReadWrite [this=d4b4138, aSelectFlags=1, 
mReadRequest=d225320, mWriteRequest=0
3[9afdbd8]: nsSocketReadRequest: [this=d225320] inside OnRead.
3[9afdbd8]: nsSocketReadRequest: [this=d225320] calling listener [offset=39420, 
count=8192]
3[9afdbd8]: nsSocketIS: PR_Read(count=1540) returned -1
3[9afdbd8]: nsSocketIS: PR_Read() failed with PR_WOULD_BLOCK_ERROR
3[9afdbd8]: nsSocketReadRequest: [this=d225320] read 0 bytes [offset=39420]
3[9afdbd8]: nsSocketTransport: doReadWrite [readstatus=80470007 writestatus=0 
readsuspend=0 writesuspend=0 mSelectFlags=5]
3[9afdbd8]: nsSocketTransport: Leaving Process() [host=jrgm.mcom.com:80 this=
d4b4138], mStatus = 80470007, CurrentState=5, mSelectFlags=5

3[9afdbd8]: nsSocketTransport: Entering Process() [host=jrgm.mcom.com:80 this=
d73edf4], aSelectFlags=1, CurrentState=5.
3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d73edf4] is 
in WaitReadWrite state [readtype=1 writetype=0 status=80470007].
3[9afdbd8]: nsSocketTransport: doReadWrite [this=d73edf4, aSelectFlags=1, 
mReadRequest=d400d68, mWriteRequest=0
3[9afdbd8]: nsSocketTransport: [this=d73edf4] completing read request due to 
cancelation
3[9afdbd8]: nsSocketTransport: CompleteAsyncRead [this=d73edf4]
3[9afdbd8]: nsSocketTransport: doReadWrite [readstatus=0 writestatus=0 
readsuspend=0 writesuspend=0 mSelectFlags=4]
3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d73edf4] is 
in Done state.
3[9afdbd8]: nsSocketTransport: Leaving Process() [host=jrgm.mcom.com:80 this=
d73edf4], mStatus = 0, CurrentState=4, mSelectFlags=4

3[9afdbd8]: nsSocketTransport: CheckForTimeout() [jrgm.mcom.com:80 d4b4138].	
TIMED OUT... Idle interval: 120003 CurrentState=5
3[9afdbd8]: nsSocketTransport: Entering Process() [host=jrgm.mcom.com:80 this=
d4b4138], aSelectFlags=0, CurrentState=7.
3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d4b4138] is 
in Timeout state.
3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d4b4138] is 
in Error state.
3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d4b4138] is 
in Done state.
3[9afdbd8]: nsSocketTransport: CompleteAsyncRead [this=d4b4138]
3[9afdbd8]: nsSocketTransport: Leaving Process() [host=jrgm.mcom.com:80 this=
d4b4138], mStatus = 0, CurrentState=3, mSelectFlags=4

mCurrentState is eSocketState_WaitReadWrite.
Note these lines:

3[9afdbd8]: nsSocketIS: PR_Read(count=1540) returned -1
3[9afdbd8]: nsSocketIS: PR_Read() failed with PR_WOULD_BLOCK_ERROR

these happen with or without the PR_Poll fix. They don't appear in a log from 
Linux.
This seems like another NSPR networking issue on Mac. When we've read all the 
available data, the next OTRcv (called from PR_Read) can return either 
kOTNoDataErr (there is no more data, but we haven't got an T_ORDREL yet) or 
kOTOutStateErr (we got the T_ORDREL already). We currently return an error if 
OTRcv returns kOTNoDataErr. Maybe we should just return 0.
Returning 0 when we get kOTNoDataErr is bad. Don't do that.

However, I do find that PR_Poll is lying about data availability. We are 
reporting that a socket is ready for reading even when we have not verified that 
data is available. This happens in GetState(), which is called from _MD_poll. It 
returns readReady based on fd->secret->md.readReady. fd->secret->md.readReady is 
set because the last read succeeded, and we haven't had a T_ORDREL yet. Maybe we 
should do an OTCountDataBytes() like we do in the blocking case?
Hacky "YMMV" patch:

Index: mozilla/nsprpub/pr/src/md/mac/macsockotpt.c
===================================================================
RCS file: /cvsroot/mozilla/nsprpub/pr/src/md/mac/macsockotpt.c,v
retrieving revision 3.15.2.7
diff -b -u -2 -r3.15.2.7 macsockotpt.c
--- macsockotpt.c	2001/02/07 01:02:27	3.15.2.7
+++ macsockotpt.c	2001/03/08 05:43:14
@@ -1659,6 +1659,9 @@
 {
     OTResult resultOT;
+    size_t   availableData;
     
-	*readReady = fd->secret->md.readReady;
+    resultOT = OTCountDataBytes((EndpointRef)fd->secret->md.osfd, &
availableData);
+
+	*readReady = fd->secret->md.readReady && (availableData > 0);
 	*exceptReady = fd->secret->md.exceptReady;

This patch seems to do very good things to page load times, reducing them by 
about 400ms on averate (2700 -> 2300ms or so). Makes sense, because it avoids 
extra PR_Reads and handling.
Assignee: darin → gordon
And no more spurious timeouts to boot! :-]
Blocks: 71668
cc beard
Summary: Mac regularly fails to connect to server, running page loading test → Mac regularly fails to connect to server (PR_Poll fix)
Patrick, what broke?  This looks like it is essentially identical to part of the 
checkin I made for Simon earlier this evening (around 11pm).
I tried copying the patch and applying it with MacCVS, and the patch failed. I 
assume it might be cause you already included it, but it didn't look like that to 
me.
i couldn't apply the original code as included in the bug (maybe a linefeed 
problem copying it out and patch choked).

so this didn't get checked in?
It did get checked in after all. I just confirmed that.
Taking
Assignee: gordon → sfraser
Fixed.
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
I haven't seen the Timed out error when running jrgm's test for some time now

markinh verified
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.