Closed Bug 70408 Opened 24 years ago Closed 24 years ago

Mac regularly fails to connect to server (PR_Poll fix)

Categories

(Core :: Networking: HTTP, defect)

PowerPC
Mac System 9.x
defect
Not set
normal

Tracking

()

VERIFIED FIXED
Future

People

(Reporter: jrgmorrison, Assigned: sfraser_bugs)

References

()

Details

(Keywords: perf)

Attachments

(3 files)

Overview Description: The Mac, for some time now, has occasionally thrown up an alert "The operation timed out when attempting to contact jrgm.mcom.com" while running the page loading test. However, the server is fully reachable (e.g., another Linux or Windows box on the same subnet running the same test at the same time will never throw up this alert). Since builds of Feb 22nd, this has gone from occasional to much more common (visiting 200 URLs will cause this error to occur 2 or 3 times). Note that Feb 22nd is the day after dougt's big channel changes. The good news is that the Mac is on average running this test more quickly (about 10% faster). Perhaps these two facts (the speedup and the increase in how often this error occurs) are related. Steps to Reproduce: 1) go to http://jrgm.mcom.com/page-loader/loader.pl 2) take the default values on that form and hit submit 3) wait until it happens Actual Results: says it can't connect when server is A-OK Expected Results: should connect Reproducibility: between 1 in 100, and one in 70 running this test Unfortunately, waiting until the alert is spun up may be too late to reconstruct why the connection failed (but it's a start). Note: the pages are loaded by doing JS Location.href = "aURL", which is similar but not identical to the code path for clicking on a link or hitting the back button. This is a minor nuisance when running the page loading test (must be watched), but I expect that real users are occasionally getting bogus failures, so it would be good to track down why this is happening.
s/often/regularly/. Anyone have a good adjective for "1 in 70"?
Keywords: perf
Summary: Mac often fails to connect to server, running page loading test → Mac regularly fails to connect to server, running page loading test
Target Milestone: --- → Future
This is enough of a show-stopper for me to warrant more attention, because it often prevents me from getting page load performance data.
adding keyword nsbeta1
Keywords: nsbeta1
Here's the chunk of log where things fail: 3[9afdbd8]: nsSocketTransport: Entering Process() [host=jrgm.mcom.com:80 this= d4b4138], aSelectFlags=1, CurrentState=5. 3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d4b4138] is in WaitReadWrite state [readtype=1 writetype=0 status=80470007]. 3[9afdbd8]: nsSocketTransport: doReadWrite [this=d4b4138, aSelectFlags=1, mReadRequest=d225320, mWriteRequest=0 3[9afdbd8]: nsSocketReadRequest: [this=d225320] inside OnRead. 3[9afdbd8]: nsSocketReadRequest: [this=d225320] calling listener [offset=39420, count=8192] 3[9afdbd8]: nsSocketIS: PR_Read(count=1540) returned -1 3[9afdbd8]: nsSocketIS: PR_Read() failed with PR_WOULD_BLOCK_ERROR 3[9afdbd8]: nsSocketReadRequest: [this=d225320] read 0 bytes [offset=39420] 3[9afdbd8]: nsSocketTransport: doReadWrite [readstatus=80470007 writestatus=0 readsuspend=0 writesuspend=0 mSelectFlags=5] 3[9afdbd8]: nsSocketTransport: Leaving Process() [host=jrgm.mcom.com:80 this= d4b4138], mStatus = 80470007, CurrentState=5, mSelectFlags=5 3[9afdbd8]: nsSocketTransport: Entering Process() [host=jrgm.mcom.com:80 this= d73edf4], aSelectFlags=1, CurrentState=5. 3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d73edf4] is in WaitReadWrite state [readtype=1 writetype=0 status=80470007]. 3[9afdbd8]: nsSocketTransport: doReadWrite [this=d73edf4, aSelectFlags=1, mReadRequest=d400d68, mWriteRequest=0 3[9afdbd8]: nsSocketTransport: [this=d73edf4] completing read request due to cancelation 3[9afdbd8]: nsSocketTransport: CompleteAsyncRead [this=d73edf4] 3[9afdbd8]: nsSocketTransport: doReadWrite [readstatus=0 writestatus=0 readsuspend=0 writesuspend=0 mSelectFlags=4] 3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d73edf4] is in Done state. 3[9afdbd8]: nsSocketTransport: Leaving Process() [host=jrgm.mcom.com:80 this= d73edf4], mStatus = 0, CurrentState=4, mSelectFlags=4 3[9afdbd8]: nsSocketTransport: CheckForTimeout() [jrgm.mcom.com:80 d4b4138]. TIMED OUT... Idle interval: 120003 CurrentState=5 3[9afdbd8]: nsSocketTransport: Entering Process() [host=jrgm.mcom.com:80 this= d4b4138], aSelectFlags=0, CurrentState=7. 3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d4b4138] is in Timeout state. 3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d4b4138] is in Error state. 3[9afdbd8]: nsSocketTransport: Transport [host=jrgm.mcom.com:80 this=d4b4138] is in Done state. 3[9afdbd8]: nsSocketTransport: CompleteAsyncRead [this=d4b4138] 3[9afdbd8]: nsSocketTransport: Leaving Process() [host=jrgm.mcom.com:80 this= d4b4138], mStatus = 0, CurrentState=3, mSelectFlags=4 mCurrentState is eSocketState_WaitReadWrite. Note these lines: 3[9afdbd8]: nsSocketIS: PR_Read(count=1540) returned -1 3[9afdbd8]: nsSocketIS: PR_Read() failed with PR_WOULD_BLOCK_ERROR these happen with or without the PR_Poll fix. They don't appear in a log from Linux.
This seems like another NSPR networking issue on Mac. When we've read all the available data, the next OTRcv (called from PR_Read) can return either kOTNoDataErr (there is no more data, but we haven't got an T_ORDREL yet) or kOTOutStateErr (we got the T_ORDREL already). We currently return an error if OTRcv returns kOTNoDataErr. Maybe we should just return 0.
Returning 0 when we get kOTNoDataErr is bad. Don't do that. However, I do find that PR_Poll is lying about data availability. We are reporting that a socket is ready for reading even when we have not verified that data is available. This happens in GetState(), which is called from _MD_poll. It returns readReady based on fd->secret->md.readReady. fd->secret->md.readReady is set because the last read succeeded, and we haven't had a T_ORDREL yet. Maybe we should do an OTCountDataBytes() like we do in the blocking case?
Hacky "YMMV" patch: Index: mozilla/nsprpub/pr/src/md/mac/macsockotpt.c =================================================================== RCS file: /cvsroot/mozilla/nsprpub/pr/src/md/mac/macsockotpt.c,v retrieving revision 3.15.2.7 diff -b -u -2 -r3.15.2.7 macsockotpt.c --- macsockotpt.c 2001/02/07 01:02:27 3.15.2.7 +++ macsockotpt.c 2001/03/08 05:43:14 @@ -1659,6 +1659,9 @@ { OTResult resultOT; + size_t availableData; - *readReady = fd->secret->md.readReady; + resultOT = OTCountDataBytes((EndpointRef)fd->secret->md.osfd, & availableData); + + *readReady = fd->secret->md.readReady && (availableData > 0); *exceptReady = fd->secret->md.exceptReady; This patch seems to do very good things to page load times, reducing them by about 400ms on averate (2700 -> 2300ms or so). Makes sense, because it avoids extra PR_Reads and handling.
Assignee: darin → gordon
And no more spurious timeouts to boot! :-]
Blocks: 71668
cc beard
Summary: Mac regularly fails to connect to server, running page loading test → Mac regularly fails to connect to server (PR_Poll fix)
Patrick, what broke? This looks like it is essentially identical to part of the checkin I made for Simon earlier this evening (around 11pm).
I tried copying the patch and applying it with MacCVS, and the patch failed. I assume it might be cause you already included it, but it didn't look like that to me.
i couldn't apply the original code as included in the bug (maybe a linefeed problem copying it out and patch choked). so this didn't get checked in?
It did get checked in after all. I just confirmed that.
Taking
Assignee: gordon → sfraser
Fixed.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
I haven't seen the Timed out error when running jrgm's test for some time now markinh verified
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: