Closed Bug 23709 Opened 25 years ago Closed 25 years ago

[talkback]Crash in nsSocketTransport::OnFound on home.netscape.com cnn.com

Categories

(Core :: Networking, defect, P1)

x86
Windows 95
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: alan-lists, Assigned: gordon)

References

Details

(Keywords: crash, top100, Whiteboard: [PDT+] 2/16/2000)

Attachments

(1 file)

Not sure when this started, but but it was in the last 2 weeks now when ever i start mozilla it would crash in NECKO.DLL. After testing and testing I found I do not crash if i just run mozilla -mail I can play and have no problems in mail. Then I set Mozilla to load a blank page and tried again. Mozilla loaded just fine. Then i started loading html pages that I had on my local D drive. I was able to load all sorts of local pages jumping around with no problem.. Again as soon as I tell mozilla to go to ANY site off my machine i get the crash in Mozilla. I could then rerun mozilla and look at local stuff with no problem crashing on outside stuff. I am sure the console window won't help, but adding it just in case. nNCL: registering deferred (0) WEBSHELL+ = 1 WEBSHELL+ = 2 nsXULKeyListenerImpl::Init() WEBSHELL+ = 3 WEBSHELL+ = 4 Setting content window browser.startup.page = 0 startpage = about:blank Document about:blank loaded successfully Document: Done (2.31 secs) got a request WEBSHELL+ = 5 FindShortcut: in='www.mozilla.org' out='null' ============================================= MOZILLA caused an invalid page fault in module NECKO.DLL at 014f:60507e5b. Registers: EAX=8b2307d0 CS=014f EIP=60507e5b EFLGS=00010202 EBX=00000000 SS=0157 ESP=020afd8c EBP=020afdcc ECX=0139caf0 DS=0157 ESI=01357df8 FS=0f87 EDX=8165a9cc ES=0157 EDI=0139caf0 GS=0000 Bytes at CS:EIP: ff 30 56 e8 c9 31 00 00 83 c4 0c eb 05 bb 05 40 Stack dump: 00000004 00000001 006e2420 00000000 6050a57a 01357db8 00000000 014e8340 006e2428 00000001 006e2420 00000000 6050a98f 020afdd4 00008e42 020afe1e
By the way, what build were you running? M12, a nightly, roll your own? Just to remove this as a possible cause: what happens if you completely blow away the old .\Seamonkey (for installer) or .\bin (for nightly) directory and then reinstall to a clean directory. (This may be what you do anyways, but I did see another bug report, which, although not for necko.dll, did involve a component loading problem problems that was caused by old cruft in the .\components directory).
Ooops, I should have mentioned the builds. I have tried the Win32 builds of 01/09/00, 01/10/00/, and 01/11/00 (this on was the win32 installer build). There may have been some problems before those builds, but I don't rembember right now. When I try a new build I always delete the mozilla .dat file in the windows directory, and also the users50 directory, and the entire \bin\ directory with the actual program.
Not sure if this is related or not but on the 1/13/99 nightly build i realized i am also getting this justbefore i crash: JavaScript Error: uncaught exception: [Exception... "Illegal value" code: "-214 7024809" nsresult: "0x80070057 (NS_ERROR_ILLEGAL_VALUE)" location: "chrome://si debar/content/sidebarOverlay.js Line: 201"] I am going crazy not being able to test Mozilla. Anything anyone would suggest to check? Version numbers, ect?
Severity: normal → critical
Target Milestone: M14
It sounds like this bug and bug 24008 are dupes or at least very close. I have a Windows 95 P120 with 64MB RAM.
Jud: you seen/seeing this?
So far this bug is a WORKSFORME, tever can you try and reproduce this reliably? Thx.
On bug 24008 that I think is a dup lchiang@netscape.com commented "If you need to see this, contact suresh@netscape.com" Is there anything else I can do at my end to help find it? I don't have VC6.0, but if pointing me to a debug build would help i can try it also.
I believe this is the same as or related to the problem I have been having with the nightly builds for the last or two (not sure when it first started happening). My setup: Build: 2000011708 (and earlier build but not M12) OS: Windows 95 - 4.00.950 B Platform: Pentium w/ 64 MB RAM I deleted the /bin directory from the previous build as well as the MOZREGISTRY.DAT file, and the Moz profile before installing. I invoke MOZILLA.EXE without any command options. It does not crash on www.mozilla.org but on other URLs like www.slashdot.org. It always crashes at different stages in rendering of the page sometimes not even crashing at all. It always crashes with a invalid page fault but in any of three possible modules with the following debug info: MOZILLA caused an invalid page fault in module NECKO.DLL at 014f:604f7f31. Registers: EAX=0000000c CS=014f EIP=604f7f31 EFLGS=00010202 EBX=00000000 SS=0157 ESP=015dfd8c EBP=015dfdcc ECX=01733330 DS=0157 ESI=01733408 FS=32a7 EDX=8163c674 ES=0157 EDI=01733330 GS=0000 Bytes at CS:EIP: ff 30 56 e8 83 31 00 00 83 c4 0c eb 05 bb 05 40 Stack dump: 00000004 00000001 00e66d14 00000000 604fa622 017333c8 00000000 01730d50 00e66d1c 00000001 00e66d14 00000000 604faa16 015dfdd4 00008e42 015dfe1e MOZILLA caused an invalid page fault in module WS2_32.DLL at 014f:00661c27. Registers: EAX=00000000 CS=014f EIP=00661c27 EFLGS=00010246 EBX=00736314 SS=0157 ESP=016fff1c EBP=016fff38 ECX=014a6768 DS=0157 ESI=00736304 FS=2aaf EDX=00000000 ES=0157 EDI=014a6750 GS=0000 Bytes at CS:EIP: 89 06 89 46 04 89 46 0c 89 45 f0 39 01 74 08 ff Stack dump: 0066e3d8 0125c25c 014a6750 016fff6c 7800cc32 00000009 00000038 016fff5c 00661a4a 00736304 00000400 014a6750 0125c230 004115bc 0066e3d8 00000000 MOZILLA caused an invalid page fault in module MSVCRT.DLL at 014f:780016b2. Registers: EAX=743d6574 CS=014f EIP=780016b2 EFLGS=00010297 EBX=00000000 SS=0157 ESP=015dfd74 EBP=015dfd7c ECX=00000001 DS=0157 ESI=743d6570 FS=433f EDX=00000000 ES=0157 EDI=0125da98 GS=0000 Bytes at CS:EIP: 8b 44 8e fc 89 44 8f fc 8d 04 8d 00 00 00 00 03 Stack dump: 0125da98 011f5640 015dfdcc 604f7f39 0125da98 743d6570 00000004 00000001 00de4cc4 00000000 604fa622 0125da58 00000000 014c4480 00de4ccc 00000001 The DOS shell usually says something like the following: -->snipped<--- WEBSHELL+ = 5 FindShortcut: in='www.slashdot.org' out='null' nsLayoutHistoryState::GetState, ERROR getting History state for the key nsLayoutHistoryState::GetState, ERROR getting History state for the key WEBSHELL+ = 6
More info, including stack trace, found in bug http://bugzilla.mozilla.org/show_bug.cgi?id=24008. I am not marking these as dupes and will leave up to QA contact or Eng to do so.
I am getting the same crash as specified by Darrel on 1/17. I am consistently getting the crash. I have deleted everything that has to do with moz and reinstalled and it still crashes. I have tried builds from today (20000121) and from yesterday.
We had made the assumption that this bug was a dupe of bug 24008, but we never marked it as such(being cautious). Well warren fixed bug 24008 on 1/21/2000. It should have landed in the late M14 build on 1/21/2000. I pulled the 1/22/2000 build and still have the crash in Necko.dll I mentioned above. Anyone have any ideas about this?
adding myself to cc list, excuse the spam. Expecting additional comments from another user who has been seeing the same thing since earlier this month.
I am using MS DUN 1.3 128version, on Compaq (Dec'97) Cyrix G180 cpu (180mhz)48Mb (4shared) with Win95 OSR 2B (Fat32), The build from 9:40 1/22/2000 crashes if I attempt to access any site not set at startup.(ie start with mozilla.org anything else mozillazine.org, slashdot.org crash. start blank, mozillazine.org crashes) This started happening around the 8th of Jan. I have been deleting moz*.dat, bin & users50 directories. Converting profile & using mozprofile no difference. I usually use RamBoost v1.6 & have even tried turning it off but no change.
I still can not reproduce this on any of my win 95 machines. Will get ahold of Suresh for assistance.
Tever, I hope to get my hands on a debug build for an M13 build with full circle to see if I can get more info. So far i have not been able to get a debug build yet so may have to wait till M13 Full Circle.
I still see the crash on loading certain web pages like home.netscape.com, www.cnn.com. Some pages loads fine. Build used: 2000-01-25-14-M13 on Win 95, 133 Mhz, 64 MB ram. Stack Trace: (Incident Id: 4440221) Call Stack: (Signature = nsSocketTransport::OnFound 8ba136db) nsSocketTransport::OnFound [d:\builds\seamonkey\mozilla\netwerk\base\src\nsSocketTransport.cpp, line 1470] nsDNSLookup::CallOnFound [d:\builds\seamonkey\mozilla\netwerk\dns\src\nsDnsService.cpp, line 297] nsDNSEventProc [d:\builds\seamonkey\mozilla\netwerk\dns\src\nsDnsService.cpp, line 394] KERNEL32.DLL + 0x3663 (0xbff73663) KERNEL32.DLL + 0x228e0 (0xbff928e0) 0x01208e3c
Weird, from that stack crawl it looks like it's dying inside of a PR_Log call in nsSocketTransport::OnFound(), but I don't see how that's possible. Do you have logging turned on? Where are you located? Can I come see your machine?
*** Bug 24008 has been marked as a duplicate of this bug. ***
I just got the final M13 build with Full Circle in it. Ran it, crashed and sent the info off. I don't know who checks the Full Circle info or how it is passed on to people. In description area I put the bug number along with info on who it was assigned to, the QA contact on this, ect.... I hope you all can get this and it will help.
asj's stack track looks like suresh@netscape.com's Incident ID 4498637 nsSocketTransport::OnFound [d:\builds\seamonkey\mozilla\netwerk\base\src\nsSocketTransport.cpp, line 1470] nsDNSLookup::CallOnFound [d:\builds\seamonkey\mozilla\netwerk\dns\src\nsDnsService.cpp, line 297] nsDNSEventProc [d:\builds\seamonkey\mozilla\netwerk\dns\src\nsDnsService.cpp, line 394] KERNEL32.DLL + 0x3663 (0xbff73663) KERNEL32.DLL + 0x228e0 (0xbff928e0) 0x01e48e3c looks like he had maybe gotten low on virtual memory. that might help to explain the randomness of this failure that folks are seeing. Operating System: Windows 95 4.0 build 67109814 Service Pack: - Physical Memory: 64.0 MB Memory Status: Available Total Physical Memory: 1.8 MB 64.0 MB Page File: 236.1 MB 278.2 MB Virtual Memory: 1996.1 MB 2044.0 MB Screen Information: 1600 x 1200, 16 bits per pixel
Keywords: beta1
Summary: Crash in NECKO.DLL → top100 Crash in NECKO.DLL home.netscape.com cnn.com
Summary: top100 Crash in NECKO.DLL home.netscape.com cnn.com → [top100][talkback]Crash in nsSocketTransport::OnFound on home.netscape.com cnn.com
can someone try visiting an IP address using a build on win95? Any IP addr will do, here's sun's 192.18.97.195. I think we're having buffer alloc problems in the dns service.
re-assigning to gordon. Here' the PR_LOG stmt that is failing (I'm assuming we're failing here (maybe a bad assumption; but I have nothing else to go on). PR_LOG(gSocketLog, PR_LOG_DEBUG, ("nsSocketTransport::OnFound(...) [%s:%d %x]." " DNS lookup succeeded => %s (%d.%d.%d.%d)\n", mHostName, mPort, this, aHostEnt->hostEnt.h_name, mNetAddress.inet.ip & 0xff, (mNetAddress.inet.ip >> 8) & 0xff, (mNetAddress.inet.ip >> 16) & 0xff, (mNetAddress.inet.ip >> 24) & 0xff)); The only real variable here that could choke a printf would be if aHostEnt->hostEnt.h_name wasn't null terminated. I checked the IP address specific code and it seems to be doing the right thing (always null terminating). However, aHostEnt->bufLen is *always* some bugus number. I've fixed (haven't checked in, gordon can you?) the IP addr case: Index: src/nsDnsService.cpp =================================================================== RCS file: /cvsroot/mozilla/netwerk/dns/src/nsDnsService.cpp,v retrieving revision 1.27 diff -r1.27 nsDnsService.cpp 679c679 < PRIntn bufLen = PR_NETDB_BUF_SIZE; --- > PRIntn bufLen = hostentry->bufLen = PR_NETDB_BUF_SIZE; But the non-IP addr case is still bogus.
Assignee: gagan → gordon
Jud: I don't think we're crashing in the log statement -- the linenumber must be wrong. PR_LOG expands into an if (<enabled>) { <then print> } kind of thing, and they don't have logging turned on so this code isn't getting executed. The only other thing in this method is the memcpy -- that's got to be the problem.
agreed. we must be trying to copy more that we should be. I'm not able to see anything obvious in the dns code. I'd like to know if it's repro w/ IP addresses (different code path in dns) before anyone kills themself trying to verify dns host ent copy code and the joyous bufalloc stuff.
As requested i tested with M13 Full Circle with IP Address. It did not crash loading Sun (192.18.97.195) or Netscape's (205.188.247.66) site with the IP address. It it did crash loading CNN's page (207.25.71.246) I sent off the M13 Full Circle test data referencing this bug. On a side note I found just before i crash my console screen fills with the following error over and over: nsLayoutHistoryState::GetState, ERROR getting History state for the key nsLayoutHistoryState::GetState, ERROR getting History state for the key Hope it helps
Okay, Jud and I have a handle on this now. We just need to reorder the fields in nsHostEnt so that the bufLen and bufPtr fields aren't overwritten when WSAAsyncGetHostByName fills in the data. Jud and I are exchanging diffs. I can check in the fix when the tree opens.
Status: NEW → ASSIGNED
Putting no PDT+ radar for beta1.
Whiteboard: [PDT+]
Adding "crash" keyword to all known open crasher bugs.
Keywords: crash
Fix was checked in last night, so we should be able to verifying in today's build when it is ready. Marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
I hate to say this.... i really hate to say this, but not sure this is fixed. Had several different errors on first set of tests so I restarted my computer and ran today's latest 1/28/99 win32 build with nothing else. This time crashed before page loaded and got TWO Dr Watons errors. First MOZILLA caused an invalid page fault in module NECKO.DLL at 014f:605587e7. Second MOZILLA caused an invalid page fault in module GKPARSER.DLL at 014f:602b69a9. Then restared again and another set of tests. This time the sites all almost completly load when the crash occures. Some Top 100 Site forget the name MOZILLA caused an invalid page fault in module WS2_32.DLL at 014f:008a1c27. CNN MOZILLA caused an invalid page fault in module NECKO.DLL at 014f:605587e7. www.icq.com MOZILLA caused an invalid page fault in module NECKO.DLL at 014f:605587e7. The other people that saw this bug before are you still getting the crash? Any thing i can do to help?
This is really ugly. The bug Jud and I fixed yesterday was a serious problem related to this section of code, but it appears to have had no impact on this bug. It seems an x86 register is getting trashed within a series of nine instructions. We aren't calling any functions, and memory is not trashed (I can manually retrace the steps the computer "should" have taken, and end up with the correct result). The problem is intermittent; most DNS lookups complete just fine. All DNS lookups go through this code. I have forced several engineers to look at this crash, and no one has an explanation yet. I believe evil spirits may be the cause. It happens on Suresh's Compaq 5133 Deskpro. Alan, what kind of machine are you on? Tom, have you been able to reproduce this yet? I'm reopening this bug, and look for an x86 expert.
Status: RESOLVED → REOPENED
It's not an MP machine, is it? Can you see who's setting the register value, and map that back to the high-level code that's being executed? Maybe it's a compiler bug.
It's not an MP machine. I doesn't appear to be a compiler bug because the code executes correctly the vast majority of the time. It may be possible that VStudio is displaying memory in a bogus way, but it seems strange that the bogus memory would look so correct. I'll put some additional sanity checks in for debug purposes to try and verify what VStudio is telling me. I'll also take a look at the Talkback reports to see if they give me more accurate information.
Sorry for spam, interested in this bug.
I am running a P120, Win95, 64 megs of ram with MS DUN 1.3, ect.... As far as brands, it is a generic type thing..... Starting out as a Midwest Micro machine. Later replaced bad motherboard with a Achme Botherboard #5156 (by Micro-Star International MSI) PCI TX4 w 512 K cache and Award BIOS I am running both 1 onboard seral port and using an extra board for a serial port on a higher IRQ (for my PalmPilot). Not that it matters, but just incase it is a SIIG Fast EIDE Controller. I have most functions on the board disabled. Are you now thinking it is a hardware issue vs Software? Curious how many others were using MS DUN (Dial up networking) 1.3 for Win 95. Saw another person wit his crash mention DUN 1.3.
I am running a P120, Win95, 64 megs of ram with MS DUN 1.3, ect.... As far as brands, it is a generic type thing..... Starting out as a Midwest Micro machine. Later replaced bad motherboard with a Achme Botherboard #5156 (by Micro-Star International MSI) PCI TX4 w 512 K cache and Award BIOS I am running both 1 onboard seral port and using an extra board for a serial port on a higher IRQ (for my PalmPilot). Not that it matters, but just incase it is a SIIG Fast EIDE Controller. I have most functions on the board disabled. Are you now thinking it is a hardware issue vs Software? Curious how many others were using MS DUN (Dial up networking) 1.3 for Win 95. Saw another person wit his crash mention DUN 1.3.
re CCing myself, as asj was kind enough to remove me accidentally...
I am getting a similar crash on my machine: Win95 AMD 333Mhz 64Mb Mail works well most of the time but I get crashes regularly loading web-pages. I can load some simple pages but all other pages generate a crash in Necko.dll at some stage during the page load. I can load local pages without problems. Also I tried loading Sun's page; it loads perfectly when using the IP address but crashes during load when using www.sun.com. The figure that go with the crash are below: MOZILLA caused an invalid page fault in module NECKO.DLL at 0137:60547f17. Registers: EAX=65726464 CS=0137 EIP=60547f17 EFLGS=00010202 EBX=00000000 SS=013f ESP=00cbfd8c EBP=00cbfdcc ECX=010bf100 DS=013f ESI=010be068 FS=4a87 EDX=816588f4 ES=013f EDI=010bf100 GS=0000 Bytes at CS:EIP: ff 30 56 e8 07 32 00 00 83 c4 0c eb 05 bb 05 40 Stack dump: 00000004 00000001 007433e8 00000000 6054a61a 010be028 00000000 01085f10 007433f0 00000001 007433e8 00000000 6054a9b4 00cbfdd4 00008e42 00cbfe1e
scenario 1/29/2000 power on machine, dial ISP, start mozilla 13:07 01/28 mozilla.org loads ok, www.weather.com ok click on current temperatures link crash 014f:605587de in NECKO.DLL I know that mozilla & Communicator are completely different but they are still using the same OS & dialer so... in 4.7 or earlier I would get crashes in RNR20.DLL which AFAIK has to do with DNS addresses. Can the installer provide a RNR20.DLL file? can someone provide a link to get this file (currently mine is 4.10.15110)? Different versions on different machines could be part of this problem. any suggestions? Thanks tom
Sunday AM scenario, power on , dial ISP, start mozilla, mozilla.org ok us IP address for weather (206.151.166.121) page loads ok, > 170 seconds, click on current temperatures, no crash temps not shown after long time, go to ISP mail maint screen, delete spam ok , go to bugzilla post this Build Id: 20000012812 . Makes me think timing (activity on Internet) part of problem??? Thanks Tom
*** Bug 25791 has been marked as a duplicate of this bug. ***
Just an update on my circumstances under which Mozilla crashes (see previous post), I now find it crashes in module NECKO.DLL at 014f:605587e7 (different location) and occasionally in modules WS2_32.DLL at 014f:00661c27 and MSVCRT.DLL at 014f:780016b2 as it did previously. Thinking that this might be related to Windows 95 only, (Is anyone with Windows 98 experiencing this?) I tried several different things: I have MS Winsock 2 installed. I can't find an elegant way of going back to the original Winsock, short of reinstalling Windows. So I can't figure out if this is the cause of the problem. Is anyone running MS Winsock 1 (ie. the one that is initial under Windows 95) that experiences this problem? I've installed three patches to Winsock 2: vipup11, vipup20, and vtcpup20. Rebooted and it still crashes. I tried toggling DNS caching on and off, see MS knowledge base article Q174614 (Don't flame me if this an IE and not a Winsock issue). It still crashes but it seems to take longer to crash when DNS caching is off (perhaps coincidentally). Initially I was crashing using an ethernet connection that uses a client manager with the PPPoE protocol. I also tried using a PPP dial-up connection but it still crashes although again it seemed to take longer (ie. I might have to try two or three different sites before it crashes). This info might be totally useless but I thought it might help.
Sorry, that was as of build 2000012808. All user and platform information is the same as previously posted.
Gordon, No I still can not reproduce this on my machines. I tried again using 2 fairly minimal win 95 machines in the lab. Also tried stressing the virtual memory - no crash like described. Checked todays 01/31 build and an older one.
Clearing FIXED resolution due to reopen.
Resolution: FIXED → ---
Build ID: 2000013111 crashes Necko.dll at 014f:6055880c which is a slight change from 605587de...
calling all folks who see this bug to add info about connection speed and other info about their network configuration.
Notice that these crashes are all from Win95 with the same build number (OSR2??). Perhaps there is something peculiar to that build (or DUN 1.3???). I do not see this on Win98.
Ok make room for me too. I'm on Win95 OSR2, DUN 1.3 update, Winsock 2 update, various other updates, no IE whatsoever. Pentium 200MHz MMX, 64MB RAM, 56K modem that connects all the time at 44K. I get crashes in those exact same files. I can browse around mozilla.org to my heart's content, but leaving it and going somewhere else kills Mozilla 99.9% of the time. And not always at the same point in loading the page. On only a couple occasions has the page loaded completely but is quite rare and would die on the next link clicked. Had some success stopping page before it completely loaded and avoid the crash. Also, because I'm still on Win95 I've been religious about updates that seemed important so there might be another update involved that is causing problems besides the DUN 1.3 (and WS2?) updates that it seems most of us having this problem have applied. I will continue to sent Talkback reports as it happens but it's pretty much the same ol' thing over and over and makes Mozilla virtually useless on my Win95 system at home, so I'd love to try anything the developers would like me to in order to test this. Have done it enough to feel confident there are no other files that it's crashing in besides msvcrt.dll, rdf.dll, ws2_32.dll, and necko.dll. Let me know how I can help.
Well, I seem to be a counterpoint: I am running OSR2 (4.00.950B) and I do *not* crash. In fact, the current builds are *very* stable (for me). I recently blew up my hard drive, and this spare disk is a pretty stock install of win95 on a IBM Thinkpad from Apr 97; (It actually hadn't been booted since early '98 (was sitting on a shelf)). I have not upgraded DUN (rnaapp.exe 4.00.1111) or WINSOCK (wsock32.dll 4.00.1111). IE4.0+ has never been installed on this computer (with all it's unknown collection of DLLs). I am using 28K dialup, PCMCIA modem, with DHCP, without WINS. [I'll mention for completeness that I run Netware as well, but that can't be relevant]. I'd be happy to provide any other info you need. John.
Thanks everyone for all the help on this. The configuration information is proving to be very interesting. I still haven't worked out the connection between the susceptible configurations and register EAX getting trashed. We have a machine in house that can reproduce the problem, and I'm poking it to see precisely what conditions are necessary for the crash. I'll post more later this afternoon. Thanks again.
I realized I had not put my connectoin speed in my previous information about my system. I am on a 56K modem that with strange lines so the connect goes anywhere from 28.8 to 33.6. (I hope they will fix the lines soon).
It really seems to be an Win95 problem. My Mozilla crashes every 3rd Webpage or so. I'm using Win95 (4.00.950b) with updated msdun13, vtcpup20, vipup20. So it may be somehow connected to these Updates. Good luck
Win95 4.00.95B (FAT32) 128byte version of DUN 1.3, with associated updates... usally connected at 31200(33600 modem) 44MB RAM available to OS
i have m13 (zipfile and fullcircle) and win95b (with all the required updates .. vtcupd/etc.winsock 2.2) .. dsl connection and i can't get the bloody thing to load one page :( .. it crashes alot *necko.dll* 48MB of ram.. Ive tried this also with some m14 nightly's
*** Bug 25102 has been marked as a duplicate of this bug. ***
Rick Potts and I have verified that WSAAsyncGetHostByName occasionally posts a notification indicating the results are complete, BEFORE it has filled out the hostent. nsSocketTransport::OnFound() tries to dereference garbage, but by the time we look at the data structures in the debugger, they have all been nicely fixed up. Also, inserting printf()'s into OnFound() alters the timing, making the bug "go away", or much more rare. We are continuing to investigate the boundaries of the problem, and develop possible solutions. We hope to have more information tomorrow.
Last night Rick investigated a bit further and refined the current hypothesis. I was able to confirm it this afternoon. The root of the problem is the version of winsock on the troubled platforms always returns 1 for calls to WSAAsyncGetHostByName, when it should be returning a unique ID that can be used to identify which lookup has completed. Thus when we have multiple lookups outstanding, we have no way of knowing which lookup completes. The nsDNSService therefore picks the first lookup in the list, which may or may not be the correct one, and may or may not be complete yet. Of course by the time the debugger kicks in, all outstanding lookups have completed, so the data structures all look fine. We need to identify which versions of winsock have this "feature".
Priority: P3 → P1
qfecheck.exe reports: UPD970624R1 Windows Socket API Update: Winsock.dll 4.10.0.1511 Wsock32.dll 4.10.0.1511 UPD971126B1 TCP Driver Update VTCP.386 4.10.0.1657 Build 2000020214 last Necko.dll crash was 014f:60558807 EAX=7373654d
I pulled version numbers (right click) to every winscock or tcp/ip driver I could think of. c:\windows\winsock.dll 4.10.1656 c:\windows\system\wsock.vxd 4.10.1656 c:\windows\system\wsock2.vxd 4.10.1656 c:\windows\system\wsock32.dll 4.10.1656 c:\windows\system\wsock32n.dll 5.2.0.2 c:\windows\system\vtcp.386 4.10.1657 c:\windows\system\vnbt.386 4.10.1658 c:\windows\system\vip.386 4.10.1657
adding top100 keyword.
Keywords: top100
Winsock version 4.10.1656 (windows95B 4.00.1111)
I don't know if this is any use to anybody, but I've had problems before with winsock 2 (not related to mozilla or netscape). The way to remove winsock 2 is to: restart in DOS mode, cd /windows/ws2bakup, then run ws2bakup.bat and reboot.
Summary: [top100][talkback]Crash in nsSocketTransport::OnFound on home.netscape.com cnn.com → [talkback]Crash in nsSocketTransport::OnFound on home.netscape.com cnn.com
Yes... It appears that both versions of Winsock2 for Win95 (1511 and 1656) have broken a WSAAsyncGetHostByName(...). For both of these versions, it appears that the HANDLE that is returned by WSAAsyncGetHostByName(...) is *always* 1. Of course this makes managing multiple outstanding requests impossible :-( The easy fix is to *remove* winsock2 :-) I have not been able to reproduce this problem on Win98, WinNT 4.0 or Win95 running old winsocks (ie. not Winsock 2.0) I also looked at the code for Communicator 4... The code is quite different because it maintains a local DNS cache. However, there is a secondary validity check for (hostent_h_name != NULL) that appears to minimize the problem :-)
Okay, my version numbers are below. Note however that I am one of those getting an error at 0137:... not 014f:.... (others are in Bug 25102 which has been marked as a dupe of this bug). I have marked the files whose version numbers are different to the similar listing filed previously. c:\windows\winsock.dll 4.10.1998 (different!) c:\windows\system\wsock.vxd 4.10.1656 c:\windows\system\wsock2.vxd 4.10.1656 c:\windows\system\wsock32.dll 4.10.1656 c:\windows\system\wsock32n.dll 5.1.0.2 (different) c:\windows\system\vtcp.386 4.10.1657 c:\windows\system\vnbt.386 4.10.1658 c:\windows\system\vip.386 4.10.1658 (different)
*** Bug 25431 has been marked as a duplicate of this bug. ***
Whiteboard: [PDT+] → [PDT+] 2/10/2000
potts and I talked about not relying on the HANDLE as the index into the lookup entry table. instead we could do a strcmp on the actual host returned, against our hosts in the lookup table we cache. if we find a match, we're covered. This "solves" the winsock2 problem, maybe not the crash???
Could all of this nonsense possibly be due to the fact that the windows dns code isn't thread safe? See bug 27496.
No. Winsock2 is broken on Win95. The thread safety issue is a separate problem.
Jud, string compares would only work if we are coalescing multiple requests for dns lookups for the same hostname. Otherwise we have the same problem as winsock2 on win95 where identical HANDLEs are returned for different lookups. The two solutions that I've discussed with Rick are either using a range of event messge IDs to identify which lookup has completed, or simply test WSAAsyncGetHostByName to see if it returns unique HANDLEs and revert to synchronous PR_GetHostByName if it doesn't. I think gracefully degrading to synchronous on win95 systems with winsock2 installed is probably the best approach. We should probably have the dns thread make the call to PR_GetHostByName so that the socket transport can continue working.
Whiteboard: [PDT+] 2/10/2000 → [PDT+] 2/16/2000
Fix checked in last night.
Status: REOPENED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
Suresh verifies this to be working on his Win95 system using build 2000022108. Marking verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: