Closed Bug 102227 Opened 23 years ago Closed 23 years ago

N620 Trunk Segfault in OnFound in nsLDAPConnection [@ nsLDAPConnection::OnFound]

Tracking

(Not tracked)

Status:

VERIFIED FIXED

People

(Reporter: leif, Assigned: leif)

References

Details

(Keywords: crash, topcrash, Whiteboard: [PDT+])

Crash Data

Attachments

(3 files, 1 obsolete file)

Disassembler output around the suspected crasher 23 years ago Leif Hedstrom 6.80 KB, text/plain		Details
stack trace of reproduced crash 23 years ago Dan Mosedale (:dmosedale, :dmose) 31.99 KB, text/plain		Details
Possible fix, v1 23 years ago Leif Hedstrom 2.34 KB, patch		Details \| Diff \| Splinter Review
Potential fix, v2 23 years ago Leif Hedstrom 2.74 KB, patch	dmosedale : review+ Bienvenu : superreview+	Details \| Diff \| Splinter Review

Leif Hedstrom

Assignee

Description

•

23 years ago

We have a few Talkback reports indicating that we are crashing on line 852 in nsLDAPConnection.cpp. The stack is nsLDAPConnection::OnFound [d:\builds\seamonkey\mozilla\directory\xpcom\base\src\nsLDAPConnection.cpp, line 852] XPTC_InvokeByIndex [d:\builds\seamonkey\mozilla\xpcom\reflect\xptcall\src\md\win32\xptcinvoke.cpp, line 139] EventHandler [d:\builds\seamonkey\mozilla\xpcom\proxy\src\nsProxyEvent.cpp, line 515] PL_HandleEvent [d:\builds\seamonkey\mozilla\xpcom\threads\plevent.c, line 591] The relevant code is: NS_IMETHODIMP nsLDAPConnection::OnFound(nsISupports *aContext, const char* aHostName, nsHostEnt *aHostEnt) { PRUint32 index = 0; PRNetAddr netAddress; char addrbuf[64]; // Do we have a proper host entry? If not, set the internal DNS // status to indicate that host lookup failed. // if (!aHostEnt->hostEnt.h_addr_list || !aHostEnt->hostEnt.h_addr_list[0]) { mDNSStatus = NS_ERROR_UNKNOWN_HOST; return NS_ERROR_UNKNOWN_HOST; } // Make sure our address structure is initialized properly // memset(&netAddress, 0, sizeof(netAddress)); PR_SetNetAddr(PR_IpAddrAny, PR_AF_INET6, 0, &netAddress); I can't think of any reason why we'd sometimes crash on this call to |memset()|, and I've not been able to reproduce it either. I'm kind of stumped how to debug this problem, I don't understand how |netAddress| could not be correcly allocated on the stack? -- Leif

Leif Hedstrom

Assignee

Updated

•

23 years ago

Status: NEW → ASSIGNED

Leif Hedstrom

Assignee

Comment 1

•

23 years ago

From a talkback report: x86 Registers: EAX: 00060003 EBX: 60e32b60 ECX: 02a9afcc EDX: 606864b4 ESI: 02b0a954 EDI: 00000000 ESP: 0012fc28 EBP: 0012fc90 EIP: 6068332e cf PF af zf sf of IF df nt RF vm IOPL: 0 CS: 001b DS: 0023 SS: 0023 ES: 0023 FS: 0038 GS: 0000 cmp [eax],edi 60683330 0f84d9000000 je 6068340f 60683336 6a20 push 0x20 60683338 8d45e0 lea eax,[ebp-0x20] 6068333b 57 push edi 6068333c 50 push eax 6068333d e89a200000 call 606853dc 60683342 8d45e0 lea eax,[ebp-0x20] 60683345 50 push eax 60683346 57 push edi 60683347 6a17 push 0x17 60683349 6a01 push 0x1 6068334b ff15dc29dccc call dword ptr [ccdc29dc]

Dan Mosedale (:dmosedale, :dmose)

Comment 2

•

23 years ago

*** Bug 102567 has been marked as a duplicate of this bug. ***

Dan Mosedale (:dmosedale, :dmose)

Comment 3

•

23 years ago

I just ran into this on my linux box running a branch build. Talkback ID is 36186399. x86 Registers: EAX: 09fec8cc EBX: 41337130 ECX: 0000266e EDX: 41336998 ESI: 00000003 EDI: 09fece90 ESP: bffff1bc EBP: bffff298 EIP: 4132fd02 cf pf af zf sf of IF df nt RF vm IOPL: 0 CS: 0023 DS: 002b SS: 002b ES: 002b FS: 0000 GS: 0007 Code Around the PC: 4132fd02 833900 cmp dword ptr [ecx],0x0 4132fd05 7519 jnz 4132fd20 4132fd07 8b4508 mov eax,[ebp+0x8] 4132fd0a c7404c1e004b80 mov dword ptr [eax+0x4c],0x804b001e 4132fd11 b81e004b80 mov eax,0x804b001e 4132fd16 e945010000 jmp 4132fe60 4132fd1b 90 nop 4132fd1c 8d742600 lea esi,[esi] 4132fd20 6a6c push 0x6c

Leif Hedstrom

Assignee

Comment 4

•

23 years ago

Attached file Disassembler output around the suspected crasher — Details

Leif Hedstrom

Assignee

Comment 5

•

23 years ago

After looking at this some more, both Mose and I are not convinced that the Talkback report is pointing at the correct line. In fact, we suspect the crasher might be at around line 845: if (!aHostEnt->hostEnt.h_addr_list || !aHostEnt->hostEnt.h_addr_list[0]) { We've been able to reproduce a crasher on this exact line, where |aHostEntr->hostEnt.h_addr_list| is non-null but points into never-never land (or Uranus as mose would say), and we crash on the second half of the |if()| statement. This causes a segfault. It's still unclear how this structure is getting corrupted, or why. Does anyone have suggestions if a) I'm not testing the |aHostEnt| structure properly for "correctness" or b) what could cause the DNS service (or possible the proxy code) to corrupt the host data or c) is this a corruption on the stack itself, making our |aHostEnt| point into the void somehow? Thanks! -- Leif

gordon

Comment 6

•

23 years ago

You might try adding assertions to nsDNSRequest::FireStop() to ascertain whether or not the hostent is corrupt at that point. I presume that aHostEnt is !nil, but I don't see a test for that.

Dan Mosedale (:dmosedale, :dmose)

Comment 7

•

23 years ago

Attached file stack trace of reproduced crash — Details

Dan Mosedale (:dmosedale, :dmose)

Comment 8

•

23 years ago

OK, so I noticed that in my builds, the crash happens more of the time when there is an error dialog, after I select the error item. Additionally, just for grins, I tried recompiling nsLDAPConnection.cpp using PROXY_SYNC rather than PROXY_ASYNC. Interestingly, once when I saw the core dump with this PROXY_SYNC code, I saw an assertion from nsDNSRequest::Cancel: NS_ASSERTION(!PR_CLIST_IS_EMPTY(this), "request is not queue on lookup"); This is making me wonder if ::Cancel is sometimes getting called after the lookup has already finished. Is this allowable semantics?

Dan Mosedale (:dmosedale, :dmose)

Comment 9

•

23 years ago

gordon: correct, aHostEnt is not nil. I tried adding the assertions you suggested, and the hostent is NOT corrupt when just before the call to OnFound. So this may be proxy or xptcall or other event queue lossage of some sort.

Dan Mosedale (:dmosedale, :dmose)

Comment 10

•

23 years ago

OK, so I see what's going on here. The DNS service is calling OnFound back with a pointer to some private data. Then, it assumes that once OnFound returns, there's no need for the private data any more, and sets the nsCOMPtr holding it to nsnull. However, in the case of an asynchronous proxy, the data may not have actually been used yet. So I think we can work around this in the short term by using a synchronous proxy (maybe I was mistaken when I thought it still dumped core before with the sync proxy, because it's not now). Long term, I'd propose the nsIDNSListener should hand back refcounted data directly, rather than just a pointer into a privately refcounted objet. I'm still seeing the assertion I mentioned before with PROXY_SYNC, anyone know what's up with this?

Dan Mosedale (:dmosedale, :dmose)

Comment 11

•

23 years ago

The assertion is happening when the nsLDAPConnection destructor calls mDNSRequest->Cancel. It's not clear to me why this is happening, however: I added some logging, and nsLDAPConnection::OnStopLookup is getting called, and that function zeroes out mDNSRequest.

Jaime Rodriguez, Jr.

Updated

•

23 years ago

Keywords: crash, nsbranch+

Leif Hedstrom

Assignee

Comment 12

•

23 years ago

Attached patch Possible fix, v1 (obsolete) — Details — Splinter Review

Leif Hedstrom

Assignee

Comment 13

•

23 years ago

Comment on attachment 52290 [details] [diff] [review] Possible fix, v1 This patch is missing one part, posting a new one soon.

Attachment #52290 - Attachment is obsolete: true

Leif Hedstrom

Assignee

Comment 14

•

23 years ago

Attached patch Potential fix, v2 — Details — Splinter Review

Leif Hedstrom

Assignee

Comment 15

•

23 years ago

Requesting SR= and R= on the v2 patch. It's tested on all three platforms. -- Leif

Dan Mosedale (:dmosedale, :dmose)

Comment 16

•

23 years ago

Comment on attachment 52295 [details] [diff] [review] Potential fix, v2 r=dmose@netscape.com

Attachment #52295 - Flags: review+

David :Bienvenu

Comment 17

•

23 years ago

Comment on attachment 52295 [details] [diff] [review] Potential fix, v2 sr=bienvenu

Attachment #52295 - Flags: superreview+

Leif Hedstrom

Assignee

Comment 18

•

23 years ago

Checked in on trunk. Richi P.: can you maybe try a "trunk" build on Monday or so, and see if this fixes your problem? Thanks, -- Leif

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Richi Plana

Comment 19

•

23 years ago

I'm using build 2001100503 on win32 right now. Unfortunately, a lot has happened since I sent that bug report. One of the major changes is that I delete my User profile and started from scratch (some changes a few weeks back caused Mozilla installers to **** on me). With this build, Mozilla doesn't seem to crash anymore when doing an LDAP lookup. I'll bang on it some more and see what happens. I'll also download a build on Monday and see if that makes any difference as well.

Richi Plana

Comment 20

•

23 years ago

Sorry ... spoke too soon. It's still happening on 2001100503 win32 (I just noticed on the Platform heading for this bug report, it says Linux only). The behavior is erratic. Near as I can tell, one of three things happen: 1) I start Mozilla, compose a message, type in a few chars. and it SIGSEGVs (the win32 equivalent, at least) 2) I start Mozilla, do some stuff, compose a message, type in a few chars. and some entries in the personal dictionary will show up and in the bottom and error entry saying problems with the LDAP server. I try a different sequence of letters and next thing I know, LDAP is working. 3) LDAP works fine. Once LDAP lookup starts to work, though, I can't seem to make it break again without restarting Mozilla. Will check again on Monday.

Leif Hedstrom

Assignee

Comment 21

•

23 years ago

What was the timestamp on the file you downloaded? The fix wasn't checked in until around 7pm, so I suspect you won't see the fix in any builds until earliest Saturday morning. -- Leif

Richi Plana

Comment 22

•

23 years ago

Finally! On win32 mozilla 2001100610 (timestamp 06-Oct-2001 14:06), doing LDAP lookups isn't crashing like before. Of course, there's very little traffic on the LAN so the environment is unlike that when I experienced it before, but it looks good so far.

Leif Hedstrom

Assignee

Comment 23

•

23 years ago

Requesting PDT for checkin on 0.9.4 branch. -- Leif

Whiteboard: PDT

yulian chang

Comment 24

•

23 years ago

Verified with 20011008 trunk build on Window 2000. LDAP auto complete works fine against the following servers: Hostname: 208.12.37.50 Base DN: dc=mcom,dc=com Hostname: 208.12.36.22 Base DN: o=Airius.com Hostname: 208.12.37.103 Base DN: o=mcom.com

QA Contact: olgac → yulian

Jaime Rodriguez, Jr.

Updated

•

23 years ago

Whiteboard: PDT → [PDT+]

Jaime Rodriguez, Jr.

Comment 25

•

23 years ago

pls check this into the branch - PDT+

Leif Hedstrom

Assignee

Comment 26

•

23 years ago

Checked in on 0.9.4 branch -- Leif

Dan Mosedale (:dmosedale, :dmose)

Comment 27

•

23 years ago

*** Bug 103868 has been marked as a duplicate of this bug. ***

Christopher Blizzard (:blizzard)

Comment 28

•

23 years ago

Re-open to get into the 0.9.5 branch.

Blocks: 101793

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Leif Hedstrom

Assignee

Comment 29

•

23 years ago

Checked in on 0.9.5 branch

Status: REOPENED → RESOLVED

Closed: 23 years ago → 23 years ago

Resolution: --- → FIXED

greer

Comment 30

•

23 years ago

We still show four incidents on the Trunk as recently as 10-04. Can we check it in? Adding info for talkback tracking. This was a topcrasher on the branch. Changing platform to reflect that this was/is happening on Windows and Linux.

Keywords: topcrash

OS: Linux → All

Hardware: All → PC

Summary: Segfault in OnFound in nsLDAPConnection → N620 Trunk Segfault in OnFound in nsLDAPConnection [@ nsLDAPConnection::OnFound]

lchiang

Comment 31

•

23 years ago

Tom, do you see this on the topcrash report for the 094 branch and 095 branch after 10-9? Thanks.

Dan Mosedale (:dmosedale, :dmose)

Comment 32

•

23 years ago

greer: re-read the comments in the bug, and you'll see that the fix wasn't checked in until late on 10/5, so it's not surprising that there are crashes on 10/4.

greer

Comment 33

•

23 years ago

Talkback data shows no incidents with this signature after 10/9. Marking VERIFIED fixed.

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

14 years ago

Crash Signature: [@ nsLDAPConnection::OnFound]

You need to log in before you can comment on or make changes to this bug.