N620 Trunk Segfault in OnFound in nsLDAPConnection [@ nsLDAPConnection::OnFound]

VERIFIED FIXED

Status

--
major
VERIFIED FIXED
17 years ago
17 years ago

People

(Reporter: leif, Assigned: leif)

Tracking

({crash, topcrash})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [PDT+], crash signature)

Attachments

(3 attachments, 1 obsolete attachment)

(Assignee)

Description

17 years ago
We have a few Talkback reports indicating that we are crashing on line 852 in
nsLDAPConnection.cpp. The stack is

nsLDAPConnection::OnFound
[d:\builds\seamonkey\mozilla\directory\xpcom\base\src\nsLDAPConnection.cpp, line
852]
XPTC_InvokeByIndex
[d:\builds\seamonkey\mozilla\xpcom\reflect\xptcall\src\md\win32\xptcinvoke.cpp,
line 139]
EventHandler [d:\builds\seamonkey\mozilla\xpcom\proxy\src\nsProxyEvent.cpp, line
515]
PL_HandleEvent [d:\builds\seamonkey\mozilla\xpcom\threads\plevent.c, line 591]


The relevant code is:

NS_IMETHODIMP
nsLDAPConnection::OnFound(nsISupports *aContext, 
                          const char* aHostName,
                          nsHostEnt *aHostEnt) 
{
    PRUint32 index = 0;
    PRNetAddr netAddress;
    char addrbuf[64];

    // Do we have a proper host entry? If not, set the internal DNS
    // status to indicate that host lookup failed.
    //
    if (!aHostEnt->hostEnt.h_addr_list || !aHostEnt->hostEnt.h_addr_list[0]) {
        mDNSStatus = NS_ERROR_UNKNOWN_HOST;

        return NS_ERROR_UNKNOWN_HOST;
    }
    
    // Make sure our address structure is initialized properly
    //
    memset(&netAddress, 0, sizeof(netAddress));
    PR_SetNetAddr(PR_IpAddrAny, PR_AF_INET6, 0, &netAddress);


I can't think of any reason why we'd sometimes crash on this call to |memset()|,
and I've not been able to reproduce it either. I'm kind of stumped how to debug
this problem, I don't understand how |netAddress| could not be correcly
allocated on the stack?

-- Leif
(Assignee)

Updated

17 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 1

17 years ago
From a talkback report:

x86 Registers:
EAX: 00060003 EBX: 60e32b60 ECX: 02a9afcc EDX: 606864b4
ESI: 02b0a954 EDI: 00000000 ESP: 0012fc28 EBP: 0012fc90
EIP: 6068332e cf PF af zf sf of IF df nt RF vm   IOPL: 0
CS: 001b DS: 0023 SS: 0023 ES: 0023 FS: 0038 GS: 0000

cmp     [eax],edi
60683330 0f84d9000000     je      6068340f
60683336 6a20             push    0x20
60683338 8d45e0           lea     eax,[ebp-0x20]
6068333b 57               push    edi
6068333c 50               push    eax
6068333d e89a200000       call    606853dc
60683342 8d45e0           lea     eax,[ebp-0x20]
60683345 50               push    eax
60683346 57               push    edi
60683347 6a17             push    0x17
60683349 6a01             push    0x1
6068334b ff15dc29dccc     call    dword ptr [ccdc29dc]
*** Bug 102567 has been marked as a duplicate of this bug. ***
I just ran into this on my linux box running a branch build.  Talkback ID is
36186399.

x86 Registers:
EAX: 09fec8cc EBX: 41337130 ECX: 0000266e EDX: 41336998
ESI: 00000003 EDI: 09fece90 ESP: bffff1bc EBP: bffff298
EIP: 4132fd02 cf pf af zf sf of IF df nt RF vm   IOPL: 0
CS: 0023 DS: 002b SS: 002b ES: 002b FS: 0000 GS: 0007

 Code Around the PC: 
4132fd02 833900           cmp     dword ptr [ecx],0x0
4132fd05 7519             jnz     4132fd20
4132fd07 8b4508           mov     eax,[ebp+0x8]
4132fd0a c7404c1e004b80   mov     dword ptr [eax+0x4c],0x804b001e
4132fd11 b81e004b80       mov     eax,0x804b001e
4132fd16 e945010000       jmp     4132fe60
4132fd1b 90               nop
4132fd1c 8d742600         lea     esi,[esi]
4132fd20 6a6c             push    0x6c
(Assignee)

Comment 4

17 years ago
Created attachment 51817 [details]
Disassembler output around the suspected crasher
(Assignee)

Comment 5

17 years ago
After looking at this some more, both Mose and I are not convinced that the
Talkback report is pointing at the correct line. In fact, we suspect the crasher
might be at around line 845:

   if (!aHostEnt->hostEnt.h_addr_list || !aHostEnt->hostEnt.h_addr_list[0]) {


We've been able to reproduce a crasher on this exact line, where
|aHostEntr->hostEnt.h_addr_list| is non-null but points into never-never land
(or  Uranus as mose would say), and we crash on the second half of the |if()|
statement. This causes a segfault.

It's still unclear how this structure is getting corrupted, or why. Does anyone
have suggestions if a) I'm not testing the |aHostEnt| structure properly for
"correctness" or b) what could cause the DNS service (or possible the proxy
code) to corrupt the host data or c) is this a corruption on the stack itself,
making our |aHostEnt| point into the void somehow?

Thanks!

-- Leif

Comment 6

17 years ago
You might try adding assertions to nsDNSRequest::FireStop() to ascertain whether
or not the hostent is corrupt at that point.

I presume that aHostEnt is !nil, but I don't see a test for that.
Created attachment 51993 [details]
stack trace of reproduced crash
OK, so I noticed that in my builds, the crash happens more of the time when
there is an error dialog, after I select the error item.  Additionally, just for
grins, I tried recompiling nsLDAPConnection.cpp using PROXY_SYNC rather than
PROXY_ASYNC.  Interestingly, once when I saw the core dump with this PROXY_SYNC
code, I saw an assertion from nsDNSRequest::Cancel:

      NS_ASSERTION(!PR_CLIST_IS_EMPTY(this), "request is not queue on lookup");

This is making me wonder if ::Cancel is sometimes getting called after the
lookup has already finished.  Is this allowable semantics?


gordon: correct, aHostEnt is not nil.  I tried adding the assertions you
suggested, and the hostent is NOT corrupt when just before the call to OnFound.
 So this may be proxy or xptcall or other event queue lossage of some sort.
OK, so I see what's going on here.  The DNS service is calling OnFound back with
a pointer to some private data.  Then, it assumes that once OnFound returns,
there's no need for the private data any more, and sets the nsCOMPtr holding it
to nsnull.  
However, in the case of an asynchronous proxy, the data may not have actually
been used yet.

So I think we can work around this in the short term by using a synchronous
proxy (maybe I was mistaken when I thought it still dumped core before with the
sync proxy, because it's not now).

Long term, I'd propose the nsIDNSListener should hand back refcounted data
directly, rather than just a pointer into a privately refcounted objet.

I'm still seeing the assertion I mentioned before with PROXY_SYNC, anyone know
what's up with this?
The assertion is happening when the nsLDAPConnection destructor calls
mDNSRequest->Cancel.  It's not clear to me why this is happening, however: I
added some logging, and nsLDAPConnection::OnStopLookup is getting called, and
that function zeroes out mDNSRequest.

Updated

17 years ago
Keywords: crash, nsbranch+
(Assignee)

Comment 12

17 years ago
Created attachment 52290 [details] [diff] [review]
Possible fix, v1
(Assignee)

Comment 13

17 years ago
Comment on attachment 52290 [details] [diff] [review]
Possible fix, v1

This patch is missing one part, posting a new one soon.
Attachment #52290 - Attachment is obsolete: true
(Assignee)

Comment 14

17 years ago
Created attachment 52295 [details] [diff] [review]
Potential fix, v2
(Assignee)

Comment 15

17 years ago
Requesting SR= and R= on the v2 patch. It's tested on all three platforms.

-- Leif

Comment 17

17 years ago
Comment on attachment 52295 [details] [diff] [review]
Potential fix, v2

sr=bienvenu
Attachment #52295 - Flags: superreview+
(Assignee)

Comment 18

17 years ago
Checked in on trunk. Richi P.: can you maybe try a "trunk" build on Monday or
so, and see if this fixes your problem?

Thanks,

-- Leif
Status: ASSIGNED → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → FIXED

Comment 19

17 years ago
I'm using build 2001100503 on win32 right now. Unfortunately, a lot has happened
since I sent that bug report. One of the major changes is that I delete my User
profile and started from scratch (some changes a few weeks back caused Mozilla
installers to **** on me).

With this build, Mozilla doesn't seem to crash anymore when doing an LDAP
lookup. I'll bang on it some more and see what happens. I'll also download a
build on Monday and see if that makes any difference as well.

Comment 20

17 years ago
Sorry ... spoke too soon. It's still happening on 2001100503 win32 (I just
noticed on the Platform heading for this bug report, it says Linux only).

The behavior is erratic. Near as I can tell, one of three things happen:

1) I start Mozilla, compose a message, type in a few chars. and it SIGSEGVs (the
win32 equivalent, at least)

2) I start Mozilla, do some stuff, compose a message, type in a few chars. and
some entries in the personal dictionary will show up and in the bottom and error
entry saying problems with the LDAP server. I try a different sequence of
letters and next thing I know, LDAP is working.

3) LDAP works fine.

Once LDAP lookup starts to work, though, I can't seem to make it break again
without restarting Mozilla.

Will check again on Monday.
(Assignee)

Comment 21

17 years ago
What was the timestamp on the file you downloaded? The fix wasn't checked in
until around 7pm, so I suspect you won't see the fix in any builds until
earliest Saturday morning.

-- Leif

Comment 22

17 years ago
Finally!

On win32 mozilla 2001100610 (timestamp 06-Oct-2001 14:06), doing LDAP lookups
isn't crashing like before. Of course, there's very little traffic on the LAN so
the environment is unlike that when I experienced it before, but it looks good
so far.
(Assignee)

Comment 23

17 years ago
Requesting PDT for checkin on 0.9.4 branch.

-- Leif
Whiteboard: PDT

Comment 24

17 years ago
Verified with 20011008 trunk build on Window 2000.
LDAP auto complete works fine against the following servers:

Hostname: 208.12.37.50
Base DN: dc=mcom,dc=com

Hostname: 208.12.36.22
Base DN: o=Airius.com

Hostname: 208.12.37.103
Base DN: o=mcom.com
QA Contact: olgac → yulian

Updated

17 years ago
Whiteboard: PDT → [PDT+]

Comment 25

17 years ago
pls check this into the branch - PDT+
(Assignee)

Comment 26

17 years ago
Checked in on 0.9.4 branch

-- Leif
*** Bug 103868 has been marked as a duplicate of this bug. ***
Re-open to get into the 0.9.5 branch.
Blocks: 101793
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 29

17 years ago
Checked in on 0.9.5 branch
Status: REOPENED → RESOLVED
Last Resolved: 17 years ago17 years ago
Resolution: --- → FIXED

Comment 30

17 years ago
We still show four incidents on the Trunk as recently as 10-04. Can we check it 
in? 

Adding info for talkback tracking. This was a topcrasher on the branch. 
Changing platform to reflect that this was/is happening on Windows and Linux.
Keywords: topcrash
OS: Linux → All
Hardware: All → PC
Summary: Segfault in OnFound in nsLDAPConnection → N620 Trunk Segfault in OnFound in nsLDAPConnection [@ nsLDAPConnection::OnFound]

Comment 31

17 years ago
Tom, do you see this on the topcrash report for the 094 branch and 095 branch
after 10-9?  Thanks.  
greer: re-read the comments in the bug, and you'll see that the fix wasn't
checked in until late on 10/5, so it's not surprising that there are crashes on
10/4.

Comment 33

17 years ago
Talkback data shows no incidents with this signature after 10/9.
Marking VERIFIED fixed.
Status: RESOLVED → VERIFIED
Crash Signature: [@ nsLDAPConnection::OnFound]
You need to log in before you can comment on or make changes to this bug.