Closed Bug 15604 Opened 21 years ago Closed 20 years ago

startup crash in PrepareAndDispatch (solaris/native optimized)

Categories

(Core :: XPConnect, defect, P3, critical)

Sun
Solaris
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: tor, Assigned: rich.burridge)

References

Details

(Keywords: crash)

Attachments

(2 files)

apprunner crashes on startup in an optimized build (--disable-debug
--enable-optimize) on Solaris/native(4.2) with a 10/5 morning CVS pull.
The same source tree built in normal debug mode runs fine.  The crash
happens quite early:

(dbx) run
Running: apprunner
(process id 13599)
Warning: MOZILLA_FIVE_HOME not set.
Reading symbolic information for libjsloader.so
Reading symbolic information for libxpconnect.so
t@1 (l@1) signal SEGV (no mapping at the fault address) in PrepareAndDispatch at
0xff203ce8
0xff203ce8: PrepareAndDispatch+0x000c:  ld      [%i0], %l0
Current function is PL_HashTableEnumerateEntries
  368               rv = (*f)(he, n, arg);
(dbx) where
current thread: t@1
  [1] PrepareAndDispatch(0x80000000, 0x40000, 0xffbef0a0, 0x6c2d8, 0xb2468,
0xfeccf444), at 0xff203ce8
  [2] 0xff204854(0x80000000, 0x4, 0xffbef0a0, 0x6c2d8, 0xfed25524, 0xff203f44),
at 0xff204853
  [3] mozJSComponentLoader::AutoRegisterComponent(0x80000000, 0x57458, 0x0,
0xffbef0bc, 0xfed25128, 0x3f128), at 0xfed13e0c
  [4] mozJSComponentLoader::RegisterComponentsInDir(0x3f128, 0x80000000,
0x6c260, 0xfed25128, 0x57430, 0x0), at 0xfed137c4
  [5] AutoRegister_enumerate(0x57c88, 0x3f128, 0xffbef2ec, 0xff1eeb54,
0xff3dc9d0, 0x5a), at 0xff1eeb84
  [6] _hashEnumerate(0x1, 0x0, 0xffbef280, 0xff3dd2c8, 0xff3dc9d0, 0x6a), at
0xff1c40a4
=>[7] PL_HashTableEnumerateEntries(ht = 0x28980, f = 0xff1c4090 =
&_hashEnumerate(), arg = 0xffbef280), line 368 in "plhash.c"
  [8] nsHashtable::Enumerate(0x28320, 0xff1eeb54, 0xffbef2ec, 0x2a950,
0xff221aec, 0x17), at 0xff1c4544
  [9] nsComponentManagerImpl::AutoRegister(0x6c260, 0x0, 0x0, 0xff221aec,
0x80000000, 0x2a950), at 0xff1ef0c8
  [10] nsComponentManager::AutoRegister(0x0, 0x0, 0x11a1c, 0x56b48, 0x0, 0x0),
at 0xff1f48b8
  [11] NS_SetupRegistry_1(0x0, 0xff22bb04, 0x56b48, 0x2766c, 0xfee3ac90, 0x3f8),
at 0x1496c
  [12] main1(0x1, 0xffbef64c, 0x80000000, 0xfee37be4, 0x26370, 0x2a950), at
0x12e6c
  [13] main(0x1, 0xffbef64c, 0x26370, 0x26400, 0x0, 0x0), at 0x1392c
(dbx) p f
f = 0xff1c4090 = &_hashEnumerate()
(dbx) p *he
*he = {
    next    = (nil)
    keyHash = 1954228146U
    key     = 0x57c88
    value   = 0x3f128
}
(dbx) p n
n = 0
(dbx) p arg
arg = 0xffbef280
Assignee: jband → rogerl
This is rogerl's code.

Shaver's JSComponents stuff causes xpconnect calls to happen early. But whatever
is wrong was probably already wrong, just not right up front.

Is object layout different for release builds? Are there compiler flag issues?
Blocks: 16654
Status: NEW → ASSIGNED
Blocks: 16950
Blocks: 17432
Blocks: 17907
Severity: major → critical
Here's a simpler test case (I have a fix already, just waiting for the tree)

function f() { re = /,/; re.exec("8,7"); return "|"; }
reg = /X/g
str = "QQXAAAXBBBXCCC";
print(str.replace(reg, f));
Nuts, wrong bug. Sorry.
Target Milestone: M13 → M14
Not going to make M13 for these.
QA Contact: cbegle → rginda
Updating QA Contact.
Adding "crash" keyword to all known open crasher bugs.
Keywords: crash
Punt.
Target Milestone: M14 → M15
Assignee: rogerl → drapeau
Status: ASSIGNED → NEW
Per Clayton, bulk moving Solaris bugs to Sun.
[richb - 16th March 2000]
The problem here only occurs with the Sun native compiler (version 4.2 or 5.0)
when compiling
.../mozilla/xpcom/reflect/xptcall/src/md/unix/xptcstubs_sparc_solaris.cpp
with -O (really -O2). If this file is compiled with -g, nothing, -O1, -O3 or
-O4, there is no problem.

The problem is incorrect assembly code is generated for the various 
STUB_ENTRY(n)'s with -O2. With -O1 (or the other compile options that work),
the assembler for each Stubn() entry looks something like:

0xfee62494: Stub3       :       save    %sp, -0x60, %sp
0xfee62498: Stub3+0x0004:       call    _PROCEDURE_LINKAGE_TABLE_+0x2d0c [PLT]
0xfee6249c: Stub3+0x0008:       mov     0x3, %o0
0xfee624a0: Stub3+0x000c:       ret
0xfee624a4: Stub3+0x0010:       restore %g0, %o0, %o0
0xfee624a8: Stub3+0x0014:       unimp   0x0

When compiled with -O2, the assembly code for the Stubn()'s is:

0xfee62414: Stub3       :       mov     %o7, %g1
0xfee62418: Stub3+0x0004:       mov     0x3, %o0
0xfee6241c: Stub3+0x0008:       call    _PROCEDURE_LINKAGE_TABLE_+0x2d0c [PLT]
0xfee62420: Stub3+0x000c:       mov     %g1, %o7

As you can see the stack pointer is not properly saved and restored, and there
is no proper return.

This would appear to be a Sun native compiler bug, and I will attempt to
isolate this into a small example with fails, but for now a simple workaround
has been found (see attachment
http://bugzilla.mozilla.org/showattachment.cgi?attach_id=6613).

Re-assigning to Rich Burridge, working on the Solaris port of Netscape 6.
Assignee: drapeau → rich.burridge
Status: NEW → ASSIGNED
[richb - 20th Match 2000]
I submitted a bug against the Sun compiler for the failure to do the right
thing when compiled with -O2. Here's the comments from the evaluation:

"The code for the Stub##n() routines calls SharedStub().
The SharedStub() routine is an assembly language routine
that thinks it knows where to find things other than it's arguments.
Tail-call optimizations among other things break the assumptions
built into SharedStub().

I suggest you re-write the Stub##n() routines in assembly language,
so that they don't break.
I also suggest that you use the optimized code produced
by the compiler for those routines,
though that will necessitate rewriting SharedStub() too.

Of course, it looks like this whole mass of code it very non-portable.
It should probably be re-written to use <stdarg.h>.

I found the code for SharedStub() at:
http://lxr.mozilla.org/seamonkey/source/xpcom/reflect/xptcall/src/md/unix/xptcstubs_asm_sparc_solaris.s

I'm going to close this as not a bug."

The evaluator also sent me the following comments by private email:

"I found:

http://lxr.mozilla.org/seamonkey/source/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_asm_sparc_solaris.s

It has the wrong value for the "save" instruction, should be -(64+32).
The SUNW version has the same problem.
Look in /usr/include/sys/stack.h for what should happen.

I found:

http://lxr.mozilla.org/seamonkey/source/xpcom/reflect/xptcall/src/md/unix/xptcstubs_asm_sparc_solaris.s

This code is bogus.
It thinks it can look in the register window and find the arguments to
it's caller -- it can't in the presence of tail-call optimizations.

It also has the same problem with the "save" instruction."

I will now look into adjusting the SPARC assembly code to do the right thing.
[richb - 21st March 2000]
I've just been trying out the suggested code from the Sun compiler guy.
His version of SharedStub (in xptcstubs_asm_sparc_solaris.s) looks like:

SharedStub:
        st      %o1, [%fp + 72]     ! store the args
        st      %o2, [%fp + 76]
        st      %o3, [%fp + 80]
        st      %o4, [%fp + 84]
        st      %o5, [%fp + 88]

                    ! %o0 already has "this" or "that"?
        ld      [%fp + 68], %o1     ! get "n" aka index
        add     %fp, 72, %o2        ! form pointer to args

    save    %sp, -96, %sp       ! need a stack frame,
                    ! or PrepareAndDispatch may overwrite
                    ! the args vector

    mov %i0, %o0
    mov %i1, %o1
        call    PrepareAndDispatch  ! tail call
    mov %i2, %o2

    ret
    restore %g0, %o0, %o0

and he suggested making the stubs look like:

 or  %g0,3,%g1      ! where "n" is 3, save it until %o reg's available
 st  %g1, [%fp+68]  ! you could probably use %g2 instead of [%fp+68]
 or  %g0,%o7,%g1
 call    SharedStub ! (tail call)
 or  %g0,%g1,%o7

Now there are two problems with this stub definition. As it currently stands
the stub needs to be defined via #define in xptcstubs_sparc_solaris.cpp. This
needs to return an nsresult. The second problem is that the "3" is fine for
Stub3(), but the Stub4() needs 4, and so on.

As an experiment I tried changing the STUB_ENTRY #define to:

#define STUB_ENTRY(n) \
nsresult nsXPTCStubBase::Stub##n() \
{ \
    register nsresult result = 0; \
    asm("   or  %g0, 3, %g1"); \
    asm("   st  %g1, [%fp+68]"); \
    asm("   or  %g0, %o7, %g1"); \
    asm("   call    SharedStub"); \
    asm("   or  %g0, %g1, %o7"); \
    asm("   mov %o0, %i0"); \
    return result; \
}

and used the above SharedStub definition. This compiled, built and ran
(Solaris CC 5.0 compilers) for -O1, -O2, -O3 and -O4. 

I also tried it with gcc (-O2) and it compiled, built and ran fine.

If I just compiled xptcstubs_sparc_solaris.cpp with -g (Sun CC 5.0 compiler),
then when I ran it, it segv'ed with:

current thread: t@5
=>[1] PrepareAndDispatch(self = (nil), methodIndex = 3U, args = 0xfd9d1a50), 
line 51 in "xptcstubs_sparc_solaris.cpp"
  [2] XPTC_InvokeByIndex(0x0, 0x3, 0xfd9d1a50, 0xfdc277e8, 0x2454a8, 0x2662c0), 
at 0xfee6659c
  [3] nsXPTCStubBase::Stub3(this = 0x3), line 5 in "xptcstubsdef.inc"
  [4] nsFileTransport::Process(0x80000000, 0xfd9d1bdc, 0xfdc277e8, 0x177430, 
0x0, 0x290c0), at 0xfdad6134
  [5] nsFileTransport::Run(0x177430, 0x0, 0x177434, 0x80000000, 0x10b1f0, 0x1), 
at 0xfdad5cd0
  [6] nsThreadPoolRunnable::Run(0x0, 0x177434, 0x10cb68, 0xfeaddfe0, 0x0, 
0x10e0a9), at 0xfee565a4
  [7] nsThread::Main(0x10d338, 0xfeae5418, 0x2, 0xfeadc55c, 0x8, 0x110a80), at 
0xfee550e8
  [8] 0xff0b9ed4(0x110a80, 0xff0d3bdc, 0x0, 0x110a80, 0x0, 0xfe955dc8), at 
0xff0b9ed3

I sent my concerns back to the compiler guy. Here are his comments:

    "What you are trying to do needs to be written in assembly language.
    It cannot be expressed in C++, not even as "asm()".

    The unfortunate part is that this will force you to
    include the name mangling in the assembly language,
    so it will not be portable across different C++ compilers.

    Also, you forgot to convert the "3" to "n" in the first "asm()"."

If I did this, then it wouldn't fit in with the generic 
#include "xptcstubsdef.inc" approach that is being currently used for multi
platform support.

This is getting beyond my level of expertise, and the compiler guys level of
interest (which is fair enough; he's been *very* helpful seeing this isn't
really his problem).

The simplest fix/hack seems to be to use the existing 
xptcstubs_asm_sparc_solaris.s code (even though it might be bogus) and
make the #define for STUB_ENTRY in xptcstubs_sparc_solaris.cpp to be:

#define STUB_ENTRY(n) \
nsresult nsXPTCStubBase::Stub##n() \
{ \
    int retval = SharedStub(n);     \
    return retval;                  \
}

which I believe prevents the tail call optimisation occuring.

Alternatively perhaps we (myself and Ashutosh Kulkarni, a member of the
Blackwood group here who has done SPARC assembly code in a previous life)
can meet with you (Chris and Roger; originators of the assembler code), to
see if we can come up with a better (less hacky) solution.

Comments please.
[richb - 22nd March 2000]
I thought I added these comments yesterday. Hmmmn...

The compiler guy came back with another suggestion:

    "The following code won't work to turn off tail-call optimization.
     But it gave me another idea, below.

     > #define STUB_ENTRY(n) \
     > nsresult nsXPTCStubBase::Stub##n() \
     > { \
     >     int retval = SharedStub(n);     \
     >     return retval;                  \
     > } 

    "You could try the following, with the original verion of SharedStub()
     plus the fix for the save instr.

     #define STUB_ENTRY(n) \
     nsresult nsXPTCStubBase::Stub##n() \
     { \ 
         int dummy; /* defeat tail-call optimization */ \
         return SharedStub(n, &dummy); \
     }   

     That should defeat the tail call optimization,
     without much extra code (probably one instruction)."

His change (plus the fix for the extern declaration for SharedStub in
xptcstubs_sparc_solaris.cpp, which now reads:

extern "C" int SharedStub(int, int*);

compiles, builds and runs (with the Sun CC 5.0 compiler). I tried -O2, -g
and -O4. It also compiles, builds and successfully runs it with gcc with -O2.
These are the only variants I've tried. I've no reason to believe the other
optimisation levels won't be successful as well.

The sets of diffs for this change are at:

http://bugzilla.mozilla.org/showattachment.cgi?attach_id=6752

This affects:

xptcinvoke_asm_sparc_solaris_SUNW.s
xptcstubs_asm_sparc_solaris.s
xptcstubs_sparc_solaris.cpp

This fix also includes the adjustment to the save instruction for the
XPTC_InvokeByIndex routine in xptcinvoke_asm_sparc_solaris_SUNW.s.

This would appear to be an acceptable solution.

Anybody care to comment on this?  You are all awfully quiet out there.
Rich,

Sorry about the silence. I think a number of us have been watching your 
well documented progress but just keeping quiet for lack of intelligent things 
to say. I did contact rogerl a few days ago to be sure that he was available to 
help. He answered that he would. Unfortunately, I believe that he's been off 
work for a bit. He really is the best person on our end to comment on the 
specific changes. FWIW, mcafee is going on sabbatical soon (if he is not already 
gone).

I certainly don't mind any of these changes that use otherwise unnecessary 
variables to defeat the optimizer. If it works then I think this is better than 
rewriting the stubs in asm.

I don't know how much of a rush anyone is in to get this checked in. I'd vote on 
waiting for review from rogerl just to get another set of knowlegable eyes 
looking at the solution. But I certainly trust you if you say it works.

I really apreciate you doing this work!

John.
Reports of my absence have been exaggerated, but I have been digging out from 
under the JS bugs reported while I was gone last week, sorry not to 
have replied sooner.

Despite jband's flattery, my assistance in this can only be minimal, my exposure 
to sparc assembly & ABI extends to this one hack, period, so I'm going 
to bow to the "compiler guy's" assertions here about defeating the tail-call 
optimization. The latest fix that's been proposed seems perfectly fine to me, 
and since it's working I'd say let's get it in.
[richb - 23rd March 2000]
While I'm on vacation, one of my group is going to checkout this fix with the
Sun 4.2 compiler as well. There might be one more .s file that needs the save
instruction adjustment too. More on this when I get back (4/3/00).
[richb - 4th April 2000]
Back from vacation. Just catching up on things.

The compiler guy came back with the following:

  "I just realized there are a couple of "niceties" that are not
   in the code for XPTC_InvokeByIndex() and probably the code
   for SharedStub() too.

   At the end of the assembly language routine,
   put:

	   .size	XPTC_InvokeByIndex, .-XPTC_InvokeByIndex
	   .type	XPTC_InvokeByIndex, #function

   That way they show up correctly in the ELF symbol table
   (e.g. with "nm").

   Also, for the internal labels, like "invoke",
   prefix them with a ".", as in ".invoke",
   that way they become local and don't show up in the symbol table."

----

Margaret Chan did some further investigation of file usage for SW 5.0
compilers vs SW 4.2 and gcc compilers. Here are her comments:

  "Your theory about WS4.2 & GCC picks up one file and WS5.0 picks up the 
   other is correct.  This is what I've observed from my build:

   1.  5.0 build - picks up xptcinvoke_asm_sparc_solaris_SUNW.s

       I did a full build for this. At the end of the build, I've got these 
       4 files in the 
       ../mozilla/xpcom/reflect/xptcall/src/md/unix directory:
    
       xptcinvoke_asm_sparc_solaris_SUNW.o  xptcstubs_asm_sparc_solaris.o
       xptcinvoke_sparc_solaris.o           xptcstubs_sparc_solaris.o
    
    
   2.  4.2 build - picks up xptcinvoke_asm_sparc_solaris_GCC.s
   
       I only did a gmake in the directory mentioned in #1 above.
       At the end of the build, I've got these 4 files in the directory:
       
       xptcinvoke_asm_sparc_solaris_GCC.o   xptcstubs_asm_sparc_solaris.o
       xptcinvoke_sparc_solaris.o           xptcstubs_sparc_solaris.o
    
   3.  GCC build - picks up xptcinvoke_asm_sparc_solaris_GCC.s

       The 4 files in the directory after the build is the same as #2 above.
    
   So it appears that you'll have to apply your fix to yet another .s file for 
   4.2/GCC build."

----

Taking all of this into account, I've generated yet another set of diffs for
the various files. This can be found at:

  http://bugzilla.mozilla.org/showattachment.cgi?attach_id=7232

If you (the Mozilla folks) are agreeable, I'd like to check these changes in, 
when leaf allocates me a CVS login id for the CVS contributor forms I submitted
about ten days ago. Leaf?

[richb - 5th April 2000]
I've now got a CVS checkin account (thanks Leaf!). Apologies for the
niave question, but can somebody tell me please how I now proceed to
check these changes in? Do I have to get approval from two reviewers
(the module owner)? Thanks.
rich... you should talk to your voucher (george, i think) about what the
development rules are.

Basically, look at the top of tinderbox
http://tinderbox.mozilla.org/showbuilds.cgi?tree=SeaMonkey
for the most current policy.

We're in lockdown mode right now, so i'm not going to be allowing checkins
unless they are regression fixes, memory leak fixes, or other entropy-fighting
checkins.

basically, you need to have changes thoroughly tested on at least one platform,
preferrably more, and reviewed by someone with some expertise in the area of
code that you're touching.
Leaf overstates somewhat, I think, in that 
- testing your changes on more than one platform would be somewhat non-sensical,
and
- we have a tradition of taking clearly-platform-specific patches during
milestone lockdown, because they present a very low risk to all other platforms
and they tend to be for platforms that didn't work in the first place

(As a general description of how we handle milestone lockdown, though, his blurb
is predictably great.)
[richb - 5th April 2000]
Okay. Thanks Leaf/Mike! I've just sent some mail to jband asking who the
reviewer should be (Roger?). When /if I get the go ahead from him, I'll
check the changes back. 
*** Bug 8373 has been marked as a duplicate of this bug. ***
Looks cool - let's get it in, r=rogerl
[richb - 7th Apr 2000]
Fix checked in for M15.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
No longer blocks: 17432
No longer blocks: 17907
You need to log in before you can comment on or make changes to this bug.