Closed Bug 14263 Opened 25 years ago Closed 25 years ago

[DOGFOOD] Linux/Alpha: Assertion failure: 0 == rv, at ptsynch.c:168

Categories

(NSPR :: NSPR, defect, P3)

DEC
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: niles, Assigned: jband_mozilla)

References

Details

(Whiteboard: [PDT-])

Attachments

(2 files)

This bug prevents Mozilla-19990917 from running at all on Linux/Alpha.

I get:
> ./apprunner
nsNativeComponentLoader: autoregistering /home/niles/mozilla/dist/bin/components
nsNativeComponentLoader: autoregistering succeeded
nsUnixToolkitService: Unknown toolkit ' '.  Using 'gtk'.
nsUnixToolkitService: Using 'gtk' for the Toolkit.
NS_SetupRegistry() MOZ_TOOLKIT=gtk, WIDGET_DLL=libwidget_gtk.so,
GFX_DLL=libgfx_gtk.so
started appcores
GFX: dpi=100 t2p=0.0694444 p2t=14.4 depth=16
Using '/home/niles/mozilla/dist/bin' as the resource: base
initialized appshell
ProfileName : mozProfile
ProfileDir  : /home/niles/.mozilla/mozProfile
Initialized app shell component {4a85a5d0-cddd-11d2-b7f6-00805f05ffa5},
rv=0x00000000
Got the event queue from the service
Calling gdk_input_add with event queue
Assertion failure: 0 == rv, at ptsynch.c:168

when I try to run Mozilla on Linux/Alpha.

I've tried to run gdb on it but I can't get GDB to use my
environmental variables! So component registeration fails.
I've tried "set environment" to no avail.
(am I being stupid? how do I do this?)  Perhaps, this
is a bug in GDB for Linux/Alpha.  This is what I get:
(gdb) run
Starting program: /home/niles/mozilla/dist/bin/./simplebrowser
Assertion: "Cannot obtain unix toolkit service." (rv == NS_OK) at file
../../../../webshell/tests/viewer/nsSetupRegistry.cpp, line 285
Break: at file ../../../../webshell/tests/viewer/nsSetupRegistry.cpp, line 285
NS_SetupRegistry() MOZ_TOOLKIT=error, WIDGET_DLL=error, GFX_DLL=error


I put in a bunch of print statements and I've found that
this problem happens when PR_Lock(lock) get called with
lock set to some crazy pointer value that seems well outside
the range of the program space.  But without a debugger I
can't tell where it's being called from.  Obviously, this pretty
much indicates it's not a NSPR bug, but I didn't know where else
to send it.
Assignee: srinivas → chofmann
Summary: Linux/Alpha: Assertion failure: 0 == rv, at ptsynch.c:168 → Linux/Alpha: Assertion failure: 0 == rv, at ptsynch.c:168
Assigning the bug to Chris Hofmann.
Chris, can you please re-assign this appropriately?
If I run mozilla under the debugger I get the behavior listed in Bug #14259.
I would not mark this as a duplicate yet, as I believe there's two separate bugs
happening here.
Assignee: chofmann → dp
niles, can you try a build closer to current?
dp, might be able to see something in the stack trace, but lets
see if it has already been fixed first.
Oh, I guess I should have made that clear I just tried it with yesturday's
nightly build and it still behaved the same.  I'm not sure how to give you
more info since if I run it under the gdb it behaves like Bug #14259.
I believe this is a SMP+(create .mozilla config) bug.  I got the
exact same problem when I tried to run the M10 binary on a
SMP x86 machine with no .mozilla directory.  I rebooted in non-SMP
mode and it ran fine.  Once the .mozilla directory was present
it ran fine in SMP mode too!  Do you have any Linux/SMP machines which
you can test M10 with?  Make sure you delete the .mozilla directory.
Please post back with the results.  This seems like a more serious bug
that I first thought if it affects x86/SMP/Linux as well as Alpha/SMP/Linux.
It seems that the threads are getting out of order.
Status: NEW → ASSIGNED
Summary: Linux/Alpha: Assertion failure: 0 == rv, at ptsynch.c:168 → [DOGFOOD] Linux/Alpha: Assertion failure: 0 == rv, at ptsynch.c:168
Target Milestone: M12
Could you attach the xpcom log for the fail case. Maybe we can see where things
go out of hand. To get a log:

setenv NSPR_LOG_MODULES nsComponentManager:5
setenv NSPR_LOG_FILE xpcom.log
apprunner

<now you should have a log file xpcom.log> For every new run, please delete the
log file.
1024[12010cf00]: nsComponentManager:
FindFactory({be761f00-a3b0-11d2-996c-0080c7cb1081})
1024[12010cf00]:                found (null) as 120120ff0 in factory cache.
1024[12010cf00]:                Factory CreateInstance() succeeded.
1024[12010cf00]:                Factory CreateInstance() FAILED.

This part sounds scary. Why does Factory CreateInstance() FAIL ? Why are there
two Factory CreateInstance() ouputs one after another....

Thanks for the log output.
Whiteboard: [PDT-]
Current thinking is that we won't be able to support 64 bit processors by beta
1
This is silly: 64-bit Mozilla is built and run by many members of the Mozilla
community every day.  Bugs are found and fixed, usually based on patches that
the 64-bit platform champions submit.  Given NSPR, there is little excuse for
breaking 64-bit ANSI-C platforms.  But as dp points out, this bug is likely a
thread safety problem, not a 64-bit bug.

/be
DP had the following clarification, saying that it is not just an Alpha (64bit)
bug, but rather a multi-cpu problem.  I think the position of the PDT was that
this was not critical for 99.99% of the inhouse use.  I would certainly agree
that this is a crasher to handle by FCS, and as soon as possible in the beta
cycle, but it will not stop the bulk of the day-in-day-out dogfood use.  If
Brendan thinks that the threading infrastructure is horked, then we have a
porkjockey problem... but otherwise this seems like a more obscure bug than the
many crashers that we have categorized as dogfood.  The following is DP's email
commentary:

Just wanted to give you the full scoop on this bug. This isn't just
alpha. This is for multiprocessor linux too. This is a symptom of
threading problems. Charlie Manske said he sees freezes on his multicpu
windows machine. Threading first shows up on multicpu machine. Affects
others under wierd circumstances.

1. This is a symptom of incorrect threading happening. Wont know where
it will bite us.

2. We know multicpu linux machines will not eat dogfood.

(1) is the reason I maked it dogfood.
xpconnect isn't being built as a component on alpha.  The pthread problem is
because JS isn't getting initialized properly.  I've attached a patch that fixes
my box here.
Blocks: 13601
*** Bug 14259 has been marked as a duplicate of this bug. ***
Assignee: dp → jband
Status: ASSIGNED → NEW
xpconnect isn't a component. msw@gimp.org claims if that becomes one, this bug
is solved.

Thanks a zillion to msw@gimp.org (I am Ccing you on this bug)
I've checked in the Makefile.in patch to build xpconnect as a component.  I've
had another Linux/Alpha confirm that his build from after the fix works.
Status: NEW → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
I'm not seeing these problems anymore in my Alpha builds.  Assume fixed?
Target Milestone: M12 → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: