Closed Bug 151577 Opened 23 years ago Closed 22 years ago

Mozilla (> 0.9.7) doesn't run on Familiar (ARM) Linux

Categories

(NSPR :: NSPR, defect)

Other
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: dr, Assigned: wtc)

References

Details

Attachments

(1 file, 1 obsolete file)

I'm trying to run Mozilla on the iPaq, which uses an ARM architecture. The operating system on my iPaq is Familiar 0.5.2 (http://familiar.handhelds.org). I have been successful in cross-compiling Mozilla for ARM, but there is a problem with more recent releases that prevents the compiled code from executing. Facts: - All my builds are embed builds (created by embedding/config/Makefile). I execute using ./run-mozilla ./TestGtkEmbed. - I have been able to get the 0.9.6 release to run successfully on the iPaq, and I'm working on putting together some documentation on how I managed to do it. - The 1.0 release does not work. I get exceptions from nsThread.cpp when I execute the program. The gtk UI comes up after that, but when I try to load a page, I die with a segfault. - The 0.9.9 release doesn't work either (I tried it because it was the first release to include freetype support). In this case, I don't get the exception from nsThread, but the program seems to die in the same way. I'm having trouble tracking down the actual location of the crash, because gdb complains about not being able to grok how dlls are loaded. What I'm going to try to do instead, is build 0.9.7 (and maybe 0.9.8) to nail down a timeframe in which the problem started occurring. There aren't many changes committed to XPCOM and NSPR threads during that time, so (assuming the problem does lie in threading code) it might help us see what the cause of the trouble is.
One other quick comment: Mozilla 0.9.6 works on the iPaq with no code modification at all, besides the patch in bug 33364. That is, the code is completely portable. The only trick was getting my environment and .mozconfig set up correctly. (Also, CC'ing some folks who've been involved with mozilla/ARM in other bugs. Apologies if you don't want to be added here - just fishing for people who might have looked at this already).
First, look at nspr and xpcom changes in between 0.9.6 and 0.9.7.
Two things: 1. Ignore my comment about bug 33364. The change I needed to make was similar, in the file xpcom/reflect/xptcall/src/md/unix/Makefile. (change armv4l and arm32 to arm%). 2. Mozilla 0.9.7 seems to work. I'll try now with 0.9.8.
One more thing about 0.9.7. I get the following assertion: nsNativeComponentLoader not thread-safe: 'owningThread == NS_CurrentThread()', file nsDebug.cpp, line 528 It doesn't seem to cause any problems here, though.
Did you take a look at bug 9519, bug 87965 and bug 106864 ? I've got only experience with a 'full system' build of mozilla (starting from 0.9.8) for arm (and the result is a fully functional mozilla !) The assertion in comment 4 should be harmless.
question : which compiler do you use to compile mozilla ? I've experienced segmentation faults with a 'plain' gcc-3.0.4 because of 'floating point' registers being incorrect reloaded. (or at least, there emulation)
Jeroen: I'm using a build of gcc-2.95.3 which I downloaded from arm.linux.org.uk (ftp://ftp.arm.linux.org.uk/pub/armlinux/toolchain). Also, I looked at those bugs you mentioned - bug 106864 might be the culprit for me. I'll try applying your patch there and see what happens... Also, I've finished the first round of my search. 0.9.7 works for me, and 0.9.8 does not. So that means I've narrowed down the segfault to code changed between December 21, 2001 and February 4, 2002. I'm going to go hunt through bonsai for relevant changes committed between those dates.
The patch in bug 106864 doesn't fix 0.9.8 for me. It fixes the strings problem it was intended to fix, but that never caused me any grief in my embed build. Jeroen: Would you be able to try creating an embed build of your own, to see if it works? (To do this, cd into embedding/config and make. This results in unstripped binaries located in dist/Embed). I'm wondering if the problem is particular to embed builds...
->NSPR. This doesn't look like it's in XPCOM threads, judging from checkins to mozilla/xpcom/threads between 12/21/2001 and 2/4/2002.
Assignee: rpotts → wtc
Component: Threading → NSPR
Product: Browser → NSPR
QA Contact: rpotts → wtc
Version: other → 4.2
Looks like there are several checkins to NSPR threads contributed by jeff@NerdOne.com during the 0.9.7 - 0.9.8 timeframe. (CC'ing Jeff). Jeff, any idea what I ought to be looking for, or why I might be experiencing this problem?
Summary: Mozilla doesn't run on ARM arch → Mozilla doesn't run on Familiar (ARM) Linux
sorry, i can't be of help. i was tracking down a bunch of memory leaks. and i was only running on win32.
I've tested mozilla-1.0 release sources, without any patches, using gcc-3.0.4 (containing a patch concerning floating point reloads), glibc-2.2.5. This is on an xscale based Xingu board. The TestGtkEmbed seems to work fine (I succeeded displaying slashdot when exporting my X windows to a linux pc (the 'streams' test works fine on the platform itself.)
*Whooff* <-- the sound of me throwing my hands in the air, being stumped.
Well, TestGtkEmbed works for me on my netwinder, mostly. Built on a Debian/ARM woddy system (gcc-2.95.4 Debian prerelease/glibc-2.2.5-6) The only problem I face is that PSM does *not* work. Any https:// site will crash mozilla with: Assertion failure: lock != NULL, at ptsynch.c:206 The problem appears to be in NSPR, in nsprpub/pr/src/misc/prdtoa.c:Balloc. The PR_Lock seens to go off without a hitch, but the PR_Unlock dies. Jeroen, does PSM work on your build. If so, what gcc/glibc patches for ARM did you use?
That assertion failure means the 'freelist_lock' in prdtoa.c is a null pointer. Since PR_Lock goes off without a hitch but PR_Unlock dies, and _PR_CleanupDtoa is the only function that sets freelist_lock to NULL, this implies that _PR_CleanupDtoa is called, which in turn implies that PR_Cleanup is called. This conclusion doesn't make sense to me because PR_Cleanup is only called before an application terminates and only some applications call PR_Cleanup.
I don't have PSM enabled... I'll try to spin a build where it is enabled...
I don't have PSM enabled either, since it's a pain to cross-compile. My major problem is just that the first page load results in a segfault (in releases after 0.9.7). I'm having some other difficulties as well: one I'm working on right now is, in my "working" builds, I can't seem to submit forms. That's a different issue, though.
Blocks: 152955
I've attached a console log in bug 152955, which might help give a bigger-picture view of everything that's going wrong in my 0.9.7 build.
Summary: Mozilla doesn't run on Familiar (ARM) Linux → Mozilla (> 0.9.7) doesn't run on Familiar (ARM) Linux
Dan, maybe you should focus your effort on mozilla-1.0, as this source tree contains all patches which are needed for arm. Then we could try to focus on the differences you and I seem to get... When you get a 'segfault' for the 1.0 release, is it inside mozilla or inside one of the libraries ? (ever tried to debug with gdb ?) Do you get different results when exporting the display to a remote pc and not exporting the display ? Did you try with gcc-3.0.4 ? (don't forget to add something like <http://gcc.gnu.org/ml/gcc-patches/2002-03/msg00248.html> (note: this was not the final patch going into cvs, but it fixes the problem))
Jeroen: I wish I could use mozilla 1.0! It's pretty embarassing at this point, filing bugs against 0.9.7 :) As for your suggestions: - I've tried to debug with GDB. There are two difficulties with that. One is that the iPaq has such limited resources that an unstripped build won't fit! But I can get around that by stripping all the binaries except those I need to debug. The other, more major problem, is that threads don't work in Familiar Linux's build of GDB (see http://handhelds.org/bugzilla/show_bug.cgi?id=161). So when I tried to track down where the segfault was, I found I was unable to trace into any threads. Pretty useless, huh?! - I get the same results when exporting the display to my desktop as I do when displaying on the iPaq screen. Tried that one already :) - As for gcc-3.0.4... That was a sad story. I tried very hard to build it as a cross-compiler, but never managed to get it to completely build. So after two weeks of banging my head against that problem, I gave up and downloaded a pre-built 2.95.3 from ftp://ftp.arm.linux.org.uk/pub/armlinux/toolchain. Do you have your 3.0.4 cross-compiling, or is it a native compiler running on a desktop ARM machine? If it's an i386->ARM cross-compiler, I'd be very grateful if you could send me a copy! Also, regarding the gcc floating point patch: is this the final patch that went in?: http://gcc.gnu.org/ml/gcc-patches/2002-03/msg00829.html Anyway, there is definitely *some* problem that cropped up between 0.9.7 and 0.9.8. But if you have 1.0 running on ARM, then I'm really pretty stumped. Maybe it's a mozilla bug, maybe it's a compiler bug, maybe it's a libraries bug...
For your curiosity, here's the gdb problem I'm having: (gdb) run Starting program: /usr/local/Embed/./TestGtkEmbed warning: Unable to find dynamic linker breakpoint function. GDB will be unable to debug shared library initializers and track explicitly loaded dynamic code. ... Cannot access memory at address 0x40016df0 This actually appears to be different from the thread issue I mentioned (which I also recall having seen) but, dollars to doughnuts, it's again a gdb bug, and not mozilla's problem. This happened using mozilla 1.0 (source release), on gdb 5.0 (5.0-3-fam1).
Woohah! I managed to get a new gdb from http://handhelds.org/bugzilla/show_bug.cgi?id=161. I ran Mozilla in it, and lo and behold, the Stack Trace of Justice! Most of the binaries here are stripped. (Only TestGtkEmbed and libgtkembedmoz.so are unstripped). Now that I can see what DLL's I'm in, I'll crank out another stack trace with those binaries unstripped. Also, I'm able to "cont" past this first crash, and experience several more: loading url sleepy.at Program received signal SIG32, Real-time event 32. 0x4047c82c in sigsuspend () from /lib/libc.so.6 (gdb) cont Continuing. Program received signal SIG32, Real-time event 32. 0x4047c82c in sigsuspend () from /lib/libc.so.6 (gdb) cont Continuing. Warning: MOZILLA_FIVE_HOME not set. Program received signal SIG32, Real-time event 32. 0x4047c82c in sigsuspend () from /lib/libc.so.6 (gdb) cont Continuing. open_uri_cb http://sleepy.at/ load_started_cb Program received signal SIGTRAP, Trace/breakpoint trap. 0x4050a5d4 in write () from /lib/libc.so.6 Each of the first three SIG32s involves necko trying to start a thread. The first (attached) is in nsIOService::Init, the next is in nsDNSService::Init, and the last is in nsHttpHandler::Init (trying to start a timer). I'll attach each full stack trace when I have unstripped Necko, XPCOM and NSPR.
What is the output of "ldd TestGtkEmbed"? By the way, the stack trace doesn't look right. #0 0x4047c82c in sigsuspend () from /lib/libc.so.6 #1 0x40436244 in pthread_getconcurrency () from /lib/libpthread.so.0 #2 0x404357d0 in pthread_create () from /lib/libpthread.so.0 #3 0x4041130c in PR_Select () from ./libnspr4.so #4 0x404115dc in PR_CreateThread () from ./libnspr4.so PR_CreateThread does not call PR_Select. So it's not clear how much we can trust this stack trace.
Agh, yuck! Sorry about the extra newlines. Minicom isn't too friendly with the X clipboard. "ldd" returns what you'd expect it to: root@midget2 /usr/local/Embed -> ldd ./TestGtkEmbed ./TestGtkEmbed: libgtkembedmoz.so => ./libgtkembedmoz.so (0x4001f000) libgtksuperwin.so => ./libgtksuperwin.so (0x40073000) libdl.so.2 => /lib/libdl.so.2 (0x40081000) libmozjs.so => ./libmozjs.so (0x4008c000) libxpcom.so => ./libxpcom.so (0x40178000) libplds4.so => ./libplds4.so (0x403b4000) libplc4.so => ./libplc4.so (0x403bf000) libnspr4.so => ./libnspr4.so (0x403cc000) libpthread.so.0 => /lib/libpthread.so.0 (0x4042c000) libc.so.6 => /lib/libc.so.6 (0x4044a000) libgtk-1.2.so.0 => /usr/lib/libgtk-1.2.so.0 (0x40566000) libgdk-1.2.so.0 => /usr/lib/libgdk-1.2.so.0 (0x406db000) libgmodule-1.2.so.0 => /usr/lib/libgmodule-1.2.so.0 (0x4071c000) libglib-1.2.so.0 => /usr/lib/libglib-1.2.so.0 (0x40727000) libXi.so.6 => /usr/X11R6/lib/libXi.so.6 (0x40759000) libXext.so.6 => /usr/X11R6/lib/libXext.so.6 (0x40768000) libX11.so.6 => /usr/X11R6/lib/libX11.so.6 (0x4077e000) libm.so.6 => /lib/libm.so.6 (0x40855000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) I'm about to attach a gdb log -- maybe the stack trace from that will make more sense...
libpthread.so is before libc.so, so the library linking order is correct.
Unfortunately I wasn't able to get Necko unstripped also - not enough space in ROM on this damn iPaq (see also http://handhelds.org/bugzilla/show_bug.cgi?id=418)!! I'll see if I can manage to squeeze it on somehow anyway.
Attachment #88709 - Attachment is obsolete: true
wtc: Any particular variables I should be peeking into? Or does everything look correct now? ... Should I try again with necko unstripped?
Is it possible that this value isn't kosher on my architecture?: if (0 == stackSize) stackSize = (64 * 1024); /* default == 64K */ (ptthread.c, line 357) I'm guessing this is probably fine because, looking in bonsai at changes to ptthread.c between 21 Dec 2001 and 4 Feb 2002, this line was not changed. I just snooped into this because seeing stackSize go from 0 to 65536 made me suspicious.
Another shot-in-the-dark, which maybe Jeroen could answer: might something be wrong with memcpy? Wondering if bug 118135 introduced the problem... Seems highly unlikely, but I'm trying to guess what sorts of things might not be threadsafe.
Has anyone tried any other threaded apps on the ARM?
I must confess that I am confused about the stack traces :( shouldn't you just ignore the SIG32 ? and what about the 'SIGTRAP' : did you put a breakpoint in the gdk library ? (I guess not, but that's what the stacktrace seems to indicate...)
Ok, I did a little more research into the signals I'm seeing, and you're right, I shouldn't be seeing any of them. They don't indicate any problem in threads or necko. References: http://www.mozilla.org/unix/debugging-faq.html (of course) http://sources.redhat.com/ml/gdb/2000-q1/msg00329.html http://sources.redhat.com/ml/gdb/2000-q1/msg00336.html http://sources.redhat.com/ml/bug-gdb/1999-10/msg00058.html http://www.advogato.org/person/drunen/ ("Seems pthreads is using this to change thread contexts, and gdb doesn't seem to get it.") http://www.uwsg.iu.edu/hypermail/linux/kernel/0012.1/0232.html etc. etc. etc. Just using "prun" doesn't deal with the SIG32s, so I've added "handle SIG32 nostop noprint pass" to my .gdbinit. That works, of course, and gets me straight to the SIGTRAP. The SIGTRAP comes up every time, but in different places in the code. And no, I have not set any breakpoints in GDK, or anywhere else. Blizzard: I know you dealt with this problem a while back. Can you tell me how I'm supposed to get around it? That'll bring me a big step closer to the holy grail of a Useful Stack Trace. As for "other threaded apps on the ARM," Mozilla 0.9.7 does run for me (though with many bugs). I assume that answers the question :)
gdb should be handling the signals. This means that either gdb is screwed up or there's no thread debugging library on the arm that works. You shouldn't need that SIG32 crap anywhere.
Ok, gdb was screwed up: I needed a newer glibc to go with the homebrewed gdb. So I'm getting results now that seem definitive. But they're weird. Apparently I'm getting a segfault in nsCSSValue::GetUnit(). I don't have an unstripped content DLL, unfortunately, because I can't load it onto the iPaq (not enough room). But nsCSSValue::GetUnit() is a one-liner: nsCSSUnit GetUnit(void) const { return mUnit; }; The other weird thing here is that the function which supposedly calls GetUnit() is CSSStyleRuleImpl::MapRuleInfoInto(). That function should indirectly call GetUnit by way of one of the Map*ForDeclaration functions, but I can't see which one. Here's the stack I've got: Program received signal SIGSEGV, Segmentation fault. 0x415d1760 in ?? () from /usr/local/mozilla/components/libgkcontent.so (gdb) shar content (no debugging symbols found)...Loaded symbols for /usr/local/mozilla/components/libgkcontent.so (gdb) where #0 0x415d1760 in nsCSSValue::GetUnit () from /usr/local/mozilla/components/libgkcontent.so #1 0x417b0724 in CSSStyleRuleImpl::MapRuleInfoInto () from /usr/local/mozilla/components/libgkcontent.so #2 0x417af038 in CSSStyleRuleImpl::MapRuleInfoInto () from /usr/local/mozilla/components/libgkcontent.so #3 0x41ac7c3c in nsRuleNode::WalkRuleTree () from /usr/local/mozilla/components/libgkcontent.so #4 0x41ac72cc in nsRuleNode::GetBorderData () from /usr/local/mozilla/components/libgkcontent.so #5 0x41ad1e70 in nsRuleNode::GetStyleData () from /usr/local/mozilla/components/libgkcontent.so #6 0x41b00ed4 in nsStyleContext::GetStyleData () from /usr/local/mozilla/components/libgkcontent.so #7 0x423e41e8 in ?? () from /usr/local/mozilla/components/libgklayout.so I'm wondering if it might be a compiler bug relating to inlining...?
Ok, I spent some time and have what seems to be a working GCC 3.1 cross-compiler (with binutils 2.12.1, glibc 2.2.5). By "seems to be working," I mean that it builds "hello world" using namespaces and iostreams... Not exactly a rigorous test, and I suspect Mozilla will be significantly more taxing on the compiler. Anyway, to make a long story short, I'm trying to cross-compile Mozilla with GCC 3.1 now. We'll see what happens.
If you're using GCC 3.1 you might have to play with the name mangling in the xptstub code. You also might need to mess with the xptcall code since I'll be the ABI has changed.
And just as a start, it would be a good idea to first try it at -O1 and only later at -O2, -O3 or -Os (With gcc-3.0.4 I exerienced one small glitch compiling nsDOMClassInfo at -O2 resulting in a XUL information not being available. Compiling this file at -O1 resolved the problem)
Well, here are my results from yesterday: The build finished successfully (without --enable-optimize). The build also runs, to the extent that it ran with gcc 2.95.3 (I didn't have to hack any of the xpt stuff). But it dies on the first page-load, just like it did with the old gcc. Worse yet, I can't seem to get any remotely useful stack trace for the crash: Program received signal SIGSEGV, Segmentation fault. 0x0015100c in ?? () (gdb) where #0 0x0015100c in ?? () Cannot access memory at address 0x0 My glibc configuration is: --host=arm-linux --enable-add-ons=linuxthreads My gcc configuration is: --target=arm-linux --enable-languages=c,c++ These also include --prefix, --with-headers, and --with-local-prefix, of course. Jeroen: I looked at the configure bits you sent me... I'm wondering, should I also be using: --with-cpu=strongarm110 --without-fp --with-softfloat-support=internal --enable-threads=posix or is that unnecessary for me?
--with-cpu=strongarm110 -> yes (don't try xscale binaries on an ipaq ;) ) --without-fp -> no : use --with-fp (default) --with-softfloat-support=internal -> not important without softfloat (leave away) --enable-threads=posix -> yes : needed for c++ with multiple threads the softfloat part is only useful if your _complete_ system is compiled for softfloat. The normal distributions (like familiar) let the kernel do the work for emulating floating points.
So, has anyone looked into the ptsynch assert in comment #14? I ran into some of the ARM Linux people, and got some debugging ideas from them. However, it was all for naught. Has anyone else gotten PSM working on the ARM, and if so, what was their toolchain setup. I have this hunch that this might be a glibc/linuxthreads bug. Potentially identical to bug #14263, maybe?
Hi Mark, I'm wondering if it might be best to open another bug for PSM on ARM. My trouble is simply that Mozilla post-0.9.7 doesn't work *at all*. Perhaps the bugs are related, but perhaps not... I don't think my problem is a toolchain issue, if that helps you... I've used gcc 2.95.3 and 3.1, and glibc 2.2.3 and 2.2.5. On the other hand, if you're seeing a problem in PR_Unlock then you might want to have a look at the changes that jeff@NerdOne.com contributed to stop threads from leaking memory. They were committed on 27 December 2001. See bugs: bug 96112 bug 96122 bug 96197 bug 96198 bug 96199 I don't really have any idea if these are the cause of the problem, though. The code added all looks more or less straightforward. The only thing I'm wary about in any of the patches is the change from PR_MALLOC() to malloc() in some places in 96122, but that's only a gut reaction... I guess the only way to really test would be to back out the changes and see if it works. Oh, actually, here's an idea for you: Since Mozilla 0.9.7 works for me, perhaps you could try compiling that and seeing if PSM works in it. Regarding compilation of PSM, you should have a look at bug 104541.
Mark: I also should mention that there are a whole bunch of other changes that were committed to NSPR threads between 0.9.7 and 0.9.8 (December 2001 - February 2002). Those are all equally worth looking at, assuming that PSM in 0.9.7 works for you.
what is the current status on this? Is this something that is still an issue? (trying to help focus)
WORKSFORME under Familiar 0.7 (and 0.7.1). For more info, see http://www.mozilla.org/projects/minimo/ of course, you have to have the right toolchain ;)
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → WORKSFORME
re: PR_dtoa, see bug 209814
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: