Closed Bug 102113 Opened 23 years ago Closed 23 years ago

nsCompressedCharMap crashes during startup on 64bit Solaris.

Categories

(Core :: Layout, defect)

Sun
Solaris
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla0.9.6

People

(Reporter: pavlov, Assigned: bstell)

References

Details

(Keywords: 64bit)

Attachments

(1 file, 1 obsolete file)

stack trace:
=>[1] nsCompressedCharMap::SetChar(this = 0xffffffff7fff4d10, aChar = 338U),
line 156 in "nsCompressedCharMap.cpp"
  [2] InitGlobals(), line 810 in "nsFontMetricsGTK.cpp"
  [3] nsFontMetricsGTK::Init(this = 0x10050a490, aFont = STRUCT, aLangGroup =
0x100407380, aContext = 0x100441250), line 1035 in "nsFontMetricsGTK.cpp"
  [4] nsFontCache::GetMetricsFor(this = 0x100506bc0, aFont = STRUCT, aLangGroup
= 0x100407380, aMetrics = (nil)), line 568 in "nsDeviceContext.cpp"
  [5] DeviceContextImpl::GetMetricsFor(this = 0x100441250, aFont = STRUCT,
aLangGroup = 0x100407380, aMetrics = (nil)), line 233 in "nsDeviceContext.cpp"
  [6] ComputeLineHeight(aRenderingContext = 0x100508910, aStyleContext =
0x1004ffd18), line 2140 in "nsHTMLReflowState.cpp"
  [7] nsHTMLReflowState::CalcLineHeight(aPresContext = 0x10040c290,
aRenderingContext = 0x100508910, aFrame = 0x1004ffd80), line 2182 in
"nsHTMLReflowState.cpp"
  [8] nsBlockReflowState::nsBlockReflowState(this = 0xffffffff7fffa140,
aReflowState = STRUCT, aPresContext = 0x10040c290, aFrame = 0x1004ffd80,
aMetrics = STRUCT, aBlockMarginRoot = 4194304), line 156 in
"nsBlockReflowState.cpp"
  [9] nsBlockFrame::Reflow(this = 0x1004ffd80, aPresContext = 0x10040c290,
aMetrics = STRUCT, aReflowState = STRUCT, aStatus = 0), line 693 in
"nsBlockFrame.cpp"

I am seeing this on a build on Solaris 8 built with Forte 6U2 with -xarch=v9
In LXR, there's nothing line 156 in the current version of nsCompressedCharMap.cpp
http://lxr.mozilla.org/seamonkey/source/gfx/src/nsCompressedCharMap.cpp

Which version of Mozilla are you using?
it ends up being line 171 because of the license changes.  there havn't been any
other changes to the file... I will update my tree though.. so my line numbers
will be right.
Pav: is sheep a 64 bit system?

If not is there a system I can build/debug on?
yeah, sheep (can be) a 64bit system.

add /opt/64bit/bin at the beginning of PATH and /opt/64bit/lib to the beginning
of LD_LIBRARY_PATH (this is where I installed 64bit glib/gtk/libIDL libraries on
sheep)
then set CC to "cc -xarch=v9" and CXX to "CC -xarch=v9" and ASFLAGS="-xarch=v9"
run configure as you normally would, and build.. when it is done, you'll have a
64bit build.  dbx/workshop work as normal.
okay, made the indicated changes and I have started a build
It seems to be failing to find a 64 bit thread locking routine.

rm -f libmozjs.so
CC -xarch=v9 -I/usr/openwin/include -mt  -DDEBUG -DDEBUG_ -DTRACING -g -G
-Qoption ld -z,muldefs -h libmozjs.so -o libmozjs.so  jsapi.o jsarena.o
jsarray.o jsatom.o jsbool.o jscntxt.o jsdate.o jsdbgapi.o jsdhash.o jsdtoa.o
jsemit.o jsexn.o jsfun.o jsgc.o jshash.o jsinterp.o jslock.o jslog2.o jslong.o
jsmath.o jsnum.o jsobj.o jsopcode.o jsparse.o jsprf.o jsregexp.o jsscan.o
jsscope.o jsscript.o jsstr.o jsutil.o jsxdrapi.o prmjtime.o lock_SunOS.o  
-xildoff    -lm -lposix4 -ldl -lnsl -lsocket -L../../dist/bin
-L/builds/bstell/mozilla/dist/lib -lplds4 -lplc4 -lnspr4 -lpthread -ldl 
-lsocket -ldl -lm    
ld: fatal: file lock_SunOS.o: wrong ELF class: ELFCLASS32
ld: fatal: File processing errors. No output written to libmozjs.so
is this build on top of another build or a fresh tree?  sun's cache might be
getting confused if this is on top of another build.  I would recommend doing a
'gmake -f client.mk distclean' on the tree.
I did a "gmake -f client.mk distclean" then a "./configure" before the build
Looks like lock_SunOS.s wasn't built using -xarch=v9.  Can you double check
ASFLAGS in config/autoconf.mk and make sure it was set.  Also, can you check the
compile line in the log to see how lock_SunOS.o was built?

config/autoconf.mk:

  ASFLAGS         =  -K PIC -L -P -D_ASM -D__STDC__=0
/usr/ccs/bin/as -o lock_SunOS.o -K PIC -L -P -D_ASM -D__STDC__=0  lock_SunOS.s
Ok, that's the problem.  Re-reading your previous comment, I don't see where
CC/CXX/ASFLAGS were passed into the build.  You need to either add those
settings to your mozconfig or pass them on the ./configure line.

Add:
CC="cc -xarch=v9"
CXX="CC -xarch=v9"
ASFLAGS="-xarch=v9"

to ~/.mozconfig

or

env CC="cc -xarch=v9" CXX="CC -xarch=v9" ASFLAGS="-xarch=v9" ./configure
okay, I finally have a build.
okay, I'll set "-g" in CFLAGS and CCFLAGS and see if I get debug symbols
bstell- mark it assign if you agree to work on it. 
Status: NEW → ASSIGNED
Here is the error:

  signal BUS (invalid address alignment) 

Looks like the array needs to be 64 bit aligned.

bstell, sorry I didn't catch this 64-bit impurity in review.  RISCs generally
require natural alignment.  The only way to ensure it is with a union around the
array of PRUint16s.  That'll cost an extra "u." member name and dot operator,
but no big deal.

/be
Blocks: 101793
Oh, and (of course) round up to a 0 mod 8 byte boundary when allocating from the
map -- is that going to waste too much space?  We have to 0 mod 4 align for
uint32 access, already.

/be
Attachment 52160 [details] [diff] forces the map into 16 bit access. This stops the crash.

A complete fix would probably involve typing the memory arrays (both
stack and heap) to ALU_TYPE and doing casts for all the 16 bit accesses.

At present the 64 bit version runs, the profile manager looks okay, 
but the pages are completely blank. Not even images show. I believe 
this is unrelated but it prevents me from verifying this patch.

Target Milestone: --- → mozilla0.9.5
This close to 0.9.4 branch I'd prefer to get the simplest fix in.
local files display but remote URLs do not
failing to display remote URLs is probably a separate bug
When I click the off-line icon I get this error:

###!!! ASSERTION: Should have thread when shutting down.: 'Not Reached', file
nsSocketTransportService.cpp, line 733
###!!! Break: at file nsSocketTransportService.cpp, line 733
JavaScript error: 
 line 0: uncaught exception: [Exception... "Component returned failure code:
0x80004005 (NS_ERROR_FAILURE) [nsIIOService.offline]"  nsresult: "0x80004005
(NS_ERROR_FAILURE)"  location: "JS frame ::
chrome://communicator/content/utilityOverlay.js :: toggleOfflineStatus :: line
69"  data: no]
Comment on attachment 52160 [details] [diff] [review]
patch; force all to use 16 bit access

r=pavlov
Attachment #52160 - Flags: review+
Comment on attachment 52160 [details] [diff] [review]
patch; force all to use 16 bit access

Good for 0.9.5, sr=brendan@mozilla.org.

Please leave this bug open so we can look into wider memory accesses for 0.9.6 trunk.
Attachment #52160 - Flags: superreview+
Comment on attachment 52160 [details] [diff] [review]
patch; force all to use 16 bit access

a=asa (on behalf of drivers) for checkin to 0.9.5.
Attachment #52160 - Flags: approval+
did this check into m0.9.5 branch ?
IF so, please move it to m0.9.6 if you want to keep it open.
Target Milestone: mozilla0.9.5 → mozilla0.9.6
Target Milestone: mozilla0.9.6 → mozilla0.9.7
from bobbell@zk3.dec.com in bug 108950:

> nsCompressedCharMap::SetChars accesses unaligned memory.  With the default
> settings on Tru64 UNIX, Tru64 UNIX correctly detects this, corrects it, and
> prints a warning message.  However, Tru64 UNIX can also be set to crash with
> this behavior, and it is technically incorrect.
> 
> The problem was discovered using a recent nightly build.  Lines 298 and 299 of
> nsCompressedCharMap.cpp are at fault.  They read:
>     NS_ASSERTION(page[i]==0, "this page should be unused");
>     page[i] = aPage[i];
> 
> page (from my crash dump) is a pointer on a four byte boundary.  This is
> because is an offset into mCCMap, which is an array of 16-bit data types.  
> However, page is a point to ALU_TYPE, which on Tru64 UNIX is a 64-bit data 
> type.  Thus, page is not properly aligned.

What I do not understand is where the misalignment comes from. (I would 
appreciate anyone pointing out what I am missing or where I am mistaken).

The pages are each 16 shorts (32 bytes) so the page-to-page distance should 
maintain the same ALU boundry alignment of the start of the map.

In the section around line 299 the code that accesses the page does so 
in ALU sized groups so it should maintain the ALU boundry alignment as 
the start of the page.

Doesn't malloc return memory that is aligned to the largest ALU size?
If not how could the code safely alloc space for the largest ALU?

Is the base CCMap address on a 4 byte boundry?

I'm seeing this problem again with a tip v9 build using WS5 .

(/opt/SUNWspro/WS5.0/bin/sparcv9/dbx) where
current thread: t@1
=>[1] nsCompressedCharMap::SetChar(this = 0xffffffff7fff5e80, aChar = 338U),
line 223 in "nsCompressedCharMap.cpp"
  [2] InitGlobals(), line 825 in "nsFontMetricsGTK.cpp"
  [3] nsFontMetricsGTK::Init(this = 0x1004e9ec0, aFont = STRUCT, aLangGroup =
0x100420340, aContext = 0x100441430), line 1050 in "nsFontMetricsGTK.cpp"
  [4] nsFontCache::GetMetricsFor(this = 0x1004e5a50, aFont = STRUCT, aLangGroup
= 0x100420340, aMetrics = (nil)), line 631 in "nsDeviceContext.cpp"
  [5] DeviceContextImpl::GetMetricsFor(this = 0x100441430, aFont = STRUCT,
aLangGroup = 0x100420340, aMetrics = (nil)), line 266 in "nsDeviceContext.cpp"
dbx: warning: can't find file
"/space/home/cls/src/moz/main/obj-opt-ws-64-g/layout/build/nsHTMLReflowState.o"
dbx: warning: see `help pathmap'
  [6] ComputeLineHeight(0x1004e9340, 0x1004e0360, 0xffffffff7fffb104,
0xffffffff74d9ef8c, 0x0, 0xffffffff74d7a808), at 0xffffffff74e1521c
  [7] nsHTMLReflowState::CalcLineHeight(0x1004417b0, 0x1004e9340, 0x1004e03c0,
0x0, 0x0, 0x0), at 0xffffffff74e1550c
dbx: warning: can't find file
"/space/home/cls/src/moz/main/obj-opt-ws-64-g/layout/build/nsBlockReflowState.o"
  [8] nsBlockReflowState::nsBlockReflowState(0xffffffff7fffb038,
0xffffffff7fffb4b8, 0x1004417b0, 0x1004e03c0, 0xffffffff7fffb5c8, 0x400000), at
0xffffffff74d9ef8c
dbx: warning: can't find file
"/space/home/cls/src/moz/main/obj-opt-ws-64-g/layout/build/nsBlockFrame.o"
  [9] nsBlockFrame::Reflow(0xffffffff7fffb5c8, 0x1004417b0, 0xffffffff7fffb5c8,
0xffffffff7fffb4b8, 0xffffffff7fffbc04, 0xffffffff755926c0), at 0xffffffff74d7ebb4
dbx: warning: can't find file
"/space/home/cls/src/moz/main/obj-opt-ws-64-g/layout/build/nsContainerFrame.o"
  [10] nsContainerFrame::ReflowChild(0x100489e90, 0x1004e03c0, 0x1004417b0,
0xffffffff7fffb5c8, 0xffffffff7fffb4b8, 0x0), at 0xffffffff74db3324
dbx: warning: can't find file
"/space/home/cls/src/moz/main/obj-opt-ws-64-g/layout/build/nsHTMLFrame.o"
  [11] CanvasFrame::Reflow(0x0, 0x0, 0xffffffff7fffbc04, 0xffffffff7fffb7f8,
0xffffffff7fffbc04, 0x2), at 0xffffffff74e055bc
dbx: warning: can't find file
"/space/home/cls/src/moz/main/obj-opt-ws-64-g/layout/build/nsBoxToBlockAdaptor.o"
  [12] nsBoxToBlockAdaptor::Reflow(0x1004e02d0, 0xffffffff7fffc920, 0x1004417b0,
0xffffffff7fffbbc0, 0xffffffff7fffcc98, 0xffffffff7fffbc04), at 0xffffffff75164ebc
  [13] nsBoxToBlockAdaptor::DoLayout(0x0, 0x0, 0x76c, 0x76c, 0x1,
0xffffffff75164588), at 0xffffffff7516435c
dbx: warning: can't find file
"/space/home/cls/src/moz/main/obj-opt-ws-64-g/layout/build/nsBox.o"
  [14] nsBox::Layout(0x1004e02d0, 0xffffffff7fffc920, 0xffffffff7fffbff8, 0x0,
0x0, 0x2), at 0xffffffff751553fc
dbx: warning: can't find file
"/space/home/cls/src/moz/main/obj-opt-ws-64-g/layout/build/nsScrollBoxFrame.o"


*** Bug 108950 has been marked as a duplicate of this bug. ***
In response to Brian Stell's comment #30:
> Doesn't malloc return memory that is aligned to the largest ALU size?
> If not how could the code safely alloc space for the largest ALU?
> Is the base CCMap address on a 4 byte boundry?

malloc() does indeed return memory that is aligned so that it can be used by any
data type (which here would make it 64-bit aligned).  However, here the memory
is not being explicitly malloc'ed.  The definition of the class
nsCompressedCharMap includes:
  protected:
    PRUint16 mUsedLen;   // in PRUint16
    PRUint16 mAllOnesPage;
    PRUint16 mCCMap[CCMAP_MAX_LEN];

Thus, mCCMap is only guaranteed to be aligned for PRUint16 access.

I believe what the Compaq cxx compiler is doing internally is aligning mUsedLen
on a 64-bit boundary (either intentionally or by chance), which puts mCCMap only
four bytes (two PRUint16's) later, which is not on a 64-bit boundary.

From some debug printfs I added:
  mCCMap @ 0x11fff30f4
  page_offset == 0x40
  page ==  0x11fff3174
thanks for the insight

this I can fix
bobbell: could you try this patch?

thanks
Attachment #52160 - Attachment is obsolete: true
the patch was made in the gfx directory
Target Milestone: mozilla0.9.7 → mozilla0.9.6
Comment on attachment 57165 [details] [diff] [review]
patch; use a union to make the C++ object align the map on the largest ALU

Why not give that ALU_TYPE dummy; member the canonical (and less insulting :-)
name, namely 'align'?

sr=brendan@mozilla.org in any event.

/be
Attachment #57165 - Flags: superreview+
bstell, can you get r= and then mail drivers@mozilla.org for a= to check in for
0.9.6?  Thanks,

/be
okay, after only 4 hours I have a 64 bit build on sheep and it crashes without
the patch and runs with that patch. (This is so weird: both my linux systems
are still horked from the network upgrade :( so I'm using my Win98 system
to display the Solaris client.)
Comment on attachment 57165 [details] [diff] [review]
patch; use a union to make the C++ object align the map on the largest ALU

I don't see any problem with the patch. r=shanjian
Attachment #57165 - Flags: review+
a=blizzard on behalf of drivers for 0.9.6
Keywords: mozilla0.9.6+
checked into 0.9.6 branch
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: