Closed Bug 127455 Opened 23 years ago Closed 22 years ago

Crashes with segmentation fault on some complex pages

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla0.9.9

People

(Reporter: mach1, Assigned: jesup)

References

(
URL
)

Details

(Keywords: 64bit, crash)

Attachments

(7 files, 1 obsolete file)

Listing of error messages produced by the browser 23 years ago mach1 701 bytes, text/plain		Details
Edited local copy of the failing page 23 years ago mach1 728 bytes, text/plain		Details
Trace of failure using edited local copy of failing page 23 years ago mach1 5.67 KB, text/plain		Details
Additional info from crash. 23 years ago mach1 11.35 KB, text/plain		Details
Image loaded by failing page. 23 years ago mach1 4.90 KB, text/plain		Details
sample c++ program exhibitting this bug 23 years ago Andrew Schultz 828 bytes, text/plain		Details
Patch to sidestep Alpha compiler bug 23 years ago Randell Jesup [:jesup] (needinfo me) 937 bytes, patch	pavlov : review+	Details \| Diff \| Splinter Review
Updated patch 23 years ago Randell Jesup [:jesup] (needinfo me) 932 bytes, patch	blizzard : review+ tor : superreview+ blizzard : approval+	Details \| Diff \| Splinter Review

mach1

Reporter

Description

•

23 years ago

The browser crashes when loading some complex pages like www.netscape.com on a
build of the browser made on an alpha linux system perfomred on 23-FEB-2002. A
segmentation fault is reported, but no core dump is created. This problem seems
to occur frequently on browser builds performed at various times over the past 2
or 3 months. A build made on 3-FEB-2002 seems to work fine. I will attach a copy
of error messages produced by a run of the browser.

mach1

Reporter

Comment 1

•

23 years ago

Attached file Listing of error messages produced by the browser — Details

This is a listing of all the error messages produced by the browser leading up
to the failure while loading www.netscape.com.

Olav Vitters

Updated

•

23 years ago

Severity: normal → critical

Keywords: crash

Andrew Schultz

Comment 2

•

23 years ago

Mozilla/5.0 (X11; U; Linux alpha; en-US; rv:0.9.8) Gecko/20020205
worskforme (binary RPMs)

The error messages you see are really warnings and nothing to worry about
(except for "Segmentation fault"!).  Can you produce a stacktrace (did you build
with symbols)?

if you download the page and view it locally, does it still crash?  If so, you
might try to whittle the page down to whatever is causing Mozilla to be unhappy.

mach1

Reporter

Comment 3

•

23 years ago

I am not well versed in the Mozilla build options. I constructed the browser as
follows:

cvs checkout -f mozilla/client.mk
cd mozilla
gmake -f client.mk

I then run the browser as follows:

cd dist/bin
./mozilla

The browser comes up with the usual www.mozilla.org/start/ page and functions as
expected on the simple pages of the site. When the url is set to
http://www.netscape.com the browser dies. In the past (months ago) a core dump
would usually be perfromed in a situation like this, but not now.

If you can give more explicit guidance how to get a stack trace in this
situation I will give it a try. I will see what I can discover about the failing
page.

Andrew Schultz

Comment 4

•

23 years ago

the default build options include symbols.  you can use gdb (or ddd) to get a
stack trace.  Basically:

./mozilla -g http://www.netscape.com
-----------------------------------
(gdb) run
[do whatever you need to do to get a crash]
(gdb) bt              (backtrace)
---------------------------------------

more details (especially if you don't have lots of RAM) are available at:
http://www.mozilla.org/unix/debugging-faq.html

mach1

Reporter

Comment 5

•

23 years ago

Attached file Edited local copy of the failing page — Details

mach1

Reporter

Comment 6

•

23 years ago

Attached file Trace of failure using edited local copy of failing page — Details

Olav Vitters

Comment 7

•

23 years ago

-> ImageLib

Assignee: asa → pavlov

Component: Browser-General → ImageLib

QA Contact: doronr → tpreston

Stuart Parmenter

Comment 8

•

23 years ago

tor,rjesup?  is this your changes?

tor

Comment 9

•

23 years ago

Could you do the following commands in gdb when you get the crash?

  where
  info locals
  p *this
  p *dest

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 10

•

23 years ago

This looks like my code from the location, though we'd need to see why we were
in there - tor modified some of the prologue code for DrawToImage.  I suspect
that there's some sort of boundary bug in my code with 64-bit, or possibly a
compiler bug.  Nothing obvious from inspection.

Please also do these commands:
p d
p s
p dst
p src
p aMask
p j
p aDWidth
p x
p alphaPixels

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 11

•

23 years ago

I spent more time looking at it.  I can't see any 64-bit issues from inspection.

Andrew Schultz

Comment 12

•

23 years ago

mach1: It looks like the image was an ad in the original page.  could you attach
the image (netscape.gif)?  Mozilla doesn't like certain animated .gif images
(bug 94336).  You might have grabbed one of those, but we don't see it because
we get different images.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 13

•

23 years ago

You might want to try the patch in bug 94336 and see if the problem goes away.

mach1

Reporter

Comment 14

•

23 years ago

Attached file Additional info from crash. — Details

mach1

Reporter

Comment 15

•

23 years ago

Attached file Image loaded by failing page. — Details

The page and image load and display fine using a browser built on 03-FEB-2002.

Andrew Schultz

Comment 16

•

23 years ago

Mozilla is running into something similar to bug 119042. That bug involves a
64-bit gcc compiler error that results from mixing ints and unsigned ints in
array index expressions.  

--------------------
(gdb) info locals
src = (PRUint8 *) 0x220a2d1dc <Address 0x220a2d1dc out of bounds>
rgbPtr = (PRUint8 *) 0x120a2cfd0 ""

(note the 0x1 offset here)
-----------------------------------

I didn't see anything like this with simple pointer arithmetic in bug 119042,
but the result is the same.  I will attach a sample c++ program that exhibits
the bug.

Andrew Schultz

Comment 17

•

23 years ago

Attached file sample c++ program exhibitting this bug — Details

-----------------------------
% g++ -o testbug testbug.cpp
% testbug
0 0x11ffff880
1 0x11ffff878
2 0x11ffff5a8
3 0x11ffff870
4 0x11ffff868
5 0x11ffff860
7 0x11ffffdf8
8 0x11ffffdf0
9 0x31ffffde8
10 0x51ffffde0
------------------
note that in this case, using an unsigned int does not help.
this is with gcc 2.96-87 (on Alpha-Linux, RedHat 7.1)

Andrew Schultz

Comment 18

•

23 years ago

this bug would make mozilla on Alpha virtually unusable.  You might switch to using:

src = src + rgbStride - 3*8*iterations;

simply to avoid the bug.  I reported bug 119042 on RedHat's Bugzilla, but didn't
get much response.  I will ping them again when their Bugzilla comes back up
(currently down for maintenance)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 19

•

23 years ago

I'll attach a patch as per your comment that should avoid the compiler bug. 
Shouldn't hurt perf on other platforms.  Thanks for the analysis!

Status: UNCONFIRMED → NEW

Ever confirmed: true

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 20

•

23 years ago

Attached patch Patch to sidestep Alpha compiler bug (obsolete) — Details — Splinter Review

I did this for src, dst, and also alpha and dstAlpha (even though they didn't
show up in the gdb output as wrong).  This should cause no slowdown for other
compilers, I believe.  We'll need verification that this actually fixes the
problem, though.

Stuart Parmenter

Comment 21

•

23 years ago

Comment on attachment 71411 [details] [diff] [review]
Patch to sidestep Alpha compiler bug

r=pavlov

Attachment #71411 - Flags: review+

mach1

Reporter

Comment 22

•

23 years ago

In the patch, shouldn't the line

    dst = dest + dest->mRowBytes - 3*8*iterations;

be

    dst = dst + dest->mRowBytes - 3*8*iterations;

?

Also, I would suggest that the behavior seen here is not a compiler bug. In the
expression:

   src += rgbStride - 3*8*iterations;

the type of 

   3*8*iterations

is (int) (32-bits on the alpha) and the type of

   rgbStride - 3*8*iterations

is (unsigned int) (also 32-bits on the alpha) according to operator precidence.

In the specific case at hand, 3*8*iterations is equal to 360 which can be
promoted to (unsigned int) type with no problem. Now

   rgbStride - 360

is equivalent to the unsigned difference 356 - 360 which, due to underflow,
results in a very large positive value. It is this value that is scaled and
added to src that ultimately results in the segmentation fault. The compiler
generates the proper code for this interpretation. If it is proper to expect

   rgbStride - 3*8*iterations

to be negative, then it seems that the proposed patch is required for any
platform because the type of:

   src + rgbStride

is (unsigned short *) and thus the subtraction is performed on a pointer value
rather than on an unsigned integer value.

However, I do believe the sample C++ test program does demonstrate a compiler
bug. It is a optimization failure in the fourth test case:

   src += rgbStride - 3*8*15;

The compiler generates code that first scales rgbStride and then subtracts the
constant 720 from the result. This avoids the underflow, but I would assert it
is an improper interpretation of the expression based on operator precidence.

mach1

Reporter

Comment 23

•

23 years ago

Oops! In my last comment, I errored in my description of the code produced by
the compiler for the fourth test case. The generated code actually scales
rgbStride, adds it to src and then subtracts 720. It is the addition of
rgbStride * 2 to the address that actually avoids the underflow when the 720 is
subtracted.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 24

•

23 years ago

dest -> dst
   that's correct, I fixed it last night

Thanks for the more detailed analysis - I had assumed that Alpha was 64bit int.

I've recoded this as follows to avoid any possible unsigned underflows:

        // at end of each line, bump pointers.  Use wordy code because of bug 127455
        dst = (dst - 3*8*iterations) + dest->mRowBytes;
        src = (src - 3*8*iterations) + rgbStride;
        alpha = (alpha - iterations) + alphaStride;
        dstAlpha += dest->mAlphaRowBytes;

I'll attach this.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 25

•

23 years ago

Attached patch Updated patch — Details — Splinter Review

Attachment #71411 - Attachment is obsolete: true

Andrew Schultz

Comment 26

•

23 years ago

rgbStride - 3*8*iterations evaluates to -4.  g++ can handle this part.

the problem is that it has to promote that to an unsigned long int so that it
can do pointer arithmetic.  The compiler does not recognize that it might be
negative, and so you wind up adding a huge number instead of subtracting a small
one.

I can see this bug if I do 

src += expr;
or
src = src + (expr);

(where expr involves at least one computation of variables, including one
unsigned int).  So, you are right when you say that it depends on the order of
operations.

Andrew Schultz

Comment 27

•

23 years ago

Randell: will the new code have any problems considering what I just figured
out?  If you have 

src += expr

and you know that expr will always be negative, you can fix it by doing

src -= -expr

if you go with:

src = src + expr1 - expr2 * expr3

then each addend (expr1 and expr2 * expr3) must be positive.

Andrew Schultz

Comment 28

•

23 years ago

for comment #26, it should say:
(where expr evaluates to a negavite number and involves at least...

mach1

Reporter

Comment 29

•

23 years ago

The solution is to come to grips with the fact that if you expect the right side
to sometimes be negative then you should use signed arithmetic. Remember, all of
the assignment operators have lower precedence than almost everything else. In
the case at hand, the subtraction is performed as an unsigned operation because
rgbStride is unsigned. The compiler rightly promotes 3*8*interations to be
unsigned as well and the resulting type is rightly infered by the compiler to be
unsigned. If the difference is intended to be negative sometimes, then signed
arithmtic must be forced. For example:

   src += (signed int) rgbStride - 3*8*iterations;

The compiler is doing exactly what it is being told to do in this case. Is what
it's being told to do what is meant? If rgbStride is not expected to be greater
than 2^31 then the cast should work fine.

Andrew Schultz

Comment 30

•

23 years ago

printf ("%d", -rgbStride);
> -356
note that the negative unsigned int is properly converted to a signed int.
src += -rgbStride

causes the bug.  the problem is that it treats the signed RHS as unsigned for
the purpose of promoting it to 64 bit long int.  that is silly.  intel gcc does
not have this problem because it does not promote the RHS to 64 bit, rather than
a lack of any assumptions about signed/unsigned.

see also comments 40, 42, 64 and 66 of bug 119042

mach1

Reporter

Comment 31

•

23 years ago

The -rgbStride example works because of how the unary - operator is defined and
the fact that the %d field specifier interprets its formal parameter as a signed
integer.

You are correct that the specific problem does not occur on an Intel platform
because it does not promote the result to 64 bits. My point is only that the
compiler is behaving properly based on the definition of the language. Specifically:

   src += rgbStride - 3*8*iterations;

is equivalent to:

   src = src + (rgbStride - 3*8*iterations);

not:

   src = src + rgbStride - 3*8*iterations;

which is equivalent to:

   src = (src + rgbStride) - 3*8*iterations;

The distinction is subtle, but significant when using mixed mode arithmetic and
implicit type coersion.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 32

•

23 years ago

Ok.  I believe my code in the latest patch is correct (if a bit paranoid about
order of operations - the previous patch would have worked too from what mach1
says).  Could someone please test it on an Alpha and verify, and then r=/sr= the
patch?

Keywords: patch

Andrew Schultz

Comment 33

•

23 years ago

Jakub at Redhat (gcc maintainer) says this is not a compiler bug:

> p += -j
> where -j is unsigned int is the same as
> p = p + ((unsigned int) (-j));
> you need to cast it to some signed type before it is converted to ptrdiff_t.

and

> i686 is 32-bit target where sizeof(int) == sizeof(void *).
> That's not the case on Alpha, nor IA-64, nor Sparc64, ...

My testcase works the same on Alpha-OSF and Alpha-Linux.  This (Mozilla) bug
should probably show up on any 64-bit machine.

mach1:  you were right all along!
(can you test out Randell's patch?  I don't have a current source tree)

mach1

Reporter

Comment 34

•

23 years ago

Yes, I will test the patch and report the result.

mach1

Reporter

Comment 35

•

23 years ago

I have applied the patch, rebuilt the browser, and tested it on my edited local
copy of the failing page and browsed several sites that have caused problems in
the past--everything seems to work just fine.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 36

•

23 years ago

Ok, patch has been tested.  I'll take this one too, and check in once we have
r/sr/a (unless you object, pav).  Reviewers?

Assignee: pavlov → rjesup

Keywords: mozilla0.9.9, nsbeta1

Target Milestone: --- → mozilla0.9.9

tor

Comment 37

•

23 years ago

Maybe I'm missing something here, but why is the increment becoming negative?

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 38

•

23 years ago

Because iterations is rounded up (you can't do 1/2 an iteration) and we bump dst
and src every iteration.

tor

Comment 39

•

23 years ago

But we work from top-to-bottom, left-to-right, so the "bump" should always
be positive.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 40

•

23 years ago

If you check the patch, you'll see that we now bump 8 pixels at a time (and
process 8).  That means that dst and src get bumped each iteration by 8*3, even
if the number of pixels to handle in the last iteration is less than 8.  In
those cases, dst and src will be past where they need to start for the next line.

Trust me, the code works and is _fast_.  The 64-bit overflow is a true bug due
to order-of-operation and types.

Still targeting 0.9.9 for this nasty crasher (blows 64-bit builds out of the
water).  I plan to ask for permission to check into the branch once I have r/sr.

Status: NEW → ASSIGNED

Andrew Schultz

Comment 41

•

23 years ago

keyword +64bit please

tor

Comment 42

•

23 years ago

Comment on attachment 71517 [details] [diff] [review]
Updated patch

sr=tor

Attachment #71517 - Flags: superreview+

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

23 years ago

Keywords: 64bit

scottputterman

Comment 43

•

22 years ago

nsbeta1- per ADT triage. This can still be checked in with drivers approval.

Keywords: nsbeta1 → nsbeta1-

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 44

•

22 years ago

*** Bug 130726 has been marked as a duplicate of this bug. ***

Christopher Blizzard (:blizzard)

Comment 45

•

22 years ago

Comment on attachment 71517 [details] [diff] [review]
Updated patch

r=blizzard

a=blizzard on behalf of drivers for checkin to the trunk.

Attachment #71517 - Flags: review+

Attachment #71517 - Flags: approval+

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 46

•

22 years ago

Fix checked in

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.