Closed Bug 521549 Opened 15 years ago Closed 13 years ago

Crash [@ qcms_transform_data_rgba_out_lut_sse2], [@ qcms_transform_data_rgb_out_lut_sse2]

Categories

(Core :: Graphics: Color Management, defect)

x86
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- -

People

(Reporter: bc, Assigned: swsnyder)

References

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(3 files, 3 obsolete files)

Attached file stacks
When running the jsreftests on 32bit linux centos5 vm with a debug mozilla-central build I see a crash after loading the js1_5/Regress/regress-351116.js test at 

#6  qcms_transform_data_rgba_out_lut_sse2 (transform=0xb64ddd0, 
    src=0x99519c1 "", dest=0x99519c1 "", length=16)
    at gfx/qcms/transform-sse2.c:152

Running the test by itself does not crash. But the crash occurs regularly at this point when running the full test.

My local profile on that machine also crashes at startup in a related place at 

qcms_transform_data_rgb_out_lut_sse2 (transform=0x9121b20, src=0xbf816a0d "\020\020\020 \033\022\t\f\r%Gշ�%@\231\020\t|j\201%G�%@[\205\030%G�%@@q\n\t\001", 
    dest=0xbf816a0d "\020\020\020 \033\022\t\f\r%Gշ�%@\231\020\t|j\201%G�%@[\205\030%G�%@@q\n\t\001", length=1) at gfx/qcms/transform-sse2.c:41

A fresh profile however does not crash. The crash still occurs in safe mode.

I don't know how to reproduce the second crash other than with the "bad" profile, but to reproduce the first crash.

1. apply bug-515440.patch to disable a failing js test (at least until I check it in).

2. make -C $OBJDIR jstestbrowser where $OBJDIR is your debug objdir.

This appears to have been added in 
Bug 512865. qcms: Improve SSE2 performance, add SSE support.
Attached patch bug-515440.patch (obsolete) — Splinter Review
sayrer: This patch should land on mozilla-central and tracemonkey. Would it cause you problems if it was landed on both at the same time? Or would it be better to land on tracemonkey and let you handle it during the next merge to mozilla-central?
Attachment #405644 - Attachment is obsolete: true
Attachment #407122 - Flags: review?(sayrer)
Attachment #407122 - Flags: review?(sayrer) → review+
http://hg.mozilla.org/tracemonkey/rev/4175f7e34648
disable js1_5/Regress/regress-351116.js for debug 32bit linux browser

to test this crash you must re-enable this test.
This stack is one we've been seeing on shutdown (intermittent, but frequently) on the Thunderbird Bloat tests.

It is slightly different to the other stacks but I'm guessing it is the same issue as I am pretty sure the intermittent crashes started the same time as when bug 512865 landed.
This is crashing frequently on debug tests on tinderbox. I think this may start affecting Thunderbird devs as soon as we start actively working on trunk.

Also I suspect this could be the cause of SeaMonkey's crashes on Linux recorded in bug 520707 that started on the same day as bug 512865.

Therefore I think this should block 1.9.3 as it is a clear reproducible crash for debug builds at least... and we need a stable environment for our Linux developers.
Severity: normal → critical
blocking2.0: --- → ?
Whiteboard: [orange]
(In reply to comment #5)
> This is crashing frequently on debug tests on tinderbox.

Mark, can you point me toward doc that explains how the debug tests differ from a standard build?  Put another way, how do I build Thunderbird/SeaMonkey in such a way that this crash can be reproduced?

Thanks.
Never mind.  Found it.

https://developer.mozilla.org/en/MailNews/Leak_And_Bloat_Tests
(In reply to comment #6)
> (In reply to comment #5)
> > This is crashing frequently on debug tests on tinderbox.
> 
> Mark, can you point me toward doc that explains how the debug tests differ from
> a standard build?  Put another way, how do I build Thunderbird/SeaMonkey in
> such a way that this crash can be reproduced?

For Thunderbird we're doing slightly different tests to the Firefox Leak & Bloat tests. Unfortunately I need to update the documentation on how to run these, thankfully it is quite simple.

If you need info on how to pull & build Thunderbird, see https://developer.mozilla.org/En/Simple_Thunderbird_build

This is the mozconfig that we're using for the bloat tests:

http://hg.mozilla.org/build/buildbot-configs/file/c230f470c116/thunderbird/linux/comm-central/debug/mozconfig

once you've got a build, go into the src directory, into mailnews/performance/bloat and run:

./runtest.py --objdir=/path/to/objdir

replacing the path as appropriate, e.g. /home/cltbld/build/objdir-tb

If you want to try and reproduce the actual actions rather than run the automated test, then for Thunderbird it is:

1) Start up with minimal profile (one account, most things turned off including contacting the server on startup).
2) Open and close compose window
3) Open and close address book window
4) Exit Thunderbird

I haven't tried it myself yet as I need to set up a new VM for that, but they are the steps for the mail leak test.


As I said I'm not sure SeaMonkey is definitely the same crash, but afaik their leak & bloat tests are run in exactly the same way as Firefox - unfortunately I don't know where we keep details on that. This is the mozconfig used by SeaMonkey:

http://hg.mozilla.org/build/buildbot-configs/file/c230f470c116/seamonkey/linux/comm-central-trunk/debug/mozconfig
make jstestbrowser in your objdir on mozilla-central will also show this crash
Mark:

The bloat test seems to run correctly (windows opened, closed) until the end, where I get this line:

/home/steve/tmp/comm-central/objdir-tb/mozilla/dist/bin/run-mozilla.sh: line 131: 10088 Segmentation fault      "$prog" ${1+"$@"}
TEST-UNEXPECTED-FAIL | runtest.py | Exited with code 139 during test run

It always crashes at the same place, in NS_StackWalk().  No references to the QCMS code being blamed.

I commented-out the QCMS code that selects the SSE2 path, rebuilt, and see the same behavior.

This on a 32-bit Fedora 10 (GCC v4.3.2) system.

Bob:

$ cd -
/home/steve/tmp/comm-central
$ cd objdir-tb/
$ make jstestbrowser
make: *** No rule to make target `jstestbrowser'.  Stop.

The name of the test suggests that a SeaMonkey build is required, explaining why it's not found in my Thunderbird build.  True.
Firefox build required: mozilla-central.
I am seeing crash on mozilla-central debug build @ qcms_transform_data_rgb_out_lut_sse2 on startup on a centos5 32bit vm.
(In reply to comment #12)
> I am seeing crash on mozilla-central debug build @
> qcms_transform_data_rgb_out_lut_sse2 on startup on a centos5 32bit vm.

I've been unsuccessful in reproducing this on Fedora, but I've got CentOS 5 too.

Are you on CentOS v5.4, fully updated?  More specifically, are you building with GCC package gcc-4.1.2-46.el5_4.1 and binutils-2.17.50.0.6-12.el5 ?

By "debug build" do you mean just having "export MOZ_DEBUG_SYMBOLS=1" in your mozconfig file, or is there more to it than that?

Given that you're in a VM, I assume that you do in fact have SSE2 support in your environment.  That is, it is likely a misalignment issue, not an attempt to execute instructions unsupported by your CPU.
(In reply to comment #13)

> Are you on CentOS v5.4, fully updated?  More specifically, are you building
> with GCC package gcc-4.1.2-46.el5_4.1 and binutils-2.17.50.0.6-12.el5 ?

yes.

> 
> By "debug build" do you mean just having "export MOZ_DEBUG_SYMBOLS=1" in your
> mozconfig file, or is there more to it than that?
> 

mk_add_options MOZ_CO_PROJECT=browser

ac_add_options --enable-application="browser"
ac_add_options --disable-optimize
ac_add_options --enable-debugger-info-modules=all

ac_add_options --enable-debug
ac_add_options --enable-tests
ac_add_options --enable-default-toolkit=cairo-gtk2
ac_add_options --disable-static
ac_add_options --enable-shared

ac_add_options --disable-crashreporter
ac_add_options --disable-installer
ac_add_options --disable-jemalloc
ac_add_options --with-valgrind
ac_add_options --enable-valgrind
ac_add_options --enable-logrefcnt

> Given that you're in a VM, I assume that you do in fact have SSE2 support in
> your environment.  That is, it is likely a misalignment issue, not an attempt
> to execute instructions unsupported by your CPU.

how do I check for SSE2 support?
(In reply to comment #14)

> how do I check for SSE2 support?

Examine the output of "cat /proc/cpuinfo" for SSE2 in the feature flags.

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 13
model name      : Intel(R) Pentium(R) M processor 1.80GHz
stepping        : 6
cpu MHz         : 1798.521
cache size      : 2048 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr mce cx8 mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe est tm2
bogomips        : 3597.04
has both sse and sse2
OK, here's the root of the problem:

transform.c: At top level:
transform.c:1321: warning: ‘__force_align_arg_pointer__’ attribute directive ignored

Googling that text gets a whole lot of complaints from Wine and libpixman users complaining that they can't build with their GCC v4.1.x compilers for this very reason.

In our case we (sometimes) enter the transformation functions with the stack mis-aligned, causing a segfault on the first attempt to write to a local __m128 variable:

    const __m128 mat0  = _mm_load_ps(mat[0]);

Confirmation that __force_align_arg_pointer__ is not doing it's job of forcing stack alignment:

    0x014acce5 <qcms_transform_data+0>:     push   %ebp
    0x014acce6 <qcms_transform_data+1>:     mov    %esp,%ebp
    0x014acce8 <qcms_transform_data+3>:     push   %esi
    0x014acce9 <qcms_transform_data+4>:     sub    $0x14,%esp
    0x014accec <qcms_transform_data+7>:     mov    0x8(%ebp),%eax
    0x014accef <qcms_transform_data+10>:    mov    0x7c(%eax),%esi
    0x014accf2 <qcms_transform_data+13>:    mov    0x10(%ebp),%edx
    0x014accf5 <qcms_transform_data+16>:    mov    0xc(%ebp),%ecx
    0x014accf8 <qcms_transform_data+19>:    mov    0x14(%ebp),%eax
    0x014accfb <qcms_transform_data+22>:    mov    %eax,0xc(%esp)
    0x014accff <qcms_transform_data+26>:    mov    %edx,0x8(%esp)
    0x014acd03 <qcms_transform_data+30>:    mov    %ecx,0x4(%esp)
    0x014acd07 <qcms_transform_data+34>:    mov    0x8(%ebp),%eax
    0x014acd0a <qcms_transform_data+37>:    mov    %eax,(%esp)
    0x014acd0d <qcms_transform_data+40>:    call   *%esi
    0x014acd0f <qcms_transform_data+42>:    add    $0x14,%esp
    0x014acd12 <qcms_transform_data+45>:    pop    %esi
    0x014acd13 <qcms_transform_data+46>:    pop    %ebp
    0x014acd14 <qcms_transform_data+47>:    ret    

We don't see similar segfaults from our copy of libpixman because in Linux we use the system pango libraries which, in RHEL5/CentOS5, don't have SSE/SSE2 support.

No fix yet, but at least the problem isn't a mystery anymore.
Jeff:

In brief testing this patch seems to fix the problem.  Before I take the time to test it more rigorously (multiple compilers, OSs, etc.), though, I have a question: is this scheme too clunky to be considered for acceptance?

The scheme is that for versions of GCC prior to v4.2 local 128-bit variables are encapulated in a run-time-aligned structure.  We decide at build time whether the local vars are accessed as stand-alone variables or as members of the structure, with a macro ( LSVAR(x) ) hiding the decision.

If you're OK with this general course of action, I'll submit a second patch and request a formal review of it.
As described above:

The scheme is that for versions of GCC prior to v4.2 local 128-bit variables
are encapsulated in a run-time-aligned structure.  We decide at build time
whether the local vars are accessed as stand-alone variables or as members of
the structure, with a macro ( LSVAR(x) ) hiding the decision.

Addendum:

It turns out that the above wasn't enough.  With optimization completely disabled GCC v4.1.2 does a _mm_load_ss() by zeroing the register, storing it to an intermediate temporary location, and overwriting the low 32 bits with the specified float before storing it in the destination specified in the source code.  That temporary location is on the stack.  Given that we cannot guarantee stack alignment I added the lowest optimization to the make file, ifdef'd to GCC.  That causes the compiler to skip the intermediate step and write the intrinsic output direct to specified (aligned) destination.  The added switch is specified before the standard compiler switches, overriding it in non-debug builds.  This is what the use of intermediate storage looks like:

    LSVAR(vec_g) = _mm_load_ss(&igtbl_g[src[1]]);

        xorps   %xmm0, %xmm0    # tmp325
        movaps  %xmm0, -776(%ebp)       # tmp325, D.6803
        movl    -752(%ebp), %eax        # __F, __F
        movl    %eax, -776(%ebp)        # __F, D.6803
        movaps  -776(%ebp), %xmm0       # D.6803, D.6804
        movl    -860(%ebp), %eax        # lsptr, lsptr
        movaps  %xmm0, 112(%eax)        # D.6643, <variable>.vec_g

Ramifications of the patch:

1. Building with GCC versions prior to v4.2 will gen code that runs a little slower due to the additional pointer referencing.

2. The SSE1- and SSE2-specific source files can't be built with optimization wholly disabled.  (Better would be to add the optimization only for <4.2 GCC but I don't see that the make file can know the version of GCC being used.)
Assignee: nobody → swsnyder
Attachment #411680 - Attachment is obsolete: true
Attachment #412351 - Flags: review?(jmuizelaar)
I'm not sure that all these contortions are worth it. Trunk is currently built with gcc 4.3. So the problem doesn't even occur with those builds. Perhaps we should just not build the sse files with gcc < 4.2 or force optimization and hope nothing goes wrong.
(In reply to comment #20)
> I'm not sure that all these contortions are worth it. Trunk is currently built
> with gcc 4.3. So the problem doesn't even occur with those builds. Perhaps we
> should just not build the sse files with gcc < 4.2 or force optimization and
> hope nothing goes wrong.

Ah so that's why the Thunderbird tinderboxes are failing with this (and possibly the SeaMonkey ones) and not the Firefox ones, because afaik Thunderbird ones are still on gcc 4.1.

Personally I think whichever version of gcc is used, the build shouldn't fail, or the prerequisites raised so that devs don't spend ages working out why their builds are failing.

Although, having said that, I will now go and raise a bug for seeing if gcc 4.3 will fix the Thunderbird boxes.
(In reply to comment #20)
> I'm not sure that all these contortions are worth it.

Me either.  Especially since older compilers would still provide the same functionality through the non-SIMD code path.  And the number of people who build with GCC versions prior to v4.2 will only get smaller with time.  

On the other hand, a contemporary Linux distribution (RHEL5/CentOS5) still uses v4.1.2 as the standard compiler.  On still another other hand, RHEL5/CentOS also offers GCC v4.4 as a non-standard installation option.  <shrug>

I looked for some Mozilla-mandated minimum version, but only found an advisement to use GCC v3.4 or newer when building.
Until this patch gets reviewed/lands, is there any workaround for this? I experience this on CentOS 5 where I cannot update to GCC 4.2 (there is no rpm available for that so far).
(In reply to comment #23)
> Until this patch gets reviewed/lands, is there any workaround for this? I
> experience this on CentOS 5 where I cannot update to GCC 4.2 (there is no rpm
> available for that so far).

Sure. Just make sse_version_available() in gfx/qcms/transform.c unconditionally return 0;
I updated my vms to gcc 4.4. Hopefully that won't cause more problems.
Any progress since last month?
I think the resolution to this is going to be use a gcc that supports __force_align_arg_pointer__ or avoid compiling in the sse code.
FTR the Thunderbird tinderboxes are now building with gcc 4.3.3 (same as FF boxes) and of the 9 or so builds today, we've not had any fail yet.
No longer blocks: CcMcBuildIssues
Comment on attachment 412351 [details] [diff] [review]
Prevent <4.2 GCC from generating illegal SIMD addressing

I don't think we need this anymore.
Attachment #412351 - Flags: review?(jmuizelaar)
We should fix this somehow, but it doesn't need to block.
blocking2.0: ? → -
Blocks: 438871
Crash Signature: [@ qcms_transform_data_rgba_out_lut_sse2] [@ qcms_transform_data_rgb_out_lut_sse2]
We use the attribute unconditionally now. This shouldn't be a problem anymore.
Status: NEW → RESOLVED
Crash Signature: [@ qcms_transform_data_rgba_out_lut_sse2] [@ qcms_transform_data_rgb_out_lut_sse2] → [@ qcms_transform_data_rgba_out_lut_sse2] [@ qcms_transform_data_rgb_out_lut_sse2]
Closed: 13 years ago
Resolution: --- → FIXED
Comment on attachment 407122 [details] [diff] [review]
patch to disable js1_5/Regress/regress-351116.js for debug 32bit linux browser
[Checked in: Comment 33]

http://hg.mozilla.org/mozilla-central/rev/4175f7e34648
Attachment #407122 - Attachment description: patch to disable js1_5/Regress/regress-351116.js for debug 32bit linux browser → patch to disable js1_5/Regress/regress-351116.js for debug 32bit linux browser [Checked in: Comment 33]
Attachment #412351 - Attachment is obsolete: true
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: