Closed Bug 799295 Opened 8 years ago Closed 8 years ago

v8 crashing most runs on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up

Categories

(Core :: JavaScript Engine, defect)

18 Branch
x86
Linux
defect
Not set
blocker

Tracking

()

RESOLVED FIXED
Tracking Status
firefox17 --- unaffected
firefox18 + fixed
firefox19 --- fixed
firefox-esr17 --- unaffected

People

(Reporter: philor, Assigned: glandium)

References

Details

(Keywords: crash, reproducible, Whiteboard: [qa-])

Crash Data

Attachments

(1 file)

Since today's merge to aurora, every Linux32 PGO dromaeo_css has crashed like https://tbpl.mozilla.org/php/getParsedLog.php?id=15934446&tree=Mozilla-Aurora

Thread 0 (crashed)
 0  libxul.so!js::gc::MarkKind [Heap.h:118a3b748323 : 1017 + 0x0]
    eip = 0x02936b85   esp = 0xbfff2f50   ebp = 0xbfff3080   ebx = 0x02e56f34
    esi = 0xb3d3a100   edi = 0x82000000   eax = 0x00000000   ecx = 0x82000000
    edx = 0xbfff2fbc   efl = 0x00010286
    Found by: given as instruction pointer in context
 1  libxul.so!js::gc::MarkValueInternal [Marking.cpp:118a3b748323 : 387 + 0x10]
    eip = 0x02936edd   esp = 0xbfff2f90   ebp = 0xbfff3080   ebx = 0x02e56f34
    esi = 0x00000000   edi = 0xbfff3048
    Found by: call frame info
 2  libxul.so!js::ion::MarkIonActivations [IonFrames.cpp:118a3b748323 : 503 + 0xc]
    eip = 0x029b8a44   esp = 0xbfff2fd0   ebp = 0xbfff3080   ebx = 0x02e56f34
    esi = 0xb3d3a100   edi = 0xbfff3048
    Found by: call frame info
 3  libxul.so!js::MarkRuntime [jsgc.cpp:118a3b748323 : 2610 + 0xb]
    eip = 0x02853f6f   esp = 0xbfff30b0   ebp = 0xab4c6e50   ebx = 0x02e56f34
    esi = 0xb3d3a100   edi = 0xb3d3a000
    Found by: call frame info
 4  libxul.so!IncrementalCollectSlice [jsgc.cpp:118a3b748323 : 3434 + 0x4]
    eip = 0x0285d9e3   esp = 0xbfff3150   ebp = 0x00000015   ebx = 0x02e56f34
    esi = 0xbfff3274   edi = 0xbfff326c
    Found by: call frame info
 5  libxul.so!GCCycle [jsgc.cpp:118a3b748323 : 4533 + 0xf]
    eip = 0x02860c64   esp = 0xbfff32b0   ebp = 0x00000000   ebx = 0x02e56f34
    esi = 0xbfff32dc   edi = 0x00000000
    Found by: call frame info
 6  libxul.so!Collect [jsgc.cpp:118a3b748323 : 4647 + 0x23]
    eip = 0x02858500   esp = 0xbfff3300   ebp = 0x00000001   ebx = 0x02e56f34
    esi = 0xb3d3a1bc   edi = 0xb3d3a000
    Found by: call frame info
 7  libxul.so!js_InvokeOperationCallback [jscntxt.cpp:118a3b748323 : 1132 + 0x29]
    eip = 0x0283c964   esp = 0xbfff3360   ebp = 0x00000031   ebx = 0x9b5d8dc0
    esi = 0xac0bf790   edi = 0x00000067
    Found by: call frame info
 8  libxul.so!js_HandleExecutionInterrupt [jscntxt.cpp:118a3b748323 : 1156 + 0x7]
    eip = 0x0283c98c   esp = 0xbfff3390   ebp = 0x00000031   ebx = 0x9b5d8dc0
    esi = 0x00000000   edi = 0x00000067
    Found by: call frame info
 9  libxul.so!js::ion::InterruptCheck [VMFunctions.cpp:118a3b748323 : 387 + 0xb]
    eip = 0x01f39b4f   esp = 0xbfff33b0   ebp = 0x00000031   ebx = 0x9b5d8dc0


That same code (since this is the merge) did not crash on mozilla-central, PGO or non-PGO. The one obvious difference is that on aurora we don't do --enable-profiling. In theory https://tbpl.mozilla.org/?tree=Profiling is supposed to let us get away with that, but it doesn't run Talos (or PGO).

Questions that a try push can answer: if we push m-c to try with the --enable-profiling removed from the mozconfig, will that be enough to crash, or do we also need to add mk_add_options MOZ_PGO=1 to the mozconfig to get the crash?

Aurora's also closed because of bug 799277, but if that gets fixed, this is still a tree-closer.
Forgot to cc the poor unfortunate who will have to decide whether we're willing to open aurora without Linux dromaeo, assuming bug 799277 gets fixed first.
This seems reasonably important, so here's the push with PGO enabled:
https://tbpl.mozilla.org/?tree=Try&rev=3ebe093955e1
With profiling disabled and PGO off, there's no crash.
With profiling disabled and PGO on, there's a crash. Sorry, this is going to suck for you.
I guess this is an argument against not running both Linux64 and Linux32...
Crash Signature: [@ js::gc::MarkKind]
Keywords: crash
Blocks: 785991
Depends on: 799685
Status: NEW → ASSIGNED
QA Contact: dvander
After thinking about this more and talking with members of the JS team, I think we should stop testing Linux PGO builds.

All evidence suggests that PGO is endlessly buggy. We've wasted plenty of developer man-hours tracking down PGO-only crashes, only to find that a specific function has been optimized wrong and must be explicitly deoptimized. There's really nothing good about this situation, littering the codebase with untestable compiler bug workaround hints that may or may not randomly become obsolete. (I'm also sceptical about how beneficial PGO is, but I won't argue that here.)

PGO deeply and zealously modifies calling conventions, function bodies, and large callgraphs. We've seen it cause test failures and top crashes that are very difficult to debug. That's even on *Windows*, where Microsoft has thrown a huge amount of resources into their C++ compiler. I'm not convinced GCC's implementation will be better, and from what I've heard, PGO has been historically buggy in compilers in general.

It is possible there is a real bug here. If so, we'll see it one way or another, likely in a context that is more easily reproducible and easier to debug.

So, I say we stop testing Linux PGO, free up releng resources and developer time, and get rid of these spurious bugs :)
Assignee: general → dvander
QA Contact: dvander
(In reply to David Anderson [:dvander] from comment #7)
> So, I say we stop testing Linux PGO, free up releng resources and developer
> time, and get rid of these spurious bugs :)

+1 from RelEng I think :)  Can you propose this in the newsgroups?
(In reply to Chris AtLee [:catlee] from comment #8)
> (In reply to David Anderson [:dvander] from comment #7)
> > So, I say we stop testing Linux PGO, free up releng resources and developer
> > time, and get rid of these spurious bugs :)
> 
> +1 from RelEng I think :)  Can you propose this in the newsgroups?

Sure, which newsgroups?
(In reply to David Anderson [:dvander] from comment #9)
> Sure, which newsgroups?

mozilla.dev.platform
I've hidden this test on aurora now. There's no point in starring this on every run.
News from various fronts:

It's not actually every run on aurora, only most: https://tbpl.mozilla.org/?tree=Mozilla-Aurora&noignore=1&rev=564500b98bac&jobname=Rev3%20Fedora%2012%20mozilla-aurora%20pgo%20talos%20dromaeojs caught a green that made me wonder if something accidentally "fixed" it, but it didn't.

Something in https://hg.mozilla.org/projects/profiling/pushloghtml?startID=1382&endID=1383 (bug 804636? bug 800568? bug 800063? bug 778948? something less obvious?) on the trunk caused the dromaeo_css crash to become less frequent, but added a similar crash

Thread 0 (crashed)
 0  libxul.so!ReadAllocation [Registers.h:6ed9ca8c36bd : 112 + 0x4]
    eip = 0x01ebdcf4   esp = 0xbf94abd0   ebp = 0xbf94ace0   ebx = 0x02df5100
    esi = 0xbf94aca8   edi = 0x00000008   eax = 0x00000000   ecx = 0x0000001f
    edx = 0x00000000   efl = 0x00010286
    Found by: given as instruction pointer in context
 1  libxul.so!js::ion::MarkIonActivations(JSRuntime*, JSTracer*) [IonFrames.cpp:6ed9ca8c36bd : 506 + 0xa]
    eip = 0x0295243b   esp = 0xbf94ac30   ebp = 0xbf94ace0   ebx = 0x02df5100
    esi = 0xb3e3a100   edi = 0xbf94aca8
    Found by: call frame info
 2  libxul.so!js::MarkRuntime [jsgc.cpp:6ed9ca8c36bd : 2515 + 0xb]
    eip = 0x027eebaf   esp = 0xbf94ad10   ebp = 0xa5eef274   ebx = 0x02df5100
    esi = 0xb3e3a100   edi = 0xb3e3a000
    Found by: call frame info
 3  libxul.so!IncrementalCollectSlice [jsgc.cpp:6ed9ca8c36bd : 3351 + 0x4]
    eip = 0x027f79f3   esp = 0xbf94adb0   ebp = 0x00000015   ebx = 0x02df5100
    esi = 0xbf94aec0   edi = 0xbf94aeb8
    Found by: call frame info
 4  libxul.so!GCCycle [jsgc.cpp:6ed9ca8c36bd : 4447 + 0xf]
    eip = 0x027facd4   esp = 0xbf94af00   ebp = 0x00000000   ebx = 0x02df5100
    esi = 0xbf94af28   edi = 0x00000000
    Found by: call frame info
 5  libxul.so!Collect [jsgc.cpp:6ed9ca8c36bd : 4561 + 0x23]
    eip = 0x027f2829   esp = 0xbf94af50   ebp = 0x00000001   ebx = 0x02df5100
    esi = 0xb3e3a1c0   edi = 0xb3e3a000
    Found by: call frame info
 6  libxul.so!js_InvokeOperationCallback(JSContext*) [jscntxt.cpp:6ed9ca8c36bd : 1205 + 0x29]
    eip = 0x027d6994   esp = 0xbf94afb0   ebp = 0xffffff81   ebx = 0x00000003
    esi = 0xa4409990   edi = 0x9ef64010
    Found by: call frame info
 7  libxul.so!js_HandleExecutionInterrupt(JSContext*) [jscntxt.cpp:6ed9ca8c36bd : 1229 + 0x7]
    eip = 0x027d69bc   esp = 0xbf94afe0   ebp = 0xffffff81   ebx = 0x00000003
    esi = 0xffffff87   edi = 0x9ef64010
    Found by: call frame info
 8  libxul.so!js::ion::InterruptCheck(JSContext*) [VMFunctions.cpp:6ed9ca8c36bd : 387 + 0xb]
    eip = 0x01ec40cb   esp = 0xbf94b000   ebp = 0xffffff81   ebx = 0x00000003
    esi = 0xffffff87   edi = 0x9ef64010
    Found by: call frame info
 9  0x305e773
    eip = 0x0305e774   esp = 0xbf94b020   ebp = 0xffffff81   ebx = 0x00000003

in test_MochiKit-DOM-Safari.html. So far, https://tbpl.mozilla.org/?tree=Profiling&rev=6ed9ca8c36bd makes that look permaorange, though it could also only be 99 of 100 like the aurora dromaeo_css crash.

That sets a more firm "do something" deadline of November 19th, since we'll be a little less sanguine about just hiding mochitest-1 on aurora-19.
Recommend untracking for Firefox 18 - this bug is self-inflicted by using experimental compiler features, and we don't seem to have any data on whether it affects Linux users.
Assignee: dvander → general
Status: ASSIGNED → NEW
Taras: you drove PGO on Linux in bug 559964, because it was "a huge(30%) startup improvement on Linux" - you okay with just dropping it rather than fixing this bustage?
We might consider using a newer GCC version, to see if this bug goes away. It looks like we're two behind, 4.5 versus 4.7.
Depending on tracking bug for 19-nov-2012 migration work.  Please unblock if this is incorrect.
Blocks: 786550
I guess it's merge-day in the sense that when 18 migrates to mozilla-beta, this will break that test suite there, and we will just hide it there, too?
Summary: dromaeo_css crashing every run on Linux32 PGO, in js::gc::MarkKind, on mozilla-aurora → dromaeo_css crashing every run on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up
(In reply to David Anderson [:dvander] from comment #13)
> Recommend untracking for Firefox 18 - this bug is self-inflicted by using
> experimental compiler features, and we don't seem to have any data on
> whether it affects Linux users.

Not tracking for FF18 given this comment .
We hid the dromaeojs job on Aurora, back when we thought this was just something that was going to be fixed in the first couple of days after the merge; turns out that was probably tree mismanagement on our part. It's now unhidden.
What does it actually mean to untrack this bug?

Tracking+ for an unfiled bug to stop doing PGO on Linux on all trunk trees, and for another unfiled bug to stop doing PGO on Linux on Aurora, and for another unfiled bug to stop doing PGO on Linux on Beta at the next merge, and for another unfiled bug to stop doing PGO on Linux on Release at the merge after that, and for another unfiled bug to switch release automation to not do PGO on Linux at that same merge?

Or tracking+ for an unfiled bug to just drop the dromaeojs talos suite on trunk and on each branch as 18 merges through it, assuming that we only crash in that talos suite, and not anywhere else?
https://tbpl.mozilla.org/php/getParsedLog.php?id=16918557&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=16919019&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=16898917&tree=Profiling
Summary: dromaeo_css crashing every run on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up → dromaeo_css crashing most runs on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up
Actually, I don't see anything here that is actionable by releng.  I'm removing the dependency on our migration work tracking bug.
No longer blocks: 786550
m-c PGO no-profiling MOZ_PGO_OPTIMIZE_FLAGS=$MOZ_OPTIMIZE_FLAGS
https://tbpl.mozilla.org/?tree=Try&rev=d1314a03a613
(In reply to Phil Ringnalda (:philor) from comment #23)
> What does it actually mean to untrack this bug?

We're trying to get to a resolution in the original thread https://groups.google.com/d/msg/mozilla.dev.platform/hud9aK0qZWI/lVvYys-U_nYJ.
(In reply to Phil Ringnalda (:philor) from comment #23)
> Or tracking+ for an unfiled bug to just drop the dromaeojs talos suite on
> trunk and on each branch as 18 merges through it, assuming that we only
> crash in that talos suite, and not anywhere else?

The discussion on dev-platform has played out, and we can just disable the failing tests on all branches for Linux PGO (no reason to wait for FF18 to merge to each branch).
(In reply to Karl Tomlinson (:karlt) from comment #37)
> m-c PGO no-profiling MOZ_PGO_OPTIMIZE_FLAGS=$MOZ_OPTIMIZE_FLAGS
> https://tbpl.mozilla.org/?tree=Try&rev=d1314a03a613

That seems to leave only PGO as the difference between pass and fail.

(In reply to Phil Ringnalda (:philor) from comment #38)
> If the regressionwindow isn't
> http://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=d042ad078f43&tochange=fdfaef738a00 I'll be astonished.

fdfaef738a00 and back to dd9554c236dc at least demonstrate the bug.
The older revisions that I picked either didn't compile, or hit unknown run_tests.py options or PerfConfigurator.py errors.
Depends on: 812076
Blocks: 810705
So, this is not a crash in dromaeo_css, but a crash in v8. And if I download e.g. http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-aurora-linux/1352918270/firefox-18.0a2.en-US.linux-i686.tar.bz2 and go to http://v8.googlecode.com/svn/data/benchmarks/v7/run.html, I can reproduce.
Summary: dromaeo_css crashing most runs on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up → v8 crashing most runs on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up
(In reply to Mike Hommey [:glandium] from comment #51)
> Even better with a nightly

Well, that's an aurora build, actually. ;-)

Thanks for reproducing, though, that's great!

Can you track this down to a regression range in 18.0a1 nightly by bisecting nightly builds?
So, from what I can see in a debugger, this looks very much like a stack corruption bug. The interesting part is that it started happening the day aurora became 18, and wasn't happening on 18 central. The even more interesting part is that it stopped happening since we merged 19 to aurora, and, obviously, moved to beta.
It never happened on central because on central, we build with --enable-profiling, which enables frame pointers, which changes the stack layout, since frame pointers are kept on stack. This may well have contributed to hide the bug.

I'm inclined to think there's an ionmonkey bug hiding in there, and it was fixed on m-c during the 19 cycle.
This is interesting: while i can't manage to get a crash out of a non pgo build, i *do* get a busy script popup during a v8 run, which i'd normally never get.
(In reply to Mike Hommey [:glandium] from comment #55)
> This is interesting: while i can't manage to get a crash out of a non pgo
> build, i *do* get a busy script popup during a v8 run, which i'd normally
> never get.

A non pgo build from the profiling branch, so without frame pointers.
(In reply to Mike Hommey [:glandium] from comment #56)
> (In reply to Mike Hommey [:glandium] from comment #55)
> > This is interesting: while i can't manage to get a crash out of a non pgo
> > build, i *do* get a busy script popup during a v8 run, which i'd normally
> > never get.
> 
> A non pgo build from the profiling branch, so without frame pointers.

In fact, that also happens on nightly, but it takes more tries. Each new run of v8 runs the regexp test slower and slower, until the busy script popup shows up. It takes about 5 attempts here to get from a score of about 1000 to about 250.

The crashes on aurora seem to happen during the EarleyBoyer test, though, which is before regexp.
(In reply to Mike Hommey [:glandium] from comment #53)
> It never happened on central because on central, we build with
> --enable-profiling, which enables frame pointers, which changes the stack
> layout, since frame pointers are kept on stack. This may well have
> contributed to hide the bug.

Hrm, is there some other way we can track it down to a checkin range (one that isn't full 6 weeks)?

(In reply to Mike Hommey [:glandium] from comment #57)
> In fact, that also happens on nightly, but it takes more tries. Each new run
> of v8 runs the regexp test slower and slower, until the busy script popup
> shows up. It takes about 5 attempts here to get from a score of about 1000
> to about 250.

Even if that's a different bug, this sounds like something we *really* should have on file and should investigate.
When running under valgrind, the first warning i get when running v8 is:
==12930== Use of uninitialised value of size 4
==12930==    at 0x64F5661: ??? (in libxul.so)

Sadly, I don't know how to tell valgrind to get the symbols from the files i downloaded from the symbol server, but resolving manually tells me this is http://hg.mozilla.org/releases/mozilla-aurora/file/149ad5c9f94d/js/src/ion/LIR.h#l94

This doesn't look very good.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #58)
> Hrm, is there some other way we can track it down to a checkin range (one
> that isn't full 6 weeks)?

Apart from using try to do pgo builds without frame pointer on older checkins, I have no idea. Let's start by testing the ionmonkey merge changeset...
Does this only happen on PGO builds?
Yes.
(In reply to Mike Hommey [:glandium] from comment #60)
> Let's start by testing the ionmonkey merge changeset...

See comment 49.  I expect its possible to patch the issues with older revisions there, but I'd guess a better first step would be to have someone who knows the code look at the valgrind warnings.  Perhaps a try build with stripping disabled might make that easier.

(In reply to Ehsan Akhgari [:ehsan] from comment #62)
> Does this only happen on PGO builds?

AFA we K.  See comment 49.
Heck, you can see comment 5, for that matter. It's PGO-only, and that's the reason I added PGO builds to the Profiling branch, which we didn't have at the time of the IM merge.
So it seems like the best bet is for an IonMonkey developer to bisect the range...
Additional data point: https://hg.mozilla.org/try/rev/fdfaef738a00 , which is the fixup changeset after the ionmonkey merge *does* have the problem.
(In reply to Mike Hommey [:glandium] from comment #57)
> In fact, that also happens on nightly, but it takes more tries. Each new run
> of v8 runs the regexp test slower and slower, until the busy script popup
> shows up. It takes about 5 attempts here to get from a score of about 1000
> to about 250.

We're tracking regexp performance in bug 806646, which is on track to be fixed soon. Right now we're pretty sad on this test, but I don't think it's related here.

> The crashes on aurora seem to happen during the EarleyBoyer test, though,
> which is before regexp.

Another thing you could try is seeing if this reproduces using gcc-4.7.
Some of the valgrind errors from a rebuild of current beta with pgo, --enable-valgrind, --disable-install-strip and --disable-elf-hack:
==22533== Use of uninitialised value of size 4
==22533==    at 0x62F60A1: CanEncodeInfoInHeader(js::ion::LAllocation const&, unsigned int*) (LIR.h:94)
==22533==    by 0x2BB: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F6073: CanEncodeInfoInHeader(js::ion::LAllocation const&, unsigned int*) (Safepoints.cpp:192)
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844B8A: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:392)
==22533==    by 0x7B44557: ???
==22533==    by 0x37F: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844B99: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:395)
==22533==    by 0x7B44557: ???
==22533==    by 0x37F: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844BA3: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:398)
==22533==    by 0x7B44557: ???
==22533==    by 0x37F: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844B8A: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:392)
==22533==    by 0x7B44557: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844B99: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:395)
==22533==    by 0x7B44557: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844BA3: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:398)
==22533==    by 0x7B44557: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Use of uninitialised value of size 4
==22533==    at 0x583FEE0: ReadAllocation(js::ion::IonFrameIterator const&, js::ion::LAllocation const*) (IonFrames.cpp:434)
==22533==    by 0x1CD: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
Thanks to the valgrind output, it ended being easy to spot what is wrong:

01a0c070 <_ZL21CanEncodeInfoInHeaderRKN2js3ion11LAllocationEPj>:
 1a0c070:	56                   	push   %esi
 1a0c071:	89 c6                	mov    %eax,%esi
 1a0c073:	83 ec 08             	sub    $0x8,%esp
 1a0c076:	8b 08                	mov    (%eax),%ecx
 1a0c078:	f6 c1 01             	test   $0x1,%cl
 1a0c07b:	75 1b                	jne    1a0c098 <_ZL21CanEncodeInfoInHeaderRKN2js3ion11LAllocationEPj+0x28>
(snip)
 1a0c098:	8b 0e                	mov    (%esi),%ecx
 1a0c09a:	8b 54 24 04          	mov    0x4(%esp),%edx
 1a0c09e:	c1 f9 04             	sar    $0x4,%ecx
 1a0c0a1:	89 0a                	mov    %ecx,(%edx)
(snip)

0x1a0c0a1 above corresponds to 0x62F60A1 in valgrind's output. Essentially, what is happening is that a value is taken from the stack at 0x1a0c09a that was never stored...

This appears to be a miscompilation from gcc 4.5 with PGO enabled, but I was able to get a similarly broken code out of it without PGO with the following patch:
diff --git a/js/src/ion/Safepoints.cpp b/js/src/ion/Safepoints.cpp
--- a/js/src/ion/Safepoints.cpp
+++ b/js/src/ion/Safepoints.cpp
@@ -8,6 +8,7 @@
 #include "Safepoints.h"
 #include "IonSpewer.h"
 #include "LIR.h"
+#include "mozilla/Likely.h"
 
 using namespace js;
 using namespace ion;
@@ -190,7 +191,7 @@ AllocationToPartKind(const LAllocation &
 static inline bool
 CanEncodeInfoInHeader(const LAllocation &a, uint32 *out)
 {
-    if (a.isGeneralReg()) {
+    if (MOZ_LIKELY(a.isGeneralReg())) {
         *out = a.toGeneralReg()->reg().code();
         return true;
     }

If I manually patch a broken libxul.so with the function machine code from a working build, I get a non-crashy build, and valgrind never complains.

It looks like, from succinct local testing, that gcc 4.7 would generate proper code. Although a notable difference between gcc 4.7 and gcc 4.5 is that gcc 4.7 inlines the function in most cases where gcc 4.5 doesn't. I'm testing a try build replacing inline with MOZ_ALWAYS_INLINE on that function, and see if the miscompilation is worked around this way.
Assignee: general → mh+mozilla
Comment on attachment 684474 [details] [diff] [review]
Work around gcc 4.5 miscompilation of CanEncodeInfoInHeader by always inlining it

Review of attachment 684474 [details] [diff] [review]:
-----------------------------------------------------------------

Nice find! Small nit, comments in this file should use //
Attachment #684474 - Flags: review?(dvander) → review+
https://hg.mozilla.org/mozilla-central/rev/d8e4f06198dc
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment on attachment 684474 [details] [diff] [review]
Work around gcc 4.5 miscompilation of CanEncodeInfoInHeader by always inlining it

[Approval Request Comment]
Bug caused by (feature/regressing bug #): gcc 4.5 miscompiling ionmonkey
User impact if declined: Pretty much random crashes when js is involved
Testing completed (on m-c, etc.): heavily tested on try, although only tested on linux x86.
Risk to taking this patch (and alternatives if risky): the patch in itself doesn't change much. It changes an "inline" to an "MOZ_ALWAYS_INLINE", which only forces compilers to inline the function. There is a tiny possibility that other compilers start miscompiling as a result, but it's quite unlikely: clang is already inlining the function, so forcing an inline won't change that, and i suspect msvc does too.
String or UUID changes made by this patch: none.
Attachment #684474 - Flags: approval-mozilla-beta?
Attachment #684474 - Flags: approval-mozilla-aurora?
Attachment #684474 - Flags: approval-mozilla-beta?
Attachment #684474 - Flags: approval-mozilla-beta+
Attachment #684474 - Flags: approval-mozilla-aurora?
Attachment #684474 - Flags: approval-mozilla-aurora+
dromaeojs unhidden on beta
Keywords: verifyme
Keywords: verifyme
Whiteboard: [qa-]
Duplicate of this bug: 810705
You need to log in before you can comment on or make changes to this bug.