v8 crashing most runs on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up

RESOLVED FIXED

Status

()

Core
JavaScript Engine
--
blocker
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: philor, Assigned: glandium)

Tracking

({crash, reproducible})

18 Branch
x86
Linux
crash, reproducible
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox17 unaffected, firefox18+ fixed, firefox19 fixed, firefox-esr17 unaffected)

Details

(Whiteboard: [qa-], crash signature)

Attachments

(1 attachment)

(Reporter)

Description

5 years ago
Since today's merge to aurora, every Linux32 PGO dromaeo_css has crashed like https://tbpl.mozilla.org/php/getParsedLog.php?id=15934446&tree=Mozilla-Aurora

Thread 0 (crashed)
 0  libxul.so!js::gc::MarkKind [Heap.h:118a3b748323 : 1017 + 0x0]
    eip = 0x02936b85   esp = 0xbfff2f50   ebp = 0xbfff3080   ebx = 0x02e56f34
    esi = 0xb3d3a100   edi = 0x82000000   eax = 0x00000000   ecx = 0x82000000
    edx = 0xbfff2fbc   efl = 0x00010286
    Found by: given as instruction pointer in context
 1  libxul.so!js::gc::MarkValueInternal [Marking.cpp:118a3b748323 : 387 + 0x10]
    eip = 0x02936edd   esp = 0xbfff2f90   ebp = 0xbfff3080   ebx = 0x02e56f34
    esi = 0x00000000   edi = 0xbfff3048
    Found by: call frame info
 2  libxul.so!js::ion::MarkIonActivations [IonFrames.cpp:118a3b748323 : 503 + 0xc]
    eip = 0x029b8a44   esp = 0xbfff2fd0   ebp = 0xbfff3080   ebx = 0x02e56f34
    esi = 0xb3d3a100   edi = 0xbfff3048
    Found by: call frame info
 3  libxul.so!js::MarkRuntime [jsgc.cpp:118a3b748323 : 2610 + 0xb]
    eip = 0x02853f6f   esp = 0xbfff30b0   ebp = 0xab4c6e50   ebx = 0x02e56f34
    esi = 0xb3d3a100   edi = 0xb3d3a000
    Found by: call frame info
 4  libxul.so!IncrementalCollectSlice [jsgc.cpp:118a3b748323 : 3434 + 0x4]
    eip = 0x0285d9e3   esp = 0xbfff3150   ebp = 0x00000015   ebx = 0x02e56f34
    esi = 0xbfff3274   edi = 0xbfff326c
    Found by: call frame info
 5  libxul.so!GCCycle [jsgc.cpp:118a3b748323 : 4533 + 0xf]
    eip = 0x02860c64   esp = 0xbfff32b0   ebp = 0x00000000   ebx = 0x02e56f34
    esi = 0xbfff32dc   edi = 0x00000000
    Found by: call frame info
 6  libxul.so!Collect [jsgc.cpp:118a3b748323 : 4647 + 0x23]
    eip = 0x02858500   esp = 0xbfff3300   ebp = 0x00000001   ebx = 0x02e56f34
    esi = 0xb3d3a1bc   edi = 0xb3d3a000
    Found by: call frame info
 7  libxul.so!js_InvokeOperationCallback [jscntxt.cpp:118a3b748323 : 1132 + 0x29]
    eip = 0x0283c964   esp = 0xbfff3360   ebp = 0x00000031   ebx = 0x9b5d8dc0
    esi = 0xac0bf790   edi = 0x00000067
    Found by: call frame info
 8  libxul.so!js_HandleExecutionInterrupt [jscntxt.cpp:118a3b748323 : 1156 + 0x7]
    eip = 0x0283c98c   esp = 0xbfff3390   ebp = 0x00000031   ebx = 0x9b5d8dc0
    esi = 0x00000000   edi = 0x00000067
    Found by: call frame info
 9  libxul.so!js::ion::InterruptCheck [VMFunctions.cpp:118a3b748323 : 387 + 0xb]
    eip = 0x01f39b4f   esp = 0xbfff33b0   ebp = 0x00000031   ebx = 0x9b5d8dc0


That same code (since this is the merge) did not crash on mozilla-central, PGO or non-PGO. The one obvious difference is that on aurora we don't do --enable-profiling. In theory https://tbpl.mozilla.org/?tree=Profiling is supposed to let us get away with that, but it doesn't run Talos (or PGO).

Questions that a try push can answer: if we push m-c to try with the --enable-profiling removed from the mozconfig, will that be enough to crash, or do we also need to add mk_add_options MOZ_PGO=1 to the mozconfig to get the crash?

Aurora's also closed because of bug 799277, but if that gets fixed, this is still a tree-closer.
(Reporter)

Comment 1

5 years ago
Forgot to cc the poor unfortunate who will have to decide whether we're willing to open aurora without Linux dromaeo, assuming bug 799277 gets fixed first.
This seems reasonably important, so here's the push with PGO enabled:
https://tbpl.mozilla.org/?tree=Try&rev=3ebe093955e1
With profiling disabled and PGO off, there's no crash.
(Reporter)

Comment 5

5 years ago
With profiling disabled and PGO on, there's a crash. Sorry, this is going to suck for you.
I guess this is an argument against not running both Linux64 and Linux32...

Updated

5 years ago
Crash Signature: [@ js::gc::MarkKind]
Keywords: crash

Updated

5 years ago
Blocks: 785991
Blocks: 799434

Updated

5 years ago
tracking-firefox17: --- → +

Updated

5 years ago
tracking-firefox17: + → ---
tracking-firefox18: --- → +
Status: NEW → ASSIGNED
QA Contact: dvander
After thinking about this more and talking with members of the JS team, I think we should stop testing Linux PGO builds.

All evidence suggests that PGO is endlessly buggy. We've wasted plenty of developer man-hours tracking down PGO-only crashes, only to find that a specific function has been optimized wrong and must be explicitly deoptimized. There's really nothing good about this situation, littering the codebase with untestable compiler bug workaround hints that may or may not randomly become obsolete. (I'm also sceptical about how beneficial PGO is, but I won't argue that here.)

PGO deeply and zealously modifies calling conventions, function bodies, and large callgraphs. We've seen it cause test failures and top crashes that are very difficult to debug. That's even on *Windows*, where Microsoft has thrown a huge amount of resources into their C++ compiler. I'm not convinced GCC's implementation will be better, and from what I've heard, PGO has been historically buggy in compilers in general.

It is possible there is a real bug here. If so, we'll see it one way or another, likely in a context that is more easily reproducible and easier to debug.

So, I say we stop testing Linux PGO, free up releng resources and developer time, and get rid of these spurious bugs :)
Assignee: general → dvander
QA Contact: dvander
(In reply to David Anderson [:dvander] from comment #7)
> So, I say we stop testing Linux PGO, free up releng resources and developer
> time, and get rid of these spurious bugs :)

+1 from RelEng I think :)  Can you propose this in the newsgroups?
(In reply to Chris AtLee [:catlee] from comment #8)
> (In reply to David Anderson [:dvander] from comment #7)
> > So, I say we stop testing Linux PGO, free up releng resources and developer
> > time, and get rid of these spurious bugs :)
> 
> +1 from RelEng I think :)  Can you propose this in the newsgroups?

Sure, which newsgroups?
(In reply to David Anderson [:dvander] from comment #9)
> Sure, which newsgroups?

mozilla.dev.platform
I've hidden this test on aurora now. There's no point in starring this on every run.
(Reporter)

Comment 12

5 years ago
News from various fronts:

It's not actually every run on aurora, only most: https://tbpl.mozilla.org/?tree=Mozilla-Aurora&noignore=1&rev=564500b98bac&jobname=Rev3%20Fedora%2012%20mozilla-aurora%20pgo%20talos%20dromaeojs caught a green that made me wonder if something accidentally "fixed" it, but it didn't.

Something in https://hg.mozilla.org/projects/profiling/pushloghtml?startID=1382&endID=1383 (bug 804636? bug 800568? bug 800063? bug 778948? something less obvious?) on the trunk caused the dromaeo_css crash to become less frequent, but added a similar crash

Thread 0 (crashed)
 0  libxul.so!ReadAllocation [Registers.h:6ed9ca8c36bd : 112 + 0x4]
    eip = 0x01ebdcf4   esp = 0xbf94abd0   ebp = 0xbf94ace0   ebx = 0x02df5100
    esi = 0xbf94aca8   edi = 0x00000008   eax = 0x00000000   ecx = 0x0000001f
    edx = 0x00000000   efl = 0x00010286
    Found by: given as instruction pointer in context
 1  libxul.so!js::ion::MarkIonActivations(JSRuntime*, JSTracer*) [IonFrames.cpp:6ed9ca8c36bd : 506 + 0xa]
    eip = 0x0295243b   esp = 0xbf94ac30   ebp = 0xbf94ace0   ebx = 0x02df5100
    esi = 0xb3e3a100   edi = 0xbf94aca8
    Found by: call frame info
 2  libxul.so!js::MarkRuntime [jsgc.cpp:6ed9ca8c36bd : 2515 + 0xb]
    eip = 0x027eebaf   esp = 0xbf94ad10   ebp = 0xa5eef274   ebx = 0x02df5100
    esi = 0xb3e3a100   edi = 0xb3e3a000
    Found by: call frame info
 3  libxul.so!IncrementalCollectSlice [jsgc.cpp:6ed9ca8c36bd : 3351 + 0x4]
    eip = 0x027f79f3   esp = 0xbf94adb0   ebp = 0x00000015   ebx = 0x02df5100
    esi = 0xbf94aec0   edi = 0xbf94aeb8
    Found by: call frame info
 4  libxul.so!GCCycle [jsgc.cpp:6ed9ca8c36bd : 4447 + 0xf]
    eip = 0x027facd4   esp = 0xbf94af00   ebp = 0x00000000   ebx = 0x02df5100
    esi = 0xbf94af28   edi = 0x00000000
    Found by: call frame info
 5  libxul.so!Collect [jsgc.cpp:6ed9ca8c36bd : 4561 + 0x23]
    eip = 0x027f2829   esp = 0xbf94af50   ebp = 0x00000001   ebx = 0x02df5100
    esi = 0xb3e3a1c0   edi = 0xb3e3a000
    Found by: call frame info
 6  libxul.so!js_InvokeOperationCallback(JSContext*) [jscntxt.cpp:6ed9ca8c36bd : 1205 + 0x29]
    eip = 0x027d6994   esp = 0xbf94afb0   ebp = 0xffffff81   ebx = 0x00000003
    esi = 0xa4409990   edi = 0x9ef64010
    Found by: call frame info
 7  libxul.so!js_HandleExecutionInterrupt(JSContext*) [jscntxt.cpp:6ed9ca8c36bd : 1229 + 0x7]
    eip = 0x027d69bc   esp = 0xbf94afe0   ebp = 0xffffff81   ebx = 0x00000003
    esi = 0xffffff87   edi = 0x9ef64010
    Found by: call frame info
 8  libxul.so!js::ion::InterruptCheck(JSContext*) [VMFunctions.cpp:6ed9ca8c36bd : 387 + 0xb]
    eip = 0x01ec40cb   esp = 0xbf94b000   ebp = 0xffffff81   ebx = 0x00000003
    esi = 0xffffff87   edi = 0x9ef64010
    Found by: call frame info
 9  0x305e773
    eip = 0x0305e774   esp = 0xbf94b020   ebp = 0xffffff81   ebx = 0x00000003

in test_MochiKit-DOM-Safari.html. So far, https://tbpl.mozilla.org/?tree=Profiling&rev=6ed9ca8c36bd makes that look permaorange, though it could also only be 99 of 100 like the aurora dromaeo_css crash.

That sets a more firm "do something" deadline of November 19th, since we'll be a little less sanguine about just hiding mochitest-1 on aurora-19.
Recommend untracking for Firefox 18 - this bug is self-inflicted by using experimental compiler features, and we don't seem to have any data on whether it affects Linux users.
Assignee: dvander → general
Status: ASSIGNED → NEW
(Reporter)

Comment 14

5 years ago
Taras: you drove PGO on Linux in bug 559964, because it was "a huge(30%) startup improvement on Linux" - you okay with just dropping it rather than fixing this bustage?
We might consider using a newer GCC version, to see if this bug goes away. It looks like we're two behind, 4.5 versus 4.7.
Depending on tracking bug for 19-nov-2012 migration work.  Please unblock if this is incorrect.
Blocks: 786550
(Reporter)

Comment 17

5 years ago
I guess it's merge-day in the sense that when 18 migrates to mozilla-beta, this will break that test suite there, and we will just hide it there, too?
Summary: dromaeo_css crashing every run on Linux32 PGO, in js::gc::MarkKind, on mozilla-aurora → dromaeo_css crashing every run on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up
(In reply to David Anderson [:dvander] from comment #13)
> Recommend untracking for Firefox 18 - this bug is self-inflicted by using
> experimental compiler features, and we don't seem to have any data on
> whether it affects Linux users.

Not tracking for FF18 given this comment .

Updated

5 years ago
tracking-firefox18: + → -
(Reporter)

Comment 19

5 years ago
We hid the dromaeojs job on Aurora, back when we thought this was just something that was going to be fixed in the first couple of days after the merge; turns out that was probably tree mismanagement on our part. It's now unhidden.
(Reporter)

Comment 23

5 years ago
What does it actually mean to untrack this bug?

Tracking+ for an unfiled bug to stop doing PGO on Linux on all trunk trees, and for another unfiled bug to stop doing PGO on Linux on Aurora, and for another unfiled bug to stop doing PGO on Linux on Beta at the next merge, and for another unfiled bug to stop doing PGO on Linux on Release at the merge after that, and for another unfiled bug to switch release automation to not do PGO on Linux at that same merge?

Or tracking+ for an unfiled bug to just drop the dromaeojs talos suite on trunk and on each branch as 18 merges through it, assuming that we only crash in that talos suite, and not anywhere else?
tracking-firefox18: - → ?
(Reporter)

Comment 24

5 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=16918557&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=16919019&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=16898917&tree=Profiling
Summary: dromaeo_css crashing every run on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up → dromaeo_css crashing most runs on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up
status-firefox17: --- → unaffected
status-firefox18: --- → affected
status-firefox19: --- → affected
status-firefox-esr17: --- → unaffected
Actually, I don't see anything here that is actionable by releng.  I'm removing the dependency on our migration work tracking bug.
No longer blocks: 786550
m-c PGO no-profiling MOZ_PGO_OPTIMIZE_FLAGS=$MOZ_OPTIMIZE_FLAGS
https://tbpl.mozilla.org/?tree=Try&rev=d1314a03a613
Keywords: regressionwindow-wanted
(In reply to Phil Ringnalda (:philor) from comment #23)
> What does it actually mean to untrack this bug?

We're trying to get to a resolution in the original thread https://groups.google.com/d/msg/mozilla.dev.platform/hud9aK0qZWI/lVvYys-U_nYJ.
(In reply to Phil Ringnalda (:philor) from comment #23)
> Or tracking+ for an unfiled bug to just drop the dromaeojs talos suite on
> trunk and on each branch as 18 merges through it, assuming that we only
> crash in that talos suite, and not anywhere else?

The discussion on dev-platform has played out, and we can just disable the failing tests on all branches for Linux PGO (no reason to wait for FF18 to merge to each branch).
tracking-firefox18: ? → +
(In reply to Karl Tomlinson (:karlt) from comment #37)
> m-c PGO no-profiling MOZ_PGO_OPTIMIZE_FLAGS=$MOZ_OPTIMIZE_FLAGS
> https://tbpl.mozilla.org/?tree=Try&rev=d1314a03a613

That seems to leave only PGO as the difference between pass and fail.

(In reply to Phil Ringnalda (:philor) from comment #38)
> If the regressionwindow isn't
> http://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=d042ad078f43&tochange=fdfaef738a00 I'll be astonished.

fdfaef738a00 and back to dd9554c236dc at least demonstrate the bug.
The older revisions that I picked either didn't compile, or hit unknown run_tests.py options or PerfConfigurator.py errors.
Keywords: regressionwindow-wanted
(Reporter)

Updated

5 years ago
Depends on: 812076
(Reporter)

Updated

5 years ago
Blocks: 810705
(Assignee)

Comment 50

5 years ago
So, this is not a crash in dromaeo_css, but a crash in v8. And if I download e.g. http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-aurora-linux/1352918270/firefox-18.0a2.en-US.linux-i686.tar.bz2 and go to http://v8.googlecode.com/svn/data/benchmarks/v7/run.html, I can reproduce.
Summary: dromaeo_css crashing most runs on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up → v8 crashing most runs on Linux32 PGO, in js::gc::MarkKind, on Gecko 18 and up

Comment 52

5 years ago
(In reply to Mike Hommey [:glandium] from comment #51)
> Even better with a nightly

Well, that's an aurora build, actually. ;-)

Thanks for reproducing, though, that's great!

Can you track this down to a regression range in 18.0a1 nightly by bisecting nightly builds?
Keywords: regressionwindow-wanted, reproducible
(Assignee)

Comment 53

5 years ago
So, from what I can see in a debugger, this looks very much like a stack corruption bug. The interesting part is that it started happening the day aurora became 18, and wasn't happening on 18 central. The even more interesting part is that it stopped happening since we merged 19 to aurora, and, obviously, moved to beta.
It never happened on central because on central, we build with --enable-profiling, which enables frame pointers, which changes the stack layout, since frame pointers are kept on stack. This may well have contributed to hide the bug.

I'm inclined to think there's an ionmonkey bug hiding in there, and it was fixed on m-c during the 19 cycle.
(Assignee)

Comment 55

5 years ago
This is interesting: while i can't manage to get a crash out of a non pgo build, i *do* get a busy script popup during a v8 run, which i'd normally never get.
(Assignee)

Comment 56

5 years ago
(In reply to Mike Hommey [:glandium] from comment #55)
> This is interesting: while i can't manage to get a crash out of a non pgo
> build, i *do* get a busy script popup during a v8 run, which i'd normally
> never get.

A non pgo build from the profiling branch, so without frame pointers.
(Assignee)

Comment 57

5 years ago
(In reply to Mike Hommey [:glandium] from comment #56)
> (In reply to Mike Hommey [:glandium] from comment #55)
> > This is interesting: while i can't manage to get a crash out of a non pgo
> > build, i *do* get a busy script popup during a v8 run, which i'd normally
> > never get.
> 
> A non pgo build from the profiling branch, so without frame pointers.

In fact, that also happens on nightly, but it takes more tries. Each new run of v8 runs the regexp test slower and slower, until the busy script popup shows up. It takes about 5 attempts here to get from a score of about 1000 to about 250.

The crashes on aurora seem to happen during the EarleyBoyer test, though, which is before regexp.

Comment 58

5 years ago
(In reply to Mike Hommey [:glandium] from comment #53)
> It never happened on central because on central, we build with
> --enable-profiling, which enables frame pointers, which changes the stack
> layout, since frame pointers are kept on stack. This may well have
> contributed to hide the bug.

Hrm, is there some other way we can track it down to a checkin range (one that isn't full 6 weeks)?

(In reply to Mike Hommey [:glandium] from comment #57)
> In fact, that also happens on nightly, but it takes more tries. Each new run
> of v8 runs the regexp test slower and slower, until the busy script popup
> shows up. It takes about 5 attempts here to get from a score of about 1000
> to about 250.

Even if that's a different bug, this sounds like something we *really* should have on file and should investigate.
(Assignee)

Comment 59

5 years ago
When running under valgrind, the first warning i get when running v8 is:
==12930== Use of uninitialised value of size 4
==12930==    at 0x64F5661: ??? (in libxul.so)

Sadly, I don't know how to tell valgrind to get the symbols from the files i downloaded from the symbol server, but resolving manually tells me this is http://hg.mozilla.org/releases/mozilla-aurora/file/149ad5c9f94d/js/src/ion/LIR.h#l94

This doesn't look very good.
(Assignee)

Comment 60

5 years ago
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #58)
> Hrm, is there some other way we can track it down to a checkin range (one
> that isn't full 6 weeks)?

Apart from using try to do pgo builds without frame pointer on older checkins, I have no idea. Let's start by testing the ionmonkey merge changeset...

Comment 62

5 years ago
Does this only happen on PGO builds?
(Reporter)

Comment 63

5 years ago
Yes.
(In reply to Mike Hommey [:glandium] from comment #60)
> Let's start by testing the ionmonkey merge changeset...

See comment 49.  I expect its possible to patch the issues with older revisions there, but I'd guess a better first step would be to have someone who knows the code look at the valgrind warnings.  Perhaps a try build with stripping disabled might make that easier.

(In reply to Ehsan Akhgari [:ehsan] from comment #62)
> Does this only happen on PGO builds?

AFA we K.  See comment 49.
(Reporter)

Comment 65

5 years ago
Heck, you can see comment 5, for that matter. It's PGO-only, and that's the reason I added PGO builds to the Profiling branch, which we didn't have at the time of the IM merge.

Comment 66

5 years ago
So it seems like the best bet is for an IonMonkey developer to bisect the range...
(Assignee)

Comment 67

5 years ago
Additional data point: https://hg.mozilla.org/try/rev/fdfaef738a00 , which is the fixup changeset after the ionmonkey merge *does* have the problem.
(In reply to Mike Hommey [:glandium] from comment #57)
> In fact, that also happens on nightly, but it takes more tries. Each new run
> of v8 runs the regexp test slower and slower, until the busy script popup
> shows up. It takes about 5 attempts here to get from a score of about 1000
> to about 250.

We're tracking regexp performance in bug 806646, which is on track to be fixed soon. Right now we're pretty sad on this test, but I don't think it's related here.

> The crashes on aurora seem to happen during the EarleyBoyer test, though,
> which is before regexp.

Another thing you could try is seeing if this reproduces using gcc-4.7.
(Assignee)

Comment 69

5 years ago
Some of the valgrind errors from a rebuild of current beta with pgo, --enable-valgrind, --disable-install-strip and --disable-elf-hack:
==22533== Use of uninitialised value of size 4
==22533==    at 0x62F60A1: CanEncodeInfoInHeader(js::ion::LAllocation const&, unsigned int*) (LIR.h:94)
==22533==    by 0x2BB: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F6073: CanEncodeInfoInHeader(js::ion::LAllocation const&, unsigned int*) (Safepoints.cpp:192)
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844B8A: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:392)
==22533==    by 0x7B44557: ???
==22533==    by 0x37F: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844B99: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:395)
==22533==    by 0x7B44557: ???
==22533==    by 0x37F: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844BA3: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:398)
==22533==    by 0x7B44557: ???
==22533==    by 0x37F: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844B8A: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:392)
==22533==    by 0x7B44557: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844B99: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:395)
==22533==    by 0x7B44557: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Conditional jump or move depends on uninitialised value(s)
==22533==    at 0x5844BA3: PartFromStream(js::ion::CompactBufferReader&, NunboxPartKind, unsigned int) (Safepoints.cpp:398)
==22533==    by 0x7B44557: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
==22533== 
==22533== Use of uninitialised value of size 4
==22533==    at 0x583FEE0: ReadAllocation(js::ion::IonFrameIterator const&, js::ion::LAllocation const*) (IonFrames.cpp:434)
==22533==    by 0x1CD: ???
==22533==  Uninitialised value was created by a stack allocation
==22533==    at 0x62F61F5: js::ion::SafepointWriter::writeNunboxParts(js::ion::LSafepoint*) (Safepoints.cpp:209)
(Assignee)

Comment 71

5 years ago
Thanks to the valgrind output, it ended being easy to spot what is wrong:

01a0c070 <_ZL21CanEncodeInfoInHeaderRKN2js3ion11LAllocationEPj>:
 1a0c070:	56                   	push   %esi
 1a0c071:	89 c6                	mov    %eax,%esi
 1a0c073:	83 ec 08             	sub    $0x8,%esp
 1a0c076:	8b 08                	mov    (%eax),%ecx
 1a0c078:	f6 c1 01             	test   $0x1,%cl
 1a0c07b:	75 1b                	jne    1a0c098 <_ZL21CanEncodeInfoInHeaderRKN2js3ion11LAllocationEPj+0x28>
(snip)
 1a0c098:	8b 0e                	mov    (%esi),%ecx
 1a0c09a:	8b 54 24 04          	mov    0x4(%esp),%edx
 1a0c09e:	c1 f9 04             	sar    $0x4,%ecx
 1a0c0a1:	89 0a                	mov    %ecx,(%edx)
(snip)

0x1a0c0a1 above corresponds to 0x62F60A1 in valgrind's output. Essentially, what is happening is that a value is taken from the stack at 0x1a0c09a that was never stored...

This appears to be a miscompilation from gcc 4.5 with PGO enabled, but I was able to get a similarly broken code out of it without PGO with the following patch:
diff --git a/js/src/ion/Safepoints.cpp b/js/src/ion/Safepoints.cpp
--- a/js/src/ion/Safepoints.cpp
+++ b/js/src/ion/Safepoints.cpp
@@ -8,6 +8,7 @@
 #include "Safepoints.h"
 #include "IonSpewer.h"
 #include "LIR.h"
+#include "mozilla/Likely.h"
 
 using namespace js;
 using namespace ion;
@@ -190,7 +191,7 @@ AllocationToPartKind(const LAllocation &
 static inline bool
 CanEncodeInfoInHeader(const LAllocation &a, uint32 *out)
 {
-    if (a.isGeneralReg()) {
+    if (MOZ_LIKELY(a.isGeneralReg())) {
         *out = a.toGeneralReg()->reg().code();
         return true;
     }

If I manually patch a broken libxul.so with the function machine code from a working build, I get a non-crashy build, and valgrind never complains.

It looks like, from succinct local testing, that gcc 4.7 would generate proper code. Although a notable difference between gcc 4.7 and gcc 4.5 is that gcc 4.7 inlines the function in most cases where gcc 4.5 doesn't. I'm testing a try build replacing inline with MOZ_ALWAYS_INLINE on that function, and see if the miscompilation is worked around this way.
(Assignee)

Comment 72

5 years ago
Created attachment 684474 [details] [diff] [review]
Work around gcc 4.5 miscompilation of CanEncodeInfoInHeader by always inlining it

https://tbpl.mozilla.org/?tree=Try&rev=c0b53bdc7f7a
Attachment #684474 - Flags: review?(dvander)
(Assignee)

Updated

5 years ago
Assignee: general → mh+mozilla
Comment on attachment 684474 [details] [diff] [review]
Work around gcc 4.5 miscompilation of CanEncodeInfoInHeader by always inlining it

Review of attachment 684474 [details] [diff] [review]:
-----------------------------------------------------------------

Nice find! Small nit, comments in this file should use //
Attachment #684474 - Flags: review?(dvander) → review+
(Assignee)

Comment 74

5 years ago
https://hg.mozilla.org/mozilla-central/rev/d8e4f06198dc
Status: NEW → RESOLVED
Last Resolved: 5 years ago
tracking-firefox18: + → ?
Keywords: regressionwindow-wanted
Resolution: --- → FIXED
(Assignee)

Comment 75

5 years ago
Comment on attachment 684474 [details] [diff] [review]
Work around gcc 4.5 miscompilation of CanEncodeInfoInHeader by always inlining it

[Approval Request Comment]
Bug caused by (feature/regressing bug #): gcc 4.5 miscompiling ionmonkey
User impact if declined: Pretty much random crashes when js is involved
Testing completed (on m-c, etc.): heavily tested on try, although only tested on linux x86.
Risk to taking this patch (and alternatives if risky): the patch in itself doesn't change much. It changes an "inline" to an "MOZ_ALWAYS_INLINE", which only forces compilers to inline the function. There is a tiny possibility that other compilers start miscompiling as a result, but it's quite unlikely: clang is already inlining the function, so forcing an inline won't change that, and i suspect msvc does too.
String or UUID changes made by this patch: none.
Attachment #684474 - Flags: approval-mozilla-beta?
Attachment #684474 - Flags: approval-mozilla-aurora?

Updated

5 years ago
tracking-firefox18: ? → +

Updated

5 years ago
Attachment #684474 - Flags: approval-mozilla-beta?
Attachment #684474 - Flags: approval-mozilla-beta+
Attachment #684474 - Flags: approval-mozilla-aurora?
Attachment #684474 - Flags: approval-mozilla-aurora+
(Assignee)

Comment 77

5 years ago
https://hg.mozilla.org/releases/mozilla-beta/rev/725506e4c115
status-firefox18: affected → fixed
dromaeojs unhidden on beta
(Assignee)

Updated

5 years ago
status-firefox19: affected → fixed

Updated

5 years ago
Keywords: verifyme

Updated

5 years ago
Keywords: verifyme

Updated

5 years ago
Whiteboard: [qa-]
(Reporter)

Updated

5 years ago
Duplicate of this bug: 810705
You need to log in before you can comment on or make changes to this bug.