Closed Bug 1233666 Opened 9 years ago Closed 9 years ago

Linux pgo M-e10s(bc2) on mozilla-aurora: application crashed [@ mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*)]

Categories

(Core :: mozglue, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
e10s + ---
firefox45 --- fixed
firefox46 --- fixed

People

(Reporter: xidorn, Assigned: xidorn)

References

Details

Attachments

(2 files)

See https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=Linux%20pgo%20Mochitest%20e10s%20Mochitest%20e10s%20Browser%20Chrome%20M-e10s%28bc2%29&tochange=1ba4f6d91e63

All of the Linux pgo M-e10s(bc2) failures are incorrectly marked as intermittent bug 1113930, but actually they are permafail and are different from bug 1113930.
Attachment #8699936 - Flags: review?(mh+mozilla)
Comment on attachment 8699936 [details]
MozReview Request: Bug 1233666 - Remove hacks for getting frame pointer for x86/x64 gcc.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/28511/diff/1-2/
Attached file builtin function test
I used this code to test the behavior of __builtin_frame_address. It seems to work fine on both x86 and x64, with optimization both on and off.
See Also: → 1113930
It seems the test chunking is different between opt build and pgo build even if I use pgo settings via config overriding. Although the M-e10s(bc2) in my try push passed, but I don't see the crashing test to be actually in it.

philor, could you help testing the patch with Linux pgo build?
Flags: needinfo?(philringnalda)
I don't think it can be done (you need to have 5 chunks like we do for bc-e10s on PGO, rather than the 7 of opt, but for that the number of chunks is set by buildbot, which actually has an assert that you aren't trying to pass in a different number from in-tree mozharness configs), but I also don't think it needs to be done: when I push trunk-as-aurora simulations, I get that crash 100% of the time in bc-e10s-7 instead of -2, that being where browser/base/content/test/general/ winds up running.

That seems to mean either that we walk some stack during shutdown of running the tests in that directory, and then print absolutely nothing from walking it unless we crash while walking it, or I guess maybe (and much more unlikely) that your patch completely and silently breaks stack walking. Either way, I have every reason to believe that the results of your patch would look exactly the same if you were running 5 chunks as they do with 7, it would just be green on 2/5 rather than being green on 7/7.
Flags: needinfo?(philringnalda)
Comment on attachment 8699936 [details]
MozReview Request: Bug 1233666 - Remove hacks for getting frame pointer for x86/x64 gcc.

https://reviewboard.mozilla.org/r/28511/#review25615

::: mozglue/misc/StackWalk.cpp:899
(Diff revision 1)
> -  void** bp;
> +  void** bp = (void**)__builtin_frame_address(0);

It's interesting to note that the x86_64 alternative code (there actually was a branch for x86_64 with a movl %%rbp) was removed in the seemingly unrelated bug 326594. Note that AFAICT, this actually doesn't change the generated code, so I wouldn't expect this patch to actually do anything in practice. I'm r+ing anyways because I don't see this being hurtful (looks like the bug this was supposed to circumvent is long gone), but I don't expect it to fix failures either.
Attachment #8699936 - Flags: review?(mh+mozilla) → review+
(In reply to Mike Hommey [:glandium] from comment #7)
> Comment on attachment 8699936 [details]
> MozReview Request: Bug 1233666 - Remove hacks for getting frame pointer for
> x86/x64 gcc.
> 
> ..., but I don't expect it to fix failures either.

According to comment 6, it fixes:
> when I push trunk-as-aurora simulations, I get that crash 100% of the time in bc-e10s-7 instead of -2
but my try push shows all green on this test.
If your try push is a "simple" push of mozilla-central to try, with PGO enabled, then it doesn't match aurora PGO builds. Aurora doesn't build with --enable-profiling, which means it builds with -fomit-frame-pointer, while m-c builds with -fno-omit-frame-pointer. Which means that bp has a valid value on m-c, and not on aurora.

Looking at the disassembly of MozStackWalk on your try push, it looks like:
 8051102:       55                      push   %ebp
 8051103:       89 e5                   mov    %esp,%ebp
 8051105:       53                      push   %ebx
 8051106:       e8 45 f9 ff ff          call   8050a50 <__x86.get_pc_thunk.bx>
 805110b:       81 c3 9d 11 02 00       add    $0x2119d,%ebx
 8051111:       83 ec 24                sub    $0x24,%esp
 8051114:       e8 97 fe ff ff          call   8050fb0 <StackWalkInitCriticalAddress()>
 8051119:       8b 45 14                mov    0x14(%ebp),%eax
 805111c:       8b 55 10                mov    0x10(%ebp),%edx
 805111f:       8b 4d 0c                mov    0xc(%ebp),%ecx
 8051122:       89 6c 24 10             mov    %ebp,0x10(%esp)
 8051126:       89 44 24 0c             mov    %eax,0xc(%esp)
 805112a:       8b 45 08                mov    0x8(%ebp),%eax
 805112d:       c7 44 24 14 ff ff ff    movl   $0xffffffff,0x14(%esp)
 8051134:       ff 
 8051135:       89 54 24 08             mov    %edx,0x8(%esp)
 8051139:       89 4c 24 04             mov    %ecx,0x4(%esp)
 805113d:       89 04 24                mov    %eax,(%esp)
 8051140:       e8 55 ff ff ff          call   805109a <mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*)>
 8051145:       83 c4 24                add    $0x24,%esp
 8051148:       5b                      pop    %ebx
 8051149:       5d                      pop    %ebp
 805114a:       c3                      ret    

On an aurora nightly, it looks like:
08051102 <MozStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, unsigned int, void*)>:
 8051102:       53                      push   %ebx
 8051103:       e8 48 f9 ff ff          call   8050a50 <__x86.get_pc_thunk.bx>
 8051108:       81 c3 a0 11 02 00       add    $0x211a0,%ebx
 805110e:       83 ec 28                sub    $0x28,%esp
 8051111:       e8 9a fe ff ff          call   8050fb0 <StackWalkInitCriticalAddress()>
 8051116:       8b 54 24 3c             mov    0x3c(%esp),%edx
 805111a:       8b 4c 24 38             mov    0x38(%esp),%ecx
 805111e:       89 e8                   mov    %ebp,%eax
 8051120:       89 44 24 10             mov    %eax,0x10(%esp)
 8051124:       8b 44 24 34             mov    0x34(%esp),%eax
 8051128:       89 54 24 0c             mov    %edx,0xc(%esp)
 805112c:       8b 54 24 30             mov    0x30(%esp),%edx
 8051130:       c7 44 24 14 ff ff ff    movl   $0xffffffff,0x14(%esp)
 8051137:       ff 
 8051138:       89 4c 24 08             mov    %ecx,0x8(%esp)
 805113c:       89 44 24 04             mov    %eax,0x4(%esp)
 8051140:       89 14 24                mov    %edx,(%esp)
 8051143:       e8 52 ff ff ff          call   805109a <mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*)>
 8051148:       83 c4 28                add    $0x28,%esp
 805114b:       5b                      pop    %ebx
 805114c:       c3                      ret    


Note how ebp is not filled with esp in the latter, and note how both pass %ebp as argument to FramePointerStackWalk in 0x10(%esp) (albeit indirectly in the latter)
And yes, this is actually a dupe of bug 1113930.
No, it's not. It's actually a permafail, which bug 1113930 isn't. And this happens before the first iteration, but bug 1113930 happens in some later iteration.
I pushed the aurora with PGO enabled to try. But I admit that I'm not sure that would behave exactly the same as if it is in aurora tree because of other infra settings. But :philor said pushing aurora head to try directly also showed the same permafail.
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #11)
> No, it's not. It's actually a permafail, which bug 1113930 isn't. And this
> happens before the first iteration, but bug 1113930 happens in some later
> iteration.

Read my comment in bug 1113930. They both are the same problem.
Which comment exactly? I think I've read all of them, and I don't think there's anything covers how should we fix the initial value of bp passed into FramePointerStackWalk, except that you state we may need a check before the first dereference.

Yes, a check before the first dereference would close this crash, but having a correct initial value is better, no?
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #12)
> I pushed the aurora with PGO enabled to try.

That's interestingly true... but even though StackWalk.cpp was built with -fomit-frame-pointer, MozStackWalk contains the mov %esp, %ebp from the disassembly I pasted... So I guess using __builtin_frame_address forces gcc to fill %ebp with a valid value. So the first iteration would work, but other iterations would fail randomly on bug 1113930. OTOH, fixing bug 1113930 would *also* fix this one, independently of the __buitin_frame_address change.
(In reply to Mike Hommey [:glandium] from comment #15)
> (In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #12)
> > I pushed the aurora with PGO enabled to try.
> 
> OTOH, fixing bug
> 1113930 would *also* fix this one, independently of the
> __buitin_frame_address change.

That depends on how we fix bug 1113930. With my current patch, no, because no check is performed before the first dereference. And I don't actually think we need, if __buitin_frame_address can reliable give us the correct result.
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #16)
> (In reply to Mike Hommey [:glandium] from comment #15)
> > (In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #12)
> > > I pushed the aurora with PGO enabled to try.
> > 
> > OTOH, fixing bug
> > 1113930 would *also* fix this one, independently of the
> > __buitin_frame_address change.
> 
> That depends on how we fix bug 1113930. With my current patch, no, because
> no check is performed before the first dereference. And I don't actually
> think we need, if __buitin_frame_address can reliable give us the correct
> result.

FramePointerStackWalk is a "public" API, and is used outside of StackWalk.cpp. The only other use is supposedly safe, but it's better not to assume that.
(In reply to Mike Hommey [:glandium] from comment #17)
> FramePointerStackWalk is a "public" API, and is used outside of
> StackWalk.cpp. The only other use is supposedly safe, but it's better not to
> assume that.

OTOH, there is no reasonable way to enforce that, short of making the caller give a pointer to the stack start, which might be as wrong as bp, so... meh. The patch here is r+ anyways, it's just not complete without bug 1113930.
It's complete to fix the bustage on aurora anyway.

Confirmed that without this patch, aurora on try also busted on M-e10s bc7:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6a3a96215e45
https://hg.mozilla.org/integration/mozilla-inbound/rev/61e79fef8e0c44d06b0e9e377ec19602bd269f2c
Bug 1233666 - Remove hacks for getting frame pointer for x86/x64 gcc. r=glandium
Assignee: nobody → quanxunzhen
Comment on attachment 8699936 [details]
MozReview Request: Bug 1233666 - Remove hacks for getting frame pointer for x86/x64 gcc.

Approval Request Comment
[Feature/regressing bug #]: unknown, permafail after merge
[User impact if declined]: n/a
[Describe test coverage new/current, TreeHerder]: n/a
[Risks and why]: very low risk as it is only related to stack walker and it just removes some no longer needed hacks
[String/UUID change made/needed]: n/a
Attachment #8699936 - Flags: approval-mozilla-aurora?
Keywords: leave-open
Comment on attachment 8699936 [details]
MozReview Request: Bug 1233666 - Remove hacks for getting frame pointer for x86/x64 gcc.

Major pain for Sheriff, taking it.
Attachment #8699936 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/61e79fef8e0c
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: