Closed Bug 1113930 Opened 10 years ago Closed 8 years ago

Intermittent browser_ruleview_content_02.js,browser_markupview_links_01.js | application crashed [@ mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*)]

Categories

(Core :: Gecko Profiler, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
firefox45 --- fixed
firefox46 --- fixed

People

(Reporter: RyanVM, Assigned: glandium)

References

Details

(Keywords: assertion, crash, intermittent-failure)

Attachments

(2 files, 2 obsolete files)

16:48:22 INFO - 155 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There are 3 headers for inherited rules
16:48:22 INFO - 156 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The first header is correct
16:48:22 INFO - 157 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The second header is correct
16:48:22 INFO - 158 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The third header is correct
16:48:22 INFO - 159 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There are 4 rules in the view
16:48:22 INFO - 160 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The rule's selector is correct
16:48:22 INFO - 161 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property name, as expected
16:48:22 INFO - 162 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property value, as expected
16:48:22 INFO - 163 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The rule's selector is correct
16:48:22 INFO - 164 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property name, as expected
16:48:22 INFO - 165 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property value, as expected
16:48:22 INFO - 166 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The rule's selector is correct
16:48:22 INFO - 167 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property name, as expected
16:48:22 INFO - 168 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property value, as expected
16:48:22 INFO - 169 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The rule's selector is correct
16:48:22 INFO - 170 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property name, as expected
16:48:22 INFO - 171 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property value, as expected
16:48:22 INFO - 172 INFO Leaving test
16:48:22 INFO - 173 ERROR TEST-UNEXPECTED-FAIL | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | application terminated with exit code 11
16:48:22 INFO - runtests.py | Application ran for: 0:07:33.138440
16:48:22 INFO - zombiecheck | Reading PID log: /tmp/tmpQHmEQ7pidlog
16:48:36 INFO - mozcrash Saved minidump as /builds/slave/test/build/blobber_upload_dir/5a7d2fee-8218-21c4-1505e452-765698ba.dmp
16:48:36 INFO - mozcrash Saved app info as /builds/slave/test/build/blobber_upload_dir/5a7d2fee-8218-21c4-1505e452-765698ba.extra
16:48:36 WARNING - PROCESS-CRASH | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | application crashed [@ mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*)]
16:48:36 INFO - Crash dump filename: /tmp/tmpkwtrvU.mozrunner/minidumps/5a7d2fee-8218-21c4-1505e452-765698ba.dmp
16:48:36 INFO - Operating system: Linux
16:48:36 INFO - 0.0.0 Linux 3.2.0-23-generic-pae #36-Ubuntu SMP Tue Apr 10 22:19:09 UTC 2012 i686
16:48:36 INFO - CPU: x86
16:48:36 INFO - GenuineIntel family 6 model 62 stepping 4
16:48:36 INFO - 1 CPU
16:48:36 INFO - Crash reason: SIGSEGV
16:48:36 INFO - Crash address: 0xffffff88
16:48:36 INFO - Thread 0 (crashed)
16:48:36 INFO - 0 libxul.so!mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*) [nsStackWalk.cpp:315c981e3f7d : 894 + 0x0]
16:48:36 INFO - eip = 0xb2718204 esp = 0xbfbf79a0 ebp = 0xbfbf79b8 ebx = 0xb6e9003c
16:48:36 INFO - esi = 0x00000052 edi = 0xffffff88 eax = 0xffffff88 ecx = 0xb76658ac
16:48:36 INFO - edx = 0xffffffae efl = 0x00210202
16:48:36 INFO - Found by: given as instruction pointer in context
16:48:36 INFO - 1 libxul.so!nsTraceRefcnt::WalkTheStack(_IO_FILE*) [nsTraceRefcnt.cpp:315c981e3f7d : 963 + 0x12]
16:48:36 INFO - eip = 0xb271b7cd esp = 0xbfbf79c0 ebp = 0xbfbf79e8 ebx = 0xb6e9003c
16:48:36 INFO - esi = 0xb7664d9c edi = 0xbfbf7a34
16:48:36 INFO - Found by: call frame info
16:48:36 INFO - 2 libxul.so!NS_DebugBreak [nsDebugImpl.cpp:315c981e3f7d : 448 + 0xf]
16:48:36 INFO - eip = 0xb26fb7d3 esp = 0xbfbf79f0 ebp = 0xbfbf7e38 ebx = 0xb6e9003c
16:48:36 INFO - esi = 0xb7664d9c edi = 0xbfbf7a34
16:48:36 INFO - Found by: call frame info
16:48:36 INFO - 3 libxul.so!nsStyleContext::~nsStyleContext() [nsStyleContext.cpp:315c981e3f7d : 90 + 0x20]
16:48:36 INFO - eip = 0xb3dd5939 esp = 0xbfbf7e40 ebp = 0xbfbf7e88 ebx = 0xb6e9003c
16:48:36 INFO - esi = 0x9e8463e0 edi = 0x8d049400
16:48:36 INFO - Found by: call frame info
16:48:36 INFO - 4 libxul.so!nsStyleContext::Destroy() [nsStyleContext.cpp:315c981e3f7d : 941 + 0x8]
16:48:36 INFO - eip = 0xb3dd5987 esp = 0xbfbf7e90 ebp = 0xbfbf7eb8 ebx = 0xb6e9003c
16:48:36 INFO - esi = 0x8d049400 edi = 0x9e8463e0
16:48:36 INFO - Found by: call frame info
16:48:36 INFO - 5 libxul.so!nsStyleContext::Release() [nsStyleContext.h:315c981e3f7d : 93 + 0x8]
16:48:36 INFO - eip = 0xb2fff03d esp = 0xbfbf7ec0 ebp = 0xbfbf7ee8 ebx = 0xb6e9003c
16:48:36 INFO - esi = 0x9e8463e0 edi = 0x00000000
16:48:36 INFO - Found by: call frame info
16:48:36 INFO - 6 libxul.so!nsTArray_Impl<nsRefPtr<nsStyleContext>, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned int, unsigned int) [nsRefPtr.h:315c981e3f7d : 59 + 0xe]
16:48:36 INFO - eip = 0xb3eab9c9 esp = 0xbfbf7ef0 ebp = 0xbfbf7f28 ebx = 0xb6e9003c
16:48:36 INFO - esi = 0x00000000 edi = 0x95ba9050
16:48:36 INFO - Found by: call frame info
16:48:36 INFO - 7 libxul.so!nsTArray_Impl<nsRefPtr<nsStyleContext>, nsTArrayInfallibleAllocator>::Clear() [nsTArray.h:315c981e3f7d : 1406 + 0x9]
16:48:36 INFO - eip = 0xb3eaba2d esp = 0xbfbf7f30 ebp = 0xbfbf7f48 ebx = 0xb6e9003c
16:48:36 INFO - esi = 0x95ba9000 edi = 0x95ba9050
16:48:36 INFO - Found by: call frame info
16:48:36 INFO - 8 libxul.so!nsTransformedTextRun::~nsTransformedTextRun() [nsTArray.h:315c981e3f7d : 785 + 0x7]
16:48:36 INFO - eip = 0xb3eaba8c esp = 0xbfbf7f50 ebp = 0xbfbf7f78 ebx = 0xb6e9003c
16:48:36 INFO - esi = 0x95ba9000 edi = 0x95ba9050
16:48:36 INFO - Found by: call frame info
16:48:36 INFO - 9 libxul.so!nsTransformedTextRun::~nsTransformedTextRun() [nsTextRunTransformations.h:315c981e3f7d : 95 + 0x8]
16:48:36 INFO - eip = 0xb3eabac2 esp = 0xbfbf7f80 ebp = 0xbfbf7f98 ebx = 0xb6e9003c
16:48:36 INFO - esi = 0x95ba9000 edi = 0x95ba9000
16:48:36 INFO - Found by: call frame info
16:48:36 INFO - 10 libxul.so!nsTextFrame::ClearTextRun(nsTextFrame*, nsTextFrame::TextRunType) [nsTextFrame.cpp:315c981e3f7d : 4402 + 0x7]
16:48:36 INFO - eip = 0xb3f424ff esp = 0xbfbf7fa0 ebp = 0xbfbf7fd8 ebx = 0xb6e9003c
16:48:36 INFO - esi = 0x95ba9000 edi = 0x95ba9000
16:48:36 INFO - Found by: call frame info
Just commented a similar case in bug 1111415 comment 2, with the same crash address.

(In reply to Cameron McCormack (:heycam) from comment #2)
> I have a reproducible FramePointerStackWalk crash with a patch queue applied
> to m-c, but I'm not sure whether it is caused by my patches yet or an
> underlying problem revealed by them.  My crash address is different from the
> other two mentioned above, though: fffff88 (on Linux32).  dmajor pointed out
> that is a special value used by JIT code, and indeed the stack walking
> crashes at a frame within JIT code.
> 
> https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=b02537901def
I seem to hit this really easily on my uplift simulation Try pushes (2 out of 3 so far). Hopefully that's not a sign of things to come on Monday :\
Inactive; closing (see bug 1180138).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
This has been permafailing on Aurora since the below push yesterday:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=0104fc2803cb

Nothing in there stands out as to why, though. One of the FHR patches maybe?
Status: RESOLVED → REOPENED
Flags: needinfo?(alessio.placitelli)
Resolution: WORKSFORME → ---
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #32)
> This has been permafailing on Aurora since the below push yesterday:
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> aurora&revision=0104fc2803cb
> 
> Nothing in there stands out as to why, though. One of the FHR patches maybe?

I think that the FHR/Telemetry patches there shouldn't cause this issue: we didn't uplift any native code changes and those changes lived quite a bit on m-c.
Flags: needinfo?(alessio.placitelli)
Bug 1177278 seems more likely.
(In reply to Georg Fritzsche [:gfritzsche] from comment #35)
> Bug 1177278 seems more likely.

Try says otherwise.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e551441dde15

Guess I'll keep bisecting...
FWIW, all of the other non-FHR revs in that push affect either Windows or Android only.
Looks like it's from the test changes in bug 1137355:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c8dff23c85c

Will backout and add comments to that bug.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Resolution: FIXED → WORKSFORME
I had some fun trying to figure out this one, checking what in bug 1137355 was triggering the crash. After quite some tinkering, here's what I found.

I tried to remove all the interesting stuff from the offending test, and it kept failing: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1d9ea7c2a0e2

After some more testing, I tried to simply add an empty test to browser.ini, without applying bug 1137355. To my surprise, it made e10 bc1 fail: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8bfd31195f39

To rule out timing issue, the test file was renamed: https://treeherder.mozilla.org/#/jobs?repo=try&revision=602bc35ff27a (still failing)

Finally, applying bug 1137355 and skipping the test in e10s makes it look good: https://treeherder.mozilla.org/#/jobs?repo=try&revision=839f4989676b
This is a plugin crash apparently in stackwalking:
https://treeherder.mozilla.org/logviewer.html#?job_id=10046886&repo=try

(Benjamin Smedberg  [:bsmedberg] from bug 1137355, comment #54)
> Looking at the stack, I think there is an xlib error or warning triggering a
> stackwalk which is crashing.
> 
> {"thread": "ProcessReader", "level": "info", "pid": 1801, "source":
> "mochitest", "time": 1438298106400, "action": "log", "exc_info": false,
> "message": "Xlib:  extension \"RANDR\" missing on display \":0\"."}
> {"thread": "ProcessReader", "level": "info", "pid": 1801, "source":
> "mochitest", "time": 1438298108809, "action": "log", "exc_info": false,
> "message": "\u0007[NPAPI 2197] ###!!! ASSERTION: plug removed: 'glib
> assertion', file
> /builds/slave/m-aurora-lx-000000000000000000/build/src/toolkit/xre/
> nsSigHandlers.cpp, line 138"}
> 
> I don't know what this assertion means in practice.

Per the above i don't see a meaningful relation with bug 1137355, so i think we'll just skip that test on e10s aurora for now.
Jim, could this be related to e10s plugin work?
Flags: needinfo?(jmathies)
If adding a blank test to the order of things causes this, does't that point to the previous test as a potential culprit?

1.10  [browser_datareporting_notification.js]
1.11  skip-if = !datareporting
1.12 +[browser_z_datachoices_notification.js]
1.13 +skip-if = !datareporting
1.14  [browser_devedition.js]

In any case I'm find with disabling a single test on e10s, we have a number of these, eventually get back to it.

I also see a lot of 'Services.DataReporting.Policy' spew in the logs right before the crash.
Flags: needinfo?(jmathies)
Thanks Jim. Adding the blank test at the beginning of browser.ini and removing "browser_datareporting_notification.js" (the potential culprit) doesn't make the process crash.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=343b23ef2b48

Adding the blank at the beginning of browser.ini keeping the datareporting test, makes it crash:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6f2a56b6b72f
setting to reopen since this is kind of perma failing on aurora
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Summary: Intermittent browser_ruleview_content_02.js | application crashed [@ mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*)] → Intermittent browser_ruleview_content_02.js,browser_markupview_links_01.js | application crashed [@ mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*)]
Blocks: 1132501
No longer blocks: 1132501
This seems to be a pretty high-frequent failure which should be fixed asap.

There are two reasons for this intermittent:
1. the test (browser_markupview_links_01.js) constantly triggers an assertion "why should we have flushed style again?" in nsComputedDOMStyle::UpdateCurrentStyleSources;
2. the stack walker intermittently crashes in FramePointerStackWalk

Fixing either one should fix this intermittent.

glandium, could you have a look at the stack walker crash?

heycam, could you have a look at the assertion there?
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(cam)
The stackwalker crash looks like the a straightforward case where the stack walker is too trusting of what's on the stack. Short of rewriting it, there's not much that can be done here IMHO. Nathan, what do you think?
Flags: needinfo?(mh+mozilla) → needinfo?(nfroyd)
(In reply to Mike Hommey [:glandium] from comment #171)
> The stackwalker crash looks like the a straightforward case where the stack
> walker is too trusting of what's on the stack. Short of rewriting it,
> there's not much that can be done here IMHO. Nathan, what do you think?

Looks plausible; the crash address is 0xffffff88, something like what got mentioned in comment 4 as "special values for the JIT".  grep'ing through the source doesn't turn up anything obvious here, though...ni? to Jan for possible insight here.

One easy way to make the stack walker more paranoid would be to record the stack pointer when the stack walker gets initialized, and reject any addresses during stack walking that fall above that point.  (Or have the stackwalker figure out the topmost possible stack address each time, since it can get called on different threads?  I think pthreads provides a sort of API for this, but I'm not sure that Windows does, and even the pthreads API has some non-portability associated with it...?)
Flags: needinfo?(nfroyd) → needinfo?(jdemooij)
One thing that might be nice would be to have the order for assertion handling be "print the failed assertion, try to get the stack, print the stack or crash while trying" instead of "try to get the stack, print the assertion and the stack or crash while trying to get the stack and never print the assertion." Among the things lumped in here is a shutdown assertion which has been permaorange on aurora for months, which we could fix-or-downgrade if we knew what it was, but all we know is that the stack for the assertion crashes stackwalker.
Actually the assertion is printed. The log has line:
> ###!!! ASSERTION: why should we have flushed style again?: 'mPresShell && currentGeneration == mPresShell->GetPresContext()->GetRestyleGeneration()', file /builds/slave/m-aurora-lx-d-0000000000000000/build/src/layout/style/nsComputedDOMStyle.cpp, line 723
before the stack dump.

It seems to me that the assertion actually always happen, but no one cares (it is ignored by the test runner, why?) You can see for example https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=b5dcc2d9ccd7&filter-job_type_symbol=dt1&selectedJob=18632228 and find there is also the same assertions inside, although not crashed.

It also seems to me that this crash happens only on linux32 because we only run the tests on linux32. I cannot find log of these tests anywhere else.
The interpretation of having hit an assertion is controlled by the harness: reftests, crashtests, and mochitest-plain have assertion counts and exceeding them is a test failure, because dbaron counted up the assertions for every single test at a point in time in the past, and then made exceeding those counts fatal. Then he looked at how many assertion failures happen in browser-chrome tests (which is what devtools-chrome is), and found more productive things to do than count them.

And that's the assertion in browser_markupview_links_01.js, which is not permaorange, merely tiresome; I'm talking about the shutdown assertion in e10s bc2 on aurora, http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-aurora-linux-pgo/1450309979/mozilla-aurora_ubuntu32_vm_test_pgo-mochitest-e10s-browser-chrome-2-bm06-tests1-linux32-build112.txt.gz, which is permaorange and has been since... Octoberish, I think, but I see now that I've looked closer that it isn't an NS_ASSERTION, we're just in DEBUG_BREAK having hit a glib assertion, and good luck to us.
This has got to be the same bug as bug 1224822 which BenWa already has plans to fix.
Yeah, I believe they are the same. Probably we should duplicate that bug to this... but at least, +see also
See Also: → 1224822
(In reply to Nathan Froyd [:froydnj] from comment #172)
> Looks plausible; the crash address is 0xffffff88, something like what got
> mentioned in comment 4 as "special values for the JIT".  grep'ing through
> the source doesn't turn up anything obvious here, though...ni? to Jan for
> possible insight here.

On 32-bit platforms, 0xffffff88 is a js::Value type tag used for objects (see JSVAL_TAG_OBJECT in js/public/Value.h) It's not JIT-specific but these values definitely show up a lot in JIT stack frames and registers, so it's not surprising the stack walker hits it.

The stack walker not working for JIT frames is not surprising; debuggers like gdb also often can't unwind past JIT frames.
Flags: needinfo?(jdemooij)
Note that there actually is a complete stack info after the crashing unwinding. At the point our stack walker crashes, that stack info shows 0xffffff88 for ebp as well, and its next frame is found via "stack scanning".

I don't think it makes much sense for us to implement whatever the "scanning" here is (if anyone is interested, it would be good to have as well, though), but I guess it still makes sense to just avoid continuing unwinding when ridiculous address is detected. I believe incomplete stack info is better than crash, when we don't really want it to crash.
I expect this patch to fix most of similiar intermittents, including this bug and bug 1224822. However there are some crashes with the same signature from pgo builds, which probably cannot be fixed here, because in those cases, there are even nothing unwound at all from our stack walker.

We should probably file a separate bug for those crashes. I suspect in those cases, the stack is broken from the very beginning. We may want to use the write() trick to check whether the initial bp really points to a valid address to avoid that kind of crashes.
Hmmm, the crashes from pgo builds actually seem to be permafail from mozilla-aurora. That would definitely need a new bug.
See Also: → 1233666
Comment on attachment 8699884 [details]
MozReview Request: Bug 1113930 - Add frame size santy check in frame pointer stack walk.

https://reviewboard.mozilla.org/r/28473/#review25605

::: mozglue/misc/StackWalk.cpp:1064
(Diff revision 1)
> +  // we can assume that bp > sp

... except when the stack is going upwards instead of downwards, which is the case on e.g. hppa... I know you're merely copying moving this from FramePointerStackWalk, but could you handle both stack growth directions? (see how JS_STACK_GROWTH_DIRECTION is defined in js/src/jscpucfg.h)

Please also add a comment about what (next & 3) is checking (which is stack alignment ; btw, this should probably be next & 7 on 64 bit platforms, although on x64 it could even be next & 15)

::: mozglue/misc/StackWalk.cpp:1070
(Diff revision 1)
> +  static MOZ_CONSTEXPR size_t kMaxFrameSize = 16 * 1024 * 1024;

const is better here than MOZ_CONSTEXPR.

::: mozglue/misc/StackWalk.cpp:1074
(Diff revision 1)
> +  // (On some *nix systems, we may want to use system call write() to

you mean mincore() instead of write().

::: mozglue/misc/StackWalk.cpp:1077
(Diff revision 1)
> +  if (current < stackEnd - kMaxFrameSize &&
> +      next > current + kMaxFrameSize) {
> +    return false;

next > current + kMaxFrameSize ...
  considering by now, we know that current < next <=  stackEnd, the only way for this condition to be true is for stackEnd - current > kMaxFrameSize... which is exactly the same as current < stackEnd - kMaxFrameSize.

So there's really only one condition here, and it's on the current pointer being right... but we have already read it, so it is, in fact, too late to do this check.
Attachment #8699884 - Flags: review?(mh+mozilla)
https://reviewboard.mozilla.org/r/28473/#review25605

> ... except when the stack is going upwards instead of downwards, which is the case on e.g. hppa... I know you're merely copying moving this from FramePointerStackWalk, but could you handle both stack growth directions? (see how JS_STACK_GROWTH_DIRECTION is defined in js/src/jscpucfg.h)
> 
> Please also add a comment about what (next & 3) is checking (which is stack alignment ; btw, this should probably be next & 7 on 64 bit platforms, although on x64 it could even be next & 15)

I don't think we want to address the alternative direction issue in this bug. I believe that would need nontrivial work, and not only at this place.

> you mean mincore() instead of write().

No. I exactly mean write(). write() returns EFAULT if its source address is invalid for the process on at least Linux, OS X, and FreeBSD, according to their document. However, mincore() seems to be a Linux-only syscall.

> next > current + kMaxFrameSize ...
>   considering by now, we know that current < next <=  stackEnd, the only way for this condition to be true is for stackEnd - current > kMaxFrameSize... which is exactly the same as current < stackEnd - kMaxFrameSize.
> 
> So there's really only one condition here, and it's on the current pointer being right... but we have already read it, so it is, in fact, too late to do this check.

Effectively this is the only change in this patch, which means if this change doesn't make sense, the patch should not take effect at all. Given this patch does fix the intermittent, you must be wrong.

I don't know how can you conclude that the second condition is useless, but I'd like to explain the meaning of them. Actually I want the second one, which checks the frame size between current and next. However, current + kMaxFrameSize itself may overflow and wrap to a small value, which would incorrectly mark a valid frame size invalid. The first condition is to avoid that.
Blocks: 1132501
I'm trying to land bug 1132501, but it actually makes this a permafail. I might apply this patch locally and do a try push and just see if it fixes it. If it helps anyone to reproduce it locally you can apply my patch in that bug.

See my try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3b488972186e

Thanks so much for fixing this.
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #190)
> https://reviewboard.mozilla.org/r/28473/#review25605
> 
> > ... except when the stack is going upwards instead of downwards, which is the case on e.g. hppa... I know you're merely copying moving this from FramePointerStackWalk, but could you handle both stack growth directions? (see how JS_STACK_GROWTH_DIRECTION is defined in js/src/jscpucfg.h)
> > 
> > Please also add a comment about what (next & 3) is checking (which is stack alignment ; btw, this should probably be next & 7 on 64 bit platforms, although on x64 it could even be next & 15)
> 
> I don't think we want to address the alternative direction issue in this
> bug. I believe that would need nontrivial work, and not only at this place.
> 
> > you mean mincore() instead of write().
> 
> No. I exactly mean write(). write() returns EFAULT if its source address is
> invalid for the process on at least Linux, OS X, and FreeBSD, according to
> their document. However, mincore() seems to be a Linux-only syscall.
> 
> > next > current + kMaxFrameSize ...
> >   considering by now, we know that current < next <=  stackEnd, the only way for this condition to be true is for stackEnd - current > kMaxFrameSize... which is exactly the same as current < stackEnd - kMaxFrameSize.
> > 
> > So there's really only one condition here, and it's on the current pointer being right... but we have already read it, so it is, in fact, too late to do this check.
> 
> Effectively this is the only change in this patch, which means if this
> change doesn't make sense, the patch should not take effect at all. Given
> this patch does fix the intermittent, you must be wrong.
> 
> I don't know how can you conclude that the second condition is useless, but
> I'd like to explain the meaning of them. Actually I want the second one,
> which checks the frame size between current and next. However, current +
> kMaxFrameSize itself may overflow and wrap to a small value, which would
> incorrectly mark a valid frame size invalid. The first condition is to avoid
> that.

So your intent is really to test current - next > kMaxFrameSize, which won't overflow.

Anyways, let's back a little. The crash is on line 1067, which reads:
  void** next = (void**)*bp;

with an access error on address 0xffffff88.

This either happens on the first iteration of the loop, which means that the bad value comes from the caller, or from a subsequent iteration, which means the the bad value would come from `next` from a previous iteration.
The latter means that value of `next` would happily go through the already existing test "next <= bp || next > aStackEnd".

So, assuming it's not happening on the first iteration, because otherwise the patch wouldn't be able to fix anything at all, that actually suggests aStackEnd is greater than 0xffffff88 !

aStackEnd is the 6th argument to the function, and looking at the disassembly from your try build without the fix, its value is... not put in a register at any time, so it's simply not in the stack trace from the crash.

But it's either derived from __libc_stack_end, or 0xffffffff. The former is randomized with ASLR and is almost impossible to be larger than 0xffffff88. But the disassembly of MozStackwalk says 0xffffffff is what is hardcoded.

So here is the real problem, and why the fix "works": aStackEnd is not remotely the actual end of the stack. Now, since its value is pretty much a constant 0xffffffff, that also means that your patch actually breaks stack walking, because for valid frames, current is *very* likely to be way below 16 megabytes from the end of the address space, failing your overflow test.

So:
a) there's a bug in the detection code to use __libc_stack_end with glibc
b) even with a) fixed, that still doesn't make aStackEnd the end of the stack on OSX.

a) is likely to be fixed by adding #include <features.h>
b) could be addressed by using bp + kMaxFrameSize (modulo overflow) as aStackEnd when FramePointerStackWalk is called, which is a better approximation than 0xffffffff.

That still assumes the original bp is right, too, so there's still a need for some check before the first dereference, which, btw, would essentially supersede the patch from bug 1111415.
(In reply to Mike Hommey [:glandium] from comment #192)
> a) is likely to be fixed by adding #include <features.h>

No, it doesn't seem so. I tried to add "#include <stdlib.h>" which includes <features.h>, but still fails.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3259724d0fdd
(In reply to Mike Hommey [:glandium] from comment #192)
> b) could be addressed by using bp + kMaxFrameSize (modulo overflow) as
> aStackEnd when FramePointerStackWalk is called, which is a better
> approximation than 0xffffffff.

It's not. I don't expect the whole stack is always smaller than 16MB. If we want this way, that should probably be 256MB or something.
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #193)
> (In reply to Mike Hommey [:glandium] from comment #192)
> > a) is likely to be fixed by adding #include <features.h>
> 
> No, it doesn't seem so. I tried to add "#include <stdlib.h>" which includes
> <features.h>, but still fails.
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=3259724d0fdd

That build is still using 0xffffffff.
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #194)
> (In reply to Mike Hommey [:glandium] from comment #192)
> > b) could be addressed by using bp + kMaxFrameSize (modulo overflow) as
> > aStackEnd when FramePointerStackWalk is called, which is a better
> > approximation than 0xffffffff.
> 
> It's not. I don't expect the whole stack is always smaller than 16MB. If we
> want this way, that should probably be 256MB or something.

I actually expect it to always be smaller than 8MB. I'm also sure we can find a way to get the actual size of the stack on OSX.
(In reply to Mike Hommey [:glandium] from comment #196)
> I actually expect it to always be smaller than 8MB. I'm also sure we can
> find a way to get the actual size of the stack on OSX.

Err, I meant the actual location of the end of the stack
(In reply to Mike Hommey [:glandium] from comment #197)
> (In reply to Mike Hommey [:glandium] from comment #196)
> > I actually expect it to always be smaller than 8MB. I'm also sure we can
> > find a way to get the actual size of the stack on OSX.
> 
> Err, I meant the actual location of the end of the stack

pthread_get_stackaddr_np(pthread_self())
Could you take this bug?
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #195)
> (In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #193)
> > (In reply to Mike Hommey [:glandium] from comment #192)
> > > a) is likely to be fixed by adding #include <features.h>
> > 
> > No, it doesn't seem so. I tried to add "#include <stdlib.h>" which includes
> > <features.h>, but still fails.
> > 
> > https://treeherder.mozilla.org/#/jobs?repo=try&revision=3259724d0fdd
> 
> That build is still using 0xffffffff.

The problem is that bug 989499 moved the __libc_stack_end stuff in a #if MOZ_STACKWALK_SUPPORTS_MACOSX block.
Yeah, at this point I might as well take it.
Assignee: nobody → mh+mozilla
Flags: needinfo?(mh+mozilla)
Bug 989499 unfortunately moved it to a OSX-only section, which broke using
__libc_stack_end, which is Linux-specific (glibc, really).
Attachment #8699884 - Attachment is obsolete: true
Attachment #8700956 - Flags: review?(nfroyd)
Also, in the unlikely case neither __libc_stack_end nor pthread_get_stackaddr_np
are available, error out at compile time, because it's not supposed to happen,
apart if something like what bug 989499 did happens again.
Attachment #8700958 - Flags: review?(nfroyd)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=017963c604f6
(this was with a wrong part 2, but it doesn't matter, since it didn't touch the linux code)
(In reply to Mike Hommey [:glandium] from comment #203)
> Created attachment 8700958 [details] [diff] [review]
> Use the actual stack end address on OSX for the stack walker
> 
> Also, in the unlikely case neither __libc_stack_end nor
> pthread_get_stackaddr_np
> are available, error out at compile time, because it's not supposed to
> happen,
> apart if something like what bug 989499 did happens again.

... except it happens on android x86, because bionic...
Attachment #8700958 - Flags: review?(nfroyd)
Comment on attachment 8700956 [details] [diff] [review]
Move __libc_stack_end related code block from StackWalk.cpp in a non-OSX section

Review of attachment 8700956 [details] [diff] [review]:
-----------------------------------------------------------------

Nice catch.
Attachment #8700956 - Flags: review?(nfroyd) → review+
Also, in the unlikely case none of the supported methods is available, error
out at compile time, because it's not supposed to happen, apart if something
like what bug 989499 did happens again.

Note this is somewhat derived from js/src/jsnativestack.cpp's js::GetNativeStackBaseImpl, which does more. Ideally, we'd just move that to mozglue, but I'll leave that to a followup.
Attachment #8700958 - Attachment is obsolete: true
Attachment #8701243 - Flags: review?(nfroyd)
Flags: needinfo?(cam)
Comment on attachment 8701243 [details] [diff] [review]
Use the actual stack end address on x86 OSX and Android for the stack walker

Review of attachment 8701243 [details] [diff] [review]:
-----------------------------------------------------------------

Guess the BSDs get to add their own fixes?
Attachment #8701243 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #210)
> Guess the BSDs get to add their own fixes?

The MozStackWalk function that is being modified is in a #if X86_OR_PPC && (MOZ_STACKWALK_SUPPORTS_MACOSX || MOZ_STACKWALK_SUPPORTS_LINUX) block, where

#define X86_OR_PPC (defined(__i386) || defined(PPC) || defined(__ppc__))

#define MOZ_STACKWALK_SUPPORTS_MACOSX \
  (defined(XP_DARWIN) && \
   (defined(__i386) || defined(__ppc__) || defined(HAVE__UNWIND_BACKTRACE)))

#define MOZ_STACKWALK_SUPPORTS_LINUX \
  (defined(linux) && \
   ((defined(__GNUC__) && (defined(__i386) || defined(PPC))) || \
    defined(HAVE__UNWIND_BACKTRACE)))

Apart if BSDs have a #define linux, they won't hit this function.
https://hg.mozilla.org/mozilla-central/rev/500961db039b
https://hg.mozilla.org/mozilla-central/rev/acdd7908ca0f
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Comment on attachment 8700956 [details] [diff] [review]
Move __libc_stack_end related code block from StackWalk.cpp in a non-OSX section

Approval Request Comment
[Feature/regressing bug #]: unknown
[User impact if declined]: it fixes a frequent intermittent crash which has been in aurora
[Describe test coverage new/current, TreeHerder]: n/a
[Risks and why]: very low risk as it is only related to stack walker, and it is just a compile-time feature detection code movement, which would have been rejected by the compiler if it were wrong.
[String/UUID change made/needed]: n/a
Attachment #8700956 - Flags: approval-mozilla-aurora?
Comment on attachment 8700956 [details] [diff] [review]
Move __libc_stack_end related code block from StackWalk.cpp in a non-OSX section

Fix a frequent intermittent crash, taking it.
Attachment #8700956 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: