1113930 - Intermittent browser_ruleview_content_02.js,browser_markupview_links_01.js | application crashed [@ mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*)]

Reporter

Description

•

11 years ago

16:48:22 INFO - 155 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There are 3 headers for inherited rules 16:48:22 INFO - 156 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The first header is correct 16:48:22 INFO - 157 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The second header is correct 16:48:22 INFO - 158 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The third header is correct 16:48:22 INFO - 159 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There are 4 rules in the view 16:48:22 INFO - 160 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The rule's selector is correct 16:48:22 INFO - 161 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property name, as expected 16:48:22 INFO - 162 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property value, as expected 16:48:22 INFO - 163 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The rule's selector is correct 16:48:22 INFO - 164 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property name, as expected 16:48:22 INFO - 165 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property value, as expected 16:48:22 INFO - 166 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The rule's selector is correct 16:48:22 INFO - 167 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property name, as expected 16:48:22 INFO - 168 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property value, as expected 16:48:22 INFO - 169 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | The rule's selector is correct 16:48:22 INFO - 170 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property name, as expected 16:48:22 INFO - 171 INFO TEST-PASS | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | There's only one property value, as expected 16:48:22 INFO - 172 INFO Leaving test 16:48:22 INFO - 173 ERROR TEST-UNEXPECTED-FAIL | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | application terminated with exit code 11 16:48:22 INFO - runtests.py | Application ran for: 0:07:33.138440 16:48:22 INFO - zombiecheck | Reading PID log: /tmp/tmpQHmEQ7pidlog 16:48:36 INFO - mozcrash Saved minidump as /builds/slave/test/build/blobber_upload_dir/5a7d2fee-8218-21c4-1505e452-765698ba.dmp 16:48:36 INFO - mozcrash Saved app info as /builds/slave/test/build/blobber_upload_dir/5a7d2fee-8218-21c4-1505e452-765698ba.extra 16:48:36 WARNING - PROCESS-CRASH | browser/devtools/styleinspector/test/browser_ruleview_content_02.js | application crashed [@ mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*)] 16:48:36 INFO - Crash dump filename: /tmp/tmpkwtrvU.mozrunner/minidumps/5a7d2fee-8218-21c4-1505e452-765698ba.dmp 16:48:36 INFO - Operating system: Linux 16:48:36 INFO - 0.0.0 Linux 3.2.0-23-generic-pae #36-Ubuntu SMP Tue Apr 10 22:19:09 UTC 2012 i686 16:48:36 INFO - CPU: x86 16:48:36 INFO - GenuineIntel family 6 model 62 stepping 4 16:48:36 INFO - 1 CPU 16:48:36 INFO - Crash reason: SIGSEGV 16:48:36 INFO - Crash address: 0xffffff88 16:48:36 INFO - Thread 0 (crashed) 16:48:36 INFO - 0 libxul.so!mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*) [nsStackWalk.cpp:315c981e3f7d : 894 + 0x0] 16:48:36 INFO - eip = 0xb2718204 esp = 0xbfbf79a0 ebp = 0xbfbf79b8 ebx = 0xb6e9003c 16:48:36 INFO - esi = 0x00000052 edi = 0xffffff88 eax = 0xffffff88 ecx = 0xb76658ac 16:48:36 INFO - edx = 0xffffffae efl = 0x00210202 16:48:36 INFO - Found by: given as instruction pointer in context 16:48:36 INFO - 1 libxul.so!nsTraceRefcnt::WalkTheStack(_IO_FILE*) [nsTraceRefcnt.cpp:315c981e3f7d : 963 + 0x12] 16:48:36 INFO - eip = 0xb271b7cd esp = 0xbfbf79c0 ebp = 0xbfbf79e8 ebx = 0xb6e9003c 16:48:36 INFO - esi = 0xb7664d9c edi = 0xbfbf7a34 16:48:36 INFO - Found by: call frame info 16:48:36 INFO - 2 libxul.so!NS_DebugBreak [nsDebugImpl.cpp:315c981e3f7d : 448 + 0xf] 16:48:36 INFO - eip = 0xb26fb7d3 esp = 0xbfbf79f0 ebp = 0xbfbf7e38 ebx = 0xb6e9003c 16:48:36 INFO - esi = 0xb7664d9c edi = 0xbfbf7a34 16:48:36 INFO - Found by: call frame info 16:48:36 INFO - 3 libxul.so!nsStyleContext::~nsStyleContext() [nsStyleContext.cpp:315c981e3f7d : 90 + 0x20] 16:48:36 INFO - eip = 0xb3dd5939 esp = 0xbfbf7e40 ebp = 0xbfbf7e88 ebx = 0xb6e9003c 16:48:36 INFO - esi = 0x9e8463e0 edi = 0x8d049400 16:48:36 INFO - Found by: call frame info 16:48:36 INFO - 4 libxul.so!nsStyleContext::Destroy() [nsStyleContext.cpp:315c981e3f7d : 941 + 0x8] 16:48:36 INFO - eip = 0xb3dd5987 esp = 0xbfbf7e90 ebp = 0xbfbf7eb8 ebx = 0xb6e9003c 16:48:36 INFO - esi = 0x8d049400 edi = 0x9e8463e0 16:48:36 INFO - Found by: call frame info 16:48:36 INFO - 5 libxul.so!nsStyleContext::Release() [nsStyleContext.h:315c981e3f7d : 93 + 0x8] 16:48:36 INFO - eip = 0xb2fff03d esp = 0xbfbf7ec0 ebp = 0xbfbf7ee8 ebx = 0xb6e9003c 16:48:36 INFO - esi = 0x9e8463e0 edi = 0x00000000 16:48:36 INFO - Found by: call frame info 16:48:36 INFO - 6 libxul.so!nsTArray_Impl<nsRefPtr<nsStyleContext>, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned int, unsigned int) [nsRefPtr.h:315c981e3f7d : 59 + 0xe] 16:48:36 INFO - eip = 0xb3eab9c9 esp = 0xbfbf7ef0 ebp = 0xbfbf7f28 ebx = 0xb6e9003c 16:48:36 INFO - esi = 0x00000000 edi = 0x95ba9050 16:48:36 INFO - Found by: call frame info 16:48:36 INFO - 7 libxul.so!nsTArray_Impl<nsRefPtr<nsStyleContext>, nsTArrayInfallibleAllocator>::Clear() [nsTArray.h:315c981e3f7d : 1406 + 0x9] 16:48:36 INFO - eip = 0xb3eaba2d esp = 0xbfbf7f30 ebp = 0xbfbf7f48 ebx = 0xb6e9003c 16:48:36 INFO - esi = 0x95ba9000 edi = 0x95ba9050 16:48:36 INFO - Found by: call frame info 16:48:36 INFO - 8 libxul.so!nsTransformedTextRun::~nsTransformedTextRun() [nsTArray.h:315c981e3f7d : 785 + 0x7] 16:48:36 INFO - eip = 0xb3eaba8c esp = 0xbfbf7f50 ebp = 0xbfbf7f78 ebx = 0xb6e9003c 16:48:36 INFO - esi = 0x95ba9000 edi = 0x95ba9050 16:48:36 INFO - Found by: call frame info 16:48:36 INFO - 9 libxul.so!nsTransformedTextRun::~nsTransformedTextRun() [nsTextRunTransformations.h:315c981e3f7d : 95 + 0x8] 16:48:36 INFO - eip = 0xb3eabac2 esp = 0xbfbf7f80 ebp = 0xbfbf7f98 ebx = 0xb6e9003c 16:48:36 INFO - esi = 0x95ba9000 edi = 0x95ba9000 16:48:36 INFO - Found by: call frame info 16:48:36 INFO - 10 libxul.so!nsTextFrame::ClearTextRun(nsTextFrame*, nsTextFrame::TextRunType) [nsTextFrame.cpp:315c981e3f7d : 4402 + 0x7] 16:48:36 INFO - eip = 0xb3f424ff esp = 0xbfbf7fa0 ebp = 0xbfbf7fd8 ebx = 0xb6e9003c 16:48:36 INFO - esi = 0x95ba9000 edi = 0x95ba9000 16:48:36 INFO - Found by: call frame info

Comment hidden (Legacy TBPL/Treeherder Robot)

Cameron McCormack (:heycam)

Comment 4

•

11 years ago

Just commented a similar case in bug 1111415 comment 2, with the same crash address. (In reply to Cameron McCormack (:heycam) from comment #2) > I have a reproducible FramePointerStackWalk crash with a patch queue applied > to m-c, but I'm not sure whether it is caused by my patches yet or an > underlying problem revealed by them. My crash address is different from the > other two mentioned above, though: fffff88 (on Linux32). dmajor pointed out > that is a special value used by JIT code, and indeed the stack walking > crashes at a frame within JIT code. > > https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=b02537901def

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 8

•

11 years ago

I seem to hit this really easily on my uplift simulation Try pushes (2 out of 3 so far). Hopefully that's not a sign of things to come on Monday :\

Comment hidden (Legacy TBPL/Treeherder Robot)

Ed Morley [:emorley]

Comment 22

•

10 years ago

Inactive; closing (see bug 1180138).

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WORKSFORME

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 32

•

10 years ago

This has been permafailing on Aurora since the below push yesterday: https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=0104fc2803cb Nothing in there stands out as to why, though. One of the FHR patches maybe?

Status: RESOLVED → REOPENED

Flags: needinfo?(alessio.placitelli)

Resolution: WORKSFORME → ---

Comment hidden (Legacy TBPL/Treeherder Robot)

Alessio Placitelli [:Dexter]

Comment 34

•

10 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #32) > This has been permafailing on Aurora since the below push yesterday: > https://treeherder.mozilla.org/#/jobs?repo=mozilla- > aurora&revision=0104fc2803cb > > Nothing in there stands out as to why, though. One of the FHR patches maybe? I think that the FHR/Telemetry patches there shouldn't cause this issue: we didn't uplift any native code changes and those changes lived quite a bit on m-c.

Flags: needinfo?(alessio.placitelli)

Georg Fritzsche [:gfritzsche]

Comment 35

•

10 years ago

Bug 1177278 seems more likely.

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 37

•

10 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #35) > Bug 1177278 seems more likely. Try says otherwise. https://treeherder.mozilla.org/#/jobs?repo=try&revision=e551441dde15 Guess I'll keep bisecting...

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 38

•

10 years ago

FWIW, all of the other non-FHR revs in that push affect either Windows or Android only.

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 39

•

10 years ago

Looks like it's from the test changes in bug 1137355: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c8dff23c85c Will backout and add comments to that bug.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Reporter

Updated

•

10 years ago

Resolution: FIXED → WORKSFORME

Alessio Placitelli [:Dexter]

•

10 years ago

Thanks Jim. Adding the blank test at the beginning of browser.ini and removing "browser_datareporting_notification.js" (the potential culprit) doesn't make the process crash. https://treeherder.mozilla.org/#/jobs?repo=try&revision=343b23ef2b48 Adding the blank at the beginning of browser.ini keeping the datareporting test, makes it crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6f2a56b6b72f

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Intermittent Failures Robot)

Carsten Book [:Tomcat]

Comment 146

•

10 years ago

setting to reopen since this is kind of perma failing on aurora

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Comment hidden (Intermittent Failures Robot)

Phil Ringnalda (:philor)

Updated

•

10 years ago

Summary: Intermittent browser_ruleview_content_02.js | application crashed [@ mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*)] → Intermittent browser_ruleview_content_02.js,browser_markupview_links_01.js | application crashed [@ mozilla::FramePointerStackWalk(void (*)(unsigned int, void*, void*, void*), unsigned int, unsigned int, void*, void**, void*)]

Comment hidden (Intermittent Failures Robot)

James Long (:jlongster)

Updated

•

10 years ago

Blocks: 1132501

James Long (:jlongster)

Updated

•

10 years ago

No longer blocks: 1132501

Comment hidden (Intermittent Failures Robot)

Xidorn Quan [:xidorn] UTC+10

Comment 170

•

10 years ago

This seems to be a pretty high-frequent failure which should be fixed asap. There are two reasons for this intermittent: 1. the test (browser_markupview_links_01.js) constantly triggers an assertion "why should we have flushed style again?" in nsComputedDOMStyle::UpdateCurrentStyleSources; 2. the stack walker intermittently crashes in FramePointerStackWalk Fixing either one should fix this intermittent. glandium, could you have a look at the stack walker crash? heycam, could you have a look at the assertion there?

Flags: needinfo?(mh+mozilla)

Flags: needinfo?(cam)

Mike Hommey [:glandium]

Assignee

Comment 171

•

10 years ago

The stackwalker crash looks like the a straightforward case where the stack walker is too trusting of what's on the stack. Short of rewriting it, there's not much that can be done here IMHO. Nathan, what do you think?

Flags: needinfo?(mh+mozilla) → needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 172

•

10 years ago

(In reply to Mike Hommey [:glandium] from comment #171) > The stackwalker crash looks like the a straightforward case where the stack > walker is too trusting of what's on the stack. Short of rewriting it, > there's not much that can be done here IMHO. Nathan, what do you think? Looks plausible; the crash address is 0xffffff88, something like what got mentioned in comment 4 as "special values for the JIT". grep'ing through the source doesn't turn up anything obvious here, though...ni? to Jan for possible insight here. One easy way to make the stack walker more paranoid would be to record the stack pointer when the stack walker gets initialized, and reject any addresses during stack walking that fall above that point. (Or have the stackwalker figure out the topmost possible stack address each time, since it can get called on different threads? I think pthreads provides a sort of API for this, but I'm not sure that Windows does, and even the pthreads API has some non-portability associated with it...?)

Flags: needinfo?(nfroyd) → needinfo?(jdemooij)

Comment hidden (Intermittent Failures Robot)

Phil Ringnalda (:philor)

Comment 174

•

10 years ago

One thing that might be nice would be to have the order for assertion handling be "print the failed assertion, try to get the stack, print the stack or crash while trying" instead of "try to get the stack, print the assertion and the stack or crash while trying to get the stack and never print the assertion." Among the things lumped in here is a shutdown assertion which has been permaorange on aurora for months, which we could fix-or-downgrade if we knew what it was, but all we know is that the stack for the assertion crashes stackwalker.

Xidorn Quan [:xidorn] UTC+10

Comment 175

•

10 years ago

Actually the assertion is printed. The log has line: > ###!!! ASSERTION: why should we have flushed style again?: 'mPresShell && currentGeneration == mPresShell->GetPresContext()->GetRestyleGeneration()', file /builds/slave/m-aurora-lx-d-0000000000000000/build/src/layout/style/nsComputedDOMStyle.cpp, line 723 before the stack dump. It seems to me that the assertion actually always happen, but no one cares (it is ignored by the test runner, why?) You can see for example https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=b5dcc2d9ccd7&filter-job_type_symbol=dt1&selectedJob=18632228 and find there is also the same assertions inside, although not crashed. It also seems to me that this crash happens only on linux32 because we only run the tests on linux32. I cannot find log of these tests anywhere else.

Phil Ringnalda (:philor)

Comment 176

•

10 years ago

The interpretation of having hit an assertion is controlled by the harness: reftests, crashtests, and mochitest-plain have assertion counts and exceeding them is a test failure, because dbaron counted up the assertions for every single test at a point in time in the past, and then made exceeding those counts fatal. Then he looked at how many assertion failures happen in browser-chrome tests (which is what devtools-chrome is), and found more productive things to do than count them. And that's the assertion in browser_markupview_links_01.js, which is not permaorange, merely tiresome; I'm talking about the shutdown assertion in e10s bc2 on aurora, http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-aurora-linux-pgo/1450309979/mozilla-aurora_ubuntu32_vm_test_pgo-mochitest-e10s-browser-chrome-2-bm06-tests1-linux32-build112.txt.gz, which is permaorange and has been since... Octoberish, I think, but I see now that I've looked closer that it isn't an NS_ASSERTION, we're just in DEBUG_BREAK having hit a glib assertion, and good luck to us.

James Long (:jlongster)

Comment 177

•

10 years ago

This has got to be the same bug as bug 1224822 which BenWa already has plans to fix.

•

10 years ago

Attached file MozReview Request: Bug 1113930 - Add frame size santy check in frame pointer stack walk. (obsolete) — Details

Review commit: https://reviewboard.mozilla.org/r/28473/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/28473/

Attachment #8699884 - Flags: review?(mh+mozilla)

Xidorn Quan [:xidorn] UTC+10

Comment 183

•

10 years ago

Compare https://treeherder.mozilla.org/#/jobs?repo=try&revision=c0bcc16dc308 (before) and https://treeherder.mozilla.org/#/jobs?repo=try&revision=6267a14144da (after)

Xidorn Quan [:xidorn] UTC+10

Comment 184

•

10 years ago

I expect this patch to fix most of similiar intermittents, including this bug and bug 1224822. However there are some crashes with the same signature from pgo builds, which probably cannot be fixed here, because in those cases, there are even nothing unwound at all from our stack walker. We should probably file a separate bug for those crashes. I suspect in those cases, the stack is broken from the very beginning. We may want to use the write() trick to check whether the initial bp really points to a valid address to avoid that kind of crashes.

Xidorn Quan [:xidorn] UTC+10

Comment 185

•

10 years ago

Hmmm, the crashes from pgo builds actually seem to be permafail from mozilla-aurora. That would definitely need a new bug.

Xidorn Quan [:xidorn] UTC+10

Updated

•

10 years ago

Comment 189

•

10 years ago

Comment on attachment 8699884 [details] MozReview Request: Bug 1113930 - Add frame size santy check in frame pointer stack walk. https://reviewboard.mozilla.org/r/28473/#review25605 ::: mozglue/misc/StackWalk.cpp:1064 (Diff revision 1) > + // we can assume that bp > sp ... except when the stack is going upwards instead of downwards, which is the case on e.g. hppa... I know you're merely copying moving this from FramePointerStackWalk, but could you handle both stack growth directions? (see how JS_STACK_GROWTH_DIRECTION is defined in js/src/jscpucfg.h) Please also add a comment about what (next & 3) is checking (which is stack alignment ; btw, this should probably be next & 7 on 64 bit platforms, although on x64 it could even be next & 15) ::: mozglue/misc/StackWalk.cpp:1070 (Diff revision 1) > + static MOZ_CONSTEXPR size_t kMaxFrameSize = 16 * 1024 * 1024; const is better here than MOZ_CONSTEXPR. ::: mozglue/misc/StackWalk.cpp:1074 (Diff revision 1) > + // (On some *nix systems, we may want to use system call write() to you mean mincore() instead of write(). ::: mozglue/misc/StackWalk.cpp:1077 (Diff revision 1) > + if (current < stackEnd - kMaxFrameSize && > + next > current + kMaxFrameSize) { > + return false; next > current + kMaxFrameSize ... considering by now, we know that current < next <= stackEnd, the only way for this condition to be true is for stackEnd - current > kMaxFrameSize... which is exactly the same as current < stackEnd - kMaxFrameSize. So there's really only one condition here, and it's on the current pointer being right... but we have already read it, so it is, in fact, too late to do this check.

Attachment #8699884 - Flags: review?(mh+mozilla)

Xidorn Quan [:xidorn] UTC+10

Comment 190

•

10 years ago

https://reviewboard.mozilla.org/r/28473/#review25605 > ... except when the stack is going upwards instead of downwards, which is the case on e.g. hppa... I know you're merely copying moving this from FramePointerStackWalk, but could you handle both stack growth directions? (see how JS_STACK_GROWTH_DIRECTION is defined in js/src/jscpucfg.h) > > Please also add a comment about what (next & 3) is checking (which is stack alignment ; btw, this should probably be next & 7 on 64 bit platforms, although on x64 it could even be next & 15) I don't think we want to address the alternative direction issue in this bug. I believe that would need nontrivial work, and not only at this place. > you mean mincore() instead of write(). No. I exactly mean write(). write() returns EFAULT if its source address is invalid for the process on at least Linux, OS X, and FreeBSD, according to their document. However, mincore() seems to be a Linux-only syscall. > next > current + kMaxFrameSize ... > considering by now, we know that current < next <= stackEnd, the only way for this condition to be true is for stackEnd - current > kMaxFrameSize... which is exactly the same as current < stackEnd - kMaxFrameSize. > > So there's really only one condition here, and it's on the current pointer being right... but we have already read it, so it is, in fact, too late to do this check. Effectively this is the only change in this patch, which means if this change doesn't make sense, the patch should not take effect at all. Given this patch does fix the intermittent, you must be wrong. I don't know how can you conclude that the second condition is useless, but I'd like to explain the meaning of them. Actually I want the second one, which checks the frame size between current and next. However, current + kMaxFrameSize itself may overflow and wrap to a small value, which would incorrectly mark a valid frame size invalid. The first condition is to avoid that.

James Long (:jlongster)

Updated

•

10 years ago

Blocks: 1132501

James Long (:jlongster)

Comment 191

•

10 years ago

I'm trying to land bug 1132501, but it actually makes this a permafail. I might apply this patch locally and do a try push and just see if it fixes it. If it helps anyone to reproduce it locally you can apply my patch in that bug. See my try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3b488972186e Thanks so much for fixing this.

Mike Hommey [:glandium]

Assignee

Comment 192

•

10 years ago

(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #190) > https://reviewboard.mozilla.org/r/28473/#review25605 > > > ... except when the stack is going upwards instead of downwards, which is the case on e.g. hppa... I know you're merely copying moving this from FramePointerStackWalk, but could you handle both stack growth directions? (see how JS_STACK_GROWTH_DIRECTION is defined in js/src/jscpucfg.h) > > > > Please also add a comment about what (next & 3) is checking (which is stack alignment ; btw, this should probably be next & 7 on 64 bit platforms, although on x64 it could even be next & 15) > > I don't think we want to address the alternative direction issue in this > bug. I believe that would need nontrivial work, and not only at this place. > > > you mean mincore() instead of write(). > > No. I exactly mean write(). write() returns EFAULT if its source address is > invalid for the process on at least Linux, OS X, and FreeBSD, according to > their document. However, mincore() seems to be a Linux-only syscall. > > > next > current + kMaxFrameSize ... > > considering by now, we know that current < next <= stackEnd, the only way for this condition to be true is for stackEnd - current > kMaxFrameSize... which is exactly the same as current < stackEnd - kMaxFrameSize. > > > > So there's really only one condition here, and it's on the current pointer being right... but we have already read it, so it is, in fact, too late to do this check. > > Effectively this is the only change in this patch, which means if this > change doesn't make sense, the patch should not take effect at all. Given > this patch does fix the intermittent, you must be wrong. > > I don't know how can you conclude that the second condition is useless, but > I'd like to explain the meaning of them. Actually I want the second one, > which checks the frame size between current and next. However, current + > kMaxFrameSize itself may overflow and wrap to a small value, which would > incorrectly mark a valid frame size invalid. The first condition is to avoid > that. So your intent is really to test current - next > kMaxFrameSize, which won't overflow. Anyways, let's back a little. The crash is on line 1067, which reads: void** next = (void**)*bp; with an access error on address 0xffffff88. This either happens on the first iteration of the loop, which means that the bad value comes from the caller, or from a subsequent iteration, which means the the bad value would come from `next` from a previous iteration. The latter means that value of `next` would happily go through the already existing test "next <= bp || next > aStackEnd". So, assuming it's not happening on the first iteration, because otherwise the patch wouldn't be able to fix anything at all, that actually suggests aStackEnd is greater than 0xffffff88 ! aStackEnd is the 6th argument to the function, and looking at the disassembly from your try build without the fix, its value is... not put in a register at any time, so it's simply not in the stack trace from the crash. But it's either derived from __libc_stack_end, or 0xffffffff. The former is randomized with ASLR and is almost impossible to be larger than 0xffffff88. But the disassembly of MozStackwalk says 0xffffffff is what is hardcoded. So here is the real problem, and why the fix "works": aStackEnd is not remotely the actual end of the stack. Now, since its value is pretty much a constant 0xffffffff, that also means that your patch actually breaks stack walking, because for valid frames, current is *very* likely to be way below 16 megabytes from the end of the address space, failing your overflow test. So: a) there's a bug in the detection code to use __libc_stack_end with glibc b) even with a) fixed, that still doesn't make aStackEnd the end of the stack on OSX. a) is likely to be fixed by adding #include <features.h> b) could be addressed by using bp + kMaxFrameSize (modulo overflow) as aStackEnd when FramePointerStackWalk is called, which is a better approximation than 0xffffffff. That still assumes the original bp is right, too, so there's still a need for some check before the first dereference, which, btw, would essentially supersede the patch from bug 1111415.

Xidorn Quan [:xidorn] UTC+10

•

10 years ago

(In reply to Mike Hommey [:glandium] from comment #197) > (In reply to Mike Hommey [:glandium] from comment #196) > > I actually expect it to always be smaller than 8MB. I'm also sure we can > > find a way to get the actual size of the stack on OSX. > > Err, I meant the actual location of the end of the stack pthread_get_stackaddr_np(pthread_self())

Xidorn Quan [:xidorn] UTC+10

Comment 199

•

10 years ago

Could you take this bug?

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Assignee

Comment 200

•

10 years ago

(In reply to Mike Hommey [:glandium] from comment #195) > (In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #193) > > (In reply to Mike Hommey [:glandium] from comment #192) > > > a) is likely to be fixed by adding #include <features.h> > > > > No, it doesn't seem so. I tried to add "#include <stdlib.h>" which includes > > <features.h>, but still fails. > > > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=3259724d0fdd > > That build is still using 0xffffffff. The problem is that bug 989499 moved the __libc_stack_end stuff in a #if MOZ_STACKWALK_SUPPORTS_MACOSX block.

Mike Hommey [:glandium]

Assignee

Comment 201

•

10 years ago

Yeah, at this point I might as well take it.

Assignee: nobody → mh+mozilla

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Assignee

Comment 202

•

10 years ago

Attached patch Move __libc_stack_end related code block from StackWalk.cpp in a non-OSX section — Details — Splinter Review

Bug 989499 unfortunately moved it to a OSX-only section, which broke using __libc_stack_end, which is Linux-specific (glibc, really).

Attachment #8699884 - Attachment is obsolete: true

Attachment #8700956 - Flags: review?(nfroyd)

Mike Hommey [:glandium]

Assignee

Comment 203

•

10 years ago

Attached patch Use the actual stack end address on OSX for the stack walker (obsolete) — Details — Splinter Review

Also, in the unlikely case neither __libc_stack_end nor pthread_get_stackaddr_np are available, error out at compile time, because it's not supposed to happen, apart if something like what bug 989499 did happens again.

Attachment #8700958 - Flags: review?(nfroyd)

Mike Hommey [:glandium]

Assignee

Comment 204

•

10 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=017963c604f6 (this was with a wrong part 2, but it doesn't matter, since it didn't touch the linux code)

Mike Hommey [:glandium]

Assignee

Comment 205

•

10 years ago

(In reply to Mike Hommey [:glandium] from comment #203) > Created attachment 8700958 [details] [diff] [review] > Use the actual stack end address on OSX for the stack walker > > Also, in the unlikely case neither __libc_stack_end nor > pthread_get_stackaddr_np > are available, error out at compile time, because it's not supposed to > happen, > apart if something like what bug 989499 did happens again. ... except it happens on android x86, because bionic...

Mike Hommey [:glandium]

Assignee

Updated

•

10 years ago

Attachment #8700958 - Flags: review?(nfroyd)

Nathan Froyd [:froydnj]

Comment 206

•

10 years ago

Comment on attachment 8700956 [details] [diff] [review] Move __libc_stack_end related code block from StackWalk.cpp in a non-OSX section Review of attachment 8700956 [details] [diff] [review]: ----------------------------------------------------------------- Nice catch.

Attachment #8700956 - Flags: review?(nfroyd) → review+

Mike Hommey [:glandium]

Assignee

Comment 207

•

10 years ago

Attached patch Use the actual stack end address on x86 OSX and Android for the stack walker — Details — Splinter Review

Also, in the unlikely case none of the supported methods is available, error out at compile time, because it's not supposed to happen, apart if something like what bug 989499 did happens again. Note this is somewhat derived from js/src/jsnativestack.cpp's js::GetNativeStackBaseImpl, which does more. Ideally, we'd just move that to mozglue, but I'll leave that to a followup.

Attachment #8700958 - Attachment is obsolete: true

Attachment #8701243 - Flags: review?(nfroyd)

Cameron McCormack (:heycam)

Updated

•

•

10 years ago

Comment on attachment 8700956 [details] [diff] [review] Move __libc_stack_end related code block from StackWalk.cpp in a non-OSX section Approval Request Comment [Feature/regressing bug #]: unknown [User impact if declined]: it fixes a frequent intermittent crash which has been in aurora [Describe test coverage new/current, TreeHerder]: n/a [Risks and why]: very low risk as it is only related to stack walker, and it is just a compile-time feature detection code movement, which would have been rejected by the compiler if it were wrong. [String/UUID change made/needed]: n/a

Attachment #8700956 - Flags: approval-mozilla-aurora?

Sylvestre Ledru [:Sylvestre]

Updated

•

10 years ago

status-firefox45: --- → affected

Sylvestre Ledru [:Sylvestre]

Comment 218

•

10 years ago

Comment on attachment 8700956 [details] [diff] [review] Move __libc_stack_end related code block from StackWalk.cpp in a non-OSX section Fix a frequent intermittent crash, taking it.

Attachment #8700956 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 219

•

10 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/e9feef7c35f7 https://hg.mozilla.org/releases/mozilla-aurora/rev/a02979f26a2f

status-firefox45: affected → fixed

Comment hidden (Intermittent Failures Robot)

MozReview Request: Bug 1113930 - Add frame size santy check in frame pointer stack walk. 10 years ago Xidorn Quan [:xidorn] UTC+10 58 bytes, text/x-review-board-request		Details
Move __libc_stack_end related code block from StackWalk.cpp in a non-OSX section 10 years ago Mike Hommey [:glandium] 1.85 KB, patch	froydnj : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
Use the actual stack end address on OSX for the stack walker 10 years ago Mike Hommey [:glandium] 1.47 KB, patch		Details \| Diff \| Splinter Review
Use the actual stack end address on x86 OSX and Android for the stack walker 10 years ago Mike Hommey [:glandium] 2.96 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review