Closed
Bug 1395708
Opened 7 years ago
Closed 7 years ago
stylo: Crash in AnyToAnyBlock
Categories
(Core :: CSS Parsing and Computation, defect, P1)
Tracking
()
VERIFIED
FIXED
mozilla57
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox55 | --- | unaffected |
firefox56 | --- | unaffected |
firefox57 | --- | fixed |
People
(Reporter: mccr8, Assigned: bholley)
References
Details
(Keywords: crash)
Crash Data
Attachments
(1 file)
2.74 KB,
patch
|
jseward
:
review+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is report bp-c0dcbb11-52c2-4755-9005-f5a3e0170831. ============================================================= The crash stack is not very useful (AnyToAnyBlock in module vImage off on some random thread), but every crash report I looked at has Stylo on the main thread, somewhere in geckoservo::glue::Servo_TraverseSubtree. There are 16 of these crashes, which isn't a huge amount, but is a decent amount for OSX. These are all on OSX 10.11. vImage is "a high-performance image processing framework" on OSX according to a Google search I did. https://developer.apple.com/library/content/documentation/Performance/Conceptual/vImage/Introduction/Introduction.html
Reporter | ||
Comment 1•7 years ago
|
||
The first build this showed up in is 20170827100428. If that's right, then the regressing change set would be in here somewhere: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2f0b519d13e9dd945b37b235db1981d5ffb2199f&tochange=f819969d7619f01e806e2685b8b3196f64624551
Comment 2•7 years ago
|
||
Could they be stack overflows? We dropped the thread-pool stack size a while ago, and I've experienced some in debug builds very recently.
Comment 3•7 years ago
|
||
(And the changeset doing that is on that regression range)
Assignee | ||
Comment 4•7 years ago
|
||
So, the theory that this bug and bug 1394878 are related to stack exhaustion is a good one. Looking through reports for both bugs, the crashing address always seems to be pretty close to rsp. So that would imply that we're running out of stack space (which is plausible, given that we have 128k stacks on the worker threads). We could bump the stack size, or increase the safety margin, or both, but it would be good to understand the problem first. There are no stacks for this bug, but there are stacks for bug 1394878. Oddly, the stacks on those bugs don't look especially deep. There tends to be some amount of nested calls to style::parallel::traverse_nodes (which is certainly suspicious), but each of those stack frames (at least on my machine) are 1760 bytes. So the only way that would bust the stack would be if the stuff underneath them (in particular the native lib stuff) was especially stack-hungry. I think this is an answer that can be determined from minidumps, but I don't have the permissions to see them nor the expertise to evaluate them. Ted said he'd take a look tomorrow.
Flags: needinfo?(ted)
Comment 5•7 years ago
|
||
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #4) > There are no stacks for this bug, but there are stacks for bug 1394878. > Oddly, the stacks on those bugs don't look especially deep. There tends to > be some amount of nested calls to style::parallel::traverse_nodes (which is > certainly suspicious), but each of those stack frames (at least on my > machine) are 1760 bytes. So the only way that would bust the stack would be > if the stuff underneath them (in particular the native lib stuff) was > especially stack-hungry. FWIW I debugged a local debug build stack overflow, and it was inside the stuff for computing animation rules. That can get relatively deep, going through effectCompositor, etc, and then calling back into servo.
Comment 6•7 years ago
|
||
The dump in comment 0 has some issues. I hacked around it a bit and still couldn't get anything that looks actually useful out of it. The stack pointer in the crashing thread (thread 40) does seem to be in valid memory, but the memory region associated with it in the minidump looks weird: region[40] MDMemoryDescriptor start_of_memory_range = 0x70000268ae80 memory.data_size = 0x180 memory.rva = 0x70368 It's only 384 bytes long. Also, it overlaps with the region immediately before it, which happens to be the stack memory for the previous thread (thread 39): region[39] MDMemoryDescriptor start_of_memory_range = 0x700002688708 memory.data_size = 0x28f8 memory.rva = 0x06 d5 a0 That makes me think that maybe this stack overflowed by quite a bit, and the stack pointer wound up overflowing into another thread's stack and then things broke. How plausible does that sound?
Flags: needinfo?(ted)
Assignee | ||
Comment 7•7 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #6) > That makes me think that maybe this stack overflowed by quite a bit, and the > stack pointer wound up overflowing into another thread's stack and then > things broke. How plausible does that sound? That certainly sounds plausible. If the OS is lining up stack regions right next to each other in address space with no red zone om between, then a stack overflow would just clobber another stack. That sounds pretty terrifying, really - I'd been assuming we'd mostly catch stack overflows with predictable memory faults, but clobbering another stack is a much bigger deal. Do you think this is a real concern? Given that we do have stacks for bug 1394878, I would like to analyze the problem from that angle as well. Can you do the analysis requested in comment 4?
Flags: needinfo?(ted)
Assignee | ||
Updated•7 years ago
|
Priority: -- → P1
Assignee | ||
Comment 8•7 years ago
|
||
Assigning to myself since I'm driving the investigation here.
Assignee: nobody → bobbyholley
Comment 9•7 years ago
|
||
There are guard pages on OS X, but AFAIK they're only one page in size so it's possible for code that creates very large stack frames to overflow them (c.f. the recently published "stack clash" exploit).
Assignee | ||
Comment 10•7 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9) > There are guard pages on OS X, but AFAIK they're only one page in size so > it's possible for code that creates very large stack frames to overflow them > (c.f. the recently published "stack clash" exploit). Ok. Hopefully the analysis from comment 4 will explain why we manage to overflow the stack with so few frames, which may give us a hint as to how this could happen.
Comment 12•7 years ago
|
||
Using the defaults pub const STYLE_THREAD_STACK_SIZE_KB: usize = 128; pub const STACK_SAFETY_MARGIN_KB: usize = 45; I get stack overflows on a pretty harmless looking page https://www.theregister.co.uk, having built with no optimisation for rust code, "-Og" for C++ code, and using LLVM 4.0.1 and gcc-6.4.1 on F25 x86_64. Doubling them both gets it back on the road. Would it be wise to increase the stack size for 57?
Assignee | ||
Comment 13•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #12) > Would it be wise to increase the stack size for 57? Yes, that is the plan, but first I want to understand how such a small number of stack frames is causing overflow, per comment 4. In particular, I want to know if the styling stack frames are much larger than what I'm measuring, or if the OS stack frames are large, since it impacts how we tune the parameters. So still need ted here.
Assignee | ||
Comment 14•7 years ago
|
||
https://gist.github.com/luser/a80af7e97fbb10a18daf0a7ed801efdc The interesting bits are: 96 vImage!vLinearTransform_8to16Q12_vec 96 vImage!vMinH_Interleaved 272 vImage!vMinV_Interleaved 128 vImage!vRotate_90_ARGB_8888 72064 vImage!vRotate_90_Planar_UInt16 33440 vImage!vRotate_90_Planar_UInt8 480 CoreGraphics 48 CoreGraphics 256 CoreGraphics 144 CoreGraphics 144 CoreGraphics 80 AppKit!convertColorToColorSpaceNamed 80 AppKit!-[NSDynamicSystemColor 64 XUL!nsLookAndFeel::NativeGetColor(mozilla::LookAndFeel::ColorID, 80 XUL!nsXPLookAndFeel::GetColorImpl(mozilla::LookAndFeel::ColorID, 48 XUL!Gecko_GetLookAndFeelSystemColor 32 XUL!style::values::specified::color::{{impl}}::to_computed_value 48 XUL!style::properties::longhands::color::cascade_property 2272 XUL!style::properties::cascade 192 XUL!style::style_resolver::{{impl}}::cascade_style<style::gecko::wrapper::GeckoElement> 128 XUL!style::style_resolver::{{impl}}::resolve_primary_style<style::gecko::wrapper::GeckoElement> 160 XUL!style::style_resolver::{{impl}}::resolve_style<style::gecko::wrapper::GeckoElement> 2032 XUL!style::traversal::compute_style<style::gecko::wrapper::GeckoElement> 1712 XUL!style::parallel::traverse_nodes<style::gecko::wrapper::GeckoElement,style::gecko::traversal::RecalcStyleOnly,smallvec::Drain<style::dom::SendNode<style::gecko::wrapper::GeckoNode>>> 1744 XUL!style::parallel::traverse_nodes<style::gecko::wrapper::GeckoElement,style::gecko::traversal::RecalcStyleOnly,smallvec::Drain<style::dom::SendNode<style::gecko::wrapper::GeckoNode>>>
Flags: needinfo?(ted)
Assignee | ||
Comment 15•7 years ago
|
||
So anyway, that solves the puzzle. The color conversion routines jointly consume ~105k of stack space, which is insane. We could try to pre-cache those values, but it's probably safer to just increase the stack size and safety margin. As long as we keep our recursion limit tight, this should generally only cost us address space, not actual committed pages.
Assignee | ||
Comment 16•7 years ago
|
||
Also, ted verified that the stack frame for vRotate_90_Planar_UInt16 is just a few hundred bytes on OSX 10.12, which explains why we only see crashes on OSX 10.11.
Assignee | ||
Comment 17•7 years ago
|
||
MozReview-Commit-ID: Fv9DttU20hM
Attachment #8905254 -
Flags: review?(jseward)
Comment 18•7 years ago
|
||
Nice. It's always quite satisfying when there's a simple answer that explains the symptoms. Thanks Ted!
Comment 19•7 years ago
|
||
For the record, I ran the minidump from bug 1394878 comment 0 through minidump_stackwalk locally, which produces output like this (trimmed to just the crashing thread): https://gist.github.com/luser/1d9cf875c4c183f94d2d9540fb7b3564 Then I fed that through a little Python script to just subtract the previous frame's frame pointer from the current one for each frame on the stack, and print it out with the last stack's function name attached: https://gist.github.com/luser/99cbcb8c68bdd3d606e827f40321668a (This probably would have been more straightforward to write using the Breakpad C++ API, but then I would have had to write C++ for it.)
Comment 20•7 years ago
|
||
Comment on attachment 8905254 [details] [diff] [review] Increase stack safety margin for stylo. v1 Review of attachment 8905254 [details] [diff] [review]: ----------------------------------------------------------------- ::: servo/components/style/parallel.rs @@ +53,2 @@ > /// > +pub const STACK_SAFETY_MARGIN_KB: usize = 168; You've actually added 123K here, not 128K. Is that intended, given the comment above ("we added it all to the safety margin")?
Attachment #8905254 -
Flags: review?(jseward) → review+
Assignee | ||
Comment 21•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #20) > Comment on attachment 8905254 [details] [diff] [review] > Increase stack safety margin for stylo. v1 > > Review of attachment 8905254 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: servo/components/style/parallel.rs > @@ +53,2 @@ > > /// > > +pub const STACK_SAFETY_MARGIN_KB: usize = 168; > > You've actually added 123K here, not 128K. Is that intended, > given the comment above ("we added it all to the safety margin")? I added 123k to the 45k that was currently being used, but the comment was intended to be relative to the 40k used in the documented measurement. The fact that we used 45k for a while is a detail that doesn't need to be preserved.
Assignee | ||
Comment 22•7 years ago
|
||
https://github.com/servo/servo/pull/18412
Assignee | ||
Comment 23•7 years ago
|
||
https://hg.mozilla.org/integration/autoland/rev/95543c372e9f9def5a42d71d4d5356917a031a89 Calixte, can you verify (in a few days) that this stops the crash reports?
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(cdenizet)
Resolution: --- → FIXED
Comment 24•7 years ago
|
||
:bholley, sure
Updated•7 years ago
|
status-firefox55:
--- → unaffected
status-firefox56:
--- → unaffected
status-firefox57:
--- → fixed
status-firefox-esr52:
--- → unaffected
Target Milestone: --- → mozilla57
Assignee | ||
Comment 26•7 years ago
|
||
Looks like the crashes stopped.
Status: RESOLVED → VERIFIED
Flags: needinfo?(cdenizet)
You need to log in
before you can comment on or make changes to this bug.
Description
•