Closed Bug 1395708 Opened 7 years ago Closed 7 years ago

stylo: Crash in AnyToAnyBlock

Categories

(Core :: CSS Parsing and Computation, defect, P1)

Unspecified
macOS
defect

Tracking

()

VERIFIED FIXED
mozilla57
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- unaffected
firefox56 --- unaffected
firefox57 --- fixed

People

(Reporter: mccr8, Assigned: bholley)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-c0dcbb11-52c2-4755-9005-f5a3e0170831.
=============================================================

The crash stack is not very useful (AnyToAnyBlock in module vImage off on some random thread), but every crash report I looked at has Stylo on the main thread, somewhere in geckoservo::glue::Servo_TraverseSubtree.

There are 16 of these crashes, which isn't a huge amount, but is a decent amount for OSX. These are all on OSX 10.11.

vImage is "a high-performance image processing framework" on OSX according to a Google search I did. https://developer.apple.com/library/content/documentation/Performance/Conceptual/vImage/Introduction/Introduction.html
The first build this showed up in is 20170827100428. If that's right, then the regressing change set would be in here somewhere: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2f0b519d13e9dd945b37b235db1981d5ffb2199f&tochange=f819969d7619f01e806e2685b8b3196f64624551
Could they be stack overflows? We dropped the thread-pool stack size a while ago, and I've experienced some in debug builds very recently.
(And the changeset doing that is on that regression range)
So, the theory that this bug and bug 1394878 are related to stack exhaustion is a good one. Looking through reports for both bugs, the crashing address always seems to be pretty close to rsp. So that would imply that we're running out of stack space (which is plausible, given that we have 128k stacks on the worker threads). We could bump the stack size, or increase the safety margin, or both, but it would be good to understand the problem first.

There are no stacks for this bug, but there are stacks for bug 1394878. Oddly, the stacks on those bugs don't look especially deep. There tends to be some amount of nested calls to style::parallel::traverse_nodes (which is certainly suspicious), but each of those stack frames (at least on my machine) are 1760 bytes. So the only way that would bust the stack would be if the stuff underneath them (in particular the native lib stuff) was especially stack-hungry.

I think this is an answer that can be determined from minidumps, but I don't have the permissions to see them nor the expertise to evaluate them. Ted said he'd take a look tomorrow.
Flags: needinfo?(ted)
See Also: → 1394878
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #4)
> There are no stacks for this bug, but there are stacks for bug 1394878.
> Oddly, the stacks on those bugs don't look especially deep. There tends to
> be some amount of nested calls to style::parallel::traverse_nodes (which is
> certainly suspicious), but each of those stack frames (at least on my
> machine) are 1760 bytes. So the only way that would bust the stack would be
> if the stuff underneath them (in particular the native lib stuff) was
> especially stack-hungry.

FWIW I debugged a local debug build stack overflow, and it was inside the stuff for computing animation rules.

That can get relatively deep, going through effectCompositor, etc, and then calling back into servo.
The dump in comment 0 has some issues. I hacked around it a bit and still couldn't get anything that looks actually useful out of it. The stack pointer in the crashing thread (thread 40) does seem to be in valid memory, but the memory region associated with it in the minidump looks weird:
region[40]
MDMemoryDescriptor
  start_of_memory_range = 0x70000268ae80
  memory.data_size      = 0x180
  memory.rva            = 0x70368

It's only 384 bytes long. Also, it overlaps with the region immediately before it, which happens to be the stack memory for the previous thread (thread 39):
region[39]
MDMemoryDescriptor
  start_of_memory_range = 0x700002688708
  memory.data_size      = 0x28f8
  memory.rva            = 0x06 d5 a0

That makes me think that maybe this stack overflowed by quite a bit, and the stack pointer wound up overflowing into another thread's stack and then things broke. How plausible does that sound?
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #6)
> That makes me think that maybe this stack overflowed by quite a bit, and the
> stack pointer wound up overflowing into another thread's stack and then
> things broke. How plausible does that sound?

That certainly sounds plausible. If the OS is lining up stack regions right next to each other in address space with no red zone om between, then a stack overflow would just clobber another stack. That sounds pretty terrifying, really - I'd been assuming we'd mostly catch stack overflows with predictable memory faults, but clobbering another stack is a much bigger deal. Do you think this is a real concern?

Given that we do have stacks for bug 1394878, I would like to analyze the problem from that angle as well. Can you do the analysis requested in comment 4?
Flags: needinfo?(ted)
Priority: -- → P1
Assigning to myself since I'm driving the investigation here.
Assignee: nobody → bobbyholley
There are guard pages on OS X, but AFAIK they're only one page in size so it's possible for code that creates very large stack frames to overflow them (c.f. the recently published "stack clash" exploit).
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9)
> There are guard pages on OS X, but AFAIK they're only one page in size so
> it's possible for code that creates very large stack frames to overflow them
> (c.f. the recently published "stack clash" exploit).

Ok. Hopefully the analysis from comment 4 will explain why we manage to overflow the stack with so few frames, which may give us a hint as to how this could happen.
Using the defaults

  pub const STYLE_THREAD_STACK_SIZE_KB: usize = 128;
  pub const STACK_SAFETY_MARGIN_KB: usize = 45;

I get stack overflows on a pretty harmless looking page
https://www.theregister.co.uk, having built with no optimisation
for rust code, "-Og" for C++ code, and using LLVM 4.0.1 and gcc-6.4.1
on F25 x86_64.  Doubling them both gets it back on the road.

Would it be wise to increase the stack size for 57?
(In reply to Julian Seward [:jseward] from comment #12)
> Would it be wise to increase the stack size for 57?

Yes, that is the plan, but first I want to understand how such a small number of stack frames is causing overflow, per comment 4. In particular, I want to know if the styling stack frames are much larger than what I'm measuring, or if the OS stack frames are large, since it impacts how we tune the parameters.

So still need ted here.
https://gist.github.com/luser/a80af7e97fbb10a18daf0a7ed801efdc

The interesting bits are:

96 vImage!vLinearTransform_8to16Q12_vec
96 vImage!vMinH_Interleaved
272 vImage!vMinV_Interleaved
128 vImage!vRotate_90_ARGB_8888
72064 vImage!vRotate_90_Planar_UInt16
33440 vImage!vRotate_90_Planar_UInt8
480 CoreGraphics
48 CoreGraphics
256 CoreGraphics
144 CoreGraphics
144 CoreGraphics
80 AppKit!convertColorToColorSpaceNamed
80 AppKit!-[NSDynamicSystemColor
64 XUL!nsLookAndFeel::NativeGetColor(mozilla::LookAndFeel::ColorID,
80 XUL!nsXPLookAndFeel::GetColorImpl(mozilla::LookAndFeel::ColorID,
48 XUL!Gecko_GetLookAndFeelSystemColor
32 XUL!style::values::specified::color::{{impl}}::to_computed_value
48 XUL!style::properties::longhands::color::cascade_property
2272 XUL!style::properties::cascade
192 XUL!style::style_resolver::{{impl}}::cascade_style<style::gecko::wrapper::GeckoElement>
128 XUL!style::style_resolver::{{impl}}::resolve_primary_style<style::gecko::wrapper::GeckoElement>
160 XUL!style::style_resolver::{{impl}}::resolve_style<style::gecko::wrapper::GeckoElement>
2032 XUL!style::traversal::compute_style<style::gecko::wrapper::GeckoElement>
1712 XUL!style::parallel::traverse_nodes<style::gecko::wrapper::GeckoElement,style::gecko::traversal::RecalcStyleOnly,smallvec::Drain<style::dom::SendNode<style::gecko::wrapper::GeckoNode>>>
1744 XUL!style::parallel::traverse_nodes<style::gecko::wrapper::GeckoElement,style::gecko::traversal::RecalcStyleOnly,smallvec::Drain<style::dom::SendNode<style::gecko::wrapper::GeckoNode>>>
Flags: needinfo?(ted)
So anyway, that solves the puzzle. The color conversion routines jointly consume ~105k of stack space, which is insane. We could try to pre-cache those values, but it's probably safer to just increase the stack size and safety margin. As long as we keep our recursion limit tight, this should generally only cost us address space, not actual committed pages.
Also, ted verified that the stack frame for vRotate_90_Planar_UInt16 is just a few hundred bytes on OSX 10.12, which explains why we only see crashes on OSX 10.11.
MozReview-Commit-ID: Fv9DttU20hM
Attachment #8905254 - Flags: review?(jseward)
Nice. It's always quite satisfying when there's a simple answer that explains the symptoms. Thanks Ted!
For the record, I ran the minidump from bug 1394878 comment 0 through minidump_stackwalk locally, which produces output like this (trimmed to just the crashing thread):
https://gist.github.com/luser/1d9cf875c4c183f94d2d9540fb7b3564

Then I fed that through a little Python script to just subtract the previous frame's frame pointer from the current one for each frame on the stack, and print it out with the last stack's function name attached:
https://gist.github.com/luser/99cbcb8c68bdd3d606e827f40321668a

(This probably would have been more straightforward to write using the Breakpad C++ API, but then I would have had to write C++ for it.)
Blocks: 1394878
Comment on attachment 8905254 [details] [diff] [review]
Increase stack safety margin for stylo. v1

Review of attachment 8905254 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/style/parallel.rs
@@ +53,2 @@
>  ///
> +pub const STACK_SAFETY_MARGIN_KB: usize = 168;

You've actually added 123K here, not 128K.  Is that intended,
given the comment above ("we added it all to the safety margin")?
Attachment #8905254 - Flags: review?(jseward) → review+
(In reply to Julian Seward [:jseward] from comment #20)
> Comment on attachment 8905254 [details] [diff] [review]
> Increase stack safety margin for stylo. v1
> 
> Review of attachment 8905254 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: servo/components/style/parallel.rs
> @@ +53,2 @@
> >  ///
> > +pub const STACK_SAFETY_MARGIN_KB: usize = 168;
> 
> You've actually added 123K here, not 128K.  Is that intended,
> given the comment above ("we added it all to the safety margin")?

I added 123k to the 45k that was currently being used, but the comment was intended to be relative to the 40k used in the documented measurement. The fact that we used 45k for a while is a detail that doesn't need to be preserved.
https://hg.mozilla.org/integration/autoland/rev/95543c372e9f9def5a42d71d4d5356917a031a89

Calixte, can you verify (in a few days) that this stops the crash reports?
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(cdenizet)
Resolution: --- → FIXED
:bholley, sure
Target Milestone: --- → mozilla57
Looks like the crashes stopped.
Status: RESOLVED → VERIFIED
Flags: needinfo?(cdenizet)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: