Closed
Bug 1368302
Opened 8 years ago
Closed 8 years ago
stylo: possible stack overflow when processing very deep DOM
Categories
(Core :: CSS Parsing and Computation, defect)
Core
CSS Parsing and Computation
Tracking
()
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
firefox57 | --- | fixed |
People
(Reporter: jseward, Assigned: jseward)
References
Details
Attachments
(4 files, 3 obsolete files)
Whilst working with variants of the bloom-basic test for Stylo perf
profiling, I noticed that it is easy to cause Firefox to segfault on
DOMs with a depth of more than about 1500 elements. I don't know whether
this is really a Stylo issue, but it seems like a possible DOS and
so I am reporting it.
Unfortunately the debuginfo for libxul.so is somehow not right
(rustc problem?) and GDB won't show the fault point at all. Valgrind
manages just one frame:
nodes at depth 2000: 1
nodes at depth 2001: 0
total nodes: 10380
--27771-- sync signal handler: signal=11, si_code=2, EIP=0x142e8b48, eip=0x101c2a1bc7, from kernel
--27771-- SIGSEGV: si_code=2 faultaddr=0x28300e68 tid=27 ESP=0x28300e68 seg=0x28300000-0x28300fff
--27771-- delivering signal 11 (SIGSEGV):2 to thread 27
--27771-- push_signal_frame (thread 27): signal 11
==27771== at 0x142E8B48: style::properties::StyleBuilder::mutate_box (/checkout/src/libcore/mem.rs:334)
Segmentation fault (core dumped)
GDB offers this:
nodes at depth 1999: 1
nodes at depth 2000: 1
nodes at depth 2001: 0
total nodes: 10380
QQQ iteration 1
Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so]
Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so]
Missing separate debuginfos, use: dnf debuginfo-install gtk3-3.22.12-2.fc25.x86_64
Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so]
(gdb) where
Python Exception <class 'gdb.error'> Frame is invalid.:
Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so]
STR: load my-bloom-basic.html (attached)
Assignee | ||
Comment 1•8 years ago
|
||
Assignee | ||
Comment 2•8 years ago
|
||
Assignee | ||
Comment 3•8 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #0)
> (gdb) where
> Python Exception <class 'gdb.error'> Frame is invalid.:
> Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea
> [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so]
Nathan, have you seen anything like that before? This is built with
"clang version 3.9.0 (tags/RELEASE_390/final)" and "rustc 1.17.0 (56124baa9 2017-04-24)"
on Fedora 25.
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 4•8 years ago
|
||
It's interesting that Valgrind can't unwind out of it, because AFAIK
it doesn't care (at least for unwinding) about the DIEs at all, to
the extent of not even trying to read them. That says to me that the
stack might be trashed.
![]() |
||
Comment 5•8 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #3)
> (In reply to Julian Seward [:jseward] from comment #0)
> > (gdb) where
> > Python Exception <class 'gdb.error'> Frame is invalid.:
> > Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea
> > [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so]
>
> Nathan, have you seen anything like that before? This is built with
> "clang version 3.9.0 (tags/RELEASE_390/final)" and "rustc 1.17.0 (56124baa9
> 2017-04-24)"
> on Fedora 25.
I haven't seen anything like this. It seems pretty unusual that there'd be invalid dwarf generated by clang, so perhaps this is confusion somewhere else?
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 6•8 years ago
|
||
I used objcopy to remove the .debug_info section on the basis that
that's where the DIEs that GDB doesn't like, came from:
objcopy --remove-section=.debug_info libxul.so-ORIG libxul.so
and like that I can at least get a stack trace. It does look
like a stack overflow caused by recursion in
style::parallel::traverse_nodes::h08dccb9d87aa4e5f. There are
about 1750 such frames on the stack. See attachment.
Assignee | ||
Comment 7•8 years ago
|
||
Comment 8•8 years ago
|
||
In general, we don't treat stack overflows as critical security issues, since they're just DoS. Also, stylo isn't shipping to users yet. I can't seem to unhide this bug though. Julian, can you do it?
I think what we want here is just to detect the recursion depth in the tail-call stuff in parallel.rs and eliminate the tail-call optimization if we get too big (say, 500?). Julian, want to whip up a patch and retest?
Component: DOM → CSS Parsing and Computation
Flags: needinfo?(jseward)
Summary: Stylo: possible stack overflow when processing very deep DOM → stylo: possible stack overflow when processing very deep DOM
Comment 9•8 years ago
|
||
Julian asked me to unhide this bug pointing at comment 8.
Group: core-security
Assignee | ||
Comment 10•8 years ago
|
||
Assignee: nobody → jseward
Flags: needinfo?(jseward)
Assignee | ||
Comment 11•8 years ago
|
||
Bobby, how does that look? Without it, Stylo segfaults for trees more
than about 1700 deep. With it, I can process trees up to at least 4500
deep. Deeper than that and my JS tree-generation routines give up due
to too much recursion, but that's a different problem.
I am vaguely concerned that this assertion at the end of |traverse_nodes|
might fail
// If this is a tail call, bypass rayon for the first chunk and invoke top_down_dom
// directly.
debug_assert_eq!(first_chunk.is_some(), mode.is_tail_call());
because the equivalence (if there ever was one, and if that is indeed
asserting it) "exactly one child == we're doing a tail call" now no longer
holds. I haven't tested with a debug build. Any insight on that?
Flags: needinfo?(bobbyholley)
Comment 12•8 years ago
|
||
Comment on attachment 8872651 [details] [diff] [review]
Limits recursion depth to 500, per comment 8
Review of attachment 8872651 [details] [diff] [review]:
-----------------------------------------------------------------
::: servo/components/style/parallel.rs
@@ +226,5 @@
> impl DispatchMode {
> fn is_tail_call(&self) -> bool { matches!(*self, DispatchMode::TailCall) }
> }
>
> +static MAX_RECURSION_DEPTH: u32 = 500;
Nit: I'd make this a usize (here and where it's passed as an arg), since that's the convention for counters, though I'd be curious if you have any particular insight on the tradeoffs.
@@ +232,4 @@
> #[inline]
> fn traverse_nodes<'a, 'scope, E, D>(nodes: NodeList<E::ConcreteNode>,
> + mode_orig: DispatchMode,
> + recursion_depth_orig: u32,
Nit: I would make both of these |mut| and nix the _orig business.
@@ +242,5 @@
> D: DomTraversal<E>,
> {
> debug_assert!(!nodes.is_empty());
>
> + // Disable the tailcall optimisation after a given depth, so as to avoid
Nit: en-US spelling of optimization.
@@ +259,5 @@
> } else {
> // The caller isn't done yet. Append to the queue and return synchronously.
> scope.spawn(move |scope| {
> let nodes = nodes;
> + top_down_dom(&nodes, recursion_depth,
This is wrong - we don't want to propagate the depth into scope.spawn() calls, because we'll unwind the stack before processing those.
@@ +274,5 @@
> let boxed = chunk.iter().cloned().collect::<Vec<_>>().into_boxed_slice();
> let traversal_data_copy = traversal_data.clone();
> scope.spawn(move |scope| {
> let b = boxed;
> + top_down_dom(&*b, recursion_depth, root,
Same here.
@@ +283,5 @@
> }
>
> // If this is a tail call, bypass rayon for the first chunk and invoke top_down_dom
> // directly.
> debug_assert_eq!(first_chunk.is_some(), mode.is_tail_call());
You can fix this by doing the following:
(1) Rename MAX_RECURSION_DEPTH to RECURSION_DEPTH_LIMIT
(2) Make the mode adjustment >= instead of >.
(3) Change this to debug_assert_eq!(first_chunk.is_some(), mode.is_tail_call() || recursion_depth == RECURSION_DEPTH_LIMIT);
Attachment #8872651 -
Flags: review-
Attachment #8872651 -
Flags: feedback+
Updated•8 years ago
|
Flags: needinfo?(bobbyholley)
Comment 13•8 years ago
|
||
(Worth re-testing your testcase with those fixes, and also doing a try push with |-b do -p linux64-stylo -u all -t none| before landing).
Assignee | ||
Comment 14•8 years ago
|
||
Addresses all comments in comment 12. Verified working up to depth 5500.
This sets a depth limit of 500. I also tried with 1000 but some testing with
bloom-basic variants didn't show any change to the level of achieved
parallelism (nor CPU) so I stayed with 500 so as to be conservative.
Looks OK on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2cec3f616bf699490e57ebb99b33b0a963548210
One detail: the recursion_depth incrementer at the start of traverse_nodes
turned out more complex than I expected, due to the need to cause it to stick
at exactly RECURSION_DEPTH_LIMIT so as to make the previously discussed
assertion happy. Not sure if there's a simpler way to do that.
Attachment #8872651 -
Attachment is obsolete: true
Attachment #8872985 -
Flags: review?(bobbyholley)
Comment 15•8 years ago
|
||
Comment on attachment 8872985 [details] [diff] [review]
bug1368302-2.cset
Review of attachment 8872985 [details] [diff] [review]:
-----------------------------------------------------------------
r=me with those three changes.
::: servo/components/style/parallel.rs
@@ +226,5 @@
> impl DispatchMode {
> fn is_tail_call(&self) -> bool { matches!(*self, DispatchMode::TailCall) }
> }
>
> +static RECURSION_DEPTH_LIMIT: usize = 500;
Make this const instead.
@@ +254,5 @@
> + }
> + else {
> + debug_assert!(recursion_depth == RECURSION_DEPTH_LIMIT);
> + mode = DispatchMode::NotTailCall;
> + }
This is more complicated than it needs to be. Let's make this:
recursion_depth += 1;
debug_assert!(recursion_depth <= RECURSION_DEPTH_LIMIT);
if recursion_depth == RECURSION_DEPTH_LIMIT {
mode = DispatchMode::NotTailCall;
}
@@ +293,5 @@
> // If this is a tail call, bypass rayon for the first chunk and invoke top_down_dom
> // directly.
> + debug_assert_eq!(first_chunk.is_some(),
> + mode.is_tail_call()
> + || recursion_depth == RECURSION_DEPTH_LIMIT);
Looking at this again, you actually don't need to modify this assertion at all, because |mode| will already be fixed up by the time first_chunk is computed. Please revert this change.
Attachment #8872985 -
Flags: review?(bobbyholley) → review+
Comment 16•8 years ago
|
||
I think we may actually not need the patch in this bug, because I think I need to remove the tail call optimization entirely in bug 1365692.
Comment 17•8 years ago
|
||
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #16)
> I think we may actually not need the patch in this bug, because I think I
> need to remove the tail call optimization entirely in bug 1365692.
The tail call optimization is still there, but in reduced form. So I think we'll still want to land this on top of that bug once it lands.
Priority: -- → P2
Comment 18•8 years ago
|
||
This patch needs rebasing. Also worth re-measuring that we crash without the patch and don't crash with it.
Flags: needinfo?(jseward)
Assignee | ||
Comment 19•8 years ago
|
||
Rebased, and takes into account comment 15, to the extent that it is still
relevant.
I reduced the threshold from 500 to 200 in this version, having observed
with GDB that the current parallel.rs will fail with a stack overflow at
only 600 levels of recursion. So setting it to 200 seems like a
conservative margin -- in particular given that we're assuming that limits
that are OK on Linux also carry over to other platforms, particularly
Windows. IIUC almost no real pages will have a DOM that deep, so 200 will
not have any perf impact for real pages.
I verified that this successfully handles a 5000-deep chain tree DOM, that
definitely segfaults without the patch.
Attachment #8872985 -
Attachment is obsolete: true
Flags: needinfo?(jseward)
Assignee | ||
Updated•8 years ago
|
Attachment #8880300 -
Flags: review?(bobbyholley)
Comment 20•8 years ago
|
||
Comment on attachment 8880300 [details] [diff] [review]
bug1368302-3.cset
Review of attachment 8880300 [details] [diff] [review]:
-----------------------------------------------------------------
::: servo/components/style/parallel.rs
@@ +240,5 @@
> impl DispatchMode {
> fn is_tail_call(&self) -> bool { matches!(*self, DispatchMode::TailCall) }
> }
>
> +const RECURSION_DEPTH_LIMIT: usize = 200;
Please add a describing the measurements involved in setting this.
@@ +261,5 @@
> // This is a tail call from the perspective of the caller. However, we only
> // want to actually dispatch the job as a tail call if there's nothing left
> // in our local queue. Otherwise we need to return to it to maintain proper
> + // breadth-first ordering. We also need to take care to avoid stack
> + // overflow due to excessive tail recursion. See bug 1368302.
Add a comment here indicating that this limit isn't observable to content, since we're still totally correct, we just skip the tail call optimization for one iteration.
@@ +270,5 @@
> + && !pool.current_thread_has_pending_tasks().unwrap()
> + } else {
> + debug_assert!(recursion_depth == RECURSION_DEPTH_LIMIT);
> + false
> + };
Nit: I think we could clean this up now a bit as follows:
debug_assert!(recursion_depth <= RECURSION_DEPTH_LIMIT);
let may_dispatch_tail = mode.is_tail_call() &&
recursion_depth != RECURSION_DEPTH_LIMIT &&
!pool.current_thread_has_pending_tasks().unwrap();
@@ +277,5 @@
> // case we can pass the SmallVec directly and avoid extra allocation.
> if nodes.len() <= WORK_UNIT_MAX {
> let work = nodes.iter().cloned().collect::<WorkUnit<E::ConcreteNode>>();
> if may_dispatch_tail {
> + top_down_dom(&work, recursion_depth, root,
And then just pass recursion_depth + 1 here.
Attachment #8880300 -
Flags: review?(bobbyholley) → review+
Comment 21•8 years ago
|
||
Just a reminder to land this. :-)
Assignee | ||
Comment 22•8 years ago
|
||
Attachment #8880300 -
Attachment is obsolete: true
Updated•8 years ago
|
Priority: P2 → --
Assignee | ||
Comment 23•8 years ago
|
||
This was merged to m-i/m-c some time back:
changeset: 366942:8d137703e636
user: Julian Seward <jseward@acm.org>
date: Fri Jun 30 03:01:08 2017 -0700
summary: servo: Merge #17563 - Bug 1368302 - stylo: possible stack overflow when processing very dee… (from julian-seward1:master); r=bholley
Can we close it now?
Comment 24•8 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #23)
> This was merged to m-i/m-c some time back:
>
> changeset: 366942:8d137703e636
> user: Julian Seward <jseward@acm.org>
> date: Fri Jun 30 03:01:08 2017 -0700
> summary: servo: Merge #17563 - Bug 1368302 - stylo: possible stack
> overflow when processing very dee… (from julian-seward1:master); r=bholley
>
> Can we close it now?
Yep, thanks!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
status-firefox57:
--- → fixed
You need to log in
before you can comment on or make changes to this bug.
Description
•