Closed Bug 1368302 Opened 7 years ago Closed 7 years ago

stylo: possible stack overflow when processing very deep DOM

Categories

(Core :: CSS Parsing and Computation, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox57 --- fixed

People

(Reporter: jseward, Assigned: jseward)

References

Details

Attachments

(4 files, 3 obsolete files)

Whilst working with variants of the bloom-basic test for Stylo perf profiling, I noticed that it is easy to cause Firefox to segfault on DOMs with a depth of more than about 1500 elements. I don't know whether this is really a Stylo issue, but it seems like a possible DOS and so I am reporting it. Unfortunately the debuginfo for libxul.so is somehow not right (rustc problem?) and GDB won't show the fault point at all. Valgrind manages just one frame: nodes at depth 2000: 1 nodes at depth 2001: 0 total nodes: 10380 --27771-- sync signal handler: signal=11, si_code=2, EIP=0x142e8b48, eip=0x101c2a1bc7, from kernel --27771-- SIGSEGV: si_code=2 faultaddr=0x28300e68 tid=27 ESP=0x28300e68 seg=0x28300000-0x28300fff --27771-- delivering signal 11 (SIGSEGV):2 to thread 27 --27771-- push_signal_frame (thread 27): signal 11 ==27771== at 0x142E8B48: style::properties::StyleBuilder::mutate_box (/checkout/src/libcore/mem.rs:334) Segmentation fault (core dumped) GDB offers this: nodes at depth 1999: 1 nodes at depth 2000: 1 nodes at depth 2001: 0 total nodes: 10380 QQQ iteration 1 Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] Missing separate debuginfos, use: dnf debuginfo-install gtk3-3.22.12-2.fc25.x86_64 Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] (gdb) where Python Exception <class 'gdb.error'> Frame is invalid.: Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] STR: load my-bloom-basic.html (attached)
Attached file my-bloom-basic.html
Attached file my-util.js
(In reply to Julian Seward [:jseward] from comment #0) > (gdb) where > Python Exception <class 'gdb.error'> Frame is invalid.: > Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea > [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] Nathan, have you seen anything like that before? This is built with "clang version 3.9.0 (tags/RELEASE_390/final)" and "rustc 1.17.0 (56124baa9 2017-04-24)" on Fedora 25.
Flags: needinfo?(nfroyd)
It's interesting that Valgrind can't unwind out of it, because AFAIK it doesn't care (at least for unwinding) about the DIEs at all, to the extent of not even trying to read them. That says to me that the stack might be trashed.
(In reply to Julian Seward [:jseward] from comment #3) > (In reply to Julian Seward [:jseward] from comment #0) > > (gdb) where > > Python Exception <class 'gdb.error'> Frame is invalid.: > > Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea > > [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] > > Nathan, have you seen anything like that before? This is built with > "clang version 3.9.0 (tags/RELEASE_390/final)" and "rustc 1.17.0 (56124baa9 > 2017-04-24)" > on Fedora 25. I haven't seen anything like this. It seems pretty unusual that there'd be invalid dwarf generated by clang, so perhaps this is confusion somewhere else?
Flags: needinfo?(nfroyd)
I used objcopy to remove the .debug_info section on the basis that that's where the DIEs that GDB doesn't like, came from: objcopy --remove-section=.debug_info libxul.so-ORIG libxul.so and like that I can at least get a stack trace. It does look like a stack overflow caused by recursion in style::parallel::traverse_nodes::h08dccb9d87aa4e5f. There are about 1750 such frames on the stack. See attachment.
Attached file Crash backtrace
In general, we don't treat stack overflows as critical security issues, since they're just DoS. Also, stylo isn't shipping to users yet. I can't seem to unhide this bug though. Julian, can you do it? I think what we want here is just to detect the recursion depth in the tail-call stuff in parallel.rs and eliminate the tail-call optimization if we get too big (say, 500?). Julian, want to whip up a patch and retest?
Component: DOM → CSS Parsing and Computation
Flags: needinfo?(jseward)
Summary: Stylo: possible stack overflow when processing very deep DOM → stylo: possible stack overflow when processing very deep DOM
Julian asked me to unhide this bug pointing at comment 8.
Group: core-security
Assignee: nobody → jseward
Flags: needinfo?(jseward)
Bobby, how does that look? Without it, Stylo segfaults for trees more than about 1700 deep. With it, I can process trees up to at least 4500 deep. Deeper than that and my JS tree-generation routines give up due to too much recursion, but that's a different problem. I am vaguely concerned that this assertion at the end of |traverse_nodes| might fail // If this is a tail call, bypass rayon for the first chunk and invoke top_down_dom // directly. debug_assert_eq!(first_chunk.is_some(), mode.is_tail_call()); because the equivalence (if there ever was one, and if that is indeed asserting it) "exactly one child == we're doing a tail call" now no longer holds. I haven't tested with a debug build. Any insight on that?
Flags: needinfo?(bobbyholley)
Comment on attachment 8872651 [details] [diff] [review] Limits recursion depth to 500, per comment 8 Review of attachment 8872651 [details] [diff] [review]: ----------------------------------------------------------------- ::: servo/components/style/parallel.rs @@ +226,5 @@ > impl DispatchMode { > fn is_tail_call(&self) -> bool { matches!(*self, DispatchMode::TailCall) } > } > > +static MAX_RECURSION_DEPTH: u32 = 500; Nit: I'd make this a usize (here and where it's passed as an arg), since that's the convention for counters, though I'd be curious if you have any particular insight on the tradeoffs. @@ +232,4 @@ > #[inline] > fn traverse_nodes<'a, 'scope, E, D>(nodes: NodeList<E::ConcreteNode>, > + mode_orig: DispatchMode, > + recursion_depth_orig: u32, Nit: I would make both of these |mut| and nix the _orig business. @@ +242,5 @@ > D: DomTraversal<E>, > { > debug_assert!(!nodes.is_empty()); > > + // Disable the tailcall optimisation after a given depth, so as to avoid Nit: en-US spelling of optimization. @@ +259,5 @@ > } else { > // The caller isn't done yet. Append to the queue and return synchronously. > scope.spawn(move |scope| { > let nodes = nodes; > + top_down_dom(&nodes, recursion_depth, This is wrong - we don't want to propagate the depth into scope.spawn() calls, because we'll unwind the stack before processing those. @@ +274,5 @@ > let boxed = chunk.iter().cloned().collect::<Vec<_>>().into_boxed_slice(); > let traversal_data_copy = traversal_data.clone(); > scope.spawn(move |scope| { > let b = boxed; > + top_down_dom(&*b, recursion_depth, root, Same here. @@ +283,5 @@ > } > > // If this is a tail call, bypass rayon for the first chunk and invoke top_down_dom > // directly. > debug_assert_eq!(first_chunk.is_some(), mode.is_tail_call()); You can fix this by doing the following: (1) Rename MAX_RECURSION_DEPTH to RECURSION_DEPTH_LIMIT (2) Make the mode adjustment >= instead of >. (3) Change this to debug_assert_eq!(first_chunk.is_some(), mode.is_tail_call() || recursion_depth == RECURSION_DEPTH_LIMIT);
Attachment #8872651 - Flags: review-
Attachment #8872651 - Flags: feedback+
Flags: needinfo?(bobbyholley)
(Worth re-testing your testcase with those fixes, and also doing a try push with |-b do -p linux64-stylo -u all -t none| before landing).
Attached patch bug1368302-2.cset (obsolete) — Splinter Review
Addresses all comments in comment 12. Verified working up to depth 5500. This sets a depth limit of 500. I also tried with 1000 but some testing with bloom-basic variants didn't show any change to the level of achieved parallelism (nor CPU) so I stayed with 500 so as to be conservative. Looks OK on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2cec3f616bf699490e57ebb99b33b0a963548210 One detail: the recursion_depth incrementer at the start of traverse_nodes turned out more complex than I expected, due to the need to cause it to stick at exactly RECURSION_DEPTH_LIMIT so as to make the previously discussed assertion happy. Not sure if there's a simpler way to do that.
Attachment #8872651 - Attachment is obsolete: true
Attachment #8872985 - Flags: review?(bobbyholley)
Comment on attachment 8872985 [details] [diff] [review] bug1368302-2.cset Review of attachment 8872985 [details] [diff] [review]: ----------------------------------------------------------------- r=me with those three changes. ::: servo/components/style/parallel.rs @@ +226,5 @@ > impl DispatchMode { > fn is_tail_call(&self) -> bool { matches!(*self, DispatchMode::TailCall) } > } > > +static RECURSION_DEPTH_LIMIT: usize = 500; Make this const instead. @@ +254,5 @@ > + } > + else { > + debug_assert!(recursion_depth == RECURSION_DEPTH_LIMIT); > + mode = DispatchMode::NotTailCall; > + } This is more complicated than it needs to be. Let's make this: recursion_depth += 1; debug_assert!(recursion_depth <= RECURSION_DEPTH_LIMIT); if recursion_depth == RECURSION_DEPTH_LIMIT { mode = DispatchMode::NotTailCall; } @@ +293,5 @@ > // If this is a tail call, bypass rayon for the first chunk and invoke top_down_dom > // directly. > + debug_assert_eq!(first_chunk.is_some(), > + mode.is_tail_call() > + || recursion_depth == RECURSION_DEPTH_LIMIT); Looking at this again, you actually don't need to modify this assertion at all, because |mode| will already be fixed up by the time first_chunk is computed. Please revert this change.
Attachment #8872985 - Flags: review?(bobbyholley) → review+
I think we may actually not need the patch in this bug, because I think I need to remove the tail call optimization entirely in bug 1365692.
Depends on: 1365692
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #16) > I think we may actually not need the patch in this bug, because I think I > need to remove the tail call optimization entirely in bug 1365692. The tail call optimization is still there, but in reduced form. So I think we'll still want to land this on top of that bug once it lands.
Priority: -- → P2
This patch needs rebasing. Also worth re-measuring that we crash without the patch and don't crash with it.
Flags: needinfo?(jseward)
Attached patch bug1368302-3.cset (obsolete) — Splinter Review
Rebased, and takes into account comment 15, to the extent that it is still relevant. I reduced the threshold from 500 to 200 in this version, having observed with GDB that the current parallel.rs will fail with a stack overflow at only 600 levels of recursion. So setting it to 200 seems like a conservative margin -- in particular given that we're assuming that limits that are OK on Linux also carry over to other platforms, particularly Windows. IIUC almost no real pages will have a DOM that deep, so 200 will not have any perf impact for real pages. I verified that this successfully handles a 5000-deep chain tree DOM, that definitely segfaults without the patch.
Attachment #8872985 - Attachment is obsolete: true
Flags: needinfo?(jseward)
Attachment #8880300 - Flags: review?(bobbyholley)
Comment on attachment 8880300 [details] [diff] [review] bug1368302-3.cset Review of attachment 8880300 [details] [diff] [review]: ----------------------------------------------------------------- ::: servo/components/style/parallel.rs @@ +240,5 @@ > impl DispatchMode { > fn is_tail_call(&self) -> bool { matches!(*self, DispatchMode::TailCall) } > } > > +const RECURSION_DEPTH_LIMIT: usize = 200; Please add a describing the measurements involved in setting this. @@ +261,5 @@ > // This is a tail call from the perspective of the caller. However, we only > // want to actually dispatch the job as a tail call if there's nothing left > // in our local queue. Otherwise we need to return to it to maintain proper > + // breadth-first ordering. We also need to take care to avoid stack > + // overflow due to excessive tail recursion. See bug 1368302. Add a comment here indicating that this limit isn't observable to content, since we're still totally correct, we just skip the tail call optimization for one iteration. @@ +270,5 @@ > + && !pool.current_thread_has_pending_tasks().unwrap() > + } else { > + debug_assert!(recursion_depth == RECURSION_DEPTH_LIMIT); > + false > + }; Nit: I think we could clean this up now a bit as follows: debug_assert!(recursion_depth <= RECURSION_DEPTH_LIMIT); let may_dispatch_tail = mode.is_tail_call() && recursion_depth != RECURSION_DEPTH_LIMIT && !pool.current_thread_has_pending_tasks().unwrap(); @@ +277,5 @@ > // case we can pass the SmallVec directly and avoid extra allocation. > if nodes.len() <= WORK_UNIT_MAX { > let work = nodes.iter().cloned().collect::<WorkUnit<E::ConcreteNode>>(); > if may_dispatch_tail { > + top_down_dom(&work, recursion_depth, root, And then just pass recursion_depth + 1 here.
Attachment #8880300 - Flags: review?(bobbyholley) → review+
Just a reminder to land this. :-)
Blocks: 1376883
Attached patch Rebased, FTRSplinter Review
Attachment #8880300 - Attachment is obsolete: true
Priority: P2 → --
This was merged to m-i/m-c some time back: changeset: 366942:8d137703e636 user: Julian Seward <jseward@acm.org> date: Fri Jun 30 03:01:08 2017 -0700 summary: servo: Merge #17563 - Bug 1368302 - stylo: possible stack overflow when processing very dee… (from julian-seward1:master); r=bholley Can we close it now?
(In reply to Julian Seward [:jseward] from comment #23) > This was merged to m-i/m-c some time back: > > changeset: 366942:8d137703e636 > user: Julian Seward <jseward@acm.org> > date: Fri Jun 30 03:01:08 2017 -0700 > summary: servo: Merge #17563 - Bug 1368302 - stylo: possible stack > overflow when processing very dee… (from julian-seward1:master); r=bholley > > Can we close it now? Yep, thanks!
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: