Closed Bug 1368302 Opened 8 years ago Closed 8 years ago

stylo: possible stack overflow when processing very deep DOM

Tracking

()

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

firefox57

---

fixed

People

(Reporter: jseward, Assigned: jseward)

References

Details

Attachments

(4 files, 3 obsolete files)

my-bloom-basic.html 8 years ago Julian Seward [:jseward] 739 bytes, text/plain		Details
my-util.js 8 years ago Julian Seward [:jseward] 4.17 KB, text/plain		Details
Crash backtrace 8 years ago Julian Seward [:jseward] 6.68 KB, text/plain		Details
Limits recursion depth to 500, per comment 8 8 years ago Julian Seward [:jseward] 6.86 KB, patch	bholley : review- bholley : feedback+	Details \| Diff \| Splinter Review
bug1368302-2.cset 8 years ago Julian Seward [:jseward] 7.21 KB, patch	bholley : review+	Details \| Diff \| Splinter Review
bug1368302-3.cset 8 years ago Julian Seward [:jseward] 6.67 KB, patch	bholley : review+	Details \| Diff \| Splinter Review
Rebased, FTR 8 years ago Julian Seward [:jseward] 7.01 KB, patch		Details \| Diff \| Splinter Review

Julian Seward [:jseward]

Assignee

Description

•

8 years ago

Whilst working with variants of the bloom-basic test for Stylo perf profiling, I noticed that it is easy to cause Firefox to segfault on DOMs with a depth of more than about 1500 elements. I don't know whether this is really a Stylo issue, but it seems like a possible DOS and so I am reporting it. Unfortunately the debuginfo for libxul.so is somehow not right (rustc problem?) and GDB won't show the fault point at all. Valgrind manages just one frame: nodes at depth 2000: 1 nodes at depth 2001: 0 total nodes: 10380 --27771-- sync signal handler: signal=11, si_code=2, EIP=0x142e8b48, eip=0x101c2a1bc7, from kernel --27771-- SIGSEGV: si_code=2 faultaddr=0x28300e68 tid=27 ESP=0x28300e68 seg=0x28300000-0x28300fff --27771-- delivering signal 11 (SIGSEGV):2 to thread 27 --27771-- push_signal_frame (thread 27): signal 11 ==27771== at 0x142E8B48: style::properties::StyleBuilder::mutate_box (/checkout/src/libcore/mem.rs:334) Segmentation fault (core dumped) GDB offers this: nodes at depth 1999: 1 nodes at depth 2000: 1 nodes at depth 2001: 0 total nodes: 10380 QQQ iteration 1 Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] Missing separate debuginfos, use: dnf debuginfo-install gtk3-3.22.12-2.fc25.x86_64 Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] (gdb) where Python Exception <class 'gdb.error'> Frame is invalid.: Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] STR: load my-bloom-basic.html (attached)

Julian Seward [:jseward]

Assignee

Comment 1

•

8 years ago

Attached file my-bloom-basic.html — Details

Julian Seward [:jseward]

Assignee

Comment 2

•

8 years ago

Attached file my-util.js — Details

Julian Seward [:jseward]

Assignee

Comment 3

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #0) > (gdb) where > Python Exception <class 'gdb.error'> Frame is invalid.: > Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea > [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] Nathan, have you seen anything like that before? This is built with "clang version 3.9.0 (tags/RELEASE_390/final)" and "rustc 1.17.0 (56124baa9 2017-04-24)" on Fedora 25.

Flags: needinfo?(nfroyd)

Julian Seward [:jseward]

Assignee

Comment 4

•

8 years ago

It's interesting that Valgrind can't unwind out of it, because AFAIK it doesn't care (at least for unwinding) about the DIEs at all, to the extent of not even trying to read them. That says to me that the stack might be trashed.

Nathan Froyd [:froydnj]

Comment 5

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #3) > (In reply to Julian Seward [:jseward] from comment #0) > > (gdb) where > > Python Exception <class 'gdb.error'> Frame is invalid.: > > Dwarf Error: Cannot find DIE at 0x2bce02d8 referenced from DIE at 0x2b7bc8ea > > [in module /space/MOZ/MC-STYLO/gcc-Og-nondebug-stylo/dist/bin/libxul.so] > > Nathan, have you seen anything like that before? This is built with > "clang version 3.9.0 (tags/RELEASE_390/final)" and "rustc 1.17.0 (56124baa9 > 2017-04-24)" > on Fedora 25. I haven't seen anything like this. It seems pretty unusual that there'd be invalid dwarf generated by clang, so perhaps this is confusion somewhere else?

Flags: needinfo?(nfroyd)

Julian Seward [:jseward]

Assignee

Comment 6

•

8 years ago

I used objcopy to remove the .debug_info section on the basis that that's where the DIEs that GDB doesn't like, came from: objcopy --remove-section=.debug_info libxul.so-ORIG libxul.so and like that I can at least get a stack trace. It does look like a stack overflow caused by recursion in style::parallel::traverse_nodes::h08dccb9d87aa4e5f. There are about 1750 such frames on the stack. See attachment.

Julian Seward [:jseward]

Assignee

Comment 7

•

8 years ago

Attached file Crash backtrace — Details

Bobby Holley (:bholley)

Comment 8

•

8 years ago

In general, we don't treat stack overflows as critical security issues, since they're just DoS. Also, stylo isn't shipping to users yet. I can't seem to unhide this bug though. Julian, can you do it? I think what we want here is just to detect the recursion depth in the tail-call stuff in parallel.rs and eliminate the tail-call optimization if we get too big (say, 500?). Julian, want to whip up a patch and retest?

Component: DOM → CSS Parsing and Computation

Flags: needinfo?(jseward)

Summary: Stylo: possible stack overflow when processing very deep DOM → stylo: possible stack overflow when processing very deep DOM

Frederik Braun [:freddy]

Comment 9

•

8 years ago

Julian asked me to unhide this bug pointing at comment 8.

Group: core-security

Julian Seward [:jseward]

Assignee

Comment 10

•

8 years ago

Attached patch Limits recursion depth to 500, per comment 8 (obsolete) — Details — Splinter Review

Assignee: nobody → jseward

Flags: needinfo?(jseward)

Julian Seward [:jseward]

Assignee

Comment 11

•

8 years ago

Bobby, how does that look? Without it, Stylo segfaults for trees more than about 1700 deep. With it, I can process trees up to at least 4500 deep. Deeper than that and my JS tree-generation routines give up due to too much recursion, but that's a different problem. I am vaguely concerned that this assertion at the end of |traverse_nodes| might fail // If this is a tail call, bypass rayon for the first chunk and invoke top_down_dom // directly. debug_assert_eq!(first_chunk.is_some(), mode.is_tail_call()); because the equivalence (if there ever was one, and if that is indeed asserting it) "exactly one child == we're doing a tail call" now no longer holds. I haven't tested with a debug build. Any insight on that?

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Comment 12

•

8 years ago

Comment on attachment 8872651 [details] [diff] [review] Limits recursion depth to 500, per comment 8 Review of attachment 8872651 [details] [diff] [review]: ----------------------------------------------------------------- ::: servo/components/style/parallel.rs @@ +226,5 @@ > impl DispatchMode { > fn is_tail_call(&self) -> bool { matches!(*self, DispatchMode::TailCall) } > } > > +static MAX_RECURSION_DEPTH: u32 = 500; Nit: I'd make this a usize (here and where it's passed as an arg), since that's the convention for counters, though I'd be curious if you have any particular insight on the tradeoffs. @@ +232,4 @@ > #[inline] > fn traverse_nodes<'a, 'scope, E, D>(nodes: NodeList<E::ConcreteNode>, > + mode_orig: DispatchMode, > + recursion_depth_orig: u32, Nit: I would make both of these |mut| and nix the _orig business. @@ +242,5 @@ > D: DomTraversal<E>, > { > debug_assert!(!nodes.is_empty()); > > + // Disable the tailcall optimisation after a given depth, so as to avoid Nit: en-US spelling of optimization. @@ +259,5 @@ > } else { > // The caller isn't done yet. Append to the queue and return synchronously. > scope.spawn(move |scope| { > let nodes = nodes; > + top_down_dom(&nodes, recursion_depth, This is wrong - we don't want to propagate the depth into scope.spawn() calls, because we'll unwind the stack before processing those. @@ +274,5 @@ > let boxed = chunk.iter().cloned().collect::<Vec<_>>().into_boxed_slice(); > let traversal_data_copy = traversal_data.clone(); > scope.spawn(move |scope| { > let b = boxed; > + top_down_dom(&*b, recursion_depth, root, Same here. @@ +283,5 @@ > } > > // If this is a tail call, bypass rayon for the first chunk and invoke top_down_dom > // directly. > debug_assert_eq!(first_chunk.is_some(), mode.is_tail_call()); You can fix this by doing the following: (1) Rename MAX_RECURSION_DEPTH to RECURSION_DEPTH_LIMIT (2) Make the mode adjustment >= instead of >. (3) Change this to debug_assert_eq!(first_chunk.is_some(), mode.is_tail_call() || recursion_depth == RECURSION_DEPTH_LIMIT);

Attachment #8872651 - Flags: review-

Attachment #8872651 - Flags: feedback+

Bobby Holley (:bholley)

Updated

•

8 years ago

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Comment 13

•

8 years ago

(Worth re-testing your testcase with those fixes, and also doing a try push with |-b do -p linux64-stylo -u all -t none| before landing).

Julian Seward [:jseward]

Assignee

Comment 14

•

8 years ago

Attached patch bug1368302-2.cset (obsolete) — Details — Splinter Review

Addresses all comments in comment 12. Verified working up to depth 5500. This sets a depth limit of 500. I also tried with 1000 but some testing with bloom-basic variants didn't show any change to the level of achieved parallelism (nor CPU) so I stayed with 500 so as to be conservative. Looks OK on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2cec3f616bf699490e57ebb99b33b0a963548210 One detail: the recursion_depth incrementer at the start of traverse_nodes turned out more complex than I expected, due to the need to cause it to stick at exactly RECURSION_DEPTH_LIMIT so as to make the previously discussed assertion happy. Not sure if there's a simpler way to do that.

Attachment #8872651 - Attachment is obsolete: true

Attachment #8872985 - Flags: review?(bobbyholley)

Bobby Holley (:bholley)

Comment 15

•

8 years ago

Comment on attachment 8872985 [details] [diff] [review] bug1368302-2.cset Review of attachment 8872985 [details] [diff] [review]: ----------------------------------------------------------------- r=me with those three changes. ::: servo/components/style/parallel.rs @@ +226,5 @@ > impl DispatchMode { > fn is_tail_call(&self) -> bool { matches!(*self, DispatchMode::TailCall) } > } > > +static RECURSION_DEPTH_LIMIT: usize = 500; Make this const instead. @@ +254,5 @@ > + } > + else { > + debug_assert!(recursion_depth == RECURSION_DEPTH_LIMIT); > + mode = DispatchMode::NotTailCall; > + } This is more complicated than it needs to be. Let's make this: recursion_depth += 1; debug_assert!(recursion_depth <= RECURSION_DEPTH_LIMIT); if recursion_depth == RECURSION_DEPTH_LIMIT { mode = DispatchMode::NotTailCall; } @@ +293,5 @@ > // If this is a tail call, bypass rayon for the first chunk and invoke top_down_dom > // directly. > + debug_assert_eq!(first_chunk.is_some(), > + mode.is_tail_call() > + || recursion_depth == RECURSION_DEPTH_LIMIT); Looking at this again, you actually don't need to modify this assertion at all, because |mode| will already be fixed up by the time first_chunk is computed. Please revert this change.

Attachment #8872985 - Flags: review?(bobbyholley) → review+

Bobby Holley (:bholley)

Comment 16

•

8 years ago

I think we may actually not need the patch in this bug, because I think I need to remove the tail call optimization entirely in bug 1365692.

Bobby Holley (:bholley)

Updated

•

8 years ago

Depends on: 1365692

Bobby Holley (:bholley)

Comment 17

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #16) > I think we may actually not need the patch in this bug, because I think I > need to remove the tail call optimization entirely in bug 1365692. The tail call optimization is still there, but in reduced form. So I think we'll still want to land this on top of that bug once it lands.

Priority: -- → P2

Bobby Holley (:bholley)

Comment 18

•

8 years ago

This patch needs rebasing. Also worth re-measuring that we crash without the patch and don't crash with it.

Flags: needinfo?(jseward)

Julian Seward [:jseward]

Assignee

Comment 19

•

8 years ago

Attached patch bug1368302-3.cset (obsolete) — Details — Splinter Review

Rebased, and takes into account comment 15, to the extent that it is still relevant. I reduced the threshold from 500 to 200 in this version, having observed with GDB that the current parallel.rs will fail with a stack overflow at only 600 levels of recursion. So setting it to 200 seems like a conservative margin -- in particular given that we're assuming that limits that are OK on Linux also carry over to other platforms, particularly Windows. IIUC almost no real pages will have a DOM that deep, so 200 will not have any perf impact for real pages. I verified that this successfully handles a 5000-deep chain tree DOM, that definitely segfaults without the patch.

Attachment #8872985 - Attachment is obsolete: true

Flags: needinfo?(jseward)

Julian Seward [:jseward]

Assignee

Updated

•

8 years ago

Attachment #8880300 - Flags: review?(bobbyholley)

Bobby Holley (:bholley)

Comment 20

•

8 years ago

Comment on attachment 8880300 [details] [diff] [review] bug1368302-3.cset Review of attachment 8880300 [details] [diff] [review]: ----------------------------------------------------------------- ::: servo/components/style/parallel.rs @@ +240,5 @@ > impl DispatchMode { > fn is_tail_call(&self) -> bool { matches!(*self, DispatchMode::TailCall) } > } > > +const RECURSION_DEPTH_LIMIT: usize = 200; Please add a describing the measurements involved in setting this. @@ +261,5 @@ > // This is a tail call from the perspective of the caller. However, we only > // want to actually dispatch the job as a tail call if there's nothing left > // in our local queue. Otherwise we need to return to it to maintain proper > + // breadth-first ordering. We also need to take care to avoid stack > + // overflow due to excessive tail recursion. See bug 1368302. Add a comment here indicating that this limit isn't observable to content, since we're still totally correct, we just skip the tail call optimization for one iteration. @@ +270,5 @@ > + && !pool.current_thread_has_pending_tasks().unwrap() > + } else { > + debug_assert!(recursion_depth == RECURSION_DEPTH_LIMIT); > + false > + }; Nit: I think we could clean this up now a bit as follows: debug_assert!(recursion_depth <= RECURSION_DEPTH_LIMIT); let may_dispatch_tail = mode.is_tail_call() && recursion_depth != RECURSION_DEPTH_LIMIT && !pool.current_thread_has_pending_tasks().unwrap(); @@ +277,5 @@ > // case we can pass the SmallVec directly and avoid extra allocation. > if nodes.len() <= WORK_UNIT_MAX { > let work = nodes.iter().cloned().collect::<WorkUnit<E::ConcreteNode>>(); > if may_dispatch_tail { > + top_down_dom(&work, recursion_depth, root, And then just pass recursion_depth + 1 here.

Attachment #8880300 - Flags: review?(bobbyholley) → review+

Bobby Holley (:bholley)

Comment 21

•

8 years ago

Just a reminder to land this. :-)

Bobby Holley (:bholley)

Updated

•

8 years ago

Blocks: 1376883

Julian Seward [:jseward]

Assignee

Comment 22

•

8 years ago

Attached patch Rebased, FTR — Details — Splinter Review

Attachment #8880300 - Attachment is obsolete: true

Bobby Holley (:bholley)

Updated

•

8 years ago

Priority: P2 → --

Julian Seward [:jseward]

Assignee

Comment 23

•

8 years ago

This was merged to m-i/m-c some time back: changeset: 366942:8d137703e636 user: Julian Seward <jseward@acm.org> date: Fri Jun 30 03:01:08 2017 -0700 summary: servo: Merge #17563 - Bug 1368302 - stylo: possible stack overflow when processing very dee… (from julian-seward1:master); r=bholley Can we close it now?

Bobby Holley (:bholley)

Comment 24

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #23) > This was merged to m-i/m-c some time back: > > changeset: 366942:8d137703e636 > user: Julian Seward <jseward@acm.org> > date: Fri Jun 30 03:01:08 2017 -0700 > summary: servo: Merge #17563 - Bug 1368302 - stylo: possible stack > overflow when processing very dee… (from julian-seward1:master); r=bholley > > Can we close it now? Yep, thanks!

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox57: --- → fixed

You need to log in before you can comment on or make changes to this bug.