Closed Bug 1370107 Opened 3 years ago Closed 3 years ago

stylo: shrink Rule and store all heap-allocated selector data inline

Categories

(Core :: CSS Parsing and Computation, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: bholley, Assigned: bholley)

References

(Blocks 1 open bug)

Details

Attachments

(13 files, 3 obsolete files)

1.38 KB, patch
emilio
: review+
manishearth
: review+
Details | Diff | Splinter Review
7.40 KB, patch
emilio
: review+
Details | Diff | Splinter Review
36.25 KB, patch
emilio
: review+
Details | Diff | Splinter Review
31.18 KB, patch
emilio
: review+
Details | Diff | Splinter Review
22.90 KB, patch
emilio
: review+
Details | Diff | Splinter Review
5.47 KB, patch
emilio
: review+
Details | Diff | Splinter Review
10.68 KB, patch
emilio
: review+
Details | Diff | Splinter Review
20.45 KB, patch
emilio
: review+
Details | Diff | Splinter Review
1.87 KB, patch
emilio
: review+
Details | Diff | Splinter Review
5.08 KB, patch
emilio
: review+
Details | Diff | Splinter Review
6.42 KB, patch
manishearth
: review+
Nika
: review+
Details | Diff | Splinter Review
9.18 KB, patch
manishearth
: review+
Nika
: review+
Details | Diff | Splinter Review
6.90 KB, patch
bholley
: review+
Details | Diff | Splinter Review
Chrome still beats sequential stylo on some configurations of bloom-basic [1]. This stumped me for a while, until I realized that our inline Rule size is twice the size of Chrome's on 64-bit platforms (64 bytes instead of 32). This is due to the wide layout of Arcslice (3 words instead of 1), as well as storing specificity_and_flags in the non-heap-allocated section of the selector.

Another suboptimal thing is that we currently store an Arc<Box<[Component]>>, which costs us an extra heap allocation just because the type system doesn't allow passing a slice to Arc::new. We can fix this with some deep magic on our custom Arc.

I have working patches for all of this. The two things I have left to do are:
* Add some more magic to Arc to avoid storing the fat Arc pointers inline in Rule (basically teaching Arc how to fetch the slice size from the heap-allocated section). This will take us from 48 to 32 bytes, putting us at parity with Chrome.
* Fix up the unit tests.

I'll work on the above items shortly, but there's enough going on in these patches that I wanted to get them into emilio's queue sooner rather than later. :-)

[1] Specifically, with the selector count increased and the element count decreased relative to what's currently in the repo.
MozReview-Commit-ID: J9y9cl0rpxX
Attachment #8874250 - Flags: review?(emilio+bugs)
In the upcoming patches we'll eliminate the existing Selector and SelectorInner,
and ComplexSelector will become an Arc-managed |Selector|. So the tuple-indexed
setup is just temporary.

MozReview-Commit-ID: 4CcOYeHGUED
Attachment #8874251 - Flags: review?(emilio+bugs)
MozReview-Commit-ID: 5mipXnjgSED
Attachment #8874252 - Flags: review?(emilio+bugs)
The refcounting is still internal. We'll fix that up next.

MozReview-Commit-ID: CTxZNaR3Qgj
Attachment #8874253 - Flags: review?(emilio+bugs)
MozReview-Commit-ID: KLqmdRKygO0
Attachment #8874254 - Flags: review?(emilio+bugs)
This is important to make the selector immutable, which needs to happen when we stick it in an Arc.

MozReview-Commit-ID: BaMbOEbYC3D
Attachment #8874255 - Flags: review?(emilio+bugs)
MozReview-Commit-ID: C3btN8Jw9sJ
Attachment #8874256 - Flags: review?(emilio+bugs)
MozReview-Commit-ID: H15cG5RBtZH
Attachment #8874257 - Flags: review?(emilio+bugs)
MozReview-Commit-ID: hq0jYrx8Sg
Attachment #8874258 - Flags: review?(emilio+bugs)
MozReview-Commit-ID: DszMC031Xlj
Attachment #8874259 - Flags: review?(emilio+bugs)
Attachment #8874250 - Flags: review?(emilio+bugs) → review+
Comment on attachment 8874251 [details] [diff] [review]
Part 2 - Hoist specificity_and_flags into ComplexSelector (soon to be Selector). v1

Review of attachment 8874251 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/selectors/parser.rs
@@ +362,5 @@
>  /// We store selectors internally left-to-right (in parsing order), but the
>  /// canonical iteration order is right-to-left (selector matching order). The
>  /// iterators abstract over these details.
>  #[derive(Clone, Eq, PartialEq)]
> +pub struct ComplexSelector<Impl: SelectorImpl>(ArcSlice<Component<Impl>>, u32);

I'd prefer if these had names, but I suppose this is going to change substantially, so fine.
Attachment #8874251 - Flags: review?(emilio+bugs) → review+
Comment on attachment 8874252 [details] [diff] [review]
Part 3 - Move the ancestor hashes out of Selector. v1

Review of attachment 8874252 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/style/stylist.rs
@@ +443,5 @@
>              match *rule {
>                  CssRule::Style(ref locked) => {
>                      let style_rule = locked.read_with(&guard);
>                      self.num_declarations += style_rule.block.read_with(&guard).len();
> +                    for selector_and_hashes in &style_rule.selectors.0 {

Are you sure you want to keep the hashes around in the style rule itself? It seems like a waste of memory, what's the reasoning for that?
Attachment #8874253 - Flags: review?(emilio+bugs) → review+
(In reply to Emilio Cobos Álvarez [:emilio] from comment #13)
> Comment on attachment 8874252 [details] [diff] [review]
> Part 3 - Move the ancestor hashes out of Selector. v1
> 
> Review of attachment 8874252 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: servo/components/style/stylist.rs
> @@ +443,5 @@
> >              match *rule {
> >                  CssRule::Style(ref locked) => {
> >                      let style_rule = locked.read_with(&guard);
> >                      self.num_declarations += style_rule.block.read_with(&guard).len();
> > +                    for selector_and_hashes in &style_rule.selectors.0 {
> 
> Are you sure you want to keep the hashes around in the style rule itself? It
> seems like a waste of memory, what's the reasoning for that?

That's the existing behavior (since currently the hashes are stored within Selector). It's basically a space/time tradeoff between the size of the parsed selectors and computation time for stylist::update. We could consider changing it, but I'd rather keep that orthogonal to this bug.
Attachment #8874252 - Flags: review?(emilio+bugs) → review+
Comment on attachment 8874254 [details] [diff] [review]
Part 5 - Stop slicing selectors when noting dependencies, and match with an offset instead. v1

Review of attachment 8874254 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/selectors/matching.rs
@@ +337,5 @@
>  
>  /// Matches a selector, fast-rejecting against a bloom filter.
>  #[inline(always)]
>  pub fn matches_selector<E, F>(selector: &Selector<E::Impl>,
> +                              offset: usize,

Ditto here, I think this would be cleaner without the offset.

@@ +358,5 @@
>  }
>  
>  /// Matches a complex selector.
>  pub fn matches_complex_selector<E, F>(complex_selector: &Selector<E::Impl>,
> +                                      offset: usize,

Could we instead always pass the iterator here? I don't see a reason why the offset would be needed.

::: servo/components/selectors/parser.rs
@@ +362,5 @@
>  
> +    pub fn iter_from(&self, offset: usize) -> SelectorIter<Impl> {
> +        // Note: selectors are stored left-to-right but logical order is right-to-left.
> +        let slice = self.0.as_ref();
> +        let iter = slice[0..(slice.len() - offset)].iter().rev();

nit: No need to specify the zero: slice[..(slice.len() - offset)].iter().rev()

::: servo/components/style/selector_map.rs
@@ +422,1 @@
>                 -> Option<Atom> {

nit: preexisting, but if while you're here you fix up the alignment, it'd be rad :)
Attachment #8874254 - Flags: review?(emilio+bugs) → review+
Comment on attachment 8874255 [details] [diff] [review]
Part 6 - Move around specificity computation so that we know it by the time we mint the Selector. v1

Review of attachment 8874255 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/selectors/parser.rs
@@ +913,5 @@
>  /// selector : simple_selector_sequence [ combinator simple_selector_sequence ]* ;
>  ///
>  /// `Err` means invalid selector.
>  fn parse_selector<P, Impl>(parser: &P, input: &mut CssParser) -> Result<Selector<Impl>, ()>
>      where P: Parser<Impl=Impl>, Impl: SelectorImpl

This function is quite useless now, could you nuke it?

@@ +985,5 @@
>          }
>          sequence.push(Component::Combinator(combinator));
>      }
>  
> +

nit: Extra newline.
Attachment #8874255 - Flags: review?(emilio+bugs) → review+
Comment on attachment 8874256 [details] [diff] [review]
Part 7 - Move stylearc into a separate crate. v1

Review of attachment 8874256 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/style/lib.rs
@@ +75,2 @@
>  #[cfg(feature = "servo")] #[macro_use] extern crate serde_derive;
> +pub extern crate servo_arc;

Does this need to be pub?
Attachment #8874256 - Flags: review?(emilio+bugs) → review+
Comment on attachment 8874257 [details] [diff] [review]
Part 8 - Add support for dynamically-sized types in servo_arc. v1

Review of attachment 8874257 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/servo_arc/lib.rs
@@ +182,5 @@
> +        let ptr: *mut ArcInner<HeaderSlice<H, T>>;
> +        unsafe {
> +            // Allocate the buffer. We use Vec because the underlying allocation
> +            // machinery isn't available in stable Rust.
> +            let mut vec = Vec::<u8>::with_capacity(size);

can you guarantee this buffer is going to be properly aligned? I _think_ it will, but I'm not 100% sure about whether rust does something smart there...

@@ +193,5 @@
> +            // only to the number of elements in the dynamically-sized portion of
> +            // the type, so the value will be the same whether it points to a [T]
> +            // or something else with a [T] as its last member.
> +            let fake_slice: &mut [T] = slice::from_raw_parts_mut(buffer as *mut T, num_items);
> +            ptr = fake_slice as *mut [T] as *mut ArcInner<HeaderSlice<H, T>>;

Are you sure this doesn't break any aliasing rules? This is somewhat scary, and you're pointing to two different types with the same pointer... Could you check with some rust people about this?

And in any case... You don't seem to use `fake_slice` anywhere else, why can't you just use buffer as *mut ArcInner<HeaderSlice<H, T>>?

@@ +209,5 @@
> +            debug_assert!(current as *mut u8 == buffer.offset(size as isize));
> +        }
> +
> +        // Return the fat Arc.
> +        debug_assert!(size_of::<Self>() == size_of::<usize>() * 2, "The Arc will be fat");

nit: debug_assert_eq?
Comment on attachment 8874258 [details] [diff] [review]
Part 9 - Use dynamically-sized Arcs for Selector. v1

Review of attachment 8874258 [details] [diff] [review]:
-----------------------------------------------------------------

Assuming the previous patch passes the language lawyer tests, r=me on this one.

::: servo/components/selectors/parser.rs
@@ +323,5 @@
>  /// We store selectors internally left-to-right (in parsing order), but the
>  /// canonical iteration order is right-to-left (selector matching order). The
>  /// iterators abstract over these details.
>  #[derive(Clone, Eq, PartialEq)]
> +pub struct Selector<Impl: SelectorImpl>(Arc<HeaderSlice<SpecificityAndFlags, Component<Impl>>>);

nit: I'd prefer Selector to be a proper struct rather than a newtype, but no big deal.

@@ +374,5 @@
>      }
>  
>      pub fn iter_from(&self, offset: usize) -> SelectorIter<Impl> {
>          // Note: selectors are stored left-to-right but logical order is right-to-left.
> +        let iter = self.0.slice[0..(self.0.slice.len() - offset)].iter().rev();

nit: as in the previous one, you can remove the zero here.
Attachment #8874258 - Flags: review?(emilio+bugs) → review+
Attachment #8874259 - Flags: review?(emilio+bugs) → review+
Comment on attachment 8874257 [details] [diff] [review]
Part 8 - Add support for dynamically-sized types in servo_arc. v1

Review of attachment 8874257 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/servo_arc/lib.rs
@@ +193,5 @@
> +            // only to the number of elements in the dynamically-sized portion of
> +            // the type, so the value will be the same whether it points to a [T]
> +            // or something else with a [T] as its last member.
> +            let fake_slice: &mut [T] = slice::from_raw_parts_mut(buffer as *mut T, num_items);
> +            ptr = fake_slice as *mut [T] as *mut ArcInner<HeaderSlice<H, T>>;

Oh, never mind, I see that's the key for the fat pointer stuff... Still worth checking the magic with some rust people.
Comment on attachment 8874257 [details] [diff] [review]
Part 8 - Add support for dynamically-sized types in servo_arc. v1

Review of attachment 8874257 [details] [diff] [review]:
-----------------------------------------------------------------

r=me on the general approach. Please do check with someone to verify you're not introducing UB here.
Attachment #8874257 - Flags: review?(emilio+bugs) → feedback+
Comment on attachment 8874257 [details] [diff] [review]
Part 8 - Add support for dynamically-sized types in servo_arc. v1

Review of attachment 8874257 [details] [diff] [review]:
-----------------------------------------------------------------

Would like eddyb or niko to have a look at this code too.

::: servo/components/servo_arc/lib.rs
@@ +182,5 @@
> +        let ptr: *mut ArcInner<HeaderSlice<H, T>>;
> +        unsafe {
> +            // Allocate the buffer. We use Vec because the underlying allocation
> +            // machinery isn't available in stable Rust.
> +            let mut vec = Vec::<u8>::with_capacity(size);

This is my concern too.

I'm also not sure what guarantees we have about the headerslice struct's layout.

One way to circumvent this is to never actually access the allocation via the fields of HeaderSlice, instead use HeaderSlice as a crutch for the DST to work, and do everything else via pointer arithmetic. Won't be great, though.

@@ +193,5 @@
> +            // only to the number of elements in the dynamically-sized portion of
> +            // the type, so the value will be the same whether it points to a [T]
> +            // or something else with a [T] as its last member.
> +            let fake_slice: &mut [T] = slice::from_raw_parts_mut(buffer as *mut T, num_items);
> +            ptr = fake_slice as *mut [T] as *mut ArcInner<HeaderSlice<H, T>>;

This shouldn't break aliasing rules.
Attachment #8874257 - Flags: review+
Niko, could you have a look at attachment 8874257 [details] [diff] [review]?
Flags: needinfo?(nmatsakis)
Actually, another way to not have to deal with padding is to do this following bit of dark magic:

let fake_slice = slice::from_raw_parts(some *const T, len);
let fake_header_slice = fake_slice as *const [T] as *const HeaderSlice<H, T> as &HeaderSlice<_,_>;
let size = mem::size_of_val(fake_header_slice); // returns size of the DST
let vec: Vec<u8> = Vec::with_capacity(size);

Then you can set_len it, directly convert the vec to an uninitialized Box<HeaderSlice>, and manually set the fields, so you don't have to worry about any padding that exists.

I'm not sure if this is UB (writing to uninit is not, IIRC, but the fake slice stuff is iffy). But it would solve the padding problem.
Comment on attachment 8874257 [details] [diff] [review]
Part 8 - Add support for dynamically-sized types in servo_arc. v1

Review of attachment 8874257 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/servo_arc/lib.rs
@@ +169,5 @@
> +        use ::std::mem::size_of;
> +        assert!(size_of::<T>() != 0, "Need to think about ZST");
> +
> +        // Compute all the sizes for the allocation.
> +        let num_items = items.len();

I think we should store the refcount within the header. Basically, internally store a `HeaderSlice<(Refcount, H), T>`. Which you can call DSTArcInner or something. Right now we're using a `*const HeaderSlice<H, T>`, which is not actually what's being stored.
Flags: needinfo?(michael)
Attached patch Part 11 - Implement ThinArc. v1 (obsolete) — Splinter Review
MozReview-Commit-ID: LB3bs9ed2WC
Attachment #8874301 - Flags: review?(manishearth)
Attachment #8874301 - Flags: feedback?(emilio+bugs)
MozReview-Commit-ID: Axvq0rbqA7Y
Attachment #8874302 - Flags: review?(emilio+bugs)
Comment on attachment 8874301 [details] [diff] [review]
Part 11 - Implement ThinArc. v1

Review of attachment 8874301 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/servo_arc/lib.rs
@@ +541,5 @@
> +            ptr: thin_to_thick(self.ptr)
> +        };
> +
> +        // Expose the transient Arc to the callback, which may clone it if it wants.
> +        let result = f(&transient);

If the function panics, the Arc will run a destructor. Are we okay with that?

If not we should probably place it inside a NoDrop. NoDrop has a slight overhead, though, except on nightly (where it uses unions, which are stable on nightly but have to ride the trains)

@@ +570,5 @@
> +        let _ = Arc::from_thin(ThinArc { ptr: self.ptr });
> +    }
> +}
> +
> +impl<H, T> Arc<HeaderSlice<HeaderWithLength<H>, T>> {

We don't actually end up using the length of the fat pointer -- is there a point of using a fat ptr in the first place? Or just because we want the Arc API?

@@ +578,5 @@
> +        assert!(a.header.length == a.slice.len(),
> +                "Length needs to be correct for ThinArc to work");
> +        let fat_ptr: *mut ArcInner<HeaderSlice<HeaderWithLength<H>, T>> = a.ptr;
> +        mem::forget(a);
> +        let thin_ptr: *mut atomic::AtomicUsize = unsafe { &mut (*fat_ptr).count };

why .count? That field could be reordered.
Attachment #8874301 - Flags: review?(manishearth) → review+
Comment on attachment 8874302 [details] [diff] [review]
Part 12 - Use ThinArc in Selector. v1

Review of attachment 8874302 [details] [diff] [review]:
-----------------------------------------------------------------

Can you elaborate in why is this needed, and what's the effective difference between ThinArc<H, T> and Arc<HeaderSlice<H, T>>?

(Sorry, got an exam today, so I don't have all the time to dig)

If you could write it in the commit message, or in the docs, that'd be even better :)
Comment on attachment 8874257 [details] [diff] [review]
Part 8 - Add support for dynamically-sized types in servo_arc. v1

Review of attachment 8874257 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/servo_arc/lib.rs
@@ +173,5 @@
> +        let num_items = items.len();
> +        let refcount_size = size_of::<atomic::AtomicUsize>();
> +        let header_size = size_of::<H>();
> +        let single_item_size = size_of::<T>();
> +        let padding = size_of::<(H, T)>() - header_size - single_item_size;

I don't think it's safe to assert that `size_of::<(H, T)>()` == `size_of::<HeaderSlice<H, T>>()` when there is one element. `(H, T)` is `repr(rust)` and allowed to do crazy stuff for layout, and/or rearrange fields. I am not sure how that would happen right now but theoretically it could at least.

In contrast `HeaderSlice` is required to keep the `slice` field at the end, due to it being dynamically sized.

If you instead had:
```
struct HeaderSlice<H, T: ?Sized> {
    pub header: H,
    pub slice: T,
}
```

and then instead asked for `size_of::<HeaderSlice<H, [T; 1]>>` (and produced <H, [T]> instead of <H, T> from `from_header_and_iterator`), then I think the layout would be guaranteed to match due to Coercion.

It's probably fine, and I doubt they ever won't be the same, but.

@@ +174,5 @@
> +        let refcount_size = size_of::<atomic::AtomicUsize>();
> +        let header_size = size_of::<H>();
> +        let single_item_size = size_of::<T>();
> +        let padding = size_of::<(H, T)>() - header_size - single_item_size;
> +        let items_size = single_item_size * num_items;

This can overflow, ideally we should do checked math here.

@@ +175,5 @@
> +        let header_size = size_of::<H>();
> +        let single_item_size = size_of::<T>();
> +        let padding = size_of::<(H, T)>() - header_size - single_item_size;
> +        let items_size = single_item_size * num_items;
> +        let size = refcount_size + header_size + padding + items_size;

This expression requires that there is no padding between the refcount, and the header, which is correct most of the time on most platforms. 

Perhaps document this assumption, and assert that the alignment of `H` is less than or equal to the size and align of `AtomicUsize`?

@@ +182,5 @@
> +        let ptr: *mut ArcInner<HeaderSlice<H, T>>;
> +        unsafe {
> +            // Allocate the buffer. We use Vec because the underlying allocation
> +            // machinery isn't available in stable Rust.
> +            let mut vec = Vec::<u8>::with_capacity(size);

I don't know if this is intended to be run in both servo and gecko, or just in gecko.

If we're just running in gecko, every allocation which we make goes through `mozjemalloc`, meaning that we actually have 2 options here:
1) We directly call into mozjemalloc, and get the memory, or
2) We use this vec strategy, and our knowledge of what alignment that mozjemalloc gives to all allocations (I _think_ it's 16, but glandium would know better).

If we're being run in servo, then no, we aren't guaranteed the alignment is correct from my understanding. As our struct's final alignment will probably be `align_of::<usize>` due to the `AtomicUsize` at the beginning (we would want to assert this), we could divide the backing buffer by `usize` elements, and round up.

@@ +184,5 @@
> +            // Allocate the buffer. We use Vec because the underlying allocation
> +            // machinery isn't available in stable Rust.
> +            let mut vec = Vec::<u8>::with_capacity(size);
> +            vec.set_len(size);
> +            let buffer: *mut u8 = &mut (*Box::into_raw(vec.into_boxed_slice()))[0];

Why this indexing and &mut-ing dance? You could just write:

    Box::into_raw(box.into_boxed_slice()) as *mut u8

As wide *mut [u8] pointers can be decomposed into thin *mut u8 pointers with `as`.

@@ +193,5 @@
> +            // only to the number of elements in the dynamically-sized portion of
> +            // the type, so the value will be the same whether it points to a [T]
> +            // or something else with a [T] as its last member.
> +            let fake_slice: &mut [T] = slice::from_raw_parts_mut(buffer as *mut T, num_items);
> +            ptr = fake_slice as *mut [T] as *mut ArcInner<HeaderSlice<H, T>>;

I _think_ this should be OK. The fact that doing casts from *mut [T] to *mut Struct<[T]> is allowed surprised me a bit, but we certainly won't be able to remove it if/when we decide to change DSTs, so I think we can rely on it here.

@@ +199,5 @@
> +            // Write the data.
> +            ptr::write(&mut ((*ptr).count), atomic::AtomicUsize::new(1));
> +            ptr::write(&mut ((*ptr).data.header), header);
> +            let mut current: *mut T = &mut (*ptr).data.slice[0];
> +            for item in items {

ExactSizeIterator doesn't guarantee that there are actually `len` items in the iterator. You should do bounds checks in this loop, for both the iterator producing too many, and too few, items.

There's an unstable `TrustedLen` trait which can be used to actually check that the length is correct.

@@ +209,5 @@
> +            debug_assert!(current as *mut u8 == buffer.offset(size as isize));
> +        }
> +
> +        // Return the fat Arc.
> +        debug_assert!(size_of::<Self>() == size_of::<usize>() * 2, "The Arc will be fat");

nit: Why not use `assert_eq!()` here? `size_of` is known at compile time, and thus this branch will be completely optimized out in optimized builds anyway.
Attachment #8874257 - Flags: feedback+
Flags: needinfo?(michael)
Comment on attachment 8874301 [details] [diff] [review]
Part 11 - Implement ThinArc. v1

Review of attachment 8874301 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/servo_arc/lib.rs
@@ +508,5 @@
> +    }
> +}
> +
> +pub struct ThinArc<H, T> {
> +    ptr: *mut ArcInner<(HeaderWithLength<H>, [T; 1])>,

Again from earlier I don't know if (H, T) has the same layout as HeaderSlice. I think that making HeaderSlice take a `T: ?Sized` and using either `[T; 1]` or `[T]` is probably a better bet and simplifies a bunch of code. Both in this patch and the other one I looked at.

@@ +521,5 @@
> +fn thin_to_thick<H, T>(thin: *mut ArcInner<(HeaderWithLength<H>, [T; 1])>)
> +    -> *mut ArcInner<HeaderSlice<HeaderWithLength<H>, T>>
> +{
> +    let len = unsafe { (*thin).data.0.length };
> +    let fake_slice: &mut [T] = unsafe {

You should be able to just write `fake_slice: *mut [T]` and avoid one of the casts below I think.

@@ +541,5 @@
> +            ptr: thin_to_thick(self.ptr)
> +        };
> +
> +        // Expose the transient Arc to the callback, which may clone it if it wants.
> +        let result = f(&transient);

In gecko we don't have to deal with exception safety, but if this is also running in Servo, we do need to worry about that.

NoDrop is definitely the most performant option right now IMO. I wouldn't be surprised if llvm got rid of the overhead anyway when the value is stored on the stack.

@@ +570,5 @@
> +        let _ = Arc::from_thin(ThinArc { ptr: self.ptr });
> +    }
> +}
> +
> +impl<H, T> Arc<HeaderSlice<HeaderWithLength<H>, T>> {

I think the reasoning behind this is to use the code from part 8 to perform the allocation to avoid unnecessary code duplication.
Attachment #8874301 - Flags: feedback+
> Can you elaborate in why is this needed, and what's the effective difference
> between ThinArc<H, T> and Arc<HeaderSlice<H, T>>?

The difference is `size_of::<ThinArc<H, T>>() == size_of::<usize>()` while `size_of::<Arc<HeaderSlice<H, T>> == size_of::<usize>() * 2`. Other than that they are basically the same.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #15)
> Comment on attachment 8874254 [details] [diff] [review]
> Part 5 - Stop slicing selectors when noting dependencies, and match with an
> offset instead. v1
> 
> Review of attachment 8874254 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: servo/components/selectors/matching.rs
> @@ +337,5 @@
> >  
> >  /// Matches a selector, fast-rejecting against a bloom filter.
> >  #[inline(always)]
> >  pub fn matches_selector<E, F>(selector: &Selector<E::Impl>,
> > +                              offset: usize,
> 
> Ditto here, I think this would be cleaner without the offset.
> 
> @@ +358,5 @@
> >  }
> >  
> >  /// Matches a complex selector.
> >  pub fn matches_complex_selector<E, F>(complex_selector: &Selector<E::Impl>,
> > +                                      offset: usize,
> 
> Could we instead always pass the iterator here? I don't see a reason why the
> offset would be needed.

The reason is that I wanted to avoid calling .iter() unconditionally. I'm not 100% confident that LLVM will always optimize out the call in the case where we fast-reject, and calling .iter() (with my eventual patches in this bug) will require dereferencing the selector pointer to get the length, which will result in a cache miss. So I want to play it safe, at least for now.

> 
> ::: servo/components/selectors/parser.rs
> @@ +362,5 @@
> >  
> > +    pub fn iter_from(&self, offset: usize) -> SelectorIter<Impl> {
> > +        // Note: selectors are stored left-to-right but logical order is right-to-left.
> > +        let slice = self.0.as_ref();
> > +        let iter = slice[0..(slice.len() - offset)].iter().rev();
> 
> nit: No need to specify the zero: slice[..(slice.len() -
> offset)].iter().rev()

Fixed.

> 
> ::: servo/components/style/selector_map.rs
> @@ +422,1 @@
> >                 -> Option<Atom> {
> 
> nit: preexisting, but if while you're here you fix up the alignment, it'd be
> rad :)

Sure. :-)
Comment on attachment 8874302 [details] [diff] [review]
Part 12 - Use ThinArc in Selector. v1

Review of attachment 8874302 [details] [diff] [review]:
-----------------------------------------------------------------

Oh, of course I missed the size_of.rs change, which is why I was confused :)

Thanks for pointing that out Michael.
Attachment #8874302 - Flags: review?(emilio+bugs) → review+
(In reply to Emilio Cobos Álvarez [:emilio] from comment #16)
> Comment on attachment 8874255 [details] [diff] [review]
> Part 6 - Move around specificity computation so that we know it by the time
> we mint the Selector. v1
> 
> Review of attachment 8874255 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: servo/components/selectors/parser.rs
> @@ +913,5 @@
> >  /// selector : simple_selector_sequence [ combinator simple_selector_sequence ]* ;
> >  ///
> >  /// `Err` means invalid selector.
> >  fn parse_selector<P, Impl>(parser: &P, input: &mut CssParser) -> Result<Selector<Impl>, ()>
> >      where P: Parser<Impl=Impl>, Impl: SelectorImpl
> 
> This function is quite useless now, could you nuke it?

I'll do this as a patch on top.

> 
> @@ +985,5 @@
> >          }
> >          sequence.push(Component::Combinator(combinator));
> >      }
> >  
> > +
> 
> nit: Extra newline.

Fixed.


(In reply to Emilio Cobos Álvarez [:emilio] from comment #17)
> Does this need to be pub?

It does, for the re-export.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #33)
> (In reply to Emilio Cobos Álvarez [:emilio] from comment #15)
> > Comment on attachment 8874254 [details] [diff] [review]
> > Part 5 - Stop slicing selectors when noting dependencies, and match with an
> > offset instead. v1
> > 
> > Review of attachment 8874254 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > ::: servo/components/selectors/matching.rs
> > @@ +337,5 @@
> > >  
> > >  /// Matches a selector, fast-rejecting against a bloom filter.
> > >  #[inline(always)]
> > >  pub fn matches_selector<E, F>(selector: &Selector<E::Impl>,
> > > +                              offset: usize,
> > 
> > Ditto here, I think this would be cleaner without the offset.
> > 
> > @@ +358,5 @@
> > >  }
> > >  
> > >  /// Matches a complex selector.
> > >  pub fn matches_complex_selector<E, F>(complex_selector: &Selector<E::Impl>,
> > > +                                      offset: usize,
> > 
> > Could we instead always pass the iterator here? I don't see a reason why the
> > offset would be needed.
> 
> The reason is that I wanted to avoid calling .iter() unconditionally. I'm
> not 100% confident that LLVM will always optimize out the call in the case
> where we fast-reject, and calling .iter() (with my eventual patches in this
> bug) will require dereferencing the selector pointer to get the length,
> which will result in a cache miss. So I want to play it safe, at least for
> now.

I see, can you add a comment on this?
(In reply to Emilio Cobos Álvarez [:emilio] from comment #36)
> > The reason is that I wanted to avoid calling .iter() unconditionally. I'm
> > not 100% confident that LLVM will always optimize out the call in the case
> > where we fast-reject, and calling .iter() (with my eventual patches in this
> > bug) will require dereferencing the selector pointer to get the length,
> > which will result in a cache miss. So I want to play it safe, at least for
> > now.
> 
> I see, can you add a comment on this?

Done.
Priority: -- → P1
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #0)
> Chrome still beats sequential stylo on some configurations of bloom-basic
> [1]. This stumped me for a while, until I realized that our inline Rule size
> is twice the size of Chrome's on 64-bit platforms (64 bytes instead of 32).
> This is due to the wide layout of Arcslice (3 words instead of 1), as well
> as storing specificity_and_flags in the non-heap-allocated section of the
> selector.

Err, this was incorrect (I forgot that the bloom hashes are 16 bytes rather than 8). Chrome is 40, and that's where these patches take us. This gives us a significant speedup on bloom-basic.

> 
> Another suboptimal thing is that we currently store an
> Arc<Box<[Component]>>, which costs us an extra heap allocation just because
> the type system doesn't allow passing a slice to Arc::new. We can fix this
> with some deep magic on our custom Arc.

From a rough measurement on 100x myspace, this seems to give us a ~10% speedup in parsing by reducing mallocs. \o/

I'll dig into the feedback on the servo_arc stuff now.
(In reply to Michael Layzell [:mystor] from comment #30)
> Comment on attachment 8874257 [details] [diff] [review]
> Part 8 - Add support for dynamically-sized types in servo_arc. v1
> 
> Review of attachment 8874257 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: servo/components/servo_arc/lib.rs
> @@ +173,5 @@
> > +        let num_items = items.len();
> > +        let refcount_size = size_of::<atomic::AtomicUsize>();
> > +        let header_size = size_of::<H>();
> > +        let single_item_size = size_of::<T>();
> > +        let padding = size_of::<(H, T)>() - header_size - single_item_size;
> 
> I don't think it's safe to assert that `size_of::<(H, T)>()` ==
> `size_of::<HeaderSlice<H, T>>()` when there is one element. `(H, T)` is
> `repr(rust)` and allowed to do crazy stuff for layout, and/or rearrange
> fields. I am not sure how that would happen right now but theoretically it
> could at least.

On IRC we decided to go with Manish's size_of_val approach.

> 
> In contrast `HeaderSlice` is required to keep the `slice` field at the end,
> due to it being dynamically sized.
> 
> If you instead had:
> ```
> struct HeaderSlice<H, T: ?Sized> {
>     pub header: H,
>     pub slice: T,
> }
> ```
> 
> and then instead asked for `size_of::<HeaderSlice<H, [T; 1]>>` (and produced
> <H, [T]> instead of <H, T> from `from_header_and_iterator`), then I think
> the layout would be guaranteed to match due to Coercion.
> 
> It's probably fine, and I doubt they ever won't be the same, but.
> 
> @@ +174,5 @@
> > +        let refcount_size = size_of::<atomic::AtomicUsize>();
> > +        let header_size = size_of::<H>();
> > +        let single_item_size = size_of::<T>();
> > +        let padding = size_of::<(H, T)>() - header_size - single_item_size;
> > +        let items_size = single_item_size * num_items;
> 
> This can overflow, ideally we should do checked math here.

Obsolete per above.

> 
> @@ +175,5 @@
> > +        let header_size = size_of::<H>();
> > +        let single_item_size = size_of::<T>();
> > +        let padding = size_of::<(H, T)>() - header_size - single_item_size;
> > +        let items_size = single_item_size * num_items;
> > +        let size = refcount_size + header_size + padding + items_size;
> 
> This expression requires that there is no padding between the refcount, and
> the header, which is correct most of the time on most platforms. 
> 
> Perhaps document this assumption, and assert that the alignment of `H` is
> less than or equal to the size and align of `AtomicUsize`?

Obsolete per above.


> 
> @@ +182,5 @@
> > +        let ptr: *mut ArcInner<HeaderSlice<H, T>>;
> > +        unsafe {
> > +            // Allocate the buffer. We use Vec because the underlying allocation
> > +            // machinery isn't available in stable Rust.
> > +            let mut vec = Vec::<u8>::with_capacity(size);
> 
> I don't know if this is intended to be run in both servo and gecko, or just
> in gecko.

Both.

> 
> If we're just running in gecko, every allocation which we make goes through
> `mozjemalloc`, meaning that we actually have 2 options here:
> 1) We directly call into mozjemalloc, and get the memory, or
> 2) We use this vec strategy, and our knowledge of what alignment that
> mozjemalloc gives to all allocations (I _think_ it's 16, but glandium would
> know better).
> 
> If we're being run in servo, then no, we aren't guaranteed the alignment is
> correct from my understanding. As our struct's final alignment will probably
> be `align_of::<usize>` due to the `AtomicUsize` at the beginning (we would
> want to assert this), we could divide the backing buffer by `usize`
> elements, and round up.

Yes, I think that's the right approach. I'll go ahead and do that.

> Why this indexing and &mut-ing dance? You could just write:
> 
>     Box::into_raw(box.into_boxed_slice()) as *mut u8
> 
> As wide *mut [u8] pointers can be decomposed into thin *mut u8 pointers with
> `as`.

Oh, I thought I'd tried that. Thanks.

> 
> @@ +193,5 @@
> > +            // only to the number of elements in the dynamically-sized portion of
> > +            // the type, so the value will be the same whether it points to a [T]
> > +            // or something else with a [T] as its last member.
> > +            let fake_slice: &mut [T] = slice::from_raw_parts_mut(buffer as *mut T, num_items);
> > +            ptr = fake_slice as *mut [T] as *mut ArcInner<HeaderSlice<H, T>>;
> 
> I _think_ this should be OK. The fact that doing casts from *mut [T] to *mut
> Struct<[T]> is allowed surprised me a bit, but we certainly won't be able to
> remove it if/when we decide to change DSTs, so I think we can rely on it
> here.
> 
> @@ +199,5 @@
> > +            // Write the data.
> > +            ptr::write(&mut ((*ptr).count), atomic::AtomicUsize::new(1));
> > +            ptr::write(&mut ((*ptr).data.header), header);
> > +            let mut current: *mut T = &mut (*ptr).data.slice[0];
> > +            for item in items {
> 

> ExactSizeIterator doesn't guarantee that there are actually `len` items in
> the iterator.

Well, that's the API contract, but I guess the point is that implementing the API doesn't require unsafe code, so we can't trust it.

> You should do bounds checks in this loop, for both the
> iterator producing too many, and too few, items.

I'll iterate using num_items, with a .next().expect() within the loop, and an assert!(items.next().is_none()) after the loop.


> There's an unstable `TrustedLen` trait which can be used to actually check
> that the length is correct.

Yeah, unstable doesn't help us here.

> 
> @@ +209,5 @@
> > +            debug_assert!(current as *mut u8 == buffer.offset(size as isize));
> > +        }
> > +
> > +        // Return the fat Arc.
> > +        debug_assert!(size_of::<Self>() == size_of::<usize>() * 2, "The Arc will be fat");
> 
> nit: Why not use `assert_eq!()` here? `size_of` is known at compile time,
> and thus this branch will be completely optimized out in optimized builds
> anyway.

Ok, I guess I'm comfortable relying on that.
(In reply to Michael Layzell [:mystor] from comment #31)
> Comment on attachment 8874301 [details] [diff] [review]
> Part 11 - Implement ThinArc. v1
> 
> Review of attachment 8874301 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: servo/components/servo_arc/lib.rs
> @@ +508,5 @@
> > +    }
> > +}
> > +
> > +pub struct ThinArc<H, T> {
> > +    ptr: *mut ArcInner<(HeaderWithLength<H>, [T; 1])>,
> 
> Again from earlier I don't know if (H, T) has the same layout as
> HeaderSlice. I think that making HeaderSlice take a `T: ?Sized` and using
> either `[T; 1]` or `[T]` is probably a better bet and simplifies a bunch of
> code. Both in this patch and the other one I looked at.

Good idea - I've gone ahead and done that. Please re-check if there are any other simplifications I've missed as a result of this.

> 
> @@ +521,5 @@
> > +fn thin_to_thick<H, T>(thin: *mut ArcInner<(HeaderWithLength<H>, [T; 1])>)
> > +    -> *mut ArcInner<HeaderSlice<HeaderWithLength<H>, T>>
> > +{
> > +    let len = unsafe { (*thin).data.0.length };
> > +    let fake_slice: &mut [T] = unsafe {
> 
> You should be able to just write `fake_slice: *mut [T]` and avoid one of the
> casts below I think.

Done.

> 
> @@ +541,5 @@
> > +            ptr: thin_to_thick(self.ptr)
> > +        };
> > +
> > +        // Expose the transient Arc to the callback, which may clone it if it wants.
> > +        let result = f(&transient);
> 
> In gecko we don't have to deal with exception safety, but if this is also
> running in Servo, we do need to worry about that.
> 
> NoDrop is definitely the most performant option right now IMO. I wouldn't be
> surprised if llvm got rid of the overhead anyway when the value is stored on
> the stack.

I've used NoDrop.

> 
> @@ +570,5 @@
> > +        let _ = Arc::from_thin(ThinArc { ptr: self.ptr });
> > +    }
> > +}
> > +
> > +impl<H, T> Arc<HeaderSlice<HeaderWithLength<H>, T>> {
> 
> I think the reasoning behind this is to use the code from part 8 to perform
> the allocation to avoid unnecessary code duplication.

I don't understand this comment.
I think we probably don't need Niko's input at this point.
Flags: needinfo?(nmatsakis)
MozReview-Commit-ID: H15cG5RBtZH
Attachment #8874596 - Flags: review?(michael)
Attachment #8874596 - Flags: review?(manishearth)
Attachment #8874257 - Attachment is obsolete: true
MozReview-Commit-ID: LB3bs9ed2WC
Attachment #8874597 - Flags: review?(michael)
Attachment #8874597 - Flags: review?(manishearth)
Attachment #8874301 - Attachment is obsolete: true
Attachment #8874301 - Flags: feedback?(emilio+bugs)
Attachment #8874596 - Flags: review?(manishearth) → review+
Attachment #8874597 - Flags: review?(manishearth) → review+
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Comment on attachment 8874596 [details] [diff] [review]
Part 8 - Add support for dynamically-sized types in servo_arc. v2

Review of attachment 8874596 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/servo_arc/lib.rs
@@ +442,5 @@
> +    pub fn from_header_and_iter<I>(header: H, mut items: I) -> Self
> +        where I: Iterator<Item=T> + ExactSizeIterator
> +    {
> +        use ::std::mem::size_of;
> +        assert!(size_of::<T>() != 0, "Need to think about ZST");

I _think_ ZSTs will work fine here, but I understand your desire to hold off :)

@@ +477,5 @@
> +            // machinery isn't available in stable Rust.
> +            //
> +            // To avoid alignment issues, we allocate words rather than bytes,
> +            // rounding up to the nearest word size.
> +            let words_to_allocate = divide_rounding_up(size, size_of::<usize>());

It might be good to assert that align_of::<T> <= align_of::<usize>, just in case someone wants to put some overaligned simd struct in one of these.

@@ +480,5 @@
> +            // rounding up to the nearest word size.
> +            let words_to_allocate = divide_rounding_up(size, size_of::<usize>());
> +            let mut vec = Vec::<usize>::with_capacity(words_to_allocate);
> +            vec.set_len(words_to_allocate);
> +            let buffer = Box::into_raw(vec.into_boxed_slice()) as *mut usize as *mut u8;

There's no reason to cast the buffer to *mut u8, as you immediately cast it to *mut T to create the slice.

@@ +491,5 @@
> +            // or something else with a [T] as its last member.
> +            let fake_slice: &mut [T] = slice::from_raw_parts_mut(buffer as *mut T, num_items);
> +            ptr = fake_slice as *mut [T] as *mut ArcInner<HeaderSlice<H, [T]>>;
> +
> +            // Write the data.

Add a comment here explaining that if the iterator panics, or any of your expects fail, it's safe because the uninitialized memory will just leak.
Attachment #8874596 - Flags: review?(michael) → review+
You generally have to do some special casing for the allocation semantics of multiple ZSTs (Vec does some stuff, the nomicon has text on this IIRC). I'm not sure if it will be necessary for this case sicne we're using size_of_val anyway, but why bother. Don't allocate ZSTs :)

> There's no reason to cast the buffer to *mut u8, as you immediately cast it to *mut T to create the slice.
Comment on attachment 8874597 [details] [diff] [review]
Part 11 - Implement ThinArc. v2

Review of attachment 8874597 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/servo_arc/lib.rs
@@ +580,5 @@
> +    }
> +}
> +
> +impl<H, T> Deref for ThinArc<H, T> {
> +    type Target = HeaderSlice<HeaderWithLength<H>, [T]>;

I feel like this code would be a lot more readable if you defined a type alias for this type which you use everywhere. Perhaps `HeaderSliceLen<H, [T]>`?

@@ +649,5 @@
> +
> +    impl Drop for Canary {
> +        fn drop(&mut self) {
> +            unsafe {
> +                match *self {

Why the match? Can't you write `(*self.0).fetch_add(1, SeqCst)` instead?
Attachment #8874597 - Flags: review?(michael) → review+
MozReview-Commit-ID: AghLMEcbUVC
Attachment #8874981 - Flags: review+
Whoops, that was a premature version.

Note that we need to keep buffer as *u8 to make the assertion at the bottom work.

MozReview-Commit-ID: AghLMEcbUVC
Attachment #8874995 - Flags: review+
Attachment #8874981 - Attachment is obsolete: true
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #50)
> Created attachment 8874995 [details] [diff] [review]
> Part 13 - Address followup review comments. r=mystor
> 
> Whoops, that was a premature version.
> 
> Note that we need to keep buffer as *u8 to make the assertion at the bottom
> work.
> 
> MozReview-Commit-ID: AghLMEcbUVC

https://github.com/servo/servo/pull/17197
You need to log in before you can comment on or make changes to this bug.