stylo: getting inline style can be a lot slower than Gecko

NEW
Assigned to

Status

()

Core
CSS Parsing and Computation
P1
normal
4 months ago
3 months ago

People

(Reporter: bz, Assigned: SimonSapin)

Tracking

(Blocks: 1 bug)

Trunk
Points:
---

Firefox Tracking Flags

(firefox55 affected)

Details

Attachments

(5 attachments, 1 obsolete attachment)

(Reporter)

Description

4 months ago
Created attachment 8857157 [details]
Testcase for .style.top getter performance

See attached testcase.  stylo (tip as of last night) is about 50% slower than Gecko on it (and Gecko is not _that_ fast to start with).
(Reporter)

Comment 1

4 months ago
Created attachment 8857162 [details]
Testcase for .style.top getter performance
Attachment #8857157 - Attachment is obsolete: true
(Reporter)

Comment 2

4 months ago
Looking at a profile, of the time under CSS2PropertiesBinding::get_top, we have the following:

1) 10% allocating when we try to convert the nsAString to a JSString.  That's because we
   have an nsAutoString and it's short, so we're not ending up with a shareable stringbuffer here.
   But that should be OK, since we should use its stack buffer, at least.
2) 1% the Truncate() call in nsDOMCSSDeclaration::GetPropertyValue.  This is silly-ish,
   and we should fix it, but no different from Gecko.
3) 5% style::properties::declaration_block::PropertyDeclarationBlock::get::h4e6c8a146e9f2cdb.  
   Some of this (about 2%) is style::properties::PropertyDeclaration::id::h0201de0187550b7e
4) 9% under Gecko_AppendUTF8toString.  Some of this is the SetCapacity/MutatePrep overhead,
   but about 2/3 is actual self time in AppendUTF8toUTF16.  Between that and the Gecko_FinalizeCString
   call and the various function call overhead for all this stuff, about 11% of the time is
   spent under
   _$LT$core..fmt..Write..write_fmt..Adapter$LT$$u27$a$C$$u20$T$GT$$u20$as$u20$core..fmt..Write$GT$::write_str::h57d3f775b39431e1
   which I think is the Write::fmt trait impl bit we do for nsAString.  And that's just for the _unit_, I think, because....

5) Another 17% is spent under _$LT$core..fmt..Write..write_fmt..Adapter$LT$$u27$a$C$$u20$T$GT$$u20$as$u20$core..fmt..Write$GT$::write_str::h57d3f775b39431e1 when it's called from core::fmt::Formatter::write_formatted_parts::hffeaaefff09eb6a7 which is itself called from core::fmt::float_to_decimal_common::hf1456822c549f2d1.  Again, some of this is the MutatePrep bit (6%) and some is the actual copying/conversion stuff.

6) 12% is under core::num::flt2dec::strategy::grisu::format_shortest_opt::he2118afa44ac680d,
   which is the actual float to string conversion, I assume.  This seems to be a bit faster,
   if I'm measuring this right, than the corresponding
   double_conversion::DoubleToStringConverter::ToPrecision call on the Gecko side,
   which is moderately surprising.  I wonder whether they do equivalent things.  But I'm not
   going to complain, if they do!

7) 4% _platform_bzero called from core::fmt::float_to_decimal_common::hf1456822c549f2d1

8) 8% self time in core::fmt::float_to_decimal_common::hf1456822c549f2d1.  Not sure what that is yet.

In terms of broader picture, on the servo side Servo_DeclarationBlock_GetPropertyValueById is about 74% of the time under get_top.  On the Gecko side, mozilla::css::Declaration::AppendValueToString is about 62% of the time under get_top.  Also on the Gecko side, get_top is about 94% of the mainthread time.

So if the Gecko mainthread time is 100ms, then Declaration::AppendValueToString takes about 59ms of that.  The corresponding Servo wall-clock time over here would be 150ms, with get_top at 96% of that and hence Servo_DeclarationBlock_GetPropertyValueById taking about 106ms, or about 2x as long as the Gecko-side equivalent.  And the actual float-to-string conversion (which is 40-50% on the Gecko side, depending on whether you include the time to copy the ASCII buffer into the nsAString) is 12-30% on the Servo side, again depending on whether you include that time.

Anyway, the upshot is that the way we're doing the string machinery here is the most obvious thing hurting us.  But also style::properties::declaration_block::PropertyDeclarationBlock::get bit at 5% is a problem.  The comparable Gecko code is nsCSSCompressedDataBlock::ValueFor and is 1.5% or so, of a number that is only 1/2 to 2/3 as big...

Granted, nsCSSCompressedDataBlock::ValueFor is O(N) in number of declarations stored, so this is a best case for it.  But looks to me like PropertyDeclarationBlock::get is likewise O(N); just with a much larger constant.
(Reporter)

Comment 3

4 months ago
I should note that a similar testcase that sets s.display = "block" and then reads s.display in a loop seems to be a smaller performance regression over Gecko, albeit still a performance regression.
(Reporter)

Comment 4

4 months ago
So https://users.rust-lang.org/t/excessive-stack-usage-for-formatting/8023/2 says "Note that core::fmt has basically been never optimized, so improvements are of course always welcome!"  That's great, but a bit sad if this is something we're using in things that are known hot on the web.  ;)

I _think_ the _platform_bzero comes from this bit in float_to_decimal_common in https://doc.rust-lang.org/nightly/src/core/fmt/float.rs.html:

  let mut buf = [0; 1024]; // enough for f32 and f64

I filed https://github.com/rust-lang/rust/issues/41234 and will poke at it a bit, for at least this part.
(Assignee)

Comment 5

4 months ago
Potentially useful:

https://github.com/dtolnay/dtoa
https://github.com/dtolnay/itoa
(Reporter)

Comment 6

4 months ago
Worth trying, I guess.  The actual double-to-bytes thing above is pretty fast, like I said (and seems to be using the same grisu algorithm as the dtoa thing linked).  The question is how the surrounding code behaves...
(Assignee)

Comment 7

4 months ago
I’m only intended to point out in passing the existence of these crates since they came to mind when I glanced at this bug. They might help a bit, but only with part of the overhead. So they might not be the first thing to look at.
Assignee: nobody → simon.sapin
Priority: -- → P1
(Reporter)

Comment 8

4 months ago
So one thing that's obviously bad here is that Formatter::write_formatted_parts ends up doing a buf.write_str() for each part, each of which does a Gecko_AppendUTF8toString call.  In this testcase, the value being output is "10" and that has _three_ parts: empty string, "1", and "0".  So we end up with three calls to Gecko_AppendUTF8toString to append this "10".

I expect the relevant parts are sign, nonzero parts of integer part, nonzero parts of mantissa, trailing zeroes.

If I use "11px" instead of "10px", I only get two parts: empty string and "11".  If I use "10.5px", I get four parts: empty string, "10", ".", "5".

So at a glance, the biggest win might be to put all the stuff into a single buffer servo-side (including the unit!) and then do a single Gecko_AppendUTF8toString.  Something like a u8 vec with some stack storage to handle common cases without allocating seems like it should work.
(In reply to Boris Zbarsky [:bz] (still a bit busy) (if a patch has no decent message, automatic r-) from comment #8)
> So at a glance, the biggest win might be to put all the stuff into a single
> buffer servo-side (including the unit!) and then do a single
> Gecko_AppendUTF8toString.  Something like a u8 vec with some stack storage
> to handle common cases without allocating seems like it should work.

This seems reasonable, given that it's what we do for nsTSubstring::AppendFloat:

http://dxr.mozilla.org/mozilla-central/source/xpcom/string/nsTSubstring.cpp#1058
(Reporter)

Comment 10

4 months ago
https://github.com/rust-lang/rust/issues/41259 filed on the zeroing thing in rust's stdlib.

I poked a bit the "stuff into a single buffer" solution.  Sadly, we still end up with a bunch of function calls and memmoves and stuff, as far as I can tell.  On irc people were suggesting that using write!() is probably not a good idea here in general.

Now that I've read through this stuff a bunch, the lib linked in comment 5 does look useful: it does a single write_all for the floating point value.  In our case, we'd end up with that, plus the write to put in the "px", which is better than the 4 calls we get now.

I'm going to attach the patches I was experimenting with today; we may want some of them no matter what.
(Reporter)

Comment 11

4 months ago
Created attachment 8857667 [details] [diff] [review]
Support for as_slice and fmt::Write for smallvec, plus a slightly faster extend_from_slice
(Reporter)

Comment 12

4 months ago
Created attachment 8857670 [details] [diff] [review]
geckolib change to use smallvec here
(Reporter)

Comment 13

4 months ago
Created attachment 8857671 [details] [diff] [review]
Possibly-desirable patch to avoid the FFI call when writing empty string to an nsAString
(Reporter)

Comment 14

4 months ago
Created attachment 8857672 [details] [diff] [review]
Cargo bits to test that stuff, for my future reference

For what it's worth, doing these bits helped some; the slowdown went from 50% of Gecko time to 25-30% of Gecko time.  I do think using the other dtoa would help some too, because it's much more write-friendly.
(Reporter)

Comment 15

4 months ago
One other note: I'm not sure how high-quality the code in those repos linked in comment 5 is.  dtoa (as a concept, not this specific implementation) is ... finicky.  Lots of edge cases, some of which may not matter to us here.
(Reporter)

Comment 16

4 months ago
Oh, last note.  If we _do_ decide to stick with the stdlib bits, calling flt2dec::to_shortest_str manually and then being smarter about its output and how it's dumped into the fmt::Write is in theory an option, but in practice it's not stable API, and I'm not sure what servo's/stylo's policy on that sort of thing is.
(Assignee)

Comment 17

4 months ago
> it's not stable API,

Yes, the core::num::flt2dec module is #![unstable]. It looks like it would even be private, if not for the technical constraint of being unit-tested from another crate.


> and I'm not sure what servo's/stylo's policy on that sort of thing is.

Servo uses many unstable features for historical reasons but we generally try to remove their usage little by little, in order to make it easier to upgrade to new compiler versions. Firefox however (and so any crate used by Stylo) can build with stable Rust (so without any #[unstable] feature), and I believe we’d like to keep it that way. We already get complaints when we increment rustc_min_version in build/moz.configure/rust.configure too often…


If the standard library has code that is currently not exposed but could be useful, we can make a case to expose it. But going through the RFC process and stabilization and riding the trains takes time, so at best it’s a couple months until stuff proposed now reaches Rust stable.

Another option is to copy the relevant code into a separate crate that we could use without waiting, but then we’d need to maintain that code separately. For example if a future Rust version adds an optimization to this code, we wouldn’t profit from it just by upgrading the compiler and would have to port it manually.
(In reply to Simon Sapin (:SimonSapin) from comment #17)
> Another option is to copy the relevant code into a separate crate that we
> could use without waiting, but then we’d need to maintain that code
> separately. For example if a future Rust version adds an optimization to
> this code, we wouldn’t profit from it just by upgrading the compiler and
> would have to port it manually.

I think we should do this, and add an action item to reevaluate after we ship stylo. If it looks like we'll be using it for the foreseeable future, we can do the RFC stuff after shipping and then deprecate the crates when that's all done.
You need to log in before you can comment on or make changes to this bug.