Closed Bug 1355599 Opened 7 years ago Closed 7 years ago

stylo: getting inline style can be a lot slower than Gecko

Categories

(Core :: CSS Parsing and Computation, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox55 --- affected

People

(Reporter: bzbarsky, Assigned: SimonSapin)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 1 obsolete file)

See attached testcase.  stylo (tip as of last night) is about 50% slower than Gecko on it (and Gecko is not _that_ fast to start with).
Attachment #8857157 - Attachment is obsolete: true
Looking at a profile, of the time under CSS2PropertiesBinding::get_top, we have the following:

1) 10% allocating when we try to convert the nsAString to a JSString.  That's because we
   have an nsAutoString and it's short, so we're not ending up with a shareable stringbuffer here.
   But that should be OK, since we should use its stack buffer, at least.
2) 1% the Truncate() call in nsDOMCSSDeclaration::GetPropertyValue.  This is silly-ish,
   and we should fix it, but no different from Gecko.
3) 5% style::properties::declaration_block::PropertyDeclarationBlock::get::h4e6c8a146e9f2cdb.  
   Some of this (about 2%) is style::properties::PropertyDeclaration::id::h0201de0187550b7e
4) 9% under Gecko_AppendUTF8toString.  Some of this is the SetCapacity/MutatePrep overhead,
   but about 2/3 is actual self time in AppendUTF8toUTF16.  Between that and the Gecko_FinalizeCString
   call and the various function call overhead for all this stuff, about 11% of the time is
   spent under
   _$LT$core..fmt..Write..write_fmt..Adapter$LT$$u27$a$C$$u20$T$GT$$u20$as$u20$core..fmt..Write$GT$::write_str::h57d3f775b39431e1
   which I think is the Write::fmt trait impl bit we do for nsAString.  And that's just for the _unit_, I think, because....

5) Another 17% is spent under _$LT$core..fmt..Write..write_fmt..Adapter$LT$$u27$a$C$$u20$T$GT$$u20$as$u20$core..fmt..Write$GT$::write_str::h57d3f775b39431e1 when it's called from core::fmt::Formatter::write_formatted_parts::hffeaaefff09eb6a7 which is itself called from core::fmt::float_to_decimal_common::hf1456822c549f2d1.  Again, some of this is the MutatePrep bit (6%) and some is the actual copying/conversion stuff.

6) 12% is under core::num::flt2dec::strategy::grisu::format_shortest_opt::he2118afa44ac680d,
   which is the actual float to string conversion, I assume.  This seems to be a bit faster,
   if I'm measuring this right, than the corresponding
   double_conversion::DoubleToStringConverter::ToPrecision call on the Gecko side,
   which is moderately surprising.  I wonder whether they do equivalent things.  But I'm not
   going to complain, if they do!

7) 4% _platform_bzero called from core::fmt::float_to_decimal_common::hf1456822c549f2d1

8) 8% self time in core::fmt::float_to_decimal_common::hf1456822c549f2d1.  Not sure what that is yet.

In terms of broader picture, on the servo side Servo_DeclarationBlock_GetPropertyValueById is about 74% of the time under get_top.  On the Gecko side, mozilla::css::Declaration::AppendValueToString is about 62% of the time under get_top.  Also on the Gecko side, get_top is about 94% of the mainthread time.

So if the Gecko mainthread time is 100ms, then Declaration::AppendValueToString takes about 59ms of that.  The corresponding Servo wall-clock time over here would be 150ms, with get_top at 96% of that and hence Servo_DeclarationBlock_GetPropertyValueById taking about 106ms, or about 2x as long as the Gecko-side equivalent.  And the actual float-to-string conversion (which is 40-50% on the Gecko side, depending on whether you include the time to copy the ASCII buffer into the nsAString) is 12-30% on the Servo side, again depending on whether you include that time.

Anyway, the upshot is that the way we're doing the string machinery here is the most obvious thing hurting us.  But also style::properties::declaration_block::PropertyDeclarationBlock::get bit at 5% is a problem.  The comparable Gecko code is nsCSSCompressedDataBlock::ValueFor and is 1.5% or so, of a number that is only 1/2 to 2/3 as big...

Granted, nsCSSCompressedDataBlock::ValueFor is O(N) in number of declarations stored, so this is a best case for it.  But looks to me like PropertyDeclarationBlock::get is likewise O(N); just with a much larger constant.
I should note that a similar testcase that sets s.display = "block" and then reads s.display in a loop seems to be a smaller performance regression over Gecko, albeit still a performance regression.
So https://users.rust-lang.org/t/excessive-stack-usage-for-formatting/8023/2 says "Note that core::fmt has basically been never optimized, so improvements are of course always welcome!"  That's great, but a bit sad if this is something we're using in things that are known hot on the web.  ;)

I _think_ the _platform_bzero comes from this bit in float_to_decimal_common in https://doc.rust-lang.org/nightly/src/core/fmt/float.rs.html:

  let mut buf = [0; 1024]; // enough for f32 and f64

I filed https://github.com/rust-lang/rust/issues/41234 and will poke at it a bit, for at least this part.
Worth trying, I guess.  The actual double-to-bytes thing above is pretty fast, like I said (and seems to be using the same grisu algorithm as the dtoa thing linked).  The question is how the surrounding code behaves...
I’m only intended to point out in passing the existence of these crates since they came to mind when I glanced at this bug. They might help a bit, but only with part of the overhead. So they might not be the first thing to look at.
Assignee: nobody → simon.sapin
Priority: -- → P1
So one thing that's obviously bad here is that Formatter::write_formatted_parts ends up doing a buf.write_str() for each part, each of which does a Gecko_AppendUTF8toString call.  In this testcase, the value being output is "10" and that has _three_ parts: empty string, "1", and "0".  So we end up with three calls to Gecko_AppendUTF8toString to append this "10".

I expect the relevant parts are sign, nonzero parts of integer part, nonzero parts of mantissa, trailing zeroes.

If I use "11px" instead of "10px", I only get two parts: empty string and "11".  If I use "10.5px", I get four parts: empty string, "10", ".", "5".

So at a glance, the biggest win might be to put all the stuff into a single buffer servo-side (including the unit!) and then do a single Gecko_AppendUTF8toString.  Something like a u8 vec with some stack storage to handle common cases without allocating seems like it should work.
(In reply to Boris Zbarsky [:bz] (still a bit busy) (if a patch has no decent message, automatic r-) from comment #8)
> So at a glance, the biggest win might be to put all the stuff into a single
> buffer servo-side (including the unit!) and then do a single
> Gecko_AppendUTF8toString.  Something like a u8 vec with some stack storage
> to handle common cases without allocating seems like it should work.

This seems reasonable, given that it's what we do for nsTSubstring::AppendFloat:

http://dxr.mozilla.org/mozilla-central/source/xpcom/string/nsTSubstring.cpp#1058
https://github.com/rust-lang/rust/issues/41259 filed on the zeroing thing in rust's stdlib.

I poked a bit the "stuff into a single buffer" solution.  Sadly, we still end up with a bunch of function calls and memmoves and stuff, as far as I can tell.  On irc people were suggesting that using write!() is probably not a good idea here in general.

Now that I've read through this stuff a bunch, the lib linked in comment 5 does look useful: it does a single write_all for the floating point value.  In our case, we'd end up with that, plus the write to put in the "px", which is better than the 4 calls we get now.

I'm going to attach the patches I was experimenting with today; we may want some of them no matter what.
For what it's worth, doing these bits helped some; the slowdown went from 50% of Gecko time to 25-30% of Gecko time.  I do think using the other dtoa would help some too, because it's much more write-friendly.
One other note: I'm not sure how high-quality the code in those repos linked in comment 5 is.  dtoa (as a concept, not this specific implementation) is ... finicky.  Lots of edge cases, some of which may not matter to us here.
Oh, last note.  If we _do_ decide to stick with the stdlib bits, calling flt2dec::to_shortest_str manually and then being smarter about its output and how it's dumped into the fmt::Write is in theory an option, but in practice it's not stable API, and I'm not sure what servo's/stylo's policy on that sort of thing is.
> it's not stable API,

Yes, the core::num::flt2dec module is #![unstable]. It looks like it would even be private, if not for the technical constraint of being unit-tested from another crate.


> and I'm not sure what servo's/stylo's policy on that sort of thing is.

Servo uses many unstable features for historical reasons but we generally try to remove their usage little by little, in order to make it easier to upgrade to new compiler versions. Firefox however (and so any crate used by Stylo) can build with stable Rust (so without any #[unstable] feature), and I believe we’d like to keep it that way. We already get complaints when we increment rustc_min_version in build/moz.configure/rust.configure too often…


If the standard library has code that is currently not exposed but could be useful, we can make a case to expose it. But going through the RFC process and stabilization and riding the trains takes time, so at best it’s a couple months until stuff proposed now reaches Rust stable.

Another option is to copy the relevant code into a separate crate that we could use without waiting, but then we’d need to maintain that code separately. For example if a future Rust version adds an optimization to this code, we wouldn’t profit from it just by upgrading the compiler and would have to port it manually.
(In reply to Simon Sapin (:SimonSapin) from comment #17)
> Another option is to copy the relevant code into a separate crate that we
> could use without waiting, but then we’d need to maintain that code
> separately. For example if a future Rust version adds an optimization to
> this code, we wouldn’t profit from it just by upgrading the compiler and
> would have to port it manually.

I think we should do this, and add an action item to reevaluate after we ship stylo. If it looks like we'll be using it for the foreseeable future, we can do the RFC stuff after shipping and then deprecate the crates when that's all done.
Priority: P1 → --
Priority: -- → P2
https://github.com/servo/rust-cssparser/pull/173 switched to entirely different float serialization code which doesn’t appear to use zeroing:

https://github.com/dtolnay/dtoa/blob/0.4.2/src/dtoa.rs#L465

    let mut buffer: [u8; 24] = mem::uninitialized();
Depends on: 1394792
We’re now about 9% slower, from 50% when this bug was opened.

----

I profiled the test case from comment 1 with one zero added to the number of iterations (so 10 million), on hg commit 632e42dca494.

The median of 5 run (as computed by the test case itself) with the profiler disabled was 1670 ms without Stylo, 1822 ms with (9.1% slower, 15.2 ns/iteration difference).

Now looking at profiles (different runs), the deepest relevant function present in both is nsCSSDOMDeclaration::GetPropertyValue. The profiler gives it 1243 ms without Stylo, 1322 ms (6.3% slower, 7.9 ns/iter diff) with. Some notable parallels under that:

* Stylo spends 42 ms (42 samples, 4.2 ns/iteration) in PropertyId::from_nscsspropertyid. This is a conversion that is unnecessary in the old style system.
* No locking at all vs 48 samples in SharedLock::read
* 14 samples in nsCSSCompressedDataBlock::ValueFor vs 35 samples in PropertyDeclarationBlock::get
* 886 samples in nsAString::AppendFloat vs 649 samples in f32::to_css
(In reply to Simon Sapin (:SimonSapin) from comment #20)
> We’re now about 9% slower, from 50% when this bug was opened.

Awesome work!

> 
> ----
> 
> I profiled the test case from comment 1 with one zero added to the number of
> iterations (so 10 million), on hg commit 632e42dca494.
> 
> The median of 5 run (as computed by the test case itself) with the profiler
> disabled was 1670 ms without Stylo, 1822 ms with (9.1% slower, 15.2
> ns/iteration difference).
> 
> Now looking at profiles (different runs), the deepest relevant function
> present in both is nsCSSDOMDeclaration::GetPropertyValue. The profiler gives
> it 1243 ms without Stylo, 1322 ms (6.3% slower, 7.9 ns/iter diff) with. Some
> notable parallels under that:
> 
> * Stylo spends 42 ms (42 samples, 4.2 ns/iteration) in
> PropertyId::from_nscsspropertyid. This is a conversion that is unnecessary
> in the old style system.

Will we be able to eliminate this conversion when we drop the old style system? If so, waiting a few cycles to fix this is probably fine.

> * No locking at all vs 48 samples in SharedLock::read

This seems like low-hanging fruit. Given that Gecko uses a single lock for the entire runtime, and given that CSSOM only happens on the main thread, I don't think the added safety at this callsite justifies a measurable performance impact. Let's make this callsite acquire the lock unsafely if #[cfg(not(debug_assertions))]

> * 14 samples in nsCSSCompressedDataBlock::ValueFor vs 35 samples in
> PropertyDeclarationBlock::get

Why do we need to take different paths?

> * 886 samples in nsAString::AppendFloat vs 649 samples in f32::to_css

Any optimizations we could borrow?
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #21)
> Will we be able to eliminate this conversion when we drop the old style
> system? If so, waiting a few cycles to fix this is probably fine.

For that we’d need to have C++ code under layout/style/ include a header file generated by the style crate’s build script. I don’t know how difficult that would be from a build system point of view.


> > * No locking at all vs 48 samples in SharedLock::read
> 
> This seems like low-hanging fruit. Given that Gecko uses a single lock for
> the entire runtime, and given that CSSOM only happens on the main thread, I
> don't think the added safety at this callsite justifies a measurable
> performance impact. Let's make this callsite acquire the lock unsafely if
> #[cfg(not(debug_assertions))]

For what it’s worth we’re talking about 4.8 nanoseconds per access, ~2% of the time for only accessing a value and not doing anything with it. Wouldn’t this be within noise level of any real workload?


> > * 14 samples in nsCSSCompressedDataBlock::ValueFor vs 35 samples in
> > PropertyDeclarationBlock::get
> 
> Why do we need to take different paths?

These are lookup methods of a C++ data structure and a Rust data structure, respectively.


> > * 886 samples in nsAString::AppendFloat vs 649 samples in f32::to_css
> 
> Any optimizations we could borrow?

To make the C++ code (not used in Stylo) faster? Maybe we could replace mfbt/double-conversion with the dtoa-short crate, but that sounds out of scope for this bug.
(In reply to Simon Sapin (:SimonSapin) from comment #22)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #21)
> > > * 886 samples in nsAString::AppendFloat vs 649 samples in f32::to_css
> > 
> > Any optimizations we could borrow?
> 
> To make the C++ code (not used in Stylo) faster? Maybe we could replace
> mfbt/double-conversion with the dtoa-short crate, but that sounds out of
> scope for this bug.

Is the intention of dtoa-short to do all the work in nsTSubstring_CharT::AppendFloat/FormatWithoutTrailingZeroes?  I would be totally willing to borrow dtoa-short to replace all that grotty code.

Replacing double-conversion with dtoa would be desirable, but that's a much longer term sort of project.
Flags: needinfo?(simon.sapin)
For the purpose of CSS serialization, yes. Outside of that I don’t know.
Flags: needinfo?(simon.sapin)
(In reply to Nathan Froyd [:froydnj] from comment #23)
> (In reply to Simon Sapin (:SimonSapin) from comment #22)
> > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #21)
> > > > * 886 samples in nsAString::AppendFloat vs 649 samples in f32::to_css
> > > 
> > > Any optimizations we could borrow?
> > 
> > To make the C++ code (not used in Stylo) faster? Maybe we could replace
> > mfbt/double-conversion with the dtoa-short crate, but that sounds out of
> > scope for this bug.
> 
> Is the intention of dtoa-short to do all the work in
> nsTSubstring_CharT::AppendFloat/FormatWithoutTrailingZeroes?  I would be
> totally willing to borrow dtoa-short to replace all that grotty code.

It intends to do the work of AppendFloat, and probably FormatWithoutTrailingZeroes (no API for that, but there is an internal function which accepts precision which we can expose if necessary), but I bet there are lots of untested edge cases (at least dtoa-short doesn't handle Inf and NaN at the moment), so replacing it in wider range may need more work.

> Replacing double-conversion with dtoa would be desirable, but that's a much
> longer term sort of project.

double-conversion is already a library wrapping dtoa, actually.
(In reply to Simon Sapin (:SimonSapin) from comment #22)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #21)
> > Will we be able to eliminate this conversion when we drop the old style
> > system? If so, waiting a few cycles to fix this is probably fine.
> 
> For that we’d need to have C++ code under layout/style/ include a header
> file generated by the style crate’s build script. I don’t know how difficult
> that would be from a build system point of view.
> 
> 
> > > * No locking at all vs 48 samples in SharedLock::read
> > 
> > This seems like low-hanging fruit. Given that Gecko uses a single lock for
> > the entire runtime, and given that CSSOM only happens on the main thread, I
> > don't think the added safety at this callsite justifies a measurable
> > performance impact. Let's make this callsite acquire the lock unsafely if
> > #[cfg(not(debug_assertions))]
> 
> For what it’s worth we’re talking about 4.8 nanoseconds per access, ~2% of
> the time for only accessing a value and not doing anything with it. Wouldn’t
> this be within noise level of any real workload?

I agree it isn't worth bending over backwards for microbenchmark. But we're talking about a couple of lines of code to eliminate 30% of our deficit wrt Gecko on this microbenchmark. And the safety checks aren't really buying us anything, because Gecko only does CSSOM on one thread, and if it did it on other threads there would be gobs and gobs of C++ code that would break spectacularly.

So I think we should just add an unsafe _unchecked getter and and use it in opt builds in this FFI function.

> 
> 
> > > * 14 samples in nsCSSCompressedDataBlock::ValueFor vs 35 samples in
> > > PropertyDeclarationBlock::get
> > 
> > Why do we need to take different paths?
> 
> These are lookup methods of a C++ data structure and a Rust data structure,
> respectively.
> 
> 
> > > * 886 samples in nsAString::AppendFloat vs 649 samples in f32::to_css
> > 
> > Any optimizations we could borrow?
> 
> To make the C++ code (not used in Stylo) faster? Maybe we could replace
> mfbt/double-conversion with the dtoa-short crate, but that sounds out of
> scope for this bug.

Whoops, I misread the numbers and thought that C++ was faster. Nice job Rust! :-)
> For that we’d need to have C++ code under layout/style/ include a header file generated
> by the style crate’s build script.

Or just ensure, with static assertions, that the actual numeric values of the two ids line up.

That said, long-term we want to have one source of canonical metainformation about style props.  Right now we have nsCSSPropList.h and the various mako files, and they are kind of being kept in sync.  It would be better to have only one thing...

> Wouldn’t this be within noise level of any real workload?

Yes, probably.  We do need to consider benchmark performance, not just real performance, unfortunately, because people use both when making performance judgements.  :(  What that means for this particular case, I can't tell you.

>* 14 samples in nsCSSCompressedDataBlock::ValueFor vs 35 samples in PropertyDeclarationBlock::get

I filed bug 1396874 on this.

> * 886 samples in nsAString::AppendFloat vs 649 samples in f32::to_css

That's 240 samples in favor of stylo.  The previous items add up to 111 samples in favor of Gecko.  Gecko is faster by 152 samples, so that's 240 - 111 + 152 = 281 samples unaccounted for.  That's assuming that we believe our sample numbers are precise enough to start with, which they very well might not be...

Just for scale, Safari us 5x faster than Gecko here, while Chrome is 40% slower than Gecko.  So being ~10% slower is probably ok in terms of just shipping stylo, but once we've removed the old style system it might be worth reprofiling and seeing what else we can trim, including in parts of the C++ code that were there to deal with the old style system but can be simplified with stylo...
Alright. Simon, can you measure/land using an unsafe lock acquisition in the relevant FFI function? After that, I think we can resolve this bug.
Flags: needinfo?(simon.sapin)
I think we should not make that change, but if others feel strongly I won’t block it.
Flags: needinfo?(simon.sapin)
(In reply to Simon Sapin (:SimonSapin) from comment #29)
> I think we should not make that change, but if others feel strongly I won’t
> block it.

https://github.com/servo/servo/pull/18386
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
Did we ever end up doing something about the bad append patterns that comment 8 mentions?  Or do we need a followup?
Flags: needinfo?(simon.sapin)
If I understand correctly comment 8 is about the standard library’s float serialization code. We’ve entirely stopped using that for CSS serialization. Instead we use dtoa-short which has a stack buffer and makes a single write_str() call to the generic stream:

https://github.com/upsuper/dtoa-short/blob/1f6173ab94/src/lib.rs#L62-L66
Flags: needinfo?(simon.sapin)
Ah, perfect, thanks!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: