1331843 - stylo: Servo CSS stylesheet parsing is 35% slower than Gecko's on myspace Talos testcase

Reporter

Description

•

7 years ago

I did some quick profiling on Talos based on shing's results in bug 1331578. I focused on the myspace testcase, where Stylo is 155% slower than stock Gecko.

One piece of low-hanging fruit I found was CSS parsing overhead. Servo spends 153 ms parsing the stylesheet: https://clptr.io/2iFRZ0k

While gecko spends 23 ms parsing the same stylesheet: https://clptr.io/2iFQPSP

There's no reason I'm aware of that we shouldn't be able to hit parity here.

Bobby Holley (:bholley)

Reporter

Comment 1

•

7 years ago

Err, this is a better view of the Servo CSS parser profile: https://clptr.io/2iFL2N3

Bobby Holley (:bholley)

Reporter

Comment 2

•

7 years ago

https://github.com/servo/servo/issues/15060
and
https://github.com/servo/rust-cssparser/pull/112

May help here.

In general, optimizing this is not super urgent, since it is unlikely to impact major design decisions, and is easy to separate out when measuring other parts of performance. Still, we should pick off any low hanging fruit we see sooner rather than later.

Bobby Holley (:bholley)

Reporter

Comment 3

•

7 years ago

Manish says https://github.com/servo/servo/pull/15065 may also help here.

Simon Sapin (:SimonSapin)

Comment 4

•

7 years ago

There’s certainly more to do, but how does this look with https://github.com/servo/rust-cssparser/pull/112 ?

Simon Sapin (:SimonSapin)

Comment 5

•

7 years ago

(https://github.com/servo/rust-cssparser/pull/112 which has now merged and is published on crates.io as version 0.7.3.)

Bobby Holley (:bholley)

Reporter

Comment 6

•

7 years ago

(In reply to Simon Sapin (:SimonSapin) from comment #5)
> (https://github.com/servo/rust-cssparser/pull/112 which has now merged and
> is published on crates.io as version 0.7.3.)

This hasn't hit incubator yet, so we'll need to wait for the next sync to test.

Cameron McCormack (:heycam)

Comment 7

•

7 years ago

This should be on incubator now.

Bobby Holley (:bholley)

Reporter

Comment 8

•

7 years ago

Those patches seem to have helped a lot. Parse performance is now only 3x slower than Gecko (~65ms):
https://clptr.io/2kP22Wz

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Summary: stylo: Servo CSS stylesheet parsing is 6-7 times slower than Gecko's on myspace Talos testcase → stylo: Servo CSS stylesheet parsing is ~3 times slower than Gecko's on myspace Talos testcase

Bobby Holley (:bholley)

Reporter

Comment 9

•

7 years ago

A bottom-up profile of the time in Servo_StyleSheet_FromUTF8Bytes is at [1]. Deduplication tops the list.


[1] https://new.cleopatra.io/public/feb22a9ae08424495559089a5d0d022b920c4d5c/calltree/?callTreeFilters=prefix-0123456789aE9cdefyUaBnijkEamnyZvQ4Q5Q6xD7xD8xD9xDaxJaxDcxDdxDh&invertCallstack&range=0.4828_0.5749&thread=0

Manish Goregaokar [:manishearth]

Comment 10

•

7 years ago

I suspect using with_capacity instead of `let mut deduplicated = Vec::new();` might help the deduplication issue.

In general we need to box the heaviest properties so that we can drastically reduce the impact of a copy here.

Bobby Holley (:bholley)

Reporter

Comment 11

•

7 years ago

Preallocation definitely helps a bit, shaving off about 3ms: https://clptr.io/2kOT63w

But it seems like there's still a fair amount of memmov traffic coming from deduplication.

Manish Goregaokar [:manishearth]

Comment 12

•

7 years ago

Yeah, the memmove traffic is because we're unconditionally copying each decl even when there's no duplication. And our declarations are *large*, we need to box some of them up. It may be preferable to iterate twice instead; the first time only reading the vector.

Manish Goregaokar [:manishearth]

Comment 13

•

7 years ago

Hm, a problem with iterating twice is that that will double the codesize of this function. Though we can make it work with some conditionals if we need to.

Even better would be to have PropertyDeclaration::parse mark seens and whether or not duplicate properties exist, and reuse this data to skip the deduplicate call in the common case.

Manish Goregaokar [:manishearth]

Comment 14

•

7 years ago

Filed https://github.com/servo/servo/issues/15320

Made PR https://github.com/servo/servo/pull/15321 to pull in the easy fix of preallocation

Manish Goregaokar [:manishearth]

Comment 15

•

7 years ago

Oh, also filed https://github.com/servo/servo/issues/15322

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Depends on: 1339894

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Assignee: nobody → bobbyholley

Priority: -- → P1

Simon Sapin (:SimonSapin)

Comment 16

•

7 years ago

One thing that I *guess* could explain some of the parsing performance difference compared to Gecko is the cssparser::Parser::try method. It records the current position in the Unicode input, executes a closure, and if that closure returns an error "rewinds" to the recorded position. This is used a lot in Servo to implement "<foo> | <bar>" alternatives in CSS grammars.

It does make e.g. adding a new property easier, but it has a cost: the tokenzier ends up doing the same work multiple times. We could have a one-token cache, but that feels like a band-aid. Better yet would be to avoid rewinding entirely. Currently, rewinding is sort-of necessary because the only way to access a token is to consume it and advance the current position. If we separate "access the current token" and "move to the next token", we should be able to refactor Servo to not need rewinding, and remove the `try` method.

However taking ownership of a token (and the `Cow<str>`s inside it, to avoid copies) would become more difficult. It might be beneficial to get rid of these `Cow`s and make tokens implement `Copy`. At the moment they only own buffers when a string is not as in the source, because the source contains backslash-escapes. We could replace that with on-demand unescaping (based on Iterator or/and Display), and make cssparser truly zero-copy.

I want to try these ideas but it’s a time investment, especially to port parsing code in Servo.

Boris Zbarsky [:bzbarsky]

Comment 17

•

7 years ago

> We could have a one-token cache

So for what it's worth, Gecko's setup here is that the API looks like this: GetToken/UngetToken.  GetToken gets a token.  UngetToken marks the current token as "not gotten", so the next GetToken call will just return it.  You can't UngetToken twice in a row.  So this is effectively a one-token cache under the hood.

Bobby Holley (:bholley)

Reporter

Comment 18

•

7 years ago

With SimonSapin's recent fixes (and a new machine), I get 23 ms for Gecko and 31 ms for Servo, so Servo is still 35% slower, but we're closing in on it.

I'll dig into the profile a bit more after the call.

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Summary: stylo: Servo CSS stylesheet parsing is ~3 times slower than Gecko's on myspace Talos testcase → stylo: Servo CSS stylesheet parsing is 35% slower than Gecko's on myspace Talos testcase

Bobby Holley (:bholley)

Reporter

Comment 19

•

7 years ago

So I reprofiled, and 97% of the time is spent in parse_qualified_rule [1]. Simon, is that consistent with the analysis in comment 16?


[1] Everything inside is inline, so I can't get more granular than that, even though I inflated the overhead by parsing each stylesheet 100 times.

Flags: needinfo?(simon.sapin)

Simon Sapin (:SimonSapin)

Comment 20

•

7 years ago

> 97% of the time is spent in parse_qualified_rule

That’s not surprising, and unfortunately not very informative. Qualified rules are any rule that doesn’t start with an @-keyword, which includes style rules. So the parsing of the selector list and the declarations for one given style rule is all under one call to parse_qualified_rule.

Flags: needinfo?(simon.sapin)

Bobby Holley (:bholley)

Reporter

Comment 21

•

7 years ago

Here's the updated profile with more stuff out-of-line: https://perfht.ml/2mXhal2

Emilio Cobos Álvarez (:emilio)

Comment 22

•

7 years ago

Seems like a decent amount of it is spent in url parsing. We should try to get that fixed I guess, stuff like https://github.com/servo/servo/issues/13778 is pretty sad.

Bobby Holley (:bholley)

Reporter

Comment 23

•

7 years ago

To test Simon's theory from comment 16, I wrote up a patch to measure the wasted work. The basic idea is that, if the parse callback returns failure, we do the work a second time in an out-of-line function which we can then measure with the profiler.

The patch is here: https://pastebin.mozilla.org/8982050
The resulting profile is here: https://perfht.ml/2nqoweF

This shows us spending 82 ms in wasted_work, and 3012 ms total time in Servo_StyleSheet_FromUTF8Bytes, which is 2930 after subtracting out the redundant set of calls in wasted_work.

This means that the wasted work accounts for about 3% of total parse time, which is about 12% of the total remaining difference between Gecko and Servo. So it's definitely worth fixing, but may not be worth a boil-the-ocean refactor right now.

There's still bigger fish to fry in there somewhere.

Simon Sapin (:SimonSapin)

Comment 24

•

7 years ago

Regarding URL parsing, I maintain that we should skip rust-url entirely for `data:` URLs. Emilio’s attempt at this ended up not landing, but I don’t remember why exactly.

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Depends on: 1347408

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Depends on: 1347435

Simon Sapin (:SimonSapin)

Comment 25

•

7 years ago

> we should skip rust-url entirely for `data:` URLs

… or even for all URLs, in Stylo. (See discussion in Bug 1347435.)

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Depends on: 1347719

Bobby Holley (:bholley)

Reporter

Comment 26

•

7 years ago

Simon is looking at this.

Here's a breakdown I did of a couple of profiles: https://docs.google.com/spreadsheets/d/1TAoakwrYnb337enEzFK3tFPOrpmUJ31hxAaX0kaP_5A/edit?usp=sharing

Some ideas:
* Either shrink tokens or redesign the API to copy them less. Bug 1347408 may or may not play a role here. If it looks like a lot of work to redesign the API, we should try to get confidence beforehand that it's going to help.
* Avoid vec doubling by using a temporary smallvec when parsing property declarations and selectors.

It's also worth considering that, per discussions elsewhere, it's likely better for caching efficiency (during matching) to eliminate the linked list of selectors if we can, and use a Vec instead.

Assignee: bobbyholley → simon.sapin

Simon Sapin (:SimonSapin)

Comment 27

•

7 years ago

Bobby, I reproduced your benchmark at mozilla-central commit aaa0cd3bd620 (hg) / b09c8505d957 (git), and saw Stylo ~14% faster than Gecko. This is on Linux x64 with Intel i7-5600U. Could you try at the same commit, to see how much of the difference with your results is actual Stylo improvements v.s. differences in our setups?

Gecko opt: https://perfht.ml/2osXWVT, 907 ms in mozilla::css::Loader::ParseSheet
    6 runs: 907, 933, 908, 886, 880, 923. Average: 906ms

Stylo opt: https://perfht.ml/2oPZTrR, 818 ms in Servo_StyleSheet_FromUTF8Bytes
    6 runs: 818, 771, 748, 772, 764, 795. Average: 778 ms

With CPU frequency scaling disabled:
    echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Bobby Holley (:bholley)

Reporter

Comment 28

•

7 years ago

Interesting! My results still reproduce. I'm on a new-gen MBP 2.9Ghz i7-6920HQ, OSX - so generally similar hardware.

The big difference that I can see is that allocation / doubling is basically negligible in your profile, whereas it's quite significant in mine (mostly in doubling when appending things to vecs, hence my suggestion in comment 26). I'm not sure why this is - but we should figure it out, and then try to replicate those characteristics for other platforms.

That said, the overhead in Parser::next looks pretty comparable between my profile and yours (250 on the profile you linked me, whereas mine varies between 230 and 210), so that's definitely still something worth looking into.

Boris Zbarsky [:bzbarsky]

Comment 29

•

7 years ago

Memory allocator performance may vary somewhat widely by OS.  free() is _definitely_ much more expensive on Mac than elsewhere...  I can't speak for malloc().

Bobby Holley (:bholley)

Reporter

Comment 30

•

7 years ago

So, I am seeing:
150ms in malloc_zone_malloc
40ms in malloc_zone_realloc
10ms in free

So free isn't really the bottleneck.

Bobby Holley (:bholley)

Reporter

Comment 31

•

7 years ago

Julian, would also be able to help us look into this? This is in the (sequential) parsing phase, but it's still hot and it would be really good if we could figure out (a) why we're spending so much more time malloc-ing on osx than linux and (b) How to spend less time in malloc on osx.

Flags: needinfo?(jseward)

Simon Sapin (:SimonSapin)

Comment 32

•

7 years ago

Aren’t we using jemalloc on both platforms?

Boris Zbarsky [:bzbarsky]

Comment 33

•

7 years ago

We are.  free() still has to do a bunch of extra work on mac to deal with some things the Mac OS system libraries expect from free() that most other OSes don't have to provide.

I have no idea why _malloc_ would take more time on Mac specifically, offhand.

Bobby Holley (:bholley)

Reporter

Comment 34

•

7 years ago

Manish pointed out in the meeting that it's probably because jemalloc acquires a lot of mutexes, which are very slow on mac (at least IIRC).

So this means that we should probably just focus on reducing the number of calls to malloc/realloc. We can probably do this by:
(a) Using a SmallVec when parsing PDB blocks, which should eliminate the 40ms of realloc overhead, and
(b) Rearrange the Selector representation to avoid the linked list, which should also help with cache locality. See [1]

I'll file dependent bugs.

[1] https://github.com/servo/servo/pull/15890#issuecomment-288279774

Flags: needinfo?(jseward)

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Depends on: 1353239

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Depends on: 1353240

Bobby Holley (:bholley)

Reporter

Comment 35

•

7 years ago

If those three things give us the wins we expect without bad side effects, I imagine we'd have fast enough parser performance to ship without regression, at least on this testcase.

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Depends on: 1354215

Boris Zbarsky [:bzbarsky]

Updated

•

7 years ago

Blocks: 1354566

Manish Goregaokar [:manishearth]

Updated

•

7 years ago

Blocks: stylo

Simon Sapin (:SimonSapin)

Updated

•

7 years ago

Depends on: 1354989

Simon Sapin (:SimonSapin)

Updated

•

7 years ago

Depends on: 1355028

Simon Sapin (:SimonSapin)

Comment 37

•

7 years ago

I’ve submitted https://github.com/servo/servo/pull/16954 which will hopefully reduce `memmove` traffic during parsing.

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Priority: P1 → --

Bobby Holley (:bholley)

Reporter

Comment 38

•

7 years ago

We're still slower at parsing on a lot of the sites that I look at. Simon, what's the current status?

Flags: needinfo?(simon.sapin)

Priority: -- → P2

Simon Sapin (:SimonSapin)

Comment 39

•

7 years ago

My work on this is on hold this week for Houdini and CSSWG meetings which I’m helping host in Paris.

Per bug 1384192 comment 5 I was/will be looking at making line number counting happen during tokenization to avoid looking at every byte again. The tokenizer would maintain the current line number and start position of the current line. These would need to be stored for the purpose of Parser::reset but are not necessary for example for Parser::slice_from. So the SourcePosition / SourceLocation / PreciseParseError types may need a redesign.

Flags: needinfo?(simon.sapin)

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Depends on: 1386398

Bobby Holley (:bholley)

Reporter

Comment 40

•

7 years ago

Per [1] this no longer blocks shipping. We should still pursue promising optimizations when they arise.

[1] https://docs.google.com/document/d/1Tj0HJtJY_KcLdSX-hQCHKx2RrD2sqUYbnXu-3NvRZNU/edit?usp=sharing

Priority: P2 → P4

Chris Peterson [:cpeterson]

Comment 41

•

7 years ago

status-firefox57=wontfix unless someone thinks this bug should block 57

status-firefox57: --- → wontfix

Simon Sapin (:SimonSapin)

Updated

•

3 years ago

Assignee: simon.sapin → nobody

BMO Automation

Updated

•

2 years ago

Severity: normal → S3