Open
Bug 1331843
Opened 7 years ago
Updated 2 years ago
stylo: Servo CSS stylesheet parsing is 35% slower than Gecko's on myspace Talos testcase
Categories
(Core :: CSS Parsing and Computation, defect, P4)
Core
CSS Parsing and Computation
Tracking
()
NEW
Tracking | Status | |
---|---|---|
firefox57 | --- | wontfix |
People
(Reporter: bholley, Unassigned)
References
(Blocks 2 open bugs)
Details
I did some quick profiling on Talos based on shing's results in bug 1331578. I focused on the myspace testcase, where Stylo is 155% slower than stock Gecko. One piece of low-hanging fruit I found was CSS parsing overhead. Servo spends 153 ms parsing the stylesheet: https://clptr.io/2iFRZ0k While gecko spends 23 ms parsing the same stylesheet: https://clptr.io/2iFQPSP There's no reason I'm aware of that we shouldn't be able to hit parity here.
Reporter | ||
Comment 1•7 years ago
|
||
Err, this is a better view of the Servo CSS parser profile: https://clptr.io/2iFL2N3
Reporter | ||
Comment 2•7 years ago
|
||
https://github.com/servo/servo/issues/15060 and https://github.com/servo/rust-cssparser/pull/112 May help here. In general, optimizing this is not super urgent, since it is unlikely to impact major design decisions, and is easy to separate out when measuring other parts of performance. Still, we should pick off any low hanging fruit we see sooner rather than later.
Reporter | ||
Comment 3•7 years ago
|
||
Manish says https://github.com/servo/servo/pull/15065 may also help here.
Comment 4•7 years ago
|
||
There’s certainly more to do, but how does this look with https://github.com/servo/rust-cssparser/pull/112 ?
Comment 5•7 years ago
|
||
(https://github.com/servo/rust-cssparser/pull/112 which has now merged and is published on crates.io as version 0.7.3.)
Reporter | ||
Comment 6•7 years ago
|
||
(In reply to Simon Sapin (:SimonSapin) from comment #5) > (https://github.com/servo/rust-cssparser/pull/112 which has now merged and > is published on crates.io as version 0.7.3.) This hasn't hit incubator yet, so we'll need to wait for the next sync to test.
Comment 7•7 years ago
|
||
This should be on incubator now.
Reporter | ||
Comment 8•7 years ago
|
||
Those patches seem to have helped a lot. Parse performance is now only 3x slower than Gecko (~65ms): https://clptr.io/2kP22Wz
Reporter | ||
Updated•7 years ago
|
Summary: stylo: Servo CSS stylesheet parsing is 6-7 times slower than Gecko's on myspace Talos testcase → stylo: Servo CSS stylesheet parsing is ~3 times slower than Gecko's on myspace Talos testcase
Reporter | ||
Comment 9•7 years ago
|
||
A bottom-up profile of the time in Servo_StyleSheet_FromUTF8Bytes is at [1]. Deduplication tops the list. [1] https://new.cleopatra.io/public/feb22a9ae08424495559089a5d0d022b920c4d5c/calltree/?callTreeFilters=prefix-0123456789aE9cdefyUaBnijkEamnyZvQ4Q5Q6xD7xD8xD9xDaxJaxDcxDdxDh&invertCallstack&range=0.4828_0.5749&thread=0
Comment 10•7 years ago
|
||
I suspect using with_capacity instead of `let mut deduplicated = Vec::new();` might help the deduplication issue. In general we need to box the heaviest properties so that we can drastically reduce the impact of a copy here.
Reporter | ||
Comment 11•7 years ago
|
||
Preallocation definitely helps a bit, shaving off about 3ms: https://clptr.io/2kOT63w But it seems like there's still a fair amount of memmov traffic coming from deduplication.
Comment 12•7 years ago
|
||
Yeah, the memmove traffic is because we're unconditionally copying each decl even when there's no duplication. And our declarations are *large*, we need to box some of them up. It may be preferable to iterate twice instead; the first time only reading the vector.
Comment 13•7 years ago
|
||
Hm, a problem with iterating twice is that that will double the codesize of this function. Though we can make it work with some conditionals if we need to. Even better would be to have PropertyDeclaration::parse mark seens and whether or not duplicate properties exist, and reuse this data to skip the deduplicate call in the common case.
Comment 14•7 years ago
|
||
Filed https://github.com/servo/servo/issues/15320 Made PR https://github.com/servo/servo/pull/15321 to pull in the easy fix of preallocation
Comment 15•7 years ago
|
||
Oh, also filed https://github.com/servo/servo/issues/15322
Reporter | ||
Updated•7 years ago
|
Assignee: nobody → bobbyholley
Priority: -- → P1
Comment 16•7 years ago
|
||
One thing that I *guess* could explain some of the parsing performance difference compared to Gecko is the cssparser::Parser::try method. It records the current position in the Unicode input, executes a closure, and if that closure returns an error "rewinds" to the recorded position. This is used a lot in Servo to implement "<foo> | <bar>" alternatives in CSS grammars. It does make e.g. adding a new property easier, but it has a cost: the tokenzier ends up doing the same work multiple times. We could have a one-token cache, but that feels like a band-aid. Better yet would be to avoid rewinding entirely. Currently, rewinding is sort-of necessary because the only way to access a token is to consume it and advance the current position. If we separate "access the current token" and "move to the next token", we should be able to refactor Servo to not need rewinding, and remove the `try` method. However taking ownership of a token (and the `Cow<str>`s inside it, to avoid copies) would become more difficult. It might be beneficial to get rid of these `Cow`s and make tokens implement `Copy`. At the moment they only own buffers when a string is not as in the source, because the source contains backslash-escapes. We could replace that with on-demand unescaping (based on Iterator or/and Display), and make cssparser truly zero-copy. I want to try these ideas but it’s a time investment, especially to port parsing code in Servo.
Comment 17•7 years ago
|
||
> We could have a one-token cache
So for what it's worth, Gecko's setup here is that the API looks like this: GetToken/UngetToken. GetToken gets a token. UngetToken marks the current token as "not gotten", so the next GetToken call will just return it. You can't UngetToken twice in a row. So this is effectively a one-token cache under the hood.
Reporter | ||
Comment 18•7 years ago
|
||
With SimonSapin's recent fixes (and a new machine), I get 23 ms for Gecko and 31 ms for Servo, so Servo is still 35% slower, but we're closing in on it. I'll dig into the profile a bit more after the call.
Reporter | ||
Updated•7 years ago
|
Summary: stylo: Servo CSS stylesheet parsing is ~3 times slower than Gecko's on myspace Talos testcase → stylo: Servo CSS stylesheet parsing is 35% slower than Gecko's on myspace Talos testcase
Reporter | ||
Comment 19•7 years ago
|
||
So I reprofiled, and 97% of the time is spent in parse_qualified_rule [1]. Simon, is that consistent with the analysis in comment 16? [1] Everything inside is inline, so I can't get more granular than that, even though I inflated the overhead by parsing each stylesheet 100 times.
Flags: needinfo?(simon.sapin)
Comment 20•7 years ago
|
||
> 97% of the time is spent in parse_qualified_rule
That’s not surprising, and unfortunately not very informative. Qualified rules are any rule that doesn’t start with an @-keyword, which includes style rules. So the parsing of the selector list and the declarations for one given style rule is all under one call to parse_qualified_rule.
Flags: needinfo?(simon.sapin)
Reporter | ||
Comment 21•7 years ago
|
||
Here's the updated profile with more stuff out-of-line: https://perfht.ml/2mXhal2
Comment 22•7 years ago
|
||
Seems like a decent amount of it is spent in url parsing. We should try to get that fixed I guess, stuff like https://github.com/servo/servo/issues/13778 is pretty sad.
Reporter | ||
Comment 23•7 years ago
|
||
To test Simon's theory from comment 16, I wrote up a patch to measure the wasted work. The basic idea is that, if the parse callback returns failure, we do the work a second time in an out-of-line function which we can then measure with the profiler. The patch is here: https://pastebin.mozilla.org/8982050 The resulting profile is here: https://perfht.ml/2nqoweF This shows us spending 82 ms in wasted_work, and 3012 ms total time in Servo_StyleSheet_FromUTF8Bytes, which is 2930 after subtracting out the redundant set of calls in wasted_work. This means that the wasted work accounts for about 3% of total parse time, which is about 12% of the total remaining difference between Gecko and Servo. So it's definitely worth fixing, but may not be worth a boil-the-ocean refactor right now. There's still bigger fish to fry in there somewhere.
Comment 24•7 years ago
|
||
Regarding URL parsing, I maintain that we should skip rust-url entirely for `data:` URLs. Emilio’s attempt at this ended up not landing, but I don’t remember why exactly.
Comment 25•7 years ago
|
||
> we should skip rust-url entirely for `data:` URLs … or even for all URLs, in Stylo. (See discussion in Bug 1347435.)
Reporter | ||
Comment 26•7 years ago
|
||
Simon is looking at this. Here's a breakdown I did of a couple of profiles: https://docs.google.com/spreadsheets/d/1TAoakwrYnb337enEzFK3tFPOrpmUJ31hxAaX0kaP_5A/edit?usp=sharing Some ideas: * Either shrink tokens or redesign the API to copy them less. Bug 1347408 may or may not play a role here. If it looks like a lot of work to redesign the API, we should try to get confidence beforehand that it's going to help. * Avoid vec doubling by using a temporary smallvec when parsing property declarations and selectors. It's also worth considering that, per discussions elsewhere, it's likely better for caching efficiency (during matching) to eliminate the linked list of selectors if we can, and use a Vec instead.
Assignee: bobbyholley → simon.sapin
Comment 27•7 years ago
|
||
Bobby, I reproduced your benchmark at mozilla-central commit aaa0cd3bd620 (hg) / b09c8505d957 (git), and saw Stylo ~14% faster than Gecko. This is on Linux x64 with Intel i7-5600U. Could you try at the same commit, to see how much of the difference with your results is actual Stylo improvements v.s. differences in our setups? Gecko opt: https://perfht.ml/2osXWVT, 907 ms in mozilla::css::Loader::ParseSheet 6 runs: 907, 933, 908, 886, 880, 923. Average: 906ms Stylo opt: https://perfht.ml/2oPZTrR, 818 ms in Servo_StyleSheet_FromUTF8Bytes 6 runs: 818, 771, 748, 772, 764, 795. Average: 778 ms With CPU frequency scaling disabled: echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Reporter | ||
Comment 28•7 years ago
|
||
Interesting! My results still reproduce. I'm on a new-gen MBP 2.9Ghz i7-6920HQ, OSX - so generally similar hardware. The big difference that I can see is that allocation / doubling is basically negligible in your profile, whereas it's quite significant in mine (mostly in doubling when appending things to vecs, hence my suggestion in comment 26). I'm not sure why this is - but we should figure it out, and then try to replicate those characteristics for other platforms. That said, the overhead in Parser::next looks pretty comparable between my profile and yours (250 on the profile you linked me, whereas mine varies between 230 and 210), so that's definitely still something worth looking into.
Comment 29•7 years ago
|
||
Memory allocator performance may vary somewhat widely by OS. free() is _definitely_ much more expensive on Mac than elsewhere... I can't speak for malloc().
Reporter | ||
Comment 30•7 years ago
|
||
So, I am seeing: 150ms in malloc_zone_malloc 40ms in malloc_zone_realloc 10ms in free So free isn't really the bottleneck.
Reporter | ||
Comment 31•7 years ago
|
||
Julian, would also be able to help us look into this? This is in the (sequential) parsing phase, but it's still hot and it would be really good if we could figure out (a) why we're spending so much more time malloc-ing on osx than linux and (b) How to spend less time in malloc on osx.
Flags: needinfo?(jseward)
Comment 32•7 years ago
|
||
Aren’t we using jemalloc on both platforms?
Comment 33•7 years ago
|
||
We are. free() still has to do a bunch of extra work on mac to deal with some things the Mac OS system libraries expect from free() that most other OSes don't have to provide. I have no idea why _malloc_ would take more time on Mac specifically, offhand.
Reporter | ||
Comment 34•7 years ago
|
||
Manish pointed out in the meeting that it's probably because jemalloc acquires a lot of mutexes, which are very slow on mac (at least IIRC). So this means that we should probably just focus on reducing the number of calls to malloc/realloc. We can probably do this by: (a) Using a SmallVec when parsing PDB blocks, which should eliminate the 40ms of realloc overhead, and (b) Rearrange the Selector representation to avoid the linked list, which should also help with cache locality. See [1] I'll file dependent bugs. [1] https://github.com/servo/servo/pull/15890#issuecomment-288279774
Flags: needinfo?(jseward)
Reporter | ||
Comment 35•7 years ago
|
||
If those three things give us the wins we expect without bad side effects, I imagine we'd have fast enough parser performance to ship without regression, at least on this testcase.
Comment 37•7 years ago
|
||
I’ve submitted https://github.com/servo/servo/pull/16954 which will hopefully reduce `memmove` traffic during parsing.
Reporter | ||
Updated•7 years ago
|
Priority: P1 → --
Reporter | ||
Comment 38•7 years ago
|
||
We're still slower at parsing on a lot of the sites that I look at. Simon, what's the current status?
Flags: needinfo?(simon.sapin)
Priority: -- → P2
Comment 39•7 years ago
|
||
My work on this is on hold this week for Houdini and CSSWG meetings which I’m helping host in Paris. Per bug 1384192 comment 5 I was/will be looking at making line number counting happen during tokenization to avoid looking at every byte again. The tokenizer would maintain the current line number and start position of the current line. These would need to be stored for the purpose of Parser::reset but are not necessary for example for Parser::slice_from. So the SourcePosition / SourceLocation / PreciseParseError types may need a redesign.
Flags: needinfo?(simon.sapin)
Reporter | ||
Comment 40•7 years ago
|
||
Per [1] this no longer blocks shipping. We should still pursue promising optimizations when they arise. [1] https://docs.google.com/document/d/1Tj0HJtJY_KcLdSX-hQCHKx2RrD2sqUYbnXu-3NvRZNU/edit?usp=sharing
Priority: P2 → P4
Comment 41•7 years ago
|
||
status-firefox57=wontfix unless someone thinks this bug should block 57
status-firefox57:
--- → wontfix
Updated•3 years ago
|
Assignee: simon.sapin → nobody
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•