I did some quick profiling on Talos based on shing's results in bug 1331578. I focused on the myspace testcase, where Stylo is 155% slower than stock Gecko. One piece of low-hanging fruit I found was CSS parsing overhead. Servo spends 153 ms parsing the stylesheet: https://clptr.io/2iFRZ0k While gecko spends 23 ms parsing the same stylesheet: https://clptr.io/2iFQPSP There's no reason I'm aware of that we shouldn't be able to hit parity here.
Err, this is a better view of the Servo CSS parser profile: https://clptr.io/2iFL2N3
https://github.com/servo/servo/issues/15060 and https://github.com/servo/rust-cssparser/pull/112 May help here. In general, optimizing this is not super urgent, since it is unlikely to impact major design decisions, and is easy to separate out when measuring other parts of performance. Still, we should pick off any low hanging fruit we see sooner rather than later.
Manish says https://github.com/servo/servo/pull/15065 may also help here.
There’s certainly more to do, but how does this look with https://github.com/servo/rust-cssparser/pull/112 ?
(https://github.com/servo/rust-cssparser/pull/112 which has now merged and is published on crates.io as version 0.7.3.)
(In reply to Simon Sapin (:SimonSapin) from comment #5) > (https://github.com/servo/rust-cssparser/pull/112 which has now merged and > is published on crates.io as version 0.7.3.) This hasn't hit incubator yet, so we'll need to wait for the next sync to test.
This should be on incubator now.
Those patches seem to have helped a lot. Parse performance is now only 3x slower than Gecko (~65ms): https://clptr.io/2kP22Wz
A bottom-up profile of the time in Servo_StyleSheet_FromUTF8Bytes is at . Deduplication tops the list.  https://new.cleopatra.io/public/feb22a9ae08424495559089a5d0d022b920c4d5c/calltree/?callTreeFilters=prefix-0123456789aE9cdefyUaBnijkEamnyZvQ4Q5Q6xD7xD8xD9xDaxJaxDcxDdxDh&invertCallstack&range=0.4828_0.5749&thread=0
I suspect using with_capacity instead of `let mut deduplicated = Vec::new();` might help the deduplication issue. In general we need to box the heaviest properties so that we can drastically reduce the impact of a copy here.
Preallocation definitely helps a bit, shaving off about 3ms: https://clptr.io/2kOT63w But it seems like there's still a fair amount of memmov traffic coming from deduplication.
Yeah, the memmove traffic is because we're unconditionally copying each decl even when there's no duplication. And our declarations are *large*, we need to box some of them up. It may be preferable to iterate twice instead; the first time only reading the vector.
Hm, a problem with iterating twice is that that will double the codesize of this function. Though we can make it work with some conditionals if we need to. Even better would be to have PropertyDeclaration::parse mark seens and whether or not duplicate properties exist, and reuse this data to skip the deduplicate call in the common case.
Filed https://github.com/servo/servo/issues/15320 Made PR https://github.com/servo/servo/pull/15321 to pull in the easy fix of preallocation
Oh, also filed https://github.com/servo/servo/issues/15322
One thing that I *guess* could explain some of the parsing performance difference compared to Gecko is the cssparser::Parser::try method. It records the current position in the Unicode input, executes a closure, and if that closure returns an error "rewinds" to the recorded position. This is used a lot in Servo to implement "<foo> | <bar>" alternatives in CSS grammars. It does make e.g. adding a new property easier, but it has a cost: the tokenzier ends up doing the same work multiple times. We could have a one-token cache, but that feels like a band-aid. Better yet would be to avoid rewinding entirely. Currently, rewinding is sort-of necessary because the only way to access a token is to consume it and advance the current position. If we separate "access the current token" and "move to the next token", we should be able to refactor Servo to not need rewinding, and remove the `try` method. However taking ownership of a token (and the `Cow<str>`s inside it, to avoid copies) would become more difficult. It might be beneficial to get rid of these `Cow`s and make tokens implement `Copy`. At the moment they only own buffers when a string is not as in the source, because the source contains backslash-escapes. We could replace that with on-demand unescaping (based on Iterator or/and Display), and make cssparser truly zero-copy. I want to try these ideas but it’s a time investment, especially to port parsing code in Servo.
> We could have a one-token cache So for what it's worth, Gecko's setup here is that the API looks like this: GetToken/UngetToken. GetToken gets a token. UngetToken marks the current token as "not gotten", so the next GetToken call will just return it. You can't UngetToken twice in a row. So this is effectively a one-token cache under the hood.
With SimonSapin's recent fixes (and a new machine), I get 23 ms for Gecko and 31 ms for Servo, so Servo is still 35% slower, but we're closing in on it. I'll dig into the profile a bit more after the call.
So I reprofiled, and 97% of the time is spent in parse_qualified_rule . Simon, is that consistent with the analysis in comment 16?  Everything inside is inline, so I can't get more granular than that, even though I inflated the overhead by parsing each stylesheet 100 times.
> 97% of the time is spent in parse_qualified_rule That’s not surprising, and unfortunately not very informative. Qualified rules are any rule that doesn’t start with an @-keyword, which includes style rules. So the parsing of the selector list and the declarations for one given style rule is all under one call to parse_qualified_rule.
Here's the updated profile with more stuff out-of-line: https://perfht.ml/2mXhal2
Seems like a decent amount of it is spent in url parsing. We should try to get that fixed I guess, stuff like https://github.com/servo/servo/issues/13778 is pretty sad.
To test Simon's theory from comment 16, I wrote up a patch to measure the wasted work. The basic idea is that, if the parse callback returns failure, we do the work a second time in an out-of-line function which we can then measure with the profiler. The patch is here: https://pastebin.mozilla.org/8982050 The resulting profile is here: https://perfht.ml/2nqoweF This shows us spending 82 ms in wasted_work, and 3012 ms total time in Servo_StyleSheet_FromUTF8Bytes, which is 2930 after subtracting out the redundant set of calls in wasted_work. This means that the wasted work accounts for about 3% of total parse time, which is about 12% of the total remaining difference between Gecko and Servo. So it's definitely worth fixing, but may not be worth a boil-the-ocean refactor right now. There's still bigger fish to fry in there somewhere.
Regarding URL parsing, I maintain that we should skip rust-url entirely for `data:` URLs. Emilio’s attempt at this ended up not landing, but I don’t remember why exactly.
> we should skip rust-url entirely for `data:` URLs … or even for all URLs, in Stylo. (See discussion in Bug 1347435.)
Simon is looking at this. Here's a breakdown I did of a couple of profiles: https://docs.google.com/spreadsheets/d/1TAoakwrYnb337enEzFK3tFPOrpmUJ31hxAaX0kaP_5A/edit?usp=sharing Some ideas: * Either shrink tokens or redesign the API to copy them less. Bug 1347408 may or may not play a role here. If it looks like a lot of work to redesign the API, we should try to get confidence beforehand that it's going to help. * Avoid vec doubling by using a temporary smallvec when parsing property declarations and selectors. It's also worth considering that, per discussions elsewhere, it's likely better for caching efficiency (during matching) to eliminate the linked list of selectors if we can, and use a Vec instead.
Bobby, I reproduced your benchmark at mozilla-central commit aaa0cd3bd620 (hg) / b09c8505d957 (git), and saw Stylo ~14% faster than Gecko. This is on Linux x64 with Intel i7-5600U. Could you try at the same commit, to see how much of the difference with your results is actual Stylo improvements v.s. differences in our setups? Gecko opt: https://perfht.ml/2osXWVT, 907 ms in mozilla::css::Loader::ParseSheet 6 runs: 907, 933, 908, 886, 880, 923. Average: 906ms Stylo opt: https://perfht.ml/2oPZTrR, 818 ms in Servo_StyleSheet_FromUTF8Bytes 6 runs: 818, 771, 748, 772, 764, 795. Average: 778 ms With CPU frequency scaling disabled: echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Interesting! My results still reproduce. I'm on a new-gen MBP 2.9Ghz i7-6920HQ, OSX - so generally similar hardware. The big difference that I can see is that allocation / doubling is basically negligible in your profile, whereas it's quite significant in mine (mostly in doubling when appending things to vecs, hence my suggestion in comment 26). I'm not sure why this is - but we should figure it out, and then try to replicate those characteristics for other platforms. That said, the overhead in Parser::next looks pretty comparable between my profile and yours (250 on the profile you linked me, whereas mine varies between 230 and 210), so that's definitely still something worth looking into.
Memory allocator performance may vary somewhat widely by OS. free() is _definitely_ much more expensive on Mac than elsewhere... I can't speak for malloc().
So, I am seeing: 150ms in malloc_zone_malloc 40ms in malloc_zone_realloc 10ms in free So free isn't really the bottleneck.
Julian, would also be able to help us look into this? This is in the (sequential) parsing phase, but it's still hot and it would be really good if we could figure out (a) why we're spending so much more time malloc-ing on osx than linux and (b) How to spend less time in malloc on osx.
Aren’t we using jemalloc on both platforms?
We are. free() still has to do a bunch of extra work on mac to deal with some things the Mac OS system libraries expect from free() that most other OSes don't have to provide. I have no idea why _malloc_ would take more time on Mac specifically, offhand.
Manish pointed out in the meeting that it's probably because jemalloc acquires a lot of mutexes, which are very slow on mac (at least IIRC). So this means that we should probably just focus on reducing the number of calls to malloc/realloc. We can probably do this by: (a) Using a SmallVec when parsing PDB blocks, which should eliminate the 40ms of realloc overhead, and (b) Rearrange the Selector representation to avoid the linked list, which should also help with cache locality. See  I'll file dependent bugs.  https://github.com/servo/servo/pull/15890#issuecomment-288279774
If those three things give us the wins we expect without bad side effects, I imagine we'd have fast enough parser performance to ship without regression, at least on this testcase.
20 days ago