I did some quick profiling on Talos based on shing's results in bug 1331578. I focused on the myspace testcase, where Stylo is 155% slower than stock Gecko. One piece of low-hanging fruit I found was CSS parsing overhead. Servo spends 153 ms parsing the stylesheet: https://clptr.io/2iFRZ0k While gecko spends 23 ms parsing the same stylesheet: https://clptr.io/2iFQPSP There's no reason I'm aware of that we shouldn't be able to hit parity here.
Err, this is a better view of the Servo CSS parser profile: https://clptr.io/2iFL2N3
https://github.com/servo/servo/issues/15060 and https://github.com/servo/rust-cssparser/pull/112 May help here. In general, optimizing this is not super urgent, since it is unlikely to impact major design decisions, and is easy to separate out when measuring other parts of performance. Still, we should pick off any low hanging fruit we see sooner rather than later.
Manish says https://github.com/servo/servo/pull/15065 may also help here.
There’s certainly more to do, but how does this look with https://github.com/servo/rust-cssparser/pull/112 ?
(https://github.com/servo/rust-cssparser/pull/112 which has now merged and is published on crates.io as version 0.7.3.)
(In reply to Simon Sapin (:SimonSapin) from comment #5) > (https://github.com/servo/rust-cssparser/pull/112 which has now merged and > is published on crates.io as version 0.7.3.) This hasn't hit incubator yet, so we'll need to wait for the next sync to test.
This should be on incubator now.
Those patches seem to have helped a lot. Parse performance is now only 3x slower than Gecko (~65ms): https://clptr.io/2kP22Wz
A bottom-up profile of the time in Servo_StyleSheet_FromUTF8Bytes is at . Deduplication tops the list.  https://new.cleopatra.io/public/feb22a9ae08424495559089a5d0d022b920c4d5c/calltree/?callTreeFilters=prefix-0123456789aE9cdefyUaBnijkEamnyZvQ4Q5Q6xD7xD8xD9xDaxJaxDcxDdxDh&invertCallstack&range=0.4828_0.5749&thread=0
I suspect using with_capacity instead of `let mut deduplicated = Vec::new();` might help the deduplication issue. In general we need to box the heaviest properties so that we can drastically reduce the impact of a copy here.
Preallocation definitely helps a bit, shaving off about 3ms: https://clptr.io/2kOT63w But it seems like there's still a fair amount of memmov traffic coming from deduplication.
Yeah, the memmove traffic is because we're unconditionally copying each decl even when there's no duplication. And our declarations are *large*, we need to box some of them up. It may be preferable to iterate twice instead; the first time only reading the vector.
Hm, a problem with iterating twice is that that will double the codesize of this function. Though we can make it work with some conditionals if we need to. Even better would be to have PropertyDeclaration::parse mark seens and whether or not duplicate properties exist, and reuse this data to skip the deduplicate call in the common case.
Filed https://github.com/servo/servo/issues/15320 Made PR https://github.com/servo/servo/pull/15321 to pull in the easy fix of preallocation
Oh, also filed https://github.com/servo/servo/issues/15322
One thing that I *guess* could explain some of the parsing performance difference compared to Gecko is the cssparser::Parser::try method. It records the current position in the Unicode input, executes a closure, and if that closure returns an error "rewinds" to the recorded position. This is used a lot in Servo to implement "<foo> | <bar>" alternatives in CSS grammars. It does make e.g. adding a new property easier, but it has a cost: the tokenzier ends up doing the same work multiple times. We could have a one-token cache, but that feels like a band-aid. Better yet would be to avoid rewinding entirely. Currently, rewinding is sort-of necessary because the only way to access a token is to consume it and advance the current position. If we separate "access the current token" and "move to the next token", we should be able to refactor Servo to not need rewinding, and remove the `try` method. However taking ownership of a token (and the `Cow<str>`s inside it, to avoid copies) would become more difficult. It might be beneficial to get rid of these `Cow`s and make tokens implement `Copy`. At the moment they only own buffers when a string is not as in the source, because the source contains backslash-escapes. We could replace that with on-demand unescaping (based on Iterator or/and Display), and make cssparser truly zero-copy. I want to try these ideas but it’s a time investment, especially to port parsing code in Servo.
> We could have a one-token cache So for what it's worth, Gecko's setup here is that the API looks like this: GetToken/UngetToken. GetToken gets a token. UngetToken marks the current token as "not gotten", so the next GetToken call will just return it. You can't UngetToken twice in a row. So this is effectively a one-token cache under the hood.
With SimonSapin's recent fixes (and a new machine), I get 23 ms for Gecko and 31 ms for Servo, so Servo is still 35% slower, but we're closing in on it. I'll dig into the profile a bit more after the call.
So I reprofiled, and 97% of the time is spent in parse_qualified_rule . Simon, is that consistent with the analysis in comment 16?  Everything inside is inline, so I can't get more granular than that, even though I inflated the overhead by parsing each stylesheet 100 times.
> 97% of the time is spent in parse_qualified_rule That’s not surprising, and unfortunately not very informative. Qualified rules are any rule that doesn’t start with an @-keyword, which includes style rules. So the parsing of the selector list and the declarations for one given style rule is all under one call to parse_qualified_rule.
Here's the updated profile with more stuff out-of-line: https://perfht.ml/2mXhal2
Seems like a decent amount of it is spent in url parsing. We should try to get that fixed I guess, stuff like https://github.com/servo/servo/issues/13778 is pretty sad.
To test Simon's theory from comment 16, I wrote up a patch to measure the wasted work. The basic idea is that, if the parse callback returns failure, we do the work a second time in an out-of-line function which we can then measure with the profiler. The patch is here: https://pastebin.mozilla.org/8982050 The resulting profile is here: https://perfht.ml/2nqoweF This shows us spending 82 ms in wasted_work, and 3012 ms total time in Servo_StyleSheet_FromUTF8Bytes, which is 2930 after subtracting out the redundant set of calls in wasted_work. This means that the wasted work accounts for about 3% of total parse time, which is about 12% of the total remaining difference between Gecko and Servo. So it's definitely worth fixing, but may not be worth a boil-the-ocean refactor right now. There's still bigger fish to fry in there somewhere.
Regarding URL parsing, I maintain that we should skip rust-url entirely for `data:` URLs. Emilio’s attempt at this ended up not landing, but I don’t remember why exactly.
> we should skip rust-url entirely for `data:` URLs … or even for all URLs, in Stylo. (See discussion in Bug 1347435.)