This is sort of a continuation of bug 1344136 but the parameters are different enough that I think a different bug is a good idea. Looking at a profile of the testcase in bug 1344131, stylo spends about 12% of its time in memmove, called mostly from declaration_block::parse_one_declaration and glue::set_property. Another 4% is the UTF16-to-UTF8 copy of the to-be-parsed string. The rest of the stylo-specific time is various parsing code. I put a profile at https://perf-html.io/public/ca2114c54af0c8a917d476430772b924b2fa9046/calltree/?thread=2 in case something in there jumps out to someone more familiar with the servo CSS parser/tokenizer. If not, I can dig some more and try to do more comparable head-to-head profiles vs Gecko to see which parts are slower in stylo.
2 months ago
There's probably significant overlap with SimonSapin's work in bug 1331843.
Happy to reprofile once that lands.
> Another 4% is the UTF16-to-UTF8 copy of the to-be-parsed string. On the output side of this conversion/copy, rust-cssparser’s tokenizer it its current incarnation fundamentally needs UTF-8 (Rust &str) for its input. Making it generic to support UTF-16 input would be a significant amount of work. On the input side, is it really UTF-16 rather than Latin-1? IIRC SpiderMonkey can store JS strings as Latin-1 internally when they don’t contain code units beyond U+00FF, which is the case of "10px". Based on chatting with heycam, it sounds like the WebIDL bindings layer would first convert Latin-1 to UTF-16, then the CSSOM implementation for Stylo converts that to UTF-8. If we add a WebIDL annotation that a given string parameter is wanted in UTF-8, we could make that one conversion instead of two. We only want UTF-8 when a string is going to Stylo/Servo though, so doing this may be tricky as long as we have two style systems. Going further, we can avoid converting/copying entirely and use std::str::from_utf8_unchecked if the string happens to only contain ASCII code units (and is stored as Latin-1). However, systematically scanning Latin-1 inputs to check whether they’re ASCII also has a cost. So ideally, SpiderMonkey would track this for us with a "is ASCII" / "might not be ASCII" bit kept in Latin-1 strings. (This is about CSSOM, I’ve filed Bug 1354989 separately for stylesheets.)
> Making it generic to support UTF-16 input would be a significant amount of work. Right. I'm not sure what we can do about this overhead so far. > On the input side, is it really UTF-16 rather than Latin-1? It's UTF-16 (or WTF-16, I guess; more on this below) once it gets past the bindings layer. The original JS string is in fact Latin-1 in this case. We _could_ do something with bindings here to produce UTF-8 instead of UTF-16, yes. As you note, that's easier to do once we don't have to worry about two style systems, or once we're ok taking a perf hit on the non-stylo style system. > Going further, we can avoid converting/copying entirely and use std::str::from_utf8_unchecked > if the string happens to only contain ASCII code units As things stand, bindings will make a copy no matter what. They do right now. So if we implemented the "bindings output UTF-8" thing, ideally we would not need any more copies or scanning of the string or whatnot. That said, does rust-cssparser actually expect UTF-8, or WTF-8? That is, what happens for surrogate code points in the input? Should bindings be outputting WTF-8 given unpaired surrogates in the WTF-16 input, or converting unpaired surrogates to replacement char, or something else? All the encoding stuff could use a separate bug or three to track it, obviously. ;)
I’ve submitted https://github.com/servo/servo/pull/16954 which will hopefully reduce `memmove` traffic during parsing.