Johnny suggested this a few weeks ago. Given that CSS parsing is turning out to hot, we should investigate whether there are any bits of work we do that can be trivially parallelized on top of rayon. Not urgent to look into this right now, just getting it on file as a potential perf win so that we don't forget about it.
It'd be quite hard taking @import and the loader into account, but the good part is that @import is only taken into account at the beginning of the stylesheet, so we could start parallelizing once we've passed the import section I believe.
Why is CSS parsing with stylo so much hotter than with Gecko's style system?
(In reply to Nathan Froyd [:froydnj] from comment #2) > Why is CSS parsing with stylo so much hotter than with Gecko's style system? Well, in part because of dumb stuff, see bug 1331843. A lot of that has been fixed, and I haven't had time to remeasure yet. However, the other piece here is that parallelizing stuff like parsing is a heck of a lot easier to do safely in Rust than in C++. So even once we get to perf parity with Gecko, we can potentially parallelize stuff that would have been too daunting to consider parallelizing in Gecko.
Multiple stylesheets can be parsed in parallel of each other trivially. Within one stylesheet it’s less easy. Several ideas come to mind, but I don’t know which of them if any would be an overall improvement. Before we try any of them, though, I think there’s still room to improve single-threaded parsing performance. I’ve written https://bugzilla.mozilla.org/show_bug.cgi?id=1331843#c16 about this. https://github.com/haxney/speculate is a research project for a speculative tokenizer based on rust-cssparser. It makes multiple threads start tokenizing from arbitrary points in the Unicode input assuming that that point is not inside a comment or quoted string. If that assumption turns out to be wrong, these results are thrown away and that part of the input is tokenized again. Results seemed promising, but this was never integrated in a parser that does more than tokenizing. Also this is written in 2013 Rust and would need to be ported. We could have a pipeline with a tokenizer thread sending tokens over a channel to another thread for the rest of parsing. (With crossbeam scoped threads to deal with lifetimes.) Finally, we could have one thread find the start and end of each top-level rule in a stylesheet, and other threads tokenize/parse them in parallel. This top-level thing still needs to look for comments, quoted strings, and backslash escapes. But maybe that’s doable with less work (and tighter inner loops) than full tokenization. This would only be effective if we can find the end of a rule several times faster than we can fully parse it, which remains to be seen. Doing this more fine-grained than top-level rules (up to single declarations) is also possible, but more/smaller work items for a thread pool might mean more synchronization cost.
Thanks for the detailed analysis Simon. We should definitely look into parallelizing multiple stylesheets, which would be the lowest-hanging fruit if we could teach the Gecko machinery to handle async-parsed stylesheets. My intuition tells me that we could probably identify top-level rules quite a bit faster than a full parse. The fast tokenizer could maintain pretty minimal state and would need to maintain it only when one of a very small number of characters is found. Would take some time to prove out and implement though.
So a few things: 1) The "find the rule boundaries" idea was my first thought. It's definitely worth testing. 2) Gecko already has a concept of "async parsing" to some extent: obviously @import can happen async and a sheet is not considered fully ready until all of those happen. Rejiggering this a bit to allow servo to return a "sheet is not done parsing" state would not be too hard. We'd need to support cloning of still-being-parsed sheets, but as long as the cloning is just a COW addref of the actual data structure still being parsed into, that should be ok. We explicitly disallow access to rules of sheets which are still "being parsed" (including having the imports still loaded), so that wouldn't be an issue. So if the servo side is ready, making the gecko side work should be pretty straightforward. Maybe half a day's work for me.
We just discussed this on IRC. I think we should do it - it would be a huge win in some cases for not very much work.
Created attachment 8865063 [details] [diff] [review] Always load stylesheets asynchronously when possible. WIP MozReview-Commit-ID: LGWWTaMtUAW
So, I wrote up a patch (attached) to see what happens when I try to always parse stylesheets async in Gecko (ignoring the servo bits). The idea would be to get that into the tree, and then use the newfound flexibility to parse things asynchronously for Servo. However, it turned try orange . I haven't had time to dig in yet, but before I do I thought it was worth asking bz whether he thinks that doing this ought to work, or whether I should change my approach.  https://treeherder.mozilla.org/#/jobs?repo=try&revision=fe71b2b46abdafd8b61d0c2180ffe54260fea5e2
Doing this ought to work, subject, like I said, to fixing the cycle-detection in LoadChildSheet. There's one obvious conceptual bug in the attached patch: it does nothing to keep onload from firing while the runnable is live. Spot-checking a few tests that failed, they all failed because of that: the test's onload fired before the sheet was actually parsed and applied. Adding BlockOnload()/UnblockOnload() calls as needed should help there.
(In reply to Boris Zbarsky [:bz] (still a bit busy) (if a patch has no decent message, automatic r-) from comment #10) > Doing this ought to work, subject, like I said, to fixing the > cycle-detection in LoadChildSheet. Is there actually anything that needs fixing? It seems to me that the important part is just that the data for the stylesheet currently being parsed is top of the stack (so that dependent loads can find it). I think my patch preserves that, no? > > There's one obvious conceptual bug in the attached patch: it does nothing to > keep onload from firing while the runnable is live. Spot-checking a few > tests that failed, they all failed because of that: the test's onload fired > before the sheet was actually parsed and applied. Adding > BlockOnload()/UnblockOnload() calls as needed should help there. Ah, good catch. Trying with that: https://treeherder.mozilla.org/#/jobs?repo=try&revision=762052e789892fc91988ee4f1d8b761e73d2411d
> Is there actually anything that needs fixing? Hmm. For purposes of this patch, probably not, since the parse is still completed sync while mParsingDatas is modified. For actual parallelized parsing, more work will be needed.
(In reply to Boris Zbarsky [:bz] (still a bit busy) (if a patch has no decent message, automatic r-) from comment #12) > > Is there actually anything that needs fixing? > > Hmm. For purposes of this patch, probably not, since the parse is still > completed sync while mParsingDatas is modified. For actual parallelized > parsing, more work will be needed. And to confirm, that work is just passing a ref to the parent stylesheet when loading the child, so that any child sheet loads _it_ triggers can find the appropriate chain to walk?
We pass the parent sheet to LoadChildSheet already. But that's not enough to do the check we're doing via HaveAncestorDataWithURI right now. Specifically, due to sheet sharing, our parent sheet may be effectively being loaded from multiple parents, and we want to check all of them (and all their ancestors) for the recursion. That is, because of sharing, sheets do not form a forest when @import is used; instead we have a DAG in which a single node might have multiple "parents". And we are trying to enforce the "A" part of "DAG", which means we need to check all things that are our "ancestors" in the DAG.