513149 - Speed up css parsing(Add telemetry probes to measure extent of problem)

It shouldn't take very much time once I get done with bug 497495, but I may have to move a lot of code around. I would prefer to make the 3.6 call after we know how invasive it ends up being. I was considering fixing bunch of places where we don't tokenize CSS exactly to spec at the same time, but I don't *have* to do that, and it might be safer not to. Regarding performance benefits, Taras told me that on our current Fennec platforms we take between half a second and a full second to process all the CSS that we load on startup. I think I can easily cut at least 50% off of that.

Boris Zbarsky [:bzbarsky]

Comment 3

•

15 years ago

I'd be interested in the impact on the microbenchmark at https://bugzilla.mozilla.org/attachment.cgi?id=383408 as well.

Alfred Kayser

Comment 4

•

15 years ago

Other bugs on performance improvements on CSS parsing are: bug 337287 - wacky parsing of comments in URLs bug 399223 - make aToken a member of nsCSSScanner instead of passing as function argument bug 455839 - Small cleanup in error handling in nsCSSParser/nsCSSScanner bug 483977 - nsCSSScanner: Simplify IsDigit: Ready for checkin! I would recommend to at least checkin bug 483977 before doing major work, and the others to may be take into account into the 'next generation parser'.

Dão Gottwald [:dao]

Updated

•

15 years ago

Keywords: perf

Benedict Hsieh [:benedict]

Comment 5

•

15 years ago

Zack, wanted to let you know that I took a shot at changing the CSSScanner to call ReadSegments on the input stream, patch is in https://bugzilla.mozilla.org/show_bug.cgi?id=512272. (Patch is a little messy right now, but the change isn't large at all.) I needed to do this for my other changes to make sense. Anyways, it's not close to landing or anything, I just wanted to let you know because you said you'd be doing something similar.

Zack Weinberg (:zwol)

Comment 6

•

15 years ago

(In reply to comment #4) > Other bugs on performance improvements on CSS parsing are: > bug 337287 - wacky parsing of comments in URLs I'm going to rewrite url() parsing to work exactly as specified. > bug 399223 - make aToken a member of nsCSSScanner instead of passing as > function argument I still don't think this is a good change. > bug 455839 - Small cleanup in error handling in nsCSSParser/nsCSSScanner dbaron asked you in that bug to describe how you tested it, and you never answered. Also, I'm thinking of moving all the error handling to the parser.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 7

•

15 years ago

(In reply to comment #6) > (In reply to comment #4) > > Other bugs on performance improvements on CSS parsing are: > > bug 337287 - wacky parsing of comments in URLs > > I'm going to rewrite url() parsing to work exactly as specified. I think all the known bugs there have been fixed, and making it work "exactly as specified" would probably just slow it down without any detectable benefit.

Zack Weinberg (:zwol)

Comment 8

•

15 years ago

(In reply to comment #7) > > I think all the known bugs there have been fixed, and making it work "exactly > as specified" would probably just slow it down without any detectable benefit. I won't do it if it winds up hurting perf, but my main goal is to get rid of the Next/NextURL distinction and the associated wacky buffer resynchronization stuff.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 9

•

15 years ago

(In reply to comment #8) > I won't do it if it winds up hurting perf, but my main goal is to get rid of > the Next/NextURL distinction and the associated wacky buffer resynchronization > stuff. I think getting rid of what you call the "wacky buffer resynchronization stuff" requires introducing the ability to do arbitrary backtracking into the scanner, which we don't want to have to do. I might be ok with removing the Next/NextURL distinction, depending on what the patch looks like.

Zack Weinberg (:zwol)

Comment 10

•

15 years ago

(In reply to comment #9) > > I think getting rid of what you call the "wacky buffer resynchronization stuff" > requires introducing the ability to do arbitrary backtracking into the scanner, > which we don't want to have to do. That's where those tokenization changes I proposed for css2.1 the other month come in. Also, if we switch to a flex-style scanner we get arbitrary backup for free.

Zack Weinberg (:zwol)

Comment 11

•

15 years ago

(well, not for *free*, looks like a flex scanner with arbitrary backup is 5-10ns/token slower than one without.)

Alfred Kayser

Comment 12

•

15 years ago

I am not really convinced that replacing a custom, optimized, tuned to the specific aspects of CSS wacky things like url(...) parser with a flex generated parser would improve performance. A long time ago, about 15 years, I graduated on a language for data description (something like XML/DTD/XSD's), and learned there that the normal yacc/lex combo is useful to get a working parser, but that a handcoded parser specifically build for the syntax is much and much faster. As for the tests for error reporting, I had no clue to do that, and removing one parameter from the error functions is not that critical. And I also tried to make a url parser to the spec, but then stumbled across de-facto standards issues (are comments allowed or not, and what does an opening comment brace within an url() really mean and such. The most important thing is ensure that existing tests still work.

Zack Weinberg (:zwol)

Comment 13

•

15 years ago

Attached file preliminary scanner timings — Details

Here's some preliminary timing results. I benchmarked four scanners: * "moz-wide-base": The existing CSS scanner, with line/column tracking and all token interpretation removed (that is, it doesn't fill in any of the fields of nsCSSToken except mType). * "moz-base": Same as above, except I changed PRUnichar to 'char' - this approximates the benefit of switching from UTF16 to UTF8 in the parser. * "flex-base": The flex scanner from CSS 2.1 section 4. This does no token interpretation either, and is UTF8-based. * "flex-nobackup": The section 4 scanner modified according to my proposal at <http://lists.w3.org/Archives/Public/www-style/2009Aug/0104.html> to have no backing-up (this changes the tokenization of an unclosed url() token, so we can't do it unilaterally). Each of these was run 50 times on five test inputs, each of which consisted of exactly 10^7 tokens (as measured by the "flex-base" scanner - the other three do not tokenize quite the same): * "realistic": The repeated concatenation of all our UA CSS plus some of the styling used by a real-world website. * "declarations": Thousands of property declarations generated by Jesse's CSS fuzzer, with no surrounding material. * "selectors": Thousands of selector constructs, again generated by Jesse's fuzzer, each followed by {}. * "identifiers": Millions of randomly generated identifiers, 0.5% of which contained \-sequences. * "keywords": Millions of random selections from the complete CSS keyword list (nsCSSKeywordList.h + nsCSSPropList.h + nsColorNameList.h). I measured user plus system time with getrusage() after each complete run. The top chart is in nanoseconds per code point consumed (two bytes for moz-wide-base, one byte for the other three) and the bottom chart is in nanoseconds per token consumed. Further left is faster. The dot shows the mean of all 50 runs, and the error bars are a bootstrapped 95% confidence interval for the mean. These are preliminary results, because they don't count token interpretation or parsing time. All it's doing is classifying each token into the categories in section 4 of css2.1 and then throwing that information away. Still, this tells us something useful: if we hold everything *except* token classification constant, switching from utf16 to utf8 would not buy us much, but a flex-type scanner would, especially if we can push the CSS2.1 changes through that would allow that scanner not to have to back up. Unfortunately, I cannot yet produce numbers for a flex-type scanner operating on utf16, because flex (the program) is steeped in 8-bit-ness. I may try to persuade Chris Jones' Python lexer generator to do it. Lemme know if anyone wants to see the code and test inputs.

Assignee: nobody → zweinberg

Status: NEW → ASSIGNED

Zack Weinberg (:zwol)

Comment 14

•

15 years ago

Attached file scanner timings with keyword recognizer — Details

This is a slightly crazy idea I had: this is the same timing chart as the previous attachment, but instead of the two "moz" scanners there's a new flex-based scanner which is the same as "flex-nobackup" except it also embeds recognition of all CSS keywords into the DFA. That, naturally, slows down keyword recognition (and thus has a visible impact on "declarations" and "realistic" parsing as well) but realize that this is *instead of* looking every identifier up in a hash table, and does not require us to copy and store the keywords. So when compared to a scanner that does token interpretation, this may well be a win.

Alfred Kayser

Comment 15

•

15 years ago

Looks promising! Looking forward to test a completed CSS scanner.

Brendan Eich [:brendan]

Comment 16

•

15 years ago

DFA recognition of keywords is righteous. We do an ad-hoc generated code-trie for JS keywords, FYI. http://mxr.mozilla.org/mozilla-central/source/js/src/jskwgen.cpp /be

Zack Weinberg (:zwol)

Comment 17

•

15 years ago

Further notes. Everything I posted last week was using toy programs that only did a tiny piece of the job - just token classification. The existing test program layout/html/tests/ParseCSS.cpp can be used to profile the complete parser, although running it is a bit of a pain: MOZILLA_FIVE_HOME=$(cd ../../../dist/bin && pwd) \ LD_LIBRARY_PATH=$(cd ../../../dist/lib && pwd) \ ./ParseCSS \ $(cd ../../../../experiments/css-parser/testdata && pwd)/realistic.css (It's ridiculous that nsILocalFile cannot handle relative paths. And I haven't figured out how to build XPCOM as a static library. And you get link errors if you try to compile this program in a tree configured with --enable-libxul. And nsCSSLoader hardwires the character encoding of files passed to LoadSheetSync as UTF-8, so I can't easily compare performance with and without conversion.) I modified ParseCSS to print the elapsed CPU time on exit, and I also hacked up a variation called ScanCSS that just runs the scanner, not the parser. For the same input ("realistic.css", which is actual CSS ganked from our UA style sheets and a website I happened to have lying around, then reduplicated many times) I get these times: ScanCSS 1.496 seconds ParseCSS 6.184 seconds This says that the parser, not the scanner, is the bottleneck - the parser is taking three times longer to run than the scanner, for the same input. Drilling down a little with valgrind (callgrind skin): ScanCSS 6,383,828,294 instructions executed ParseCSS 15,978,647,543 instructions executed The parser accounts for only 60% or so of instructions executed, but ... ScanCSS 1,319,959 D1 misses ParseCSS 37,543,122 D1 misses ... 96% of data cache misses! This is not totally fair to ParseCSS, though; about 11 million of those misses are coming from teardown of the very large style sheet that we build up, and other XPCOM shutdown overhead. It's interesting to know that nsHtml5NamedCharacters::releaseStatics()' gigantic call to operator delete[] accounts for 20% of all D1 read misses in this run, but it's not really *relevant*. I modified ParseCSS again so it calls the operating system _exit() immediately after LoadSheetSync() returns, thus skipping all teardown. It's also useful to know that three-fourths of the data read misses for ScanCSS are coming from the dynamic loader. XPCOM and JS initialization takes another 10% or so. Conversion from ASCII to UTF-16 comes in just behind that, at about 6%. The scanner itself accounts for only 2%. Data write misses are less ridiculous, but it's still the dynamic loader and XPCOM overhead taking up more than half the total, then UTF-16 conversion, and then the scanner proper, at 6%. Basically, I don't think the scanner has a data cache problem, but it would be worth getting rid of the conversion overhead. Now, looking at the parser (with teardown cut out), the things that dominate D1 read misses are hash table lookups, atom manipulation (nsAtomListUtils::IsMember appears to do a linear search; unfortunately I don't know from this analysis how long the lists are) and property data block construction. For D1 write misses, it's all about memory allocation, with malloc() eating 41% of them. nsCSSExpandedDataBlock::Compress is the hungriest function in the parser proper. UTF-16 conversion actually has larger footprint in the full parser: 12% of D1 read and 22% of D1 write misses. Presumably this is because it's competing for cache space with the data structure being built up by the parser. It's maybe also worth mentioning that the worst offenders in terms of instruction cache misses are ParseSelectorGroup, ParseDeclarationBlock, and ParseVariant. ---- So, conclusions: The biggest things we could do here are get rid of character set conversion, optimize the data structures, and make the atom manipulation that we do more efficient. I'm somewhat less sanguine about switching the parser over to work in UTF-8 after seeing how important atom ops are -- we're not gonna be able to change *that* API to UTF-8, after all. But it still might be worth trying. I wonder how hard it would be to get rid of expanded data blocks and do all the work directly on compressed data blocks. Scanner work is still relevant, but only insofar as it speeds up the parser. For instance, using a DFA to identify CSS keywords might be worthwhile, since we're spending a lot of time looking them up in the hash table. (And that would, of course, entail using a DFA for everything else in the scanner.)

Boris Zbarsky [:bzbarsky]

Comment 18

•

15 years ago

> I don't know from this analysis how long the lists are The callers of nsAtomListUtils::IsMember in the css parser are almost certainly nsCSSPseudoElements::IsPseudoElement (list has 11 elements) and/or nsCSSPseudoClasses::IsPseudoClass (list has 52 elements). > we're not gonna be able to change *that* API to UTF-8 Atoms are stored in UTF-8. Given a UTF-8 string, looking up an atom is just a hashtable lookup. Given a UTF-16 string, it's also a hashtable lookup, but with on-the-fly conversion to utf-8 to compute the hashcode and do the compare, basically. I believe atomizing a string is fastest if one has an nsACString which is UTF-8. I agree that the memory behavior of compressed data blocks is one of the bigger overheads during parsing; I've been thinking about ways to try to get them off the fast path in inline style a bit... The thing is, I suspect their use reduces data cache misses during actual style computation, not to mention reducing memory usage. Doing all the work on compressed blocks is a bit difficult, since the compressed block allocates all its data as a single contiguous allocation. This requires knowing up front how big the allocation should be.

Zack Weinberg (:zwol)

Comment 19

•

15 years ago

Retitling -- I may still do a machine generated parser, but it looks like the biggest available win will be from reducing the number of times we have to copy CSS property-value pairs around.

Summary: Speed up css parsing by using a machine generated lexer → Speed up css parsing

Zack Weinberg (:zwol)

Comment 20

•

15 years ago

Attached image microbenchmark for current patch series — Details

I've been developing what I thought of as preliminary clean-up patches in a bunch of different bugs. The attached PNG (large, but not huge, I hope) shows what they do to parser performance on a microbenchmark of my own invention (modified from one of Boris's). The X-axis is the patch series (0 being yesterday's unmodified trunk, 10 being all patches applied) and the Y-axis is microseconds per token consumed. Eyeball-wise, the patches seem to be an overall win for ParseSheet, which is the most important of course, but more equivocal for everything else. What I think we're looking at is wildly variable overhead for the various entry points, and some of the patches make the overhead worse, even though they mostly improve per-token performance. It's also worrisome and frustrating that *cleanups* have such visible effects on overhead.

Boris Zbarsky [:bzbarsky]

Comment 21

•

15 years ago

> It's also worrisome and frustrating that *cleanups* have such visible effects > on overhead. At least in part because the microbenchmarks are testing things we've really worked on speeding up, so just adding things like function call overhead on those codepaths can affect things noticeably...

Zack Weinberg (:zwol)

Comment 22

•

15 years ago

Attached image updated microbenchmark results — Details

Fed up with my results jumping all over the place due to DOM overhead, I added a bunch of benchmark hooks to nsIDOMWindowUtils and repeated my tests. I've also rearranged the patch series extensively, so that the patches that are expected to be positive are in front, and the patches that are known to slow things down are in the back. Some of the expected-to-speed-up patches, unfortunately, depend on the known-to-slow-down patches right now so I have removed them from the queue temporarily. Here are the results. The leftmost boxplot in each row of this chart is a baseline measurement with no patches and therefore always has its median line at 0%. Everything else in a row is relative to that point. As you go rightward, we add patches to the stack. The first two patches (decom parser, decom loader) are clearly a win, except possibly for the ParseColor test at the very top; I don't know what's going on there, but ParseColor is really cheap anyway so maybe it's not to worry about. The third patch (dont track column) *should* be an across-the-board win because it allows the scanner to do strictly less work, but clearly it isn't; my next action is going to be running that under Shark and hopefully I can make sense of the results this time (since the DOM overhead won't be cluttering everything up). I'm impressed that I can get 3-5% speedups *just* from deCOM/devirt patches. We should be doing more of those.

Brendan Eich [:brendan]

Comment 23

•

15 years ago

(In reply to comment #22) > I'm impressed that I can get 3-5% speedups *just* from deCOM/devirt patches. > We should be doing more of those. Everything old, including this news flash, is new again -- so shout it from the rooftops if need be. Virtual methods, when overused at high frequency, hurt. They also increase the control flow integrity attack surface. roc, do you still have helpers deCOMtaminating? Are they using Taras's static analysis -> code rewriting tools? /be

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 24

•

15 years ago

No. People who showed up saying they want to do deCOMtamination very rarely actually produced anything.

Zack Weinberg (:zwol)

Comment 25

•

15 years ago

I'm not surprised; deCOM is finicky and tedious. I don't want to do any more of those patches without mechanical assistance. :-)

Brendan Eich [:brendan]

Comment 26

•

15 years ago

We pull the newsgroup trigger now but I have everyone here, cc'ed, and the topic is relatively hot. Taras, what's the state of the art for static-analysis-tooled deCOMtamination? /be

(dormant account)

Reporter

Comment 27

•

15 years ago

(In reply to comment #26) > We pull the newsgroup trigger now but I have everyone here, cc'ed, and the > topic is relatively hot. Taras, what's the state of the art for > static-analysis-tooled deCOMtamination? Nada. deCOMtamination was too general of a goal when I worked on it so instead of focused on simpler stuff like outparamdel. If someone is willing spec out exactly which parts of deCOM we can mechanize, I can write up a tool. My experience so far has been disappointing. Every deCOM I looked at looked needed some domain-specific code reshuffling. If someone who is more experienced at deCOM such as zwol or roc can suggest specific steps to automate, we may get somewhere.

Brendan Eich [:brendan]

Comment 28

•

15 years ago

We want simpler steps instead of no steps, but it does seem like there could be a new attempt to synthesize and mechanize. Zwol, could you write such a spec? Maybe an interactive spec sketch with taras on IRC? OK, I'll stop hijacking this bug. /be

(dormant account)

Reporter

Comment 29

•

15 years ago

(In reply to comment #28) > We want simpler steps instead of no steps, but it does seem like there could be > a new attempt to synthesize and mechanize. Zwol, could you write such a spec? > Maybe an interactive spec sketch with taras on IRC? OK, I'll stop hijacking > this bug. Sounds like a good idea for workweek session.</hijack>

Zack Weinberg (:zwol)

Comment 30

•

15 years ago

I tell you what, I file a new bug 545052 under Rewriting and Analysis (and another bug 545050 under XPCOM) and cc a bunch of people, but I don't know who everyone should be cced, so you add more people, Brendan?

Brendan Eich [:brendan]

Comment 31

•

15 years ago

Sure -- I'll help cc: people. This is worth a try. /be

Zack Weinberg (:zwol)

Comment 32

•

15 years ago

I'm'a mention bug 105431 and its now-much-longer list of dependencies here, too.

Dietrich Ayala (:dietrich)

Updated

•

15 years ago

Whiteboard: [ts] → [ts][2010Q1]

Dietrich Ayala (:dietrich)

Updated

•

15 years ago

Blocks: 447581

Dietrich Ayala (:dietrich)

Updated

•

15 years ago

Depends on: 544112, 523496

Zack Weinberg (:zwol)

Updated

•

15 years ago

Depends on: 553456

Marco Castelluccio [:marco]

Updated

•

13 years ago

Whiteboard: [ts][2010Q1] → [ts][2010Q1][Snappy]

(dormant account)

Reporter

Updated

•

13 years ago

Summary: Speed up css parsing → Speed up css parsing(Add telemetry probes to measure extent of problem)

Whiteboard: [ts][2010Q1][Snappy] → [ts][2010Q1][Snappy:P2]

Benjamin Smedberg

Comment 33

•

8 years ago

Resetting owner to default per Zack's request.

Assignee: zackw → nobody

Status: ASSIGNED → NEW

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

BugBot [:suhaib / :marco/ :calixte]

Comment 34

•

2 years ago

The severity field for this bug is relatively low, S3. However, the bug has 16 votes and 50 CCs.
:emilio, could you consider increasing the bug severity?

For more information, please visit auto_nag documentation.

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Comment 35

•

2 years ago

Needs up-to-date profiling.

Flags: needinfo?(emilio)

Priority: -- → P3

Marco Castelluccio [:marco]

Updated

•

2 years ago

Severity: S3 → --

Type: defect → enhancement

preliminary scanner timings 15 years ago Zack Weinberg (:zwol) 17.03 KB, application/pdf		Details
scanner timings with keyword recognizer 15 years ago Zack Weinberg (:zwol) 14.32 KB, application/pdf		Details
microbenchmark for current patch series 15 years ago Zack Weinberg (:zwol) 22.66 KB, image/png		Details
updated microbenchmark results 15 years ago Zack Weinberg (:zwol) 19.85 KB, image/png		Details