Open Bug 513149 Opened 15 years ago Updated 1 year ago

Speed up css parsing(Add telemetry probes to measure extent of problem)

Categories

(Core :: CSS Parsing and Computation, enhancement, P3)

x86
Linux
enhancement

Tracking

()

People

(Reporter: taras.mozilla, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Keywords: perf, Whiteboard: [ts][2010Q1][Snappy:P2])

Attachments

(4 files)

Zack says he has a prototype and this is next on his list of things to do.
Whiteboard: ts
Whiteboard: ts → [ts]
Zack, thanks for working on this. Do you expect this to land in time for 3.6?
Assignee: zweinberg → nobody
Component: DOM: CSS Object Model → Style System (CSS)
QA Contact: general → style-system
Blocks: 91242
It shouldn't take very much time once I get done with bug 497495, but I may have to move a lot of code around.  I would prefer to make the 3.6 call after we know how invasive it ends up being.

I was considering fixing bunch of places where we don't tokenize CSS exactly to spec at the same time, but I don't *have* to do that, and it might be safer not to.

Regarding performance benefits, Taras told me that on our current Fennec platforms we take between half a second and a full second to process all the CSS that we load on startup.  I think I can easily cut at least 50% off of that.
I'd be interested in the impact on the microbenchmark at https://bugzilla.mozilla.org/attachment.cgi?id=383408 as well.
Other bugs on performance improvements on CSS parsing are:
bug 337287 - wacky parsing of comments in URLs
bug 399223 - make aToken a member of nsCSSScanner instead of passing as function argument
bug 455839 - Small cleanup in error handling in nsCSSParser/nsCSSScanner 
bug 483977 - nsCSSScanner: Simplify IsDigit: Ready for checkin!

I would recommend to at least checkin bug 483977 before doing major work,
and the others to may be take into account into the 'next generation parser'.
Keywords: perf
Zack, wanted to let you know that I took a shot at changing the CSSScanner to call ReadSegments on the input stream, patch is in https://bugzilla.mozilla.org/show_bug.cgi?id=512272. (Patch is a little messy right now, but the change isn't large at all.) I needed to do this for my other changes to make sense.

Anyways, it's not close to landing or anything, I just wanted to let you know because you said you'd be doing something similar.
(In reply to comment #4)
> Other bugs on performance improvements on CSS parsing are:
> bug 337287 - wacky parsing of comments in URLs

I'm going to rewrite url() parsing to work exactly as specified.

> bug 399223 - make aToken a member of nsCSSScanner instead of passing as
> function argument

I still don't think this is a good change.

> bug 455839 - Small cleanup in error handling in nsCSSParser/nsCSSScanner 

dbaron asked you in that bug to describe how you tested it, and you never answered.  Also, I'm thinking of moving all the error handling to the parser.
(In reply to comment #6)
> (In reply to comment #4)
> > Other bugs on performance improvements on CSS parsing are:
> > bug 337287 - wacky parsing of comments in URLs
> 
> I'm going to rewrite url() parsing to work exactly as specified.

I think all the known bugs there have been fixed, and making it work "exactly as specified" would probably just slow it down without any detectable benefit.
(In reply to comment #7)
> 
> I think all the known bugs there have been fixed, and making it work "exactly
> as specified" would probably just slow it down without any detectable benefit.

I won't do it if it winds up hurting perf, but my main goal is to get rid of the Next/NextURL distinction and the associated wacky buffer resynchronization stuff.
(In reply to comment #8)
> I won't do it if it winds up hurting perf, but my main goal is to get rid of
> the Next/NextURL distinction and the associated wacky buffer resynchronization
> stuff.

I think getting rid of what you call the "wacky buffer resynchronization stuff" requires introducing the ability to do arbitrary backtracking into the scanner, which we don't want to have to do.

I might be ok with removing the Next/NextURL distinction, depending on what the patch looks like.
(In reply to comment #9)
> 
> I think getting rid of what you call the "wacky buffer resynchronization stuff"
> requires introducing the ability to do arbitrary backtracking into the scanner,
> which we don't want to have to do.

That's where those tokenization changes I proposed for css2.1 the other month come in.  Also, if we switch to a flex-style scanner we get arbitrary backup for free.
(well, not for *free*, looks like a flex scanner with arbitrary backup is 5-10ns/token slower than one without.)
I am not really convinced that replacing a custom, optimized, tuned to the specific aspects of CSS wacky things like url(...) parser with a flex generated parser would improve performance. 
A long time ago, about 15 years, I graduated on a language for data description (something like XML/DTD/XSD's), and learned there that the normal yacc/lex combo is useful to get a working parser, but that a handcoded parser specifically build for the syntax is much and much faster.

As for the tests for error reporting, I had no clue to do that, and removing one parameter from the error functions is not that critical.

And I also tried to make a url parser to the spec, but then stumbled across de-facto standards issues (are comments allowed or not, and what does an opening comment brace within an url() really mean and such. The most important thing is ensure that existing tests still work.
Here's some preliminary timing results.  I benchmarked four scanners:

* "moz-wide-base": The existing CSS scanner, with line/column tracking and all token interpretation removed (that is, it doesn't fill in any of the fields of nsCSSToken except mType).
* "moz-base": Same as above, except I changed PRUnichar to 'char' - this approximates the benefit of switching from UTF16 to UTF8 in the parser.
* "flex-base": The flex scanner from CSS 2.1 section 4.  This does no token interpretation either, and is UTF8-based.
* "flex-nobackup": The section 4 scanner modified according to my proposal at <http://lists.w3.org/Archives/Public/www-style/2009Aug/0104.html> to have no backing-up (this changes the tokenization of an unclosed url() token, so we can't do it unilaterally).

Each of these was run 50 times on five test inputs, each of which consisted of exactly 10^7 tokens (as measured by the "flex-base" scanner - the other three do not tokenize quite the same):

* "realistic": The repeated concatenation of all our UA CSS plus some of the styling used by a real-world website.
* "declarations": Thousands of property declarations generated by Jesse's CSS fuzzer, with no surrounding material.
* "selectors": Thousands of selector constructs, again generated by Jesse's fuzzer, each followed by {}.
* "identifiers": Millions of randomly generated identifiers, 0.5% of which contained \-sequences.
* "keywords": Millions of random selections from the complete CSS keyword list (nsCSSKeywordList.h + nsCSSPropList.h + nsColorNameList.h).

I measured user plus system time with getrusage() after each complete run.  The top chart is in nanoseconds per code point consumed (two bytes for moz-wide-base, one byte for the other three) and the bottom chart is in nanoseconds per token consumed.  Further left is faster.  The dot shows the mean of all 50 runs, and the error bars are a bootstrapped 95% confidence interval for the mean.

These are preliminary results, because they don't count token interpretation or parsing time.  All it's doing is classifying each token into the categories in section 4 of css2.1 and then throwing that information away.  Still, this tells us something useful: if we hold everything *except* token classification constant, switching from utf16 to utf8 would not buy us much, but a flex-type scanner would, especially if we can push the CSS2.1 changes through that would allow that scanner not to have to back up.

Unfortunately, I cannot yet produce numbers for a flex-type scanner operating on utf16, because flex (the program) is steeped in 8-bit-ness.  I may try to persuade Chris Jones' Python lexer generator to do it.

Lemme know if anyone wants to see the code and test inputs.
Assignee: nobody → zweinberg
Status: NEW → ASSIGNED
This is a slightly crazy idea I had: this is the same timing chart as the previous attachment, but instead of the two "moz" scanners there's a new flex-based scanner which is the same as "flex-nobackup" except it also embeds recognition of all CSS keywords into the DFA.

That, naturally, slows down keyword recognition (and thus has a visible impact on "declarations" and "realistic" parsing as well) but realize that this is *instead of* looking every identifier up in a hash table, and does not require us to copy and store the keywords.  So when compared to a scanner that does token interpretation, this may well be a win.
Looks promising!
Looking forward to test a completed CSS scanner.
DFA recognition of keywords is righteous. We do an ad-hoc generated code-trie for JS keywords, FYI.

http://mxr.mozilla.org/mozilla-central/source/js/src/jskwgen.cpp

/be
Further notes.  Everything I posted last week was using toy programs
that only did a tiny piece of the job - just token classification.
The existing test program layout/html/tests/ParseCSS.cpp can be used
to profile the complete parser, although running it is a bit of a
pain:

MOZILLA_FIVE_HOME=$(cd ../../../dist/bin && pwd) \
LD_LIBRARY_PATH=$(cd ../../../dist/lib && pwd) \
 ./ParseCSS \
 $(cd ../../../../experiments/css-parser/testdata && pwd)/realistic.css

(It's ridiculous that nsILocalFile cannot handle relative paths.
And I haven't figured out how to build XPCOM as a static library.
And you get link errors if you try to compile this program in a tree
configured with --enable-libxul.  And nsCSSLoader hardwires the
character encoding of files passed to LoadSheetSync as UTF-8, so I
can't easily compare performance with and without conversion.)

I modified ParseCSS to print the elapsed CPU time on exit, and I also
hacked up a variation called ScanCSS that just runs the scanner, not
the parser.  For the same input ("realistic.css", which is actual CSS
ganked from our UA style sheets and a website I happened to have lying
around, then reduplicated many times) I get these times:

ScanCSS  1.496 seconds
ParseCSS 6.184 seconds

This says that the parser, not the scanner, is the bottleneck - the
parser is taking three times longer to run than the scanner, for the
same input.

Drilling down a little with valgrind (callgrind skin):

ScanCSS   6,383,828,294 instructions executed
ParseCSS 15,978,647,543 instructions executed

The parser accounts for only 60% or so of instructions executed, but ...

ScanCSS   1,319,959  D1 misses
ParseCSS 37,543,122  D1 misses

... 96% of data cache misses!  This is not totally fair to ParseCSS,
though; about 11 million of those misses are coming from teardown of
the very large style sheet that we build up, and other XPCOM shutdown
overhead.  It's interesting to know that
nsHtml5NamedCharacters::releaseStatics()' gigantic call to operator
delete[] accounts for 20% of all D1 read misses in this run, but it's
not really *relevant*.  I modified ParseCSS again so it calls the
operating system _exit() immediately after LoadSheetSync() returns,
thus skipping all teardown.

It's also useful to know that three-fourths of the data read misses
for ScanCSS are coming from the dynamic loader.  XPCOM and JS
initialization takes another 10% or so.  Conversion from ASCII to
UTF-16 comes in just behind that, at about 6%.  The scanner itself
accounts for only 2%.  Data write misses are less ridiculous, but it's
still the dynamic loader and XPCOM overhead taking up more than half
the total, then UTF-16 conversion, and then the scanner proper, at 6%.
Basically, I don't think the scanner has a data cache problem, but it
would be worth getting rid of the conversion overhead.

Now, looking at the parser (with teardown cut out), the things that
dominate D1 read misses are hash table lookups, atom manipulation
(nsAtomListUtils::IsMember appears to do a linear search;
unfortunately I don't know from this analysis how long the lists are)
and property data block construction.  For D1 write misses, it's all
about memory allocation, with malloc() eating 41% of them.
nsCSSExpandedDataBlock::Compress is the hungriest function in the
parser proper.

UTF-16 conversion actually has larger footprint in the full parser:
12% of D1 read and 22% of D1 write misses.  Presumably this is because
it's competing for cache space with the data structure being built up
by the parser.

It's maybe also worth mentioning that the worst offenders in terms of
instruction cache misses are ParseSelectorGroup,
ParseDeclarationBlock, and ParseVariant.

----

So, conclusions: The biggest things we could do here are get rid of
character set conversion, optimize the data structures, and make the
atom manipulation that we do more efficient.  I'm somewhat less
sanguine about switching the parser over to work in UTF-8 after seeing
how important atom ops are -- we're not gonna be able to change *that*
API to UTF-8, after all.  But it still might be worth trying.

I wonder how hard it would be to get rid of expanded data blocks and
do all the work directly on compressed data blocks.

Scanner work is still relevant, but only insofar as it speeds up the
parser.  For instance, using a DFA to identify CSS keywords might be
worthwhile, since we're spending a lot of time looking them up in the
hash table.  (And that would, of course, entail using a DFA for
everything else in the scanner.)
> I don't know from this analysis how long the lists are

The callers of nsAtomListUtils::IsMember in the css parser are almost certainly nsCSSPseudoElements::IsPseudoElement (list has 11 elements) and/or nsCSSPseudoClasses::IsPseudoClass (list has 52 elements).

> we're not gonna be able to change *that* API to UTF-8

Atoms are stored in UTF-8.  Given a UTF-8 string, looking up an atom is just a hashtable lookup.  Given a UTF-16 string, it's also a hashtable lookup, but with on-the-fly conversion to utf-8 to compute the hashcode and do the compare, basically.

I believe atomizing a string is fastest if one has an nsACString which is UTF-8.

I agree that the memory behavior of compressed data blocks is one of the bigger overheads during parsing; I've been thinking about ways to try to get them off the fast path in inline style a bit...  The thing is, I suspect their use reduces data cache misses during actual style computation, not to mention reducing memory usage.  Doing all the work on compressed blocks is a bit difficult, since the compressed block allocates all its data as a single contiguous allocation.  This requires knowing up front how big the allocation should be.
Retitling -- I may still do a machine generated parser, but it looks like the biggest available win will be from reducing the number of times we have to copy CSS property-value pairs around.
Summary: Speed up css parsing by using a machine generated lexer → Speed up css parsing
I've been developing what I thought of as preliminary clean-up patches in a bunch of different bugs.  The attached PNG (large, but not huge, I hope) shows what they do to parser performance on a microbenchmark of my own invention (modified from one of Boris's).  The X-axis is the patch series (0 being yesterday's unmodified trunk, 10 being all patches applied) and the Y-axis is microseconds per token consumed.

Eyeball-wise, the patches seem to be an overall win for ParseSheet, which is the most important of course, but more equivocal for everything else.  What I think we're looking at is wildly variable overhead for the various entry points, and some of the patches make the overhead worse, even though they mostly improve per-token performance.  It's also worrisome and frustrating that *cleanups* have such visible effects on overhead.
> It's also worrisome and frustrating that *cleanups* have such visible effects
> on overhead.

At least in part because the microbenchmarks are testing things we've really worked on speeding up, so just adding things like function call overhead on those codepaths can affect things noticeably...
Fed up with my results jumping all over the place due to DOM overhead, I added a bunch of benchmark hooks to nsIDOMWindowUtils and repeated my tests.  I've also rearranged the patch series extensively, so that the patches that are expected to be positive are in front, and the patches that are known to slow things down are in the back.  Some of the expected-to-speed-up patches, unfortunately, depend on the known-to-slow-down patches right now so I have removed them from the queue temporarily.  Here are the results.

The leftmost boxplot in each row of this chart is a baseline measurement with no patches and therefore always has its median line at 0%.  Everything else in a row is relative to that point.  As you go rightward, we add patches to the stack.

The first two patches (decom parser, decom loader) are clearly a win, except possibly for the ParseColor test at the very top; I don't know what's going on there, but ParseColor is really cheap anyway so maybe it's not to worry about.  The third patch (dont track column) *should* be an across-the-board win because it allows the scanner to do strictly less work, but clearly it isn't; my next action is going to be running that under Shark and hopefully I can make sense of the results this time (since the DOM overhead won't be cluttering everything up).

I'm impressed that I can get 3-5% speedups *just* from deCOM/devirt patches.  We should be doing more of those.
(In reply to comment #22)
> I'm impressed that I can get 3-5% speedups *just* from deCOM/devirt patches. 
> We should be doing more of those.

Everything old, including this news flash, is new again -- so shout it from the rooftops if need be.

Virtual methods, when overused at high frequency, hurt. They also increase the control flow integrity attack surface.

roc, do you still have helpers deCOMtaminating? Are they using Taras's static analysis -> code rewriting tools?

/be
No. People who showed up saying they want to do deCOMtamination very rarely actually produced anything.
I'm not surprised; deCOM is finicky and tedious.  I don't want to do any more of those patches without mechanical assistance. :-)
We pull the newsgroup trigger now but I have everyone here, cc'ed, and the topic is relatively hot. Taras, what's the state of the art for static-analysis-tooled deCOMtamination?

/be
(In reply to comment #26)
> We pull the newsgroup trigger now but I have everyone here, cc'ed, and the
> topic is relatively hot. Taras, what's the state of the art for
> static-analysis-tooled deCOMtamination?

Nada. deCOMtamination was too general of a goal when I worked on it so instead of focused on simpler stuff like outparamdel. If someone is willing spec out exactly which parts of deCOM we can mechanize, I can write up a tool.

My experience so far has been disappointing. Every deCOM I looked at looked needed some domain-specific code reshuffling. If someone who is more experienced at deCOM such as zwol or roc can suggest specific steps to automate, we may get somewhere.
We want simpler steps instead of no steps, but it does seem like there could be a new attempt to synthesize and mechanize. Zwol, could you write such a spec? Maybe an interactive spec sketch with taras on IRC? OK, I'll stop hijacking this bug.

/be
(In reply to comment #28)
> We want simpler steps instead of no steps, but it does seem like there could be
> a new attempt to synthesize and mechanize. Zwol, could you write such a spec?
> Maybe an interactive spec sketch with taras on IRC? OK, I'll stop hijacking
> this bug.

Sounds like a good idea for workweek session.</hijack>
I tell you what, I file a new bug 545052 under Rewriting and Analysis (and another bug 545050 under XPCOM) and cc a bunch of people, but I don't know who everyone should be cced, so you add more people, Brendan?
Sure -- I'll help cc: people. This is worth a try.

/be
I'm'a mention bug 105431 and its now-much-longer list of dependencies here, too.
Whiteboard: [ts] → [ts][2010Q1]
Blocks: 447581
Depends on: 544112, 523496
Depends on: 553456
Whiteboard: [ts][2010Q1] → [ts][2010Q1][Snappy]
Summary: Speed up css parsing → Speed up css parsing(Add telemetry probes to measure extent of problem)
Whiteboard: [ts][2010Q1][Snappy] → [ts][2010Q1][Snappy:P2]
Resetting owner to default per Zack's request.
Assignee: zackw → nobody
Status: ASSIGNED → NEW
Severity: normal → S3

The severity field for this bug is relatively low, S3. However, the bug has 16 votes and 50 CCs.
:emilio, could you consider increasing the bug severity?

For more information, please visit auto_nag documentation.

Flags: needinfo?(emilio)

Needs up-to-date profiling.

Flags: needinfo?(emilio)
Priority: -- → P3
Severity: S3 → --
Type: defect → enhancement
You need to log in before you can comment on or make changes to this bug.