Last Comment Bug 506090 - [HTML5][Patch] Named character reference tokenization performance regression
: [HTML5][Patch] Named character reference tokenization performance regression
Status: RESOLVED FIXED
: perf
Product: Core
Classification: Components
Component: HTML: Parser (show other bugs)
: unspecified
: All All
: P2 normal (vote)
: ---
Assigned To: Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03)
:
:
Mentors:
http://people.mozilla.org/~bnewman/bu...
Depends on: 501082
Blocks: 543458
  Show dependency treegraph
 
Reported: 2009-07-23 14:02 PDT by Ben Newman (:bnewman) (:benjamn)
Modified: 2010-03-17 01:36 PDT (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
WIP: C++ (392.57 KB, patch)
2010-02-10 04:59 PST, Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03)
no flags Details | Diff | Splinter Review
WIP: Java (182.22 KB, patch)
2010-02-10 05:01 PST, Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03)
no flags Details | Diff | Splinter Review
The mozilla-central diff (392.57 KB, patch)
2010-02-23 01:28 PST, Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03)
mozilla+ben: review+
Details | Diff | Splinter Review
html-parser diff (183.98 KB, patch)
2010-02-23 01:33 PST, Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03)
mozilla+ben: review+
Details | Diff | Splinter Review

Description Ben Newman (:bnewman) (:benjamn) 2009-07-23 14:02:03 PDT
Tokenizing a character reference such as Æ using the HTML5 parser takes linear time in the length of the list of character references.  We should be able to do better.

Compared with the HTML4 parser, the underlying task has gotten harder in two ways: first, there were only ~300 HTML4 entities as opposed to the 2100+ HTML5 named character references; second, the HTML5 spec requires the parser to perform a longest prefix match, whereas the HTML4 parser simply consumed the identifier following a '&' and gave back a character if that identifier named a known entity.

The first factor shouldn't matter if we choose a data structure and/or search algorithm that supports O(log n) or O(1) lookup.  The second factor can be overcome, I suspect, with a bit of cleverness.

Amdahl's Law obliges me to point out that character reference tokenization is almost certainly not the bulk of parsing any real-world webpage, so this may not be a super-high priority, but we're going to want to fix it sooner or later.
Comment 1 Ben Newman (:bnewman) (:benjamn) 2009-07-23 14:21:32 PDT
See this URL for an illustration of the problem:
http://people.mozilla.org/~bnewman/bugs/506090/
Comment 2 Ben Newman (:bnewman) (:benjamn) 2009-07-23 14:26:50 PDT
Relevant part of the HTML5 spec:
http://dev.w3.org/html5/spec/syntax.html#tokenizing-character-references
Comment 3 Ben Newman (:bnewman) (:benjamn) 2009-07-23 14:33:40 PDT
Relevant part of Tokenizer.java:
http://hg.mozilla.org/mozilla-central/file/b2d3327fb4cc/parser/html/javasrc/Tokenizer.java#l4037

And the corresponding part of parser/html/nsHtml5Tokenizer.cpp:
http://hg.mozilla.org/mozilla-central/file/b2d3327fb4cc/parser/html/nsHtml5Tokenizer.cpp#l2214

It's worth mentioning that this issue should be possible to resolve within the Java source files.  In particular, this regression is *not* a problem with the Java-to-C++ translator, nor does it have anything to do with differences between the performance characteristics of Java and C++.
Comment 4 Zack Weinberg (:zwol) 2009-07-28 11:21:28 PDT
Tries are good for longest prefix match with O(1) string comparisons.
Comment 5 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03) 2009-08-07 05:28:33 PDT
Looks like avoiding premature optimization didn't work all the way here. :-(

I think a trie laid out into static const arrays ahead of compile time is the most proper way to fix this (and the same mechanism could be used for optimizing the doctype quirkiness check, too, even though that check runs at most once per parse). 

However, using binary search to locate the initial search window might be good enough with the current data structures.
Comment 6 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03) 2009-11-11 05:17:28 PST
The first level of search should definitely be a mere array lookup by using the input character as the array index. However, I wonder if trie-ish things make sense for the second character in the entity name. Maybe using binary search + linear search for window widening would be good enough for the subsequent levels.
Comment 7 Ben Newman (:bnewman) (:benjamn) 2009-11-11 10:40:54 PST
Almost all entries in the table end with a ';', and it occurs to me that we don't need the full complexity of the longest-prefix logic for such character references, because ';' can only be the last character of a reference.  Instead, we could save characters starting after the '&' and look those characters up in a simple hash set when we encounter a ';'.  If there's no match, or if it takes longer to find the ';' than the maximum length of any character reference, then would could fall back to the smaller table of prefix candidates (the entries without ';'s).
Comment 8 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03) 2009-11-11 23:47:02 PST
(In reply to comment #7)
> Almost all entries in the table end with a ';', and it occurs to me that we
> don't need the full complexity of the longest-prefix logic for such character
> references, because ';' can only be the last character of a reference. 
> Instead, we could save characters starting after the '&' and look those
> characters up in a simple hash set when we encounter a ';'.  If there's no
> match, or if it takes longer to find the ';' than the maximum length of any
> character reference, then would could fall back to the smaller table of prefix
> candidates (the entries without ';'s).

This could work. Of course, the search needs to also terminate on < and &.

(In reply to comment #6)
> The first level of search should definitely be a mere array lookup by using the
> input character as the array index. However, I wonder if trie-ish things make
> sense for the second character in the entity name. Maybe using binary search +
> linear search for window widening would be good enough for the subsequent
> levels.

I looked into this more. If the first level is a mere array lookup, it doesn't make sense to use binary search plus linear window widening for the second level, because it would do more compares in many cases than linear narrowing.

Array lookup on the first level and then using the current code would be very reasonable for &amp; and &nbsp;. Unfortunately, MathML defines 12 of named characters whose name starts with "lt" and 12 with "gt", so at ';' after "lt" or "12", the 'hi' index would need to travel over 12 entries.

Maybe the least intrusive fix is the following:
 1) Use array lookup for the first level of window narrowing.
 2) Whenever the window narrowing occurs subsequently, first advance lo and then check if hi has to become equal to lo (by looking at lo+1) and only do a linear search with hi if not.

This would yield trie-like performance for &amp;, &lt;, &gt; and &nbsp; (but not &quot;) but would avoid having to buffer input up to the length of the longest name for &ltxxxxxxxxxxxxxxxxxxxxxxx, &gtxxxxxxxxxxxxxxxxxxxxxxx, &ampxxxxxxxxxxxxxxxxxxxxxxx and &xxxxxxxxxxxxxxxxxxxxxx.
Comment 9 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03) 2010-02-09 04:30:18 PST
Taking as part of tp4 work.
Comment 10 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03) 2010-02-10 04:59:05 PST
Created attachment 426230 [details] [diff] [review]
WIP: C++
Comment 11 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03) 2010-02-10 05:01:37 PST
Created attachment 426232 [details] [diff] [review]
WIP: Java

This patch brings character reference tokenization perf to a level similar to raw text tokenization.
Comment 12 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03) 2010-02-23 01:28:41 PST
Created attachment 428395 [details] [diff] [review]
The mozilla-central diff
Comment 13 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03) 2010-02-23 01:31:54 PST
>  2) Whenever the window narrowing occurs subsequently, first advance lo and
>  then check if hi has to become equal to lo (by looking at lo+1) and only do 
>  linear search with hi if not.

This check turned out to be unnecessary for meeting the objective of getting character reference tokenization to the perf level of tokenizing as much text.
Comment 14 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03) 2010-02-23 01:33:28 PST
Created attachment 428397 [details] [diff] [review]
html-parser diff
Comment 15 Ben Newman (:bnewman) (:benjamn) 2010-03-15 11:35:10 PDT
Comment on attachment 428397 [details] [diff] [review]
html-parser diff

>+        // Java initializes arrays to zero. Zero is our magic value for no hilo
>+        // value.
>+        int[][] hiLoTable = new int[0x7B][52];

Make these numbers a little less magical?  For example,

          int[][] hiLoTable = new int['z'+1]['Z'-'A'+1 + 'z'-'a'+1];

or similar?
Comment 16 Ben Newman (:bnewman) (:benjamn) 2010-03-15 12:15:50 PDT
Comment on attachment 428395 [details] [diff] [review]
The mozilla-central diff

>-#define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, VALUE, SIZE) \
>-static PRUnichar const NAME_##N[] = { CHARS };
>+static PRInt32 const HILO_ACCEL_65[] = {
>+  0,
>+  0,
>+  0,
>+  0,
>+  0,
>+  0,
>+  0,
>+  12386493,
>+  0,
---8<---

I don't know a good way to make these data readable, so you might as well save yourself and anyone else reading the file some scrolling by separating the array elements with ", " instead of ",\n".

>+NAMED_CHARACTER_REFERENCE(0, 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0)
>+NAMED_CHARACTER_REFERENCE(1, 'l' _ 'i' _ 'g' _ ';', 4, 0x00c6 _ 0)
>+NAMED_CHARACTER_REFERENCE(2, 'P', 1, 0x0026 _ 0)
>+NAMED_CHARACTER_REFERENCE(3, 'P' _ ';', 2, 0x0026 _ 0)
>+NAMED_CHARACTER_REFERENCE(4, 'c' _ 'u' _ 't' _ 'e', 4, 0x00c1 _ 0)
>+NAMED_CHARACTER_REFERENCE(5, 'c' _ 'u' _ 't' _ 'e' _ ';', 5, 0x00c1 _ 0)
---8<---

It's also regrettable that this file no longer reads like data, since the first two characters of each reference are missing.

Could you maybe include the first two characters in a comment?  For instance,

  NAMED_CHARACTER_REFERENCE(0, /* A e */ 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0)

and include a comment at the top of the file pointing to the explanatory comment in Tokenizer.java?
Comment 17 Ben Newman (:bnewman) (:benjamn) 2010-03-15 12:17:50 PDT
I realize that my suggestions in comment #16 actually require changes to the java patch, but the resulting C++ code was what caught my attention.
Comment 18 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-10-03) 2010-03-17 01:36:24 PDT
(In reply to comment #15)
> (From update of attachment 428397 [details] [diff] [review])
> >+        // Java initializes arrays to zero. Zero is our magic value for no hilo
> >+        // value.
> >+        int[][] hiLoTable = new int[0x7B][52];
> 
> Make these numbers a little less magical?  For example,
> 
>           int[][] hiLoTable = new int['z'+1]['Z'-'A'+1 + 'z'-'a'+1];
> 
> or similar?

Fixed.

(In reply to comment #16)
> (From update of attachment 428395 [details] [diff] [review])
> >-#define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, VALUE, SIZE) \
> >-static PRUnichar const NAME_##N[] = { CHARS };
> >+static PRInt32 const HILO_ACCEL_65[] = {
> >+  0,
> >+  0,
> >+  0,
> >+  0,
> >+  0,
> >+  0,
> >+  0,
> >+  12386493,
> >+  0,
> ---8<---
> 
> I don't know a good way to make these data readable, so you might as well save
> yourself and anyone else reading the file some scrolling by separating the
> array elements with ", " instead of ",\n".

Fixed.

> >+NAMED_CHARACTER_REFERENCE(0, 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0)
> >+NAMED_CHARACTER_REFERENCE(1, 'l' _ 'i' _ 'g' _ ';', 4, 0x00c6 _ 0)
> >+NAMED_CHARACTER_REFERENCE(2, 'P', 1, 0x0026 _ 0)
> >+NAMED_CHARACTER_REFERENCE(3, 'P' _ ';', 2, 0x0026 _ 0)
> >+NAMED_CHARACTER_REFERENCE(4, 'c' _ 'u' _ 't' _ 'e', 4, 0x00c1 _ 0)
> >+NAMED_CHARACTER_REFERENCE(5, 'c' _ 'u' _ 't' _ 'e' _ ';', 5, 0x00c1 _ 0)
> ---8<---
> 
> It's also regrettable that this file no longer reads like data, since the first
> two characters of each reference are missing.
> 
> Could you maybe include the first two characters in a comment?  For instance,
> 
>   NAMED_CHARACTER_REFERENCE(0, /* A e */ 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0)
> 
> and include a comment at the top of the file pointing to the explanatory
> comment in Tokenizer.java?

Fixed.

Thanks!

http://hg.mozilla.org/mozilla-central/rev/42ccfcc82e1e

Note You need to log in before you can comment on or make changes to this bug.