[HTML5][Patch] Named character reference tokenization performance regression

RESOLVED FIXED

Status

()

Core
HTML: Parser
P2
normal
RESOLVED FIXED
8 years ago
7 years ago

People

(Reporter: bnewman, Assigned: hsivonen)

Tracking

({perf})

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(2 attachments, 2 obsolete attachments)

(Reporter)

Description

8 years ago
Tokenizing a character reference such as Æ using the HTML5 parser takes linear time in the length of the list of character references.  We should be able to do better.

Compared with the HTML4 parser, the underlying task has gotten harder in two ways: first, there were only ~300 HTML4 entities as opposed to the 2100+ HTML5 named character references; second, the HTML5 spec requires the parser to perform a longest prefix match, whereas the HTML4 parser simply consumed the identifier following a '&' and gave back a character if that identifier named a known entity.

The first factor shouldn't matter if we choose a data structure and/or search algorithm that supports O(log n) or O(1) lookup.  The second factor can be overcome, I suspect, with a bit of cleverness.

Amdahl's Law obliges me to point out that character reference tokenization is almost certainly not the bulk of parsing any real-world webpage, so this may not be a super-high priority, but we're going to want to fix it sooner or later.
(Reporter)

Comment 1

8 years ago
See this URL for an illustration of the problem:
http://people.mozilla.org/~bnewman/bugs/506090/
(Reporter)

Comment 2

8 years ago
Relevant part of the HTML5 spec:
http://dev.w3.org/html5/spec/syntax.html#tokenizing-character-references
(Reporter)

Comment 3

8 years ago
Relevant part of Tokenizer.java:
http://hg.mozilla.org/mozilla-central/file/b2d3327fb4cc/parser/html/javasrc/Tokenizer.java#l4037

And the corresponding part of parser/html/nsHtml5Tokenizer.cpp:
http://hg.mozilla.org/mozilla-central/file/b2d3327fb4cc/parser/html/nsHtml5Tokenizer.cpp#l2214

It's worth mentioning that this issue should be possible to resolve within the Java source files.  In particular, this regression is *not* a problem with the Java-to-C++ translator, nor does it have anything to do with differences between the performance characteristics of Java and C++.
Tries are good for longest prefix match with O(1) string comparisons.
(Assignee)

Comment 5

8 years ago
Looks like avoiding premature optimization didn't work all the way here. :-(

I think a trie laid out into static const arrays ahead of compile time is the most proper way to fix this (and the same mechanism could be used for optimizing the doctype quirkiness check, too, even though that check runs at most once per parse). 

However, using binary search to locate the initial search window might be good enough with the current data structures.
(Assignee)

Comment 6

8 years ago
The first level of search should definitely be a mere array lookup by using the input character as the array index. However, I wonder if trie-ish things make sense for the second character in the entity name. Maybe using binary search + linear search for window widening would be good enough for the subsequent levels.
(Reporter)

Comment 7

8 years ago
Almost all entries in the table end with a ';', and it occurs to me that we don't need the full complexity of the longest-prefix logic for such character references, because ';' can only be the last character of a reference.  Instead, we could save characters starting after the '&' and look those characters up in a simple hash set when we encounter a ';'.  If there's no match, or if it takes longer to find the ';' than the maximum length of any character reference, then would could fall back to the smaller table of prefix candidates (the entries without ';'s).
(Assignee)

Comment 8

8 years ago
(In reply to comment #7)
> Almost all entries in the table end with a ';', and it occurs to me that we
> don't need the full complexity of the longest-prefix logic for such character
> references, because ';' can only be the last character of a reference. 
> Instead, we could save characters starting after the '&' and look those
> characters up in a simple hash set when we encounter a ';'.  If there's no
> match, or if it takes longer to find the ';' than the maximum length of any
> character reference, then would could fall back to the smaller table of prefix
> candidates (the entries without ';'s).

This could work. Of course, the search needs to also terminate on < and &.

(In reply to comment #6)
> The first level of search should definitely be a mere array lookup by using the
> input character as the array index. However, I wonder if trie-ish things make
> sense for the second character in the entity name. Maybe using binary search +
> linear search for window widening would be good enough for the subsequent
> levels.

I looked into this more. If the first level is a mere array lookup, it doesn't make sense to use binary search plus linear window widening for the second level, because it would do more compares in many cases than linear narrowing.

Array lookup on the first level and then using the current code would be very reasonable for &amp; and &nbsp;. Unfortunately, MathML defines 12 of named characters whose name starts with "lt" and 12 with "gt", so at ';' after "lt" or "12", the 'hi' index would need to travel over 12 entries.

Maybe the least intrusive fix is the following:
 1) Use array lookup for the first level of window narrowing.
 2) Whenever the window narrowing occurs subsequently, first advance lo and then check if hi has to become equal to lo (by looking at lo+1) and only do a linear search with hi if not.

This would yield trie-like performance for &amp;, &lt;, &gt; and &nbsp; (but not &quot;) but would avoid having to buffer input up to the length of the longest name for &ltxxxxxxxxxxxxxxxxxxxxxxx, &gtxxxxxxxxxxxxxxxxxxxxxxx, &ampxxxxxxxxxxxxxxxxxxxxxxx and &xxxxxxxxxxxxxxxxxxxxxx.
(Assignee)

Updated

8 years ago
Priority: -- → P2
(Assignee)

Comment 9

7 years ago
Taking as part of tp4 work.
Assignee: nobody → hsivonen
Blocks: 543458
Status: NEW → ASSIGNED
(Assignee)

Comment 10

7 years ago
Created attachment 426230 [details] [diff] [review]
WIP: C++
(Assignee)

Comment 11

7 years ago
Created attachment 426232 [details] [diff] [review]
WIP: Java

This patch brings character reference tokenization perf to a level similar to raw text tokenization.
(Assignee)

Comment 12

7 years ago
Created attachment 428395 [details] [diff] [review]
The mozilla-central diff
Attachment #426230 - Attachment is obsolete: true
Attachment #428395 - Flags: review?(bnewman)
(Assignee)

Comment 13

7 years ago
>  2) Whenever the window narrowing occurs subsequently, first advance lo and
>  then check if hi has to become equal to lo (by looking at lo+1) and only do 
>  linear search with hi if not.

This check turned out to be unnecessary for meeting the objective of getting character reference tokenization to the perf level of tokenizing as much text.
(Assignee)

Comment 14

7 years ago
Created attachment 428397 [details] [diff] [review]
html-parser diff
Attachment #426232 - Attachment is obsolete: true
Attachment #428397 - Flags: review?(bnewman)
(Assignee)

Updated

7 years ago
Summary: [HTML5] Named character reference tokenization performance regression → [HTML5][Patch] Named character reference tokenization performance regression
(Reporter)

Updated

7 years ago
Depends on: 501082
(Reporter)

Comment 15

7 years ago
Comment on attachment 428397 [details] [diff] [review]
html-parser diff

>+        // Java initializes arrays to zero. Zero is our magic value for no hilo
>+        // value.
>+        int[][] hiLoTable = new int[0x7B][52];

Make these numbers a little less magical?  For example,

          int[][] hiLoTable = new int['z'+1]['Z'-'A'+1 + 'z'-'a'+1];

or similar?
Attachment #428397 - Flags: review?(bnewman) → review+
(Reporter)

Updated

7 years ago
Attachment #428395 - Flags: review?(bnewman) → review+
(Reporter)

Comment 16

7 years ago
Comment on attachment 428395 [details] [diff] [review]
The mozilla-central diff

>-#define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, VALUE, SIZE) \
>-static PRUnichar const NAME_##N[] = { CHARS };
>+static PRInt32 const HILO_ACCEL_65[] = {
>+  0,
>+  0,
>+  0,
>+  0,
>+  0,
>+  0,
>+  0,
>+  12386493,
>+  0,
---8<---

I don't know a good way to make these data readable, so you might as well save yourself and anyone else reading the file some scrolling by separating the array elements with ", " instead of ",\n".

>+NAMED_CHARACTER_REFERENCE(0, 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0)
>+NAMED_CHARACTER_REFERENCE(1, 'l' _ 'i' _ 'g' _ ';', 4, 0x00c6 _ 0)
>+NAMED_CHARACTER_REFERENCE(2, 'P', 1, 0x0026 _ 0)
>+NAMED_CHARACTER_REFERENCE(3, 'P' _ ';', 2, 0x0026 _ 0)
>+NAMED_CHARACTER_REFERENCE(4, 'c' _ 'u' _ 't' _ 'e', 4, 0x00c1 _ 0)
>+NAMED_CHARACTER_REFERENCE(5, 'c' _ 'u' _ 't' _ 'e' _ ';', 5, 0x00c1 _ 0)
---8<---

It's also regrettable that this file no longer reads like data, since the first two characters of each reference are missing.

Could you maybe include the first two characters in a comment?  For instance,

  NAMED_CHARACTER_REFERENCE(0, /* A e */ 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0)

and include a comment at the top of the file pointing to the explanatory comment in Tokenizer.java?
(Reporter)

Comment 17

7 years ago
I realize that my suggestions in comment #16 actually require changes to the java patch, but the resulting C++ code was what caught my attention.
(Assignee)

Comment 18

7 years ago
(In reply to comment #15)
> (From update of attachment 428397 [details] [diff] [review])
> >+        // Java initializes arrays to zero. Zero is our magic value for no hilo
> >+        // value.
> >+        int[][] hiLoTable = new int[0x7B][52];
> 
> Make these numbers a little less magical?  For example,
> 
>           int[][] hiLoTable = new int['z'+1]['Z'-'A'+1 + 'z'-'a'+1];
> 
> or similar?

Fixed.

(In reply to comment #16)
> (From update of attachment 428395 [details] [diff] [review])
> >-#define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, VALUE, SIZE) \
> >-static PRUnichar const NAME_##N[] = { CHARS };
> >+static PRInt32 const HILO_ACCEL_65[] = {
> >+  0,
> >+  0,
> >+  0,
> >+  0,
> >+  0,
> >+  0,
> >+  0,
> >+  12386493,
> >+  0,
> ---8<---
> 
> I don't know a good way to make these data readable, so you might as well save
> yourself and anyone else reading the file some scrolling by separating the
> array elements with ", " instead of ",\n".

Fixed.

> >+NAMED_CHARACTER_REFERENCE(0, 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0)
> >+NAMED_CHARACTER_REFERENCE(1, 'l' _ 'i' _ 'g' _ ';', 4, 0x00c6 _ 0)
> >+NAMED_CHARACTER_REFERENCE(2, 'P', 1, 0x0026 _ 0)
> >+NAMED_CHARACTER_REFERENCE(3, 'P' _ ';', 2, 0x0026 _ 0)
> >+NAMED_CHARACTER_REFERENCE(4, 'c' _ 'u' _ 't' _ 'e', 4, 0x00c1 _ 0)
> >+NAMED_CHARACTER_REFERENCE(5, 'c' _ 'u' _ 't' _ 'e' _ ';', 5, 0x00c1 _ 0)
> ---8<---
> 
> It's also regrettable that this file no longer reads like data, since the first
> two characters of each reference are missing.
> 
> Could you maybe include the first two characters in a comment?  For instance,
> 
>   NAMED_CHARACTER_REFERENCE(0, /* A e */ 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0)
> 
> and include a comment at the top of the file pointing to the explanatory
> comment in Tokenizer.java?

Fixed.

Thanks!

http://hg.mozilla.org/mozilla-central/rev/42ccfcc82e1e
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.