As an artifact of SAX, the HTML5 tokenizer tries to batch its character data transfers from the input buffer into the tree builder accumulation buffer by flushing runs of characters via memcpy. However, the tokenizer will have read all those characters once by then. If the write operation to the accumulation buffer on a per character basis were more efficient than the per-character amortized memcpy (re-)read&write, it would be worthwhile to write characters one by one into the accumulation buffer. This means the tokenizer/treebuilder boundary can't become fully virtual for sanitization layers or such, since the per character write should be inlineable. This would probably require bug 489820 and bug 489821 as prerequisites.
You need to log in before you can comment on or make changes to this bug.