Closed Bug 489820 Opened 11 years ago Closed 4 years ago

[HTML5] Avoid repeated conditionals upon buffer append by sizing buffers for worst case

Categories

(Core :: DOM: HTML Parser, enhancement, P4)

Other Branch
enhancement

Tracking

()

RESOLVED FIXED
mozilla43
Tracking Status
firefox43 --- fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

References

(Blocks 3 open bugs)

Details

Attachments

(4 files, 17 obsolete files)

5.42 KB, patch
hsivonen
: review+
Details | Diff | Splinter Review
9.05 KB, patch
hsivonen
: review+
Details | Diff | Splinter Review
15.00 KB, patch
wchen
: review+
Details | Diff | Splinter Review
34.99 KB, patch
wchen
: review+
Details | Diff | Splinter Review
Various buffers are grown dynamically on a per-append basis. This involves a buffer available space check on each append.

A given call into to tokenizer cannot cause buffers to grow by more than the length of the input buffer. Thus, it would be possible to use more RAM and run less conditionals by reserving buffer space for current size + length of tokenizer input buffer.
Blocks: 489822
Priority: -- → P4
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
http://hg.mozilla.org/mozilla-central/rev/a0f0fde99844
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
The fix here was broken. See bug 555462.

It was broken, because it didn't take into account that there can be appendable data baked into the current tokenizer state (either a failed character reference in strBuf or something like a single '<' as part of the state itself.)

Instead of trying to change things to be more clever here, I think the right fix for now is to back this out instead of digging the hole deeper. Then, if we really want this later, this should be re-tried as a well-isolated landing.
Depends on: 555462
Note to self: This solution needs special casing for U+0000 to U+FFFD mapping if the tokenizer is ever changed to operate on UTF-8 instead of UTF-16.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 695313
Blocks: 569328
Blocks: 1027585
Blocks: 1029671
Depends on: 1173818
The patch for bug 1176668 goes between part 1 and part 2 here, because I thought part 1 here had broken NCR handling. By the time I realized I had found an ancient bug, the patches were already entangled enough that it's not worthwhile to disentangle them.
Status: REOPENED → ASSIGNED
Attachment #8625268 - Flags: review?(wchen)
Attachment #8625269 - Flags: review?(wchen)
Blocks: 1176681
Comment on attachment 8625277 [details] [diff] [review]
m-c part 3: Size the buffers to worst case before each tokenization call

Oops. Gecko-specific code in part 3 is still wrong.
Attachment #8625277 - Flags: review?(wchen)
So the key insight here is that (when the input to the HTML parser is UTF-16 and the output is UTF-16 as is the case in Gecko) the HTML parsing algorithm never expands strings. It either keeps the length the same or contracts them. Specifically, various markup does not end up in subsequent buffers at all and a single character reference is always at least 4 UTF-16 code units long in the input and always at most two UTF-16 code units in the output. U+0000 can turn into U+FFFD, but the two are both equally long (one code unit) in UTF-16.

Part 3 of the patch set still crashes, because I forgot about View Source. Oops...
It's a bit unfortunate that this makes tokenizeBuffer() a footgun in the sense that you must consistently call another method right before or memory-unsafety ensues, but this solution seems to be the most sensible one given the existing code written to constraints that assumed that the allocator subsystem would take care of stopping the parser when needed and the current reality of that part of the "infallible" allocator plan having gotten dropped.
Also: I have noticed the opportunity to save memory by unifying the buffer in the tokenizer and the buffer in the tree builder. However, this patch queue is risky enough as-is and there's no way I'd get the unification done before I go out of office for 3 weeks, so I'm leaving the unification as a follow-up.
Attachment #8625261 - Attachment is obsolete: true
Attachment #8625261 - Flags: review?(wchen)
Comment on attachment 8625679 [details] [diff] [review]
htmlparser repo Part 1: Consolidate strBuf and longStrBuf (comments fixed in a follow-up), v2

Moving the buffer consolidation to bug 559303 where it belongs to make it easier to explain the order in which this patch queue needs to land.
Attachment #8625679 - Attachment is obsolete: true
Attachment #8625680 - Attachment is obsolete: true
Attachment #8625269 - Attachment is obsolete: true
Attachment #8625269 - Flags: review?(wchen)
Attachment #8625268 - Attachment is obsolete: true
Attachment #8625268 - Flags: review?(wchen)
(In reply to Henri Sivonen (:hsivonen) from comment #6)
> The patch for bug 1176668 goes between part 1 and part 2 here, because I
> thought part 1 here had broken NCR handling. By the time I realized I had
> found an ancient bug, the patches were already entangled enough that it's
> not worthwhile to disentangle them.

This no longer applies. Now the order is simply: bug 559303, bug 1176668, this bug.
Attachment #8625701 - Flags: review?(wchen)
Attachment #8625702 - Flags: review?(wchen)
Attachment #8625703 - Flags: review?(wchen)
Comment on attachment 8625705 [details] [diff] [review]
m-c part 2: Size the buffers to worst case before each tokenization call, vN

See bug 1029671 comment 36 for the big picture of the whole queue and about landing.
Attachment #8625705 - Flags: review?(wchen)
Attachment #8625701 - Flags: review?(wchen) → review+
Attachment #8625702 - Flags: review?(wchen) → review+
Comment on attachment 8625703 [details] [diff] [review]
htmlparser repo part 2: Size the buffers to worst case before each tokenization call

Review of attachment 8625703 [details] [diff] [review]:
-----------------------------------------------------------------

I took a look at the crashes and it seems like it's essentially the same problem as comment 2, and the insight of comment 13 doesn't hold.

The W2 test that is causing the crash looks like this (I've simplified the example):

var s = "<script></a";
for (var i=0; i<s.length; i++) {
  document.write(s[i]);
}

Here is what I think is happening: It's writing to the tokenizer one character at a time, so we don't grow the buffer more than 2 characters (mozilla::RoundUpPow2(1)). At the same time, this is causing a change in tokenizer state. When the tokenizer consumes the final "a", it's in a state that will cause it to output "</a", but the buffer can't hold that many character and overflows.
Attachment #8625703 - Flags: review?(wchen) → review-
Comment on attachment 8625705 [details] [diff] [review]
m-c part 2: Size the buffers to worst case before each tokenization call, vN

Review of attachment 8625705 [details] [diff] [review]:
-----------------------------------------------------------------

::: parser/html/nsHtml5TreeBuilderCppSupplement.h
@@ +894,5 @@
>  
>  void
>  nsHtml5TreeBuilder::accumulateCharacters(const char16_t* aBuf, int32_t aStart, int32_t aLength)
>  {
> +  memcpy(charBuffer + charBufferLen, aBuf + aStart, sizeof(char16_t) * aLength);

We overflow here. There isn't an assertion here which is why you weren't getting anything from debug builds.
Attachment #8625705 - Flags: review?(wchen) → review-
Comment on attachment 8626084 [details] [diff] [review]
m-c part 3 - undo part of part 2 to avoid crashing on web-platform-tests-2

Review of attachment 8626084 [details] [diff] [review]:
-----------------------------------------------------------------

This patch shouldn't be necessary if you fix the previously mentioned bug.
Attachment #8626084 - Flags: review?(wchen) → review-
Attachment #8626085 - Flags: review?(wchen)
Attachment #8626084 - Attachment is obsolete: true
Attachment #8626085 - Attachment is obsolete: true
Thanks. Attaching patch 1 with an updated hg comment to keep track of stuff.
Attachment #8625701 - Attachment is obsolete: true
Attachment #8644266 - Flags: review+
Thanks for finding the bug! Attaching revised part 2.
Attachment #8625703 - Attachment is obsolete: true
Attachment #8644268 - Flags: review?(wchen)
Attachment #8644269 - Flags: review?(wchen)
Comment on attachment 8644268 [details] [diff] [review]
htmlparser repo part 2: Size the buffers to worst case before each tokenization call, revised

Looks good. Patch introduces some new trailing whitespace which should be removed.
Attachment #8644268 - Flags: review?(wchen) → review+
Comment on attachment 8644269 [details] [diff] [review]
m-c part 2: Size the buffers to worst case before each tokenization call, revised

Trailing whitespace
Attachment #8644269 - Flags: review?(wchen) → review+
Thanks. Landed with the trailing whitespace fixed. I think I'll file a follow-up to zap all trailing whitespace from the Java code after which I can make my editor enforce it.
https://hg.mozilla.org/mozilla-central/rev/c5c201250f2a
https://hg.mozilla.org/mozilla-central/rev/bb89c4d2d004
Status: ASSIGNED → RESOLVED
Closed: 10 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
Depends on: 1286911
Depends on: 1309834
You need to log in before you can comment on or make changes to this bug.