Closed Bug 1434429 Opened 6 years ago Closed 6 years ago

More rounds of TokenStream adjustments for UTF-8 tokenizing/parsing

Categories

(Core :: JavaScript Engine, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

Attachments

(7 files)

11.38 KB, patch
till
: review+
Details | Diff | Splinter Review
17.53 KB, patch
arai
: review+
Details | Diff | Splinter Review
6.00 KB, patch
arai
: review+
Details | Diff | Splinter Review
8.48 KB, patch
arai
: review+
Details | Diff | Splinter Review
3.55 KB, patch
arai
: review+
Details | Diff | Splinter Review
9.75 KB, patch
arai
: review+
Details | Diff | Splinter Review
4.45 KB, patch
arai
: review+
Details | Diff | Splinter Review
      No description provided.
I find myself wanting TokenStream::error to actually use the current immediate offset soon, and changing how these places work is necessary to have that.  Plus, these error messages are mildly improvable.
Attachment #8946815 - Flags: review?(till)
The inheritance structure of TokenStream is coming increasingly to roughly mirror that of Parser.  Unsurprisingly, given the two need to be merged pretty much (but not now).
Attachment #8946817 - Flags: review?(arai.unmht)
There are no tests for this because there are no behavior changes.  It *happens* that right now, badchar only happens for something that's wrong at the *start* of a token, so the ungets here just revert to exactly the same position.  But, this is clearer, uses the instantaneous offset (as one would expect), and it provides a new primitive that will be useful as other places come to use better report locations as they shift away from reportError.
Attachment #8946824 - Flags: review?(arai.unmht)
Ostensible UTF-16 source text in JS *can* have mispaired surrogates and other junk in it, per ECMAScript rules.  But ECMAScript doesn't directly process UTF-8, only UTF-16 -- so the implicit process of conversion from malformed UTF-8 to this UTF-16-alike input should produce an error earlier in the compilation chain.

So, when tokenizing UTF-8 and we get something malformed, technically we're in a place that ECMAScript says nothing about.  So *not* allowing malformed code point sequences and goo is the most desirable thing to do.  That requires changing the existing multi-unit code point consumption signature to allow errors, even if no errors are reported in the case of processing UTF-16.

Note that bz says that Gecko will only hand the JS engine properly encoded data (because it does that validation step internally, as it *should*).  So none of this should be web-observable, once the browser starts invoking UTF-8-specific parsing entrypoints.
Attachment #8946830 - Flags: review?(arai.unmht)
It doesn't seem impossible to me that in the long run, we can write fully code-unit-length-generic code and never have to think about this goo anywhere in tokenizing.  But we'll see.
Attachment #8946832 - Flags: review?(arai.unmht)
Attachment #8946817 - Flags: review?(arai.unmht) → review+
Attachment #8946821 - Flags: review?(arai.unmht) → review+
Comment on attachment 8946825 [details] [diff] [review]
Use MakeScopeExit to handle resetting userbuf offset in TokenStreamSpecific::putIdentInTokenbuf

Review of attachment 8946825 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +1325,5 @@
>  
> +    auto restoreNextRawCharAddress =
> +        MakeScopeExit([this, originalAddress]() {
> +            this->userbuf.setAddressOfNextRawChar(originalAddress);
> +        });

oh, this class is nice :)
I think we have some other places that this can simplify.
Attachment #8946825 - Flags: review?(arai.unmht) → review+
Attachment #8946824 - Flags: review?(arai.unmht) → review+
Attachment #8946830 - Flags: review?(arai.unmht) → review+
Comment on attachment 8946832 [details] [diff] [review]
Change TokenStreamCharsBase::appendMultiUnitCodepointToTokenbuf to TokenStreamCharsBase::appendCodePointToTokenbuf so that the idea doesn't require knowledge of multi-unitness

Review of attachment 8946832 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +1304,5 @@
> +
> +    if (!tokenbuf.append(units[0]))
> +        return false;
> +
> +    return numUnits == 1 || tokenbuf.append(units[1]);

I prefer separating `numUnits == 1` and `tokenbuf.append(units[1])` into 2 statements.
since "true" has different meaning for them.
Attachment #8946832 - Flags: review?(arai.unmht) → review+
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/024f70aeba70
Use MakeScopeExit to handle resetting userbuf offset in TokenStreamSpecific::putIdentInTokenbuf.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/112eaf00632c
Move TokenStreamSpecific::ungetChar into a new GeneralTokenStreamChars<CharT, AnyCharsAccess> inserted between TokenStreamCharsBase<CharT> and TokenStreamChars<CharT, AnyCharsAccess> in the token stream inheritance hierarchy.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/45c9102825ab
Move TokenStreamSpecific::ungetCharIgnoreEOL into TokenStreamCharsBase.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/21c79fd76133
Implement a TokenStreamChars::ungetCodePointIgnoreEOL and use it to report errors at precise locations, rather than blindly at the beginning of the token (which happens to be the same thing, just not nearly as clear about it).  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/95e07a79f4b2
Implement TokenStreamChars::matchMultiUnitCodePoint as a better nailing-down of behavior when processing a multi-code unit code point.  r=arai
Extricated several of these for landing, still a couple to go after the last review.
Keywords: leave-open
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b190c5a3dab2
Followup bustage fix (?) for gcc (and maybe other?) compiler bustage.  Worked in recent clang...  r=boogstage in a CLOSED TREE
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1e0faca62e86
Comment out an assertion whose condition is disliked by certain versions of gcc for entirely mysterious reasons, to be debugged later.  r=bustage in a still-CLOSED TREE
Backout by csabou@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/63fe40a76a25
Backed out 7 changesets for build bustages on regress-618572.js and TokenStream.h on a CLOSED TREE
Comment on attachment 8946815 [details] [diff] [review]
Give better errors for unterminated/EOF'd string/template literals

Review of attachment 8946815 [details] [diff] [review]:
-----------------------------------------------------------------

r=me
Attachment #8946815 - Flags: review?(till) → review+
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/049cc7799b53
Use MakeScopeExit to reset userbuf offset in TokenStreamSpecific::putIdentInTokenbuf.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/a9cb684274da
Move TokenStreamSpecific::ungetChar into a new GeneralTokenStreamChars<CharT, AnyCharsAccess> inserted between TokenStreamCharsBase<CharT> and TokenStreamChars<CharT, AnyCharsAccess> in the token stream inheritance hierarchy.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/d353d4a61cb0
Move TokenStreamSpecific::ungetCharIgnoreEOL into TokenStreamCharsBase.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/1bf0b12b992f
Implement a TokenStreamChars::ungetCodePointIgnoreEOL and use it to report errors at precise locations, rather than blindly at the beginning of the token (which happens to be the same thing, just not nearly as clear about it).  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/229b3ea6530f
Implement TokenStreamChars::matchMultiUnitCodePoint as a better nailing-down of behavior when processing a multi-code unit code point.  r=arai
(In reply to Tooru Fujisawa [:arai] from comment #9)
> I prefer separating `numUnits == 1` and `tokenbuf.append(units[1])` into 2
> statements.
> since "true" has different meaning for them.

Yeah, I dithered on that before posting.  Will change when landing.
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4b03e86fb95a
Use the current offset, not the offset of the start of the current token, when reporting errors for unterminated string/template literals.  r=till
https://hg.mozilla.org/integration/mozilla-inbound/rev/dcde99a8fa6c
Change TokenStreamCharsBase::appendMultiUnitCodepointToTokenbuf to TokenStreamCharsBase::appendCodePointToTokenbuf so that the idea doesn't require knowledge of multi-unitness.  r=arai
Blocks: 1433909
Priority: -- → P2
Depends on: 1436150
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/11ead800137e
Use the current offset, not the offset of the start of the current token, when reporting errors for unterminated string/template literals.  r=till
https://hg.mozilla.org/integration/mozilla-inbound/rev/e4213d1706e9
Change TokenStreamCharsBase::appendMultiUnitCodepointToTokenbuf to TokenStreamCharsBase::appendCodePointToTokenbuf so that the idea doesn't require knowledge of multi-unitness.  r=arai
Did all patches for this issue land? (I'm waiting for this bug to be closed so I can proceed with bug 1433909).
Flags: needinfo?(jwalden+bmo)
Oh, whoops.  Yes, it all landed.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(jwalden+bmo)
Keywords: leave-open
Resolution: --- → FIXED
(In reply to Jeff Walden [:Waldo] from comment #24)
> Oh, whoops.  Yes, it all landed.

Thanks! :-)
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: