Closed Bug 1434429 Opened 6 years ago Closed 6 years ago

More rounds of TokenStream adjustments for UTF-8 tokenizing/parsing

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla60

Tracking Flags:

Tracking

Status

firefox60

---

fixed

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

Attachments

(7 files)

Give better errors for unterminated/EOF'd string/template literals 6 years ago Jeff Walden [:Waldo] 11.38 KB, patch	till : review+	Details \| Diff \| Splinter Review
Move TokenStreamSpecific::ungetChar into a new GeneralTokenStreamChars<CharT, AnyCharsAccess> inserted between TokenStreamCharsBase<CharT> and TokenStreamChars<CharT, AnyCharsAccess> in the token stream inheritance hierarchy 6 years ago Jeff Walden [:Waldo] 17.53 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Move TokenStreamSpecific::ungetCharIgnoreEOL into TokenStreamCharsBase 6 years ago Jeff Walden [:Waldo] 6.00 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Implement a TokenStreamChars::ungetCodePointIgnoreEOL and use it to report errors at precise locations, rather than blindly at the beginning of the token (which happens to be the same thing, just not nearly as clear about it) 6 years ago Jeff Walden [:Waldo] 8.48 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Use MakeScopeExit to handle resetting userbuf offset in TokenStreamSpecific::putIdentInTokenbuf 6 years ago Jeff Walden [:Waldo] 3.55 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Implement TokenStreamChars::matchMultiUnitCodePoint as a better nailing-down of behavior when processing a multi-code unit code point 6 years ago Jeff Walden [:Waldo] 9.75 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Change TokenStreamCharsBase::appendMultiUnitCodepointToTokenbuf to TokenStreamCharsBase::appendCodePointToTokenbuf so that the idea doesn't require knowledge of multi-unitness 6 years ago Jeff Walden [:Waldo] 4.45 KB, patch	arai : review+	Details \| Diff \| Splinter Review

Jeff Walden [:Waldo]

Assignee

Description

•

6 years ago

      No description provided.

Jeff Walden [:Waldo]

Assignee

Comment 1

•

6 years ago

Attached patch Give better errors for unterminated/EOF'd string/template literals — Details — Splinter Review

I find myself wanting TokenStream::error to actually use the current immediate offset soon, and changing how these places work is necessary to have that.  Plus, these error messages are mildly improvable.

Attachment #8946815 - Flags: review?(till)

Jeff Walden [:Waldo]

Assignee

Comment 2

•

6 years ago

Attached patch Move TokenStreamSpecific::ungetChar into a new GeneralTokenStreamChars<CharT, AnyCharsAccess> inserted between TokenStreamCharsBase<CharT> and TokenStreamChars<CharT, AnyCharsAccess> in the token stream inheritance hierarchy — Details — Splinter Review

The inheritance structure of TokenStream is coming increasingly to roughly mirror that of Parser.  Unsurprisingly, given the two need to be merged pretty much (but not now).

Attachment #8946817 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 3

•

6 years ago

Attached patch Move TokenStreamSpecific::ungetCharIgnoreEOL into TokenStreamCharsBase — Details — Splinter Review

Attachment #8946821 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 4

•

6 years ago

Attached patch Implement a TokenStreamChars::ungetCodePointIgnoreEOL and use it to report errors at precise locations, rather than blindly at the beginning of the token (which happens to be the same thing, just not nearly as clear about it) — Details — Splinter Review

There are no tests for this because there are no behavior changes.  It *happens* that right now, badchar only happens for something that's wrong at the *start* of a token, so the ungets here just revert to exactly the same position.  But, this is clearer, uses the instantaneous offset (as one would expect), and it provides a new primitive that will be useful as other places come to use better report locations as they shift away from reportError.

Attachment #8946824 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 5

•

6 years ago

Attached patch Use MakeScopeExit to handle resetting userbuf offset in TokenStreamSpecific::putIdentInTokenbuf — Details — Splinter Review

Attachment #8946825 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 6

•

6 years ago

Attached patch Implement TokenStreamChars::matchMultiUnitCodePoint as a better nailing-down of behavior when processing a multi-code unit code point — Details — Splinter Review

Ostensible UTF-16 source text in JS *can* have mispaired surrogates and other junk in it, per ECMAScript rules.  But ECMAScript doesn't directly process UTF-8, only UTF-16 -- so the implicit process of conversion from malformed UTF-8 to this UTF-16-alike input should produce an error earlier in the compilation chain.

So, when tokenizing UTF-8 and we get something malformed, technically we're in a place that ECMAScript says nothing about.  So *not* allowing malformed code point sequences and goo is the most desirable thing to do.  That requires changing the existing multi-unit code point consumption signature to allow errors, even if no errors are reported in the case of processing UTF-16.

Note that bz says that Gecko will only hand the JS engine properly encoded data (because it does that validation step internally, as it *should*).  So none of this should be web-observable, once the browser starts invoking UTF-8-specific parsing entrypoints.

Attachment #8946830 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 7

•

6 years ago

Attached patch Change TokenStreamCharsBase::appendMultiUnitCodepointToTokenbuf to TokenStreamCharsBase::appendCodePointToTokenbuf so that the idea doesn't require knowledge of multi-unitness — Details — Splinter Review

It doesn't seem impossible to me that in the long run, we can write fully code-unit-length-generic code and never have to think about this goo anywhere in tokenizing.  But we'll see.

Attachment #8946832 - Flags: review?(arai.unmht)

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8946817 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8946821 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 8

•

6 years ago

Comment on attachment 8946825 [details] [diff] [review]
Use MakeScopeExit to handle resetting userbuf offset in TokenStreamSpecific::putIdentInTokenbuf

Review of attachment 8946825 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +1325,5 @@
>  
> +    auto restoreNextRawCharAddress =
> +        MakeScopeExit([this, originalAddress]() {
> +            this->userbuf.setAddressOfNextRawChar(originalAddress);
> +        });

oh, this class is nice :)
I think we have some other places that this can simplify.

Attachment #8946825 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8946824 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8946830 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 9

•

6 years ago

Comment on attachment 8946832 [details] [diff] [review]
Change TokenStreamCharsBase::appendMultiUnitCodepointToTokenbuf to TokenStreamCharsBase::appendCodePointToTokenbuf so that the idea doesn't require knowledge of multi-unitness

Review of attachment 8946832 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +1304,5 @@
> +
> +    if (!tokenbuf.append(units[0]))
> +        return false;
> +
> +    return numUnits == 1 || tokenbuf.append(units[1]);

I prefer separating `numUnits == 1` and `tokenbuf.append(units[1])` into 2 statements.
since "true" has different meaning for them.

Attachment #8946832 - Flags: review?(arai.unmht) → review+

Pulsebot

Comment 10

•

6 years ago

Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/024f70aeba70
Use MakeScopeExit to handle resetting userbuf offset in TokenStreamSpecific::putIdentInTokenbuf.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/112eaf00632c
Move TokenStreamSpecific::ungetChar into a new GeneralTokenStreamChars<CharT, AnyCharsAccess> inserted between TokenStreamCharsBase<CharT> and TokenStreamChars<CharT, AnyCharsAccess> in the token stream inheritance hierarchy.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/45c9102825ab
Move TokenStreamSpecific::ungetCharIgnoreEOL into TokenStreamCharsBase.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/21c79fd76133
Implement a TokenStreamChars::ungetCodePointIgnoreEOL and use it to report errors at precise locations, rather than blindly at the beginning of the token (which happens to be the same thing, just not nearly as clear about it).  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/95e07a79f4b2
Implement TokenStreamChars::matchMultiUnitCodePoint as a better nailing-down of behavior when processing a multi-code unit code point.  r=arai

Jeff Walden [:Waldo]

Assignee

Comment 11

•

6 years ago

Extricated several of these for landing, still a couple to go after the last review.

Keywords: leave-open

Pulsebot

Comment 12

•

6 years ago

Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b190c5a3dab2
Followup bustage fix (?) for gcc (and maybe other?) compiler bustage.  Worked in recent clang...  r=boogstage in a CLOSED TREE

Pulsebot

Comment 13

•

6 years ago

Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1e0faca62e86
Comment out an assertion whose condition is disliked by certain versions of gcc for entirely mysterious reasons, to be debugged later.  r=bustage in a still-CLOSED TREE

Pulsebot

Comment 14

•

6 years ago

Backout by csabou@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/63fe40a76a25
Backed out 7 changesets for build bustages on regress-618572.js and TokenStream.h on a CLOSED TREE

Till Schneidereit [:till]

Comment 15

•

6 years ago

Comment on attachment 8946815 [details] [diff] [review]
Give better errors for unterminated/EOF'd string/template literals

Review of attachment 8946815 [details] [diff] [review]:
-----------------------------------------------------------------

r=me

Attachment #8946815 - Flags: review?(till) → review+

Pulsebot

Comment 16

•

6 years ago

Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/049cc7799b53
Use MakeScopeExit to reset userbuf offset in TokenStreamSpecific::putIdentInTokenbuf.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/a9cb684274da
Move TokenStreamSpecific::ungetChar into a new GeneralTokenStreamChars<CharT, AnyCharsAccess> inserted between TokenStreamCharsBase<CharT> and TokenStreamChars<CharT, AnyCharsAccess> in the token stream inheritance hierarchy.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/d353d4a61cb0
Move TokenStreamSpecific::ungetCharIgnoreEOL into TokenStreamCharsBase.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/1bf0b12b992f
Implement a TokenStreamChars::ungetCodePointIgnoreEOL and use it to report errors at precise locations, rather than blindly at the beginning of the token (which happens to be the same thing, just not nearly as clear about it).  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/229b3ea6530f
Implement TokenStreamChars::matchMultiUnitCodePoint as a better nailing-down of behavior when processing a multi-code unit code point.  r=arai

Jeff Walden [:Waldo]

Assignee

Comment 17

•

6 years ago

(In reply to Tooru Fujisawa [:arai] from comment #9)
> I prefer separating `numUnits == 1` and `tokenbuf.append(units[1])` into 2
> statements.
> since "true" has different meaning for them.

Yeah, I dithered on that before posting.  Will change when landing.

Arthur Iakab [arthur_iakab]

Comment 18

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/049cc7799b53
https://hg.mozilla.org/mozilla-central/rev/a9cb684274da
https://hg.mozilla.org/mozilla-central/rev/d353d4a61cb0
https://hg.mozilla.org/mozilla-central/rev/1bf0b12b992f
https://hg.mozilla.org/mozilla-central/rev/229b3ea6530f

Pulsebot

Comment 19

•

6 years ago

Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4b03e86fb95a
Use the current offset, not the offset of the start of the current token, when reporting errors for unterminated string/template literals.  r=till
https://hg.mozilla.org/integration/mozilla-inbound/rev/dcde99a8fa6c
Change TokenStreamCharsBase::appendMultiUnitCodepointToTokenbuf to TokenStreamCharsBase::appendCodePointToTokenbuf so that the idea doesn't require knowledge of multi-unitness.  r=arai

Pulsebot

Comment 20

•

6 years ago

Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/835b96c27ded
Another followup test-fix.  r=red

André Bargull [:anba]

Updated

•

6 years ago

Blocks: 1433909

Nicolas B. Pierron [:nbp]

Updated

•

6 years ago

status-firefox60: --- → fix-optional

Priority: -- → P2

Nicolas B. Pierron [:nbp]

Updated

•

6 years ago

status-firefox60: fix-optional → affected

Priority: P2 → P1

Jeff Walden [:Waldo]

Assignee

Updated

•

6 years ago

Depends on: 1436150

Pulsebot

Comment 21

•

6 years ago

Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/11ead800137e
Use the current offset, not the offset of the start of the current token, when reporting errors for unterminated string/template literals.  r=till
https://hg.mozilla.org/integration/mozilla-inbound/rev/e4213d1706e9
Change TokenStreamCharsBase::appendMultiUnitCodepointToTokenbuf to TokenStreamCharsBase::appendCodePointToTokenbuf so that the idea doesn't require knowledge of multi-unitness.  r=arai

Noemi Erli[:noemi_erli]

Comment 22

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/11ead800137e
https://hg.mozilla.org/mozilla-central/rev/e4213d1706e9

André Bargull [:anba]

Comment 23

•

6 years ago

Did all patches for this issue land? (I'm waiting for this bug to be closed so I can proceed with bug 1433909).

Flags: needinfo?(jwalden+bmo)

Jeff Walden [:Waldo]

Assignee

Comment 24

•

6 years ago

Oh, whoops.  Yes, it all landed.

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Flags: needinfo?(jwalden+bmo)

Keywords: leave-open

Resolution: --- → FIXED

André Bargull [:anba]

Comment 25

•

6 years ago

(In reply to Jeff Walden [:Waldo] from comment #24)
> Oh, whoops.  Yes, it all landed.

Thanks! :-)

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox60: affected → fixed

Target Milestone: --- → mozilla60

You need to log in before you can comment on or make changes to this bug.