Closed Bug 1476866 Opened Last year Closed Last year

More wrangling of tokenizing code into place for concise UTF-8 specializations

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox63 --- fixed

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

Attachments

(15 files)

2.12 KB, patch
arai
: review+
Details | Diff | Splinter Review
6.12 KB, patch
arai
: review+
Details | Diff | Splinter Review
13.71 KB, patch
arai
: review+
Details | Diff | Splinter Review
6.38 KB, patch
arai
: review+
Details | Diff | Splinter Review
5.38 KB, patch
arai
: review+
Details | Diff | Splinter Review
3.78 KB, patch
arai
: review+
Details | Diff | Splinter Review
13.02 KB, patch
arai
: review+
Details | Diff | Splinter Review
8.56 KB, patch
arai
: review+
Details | Diff | Splinter Review
1.92 KB, patch
arai
: review+
Details | Diff | Splinter Review
6.57 KB, patch
arai
: review+
Details | Diff | Splinter Review
6.59 KB, patch
arai
: review+
Details | Diff | Splinter Review
5.97 KB, patch
arai
: review+
Details | Diff | Splinter Review
10.33 KB, patch
arai
: review+
Details | Diff | Splinter Review
7.31 KB, patch
arai
: review+
Details | Diff | Splinter Review
6.21 KB, patch
arai
: review+
Details | Diff | Splinter Review
Another grab-bag bug for the latest round of fixes.

None of these yet require any UTF-8 decoding capabilities, but a good number of these patches have direct complements in my tree that define operations for UTF-8.  Iᴛ's ʜᴀᴘᴘᴇɴɪɴɢ!  🎆
It's mildly better for patching, IMO, to separate out this class-addition from the patch that actually starts adding anything interesting to it.
Attachment #8993215 - Flags: review?(arai.unmht)
The UTF-8 version will have to be fallible, so its implementation actually *will* live entirely in TokenStreamChars, eventually.
Attachment #8993216 - Flags: review?(arai.unmht)
The next patch adds another user of this, FYI.
Attachment #8993217 - Flags: review?(arai.unmht)
Okay, *technically* this encodes some UTF-8 awareness in it, but only the barest amount for two code points.  :-)
Attachment #8993218 - Flags: review?(arai.unmht)
Minor readability improvement so that a comparison of a raw unit to '\r' and '\n' later on can more clearly not be missing a Unicode separator-check (and not have a useless '\r' check because the unit was normalized).
Attachment #8993220 - Flags: review?(arai.unmht)
I considered making the character a template parameter to constant-ize it, but 1) probably overkill, 2) that might have required template-qualified member lookup which was absolutely not gonna happen.
Attachment #8993221 - Flags: review?(arai.unmht)
Somewhat unbelievably, this function is actually going to have *two* callers very soon.  o_O  That and a rename, 'cause this name obviously is for exactly one use.
Attachment #8993222 - Flags: review?(arai.unmht)
This function *also* will have a fresh caller quite soon, for producing an error when there's malformed UTF-8 in one of several possible ways.  And because that case is rather weird, it seems much better to funnel through another set of calls to something that looks like computeErrorMetadata, than to try to reuse the existing ones.
Attachment #8993224 - Flags: review?(arai.unmht)
While technically this could include invalid UTF-16 in it if we wanted right now, it seems sensible to do to this, the same thing *has* to be done to UTF-8 -- make it only include valid Unicode text.
Attachment #8993225 - Flags: review?(arai.unmht)
Oddly, including only valid UTF-16 makes this *trickier* to implement than for UTF-8, where we can just find the earliest-start and then scan forward to the first non-trailing code unit.
Attachment #8993226 - Flags: review?(arai.unmht)
The UTF-8 hack in here is rather elegant, IMO.  (I have patches in my tree that remove it, but they depend on a UTF-8 decoding function existing.)
Attachment #8993227 - Flags: review?(arai.unmht)
It was almost a pleasant surprise to recognize the exact same algorithm works for both here.
Attachment #8993229 - Flags: review?(arai.unmht)
Attachment #8993213 - Flags: review?(arai.unmht) → review+
Comment on attachment 8993215 [details] [diff] [review]
Add a SpecializedTokenStreamCharsBase class to the hierarchy as a clear collection point for functions entirely specific to UTF-8 or UTF-16 tokenizing only

Review of attachment 8993215 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.h
@@ +87,5 @@
>   *
> + * == SpecializedTokenStreamCharsBase<CharT> → TokenStreamCharsBase<CharT> ==
> + *
> + * Certain tokenizing functionality is specific to a single character type.
> + * For example, the char16_t encoding recognizes no coding errors, because lone

"char16_t encoding" sounds a bit strange.
can't we just use "UTF-16" ?
Attachment #8993215 - Flags: review?(arai.unmht) → review+
Comment on attachment 8993216 [details] [diff] [review]
Move consumeRestOfSingleLineComment into TokenStreamChars (and the infallible UTF-16 implementation into SpecializedTokenStreamCharsBase) so that it can be easily specialized for UTF-8

Review of attachment 8993216 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.h
@@ +1226,5 @@
>  
> +    /** Consume a known, non-EOF code unit. */
> +    inline void consumeKnownCodeUnit(int32_t unit);
> +
> +    template<typename T> inline void consumeKnownCodeUnit(T) = delete;

can you add a comment why we have this deleted definition?
iiuc it's to avoid implicit type conversion, but not immediately clear,
and also need a reason why we do this.

@@ +1329,5 @@
> +    void infallibleConsumeRestOfSingleLineComment();
> +
> +  protected:
> +    // These functions are present in all SpecializedTokenStreamCharsBase
> +    // specializations, regardless of the identity of CharT.

I don't get this comment.
Attachment #8993216 - Flags: review?(arai.unmht) → review+
Comment on attachment 8993217 [details] [diff] [review]
Add a getNonAsciiCodePointDontNormalize for use in situations that demand such

Review of attachment 8993217 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +1330,5 @@
>                  break;
>          } else {
> +            // |restoreNextRawCharAddress| undoes all gets, and this function
> +            // doesn't update line/column info.
> +            char32_t cp;

are you going to replace other uint32_t with char32_t as well?
Attachment #8993217 - Flags: review?(arai.unmht) → review+
Attachment #8993218 - Flags: review?(arai.unmht) → review+
Attachment #8993220 - Flags: review?(arai.unmht) → review+
Attachment #8993221 - Flags: review?(arai.unmht) → review+
Attachment #8993222 - Flags: review?(arai.unmht) → review+
(In reply to Tooru Fujisawa [:arai] from comment #17)
> > +    template<typename T> inline void consumeKnownCodeUnit(T) = delete;
> 
> can you add a comment why we have this deleted definition?
> iiuc it's to avoid implicit type conversion, but not immediately clear,
> and also need a reason why we do this.

Especially when we only compile UTF-16 code, without this, char16_t could be inadvertently passed to consumeKnownCodeUnit and you'd have a small ticking time bomb for UTF-8 compilation.  This mostly forces the same rigidity on both UTF-8 and UTF-16 instantiations of these templates.  I gestured in this direction in a modified version of this comment in the final patch.

    // Forbid accidental calls to consumeKnownCodeUnit *not* with the single
    // unit-or-EOF type.  CharT should use SourceUnits::consumeKnownCodeUnit;
    // CodeUnitValue() results should go through toCharT(), or better yet just
    // use the original CharT.

> > +    // These functions are present in all SpecializedTokenStreamCharsBase
> > +    // specializations, regardless of the identity of CharT.
> 
> I don't get this comment.

So the idea is *some* functions in STSCB are only relevant for specific values of CharT.  Like the infallibleConsumeRestOfSingleLineComment function is only present for char16_t and cannot be defined at all for UTF-8, so it lives here -- and only UTF-16-specific code, i.e. subclass code specializations solely for UTF-16 (or of course this specialization itself), can call it.  Those functions can live here.

But other operations are sensible for any CharT.  There will be separate definitions of them in the different STSCB, but their interfaces will be identical (ignoring CharT differences).  Those functions will get listed below here, so you can see what parts of STSCB are common interface.

I changed this comment a bit as well to make this clearer:

    // These APIs are in both SpecializedTokenStreamCharsBase specializations
    // and so are usable in subclasses no matter what CharT is.

and the specific-CharT comments are now:

    // These APIs are only usable by UTF-16-specific code.
    // These APIs are only usable by UTF-8-specific code.

(In reply to Tooru Fujisawa [:arai] from comment #18)
> > +            char32_t cp;
> 
> are you going to replace other uint32_t with char32_t as well?

I don't plan to in this bug, and I don't have plans to in any followup bugs/bugs-to-be-filed that I can point directly at.  But yes, we should be making this conversion, when the outparam *only* ever contains a true code point.  (Note that functions that can produce EOF probably should not return a char32_t in this manner, even if EOF as a negative number, when stored in char32_t, will have value distinct from the [0, 0x10FFFF] range.)
Attachment #8993223 - Flags: review?(arai.unmht) → review+
Attachment #8993224 - Flags: review?(arai.unmht) → review+
Attachment #8993225 - Flags: review?(arai.unmht) → review+
Comment on attachment 8993226 [details] [diff] [review]
Move error-message window-start computation into SourceUnits::findWindowStart so that specialized versions can be provided for UTF-8 and UTF-16

Review of attachment 8993226 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +631,5 @@
> +    // This is JS's understanding of UTF-16 that allows lone surrogates, so
> +    // we have to exclude lone surrogates from [windowStart, offset) ourselves.
> +
> +    const char16_t* const earliestPossibleStart =
> +        codeUnitPtrAt(std::min<size_t>(startOfLineContainingOffset, startOffset_));

we shouldn't use `startOfLineContainingOffset` if `startOfLineContainingOffset < startOffset_`, otherwise we'll read out of range before the beginning of `base_`, inside the loop below.
so `earliestPossibleStart` should simply be `base_`.
Attachment #8993226 - Flags: review?(arai.unmht) → review+
Comment on attachment 8993227 [details] [diff] [review]
Implement TokenStreamCharsBase::addLineOfContext that does all the work of creating a syntax-error window and works for either CharT

Review of attachment 8993227 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +753,5 @@
> +
> +    // Don't add a useless "line" of context when the window ends up empty
> +    // because of an invalid encoding at the start of a line.
> +    if (windowLength == 0)
> +        return true;

we should initialize lineLength and tokenOffset if returning true.
Attachment #8993227 - Flags: review?(arai.unmht) → review+
Attachment #8993228 - Flags: review?(arai.unmht) → review+
Attachment #8993229 - Flags: review?(arai.unmht) → review+
Keywords: leave-open
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1b604df36960
Fix the TokenStreamChars::getNonAsciiCodePoint doc comment to be clearer about |lead| being non-ASCII but still potentially (for UTF-16) a full code point.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/9ed1d4084742
Add a SpecializedTokenStreamCharsBase class to the hierarchy as a clear collection point for functions entirely specific to UTF-8 or UTF-16 tokenizing only.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/44b64b5a44fc
Move consumeRestOfSingleLineComment into TokenStreamChars (and the infallible UTF-16 implementation into SpecializedTokenStreamCharsBase) so that it can be easily specialized for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/3a4e6ae59b59
Add a getNonAsciiCodePointDontNormalize for use in situations that demand such.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/80b3a14e84c2
Remove ungetLineTerminator, used only to unget Unicode separators, and replace it with a SourceUnits::ungetLineOrParagraphSeparator.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/0ed3f8f103c0
Reduce code indentation in the loop to accumulate regular expression source text by handling the non-ASCII case in an early block-with-continue.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/4bd4c0e74bc6
Make matchCodeUnit only accept ASCII, and split matchLineTerminator (for '\r' and '\n') out of it.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/7d07cfa666bf
Move fillCharBufferWithTemplateStringContents into TokenStreamCharsBase so that a UTF-8 specialization can eventually be defined for it.  r=arai
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/96e6f8645b7c
Implement a bare-bones TokenStreamChars specialization for UTF-8, into which subsequent UTF-8-centric functions can be added.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/6ac4c8ad6b9a
Move TokenStreamSpecific::computeLineOfContext to GeneralTokenStreamChars::internalComputeLineOfContext, and beef up its doc comment a little.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/ff1852065b5c
Rename SourceUnits::findEOLMax to SourceUnits::findWindowEnd, and make it include only valid UTF-16.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/30c6932eea32
Move error-message window-start computation into SourceUnits::findWindowStart so that specialized versions can be provided for UTF-8 and UTF-16.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/670d911efa81
Implement TokenStreamCharsBase::addLineOfContext that does all the work of creating a syntax-error window and works for either CharT.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/7da01c72860d
Replace SourceUnits<CharT>::isRawEOLChar (where CharT had rather uncertain meaning) with standalone IsLineTerminator of clearer meaning.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/735ee4ba2523
Move TokenStreamChars::getFullAsciiCodePoint to GeneralTokenStreamChars::getFullAsciiCodePoint, because its algorithm is identical for UTF-8 and UTF-16.  r=arai
(In reply to Tooru Fujisawa [:arai] from comment #20)
> we shouldn't use `startOfLineContainingOffset` if
> `startOfLineContainingOffset < startOffset_`, otherwise we'll read out of
> range before the beginning of `base_`, inside the loop below.
> so `earliestPossibleStart` should simply be `base_`.

Adjusted as discussed on IRC by getting rid of the line-start parameter and always searching backward through a radius of units, capping at source-start/end in the separate patches.

(In reply to Tooru Fujisawa [:arai] from comment #21)
> we should initialize lineLength and tokenOffset if returning true.

These fields are only valid when lineOfContext is non-null:

    // If |lineOfContext| is non-null, its length.
    size_t lineLength;

    // If |lineOfContext| is non-null, the offset within it of the token that
    // triggered the error.
    size_t tokenOffset;

Added an assert of nullness before returning true to clarify.

I also ran into one other issue pushing: if you look closely at the old code, the existing line-of-context being returned, despite comments multiple places, *did* have the LineTerminator included in it.  My changes happened to fix that, while adhering to what the old comments *said*.  I added a jsapi-test that checks the last character in JSErrorReport::linebuf() for expected, non-LineTerminator values in various cases, to address this.  (I also had to change one devtools test to deal as well, because it was expecting the LineTerminator.)

More UTF-8 specialization of these various functions coming up -- but in a separate bug.  This one's had all non-UTF-8 work, just prep, and this is a good place to cut it off.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.