Closed Bug 1478170 Opened 6 years ago Closed 6 years ago

Instantiate TokenStreamChars for UTF-8 source text

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox63 --- fixed

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

Attachments

(14 files)

1.88 KB, patch
arai
: review+
Details | Diff | Splinter Review
5.77 KB, patch
arai
: review+
Details | Diff | Splinter Review
2.20 KB, patch
arai
: review+
Details | Diff | Splinter Review
2.53 KB, patch
arai
: review+
Details | Diff | Splinter Review
16.32 KB, patch
arai
: review+
Details | Diff | Splinter Review
1.76 KB, patch
arai
: review+
Details | Diff | Splinter Review
2.26 KB, patch
arai
: review+
Details | Diff | Splinter Review
2.09 KB, patch
arai
: review+
Details | Diff | Splinter Review
4.37 KB, patch
arai
: review+
Details | Diff | Splinter Review
1.49 KB, patch
arai
: review+
Details | Diff | Splinter Review
4.34 KB, patch
arai
: review+
Details | Diff | Splinter Review
4.17 KB, patch
arai
: review+
Details | Diff | Splinter Review
3.99 KB, patch
arai
: review+
Details | Diff | Splinter Review
1.01 KB, patch
arai
: review+
Details | Diff | Splinter Review
TokenStreamCharsBase was previously finished, gotta keep moving through the hierarchy.
This isn't really related to this bug's work, but it should happen and better before I specialize these functions for UTF-8, IMO.
Attachment #8994688 - Flags: review?(arai.unmht)
I had an epiphany: if we encounter an encoding error in this function, we can just stop consuming code points.  When we return from this and continue looking at code units, the next thing looking will report an error failing to decode a code point.  So *this* function doesn't have to!  A nice mini-simplification.
Attachment #8994690 - Flags: review?(arai.unmht)
Attachment #8994691 - Flags: review?(arai.unmht)
This compiles, but roughly unsurprisingly as yet, it is untested.  🤠  The basic ideas of it are the interesting things, tho, IMO -- not mini-buglets that potentially exist in it (tho I do not believe any do).
Attachment #8994693 - Flags: review?(arai.unmht)
Attachment #8994708 - Flags: review?(arai.unmht)
Attachment #8994709 - Flags: review?(arai.unmht)
Attachment #8994710 - Flags: review?(arai.unmht)
This is pretty hack, but it is at least forward progress.
Attachment #8994719 - Flags: review?(arai.unmht)
Attachment #8994720 - Flags: review?(arai.unmht)
Attachment #8994688 - Flags: review?(arai.unmht) → review+
Comment on attachment 8994690 [details] [diff] [review]
Replace SpecializedTokenStreamCharsBase::consumeRestOfSingleLineComment with an infallible SourceUnits::consumeRestOfSingleLineComment

Review of attachment 8994690 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.h
@@ +1293,5 @@
> +     * terminates it).
> +     *
> +     * If an encoding error is encountered -- possible only for UTF-8 because
> +     * JavaScript's conception of UTF-16 encompasses any sequence of 16-bit
> +     * code units -- code points up to it will be consumed.  (The caller is

can you clarify whether the part of erroneous code units are consumed or not?
("up to" seems to be ambiguous)
Attachment #8994690 - Flags: review?(arai.unmht) → review+
Attachment #8994691 - Flags: review?(arai.unmht) → review+
Comment on attachment 8994692 [details] [diff] [review]
Specialize TokenStreamCharsBase::fillCharBufferWithTemplateStringContents for UTF-8

Review of attachment 8994692 [details] [diff] [review]:
-----------------------------------------------------------------

the method name in the commit message doesn't match to the change.
Attachment #8994692 - Flags: review?(arai.unmht) → review+
Comment on attachment 8994693 [details] [diff] [review]
Implement getNonAsciiCodePointDontNormalize for UTF-8

Review of attachment 8994693 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +661,5 @@
> +    char leadByteStr[5];
> +    byteToTerminatedString(leadValue, leadByteStr);
> +
> +    const char expectedStr[] = { toHexChar(required - 1), '\0' };
> +    const char actualStr[] = { toHexChar(remaining - 1), '\0' };

can you add some comment that those strings are expected to be decimal numbers,
and we reuse toHexChar because it's same for 0-4 range.

@@ +697,5 @@
> +    }
> +
> +    MOZ_ASSERT(codePointCharsArray + 2 <= codePointStr);
> +    *--codePointStr = 'x';
> +    *--codePointStr = '0';

can you add some helper function with descriptive name for above code?
or maybe add comment what it's doing.
the resulting content of codePointCharsArray and codePointStr are complicated
(like, the string starts from the middle of codePointCharsArray),
and a comment or descriptive function name will help figuring out what it's doing.

::: js/src/frontend/TokenStream.h
@@ +1633,5 @@
> +
> +        mozilla::Utf8Unit operator*() const {
> +            // This function is expected to be called only during an expression
> +            // of the form |*iter++| that *post-increments* this iterator, then
> +            // dereferences the old iterator.  Because operator++(int) will

can you clarify they share single SourceUnits instance?

@@ +1636,5 @@
> +            // of the form |*iter++| that *post-increments* this iterator, then
> +            // dereferences the old iterator.  Because operator++(int) will
> +            // consume the next code unit in |sourceUnits_|, dereferencing to
> +            // compute the just-consumed code unit must look at the *previous*
> +            // code unit here.

looks like a pitfall.
can we add some debug-only fields to detect mis-use?
like, storing the expected `current()` pointer or something when copying, and validating it when dereferencing?

also, it would be better preventing any other operations on the copied iterator, which modifies the shared SourceUnits.

::: js/src/util/Unicode.h
@@ +582,5 @@
>      return codePoint >= TrailSurrogateMin && codePoint <= TrailSurrogateMax;
>  }
>  
> +inline bool
> +IsSurrogate(uint32_t codePoint)

might be better adding some comment that explains why it receives `codePoint`.
like, this function is used for detecting invalid text, instead of used while parsing UTF16 surrogate pair or some similar usage.
Attachment #8994693 - Flags: review?(arai.unmht) → review+
Attachment #8994707 - Flags: review?(arai.unmht) → review+
Attachment #8994708 - Flags: review?(arai.unmht) → review+
Attachment #8994709 - Flags: review?(arai.unmht) → review+
Comment on attachment 8994710 [details] [diff] [review]
Specialize SourceUnits::findWindowStart for UTF-8

Review of attachment 8994710 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +943,5 @@
> +
> +    while (true) {
> +        MOZ_ASSERT(earliestPossibleStart <= p);
> +        MOZ_ASSERT(HalfWindowSize() <= WindowRadius);
> +        if (p <= earliestPossibleStart || HalfWindowSize() >= WindowRadius)

according to the above assertion, `p < earliestPossibleStart` shouldn't happen.

@@ +955,5 @@
> +        // If there aren't three code units available, some comparison here
> +        // will fail before we'd underflow.
> +        if (MOZ_UNLIKELY((prev == 0xA8 || prev == 0xA9) &&
> +                         p[-2].toUint8() == 0x80 &&
> +                         p[-3].toUint8() == 0xE2))

can you add some comment that says they're LINE_SEPARATOR and PARA_SEPARATOR ?
Attachment #8994710 - Flags: review?(arai.unmht) → review+
Attachment #8994711 - Flags: review?(arai.unmht) → review+
Comment on attachment 8994713 [details] [diff] [review]
Implement TokenStreamChars::getNonAsciiCodePoint for UTF-8

Review of attachment 8994713 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +901,5 @@
> +        this->notShortestForm(badCodePoint, unitsObserved);
> +    };
> +
> +    // This consumes the full, valid code point or ungets |lead| and calls the
> +    // appropriate error functor on failure.

where do we unget |lead| ?
Attachment #8994713 - Flags: review?(arai.unmht) → review+
Attachment #8994718 - Flags: review?(arai.unmht) → review+
Comment on attachment 8994719 [details] [diff] [review]
Make atomizeSourceChars take a Span<CharT>, and define a specialization for UTF-8

Review of attachment 8994719 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.h
@@ +1450,5 @@
>      CharBuffer& getCharBuffer() { return charBuffer; }
>  };
>  
> +inline mozilla::Span<const char16_t>
> +ToCharSpan(mozilla::Span<const char16_t> codeUnits)

is this going to be used later?
if we're not going to use, it would be better removing this definition.
Attachment #8994719 - Flags: review?(arai.unmht) → review+
Comment on attachment 8994720 [details] [diff] [review]
Instantiate TokenStreamChars for UTF-8

Review of attachment 8994720 [details] [diff] [review]:
-----------------------------------------------------------------

\o/
Attachment #8994720 - Flags: review?(arai.unmht) → review+
(In reply to Tooru Fujisawa [:arai] from comment #17)
> can you add some helper function with descriptive name for above code?
> or maybe add comment what it's doing.

Added a comment.  Separate helper function right now is tricky, because until this code is instantiated, a helper function would actually be unused -- and then you court compiler warnings.  (Granted at the end of this bug that *does* happen, but it's messy in advance of that, and I have the actually-instantiate patch at the end of the patch series for a reason.)

> can you clarify they share single SourceUnits instance?

Better yet, there's no point in even having the pointer in |SourceUnitsEnd| -- so I just made operator- return |sourceUnits_.remaining()| stored in |SourceUnitsIterator|.

> can we add some debug-only fields to detect mis-use?
> like, storing the expected `current()` pointer or something when copying,
> and validating it when dereferencing?
> 
> also, it would be better preventing any other operations on the copied
> iterator, which modifies the shared SourceUnits.

I added a |Maybe<const mozilla::Utf8Unit*>| field to the iterator, then assert it's |isNothing()| in all functions except operator* and operator++, and in those two assert the expected value in the first and then |reset()| it, in the second |emplace()| |current()| (which incidentally will assert |isNothing()| and verify it's clear).  I think that covers all the bases you mention.

> might be better adding some comment that explains why it receives
> `codePoint`.
> like, this function is used for detecting invalid text, instead of used
> while parsing UTF16 surrogate pair or some similar usage.

Added a comment, but note that most other functions in this file take |uint32_t| and not |char16_t| already, so it's not really breaking new ground.

(In reply to Tooru Fujisawa [:arai] from comment #18)
> > +        MOZ_ASSERT(earliestPossibleStart <= p);
> > +        MOZ_ASSERT(HalfWindowSize() <= WindowRadius);
> > +        if (p <= earliestPossibleStart || HalfWindowSize() >= WindowRadius)
> 
> according to the above assertion, `p < earliestPossibleStart` shouldn't
> happen.

True.  I wrote it that way for extra paranoia at runtime, of the sort that meant bug 1476409 was not actually dangerous -- we had a similar |>=| check there where, assuming proper incrementing, |==| would have been sufficient.

(In reply to Tooru Fujisawa [:arai] from comment #19)
> > +    // This consumes the full, valid code point or ungets |lead| and calls the
> > +    // appropriate error functor on failure.
> 
> where do we unget |lead| ?

See all the various |*aIter -= N| in |DecodeOneUtf8CodePointInline| in mfbt/Utf8.h, that end up invoking |SourceUnitsIterator::operator-=|, that invokes |sourceUnits.unskipCodeUnits(N)|.
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/66952e88e68c
Make SourceUnits::findWindow{Start,End} const because they can/should be.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/0ccbc10cd6a1
Replace SpecializedTokenStreamCharsBase::consumeRestOfSingleLineComment with an infallible SourceUnits::consumeRestOfSingleLineComment, in light of the fact that the UTF-8 version can just stop consuming code points when an encoding error is encountered, leaving error-reporting to whatever actually tries to consume the first code point of the "next" line.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/292bd4af4056
Implement SourceUnits::peekCodePoint for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/7f427db8f6f8
Specialize TokenStreamCharsBase::fillCharBufferFromSourceNormalizingAsciiLineBreaks for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/45b34e5af77e
Implement getNonAsciiCodePointDontNormalize for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/d0847a6bfa7a
Specialize SourceUnits::consumeKnownCodePoint for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/46c400e4a018
Implement consumeRestOfSingleLine for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/f00dc821eb42
Specialize SourceUnits::findWindowEnd for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/26f77e15d7d2
Specialize SourceUnits::findWindowStart for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/cafc89ca6a87
Stop specializing TokenStreamCharsBase::addLineOfContext now that the CharT-specialized things it relies on are fully specialized.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/107211d728e6
Implement TokenStreamChars::getNonAsciiCodePoint for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/38216fdba3dd
Move TokenStreamChars::getCodePoint to a generalized TokenStreamSpecific::getCodePoint that works for any CharT.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/dad6f1d06271
Make atomizeSourceChars take a Span<CharT>, and define a specialization for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/0ad2890523ff
Instantiate TokenStreamChars for UTF-8.  r=arai
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: