Closed Bug 1478170 Opened 6 years ago Closed 6 years ago

Instantiate TokenStreamChars for UTF-8 source text

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla63

Tracking Flags:

Tracking

Status

firefox63

---

fixed

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

Attachments

(14 files)

Make SourceUnits::findWindow{Start,End} const because they can/should be 6 years ago Jeff Walden [:Waldo] 1.88 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Replace SpecializedTokenStreamCharsBase::consumeRestOfSingleLineComment with an infallible SourceUnits::consumeRestOfSingleLineComment 6 years ago Jeff Walden [:Waldo] 5.77 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Implement SourceUnits::peekCodePoint for UTF-8 6 years ago Jeff Walden [:Waldo] 2.20 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Specialize TokenStreamCharsBase::fillCharBufferWithTemplateStringContents for UTF-8 6 years ago Jeff Walden [:Waldo] 2.53 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Implement getNonAsciiCodePointDontNormalize for UTF-8 6 years ago Jeff Walden [:Waldo] 16.32 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Specialize SourceUnits::consumeKnownCodePoint for UTF-8 6 years ago Jeff Walden [:Waldo] 1.76 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Implement consumeRestOfSingleLine for UTF-8 6 years ago Jeff Walden [:Waldo] 2.26 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Specialize SourceUnits::findWindowEnd for UTF-8 6 years ago Jeff Walden [:Waldo] 2.09 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Specialize SourceUnits::findWindowStart for UTF-8 6 years ago Jeff Walden [:Waldo] 4.37 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Stop specializing TokenStreamCharsBase::addLineOfContext now that the CharT-specialized things it relies on are fully specialized 6 years ago Jeff Walden [:Waldo] 1.49 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Implement TokenStreamChars::getNonAsciiCodePoint for UTF-8 6 years ago Jeff Walden [:Waldo] 4.34 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Move TokenStreamChars::getCodePoint to a generalized TokenStreamSpecific::getCodePoint that works for any CharT 6 years ago Jeff Walden [:Waldo] 4.17 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Make atomizeSourceChars take a Span<CharT>, and define a specialization for UTF-8 6 years ago Jeff Walden [:Waldo] 3.99 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Instantiate TokenStreamChars for UTF-8 6 years ago Jeff Walden [:Waldo] 1.01 KB, patch	arai : review+	Details \| Diff \| Splinter Review

Jeff Walden [:Waldo]

Assignee

Description

•

6 years ago

TokenStreamCharsBase was previously finished, gotta keep moving through the hierarchy.

Jeff Walden [:Waldo]

Assignee

Comment 1

•

6 years ago

Attached patch Make SourceUnits::findWindow{Start,End} const because they can/should be — Details — Splinter Review

This isn't really related to this bug's work, but it should happen and better before I specialize these functions for UTF-8, IMO.

Attachment #8994688 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 2

•

6 years ago

Attached patch Replace SpecializedTokenStreamCharsBase::consumeRestOfSingleLineComment with an infallible SourceUnits::consumeRestOfSingleLineComment — Details — Splinter Review

I had an epiphany: if we encounter an encoding error in this function, we can just stop consuming code points.  When we return from this and continue looking at code units, the next thing looking will report an error failing to decode a code point.  So *this* function doesn't have to!  A nice mini-simplification.

Attachment #8994690 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 3

•

6 years ago

Attached patch Implement SourceUnits::peekCodePoint for UTF-8 — Details — Splinter Review

Attachment #8994691 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 4

•

6 years ago

Attached patch Specialize TokenStreamCharsBase::fillCharBufferWithTemplateStringContents for UTF-8 — Details — Splinter Review

Attachment #8994692 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 5

•

6 years ago

Attached patch Implement getNonAsciiCodePointDontNormalize for UTF-8 — Details — Splinter Review

This compiles, but roughly unsurprisingly as yet, it is untested.  🤠  The basic ideas of it are the interesting things, tho, IMO -- not mini-buglets that potentially exist in it (tho I do not believe any do).

Attachment #8994693 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 6

•

6 years ago

Attached patch Specialize SourceUnits::consumeKnownCodePoint for UTF-8 — Details — Splinter Review

Attachment #8994707 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 7

•

6 years ago

Attached patch Implement consumeRestOfSingleLine for UTF-8 — Details — Splinter Review

Attachment #8994708 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 8

•

6 years ago

Attached patch Specialize SourceUnits::findWindowEnd for UTF-8 — Details — Splinter Review

Attachment #8994709 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 9

•

6 years ago

Attached patch Specialize SourceUnits::findWindowStart for UTF-8 — Details — Splinter Review

Attachment #8994710 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 10

•

6 years ago

Attached patch Stop specializing TokenStreamCharsBase::addLineOfContext now that the CharT-specialized things it relies on are fully specialized — Details — Splinter Review

Attachment #8994711 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 11

•

6 years ago

Attached patch Implement TokenStreamChars::getNonAsciiCodePoint for UTF-8 — Details — Splinter Review

Attachment #8994713 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 12

•

6 years ago

Attached patch Move TokenStreamChars::getCodePoint to a generalized TokenStreamSpecific::getCodePoint that works for any CharT — Details — Splinter Review

Attachment #8994718 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 13

•

6 years ago

Attached patch Make atomizeSourceChars take a Span<CharT>, and define a specialization for UTF-8 — Details — Splinter Review

This is pretty hack, but it is at least forward progress.

Attachment #8994719 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 14

•

6 years ago

Attached patch Instantiate TokenStreamChars for UTF-8 — Details — Splinter Review

Attachment #8994720 - Flags: review?(arai.unmht)

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8994688 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 15

•

6 years ago

Comment on attachment 8994690 [details] [diff] [review]
Replace SpecializedTokenStreamCharsBase::consumeRestOfSingleLineComment with an infallible SourceUnits::consumeRestOfSingleLineComment

Review of attachment 8994690 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.h
@@ +1293,5 @@
> +     * terminates it).
> +     *
> +     * If an encoding error is encountered -- possible only for UTF-8 because
> +     * JavaScript's conception of UTF-16 encompasses any sequence of 16-bit
> +     * code units -- code points up to it will be consumed.  (The caller is

can you clarify whether the part of erroneous code units are consumed or not?
("up to" seems to be ambiguous)

Attachment #8994690 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8994691 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 16

•

6 years ago

Comment on attachment 8994692 [details] [diff] [review]
Specialize TokenStreamCharsBase::fillCharBufferWithTemplateStringContents for UTF-8

Review of attachment 8994692 [details] [diff] [review]:
-----------------------------------------------------------------

the method name in the commit message doesn't match to the change.

Attachment #8994692 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 17

•

6 years ago

Comment on attachment 8994693 [details] [diff] [review]
Implement getNonAsciiCodePointDontNormalize for UTF-8

Review of attachment 8994693 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +661,5 @@
> +    char leadByteStr[5];
> +    byteToTerminatedString(leadValue, leadByteStr);
> +
> +    const char expectedStr[] = { toHexChar(required - 1), '\0' };
> +    const char actualStr[] = { toHexChar(remaining - 1), '\0' };

can you add some comment that those strings are expected to be decimal numbers,
and we reuse toHexChar because it's same for 0-4 range.

@@ +697,5 @@
> +    }
> +
> +    MOZ_ASSERT(codePointCharsArray + 2 <= codePointStr);
> +    *--codePointStr = 'x';
> +    *--codePointStr = '0';

can you add some helper function with descriptive name for above code?
or maybe add comment what it's doing.
the resulting content of codePointCharsArray and codePointStr are complicated
(like, the string starts from the middle of codePointCharsArray),
and a comment or descriptive function name will help figuring out what it's doing.

::: js/src/frontend/TokenStream.h
@@ +1633,5 @@
> +
> +        mozilla::Utf8Unit operator*() const {
> +            // This function is expected to be called only during an expression
> +            // of the form |*iter++| that *post-increments* this iterator, then
> +            // dereferences the old iterator.  Because operator++(int) will

can you clarify they share single SourceUnits instance?

@@ +1636,5 @@
> +            // of the form |*iter++| that *post-increments* this iterator, then
> +            // dereferences the old iterator.  Because operator++(int) will
> +            // consume the next code unit in |sourceUnits_|, dereferencing to
> +            // compute the just-consumed code unit must look at the *previous*
> +            // code unit here.

looks like a pitfall.
can we add some debug-only fields to detect mis-use?
like, storing the expected `current()` pointer or something when copying, and validating it when dereferencing?

also, it would be better preventing any other operations on the copied iterator, which modifies the shared SourceUnits.

::: js/src/util/Unicode.h
@@ +582,5 @@
>      return codePoint >= TrailSurrogateMin && codePoint <= TrailSurrogateMax;
>  }
>  
> +inline bool
> +IsSurrogate(uint32_t codePoint)

might be better adding some comment that explains why it receives `codePoint`.
like, this function is used for detecting invalid text, instead of used while parsing UTF16 surrogate pair or some similar usage.

Attachment #8994693 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8994707 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8994708 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8994709 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 18

•

6 years ago

Comment on attachment 8994710 [details] [diff] [review]
Specialize SourceUnits::findWindowStart for UTF-8

Review of attachment 8994710 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +943,5 @@
> +
> +    while (true) {
> +        MOZ_ASSERT(earliestPossibleStart <= p);
> +        MOZ_ASSERT(HalfWindowSize() <= WindowRadius);
> +        if (p <= earliestPossibleStart || HalfWindowSize() >= WindowRadius)

according to the above assertion, `p < earliestPossibleStart` shouldn't happen.

@@ +955,5 @@
> +        // If there aren't three code units available, some comparison here
> +        // will fail before we'd underflow.
> +        if (MOZ_UNLIKELY((prev == 0xA8 || prev == 0xA9) &&
> +                         p[-2].toUint8() == 0x80 &&
> +                         p[-3].toUint8() == 0xE2))

can you add some comment that says they're LINE_SEPARATOR and PARA_SEPARATOR ?

Attachment #8994710 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8994711 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 19

•

6 years ago

Comment on attachment 8994713 [details] [diff] [review]
Implement TokenStreamChars::getNonAsciiCodePoint for UTF-8

Review of attachment 8994713 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +901,5 @@
> +        this->notShortestForm(badCodePoint, unitsObserved);
> +    };
> +
> +    // This consumes the full, valid code point or ungets |lead| and calls the
> +    // appropriate error functor on failure.

where do we unget |lead| ?

Attachment #8994713 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8994718 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 20

•

6 years ago

Comment on attachment 8994719 [details] [diff] [review]
Make atomizeSourceChars take a Span<CharT>, and define a specialization for UTF-8

Review of attachment 8994719 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.h
@@ +1450,5 @@
>      CharBuffer& getCharBuffer() { return charBuffer; }
>  };
>  
> +inline mozilla::Span<const char16_t>
> +ToCharSpan(mozilla::Span<const char16_t> codeUnits)

is this going to be used later?
if we're not going to use, it would be better removing this definition.

Attachment #8994719 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 21

•

6 years ago

Comment on attachment 8994720 [details] [diff] [review]
Instantiate TokenStreamChars for UTF-8

Review of attachment 8994720 [details] [diff] [review]:
-----------------------------------------------------------------

\o/

Attachment #8994720 - Flags: review?(arai.unmht) → review+

Jeff Walden [:Waldo]

Assignee

Comment 22

•

6 years ago

(In reply to Tooru Fujisawa [:arai] from comment #17)
> can you add some helper function with descriptive name for above code?
> or maybe add comment what it's doing.

Added a comment.  Separate helper function right now is tricky, because until this code is instantiated, a helper function would actually be unused -- and then you court compiler warnings.  (Granted at the end of this bug that *does* happen, but it's messy in advance of that, and I have the actually-instantiate patch at the end of the patch series for a reason.)

> can you clarify they share single SourceUnits instance?

Better yet, there's no point in even having the pointer in |SourceUnitsEnd| -- so I just made operator- return |sourceUnits_.remaining()| stored in |SourceUnitsIterator|.

> can we add some debug-only fields to detect mis-use?
> like, storing the expected `current()` pointer or something when copying,
> and validating it when dereferencing?
> 
> also, it would be better preventing any other operations on the copied
> iterator, which modifies the shared SourceUnits.

I added a |Maybe<const mozilla::Utf8Unit*>| field to the iterator, then assert it's |isNothing()| in all functions except operator* and operator++, and in those two assert the expected value in the first and then |reset()| it, in the second |emplace()| |current()| (which incidentally will assert |isNothing()| and verify it's clear).  I think that covers all the bases you mention.

> might be better adding some comment that explains why it receives
> `codePoint`.
> like, this function is used for detecting invalid text, instead of used
> while parsing UTF16 surrogate pair or some similar usage.

Added a comment, but note that most other functions in this file take |uint32_t| and not |char16_t| already, so it's not really breaking new ground.

(In reply to Tooru Fujisawa [:arai] from comment #18)
> > +        MOZ_ASSERT(earliestPossibleStart <= p);
> > +        MOZ_ASSERT(HalfWindowSize() <= WindowRadius);
> > +        if (p <= earliestPossibleStart || HalfWindowSize() >= WindowRadius)
> 
> according to the above assertion, `p < earliestPossibleStart` shouldn't
> happen.

True.  I wrote it that way for extra paranoia at runtime, of the sort that meant bug 1476409 was not actually dangerous -- we had a similar |>=| check there where, assuming proper incrementing, |==| would have been sufficient.

(In reply to Tooru Fujisawa [:arai] from comment #19)
> > +    // This consumes the full, valid code point or ungets |lead| and calls the
> > +    // appropriate error functor on failure.
> 
> where do we unget |lead| ?

See all the various |*aIter -= N| in |DecodeOneUtf8CodePointInline| in mfbt/Utf8.h, that end up invoking |SourceUnitsIterator::operator-=|, that invokes |sourceUnits.unskipCodeUnits(N)|.

Pulsebot

Comment 23

•

6 years ago

Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/66952e88e68c
Make SourceUnits::findWindow{Start,End} const because they can/should be.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/0ccbc10cd6a1
Replace SpecializedTokenStreamCharsBase::consumeRestOfSingleLineComment with an infallible SourceUnits::consumeRestOfSingleLineComment, in light of the fact that the UTF-8 version can just stop consuming code points when an encoding error is encountered, leaving error-reporting to whatever actually tries to consume the first code point of the "next" line.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/292bd4af4056
Implement SourceUnits::peekCodePoint for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/7f427db8f6f8
Specialize TokenStreamCharsBase::fillCharBufferFromSourceNormalizingAsciiLineBreaks for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/45b34e5af77e
Implement getNonAsciiCodePointDontNormalize for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/d0847a6bfa7a
Specialize SourceUnits::consumeKnownCodePoint for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/46c400e4a018
Implement consumeRestOfSingleLine for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/f00dc821eb42
Specialize SourceUnits::findWindowEnd for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/26f77e15d7d2
Specialize SourceUnits::findWindowStart for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/cafc89ca6a87
Stop specializing TokenStreamCharsBase::addLineOfContext now that the CharT-specialized things it relies on are fully specialized.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/107211d728e6
Implement TokenStreamChars::getNonAsciiCodePoint for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/38216fdba3dd
Move TokenStreamChars::getCodePoint to a generalized TokenStreamSpecific::getCodePoint that works for any CharT.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/dad6f1d06271
Make atomizeSourceChars take a Span<CharT>, and define a specialization for UTF-8.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/0ad2890523ff
Instantiate TokenStreamChars for UTF-8.  r=arai

Natalia Csoregi [:nataliaCs]

Comment 24

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/66952e88e68c
https://hg.mozilla.org/mozilla-central/rev/0ccbc10cd6a1
https://hg.mozilla.org/mozilla-central/rev/292bd4af4056
https://hg.mozilla.org/mozilla-central/rev/7f427db8f6f8
https://hg.mozilla.org/mozilla-central/rev/45b34e5af77e
https://hg.mozilla.org/mozilla-central/rev/d0847a6bfa7a
https://hg.mozilla.org/mozilla-central/rev/46c400e4a018
https://hg.mozilla.org/mozilla-central/rev/f00dc821eb42
https://hg.mozilla.org/mozilla-central/rev/26f77e15d7d2
https://hg.mozilla.org/mozilla-central/rev/cafc89ca6a87
https://hg.mozilla.org/mozilla-central/rev/107211d728e6
https://hg.mozilla.org/mozilla-central/rev/38216fdba3dd
https://hg.mozilla.org/mozilla-central/rev/dad6f1d06271
https://hg.mozilla.org/mozilla-central/rev/0ad2890523ff

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox63: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla63

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Updated

•

6 years ago

Depends on: 1480306

You need to log in before you can comment on or make changes to this bug.