Closed Bug 1467336 Opened 2 years ago Closed 2 years ago

Move away from having tokenizer code use "char" in function names to using "code unit" or "code point" as appropriate

Categories

(Core :: JavaScript Engine, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox63 --- fixed

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

Attachments

(24 files, 1 obsolete file)

22.08 KB, patch
arai
: review+
Details | Diff | Splinter Review
22.87 KB, patch
arai
: review+
Details | Diff | Splinter Review
4.29 KB, patch
arai
: review+
Details | Diff | Splinter Review
14.04 KB, patch
arai
: review+
Details | Diff | Splinter Review
5.20 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
11.46 KB, patch
arai
: review+
Details | Diff | Splinter Review
6.36 KB, patch
arai
: review+
Details | Diff | Splinter Review
10.48 KB, patch
arai
: review+
Details | Diff | Splinter Review
13.80 KB, patch
arai
: review+
Details | Diff | Splinter Review
18.86 KB, patch
arai
: review+
Details | Diff | Splinter Review
7.09 KB, patch
arai
: review+
Details | Diff | Splinter Review
9.11 KB, patch
arai
: review+
Details | Diff | Splinter Review
1.25 KB, patch
arai
: review+
Details | Diff | Splinter Review
5.54 KB, patch
arai
: review+
Details | Diff | Splinter Review
4.53 KB, patch
arai
: review+
Details | Diff | Splinter Review
8.29 KB, patch
arai
: review+
Details | Diff | Splinter Review
6.38 KB, patch
arai
: review+
Details | Diff | Splinter Review
2.96 KB, patch
arai
: review+
Details | Diff | Splinter Review
3.24 KB, patch
arai
: review+
Details | Diff | Splinter Review
27.31 KB, patch
arai
: review+
Details | Diff | Splinter Review
4.36 KB, patch
arai
: review+
Details | Diff | Splinter Review
7.24 KB, patch
arai
: review+
Details | Diff | Splinter Review
4.84 KB, patch
arai
: review+
Details | Diff | Splinter Review
12.29 KB, patch
arai
: review+
Details | Diff | Splinter Review
This has sort of been ongoing for awhile as a broader goal, but lots of function names still use "char" in them with inconsistent meanings applied.

Ultimately it would be nice to use, instead of int32_t to represent a code-unit-or-EOF, some sort of class abstracting that better.  But I don't want to try touching those aspects of APIs until there's better unit/point handling in place.
Nicely, the "unit" nomenclature fairly clearly conveys that "ignore EOL" is part of the idea of it, IMO.  And when a series of patches I'm doing now is done, we'll start with getCodeUnit, proceed to test isAscii, then handle ASCII in one place and non-ASCII (including multi-unit) in the other, with relatively clean separation of the two.
Attachment #8984001 - Flags: review?(arai.unmht)
Attachment #8984002 - Flags: review?(arai.unmht)
The getChar function is used a bunch of places for a bunch of things, and many places IMO should be rewritten to somewhat different manners that first try a getCodeUnit approach, then only proceed to handle non-ASCII in a slower path.  But this particular case *does* want precisely the semantics of getChar.  So, rename the function, then add the bad-old-name back as effectively an alias.
Attachment #8984006 - Flags: review?(arai.unmht)
< 0x80 or <= 0x7F or something is pretty tiresome and repetitive and harder to read than a named function.
Attachment #8984008 - Flags: review?(nfroyd)
This replaces one use of getChar with replacements that bifurcate unit-treatment into ASCII and non-ASCII cases.  It's a simpler case with less cases to consider, so I thought it made a plausible place to simply demonstrate how the new approach works.

Previously you would do

  int32_t c;
  if (!getChar(&c)) // includes LineTerminator normalizing, in UTF-8 will require multi-unit handling
    return false;
  if (c == EOF) {
    ...handle usually as bad...
  }

  ...cases for ASCII special characters, often special handling for LineTerminator...
  ...handle non-special stuff...

So, here's how things now will generally look, if you're getting a code point in some place where particular ASCII code points have special semantics:

  int32_t unit = getCodeUnit(); // EOF, or the code unit -- no LineTerminator norming
  if (unit == EOF) {
    ...handle usually as bad...
  }

  if (isAsciiCodePoint(unit)) {
    ...special handling of units that are special...

    // Otherwise we don't have a special unit, so we have to handle the possibility
    // of LineTerminator sequences -- or not, if the special handling above means no
    // LineTerminator encountered..
    int32_t codePoint;
    if (!getFullAsciiCodePoint(unit, &codePoint)) // requires a not-'\r'/'\n' test
      return false;

    ...do whatever to handle an ASCII code point, including '\n'...
  } else {
    int32_t codePoint;
    if (!getNonAsciiCodePoint(unit, &codePoint))
      return false;

    ...do whatever to handle a possibly non-ASCII code point...
  }

The isAsciiCodePoint function is a renamed wrapper around MOZ_LIKELY(IsAscii(unit)).  Hint should pay off for ASCII sorts of text, including JS libraries in common use, and stuff outside that can hit the slow case.  Special characters can either be specifically picked off inside the isAscii path, or they can be picked off in advance of it.  There are perhaps arguable tradeoffs to each approach.

There is a somewhat alternative approach that could be taken, that privileges *all* one-unit code points (even in UTF-16).  I originally started writing it up this way -- instead of an IsAscii(unit) test, you'd just do !unicode::IsLeadSurrogate -- but I ran into two mini-issues.

First, good function names for the various APIs are harder to find, IMO.  You could do isSingleUnit/getFullSingleUnitCodePoint/getMultiUnitCodePoint maybe.  It's a bit of a mouthful.  ASCII is pretty clean to name.

Second, and I would guess maybe more important, formulation using ASCII means that you only have two line break characters to consider in the fast path, always.  If you instead do single-unit, then all the UTF-16 variants have to handle line/paragraph separators in all cases.  That's fairly stupid given these code points are rare, and it's slower than just excluding non-ASCII.

But, this decision could be revisited in the future if we wanted, without horrid amounts of extra effort.
Attachment #8984017 - Flags: review?(arai.unmht)
Possibly should have happened before the previous patch, but doesn't really make a difference the order.
Attachment #8984019 - Flags: review?(arai.unmht)
Attachment #8984001 - Flags: review?(arai.unmht) → review+
Attachment #8984002 - Flags: review?(arai.unmht) → review+
Attachment #8984006 - Flags: review?(arai.unmht) → review+
Attachment #8984007 - Flags: review?(arai.unmht) → review+
Comment on attachment 8984017 [details] [diff] [review]
Introduce code to bifurcate handling of arbitrary-code-point getting in the tokenizer to distinctly and separately handle ASCII and non-ASCII code points

Review of attachment 8984017 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +538,5 @@
> +{
> +#ifdef DEBUG
> +    {
> +        MOZ_ASSERT(!IsAscii(lead),
> +                   "ASCII code unit/point must  be handled separately");

s/must  be/must be/

@@ +542,5 @@
> +                   "ASCII code unit/point must  be handled separately");
> +
> +        sourceUnits.ungetCodeUnit();
> +
> +        char16_t current = sourceUnits.getCodeUnit();

can we add dedicated DEBUG-only SourceUnits method which returns the "current" code unit, which clearly explains it's actually no-op for sourceUnits itself in certain situation?
(or maybe add some comment here)

calling those methods only in DEBUG looks like it behaves differently between DEBUG and non-DEBUG,
even if it actually doesn't.

::: js/src/frontend/TokenStream.h
@@ +1071,5 @@
> +    /**
> +     * Determine whether a code unit is ASCII (and thus can be handled quickly
> +     * when tokenizing either UTF-8 or UTF-16 source code).
> +     */
> +    static MOZ_MUST_USE MOZ_ALWAYS_INLINE bool isAsciiCodePoint(CharT unit) {

function name, parameter, and comment have both "code point" and "code unit", which is a bit confusing what to pass to this function.
maybe be better saying that this checks if the given code unit is the leading code unit of a sequence of code units which consists an ascii code point, in comment (and maybe function name)

or perhaps I'm misunderstanding what "code point" means in term of CR-LF sequence normalization, inside Tokenizer?
to my understanding from those comments around newly added functions, CR-LF is a special single code point here, isn't it?

@@ +1072,5 @@
> +     * Determine whether a code unit is ASCII (and thus can be handled quickly
> +     * when tokenizing either UTF-8 or UTF-16 source code).
> +     */
> +    static MOZ_MUST_USE MOZ_ALWAYS_INLINE bool isAsciiCodePoint(CharT unit) {
> +        return MOZ_LIKELY(mozilla::IsAscii(unit));

this might need some comment why MOZ_LIKELY is here instead of callsite, while this function name itself doesn't explain the probability for true/false.
of course, it's clear that it's likely, if this is used while common regular JS files, tho

@@ +1298,5 @@
> +     * If a LineTerminator sequence was consumed, also update line/column
> +     * information.
> +     */
> +    MOZ_MUST_USE bool getFullAsciiCodePoint(char16_t lead, int32_t* codePoint) {
> +        MOZ_ASSERT(mozilla::IsAscii(lead));

this assertion should use isAsciiCodePoint, given that the consumer of this function should use it.
(which contains MOZ_LIKELY, instead of only bare isAscii)

same for the assertion in getNonAsciiCodePoint

@@ +1303,5 @@
> +
> +        if (MOZ_UNLIKELY(lead == '\r')) {
> +            if (MOZ_LIKELY(sourceUnits.hasRawChars()))
> +                sourceUnits.matchCodeUnit('\n');
> +        } else if (MOZ_LIKELY(lead != '\n')) {

can we check not-LF case first? so that the flow is clearer.
if there's any reason not to do so (maybe optimization reason?) it would be nice to mention it here.

@@ +1318,5 @@
> +     * leading code unit.  Normalizes LineTerminators to '\n'.  Store the
> +     * normalized code point in |*cp| and return true on success, otherwise
> +     * return false and leave |*cp| undefined on failure.
> +     *
> +     * This may have the effect of changing the current |sourceUnits| offset.

this comment also applies to getFullAsciiCodePoint, right? (when consuming LF after CR)
Attachment #8984017 - Flags: review?(arai.unmht) → review+
Attachment #8984018 - Flags: review?(arai.unmht) → review+
Attachment #8984019 - Flags: review?(arai.unmht) → review+
Comment on attachment 8984008 [details] [diff] [review]
Implement mozilla::IsAscii to detect pure ASCII characters

Review of attachment 8984008 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/tests/TestTextUtils.cpp
@@ +17,5 @@
>  
>  static void
> +TestIsAscii()
> +{
> +  // char

Do you think it's worth tossing in `signed char` tests, just to exercise things more on systems where `char` is unsigned by default?  (Do any of our systems actually care about this?  I think PPC is the opposite of x86 on this front, but I can't remember which is which.)
Attachment #8984008 - Flags: review?(nfroyd) → review+
Web searches suggest Android is the oddball with char having the same representation as unsigned char, everything else having char being signed char.  I certainly wouldn't object to splitting the existing char tests for everything into explicitly signed and unsigned char tests, but at least for the simplicity of these things now, it seems like a bunch of work for not much gain (and given Android counters the other platforms, a gain that only appears if *one* platform is somehow internally buggy, seems like).  Doesn't seem valuable enough to just *do*, so I'm leaving this alone.
(In reply to Tooru Fujisawa [:arai] from comment #9)
> can we add dedicated DEBUG-only SourceUnits method which returns the
> "current" code unit, which clearly explains it's actually no-op for
> sourceUnits itself in certain situation?

Good idea -- added a previousCodeUnit function.

> function name, parameter, and comment have both "code point" and "code
> unit", which is a bit confusing what to pass to this function.

The idea is that we're testing whether this code *unit*, which in the general case is but a portion of a code point, also constitutes an entire ASCII code point: anything [0, 128).  Both '\r' and '\n' are code units that are ASCII code points.  This function doesn't care about LineTerminatorSequence: it acts purely at the encoding level.

> maybe be better saying that this checks if the given code unit is the
> leading code unit of a sequence of code units which consists an ascii code
> point, in comment (and maybe function name)

There isn't really any "sequence" meaning to this at all.  Sequence-handling is a secondary step after this function's return value is examined.

> or perhaps I'm misunderstanding what "code point" means in term of CR-LF
> sequence normalization, inside Tokenizer?
> to my understanding from those comments around newly added functions, CR-LF
> is a special single code point here, isn't it?

As a general rule in JS, LineTerminatorSequence is any of (examined greedily) '\r', '\n', "\r\n" (i.e. '\r' demands it not be followed by '\n'), '\u2028', and '\u2029'.  Outside of string and template literals, these are all normalized to the single ASCII code point '\n'.  (Inside string and template literals, '\u2028' and '\u2029' are used literally.)

getFullAsciiCodePoint, because it knows |lead| is ASCII, for anything but '\r' and '\n', returns it exactly.  '\n' is also returned exactly, after line-info is updated.  For '\r', a subsequent '\n' is also consumed -- then either "\r\n" or '\r' is normalized to '\n'.

In the end, I went with

/**
 * Determine whether a code unit constitutes a complete ASCII code point.
 * (The code point's exact identity might not end up being used, however,
 * if the code point is part of a LineTerminator sequence.)
 */

as the comment.  (Tho looking at this, I probably should change that to LineTerminatorSequence.)

> this might need some comment why MOZ_LIKELY is here instead of callsite,

Mm, fair point -- MOZ_LIKELY and such should go at call sites.  Moved that out of here and into them.

> this assertion should use isAsciiCodePoint, given that the consumer of this
> function should use it.
> 
> same for the assertion in getNonAsciiCodePoint

Good point, changed.

> @@ +1303,5 @@
> > +
> > +        if (MOZ_UNLIKELY(lead == '\r')) {
> > +            if (MOZ_LIKELY(sourceUnits.hasRawChars()))
> > +                sourceUnits.matchCodeUnit('\n');
> > +        } else if (MOZ_LIKELY(lead != '\n')) {
> 
> can we check not-LF case first? so that the flow is clearer.
> if there's any reason not to do so (maybe optimization reason?) it would be
> nice to mention it here.

Don't think so?  The not-'\n' case has to treat '\r' as a line break, so you have to pile up two conditions before you can just return the code unit directly.  This maps straightforwardly onto the ASCII subset of LineTerminatorSequence, so I'm not sure a comment does much, and I haven't added one.

> this comment also applies to getFullAsciiCodePoint, right? (when consuming
> LF after CR)

Yeah -- changed both comments to be more similar in the final version.
(In reply to Jeff Walden [:Waldo] from comment #12)
> In the end, I went with
> 
> /**
>  * Determine whether a code unit constitutes a complete ASCII code point.
>  * (The code point's exact identity might not end up being used, however,
>  * if the code point is part of a LineTerminator sequence.)
>  */
> 
> as the comment.  (Tho looking at this, I probably should change that to
> LineTerminatorSequence.)

Thank you for explanation and the comment :)


> > @@ +1303,5 @@
> > > +
> > > +        if (MOZ_UNLIKELY(lead == '\r')) {
> > > +            if (MOZ_LIKELY(sourceUnits.hasRawChars()))
> > > +                sourceUnits.matchCodeUnit('\n');
> > > +        } else if (MOZ_LIKELY(lead != '\n')) {
> > 
> > can we check not-LF case first? so that the flow is clearer.
> > if there's any reason not to do so (maybe optimization reason?) it would be
> > nice to mention it here.
> 
> Don't think so?  The not-'\n' case has to treat '\r' as a line break, so you
> have to pile up two conditions before you can just return the code unit
> directly.  This maps straightforwardly onto the ASCII subset of
> LineTerminatorSequence, so I'm not sure a comment does much, and I haven't
> added one.

okay, makes sense.
the current one should be better.
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b6821b80af87
Rename getCharIgnoreEOL to getCodeUnit to better indicate that's all it does.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/e90575cf96ce
Rename ungetCharIgnoreEOL to ungetCodeUnit.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/7d80671bf849
Add TokenStreamChars::getCodePoint and use it in one trivial place.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/b2b3d152d7ea
Rename TokenStreamSpecific::matchChar to TokenStreamCharsBase::matchCodeUnit to clarify its unit-matching nature.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/2dac13bda723
Implement mozilla::IsAscii to detect pure ASCII characters.  r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/564254cf34aa
Introduce code to bifurcate handling of arbitrary-code-point getting in the tokenizer to distinctly and separately handle ASCII and non-ASCII code points.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/5a295d6c1688
Process the code units comprising regular expression literals in a manner that optimizes the pure-ASCII case.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/afcc2d65b6fb
Move regular expression literal tokenizing into its own function.  r=arai
More patches coming along these lines -- I am somewhat likely to run this bug to ground with many patches, not spin off a new one each time I change something.
Keywords: leave-open
They aren't all used as code units yet, but an alpha-rename hopefully simplifies future changes to make that so.
Attachment #8984543 - Flags: review?(arai.unmht)
The one place that doesn't process in this manner, should be suitably flagged in the changes.  :-)

I will confess, however, that the escape-handling goo I am largely relying on tests to verify correctness, combined with trying to just not disturb it too much.  (And I seriously question whether strings and template tokens should be parsed by the same function, given what a mess shoehorning in template-parsing, or alternatively having to have all these silly conditions guarding everything, makes to both algorithms.)
Attachment #8984544 - Flags: review?(arai.unmht)
It's almost unbelievably silly that the current algorithm wasn't written to work the way this new algorithm clearly is.
Attachment #8984545 - Flags: review?(arai.unmht)
I paired the name up with atStart -- except that atStart doesn't actually seem to indicate what it seems like it does.  Notably, since this revision it appears atStart has stopped meaning pointing-at-the-original-character-of-source:

https://searchfox.org/mozilla-central/diff/29234315debd45187859a4c7ea8dc8630db0d9a3/js/src/frontend/TokenStream.h#681

Every user of atStart really wants the old meaning (borne out by a try-push), so I'll fix it in a followup patch.
Attachment #8984547 - Flags: review?(arai.unmht)
This also eliminates ungets-then-skips that cancel out when a valid escape is matched (as it usually will be, 'cause people don't write invalid code).
Attachment #8984709 - Flags: review?(arai.unmht)
I note in passing that this seems to use IsIdentifierPart even for the first code point.  I guess/hope IsIdentifierStart must be a proper subset, because otherwise this seems buggy.
Attachment #8984711 - Flags: review?(arai.unmht)
The remaining functions requiring modification along these lines is pretty small.  Maybe even zero, at a quick skim.  Will regroup and examine fully Monday.
Whoops -- no need to unget meant I could get rid of a |CharT cp[6];| and |cp[i++] = unit|, but obviously the |i++| needed to stick around, as a try-run pointed out.
Attachment #8984719 - Flags: review?(arai.unmht)
Attachment #8984709 - Attachment is obsolete: true
Attachment #8984709 - Flags: review?(arai.unmht)
Attachment #8984543 - Flags: review?(arai.unmht) → review+
Comment on attachment 8984544 [details] [diff] [review]
Make much of TokenStreamSpecific::getStringOrTemplateToken process by code unit, not by code-unit-generally-assumed-to-be-code-point

Review of attachment 8984544 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +2332,5 @@
> +        MOZ_ASSERT(!this->sourceUnits.hasRawChars() ||
> +                   this->sourceUnits.peekCodeUnit() == '\r' ||
> +                   this->sourceUnits.peekCodeUnit() == '\n',
> +                   "must be parked at EOF or EOL to call this function");
> +        const char delimiters[] = { untilChar, untilChar, '\0' };

pre-existing tho, can you add some comment about those 2-untilChar's, which explains
they're used as beginning quotation mark and ending quotation mark in the error message?
(it's a bit surprising/confusing that we're passing 2 untilChar's, given that now `errnum` is also variable)

@@ +2395,5 @@
> +                if (!getNonAsciiCodePoint(unit, &codePoint))
> +                    return false;
> +
> +                // LineContinuation represents no code points.
> +                if (codePoint != '\n') {

might be better explicitly saying they're LINE_SEPARATOR and PARA_SEPARATOR (of course LineContinuation implies them tho),
so that it's clear that the '\n' matches to `!isAsciiCodePoint` condition above.

@@ +2509,5 @@
> +                    if (!appendCodePointToTokenbuf(code))
> +                        return false;
> +
> +                    continue;
> +                } // (c2 == '{') // delimited Unicode escape

this comment is confusing, in term of which part this is for, whether the above block (corresponds to namespace's one) or the following block (implicit else condition).

IMO this is not necessary.
Attachment #8984544 - Flags: review?(arai.unmht) → review+
Attachment #8984545 - Flags: review?(arai.unmht) → review+
Attachment #8984547 - Flags: review?(arai.unmht) → review+
Attachment #8984548 - Flags: review?(arai.unmht) → review+
Attachment #8984549 - Flags: review?(arai.unmht) → review+
Attachment #8984706 - Flags: review?(arai.unmht) → review+
Attachment #8984707 - Flags: review?(arai.unmht) → review+
Comment on attachment 8984708 [details] [diff] [review]
Move Unicode escape processing from TokenStreamSpecific to GeneralTokenStreamChars

Review of attachment 8984708 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +1086,2 @@
>  {
> +    MOZ_ASSERT(sourceUnits.previousCodeUnit() == '{');

nice that this can now be an assert :)
Attachment #8984708 - Flags: review?(arai.unmht) → review+
Attachment #8984719 - Flags: review?(arai.unmht) → review+
Comment on attachment 8984710 [details] [diff] [review]
Make TokenStreamSpecific::identifierName consume in terms of code units

Review of attachment 8984710 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +1480,5 @@
> +
> +            if (!unicode::IsIdentifierPart(uint32_t(codePoint))) {
> +                ungetCodePointIgnoreEOL(codePoint);
> +                if (codePoint == unicode::LINE_SEPARATOR || codePoint == unicode::PARA_SEPARATOR)
> +                    anyCharsAccess().undoInternalUpdateLineInfoForEOL();

those 3 lines appear multiple times.
can you add a method for them?
Attachment #8984710 - Flags: review?(arai.unmht) → review+
Attachment #8984711 - Flags: review?(arai.unmht) → review+
Comment on attachment 8984712 [details] [diff] [review]
Make TokenStreamSpecific::getTokenInternal itself deal only in code units

Review of attachment 8984712 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +1858,5 @@
>  
> +        if (MOZ_UNLIKELY(!isAsciiCodePoint(unit))) {
> +            // Non-ASCII code points can only be identifiers or whitespace.
> +            // It would be nice to compute these *after* discarding whitespace,
> +            // but ɪɴ ᴀ ᴡᴏʀʟᴅ where |unicode::IsSpaceOrBOM2| requires consuming

hmm... it's not a good idea to use these code points in the code comment, for search-ability and accessibility (code search doesn't match for ASCII query "IN A WORLD", and screen-reader cannot pronounce them properly).
Attachment #8984712 - Flags: review?(arai.unmht) → review+
Attachment #8984713 - Flags: review?(arai.unmht) → review+
(In reply to Tooru Fujisawa [:arai] from comment #33)
> > +                } // (c2 == '{') // delimited Unicode escape
> 
> this comment is confusing, in term of which part this is for, whether the
> above block (corresponds to namespace's one) or the following block
> (implicit else condition).
> 
> IMO this is not necessary.

Mm.  This isn't the biggest block of interminable code, but I found it very difficult as I was working on this to remember precisely what this block of code was handling, as I was scrolling around vertically within this code.  Basically the same sort of issue motivating the "// defined(XP_WIN)"-style comments after #endifs and such.  I changed the comment to

  } // end of delimited Unicode escape handling

which hopefully is more readable as prose.
(In reply to Tooru Fujisawa [:arai] from comment #36)
> > +            // but ɪɴ ᴀ ᴡᴏʀʟᴅ where |unicode::IsSpaceOrBOM2| requires consuming
> 
> hmm... it's not a good idea to use these code points in the code comment,
> for search-ability and accessibility (code search doesn't match for ASCII
> query "IN A WORLD", and screen-reader cannot pronounce them properly).

Boooooooooooooooooooooooooooooooooo you're wrong but also right.  :-(  :-)

(But that's totes a screen reader bug IMO.)
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/103a82585566
Rename the |c| variables in TokenStreamSpecific::getStringOrTemplateToken to |unit|.  They aren't all used as code units yet, but an alpha-rename hopefully simplifies future changes to make that so.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/82729f89bc61
Make much of TokenStreamSpecific::getStringOrTemplateToken process by code unit, not by code-unit-generally-assumed-to-be-code-point.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/3d10297c4384
Rename TokenStreamSpecific::peekChars to SourceUnits::peekCodeUnits and optimize its code in now-obvious ways.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/03e60a811919
Invert SourceUnits::hasRawChars's meaning and rename it to SourceUnits::atEnd.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/673d93f7c030
Change SourceUnits::atStart to indicate being at the start of the source units of code, not at the start of the *containing* source.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/2b5d5815e7ef
Move skipChars to SourceUnits::skipCodeUnits, and make it efficient in the obvious way.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/18c217828e1b
Convert TokenStreamSpecific::getDirective act on code units, and remove TokenStreamSpecific::peekChar.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/b04275fa34e6
Make TokenStreamSpecific::decimalNumber work by consuming code units.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/96bab59f48e0
Move Unicode escape processing from TokenStreamSpecific to GeneralTokenStreamChars.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/96438f0ed68a
Refactor Unicode-escape identifier matching to work in terms of code units.  This also eliminates ungets-then-skips that cancel out when a valid escape is matched.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/57a1edba27d7
Make TokenStreamSpecific::identifierName consume in terms of code units.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/dba0ac1d0a0f
Make TokenStreamSpecific::putIdentInTokenbuf read identifier characters a code unit at a time.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/36f3b5eb7294
Make TokenStreamSpecific::getTokenInternal itself deal only in code units.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/8949f1ff42ac
Move a couple small functions to be defined inline in TokenStream class bodies, not out-of-line in TokenStream.cpp, for readability.  r=arai
Not yet ready to conclude this work is all done, so continuing to leave open for now.
Depends on: 1471465
With this patch I will declare this bug's challenge complete.  Other bugs will follow for various mop-up needed to increasingly instantiate TokenStream base classes for UTF-8 as well as UTF-16.
Attachment #8988436 - Flags: review?(arai.unmht)
Attachment #8988435 - Flags: review?(arai.unmht) → review+
Attachment #8988436 - Flags: review?(arai.unmht) → review+
Priority: -- → P1
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5d9c420a912b
Remove GeneralTokenStreamChars::{,un}getChar now that they're unused.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/4396a31f09b5
Change a bunch of 'character' nomenclature in token stream code to 'code unit', completing the transition from tokenizing by UTF-16 'character' to tokenizinng by UTF-8/16 code unit.  (There are straggling places where algorithms will need to be specialized for UTF-8, or functions will need to move within the TokenStream* hierarchy to permit such; but what is in the tree now universally acts on code units first, full code points second.)  r=arai
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/5d9c420a912b
https://hg.mozilla.org/mozilla-central/rev/4396a31f09b5
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Depends on: 1476409
You need to log in before you can comment on or make changes to this bug.