Closed Bug 1433909 Opened 2 years ago Closed 2 years ago

Use (un)getCharIgnoreEOL more often when line-position updates are not needed

Categories

(Core :: JavaScript Engine, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox60 --- wontfix
firefox61 --- fixed

People

(Reporter: anba, Assigned: anba)

References

Details

Attachments

(3 files, 6 obsolete files)

From bug 1421400 comment #1:
---
We have TokenStream::reportError(...) [1] and TokenStream::error(...) [2], but both functions are identical. Callers unget the last character before calling reportError(...) [4], probably to adjust the error position, but reportError(...) uses the current token position for error reporting, so ungetting the last character is useless if performed to adjust the error position. Is there possibly another reason why the last character is unget?
---

I think this is what happened here:
bug 638034 added calls to ungetChar() for proper error reporting, but when bug 639420 replaced some getChar/ungetChar() calls with getCharIgnoreEOL()/ungetCharIgnoreEOL(), we didn't remove the now no longer needed calls to ungetCharIgnoreEOL().
Assignee: nobody → andrebargull
Blocks: 1421400
Status: NEW → ASSIGNED
bug 638034 added calls to ungetChar() before error reporting functions to ensure the correct line number is included in the error message (i.e. when a line separator is returned by getChar(), the TokenStream's line position information gets updated; calling ungetChar() ensures the line position reflects the correct error position). But then bug 639420 introduced the getCharIgnoreEOL()/ungetCharIgnoreEOL() optimizations. And because getCharIgnoreEOL() doesn't update the line position information, it's no longer necessary to call ungetCharIgnoreEOL() before reporting an error. This allows us to remove ungetCharIgnoreEOL() when it's immediately followed by reportError()/error().
Attachment #8946809 - Flags: review?(jwalden+bmo)
For the most part some mini-optimizations to avoid unnecessary line number processing in TokenStream methods.  


TokenStreamSpecific::peekChars()
This function used to call getChar(), which normalizes all line separators to \n. Nowadays getCharIgnoreEOL() is called here and that one doesn't perform any line separator normalization, so the \n check only handles 1 of 4 possible line separators. The \n check itself was added in bug 344711 to fix an issue when getChar() updated the line position information. Now that getCharIgnoreEOL() is used, this issue no longer applies and therefore we don't need to check for \n.


TokenStreamSpecific::getDirective()
- Replace ungetChar() with ungetCharIgnoreEOL() because both functions behave identically when called with a non-line separator, only that ungetCharIgnoreEOL() is faster.
- Replace peekChar() with matchChar()+ungetCharIgnoreEOL() to avoid the extra line number processing in peekChar().


TokenStreamSpecific::getTokenInternal()
- Replace ungetChar() with ungetCharIgnoreEOL() where possible.


- Single-line comment processing
 - Replace peekChar()+consumeKnownChar() with getCharIgnoreEOL()+ungetCharIgnoreEOL() to avoid any extra line number processing.
 - Similarly replace getChar()+ungetChar() with getCharIgnoreEOL()+ungetCharIgnoreEOL()+TokenBuf::isRawEOLChar().
 - This improves the following µ-benchmark from 880ms to 690ms for me:
    ---
    var s = "//\n".repeat(2**24);
    var t0 = dateNow(); Function(s); var t1 = dateNow();
    print(t1 - t0);
    ---
 - And this one from 1140ms to 830ms:
    ---
    var s = "//abcdefghij\n".repeat(2**23);
    var t0 = dateNow(); Function(s); var t1 = dateNow();
    print(t1 - t0);
    ---
  - And this one from 820ms to 750ms:
    ---
    var s = "<!--\n".repeat(2**24);
    var t0 = dateNow(); Function(s); var t1 = dateNow();
    print(t1 - t0);
    ---
 
- RegExp flags processing
 - Replace more peekChar()+getChar() combos with getCharIgnoreEOL()+ungetCharIgnoreEOL().
 - Avoid modifying the token start position to adjust the error position, instead use errorAt().
 - This improves the following µ-benchmark from 1300ms to 1200ms for me:
    ---
    var s = "/a/gimuy;".repeat(2**20);
    var t0 = dateNow(); Function(s); var t1 = dateNow();
    print(t1 - t0);
    ---

- HTML comments processing
 - Replace peekChar() with matchChar(). No reason to use peekChar() if we end up with ignoring all characters until the next line separator.
 - This improves the following µ-benchmark from 965ms to 675ms for me (also affected by the single-line comment updates from above):
    ---
    var s = "-->\n".repeat(2**24);
    var t0 = dateNow(); Function(s); var t1 = dateNow();
    print(t1 - t0);
    ---


TokenStreamSpecific::getStringOrTemplateToken()
- Replace peekChar()+consumeKnownChar() with getCharIgnoreEOL().
- Avoid peeking the first character in \uXYZX escape sequence twice, instead use the same approach as in peekUnicodeEscape() and only peek the next three characters when we already know the first one.
- Calling |userbuf.peekRawChar()| can skip over the TokenBuf limit, instead use hasRawChars()+matchRawChar() (just like we do everywhere else for this case).
- Replace a getCharIgnoreEOL()+ungetCharIgnoreEOL() sequence with matchChar().
- These changes improve the following µ-benchmark from 730ms to 615ms for me:
    ---
    var s = "'" + "\\u0000".repeat(2**23) + "'";
    var t0 = dateNow(); Function(s); var t1 = dateNow();
    print(t1 - t0);
    ---
- And this one improves from 465ms to 375ms for me:
    ---
    var s = "'" + "\\u{0}".repeat(2**23) + "'";
    var t0 = dateNow(); Function(s); var t1 = dateNow();
    print(t1 - t0);
    ---


TokenStreamSpecific::skipChars()
- Calling getCharIgnoreEOL() doesn't normalize line separators, so use TokenBuf::isRawEOLChar() in the assertion.
Attachment #8946818 - Flags: review?(jwalden+bmo)
Directly encode '`' as |String| instead of using the indirection through |Templat|. Also fix some indentation issues to ensure the commas line up properly.
Attachment #8946819 - Flags: review?(jwalden+bmo)
- Rename undoGetChar() to undoInternalUpdateLineInfoForEOL() so it's clearer that this method is the counterpart to internalUpdateLineInfoForEOL() and not the counterpart for getChar().
- And then move it next to internalUpdateLineInfoForEOL().
- Replace "MultiUnitCodepoint" with "SupplementaryCodepoint" in two functions to match the use of "Supplementary" in vm/Unicode.h.
- Also use the "match" prefix for "matchSupplementaryCodepoint()" (previous name: "isMultiUnitCodepoint()") to give a hint that this method consumes characters and for consistency with other "match"-prefixed methods.
Attachment #8946822 - Flags: review?(jwalden+bmo)
Attachment #8946819 - Flags: review?(jwalden+bmo) → review+
The timing of this vis-a-vis a few patches I've had locally for a bit is unfortunate.  I've coincidentally stumbled over some of this (and our failure to report precise error locations for certain intra-token syntax errors) in the context of intra-code point errors in UTF-8 source text, and I'm making a few changes in this rough area in bug 1434429.  :-\

So, some of this conflicts with that -- the multi-unit codepoint changes, getting rid of some of the ungetChar* stuff (I would like us to remove reportError entirely and replace all of it with error() calls, actually -- maybe that goal conflicts with this?  haven't thought it through to say, but maybe), the "match" issue, etc.  I've stamped what I can for now, but some bits of this will need to wait for that other bug to go.
(In reply to Jeff Walden [:Waldo] (I'm baaaaaaack...) from comment #5)
> The timing of this vis-a-vis a few patches I've had locally for a bit is
> unfortunate.  I've coincidentally stumbled over some of this (and our
> failure to report precise error locations for certain intra-token syntax
> errors) in the context of intra-code point errors in UTF-8 source text, and
> I'm making a few changes in this rough area in bug 1434429.  :-\

We even managed to upload the patches at the same time! \o/


> So, some of this conflicts with that -- the multi-unit codepoint changes,
> getting rid of some of the ungetChar* stuff (I would like us to remove
> reportError entirely and replace all of it with error() calls, actually --
> maybe that goal conflicts with this?  haven't thought it through to say, but
> maybe), the "match" issue, etc.  I've stamped what I can for now, but some
> bits of this will need to wait for that other bug to go.

Yes, part 1 will conflict with reporting the error at the current position instead of the start position of the current token. And part 4 partially conflicts with bug 1434429, too. For example both bugs rename isMultiUnitCodepoint() to use the "match"-prefix, but part 4 also replaces "MultiUnitCodepoint" with "SupplementaryCodepoint".
Depends on: 1434429
Priority: -- → P1
Bug 1436150's patch (and getting it reviewed) stands in the way of landing the last two pieces in bug 1434429 that stand in the way of moving forward here.  Patch there has been posted for all of a day, so it hardly seems to be at the point where it'd be time to push on that.  At least, not unless something here is particularly pressing for some non-obvious reason.
Comment on attachment 8946809 [details] [diff] [review]
bug1433909-part1-unget-before-error.patch

Clearing to get out of the queue, on the vague recollection of this conflicting with the bug I mentioned in an earlier comment here -- reflag/update as appropriate accounting for that bug, please.
Attachment #8946809 - Flags: review?(jwalden+bmo)
Attachment #8946818 - Flags: review?(jwalden+bmo)
Attachment #8946822 - Flags: review?(jwalden+bmo)
(In reply to Jeff Walden [:Waldo] from comment #8)
> Clearing to get out of the queue, on the vague recollection of this
> conflicting with the bug I mentioned in an earlier comment here --
> reflag/update as appropriate accounting for that bug, please.

Yes, the patches conflict with bug 1434429. So I'm currently waiting that all patches for bug 1434429 land and then I'll update the patches here and re-request review.
Morphing this bug given that comment #5 indicates that exact source position instead of token position errors should happen in the future.
Summary: Remove calls to TokenStreamSpecific::ungetCharIgnoreEOL if immediately followed by reportError/error → Use (un)getCharIgnoreEOL more often when line-position updates are not needed
This was part 2 in the original patch series, see comment #2 for the performance results.
Attachment #8946809 - Attachment is obsolete: true
Attachment #8946818 - Attachment is obsolete: true
Attachment #8962324 - Flags: review?(jwalden+bmo)
Renamed part 3 -> part 2, carrying r+.
Attachment #8946819 - Attachment is obsolete: true
Attachment #8962325 - Flags: review+
Part 4 -> Part 3.

- Rename undoGetChar() to undoInternalUpdateLineInfoForEOL() so it's clearer that this method is the counterpart to internalUpdateLineInfoForEOL() and not the counterpart for getChar().
- And then move it next to internalUpdateLineInfoForEOL().
- Rename "matchMultiUnitCodePointSlow" to "matchSupplementaryCodePointSlow" to match the use of "Supplementary" in vm/Unicode.h.
Attachment #8946822 - Attachment is obsolete: true
Attachment #8962328 - Flags: review?(jwalden+bmo)
(In reply to André Bargull [:anba] from comment #13)
> - Rename "matchMultiUnitCodePointSlow" to "matchSupplementaryCodePointSlow"
> to match the use of "Supplementary" in vm/Unicode.h.

The "multi-unit" name is because this is intended to be specialized for UTF-8 parsing, where it would not have "supplementary" intent.  Perhaps that's a fine enough name in solely-UTF-16 code, in TokenStreamChars or TokenStreamCharsBase.  But for the interface level stuff that will be shared by multiple classes through one template interface, "multi-unit" is accurate and "supplementary" is not.  Maybe the 16-bit *definition* of the interface-level function could call something with "supplementary" in the name, tho, if that helps.
Ah, I almost feared that's the reason the name contains "MultiUnit". I'll just remove those bits from part 3 and, if necessary, we can revisit the naming after UTF-8 parsing lands.
Comment on attachment 8962324 [details] [diff] [review]
bug1433909-part1-lexer-perf.patch

Review of attachment 8962324 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +618,5 @@
>      for (i = 0; i < n; i++) {
>          int32_t c = getCharIgnoreEOL();
>          if (c == EOF)
>              break;
>  

FWIW I've thought it would be nice to combine peekChars with CharsMatch -- the former is used mostly with the latter, or with lookahead for hexdigit stuff that might be better another way -- and have this be passed a string/length and compare/consume chars *directly* against that.  Conceptually a simple change, but the brass tacks of it looked too dilatory for me to do at the time, last time I looked at it.

@@ +1924,1 @@
>              if (c == '@' || c == '#') {

A matchChars interface that took a lambda might be a nice thing, just musing.

@@ +2021,5 @@
> +
> +                if ((reflags & flag) || flag == NoFlags) {
> +                    MOZ_ASSERT(userbuf.offset() > 0);
> +                    char buf[2] = { char(c), '\0' };
> +                    errorAt(userbuf.offset() - 1, JSMSG_BAD_REGEXP_FLAG, buf);

Mm.  I should remember this, but what's the encoding of error messages?  We're putting a Latin-1 character here (mangling one in, really).  I don't remember how kosher that is.  Please double-check it is.

@@ +2348,5 @@
>      if (!atom)
>          return false;
>  
>      if (!parsingTemplate) {
>          (*tp)->type = TokenKind::String;

MOZ_ASSERT(!templateHead) for good measure.
Attachment #8962324 - Flags: review?(jwalden+bmo) → review+
(In reply to Jeff Walden [:Waldo] from comment #16)
> @@ +2021,5 @@
> > +
> > +                if ((reflags & flag) || flag == NoFlags) {
> > +                    MOZ_ASSERT(userbuf.offset() > 0);
> > +                    char buf[2] = { char(c), '\0' };
> > +                    errorAt(userbuf.offset() - 1, JSMSG_BAD_REGEXP_FLAG, buf);
> 
> Mm.  I should remember this, but what's the encoding of error messages? 
> We're putting a Latin-1 character here (mangling one in, really).  I don't
> remember how kosher that is.  Please double-check it is.

|c| is always an ASCII character here (because the else-branch after |else if (JS7_ISLET(c))| has a break).
Comment on attachment 8962328 [details] [diff] [review]
bug1433909-part3-rename-methods.patch

Review of attachment 8962328 [details] [diff] [review]:
-----------------------------------------------------------------

Stick to just the undoGetChar name change/move, don't rename the matchMulti* function, and this is good.

::: js/src/frontend/TokenStream.cpp
@@ +486,3 @@
>  template<typename CharT, class AnyCharsAccess>
>  MOZ_MUST_USE MOZ_ALWAYS_INLINE bool
>  TokenStreamSpecific<CharT, AnyCharsAccess>::updateLineInfoForEOL()

Mm, this is maybe a bit dumb to have the little TokenStreamSpecific helper for this, particularly if it hides the symmetry of these two functions.  Not sure if "internal" is entirely desirable either, at that point.  But eh, for later.
Attachment #8962328 - Flags: review?(jwalden+bmo) → review+
(In reply to Jeff Walden [:Waldo] from comment #18)
> Stick to just the undoGetChar name change/move, don't rename the matchMulti*
> function, and this is good.

Hah, I forgot to upload the updated patch. The try-runs already contain the correct patch which only changes |undoGetChar|: https://hg.mozilla.org/try/rev/3c1810a536b6334efb301ce68bff8d46b001f72f
Update patch to add extra assertion per review comments, carrying r+.
Attachment #8962324 - Attachment is obsolete: true
Attachment #8963607 - Flags: review+
Update patch to only rename TokenStream::undoGetChar, carrying r+.
Attachment #8962328 - Attachment is obsolete: true
Attachment #8963609 - Flags: review+
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a8af22a4ed75
Part 1: Use (un)getCharIgnoreEOL in more cases in TokenStream. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/8bee54a25340
Part 2: Fix indentation and directly use String for template strings in firstCharKinds. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/5c1eb1402aec
Part 3: Rename TokenStream::undoGetChar for clarity. r=Waldo
Keywords: checkin-needed
You need to log in before you can comment on or make changes to this bug.