Closed Bug 1461481 Opened 2 years ago Closed 2 years ago

Move various tokenstream functions, that will require specialized definition for single-byte tokenizing, into classes that will permit this

Categories

(Core :: JavaScript Engine, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox62 --- fix-optional

People

(Reporter: Waldo, Assigned: Waldo)

Details

Attachments

(2 files)

Some functions like "unget a char" have to know about the single/double-byte encoding distinction, ergo will require specializing template algorithms.  Because C++ doesn't allow function template partial specialization, the tokenstream class hierarchy mandates that these functions be in TokenStreamChars, where the *class* is specialized by CharT and AnyCharsAccess as sole template parameter can vary.

This bug will move various of those functions demanding specialization into the right locations.  (Some functions will additionally have to move upward in the class hierarchy, so that TokenStreamChars can depend on CharT-sensitive behavior that doesn't require different definitions of the functions.)  This is one of the last bits of change I can do without specifically writing out UTF-8-centric versions of these functions (which requires resolving the snarl in bug 1426909).

I only have a couple of these patches written right now, so I'll post them and probably keep piling on this bug with more and more functions' moves.
This function will require 1/2-byte different definitions (because the UTF-8 case has to check encoding validity), but it doesn't care about AnyCharsAccess, so we can specialize it directly within TokenStreamCharsBase.  This move also cuts on template-instantiation heaviness, because we only have two specializations (1/2) this way versus six ({1,2}{TokenStream,Parser<{Full,Syntax}ParseHandler>}) if we built out the current thing.
Attachment #8975641 - Flags: review?(arai.unmht)
Attachment #8975638 - Flags: review?(arai.unmht) → review+
Comment on attachment 8975641 [details] [diff] [review]
Move filling up a CharBuffer with the characters in a sequence of template literal characters into TokenStreamCharsBase

Review of attachment 8975641 [details] [diff] [review]:
-----------------------------------------------------------------

I have one question.

(In reply to Jeff Walden [:Waldo] from comment #2)
> This function will require 1/2-byte different definitions (because the UTF-8
> case has to check encoding validity)

Can you remind me the plan about where we're going to validate UTF-8 encoding?
I thought it's already validated outside of JS engine?
or it's just that current situation due to chat16_t-only frontend, and we're going to validate inside TokenStream for UTF-8 case?
Attachment #8975641 - Flags: review?(arai.unmht) → review+
(In reply to Tooru Fujisawa [:arai] from comment #3)
> Can you remind me the plan about where we're going to validate UTF-8
> encoding?
> I thought it's already validated outside of JS engine?
> or it's just that current situation due to chat16_t-only frontend, and we're
> going to validate inside TokenStream for UTF-8 case?

The plan is slightly nascent, but: Gecko is only going to hand us validated UTF-8, so we can sort of ignore it for error purposes.

JSAPI clients generally, however, probably will get validation as we go.  For anything with lots of ASCII, you get the validation for nearly free in getTokenInternal's < 128 check.  Otherwise, and in contexts where arbitrary text can appear (string contents, template literal contents as they're accumulated into CharBuffer or so, identifiers, comment contents, &c.), we'll have to do a little extra work *when we encounter overlarge code units only*.

The extra safety of extra checking seems desirable to me at least short term (even for Gecko use where only-valid UTF-8 should be seen).  Longer term, and if/when we get validated-UTF-8 types to rely on, maybe we can undo some of that.  But IMO not (safely) any sooner.

But this is a somewhat active point of contention in bug 1426909 which I have been avoiding touching as long as I can partly for this reason, so there is some amount of this subject to change.
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/70d1589b5df5
Move getChar and updateLineInfoForEOL upward into TokenStreamChars and GeneralTokenStreamChars.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/43b6fedcb9af
Move filling up a CharBuffer with the characters in a sequence of template literal characters into TokenStreamCharsBase.  r=arai
Keywords: leave-open
Priority: -- → P2
The leave-open keyword is there and there is no activity for 6 months.
:Waldo, maybe it's time to close this bug?
Flags: needinfo?(jwalden+bmo)
I'm not sure why I left this open, and we are well past the point where anything has to be moved around for 1/2-byte-specific tokenizing behavior, tho probably some things beyond what I'd done here did require moving around.  (With local patches in place I can run all of jstests and jit-tests tokenizing directly from UTF-8.  And the "-u <path>" argument to the JS shell works just fine to execute direct from UTF-8 right now.)  So yeah, this can be closed.
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(jwalden+bmo)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.