Closed Bug 1351107 Opened 7 years ago Closed 5 years ago

Make TokenStream parse both single-byte UTF- and two-byte source text

Categories

(Core :: JavaScript Engine, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
Performance Impact high
Tracking Status
firefox60 --- wontfix
firefox61 - wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix

People

(Reporter: Waldo, Assigned: Waldo)

References

(Blocks 1 open bug)

Details

(Keywords: perf)

Attachments

(32 files, 7 obsolete files)

1.83 KB, patch
arai
: review+
Details | Diff | Splinter Review
34.07 KB, patch
arai
: review+
Details | Diff | Splinter Review
2.40 KB, patch
arai
: review+
Details | Diff | Splinter Review
4.59 KB, patch
arai
: review+
Details | Diff | Splinter Review
8.26 KB, patch
arai
: review+
Details | Diff | Splinter Review
26.61 KB, patch
arai
: review+
Details | Diff | Splinter Review
26.21 KB, patch
arai
: review+
Details | Diff | Splinter Review
28.54 KB, patch
arai
: review+
Details | Diff | Splinter Review
32.63 KB, patch
arai
: review+
Details | Diff | Splinter Review
3.43 KB, patch
arai
: review+
Details | Diff | Splinter Review
27.57 KB, patch
arai
: review+
Details | Diff | Splinter Review
6.13 KB, patch
arai
: review+
Details | Diff | Splinter Review
3.27 KB, patch
arai
: review+
Details | Diff | Splinter Review
2.46 KB, patch
arai
: review+
Details | Diff | Splinter Review
4.31 KB, patch
arai
: review+
Details | Diff | Splinter Review
5.57 KB, patch
arai
: review+
Details | Diff | Splinter Review
213.93 KB, patch
arai
: review+
Details | Diff | Splinter Review
3.62 KB, patch
arai
: review+
Details | Diff | Splinter Review
5.55 KB, patch
arai
: review+
Details | Diff | Splinter Review
30.67 KB, patch
arai
: review+
Details | Diff | Splinter Review
21.52 KB, patch
arai
: review+
Details | Diff | Splinter Review
9.28 KB, patch
arai
: review+
Details | Diff | Splinter Review
10.09 KB, patch
arai
: review+
Details | Diff | Splinter Review
8.47 KB, patch
arai
: review+
Details | Diff | Splinter Review
109.98 KB, patch
arai
: review+
Details | Diff | Splinter Review
11.70 KB, patch
arai
: review+
Details | Diff | Splinter Review
1.73 KB, patch
arai
: review+
Details | Diff | Splinter Review
132.93 KB, patch
arai
: review+
Details | Diff | Splinter Review
23.52 KB, patch
arai
: review+
Details | Diff | Splinter Review
6.76 KB, patch
arai
: review+
Details | Diff | Splinter Review
39.62 KB, patch
arai
: review+
Details | Diff | Splinter Review
23.21 KB, application/octet-stream
Details
Right now the entire frontend depends on TokenStream to be able to report errors.  But if TokenStream is duplicated for single-byte and two-byte parsing, that means effectively the entire frontend doubles.  Let's separate out reporting so that we don't have to duplicate so hard.
This is a random bit of cleanup I noticed working on actual patches for this bug.  It's not actually related.  :-)
Attachment #8851812 - Flags: review?(arai.unmht)
Actually, these patches are intermingled enough in sense I probably should retitle, even if extricating error reporting is the bulk of what I've finished at this point.
Summary: Move TokenStream's central compile-error/warning-reporting functionality into standalone functions → Make TokenStream parse both single-byte (whichever of Latin-1/UTF-8 is simpler, for a first pass) and two-byte source text
I did the splitting here mostly manually.  There may be lingering bits in TokenStream that are character-agnostic, but we can find/deal with them later if they exist.
Attachment #8851814 - Flags: review?(arai.unmht)
Just some preliminary cleaning.
Attachment #8851817 - Flags: review?(arai.unmht)
TokenStream really is going to have to be involved in this process at *some* point, if it's going to include -- at a minimum -- the line of context.  This at least is getting closer to a point where other code could pass in its computed version of this stuff.
Attachment #8851818 - Flags: review?(arai.unmht)
I get why this was in TokenStream previously, but it seems to me that it's better off put into asmjs-specific code, given what it's doing.  And ultimately, there's only going to be the compute-metadata aspect of this that'll be TokenStream-specific, and everything else will be unrelated to TokenStream.
Attachment #8851819 - Flags: review?(arai.unmht)
Ultimately I want reporting warnings and reporting errors to be 100% different codepaths, but separate signatures will do for now.
Attachment #8851821 - Flags: review?(arai.unmht)
There's certainly much more reporting code that could move into these files.  I only moved pretty much what I had to.  (I think CallWarningReporter was the only thing that I moved for consistency, and only because I saw its use in the relevant code, saw it as sensible to move, and saw it was easy *to* move.)
Attachment #8851823 - Flags: review?(arai.unmht)
As mentioned before, there might be more, but I'm moving as I see stuff.
Attachment #8851825 - Flags: review?(arai.unmht)
Blocks: 1347824
Attachment #8851818 - Attachment is obsolete: true
Attachment #8851818 - Flags: review?(arai.unmht)
Attachment #8851812 - Flags: review?(arai.unmht) → review+
Attachment #8851814 - Flags: review?(arai.unmht) → review+
Attachment #8851817 - Flags: review?(arai.unmht) → review+
Attachment #8851819 - Flags: review?(arai.unmht) → review+
Comment on attachment 8851821 [details] [diff] [review]
Split reportCompileErrorNumberVA into separate functions for unconditionally reporting warnings and unconditionally reporting errors

Review of attachment 8851821 [details] [diff] [review]:
-----------------------------------------------------------------

the change to compileWarningOrError should be reverted I think.

::: js/src/frontend/TokenStream.cpp
@@ -821,5 @@
> -
> -    if (warning && options().werrorOption) {
> -        flags &= ~JSREPORT_WARNING;
> -        warning = false;
> -    }

why is this code block removed?

@@ +845,5 @@
>  
>      if (!cx->helperThread())
>          err.throwError(cx);
>  
> +    return true;

this should return false for error, including werror
about comment #12.
Flags: needinfo?(jwalden+bmo)
Comment on attachment 8851822 [details] [diff] [review]
Move a few functions out of TokenStream, into TokenStreamBase, as they don't depend on the type of character TokenStream processes

Review of attachment 8851822 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.h
@@ +1053,3 @@
>          uint32_t startOffset_;          // offset of base_[0]
> +        const CharT* limit_;         // limit for quick bounds check
> +        const CharT* ptr;            // next char to get

alignment for the comments is off here.
(but I think it's better moving those comments into their own lines instead of trying to keep alignment?
Attachment #8851822 - Flags: review?(arai.unmht) → review+
I'll review the remaining parts tonight.
Comment on attachment 8851823 [details] [diff] [review]
Move the core elements of error/warning reporting out of TokenStream into a new ErrorReporting.{cpp,h} API

Review of attachment 8851823 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/ErrorReporting.cpp
@@ +83,5 @@
> +
> +    if (!cx->helperThread())
> +        err.throwError(cx);
> +
> +    return true;

same as the previous patch.
this should return false for error/werror
Attachment #8851825 - Flags: review?(arai.unmht) → review+
Comment on attachment 8851861 [details] [diff] [review]
Pass error metadata to the error-reporting function, rather than having that function do so itself

Review of attachment 8851861 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure why we need 2 struct ErrorMetadata definitions.

::: js/src/frontend/TokenStream.cpp
@@ +747,5 @@
> +            filenameIsFromCaller = true;
> +        }
> +    }
> +
> +    if (!filenameIsFromCaller) {

we could flip this condition and reduce the indent.

@@ +762,5 @@
> +            // context.  (This means any error in a multi-line token, e.g.
> +            // an unterminated multiline string literal, will have no
> +            // context.)
> +            if (err->lineNumber != lineno)
> +                break;

this can be "return true", and removing the do-while(false)

::: js/src/frontend/TokenStream.h
@@ +272,5 @@
> +/**
> + * Metadata for a compilation error (or warning) at a particular offset, or at
> + * no offset (i.e. with respect to a script overall).
> + */
> +struct ErrorMetadata

can we just reuse the ErrorMetadata struct in ErrorReporting.h ?
(In reply to Tooru Fujisawa [:arai] (almost away until April) from comment #17)
> I'm not sure why we need 2 struct ErrorMetadata definitions.

We don't.  The patches in order add ErrorMetadata to TokenStream.h because that was the first/original place it was needed, then the ErrorReporting.h version appears when the EM in TS.h is *moved* over there.  The patches aren't listed in exactly their order of application, because of the one patch being replaced by an updated version.

> we could flip this condition and reduce the indent.
>
> this can be "return true", and removing the do-while(false)

Yeah, these seem reasonable.  And actually, the whole of that is probably better in a standalone function.  Will update the patch.
Flags: needinfo?(jwalden+bmo)
Comment on attachment 8851861 [details] [diff] [review]
Pass error metadata to the error-reporting function, rather than having that function do so itself

Review of attachment 8851861 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #18)
> (In reply to Tooru Fujisawa [:arai] (almost away until April) from comment
> #17)
> > I'm not sure why we need 2 struct ErrorMetadata definitions.
> 
> We don't.  The patches in order add ErrorMetadata to TokenStream.h because
> that was the first/original place it was needed, then the ErrorReporting.h
> version appears when the EM in TS.h is *moved* over there.  The patches
> aren't listed in exactly their order of application, because of the one
> patch being replaced by an updated version.

thanks :)
in that case, r+ with the `!filenameIsFromCaller` condition flipped.
Attachment #8851861 - Flags: review?(arai.unmht) → review+
Whiteboard: [qf]
I futzed on this some more, and it seems to me things get much more readable if the line-of-context computation is in its own function.  (Even if that function's only called from TS::computeErrorMetadata.)  So I made that change, and it seems sufficiently large to me (and/or easy to introduce typos with) that it should have one last look.
Attachment #8855043 - Flags: review?(arai.unmht)
Attachment #8851861 - Attachment is obsolete: true
Bah, I removed the werror bits and then forgot to add code to otherwise deal correctly.  Reverted that bit.

Then I figured I probably should just have two entirely separate functions for reporting errors and warnings, so I made that change.  It is maybe not ideal to have the same algorithm duplicated twice.  But the number of lines is small.  And the confusion you had about the meaning of compileWarningOrError's return value -- if you pass in JSREPORT_WARNING, then the return value indicates success/failure, if you pass in JSREPORT_ERROR the return value means nothing and the overall result should always be a failure -- suggests it was a bad signature (even if it was intended to be called *only* by those two functions, so could be a bit weird).  Best to just break the two behaviors apart.

Unfortunately, this takes us back to where we were before, with a requirement that werror-respecting warning behavior must go through TokenStream.  Not sure how best to remove that.  :-\

This bit of changing had ramifications down the rest of the patches, so you're going to get to review some stuff kind of a second time.
Attachment #8855066 - Flags: review?(arai.unmht)
Attachment #8851821 - Attachment is obsolete: true
Attachment #8851821 - Flags: review?(arai.unmht)
Attachment #8851822 - Attachment is obsolete: true
Attachment #8851823 - Attachment is obsolete: true
Attachment #8851823 - Flags: review?(arai.unmht)
This could be a bit controversial, because the existing code doesn't make 100% obviously clear that this change is always correct in every case.  But 1) it lets us make forward progress here, 2) figuring out the exact precise location (supposing this patch *doesn't* point out the right location in some cases, which is not at all clear, and current comments suggest this may well be the right place anyway) is Hard without much broader changes, and 3) the existing error behavior is sometimes buggy anyway.

Parser::functionArguments comments are what I've relied on to presume this is the correct location.  Even if some edge case was missed, this should be an improvement on existing behavior in many cases.  For example, current behavior is:

js> function eval() { "use strict"; }
typein:1:30 SyntaxError: 'eval' can't be defined or assigned to in strict mode code:
typein:1:30 function eval() { "use strict"; }
typein:1:30 ..............................^

pointing at clearly the wrong location, but with this patch behavior becomes

js> function eval() { 'use strict'; }
typein:1:9 SyntaxError: 'eval' can't be defined or assigned to in strict mode code:
typein:1:9 function eval() { 'use strict'; }
typein:1:9 .........^
Attachment #8855075 - Flags: review?(arai.unmht)
This is one of the last dependencies of ParseHandler on TokenStream, a dependency removed entirely later in these patches.
Attachment #8855077 - Flags: review?(arai.unmht)
Sure, we give up an "offset" for those two error cases, but given we're talking about huge scripts in this case, precisely identifying an exact error location seems pretty dubiously useful.  Overflow of the numbers here isn't determined by depth of nesting, but by depth of nesting *plus* the number of previous scopes encountered.  Depending on everything that came beforehand, means you might as well just complain location-independently that the script's too big.
Attachment #8855081 - Flags: review?(arai.unmht)
Just a minor simplification.  And because this in principle isn't frontend-related, it seems sensible not to put it in a header in frontend/ but in a more general spot.
Attachment #8855083 - Flags: review?(arai.unmht)
This is big and ugly.  Not sure there's any way around that.

Note that I punted on some "harder" questions: what to do about existing partial specializations of Parser<ParseHandler> functions, where we have

  template<> void
  Parser<FullParseHandler>::foo() { }

right now and really want to have

  template<typename CharT> void
  Parser<FullParseHandler, CharT>::foo() { }

but can't because you can't partially specialize a function.  (I punted on these by specializing only the char16_t version, which is the only thing that exists in this patch.)  I have one or two ideas for this, but those can come about as the single-byte frontend is actually made to work.  This just gets a big huge patch out of my rebase queue.
Attachment #8855110 - Flags: review?(arai.unmht)
Attachment #8855043 - Flags: review?(arai.unmht) → review+
Attachment #8855066 - Flags: review?(arai.unmht) → review+
Whiteboard: [qf] → [qf:p1]
Attachment #8855067 - Flags: review?(arai.unmht) → review+
Comment on attachment 8855069 [details] [diff] [review]
Move the core elements of error/warning reporting out of TokenStream into a new ErrorReporting.{cpp,h} API

Review of attachment 8855069 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/BytecodeEmitter.cpp
@@ +3580,5 @@
>  
>      TokenStream& ts = tokenStream();
>      ErrorMetadata metadata;
>      if (ts.computeErrorMetadata(&metadata, pos.begin))
> +        ReportCompileError(cx, Move(metadata), nullptr, JSREPORT_ERROR, errorNumber, args);

looks like JSREPORT_ERROR arguments can be somehow removed.
(not required to be addressed in this patch tho)
Attachment #8855069 - Flags: review?(arai.unmht) → review+
Attachment #8855072 - Flags: review?(arai.unmht) → review+
Attachment #8855075 - Flags: review?(arai.unmht) → review+
Attachment #8855077 - Flags: review?(arai.unmht) → review+
Attachment #8855079 - Flags: review?(arai.unmht) → review+
Attachment #8855080 - Flags: review?(arai.unmht) → review+
Attachment #8855081 - Flags: review?(arai.unmht) → review+
Attachment #8855082 - Flags: review?(arai.unmht) → review+
Comment on attachment 8855083 [details] [diff] [review]
Split js::frontend::Nestable into a separate header in ds/

Review of attachment 8855083 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/ds/Nestable.h
@@ +3,5 @@
> + * This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +#ifndef frontend_Nestable_h

would be nice to change the "frontend" to "ds".
Attachment #8855083 - Flags: review?(arai.unmht) → review+
Attachment #8855110 - Flags: review?(arai.unmht) → review+
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/36652248ec27
Use UniquePtr-centric functions for creating objects in HelperThreads code, rather than using js_new and passing that raw pointer to the UniquePtr constructor.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/e83b7192f153
Split out all source-text-character-type-agnostic functionality in TokenStream into a TokenStreamBase class.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/1cbd82417e3b
Simplify some of the computation of offset/filename in TokenStream::reportCompileErrorNumberVA.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/9176bba9a53c
Pass error metadata to the error-reporting function, rather than having that function do so itself.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/89e10a5b1870
Move TokenStream::reportAsmJSError into asmjs code and out of TokenStream.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/4ce0f383a773
Split reportCompileErrorNumberVA into separate functions for unconditionally reporting warnings and unconditionally reporting errors.  r=arai
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7cab3c255ea1
Move a few functions out of TokenStream, into TokenStreamBase, as they don't depend on the type of character TokenStream processes.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/3d388a56bec6
Move the core elements of error/warning reporting out of TokenStream into a new ErrorReporting.{cpp,h} API.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/feeea0c7a759
Move a few handler-agnostic functions out of Parser into ParserBase.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/c22b404132cd
Add a ParseHandler function for the body in functions with expression bodies, that determines its position from the expression node.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/398bfe65f709
Require various ParseHandler function that create a Node implicitly using the current position, to instead use a caller-provided position.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/f40b66381b6f
Remove the ParserBase constructor's unused Parser<SyntaxParseHandler>* argument.  r=arai
Blocks: 1345703
The previous approach doesn't quite work, because the offset might be outside |userbuf|'s actual covered range.  So, let's just give up on a "good" fix for this, and make the user provide a TokenStream& to consult if needed, to get the TokenStream dependency out of SyntaxParseHandler.
Attachment #8860220 - Flags: review?(arai.unmht)
Attachment #8855075 - Attachment is obsolete: true
Attachment #8860220 - Flags: review?(arai.unmht) → review+
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f57dbdd29d41
Refactor the code that identifies the offset of an impermissible function name (because of strict-mode or extra-warnings restrictions) so that SyntaxParseHandler doesn't depend on a cached TokenStream&.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/1f88113d1a9d
Remove ParseHandler::[gs]etPosition, remove SyntaxParseHandler::tokenStream now that it's unused, and make FullParseHandler::tokenStream a TokenStreamBase to not drag in a character-type dependency.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/0930544a0236
Make ParseContext contain a TokenStreamBase&, not a TokenStream&, so that ParseContext need not be character-type-parametrized.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/6b3e668eefd5
Remove all use of TokenStream from the two ParseHandlers.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/5649e25c4c42
Split js::frontend::Nestable into a separate header in ds/ to simplify the frontend header it's in now.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/0d069bcc20b0
Add a |typename CharT| template parameter to Parser, SyntaxParseHandler, and FullParseHandler, *without* otherwise changing functionality.  r=arai
Is there a bug for getting this wired into gecko?
Flags: needinfo?(shu)
(In reply to Kannan Vijayan [:djvj] from comment #41)
> Is there a bug for getting this wired into gecko?

Not yet -- will file once the engine parts are done.
Flags: needinfo?(shu)
(In reply to Tooru Fujisawa [:arai] from comment #33)
> > +        ReportCompileError(cx, Move(metadata), nullptr, JSREPORT_ERROR, errorNumber, args);
> 
> looks like JSREPORT_ERROR arguments can be somehow removed.

Not necessarily.  TokenStream::compileWarning in the werror case *doesn't* slot in JSREPORT_ERROR for flags when calling this function, so there's at least one case where an error doesn't have JSREPORT_ERROR passed in.  (Perhaps it could?  But I don't know, and figuring out the answer is too time-intensive to do in this bug.)

As for the other TS::compileWarning callers, while they do all pass in JSREPORT_WARNING, some also pass in JSREPORT_STRICT as well, so that argument can't be removed, either.

There's almost certainly a way to clean this gunk up and out of the interface.  But it's not dead-obviously-simple, and I'm running low on time, so I'm punting this particular bit of improvement indefinitely.
Right now this is just a Variant of one type, but it'll be a Variant of multiple times pretty soon.  In the meantime, all this variant matching gunk is independently landable.
Attachment #8862152 - Flags: review?(nfroyd)
Comment on attachment 8862152 [details] [diff] [review]
Adapt BytecodeEmitter to contemplate working with Parsers working on both single- and double-byte source text

Review of attachment 8862152 [details] [diff] [review]:
-----------------------------------------------------------------

And a JS person as well, tho mostly this is just a massive exercise in perfect forwarding and Variant matching that's JS-unrelated.
Attachment #8862152 - Flags: review?(arai.unmht)
Details on when this wouldn't be a base class are forthcoming.
Attachment #8862157 - Flags: review?(arai.unmht)
The current function uses functions, internally, that can only exist in a character-specific TokenStream.  But the extra-warning/strict-error cases that do this are strictly dead code.  So split this into a warning function and an error function, neither taking an offset, to get rid of that code.
Attachment #8862167 - Flags: review?(arai.unmht)
Both JSContext and Parser want to access this enum.  Putting the enum in either place leads to header-dependency cycles and overall blowup.  So just move it to a new header, slightly-generalized to hold other language extension stuff we might come up with over time.  (I'm not entirely sure what such would be, but it seems like there's probably enough to fill in a header a little.)  Dumb patch, but a necessary one.
Attachment #8862169 - Flags: review?(arai.unmht)
Specifically, RegExpObject::create signatures at one time in my patchwork on this bug wanted to #include TokenStream.h.  But other headers wanted to #include RegExpObject.h, and it was impossible to resolve the cycle.  This *may* not be the case any more, at the end of my patches, not sure.  But there's another reason to do this: jscompartment.h and GlobalObject.h are #included everywhere, so if we can trim builtin/RegExp.h from their #includes, that should help overall compile time.  We have to add a bunch more bootlegged #includes in various places, but that's better than just Feeling Lucky.
Attachment #8862173 - Flags: review?(arai.unmht)
Comment on attachment 8862147 [details] [diff] [review]
Use unicode::{LINE,PARA}_SEPARATOR instead of hand-rolling them for TokenStream code

Review of attachment 8862147 [details] [diff] [review]:
-----------------------------------------------------------------

I'll review remaining patches this weekend.
Attachment #8862147 - Flags: review?(arai.unmht) → review+
The explanation for why all this complexity is in a mega-comment in TokenStream.h.  I think it *might* be possible to eliminate some (but certainly not all) of this if we:

1. Eliminate frontend::TokenStream.  This seems doable, but it'll require plumbing something very different into irregexp::ParsePattern{,Syntax} to provide the functionality.  (I *think* that's the only user of that type.)  Too dilatory from the current single-byte parsing task to do now, with time constraints.
2. Fold all TokenStream bits directly into Parser and ParserBase, with some of TokenStream's new inheritance changes made in this patch.  This also seems a large task for not much gain, racing a clock.

But these things are chunks of work for not enough gain right now.
Attachment #8862291 - Flags: review?(arai.unmht)
To be precise, computing a line of context requires accessing characters in a CharT*, so these functions have to be in Parser with its CharT parameter, not in ParserBase lacking one by design.
Attachment #8862292 - Flags: review?(arai.unmht)
A bit hot and messy, like the token stream decoupling patch, but not much to do about it.
Attachment #8862318 - Flags: review?(arai.unmht)
Hmm, I guess I should have put this in one of the earlier patches before uploading.  Oh well, simple enough anyway.

No particular rush on reviewing things, weekend's totally fine by me, BTW.
Attachment #8862332 - Flags: review?(arai.unmht)
Compiling the various patches here with GCC, I discovered that you can't specialize a function outside of the namespace that it's in.  Brief demo:

namespace js {
namespace frontend {
template<typename CharT>
class TokenStreamCharsBase
{
  public:
    void foo();
};
}
}

using js::frontend;

template<> void
TokenStreamCharsBase<char16_t>::foo()
{}

Turns out the above is buggy -- you have to put the foo() definition inside a js::frontend namespace block.  (Or, in C++11, you have to qualify TokenStreamCharsBase -- but GCC 4.9 at least doesn't support qualified specializations, so that's not a solution.)

We already have this issue for Parser, and the specializations of its member functions -- but it doesn't manifest because the majority of Parser.cpp is inside namespace blocks.  Seems like we should do the same thing for TokenStream.cpp, if it's going to be picking up a bunch of templates to deal with multiple character types.

This patch appears in applicable sequence immediately before the "Decouple TokenStream and TokenStreamBase from being related" patch that introduces templates and specializations to TokenStream.h.
Attachment #8862709 - Flags: review?(arai.unmht)
Comment on attachment 8862152 [details] [diff] [review]
Adapt BytecodeEmitter to contemplate working with Parsers working on both single- and double-byte source text

Review of attachment 8862152 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me
Attachment #8862152 - Flags: review?(nfroyd) → review+
Turns out EitherParser is going to be needed in BytecodeCompiler soon, and for SyntaxParseHandler as well as FullParseHandler, so I moved it to its own header and added yet moar templates to it.

"Templates."
Attachment #8863053 - Flags: review?(arai.unmht)
Attachment #8862152 - Attachment is obsolete: true
Attachment #8862152 - Flags: review?(arai.unmht)
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/de236ed88685
Use unicode::{LINE,PARA}_SEPARATOR instead of hand-rolling them for TokenStream code.  r=arai
Attachment #8862148 - Flags: review?(arai.unmht) → review+
Attachment #8862157 - Flags: review?(arai.unmht) → review+
Comment on attachment 8862167 [details] [diff] [review]
Split Parser::reportNoOffset into Parser::{error,warning}NoOffset

Review of attachment 8862167 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/wasm/AsmJS.cpp
@@ +8605,5 @@
>      return cx->helperThread() || !cx->isExceptionPending();
>  }
>  
>  static bool
> +SuccessfulValidation(AsmJSParser& parser, UniqueChars str)

Nice :D
Attachment #8862167 - Flags: review?(arai.unmht) → review+
Attachment #8862169 - Flags: review?(arai.unmht) → review+
Attachment #8862173 - Flags: review?(arai.unmht) → review+
Comment on attachment 8862291 [details] [diff] [review]
Decouple TokenStream and TokenStreamBase from being related only through inheritance, but rather through a generalized static system

Review of attachment 8862291 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.cpp
@@ +1162,5 @@
>          }
>  
>          size_t length = tokenbuf.length();
>  
> +        *destination = anyCharsAccess().cx->template make_pod_array<char16_t>(length + 1);

TIL "template" keyword in expression :)

::: js/src/frontend/TokenStream.h
@@ +882,5 @@
> +
> +  public:
> +    TokenStreamChars(JSContext* cx, const char16_t* chars, size_t length, size_t startOffset);
> +
> +    MOZ_ALWAYS_INLINE bool isMultiUnitCodepoint(char16_t c, uint32_t* codepoint) {

nice abstraction :D
Attachment #8862291 - Flags: review?(arai.unmht) → review+
Attachment #8862292 - Flags: review?(arai.unmht) → review+
Attachment #8862296 - Flags: review?(arai.unmht) → review+
Comment on attachment 8862318 [details] [diff] [review]
Split TokenStream in Parser across Parser and ParserBase

Review of attachment 8862318 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/Parser.cpp
@@ +10259,5 @@
>  bool
>  Parser<ParseHandler, CharT>::warnOnceAboutExprClosure()
>  {
>  #ifndef RELEASE_OR_BETA
> +    if (this->context->helperThread())

I'm curious why "this->" is explicitly used here.
(actually I prefer explicit "this->")

::: js/src/frontend/Parser.h
@@ +765,5 @@
>      JSContext* const context;
>  
>      LifoAlloc& alloc;
>  
> +    TokenStreamAnyChars anyChars;

IMO, outside of TokenStream itself, it's not clear "anyChars" means TokenStreamAnyChars instance.
can we use a bit more descriptive name?  (not necessary to be done in this patch tho)

::: js/src/vm/RegExpObject.cpp
@@ +243,2 @@
>  {
> +    // XXX UTF-8?

would be nice to file a bug and point it.

::: js/src/wasm/AsmJS.cpp
@@ +3223,5 @@
>          MOZ_CRASH("unexpected literal type");
>      }
>      MOZ_MUST_USE bool writeCall(ParseNode* pn, Op op) {
>          return encoder().writeOp(op) &&
> +               fg_.addCallSiteLineNum(m().tokenStream().anyCharsAccess().srcCoords.lineNum(pn->pn_pos.begin));

overflows from 99 chars
Attachment #8862318 - Flags: review?(arai.unmht) → review+
Attachment #8862332 - Flags: review?(arai.unmht) → review+
Attachment #8862709 - Flags: review?(arai.unmht) → review+
Comment on attachment 8863053 [details] [diff] [review]
Adapt BytecodeEmitter to contemplate working with Parsers working on both single- and double-byte source text

Review of attachment 8863053 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/EitherParser.h
@@ +3,5 @@
> + * This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +/* JS bytecode generation. */

please remove this comment.
(I guess it's copy-paste artifact or something)
Attachment #8863053 - Flags: review?(arai.unmht) → review+
The total patch-stack has some issues to resolve before I land stuff again, will deal with tomo^H^Hday.

(In reply to Tooru Fujisawa [:arai] from comment #66)
> > +    if (this->context->helperThread())
> 
> I'm curious why "this->" is explicitly used here.
> (actually I prefer explicit "this->")

At one time I thought it was necessary so that |context| would be treated as a a dependent name, not as a name in the enclosing namespace.  But that's not so now, if it ever was, so I undid this.  (I too would prefer explicit |this->|, but not nearly enough to fight it, and not when the opposing practice is so pervasive.)

> IMO, outside of TokenStream itself, it's not clear "anyChars" means
> TokenStreamAnyChars instance.
> can we use a bit more descriptive name?  (not necessary to be done in this
> patch tho)

I'll certainly accept suggestions -- I picked "something" and ran with it, mostly.  (But yes on subsequent-patching for sure.)

> > +    // XXX UTF-8?
> 
> would be nice to file a bug and point it.

I added a static_assert of char16_t-ness so this crashes and burns when I slot in the single-byte version.

(In reply to Tooru Fujisawa [:arai] from comment #67)
> please remove this comment.
> (I guess it's copy-paste artifact or something)

EitherParser.h is a |hg cp|, so a copy-artifact but not a paste-artifact.
Flags: needinfo?(jwalden+bmo)
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5acf7c7c8dc1
Use unicode::{LINE,PARA}_SEPARATOR instead of hand-rolling them for TokenStream code.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/4ea208159e7a
Move RegExpFlags, RegExpShared, and RegExpCompartment into vm/RegExpShared.h so that users requiring only those types don't have to import everything RegExpObject requires.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/09d95ca08acc
Adapt BytecodeEmitter to contemplate working with Parsers working on both single- and double-byte source text.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/f3e0fdc7f834
Rename TokenStreamBase to TokenStreamAnyChars, anticipating this class eventually not being a base class.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/a8e1028ccbea
Make the DeprecatedLanguageExtensions enum an enum class in a new header, to address a thorny cyclic dependency issue in subsequent patches.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/118f15cdd0fb
Make jscompartment.h and GlobalObject.h not #include builtin/RegExp.h, to address yet more thorny cyclic dependency issues in subsequent patches.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/c47075b2b6b9
Enclose most of TokenStream.cpp in js::frontend namespace blocks so that template specializations don't need their own custom namespace blocks.  r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/e204408d381b
Make Reflect.parse code work with a TokenStreamAnyChars, not a TokenStream.  r=arai
Depends on: 1363116
Depends on: 1366459
Assignee: jwalden+bmo → shu
I think shu is taking this over with waldo on PTO
Flags: needinfo?(shu)
Whiteboard: [qf:p1] → [qf:p2]
Please note that bug 1363116 changed EitherParser::tokenStream() to return a TokenStream& instead of a TokenStreamAnyChars&. See details there as to why.

One way to deal with that would be to make report*ErrorNumberVA virtual on TokenStreamAnyChars, and then BytecodeEmitter::report{ExtraWarning,StrictModeError} could use tokenStream() instead of parser.tokenStream() (and EitherParser::tokenStream() could be restored to return a TokenStreamAnyChars&).
Keywords: perf
Whiteboard: [qf:p2] → [qf:p1]
Whiteboard: [qf:p1] → [qf:i60][qf:p1]
Whiteboard: [qf:i60][qf:p1] → [qf:f60][qf:p1]
Depends on: 1424394
Depends on: 1424420
Depends on: 1424946
Depends on: 1424951
Depends on: 1426909
Depends on: 1428863
Depends on: 1434429
Whiteboard: [qf:f60][qf:p1] → [qf:f61][qf:p1]
Priority: -- → P1
Assignee: shu → jwalden+bmo
Flags: needinfo?(shu)
[Tracking Requested - why for this release]: Reason [qf:61]
Target Milestone: mozilla55 → ---
We're already in the soft freeze for 61 and this looks like a ton of patches that aren't ready for landing yet. Based on that, I'm calling this tracking- and wontfix for 61.
Depends on: 1459382
Depends on: 1459384
Whiteboard: [qf:f61][qf:p1] → [qf:f64][qf:p1]
Waldo, do you have a status update on progress for this?  Do we have a reasonable estimate on landing time, or are there still unknowns to work through?
See comment 77.
Flags: needinfo?(jwalden+bmo)
Whiteboard: [qf:f64][qf:p1] → [qf:p1:f64]
Depends on: 1467334
Bug 1467336 is mostly done, and with that, TokenStream goo mostly is reoriented to interpret by code units, not by code points with the occasional special-treatment of surrogates in specific rare places.  Next step is to fill in the specifically-UTF-8 specializations needed to tokenize UTF-8, which is gated on bug 1426909.  I expect to resolve the necessary things in that to move forward there this week.  Threading through those bits will involve some effort, but I think a relatively small amount.

After that I'll probably need to focus a little on getting UTF-8 source text through the JSAPI, so modifying or templating SourceBufferHolder and such.  I don't anticipate that taking particularly long, but I have not run through all the JSAPI steps from JSAPI entrypoints to bytecode emission.

Probably the next step is getting a command-line flag on the JS shell to interpret input as UTF-8 for basic testing, maybe even flip over treeherder tests to use that flag (and let browser-based testing cover the UTF-16 case).

Final step is making DOM script-processing use those new entrypoints and not inflate.  I don't know how much work is involved in that, because I haven't looked at any of the DOM code for this.  But it's probably mostly simple.

So the remaining unknowns, I think, are the DOM bits, some amount more on hooking up JSAPI entrypoints, and getting the UTF-8 bits into place based on what bug 1426909 puts in place.  Minus the DOM bits, my sense is things should be mostly in place by early next month.
Flags: needinfo?(jwalden+bmo)
Actually, there is one bit of uncertain complexity.

In UTF-16 and in our implementation of tokenizing and parsing, we more or less can just consider column numbers as an index by code unit.  There are only a few places where multi-unit code points can happen -- strings, identifiers, and comment contents are a possibly exhaustive list, off the top of my head -- and I'm pretty sure *if* multi-unit code points happen, we will get the column numbers wrong.

This problem is severely exacerbated with UTF-8, because while non-ASCII code points are generally uncommon, they are going to appear much more often than non-BMP code points do in UTF-16.

Our tokens all store offsets as code unit indexes, so those at least will be correct even in UTF-8.  But anything user-visible that purports to be a column number, may need an uncertain amount of frobbing of some sort to be correct.  Things like the error-reporting mechanism that grabs a "window" of contextual information, will need updates to not break on code point boundaries.

I am not sure how much work will be involved in making all of this work, nor precisely what all would need to be touched.
Depends on: 1471463
Depends on: 1471464
Depends on: 1471465
Depends on: 1472031
Depends on: 1472066
Depends on: 1476866
Depends on: 1478045
Depends on: 1478170
Depends on: 1478587
Depends on: 1478892
Depends on: 1485800
Depends on: 1485805
Whiteboard: [qf:p1:f64] → [qf:p1:f65]

The bug and dependency trees here are a bit of a mess. By this bug's summary (whether pre- or post-revision made here to eliminate the Latin-1 possibility), this bug is complete: TokenStream is perfectly capable of handling UTF-8 source text. Going further to expose and use this is not really this bug's ambit, notwithstanding that any number of bugs ostensibly blocking this actually do that task (and so really should themselves be blocked by this bug).

It seems like a waste of time to push bugs here and there and make an accurate dependency tree, so I'm just going to close this. But for the sake of people wondering what remains to be done here to actually be able to parse/compile UTF-8 source text:

  • Bug 1504947 fixes column numbers to be counts of code points, not units as now. For UTF-16 we could fake it and mostly be okay; for UTF-8, the error quickly becomes glaring. This is the only bug remaining to be fixed, IMO, before we can start doing direct compilation of UTF-8 somewhere in the browser. The fix for this is "one line" to flip an #ifdef -- but if you flip it now, we consistently leak windows in several devtools tests runs. Not good.

If you just want to play around in advance of that fix, and our using this stuff in any sort of "production", the JS shell will compile from UTF-8 if you pass in files with '-u' instead of with the traditional '-f', and things generally work pretty well. (At most we have a handful of jstests/jit-tests that will fall over for the preceding problem.)

Once that's fixed,

  • Bug 1506902 will make our self-hosted code be directly parsed from UTF-8.
  • The JS::Compile* functions that claim to take UTF-8 should be augmented by versions that explicitly don't inflate.
  • We need to add prefs to control whether to use the "normal" or "do not inflate" APIs.
  • We'll need to start hacking on the various ScriptLoader.cpp to convert to UTF-8 and not UTF-16. (We may need to specially handle the UTF-8-to-UTF-8 case to avoid copying, in the normal case of no UTF-8 encoding errors.)
  • We need to change places that call "normal" APIs, to call "do not inflate" APIs if so requested by pref. (And we may want some things to move in lockstep, e.g. possibly <script> elements and workers should be controlled by one pref.)
  • Ultimately, the "normal" APIs will all be unused, we can remove them, and the "do not inflate" APIs can be renamed to the "normal" names.

Also we probably ought clean up the JSAPI compilation APIs to all take JS::SourceText<Unit>& only -- no accepting raw pointers and lengths -- but that's driveby cosmetic API simplification and not functionality improvement.

Anyway: let's handle the rest of this in other bugs. TokenStream, Parser, and BytecodeCompiler all handle UTF-8 just fine if you let them, and the only lapse right now (columns) is largely distinct from that achievement.

Status: REOPENED → RESOLVED
Closed: 7 years ago5 years ago
Resolution: --- → FIXED
Summary: Make TokenStream parse both single-byte (whichever of Latin-1/UTF-8 is simpler, for a first pass) and two-byte source text → Make TokenStream parse both single-byte UTF- and two-byte source text
Performance Impact: --- → P1
Whiteboard: [qf:p1:f65]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: