Closed Bug 1493441 Opened Last year Closed Last year

Make ScriptSource able to hold UTF-8 in addition to UTF-16

Categories

(Core :: JavaScript Engine, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla64
Tracking Status
firefox64 --- fixed

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

Attachments

(5 files, 3 obsolete files)

63.44 KB, patch
Details | Diff | Splinter Review
2.20 KB, patch
tcampbell
: review+
Details | Diff | Splinter Review
2.26 KB, patch
tcampbell
: review+
Details | Diff | Splinter Review
2.59 KB, patch
tcampbell
: review+
Details | Diff | Splinter Review
67.28 KB, patch
tcampbell
: review+
Details | Diff | Splinter Review
If we're going to parse UTF-8 and save any memory, we can't have lazily compiled functions on UTF-8 source text without making the held/compressed source text be stored as UTF-8 (and not inflated to UTF-16).

In theory I finished up the patching to make this *possible* -- but not actually to create any such ScriptSources backed by UTF-8 text -- on the flights yesterday.  Still need to test it, for one.  And for another, it currently uses "char" instead of "Utf8Unit" for code units, and the intent of that type was to use it for things like this, so I want to try swapping that into place for the final patch.
Mostly the process of fixing this is just making ScriptSource::Compressed and ScriptSource::Uncompressed templates and then templating a bunch of code and then doing type-tests before instantiating PinnedChars with the appropriate type.  Lots of mess, but not actually many messy changes in it.  Still have various cleanup to do as mentioned in comment 0.
Assignee: nobody → jwalden+bmo
Status: NEW → ASSIGNED
In a tree a few days old, all jstests pass.  jit-test results are:

[6386|  22|   5|   0] 100% ==========================================>| 976.2s
FAILURES:
    auto-regress/bug1468629.js
    basic/bug1057571.js
    latin1/assorted.js
    parser/bug-1431353.js
    xdr/async.js
    xdr/bug1186973.js
    xdr/classes.js
    xdr/debug-lazy.js
    xdr/decode-off-thread.js
    xdr/incremental-encoder.js
    xdr/function-flags.js
    xdr/scope.js
    xdr/tagged-template-literals.js
    xdr/lazy.js
    xdr/relazify.js
    xdr/tagged-template-literals-2.js
    xdr/trivial.js
TIMEOUTS:
    basic/bug832203.js
    basic/inflate-oom.js
    gc/bug-1143706.js
    gc/bug-1462337.js
    gc/oomInGetJumpLabelForBranch.js

so there clearly remain *some* issues to this.
Turns out XDRState::codeChars for Latin1Char is only half-implemented, not for both encoding and decoding.

Still going to look/work more on this to see if I can get it to use Utf8Unit, tho, before posting for review.
Attachment #9011238 - Attachment is obsolete: true
Ideally we need do as few Utf8Unit<->char/unsigned char conversions in as few places as possible.  So this makes sense to do.
Attachment #9011974 - Flags: review?(tcampbell)
I busted up in my patching because this wasn't happening.  There is not a caller that would not need to align first, or that couldn't benefit from a harmless don't-need-to-align: it should be in codeChars.
Attachment #9011975 - Flags: review?(tcampbell)
Given that I added the Utf8Unit overload, it's possible my subsequent patches don't need this, come to think of it.  But it is rather Dumb that we had this and it only worked for one way, when it's easy to support both ways.
Attachment #9011976 - Flags: review?(tcampbell)
Mostly this is just a lot of adding templates to stuff.  Our source caching mechanisms are already keyed by a thing that can incorporate source text type.  The source cache mechanism under the hood supports both one-byte and two-byte data incoming (and under the hood just treats it all as bags'o'bytes).  All accesses are type-directed from a PinnedChars, and with a runtime tag-test on the relevant variant, it's easy to have one path for PinnedChars<Utf8Unit> and one path for PinnedChars<char16_t>.

This passes all jstests and jit-tests locally.  I'll kick off a try-push for whenever this is reviewed, which doesn't need to be very soon, because the rest of this week I'm out at Strange Loop.  Should be around next week, at least after EDT mornings due to SCOTUS oral arguments I'm attending.  Opening week OT2018, baby!  🤘⚖🤓
Attachment #9011978 - Flags: review?(tcampbell)
Comment on attachment 9011974 [details] [diff] [review]
Implement XDRState::codeChars(Utf8Unit*, size_t)

Review of attachment 9011974 [details] [diff] [review]:
-----------------------------------------------------------------

r=me if we are sure std::transform meets our minimum compiler targets.

::: js/src/vm/Xdr.cpp
@@ +79,5 @@
> +        if (!ptr) {
> +            return fail(JS::TranscodeResult_Throw);
> +        }
> +
> +        std::transform(units, units + count,

It looks like std::transform is marked C++17 and I thought we were C++14 in Gecko? Change my mind.

@@ +87,5 @@
> +        if (!ptr) {
> +            return fail(JS::TranscodeResult_Failure_BadDecode);
> +        }
> +
> +        std::transform(ptr, ptr + count,

This was a zero-copy for Latin1, but that seems a silly ideal for this XDR stuff anyways.
Attachment #9011974 - Flags: review?(tcampbell) → review+
Comment on attachment 9011975 [details] [diff] [review]
Make XDRState::codeChars(char16_t*, size_t) perform buffer-alignment before acting so that its callers don't have to do so manually

Review of attachment 9011975 [details] [diff] [review]:
-----------------------------------------------------------------

My understanding here is that codeChars(char16_t) is only called for the ENCODE case (and we use peek for the other case). Therefore the align needs to be outside codeChars? Feel free to ping me about this if I'm missing something.
Attachment #9011975 - Flags: review?(tcampbell) → review-
Comment on attachment 9011976 [details] [diff] [review]
Change XDRState::codeChars(const Latin1Char*, size_t) to XDRState::codeChars(Latin1Char*, size_t) for consistency with the other overloads of the same name for char16_t and Utf8Unit

Review of attachment 9011976 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/Xdr.cpp
@@ +48,4 @@
>  
>  template<XDRMode mode>
>  XDRResult
> +XDRState<mode>::codeChars(Latin1Char* chars, size_t nchars)

Agreed. It was bizarre to have a |mode| template parameter and const buffer.
Attachment #9011976 - Flags: review?(tcampbell) → review+
(In reply to Ted Campbell [:tcampbell] (PTO until Oct 9) from comment #8)
> r=me if we are sure std::transform meets our minimum compiler targets.

I imagine you're reading https://en.cppreference.com/w/cpp/algorithm/transform which has somewhat confusing notation -- std::transform is new as constexpr since C++20 (I think), but it's been in C++ for a good long time.

Plus I found std::transform in the C++11 spec and was going off that.  I expect we can rely on it.  :-)  Although I should probably try for safety.
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/60633a71c04d
Implement XDRState::codeChars(Utf8Unit*, size_t).  r=tcampbell
https://hg.mozilla.org/integration/mozilla-inbound/rev/2f853725d168
Change XDRState::codeChars(const Latin1Char*, size_t) to XDRState::codeChars(Latin1Char*, size_t) for consistency with the other overloads of the same name for char16_t and Utf8Unit.  r=tcampbell
Keywords: leave-open
Mm, I think you're right about that.  Nonetheless, I think always aligning -- and making the exceptional cases be extra-careful about it -- is still good to do.  This version of the patch should do it correctly, I think.
Attachment #9014975 - Flags: review?(tcampbell)
Attachment #9011975 - Attachment is obsolete: true
Updated for after bug 1459067's changes that made a mess of the code in these areas.  :-)
Attachment #9014984 - Flags: review?(tcampbell)
Attachment #9011978 - Attachment is obsolete: true
Attachment #9011978 - Flags: review?(tcampbell)
Priority: -- → P2
Blocks: 1498320
Attachment #9014975 - Flags: review?(tcampbell) → review+
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/acd510f0152a
Make XDRState::codeChars(char16_t*, size_t) perform buffer-alignment before acting so that its callers don't have to do so manually.  r=tcampbell
Comment on attachment 9014984 [details] [diff] [review]
Allow ScriptSource to store UTF-8 script data in addition to UTF-16 script data (but don't create any UTF-8-backed ScriptSources yet)

Review of attachment 9014984 [details] [diff] [review]:
-----------------------------------------------------------------

Once I understood the patterns for mozilla::Variant, this makes far more sense to me. Seems to be some good cleanups in the XDR code too which is nice.

Rubberstamp for the RR-related stuff, but it seems reasonable and we checked with :bhackett.

::: js/src/vm/JSFunction.cpp
@@ +1816,5 @@
> +
> +                // XXX There are no UTF-8 ScriptSources now, so just crash so this
> +                //     gets filled in later.
> +                MOZ_CRASH("UTF-8 lazy function compilation not implemented yet");
> +            } else {

MOZ_ASSERT(lazy->scriptSource()->hasSourceType<uint16_t>()); or similar.

::: js/src/vm/JSScript.cpp
@@ +1834,4 @@
>          cursor += numChars;
>      }
>  
> +    // XXX Can we remove the null-termination?  It's unclear if anyone uses

I've opened Bug 1499192 to remove the null-termination.

::: js/src/vm/JSScript.h
@@ +460,5 @@
> +
> +    static UniqueChars toCacheable(EntryChars<mozilla::Utf8Unit> str) {
> +        // The cache only stores strings of |char| or |char16_t|, and right now
> +        // it seems best not to gunk up the cache with |Utf8Unit| too.  So
> +        // cache |Utf8Unit| strings by interpreting them as |char| strings.

*Please* make sure the SharedImmutableStringCache gets some comments indicating that we may have UTF-8 strings in it. I almost wonder if we should have a follow-up bug to de-type the ImmutableStringCache and have it just store black-box raw bytes.

::: toolkit/recordreplay/ipc/JSControl.cpp
@@ +691,5 @@
> +      if (!aContentType) {
> +        return false;
> +      }
> +
> +      aContent.set(info.mContent8.length() > 0

Can you assert that only one length>0 at a time?
Attachment #9014984 - Flags: review?(tcampbell) → review+
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2ec2641a20ae
Allow ScriptSource to store UTF-8 script data in addition to UTF-16 script data (but don't create any UTF-8-backed ScriptSources yet).  r=tcampbell
And with that, this is done.  More work in other bugs is necessary to *create* and *consume* UTF-8 ScriptSources, but this should finish up ScriptSource changes (barring the unexpected tweak or two, as always).
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/2ec2641a20ae
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
You need to log in before you can comment on or make changes to this bug.