Implement String#codePointAt and String.fromCodePoint

RESOLVED FIXED in mozilla29

Status

()

Core
JavaScript Engine
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: evilpie, Assigned: evilpie)

Tracking

(Blocks: 1 bug, {dev-doc-complete, feature})

unspecified
mozilla29
dev-doc-complete, feature
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 2 obsolete attachments)

(Assignee)

Description

4 years ago
Created attachment 807822 [details] [diff] [review]
Implementation of fromCodePoint and codePointAt

Currently working on some tests for this. There is at least one unresolved issue, about whether codePointAt outside the range should return NaN or undefined.

Comment 1

4 years ago
I guess this is a duplicate of bug 888093 (or vice versa).

Tests I wrote today:
* https://github.com/mathiasbynens/String.fromCodePoint/blob/master/tests/tests.js
* https://github.com/mathiasbynens/String.prototype.codePointAt/blob/master/tests/tests.js

Feel free to re-use these! Oh, and let me know if I missed anything important.

Looking at the example code in Norbert’s original proposal (http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/#String), I think the intention is to return `undefined`. Keep in mind that my tests are based on that assumption. See https://bugs.ecmascript.org/show_bug.cgi?id=1153.
(Assignee)

Updated

4 years ago
Duplicate of this bug: 888093
Keywords: dev-doc-needed

Comment 3

4 years ago
Comment on attachment 807822 [details] [diff] [review]
Implementation of fromCodePoint and codePointAt

Review of attachment 807822 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/builtin/String.js
@@ +128,5 @@
> +            ThrowError(JSMSG_NOT_A_CODEPOINT, ToString(nextCP));
> +
> +        // Step 5f.
> +        // Inlined UTF-16 Encoding
> +        if (nextCP < 65535) {

Shouldn’t this be 65536 (0x010000, i.e. 0xFFFF + 1)?
(Assignee)

Comment 4

4 years ago
You are right, I missed the <=.
(Assignee)

Updated

4 years ago
Assignee: general → evilpies
(Assignee)

Comment 5

4 years ago
Mathias the tests you landed for v8 looked quite good, I think we could use them.
(Assignee)

Comment 6

4 years ago
Created attachment 812453 [details] [diff] [review]
v1
Attachment #807822 - Attachment is obsolete: true
Attachment #812453 - Flags: review?(jwalden+bmo)

Comment 7

4 years ago
(In reply to Tom Schuster [:evilpie] from comment #5)
> Mathias the tests you landed for v8 looked quite good, I think we could use
> them.

Happy to hear. This totally made my day :) Yay!

By the way, I like that in the new patch you’re using HexIntegerLiterals for code point values — much more readable that way IMHO.
Comment on attachment 812453 [details] [diff] [review]
v1

Review of attachment 812453 [details] [diff] [review]:
-----------------------------------------------------------------

Stealing review.

This looks good overall, but I have a few requests I'd like to see addressed.

As mentioned in-line below, I'm not too happy with codePointAt calling charCodeAt and thus duplicating all the conversion and checking overhead. You know better than I do how much work would be involved in exposing a performant String_getChar intrinsic to be used instead. If it's not too much work, I'd like for that to be done here. If it is, please file a follow-up bug about it.

The test files should get license header. Since the tests use MIT in Mathias' repo, that's what we'll have to use, too.

::: js/src/builtin/String.js
@@ +20,5 @@
> +    if (position < 0 || position >= size)
> +        return undefined;
> +
> +    // Steps 8-9.
> +    var first = callFunction(std_String_charCodeAt, S, position);

It's really unfortunate that we effectively repeat steps 1-7 by calling charCodeAt. How about introducing a String_getChar intrinsic? That should be easy enough to do. OTOH, I don't know whether we inline charCodeAt in Ion. OTOOH, having an inlinable String_getChar would probably allow us to self-host more String methods.

@@ +21,5 @@
> +        return undefined;
> +
> +    // Steps 8-9.
> +    var first = callFunction(std_String_charCodeAt, S, position);
> +    if (first < 0xD800 || first > 0xDBFF || position + 1 == size)

Nit: ===

@@ +30,5 @@
> +    if (second < 0xDC00 || second > 0xDFFF)
> +        return first;
> +
> +    // Step 12.
> +    return ((first - 0xD800) * 0x400) + (second - 0xDC00) + 0x10000;

Nit: over-bracing for the multiplication

@@ +107,5 @@
> +    // Step 2.
> +    var length = arguments.length;
> +
> +    // Step 3.
> +    var elements = new List();

While List is the right spec type, we can use NewDenseArray(length) for more efficiency, here.

@@ +127,5 @@
> +
> +        // Step 5f.
> +        // Inlined UTF-16 Encoding
> +        if (nextCP <= 0xFFFF) {
> +            elements.push(nextCP);

You can use UnsafePutElements(elements, nextIndex, nextCP) here after making the change to NewDenseArray above.

@@ +131,5 @@
> +            elements.push(nextCP);
> +            continue;
> +        }
> +
> +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);

coercing to int with |0 should be faster than Math.floor. Or at least tax the jit less heavily, I guess.

@@ +132,5 @@
> +            continue;
> +        }
> +
> +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> +        elements.push(((nextCP - 0x10000) % 0x400) + 0xDC00);

UnsafePutElements can be used with more than one index/value pair, so these two calls can be combined.
Attachment #812453 - Flags: review?(jwalden+bmo) → feedback+
(Assignee)

Comment 9

4 years ago
(In reply to Till Schneidereit [:till] from comment #8)
> Comment on attachment 812453 [details] [diff] [review]
> v1
> 
> Review of attachment 812453 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Stealing review.
> 
> This looks good overall, but I have a few requests I'd like to see addressed.
> 
> As mentioned in-line below, I'm not too happy with codePointAt calling
> charCodeAt and thus duplicating all the conversion and checking overhead.
> You know better than I do how much work would be involved in exposing a
> performant String_getChar intrinsic to be used instead. If it's not too much
> work, I'd like for that to be done here. If it is, please file a follow-up
> bug about it.
> 
> The test files should get license header. Since the tests use MIT in
> Mathias' repo, that's what we'll have to use, too.
> 
> ::: js/src/builtin/String.js
> @@ +20,5 @@
> > +    if (position < 0 || position >= size)
> > +        return undefined;
> > +
> > +    // Steps 8-9.
> > +    var first = callFunction(std_String_charCodeAt, S, position);
> 
> It's really unfortunate that we effectively repeat steps 1-7 by calling
> charCodeAt. How about introducing a String_getChar intrinsic? That should be
> easy enough to do. OTOH, I don't know whether we inline charCodeAt in Ion.
> OTOOH, having an inlinable String_getChar would probably allow us to
> self-host more String methods.
> 
We already do a pretty good job at inlining charCodeAt. But I guess with an unsafe macro we could avoid some length check, unless that one is already folded with our length check :)
> @@ +21,5 @@
> > +        return undefined;
> > +
> > +    // Steps 8-9.
> > +    var first = callFunction(std_String_charCodeAt, S, position);
> > +    if (first < 0xD800 || first > 0xDBFF || position + 1 == size)
> 
> Nit: ===
> 
> @@ +30,5 @@
> > +    if (second < 0xDC00 || second > 0xDFFF)
> > +        return first;
> > +
> > +    // Step 12.
> > +    return ((first - 0xD800) * 0x400) + (second - 0xDC00) + 0x10000;
> 
> Nit: over-bracing for the multiplication
> 
> @@ +107,5 @@
> > +    // Step 2.
> > +    var length = arguments.length;
> > +
> > +    // Step 3.
> > +    var elements = new List();
> 
> While List is the right spec type, we can use NewDenseArray(length) for more
> efficiency, here.
> 
> @@ +127,5 @@
> > +
> > +        // Step 5f.
> > +        // Inlined UTF-16 Encoding
> > +        if (nextCP <= 0xFFFF) {
> > +            elements.push(nextCP);
> 
> You can use UnsafePutElements(elements, nextIndex, nextCP) here after making
> the change to NewDenseArray above.
> 
Is that safe when nextIndex == elements.length?
> @@ +131,5 @@
> > +            elements.push(nextCP);
> > +            continue;
> > +        }
> > +
> > +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> 
> coercing to int with |0 should be faster than Math.floor. Or at least tax
> the jit less heavily, I guess.
> 
v8 uses the FLOOR macro here, but I had no luck finding out how that is implemented.
> @@ +132,5 @@
> > +            continue;
> > +        }
> > +
> > +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> > +        elements.push(((nextCP - 0x10000) % 0x400) + 0xDC00);
> 
> UnsafePutElements can be used with more than one index/value pair, so these
> two calls can be combined.

Some general observation in terms of speed: We might want to use something that allows us to construct the resulting string without having to call fromCharCode. Maybe something like an Uint16Array where we can use that as the backing storage for the new string.
(In reply to Tom Schuster [:evilpie] from comment #9)
> > It's really unfortunate that we effectively repeat steps 1-7 by calling
> > charCodeAt. How about introducing a String_getChar intrinsic? That should be
> > easy enough to do. OTOH, I don't know whether we inline charCodeAt in Ion.
> > OTOOH, having an inlinable String_getChar would probably allow us to
> > self-host more String methods.
> > 
> We already do a pretty good job at inlining charCodeAt. But I guess with an
> unsafe macro we could avoid some length check, unless that one is already
> folded with our length check :)

Yeah, you know more about this than I do, so if you think this doesn't cost much, that works for me. :)

> > 
> > You can use UnsafePutElements(elements, nextIndex, nextCP) here after making
> > the change to NewDenseArray above.
> > 
> Is that safe when nextIndex == elements.length?

It's not. But IIUC, you know the length beforehand, no? You'd call NewDenseArray with the length and never run into this situation. That's how, e.g., Array#map works.

> > @@ +131,5 @@
> > > +            elements.push(nextCP);
> > > +            continue;
> > > +        }
> > > +
> > > +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> > 
> > coercing to int with |0 should be faster than Math.floor. Or at least tax
> > the jit less heavily, I guess.
> > 
> v8 uses the FLOOR macro here, but I had no luck finding out how that is
> implemented.

`|0` definitely works. Technically, you could also replace the division with `>> 10`, but that might be too hard to understand later. If you do that, add a comment explaining what's going on.

> > @@ +132,5 @@
> > > +            continue;
> > > +        }
> > > +
> > > +        elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> > > +        elements.push(((nextCP - 0x10000) % 0x400) + 0xDC00);
> > 
> > UnsafePutElements can be used with more than one index/value pair, so these
> > two calls can be combined.
> 
> Some general observation in terms of speed: We might want to use something
> that allows us to construct the resulting string without having to call
> fromCharCode. Maybe something like an Uint16Array where we can use that as
> the backing storage for the new string.

That would be very nice to have, indeed. You know more about how our strings work: do you think it'd make sense to create an intrinsic along the lines of StringFromTypedArray?
Typed array element storage isn't memory-layout-interconvertible with string memory layout.  Intrinsics to create a string of a given length, with chars fillable by another intrinsic, is totally crazy and unreasonable, but not unreasonable.  *handwaving*  All but certainly should be followup-land, of course.

Comment 12

4 years ago
https://bugzilla.mozilla.org/show_bug.cgi?id=860574
(Assignee)

Comment 13

4 years ago
Till would you mind if we did this, with the List for the moment? We can optimize this somehow in the future.
(In reply to Tom Schuster [:evilpie] from comment #13)
> Till would you mind if we did this, with the List for the moment? We can
> optimize this somehow in the future.

Not at all. In fact, we should have done that much earlier - sorry for holding that up.

If you r? me on a patch with the other feedback applied, I'll be quick with the review.
(Assignee)

Comment 15

4 years ago
Created attachment 8351211 [details] [diff] [review]
v2
Attachment #812453 - Attachment is obsolete: true
Attachment #8351211 - Flags: review?(till)
Comment on attachment 8351211 [details] [diff] [review]
v2

Review of attachment 8351211 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!
Attachment #8351211 - Flags: review?(till) → review+
(Assignee)

Comment 17

4 years ago
Thanks!
http://hg.mozilla.org/integration/mozilla-inbound/rev/2411714cd058
https://hg.mozilla.org/mozilla-central/rev/2411714cd058
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Release notes: https://developer.mozilla.org/en-US/Firefox/Releases/29#JavaScript

New pages:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint

I moved the polyfills Mathias added to the fromCharCode and charCodeAt wiki pages to these two new pages now.

Lists updated on:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#Methods
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/prototype#Methods

As always, reviewing these docs or adding more examples/descriptions is more than welcome on the MDN wiki. Thanks!
Keywords: dev-doc-needed → dev-doc-complete
Keywords: feature

Comment 20

3 years ago
Is there any need for manual QA here (given the existing automation coverage)?
Flags: needinfo?(evilpies)
Flags: in-testsuite+
(Assignee)

Comment 21

3 years ago
This feature has enough tests that manual QA seems unnecessary.
Flags: needinfo?(evilpies)
You need to log in before you can comment on or make changes to this bug.