Closed
Bug 918879
Opened 11 years ago
Closed 11 years ago
Implement String#codePointAt and String.fromCodePoint
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla29
People
(Reporter: evilpies, Assigned: evilpies)
References
(Blocks 1 open bug)
Details
(Keywords: dev-doc-complete, feature)
Attachments
(1 file, 2 obsolete files)
14.99 KB,
patch
|
till
:
review+
|
Details | Diff | Splinter Review |
Currently working on some tests for this. There is at least one unresolved issue, about whether codePointAt outside the range should return NaN or undefined.
Comment 1•11 years ago
|
||
I guess this is a duplicate of bug 888093 (or vice versa).
Tests I wrote today:
* https://github.com/mathiasbynens/String.fromCodePoint/blob/master/tests/tests.js
* https://github.com/mathiasbynens/String.prototype.codePointAt/blob/master/tests/tests.js
Feel free to re-use these! Oh, and let me know if I missed anything important.
Looking at the example code in Norbert’s original proposal (http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/#String), I think the intention is to return `undefined`. Keep in mind that my tests are based on that assumption. See https://bugs.ecmascript.org/show_bug.cgi?id=1153.
Updated•11 years ago
|
Keywords: dev-doc-needed
Comment 3•11 years ago
|
||
Comment on attachment 807822 [details] [diff] [review]
Implementation of fromCodePoint and codePointAt
Review of attachment 807822 [details] [diff] [review]:
-----------------------------------------------------------------
::: js/src/builtin/String.js
@@ +128,5 @@
> + ThrowError(JSMSG_NOT_A_CODEPOINT, ToString(nextCP));
> +
> + // Step 5f.
> + // Inlined UTF-16 Encoding
> + if (nextCP < 65535) {
Shouldn’t this be 65536 (0x010000, i.e. 0xFFFF + 1)?
Mathias the tests you landed for v8 looked quite good, I think we could use them.
Attachment #807822 -
Attachment is obsolete: true
Attachment #812453 -
Flags: review?(jwalden+bmo)
Comment 7•11 years ago
|
||
(In reply to Tom Schuster [:evilpie] from comment #5)
> Mathias the tests you landed for v8 looked quite good, I think we could use
> them.
Happy to hear. This totally made my day :) Yay!
By the way, I like that in the new patch you’re using HexIntegerLiterals for code point values — much more readable that way IMHO.
Comment 8•11 years ago
|
||
Comment on attachment 812453 [details] [diff] [review]
v1
Review of attachment 812453 [details] [diff] [review]:
-----------------------------------------------------------------
Stealing review.
This looks good overall, but I have a few requests I'd like to see addressed.
As mentioned in-line below, I'm not too happy with codePointAt calling charCodeAt and thus duplicating all the conversion and checking overhead. You know better than I do how much work would be involved in exposing a performant String_getChar intrinsic to be used instead. If it's not too much work, I'd like for that to be done here. If it is, please file a follow-up bug about it.
The test files should get license header. Since the tests use MIT in Mathias' repo, that's what we'll have to use, too.
::: js/src/builtin/String.js
@@ +20,5 @@
> + if (position < 0 || position >= size)
> + return undefined;
> +
> + // Steps 8-9.
> + var first = callFunction(std_String_charCodeAt, S, position);
It's really unfortunate that we effectively repeat steps 1-7 by calling charCodeAt. How about introducing a String_getChar intrinsic? That should be easy enough to do. OTOH, I don't know whether we inline charCodeAt in Ion. OTOOH, having an inlinable String_getChar would probably allow us to self-host more String methods.
@@ +21,5 @@
> + return undefined;
> +
> + // Steps 8-9.
> + var first = callFunction(std_String_charCodeAt, S, position);
> + if (first < 0xD800 || first > 0xDBFF || position + 1 == size)
Nit: ===
@@ +30,5 @@
> + if (second < 0xDC00 || second > 0xDFFF)
> + return first;
> +
> + // Step 12.
> + return ((first - 0xD800) * 0x400) + (second - 0xDC00) + 0x10000;
Nit: over-bracing for the multiplication
@@ +107,5 @@
> + // Step 2.
> + var length = arguments.length;
> +
> + // Step 3.
> + var elements = new List();
While List is the right spec type, we can use NewDenseArray(length) for more efficiency, here.
@@ +127,5 @@
> +
> + // Step 5f.
> + // Inlined UTF-16 Encoding
> + if (nextCP <= 0xFFFF) {
> + elements.push(nextCP);
You can use UnsafePutElements(elements, nextIndex, nextCP) here after making the change to NewDenseArray above.
@@ +131,5 @@
> + elements.push(nextCP);
> + continue;
> + }
> +
> + elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
coercing to int with |0 should be faster than Math.floor. Or at least tax the jit less heavily, I guess.
@@ +132,5 @@
> + continue;
> + }
> +
> + elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> + elements.push(((nextCP - 0x10000) % 0x400) + 0xDC00);
UnsafePutElements can be used with more than one index/value pair, so these two calls can be combined.
Attachment #812453 -
Flags: review?(jwalden+bmo) → feedback+
(In reply to Till Schneidereit [:till] from comment #8)
> Comment on attachment 812453 [details] [diff] [review]
> v1
>
> Review of attachment 812453 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> Stealing review.
>
> This looks good overall, but I have a few requests I'd like to see addressed.
>
> As mentioned in-line below, I'm not too happy with codePointAt calling
> charCodeAt and thus duplicating all the conversion and checking overhead.
> You know better than I do how much work would be involved in exposing a
> performant String_getChar intrinsic to be used instead. If it's not too much
> work, I'd like for that to be done here. If it is, please file a follow-up
> bug about it.
>
> The test files should get license header. Since the tests use MIT in
> Mathias' repo, that's what we'll have to use, too.
>
> ::: js/src/builtin/String.js
> @@ +20,5 @@
> > + if (position < 0 || position >= size)
> > + return undefined;
> > +
> > + // Steps 8-9.
> > + var first = callFunction(std_String_charCodeAt, S, position);
>
> It's really unfortunate that we effectively repeat steps 1-7 by calling
> charCodeAt. How about introducing a String_getChar intrinsic? That should be
> easy enough to do. OTOH, I don't know whether we inline charCodeAt in Ion.
> OTOOH, having an inlinable String_getChar would probably allow us to
> self-host more String methods.
>
We already do a pretty good job at inlining charCodeAt. But I guess with an unsafe macro we could avoid some length check, unless that one is already folded with our length check :)
> @@ +21,5 @@
> > + return undefined;
> > +
> > + // Steps 8-9.
> > + var first = callFunction(std_String_charCodeAt, S, position);
> > + if (first < 0xD800 || first > 0xDBFF || position + 1 == size)
>
> Nit: ===
>
> @@ +30,5 @@
> > + if (second < 0xDC00 || second > 0xDFFF)
> > + return first;
> > +
> > + // Step 12.
> > + return ((first - 0xD800) * 0x400) + (second - 0xDC00) + 0x10000;
>
> Nit: over-bracing for the multiplication
>
> @@ +107,5 @@
> > + // Step 2.
> > + var length = arguments.length;
> > +
> > + // Step 3.
> > + var elements = new List();
>
> While List is the right spec type, we can use NewDenseArray(length) for more
> efficiency, here.
>
> @@ +127,5 @@
> > +
> > + // Step 5f.
> > + // Inlined UTF-16 Encoding
> > + if (nextCP <= 0xFFFF) {
> > + elements.push(nextCP);
>
> You can use UnsafePutElements(elements, nextIndex, nextCP) here after making
> the change to NewDenseArray above.
>
Is that safe when nextIndex == elements.length?
> @@ +131,5 @@
> > + elements.push(nextCP);
> > + continue;
> > + }
> > +
> > + elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
>
> coercing to int with |0 should be faster than Math.floor. Or at least tax
> the jit less heavily, I guess.
>
v8 uses the FLOOR macro here, but I had no luck finding out how that is implemented.
> @@ +132,5 @@
> > + continue;
> > + }
> > +
> > + elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> > + elements.push(((nextCP - 0x10000) % 0x400) + 0xDC00);
>
> UnsafePutElements can be used with more than one index/value pair, so these
> two calls can be combined.
Some general observation in terms of speed: We might want to use something that allows us to construct the resulting string without having to call fromCharCode. Maybe something like an Uint16Array where we can use that as the backing storage for the new string.
Comment 10•11 years ago
|
||
(In reply to Tom Schuster [:evilpie] from comment #9)
> > It's really unfortunate that we effectively repeat steps 1-7 by calling
> > charCodeAt. How about introducing a String_getChar intrinsic? That should be
> > easy enough to do. OTOH, I don't know whether we inline charCodeAt in Ion.
> > OTOOH, having an inlinable String_getChar would probably allow us to
> > self-host more String methods.
> >
> We already do a pretty good job at inlining charCodeAt. But I guess with an
> unsafe macro we could avoid some length check, unless that one is already
> folded with our length check :)
Yeah, you know more about this than I do, so if you think this doesn't cost much, that works for me. :)
> >
> > You can use UnsafePutElements(elements, nextIndex, nextCP) here after making
> > the change to NewDenseArray above.
> >
> Is that safe when nextIndex == elements.length?
It's not. But IIUC, you know the length beforehand, no? You'd call NewDenseArray with the length and never run into this situation. That's how, e.g., Array#map works.
> > @@ +131,5 @@
> > > + elements.push(nextCP);
> > > + continue;
> > > + }
> > > +
> > > + elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> >
> > coercing to int with |0 should be faster than Math.floor. Or at least tax
> > the jit less heavily, I guess.
> >
> v8 uses the FLOOR macro here, but I had no luck finding out how that is
> implemented.
`|0` definitely works. Technically, you could also replace the division with `>> 10`, but that might be too hard to understand later. If you do that, add a comment explaining what's going on.
> > @@ +132,5 @@
> > > + continue;
> > > + }
> > > +
> > > + elements.push(std_Math_floor((nextCP - 0x10000) / 0x400) + 0xD800);
> > > + elements.push(((nextCP - 0x10000) % 0x400) + 0xDC00);
> >
> > UnsafePutElements can be used with more than one index/value pair, so these
> > two calls can be combined.
>
> Some general observation in terms of speed: We might want to use something
> that allows us to construct the resulting string without having to call
> fromCharCode. Maybe something like an Uint16Array where we can use that as
> the backing storage for the new string.
That would be very nice to have, indeed. You know more about how our strings work: do you think it'd make sense to create an intrinsic along the lines of StringFromTypedArray?
Comment 11•11 years ago
|
||
Typed array element storage isn't memory-layout-interconvertible with string memory layout. Intrinsics to create a string of a given length, with chars fillable by another intrinsic, is totally crazy and unreasonable, but not unreasonable. *handwaving* All but certainly should be followup-land, of course.
Comment 12•11 years ago
|
||
Assignee | ||
Comment 13•11 years ago
|
||
Till would you mind if we did this, with the List for the moment? We can optimize this somehow in the future.
Comment 14•11 years ago
|
||
(In reply to Tom Schuster [:evilpie] from comment #13)
> Till would you mind if we did this, with the List for the moment? We can
> optimize this somehow in the future.
Not at all. In fact, we should have done that much earlier - sorry for holding that up.
If you r? me on a patch with the other feedback applied, I'll be quick with the review.
Assignee | ||
Comment 15•11 years ago
|
||
Attachment #812453 -
Attachment is obsolete: true
Attachment #8351211 -
Flags: review?(till)
Comment 16•11 years ago
|
||
Comment on attachment 8351211 [details] [diff] [review]
v2
Review of attachment 8351211 [details] [diff] [review]:
-----------------------------------------------------------------
Nice!
Attachment #8351211 -
Flags: review?(till) → review+
Assignee | ||
Comment 17•11 years ago
|
||
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Comment 19•11 years ago
|
||
MDN |
Release notes: https://developer.mozilla.org/en-US/Firefox/Releases/29#JavaScript
New pages:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint
I moved the polyfills Mathias added to the fromCharCode and charCodeAt wiki pages to these two new pages now.
Lists updated on:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#Methods
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/prototype#Methods
As always, reviewing these docs or adding more examples/descriptions is more than welcome on the MDN wiki. Thanks!
Keywords: dev-doc-needed → dev-doc-complete
Comment 20•11 years ago
|
||
Is there any need for manual QA here (given the existing automation coverage)?
Flags: needinfo?(evilpies)
Flags: in-testsuite+
Assignee | ||
Comment 21•11 years ago
|
||
This feature has enough tests that manual QA seems unnecessary.
Flags: needinfo?(evilpies)
You need to log in
before you can comment on or make changes to this bug.
Description
•