Add API to create JSString from UTF-8 string efficiently.

RESOLVED FIXED in Firefox 51

Status

()

defect
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: arai, Assigned: arai)

Tracking

Trunk
mozilla51
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox50 affected, firefox51 fixed)

Details

Attachments

(3 attachments, 6 obsolete attachments)

Assignee

Description

3 years ago
We're going to change the error reporting API and internal to use UTF-8 in bug 1283710.
So we'd need API to create JSString from UTF-8 string (const char*).

JS_NewStringCopyFromUTF8, and maybe JS_NewStringCopyFromUTF8Lossy or something if needed.

We already have an API to convert JSString to UTF-8 (JS_EncodeStringToUTF8), so only API for UTF-8 to JSString is needed.

I have a draft patch, that was originally for bug 1283710.
will post it shortly.
Assignee

Comment 1

3 years ago
looks like similar API (JS_NewStringCopyUTF8Z and JS_NewStringCopyUTF8N) are going to be added in bug 987069.

hopefully I can improve the efficiency :)
See Also: → 987069
Assignee

Updated

3 years ago
Blocks: 1295017
No longer blocks: 1283710
Assignee

Updated

3 years ago
Depends on: 1294940
Assignee

Comment 2

3 years ago
Added UTF8CharsToNewLatin1CharsZ and LossyUTF8CharsToNewLatin1CharsZ,
that converts UTF-8 encoded string to Latin1.

Those function suppose the input UTF-8 encoded string *can* be encoded in Latin1, and doesn't handle overflow.
That is because those are supposed to be used after PossibleMinumumEncodingForUTF8 function, that is added in Part 2, that returns possible minimum encoding for the given UTF-8 string.

To make those APIs, added CharT typedef for {,Const}{Latin1,UTF8,TwoBytes}Chars{,Z} classes, to use it from InflateUTF8StringHelper, that is now templatized for return type, that could now be TwoByteCharsZ or Latin1CharsZ.
Assignee

Comment 3

3 years ago
As mentioned above, PossibleMinumumEncodingForUTF8 function returns possible minimum encoding for the given UTF-8 string, from one of ASCII, Latin1, UTF16.
  ASCII  for char <= 0x7f
  Latin1 for char <= 0xff
  UTF16  for other case
Assignee

Comment 4

3 years ago
JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z functions use those above functions.

Those funtions first check the possible minimum encoding,
and then use NewStringCopyN for ASCII, UTF8CharsToNewLatin1CharsZ+NewString for Latin1, UTF8CharsToNewTwoByteCharsZ+NewString for UTF-16.

This is done to reduce the number of copy or allocation.
for ASCII case, we can just copy the string.
for other cases, we need to convert UTF-8 encoded string to Latin1 or UTF-16 first, and then allocate JSString.
by adding and using PossibleMinumumEncodingForUTF8 and UTF8CharsToNewLatin1CharsZ, we don't need temporary space for any case.
Assignee

Comment 5

3 years ago
I'll ask review after bug 1294940.
Assignee

Comment 6

3 years ago
rebased
Attachment #8781384 - Attachment is obsolete: true
Attachment #8786525 - Flags: review?(jwalden+bmo)
Assignee

Comment 7

3 years ago
Attachment #8781385 - Attachment is obsolete: true
Attachment #8786526 - Flags: review?(jwalden+bmo)
Assignee

Comment 8

3 years ago
Attachment #8781387 - Attachment is obsolete: true
Attachment #8786527 - Flags: review?(jwalden+bmo)
Comment on attachment 8786526 [details] [diff] [review]
Part 2: Add PossibleMinumumEncodingForUTF8.

Review of attachment 8786526 [details] [diff] [review]:
-----------------------------------------------------------------

With these few comments, enough changes that I should just punt for a closer look at something much closer to the final deal.

::: js/public/CharacterEncoding.h
@@ +288,5 @@
>  DeflateStringToUTF8Buffer(JSFlatString* src, mozilla::RangedPtr<char> dst,
>                            size_t* dstlenp = nullptr, size_t* numcharsp = nullptr);
>  
>  /*
> + * Returns the possible minimum encoding for the given UTF8 string.

The enum shouldn't include "minimum" in it -- that's a function of the API function, not the enum.  Suggest just "Encoding" for a name, and this as comment:

"""
JSAPI character encodings.
"""

@@ +296,5 @@
> +    Latin1,
> +    UTF16
> +};
> +JS_PUBLIC_API(PossibleMinumumEncoding)
> +PossibleMinumumEncodingForUTF8(UTF8Chars utf8);

I'd say

"""
Returns the most restricted encoding possible for the given string: if all codepoints are <128 then ASCII, otherwise if all codepoints are <256 Latin-1, else UTF16.
"""

and name the function MostRestrictedEncoding.  "ForUTF8" is implicit in the function argument, doesn't need to be spelled out.

::: js/src/vm/CharacterEncoding.cpp
@@ +262,5 @@
>  // LossyConvertUTF8toUTF16() in dom/wifi/WifiUtils.cpp
>  template <InflateUTF8Action Action, typename CharT>
>  static bool
>  InflateUTF8StringToBuffer(JSContext* cx, const UTF8Chars src, CharT* dst, size_t* dstlenp,
> +                          bool* isAsciip, bool* isLatin1)

Instead of multiple bool*, why not |Encoding* mostRestrictedEncoding|?  (The 'p' at end doesn't seem necessary to me, not least because we're not consistent about doing it.)  Doesn't admit nonsense like |*isAsciip == true && *isLatin1 == false|, and users get told exactly what encoding is best, without having to reconstruct it from bools.  That also means you could not add a new InflateUTF8Action -- just replace all |*isAsciip| assignments with assigments to mRE.

@@ +332,5 @@
>  
>              // Check the continuation bytes.
>              for (uint32_t m = 1; m < n; m++)
>                  if ((src[i + m] & 0xC0) != 0x80)
>                      INVALID(ReportInvalidCharacter, i, m);

Put braces around the for-loop while you're here.

@@ +394,5 @@
>          ReportOutOfMemory(cx);
>          return CharsT();
>      }
>  
>      if (isAscii) {

In a most-restricted-encoding outparam world

@@ +444,5 @@
> +                                                                       utf8,
> +                                                                       /* dst = */ nullptr,
> +                                                                       /* dstlen = */ nullptr,
> +                                                                       &isAscii,
> +                                                                       &isLatin1)));

With just one outparam this becomes easy -- just return that outparam, none of this if-stuff.
Attachment #8786526 - Flags: review?(jwalden+bmo) → review-
Comment on attachment 8786525 [details] [diff] [review]
Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ.

Review of attachment 8786525 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/public/CharacterEncoding.h
@@ +290,5 @@
>  
> +/*
> + * Inflate bytes in UTF-8 encoding to Latin1 string.
> + * The caller must check if the given UTF-8 string can be encoded with Latin1
> + * before calling this.

Wait, that's not how this works.  As stated, this makes Latin-1 encoding of the string a *precondition*, and if you don't do it, it's undefined what happens.  This sentence really should be something like, "If the string contains non-Latin-1 characters, reports an error."  It's really perfectly fine to call this with a non-Latin-1 string, as long as you understand the consequences.  :-)

@@ +301,5 @@
> +
> +/*
> + * The same as UTF8CharsToNewLatin1CharsZ(), except that any malformed UTF-8
> + * characters will be replaced by "?".  No exception will be thrown for
> + * malformed UTF-8 input.

I wouldn't document by reference with alterations here, just say things separately -- there's too little complexity here to not just spit it out, because document-by-reference increases the burden on anyone reading the comment.  So I'd have these two doc-comments:

"""
Return a null-terminated Latin-1 string copied from the input string, storing its length (excluding null terminator) in |*outlen|.  Fail and report an error if the string contains non-Latin-1 codepoints.  Returns Latin1CharsZ() on failure.
"""

"""
Return a null-terminated Latin-1 string copied from the input string, storing its length (excluding null terminator) in |*outlen|.  Non-Latin-1 codepoints are replaced by '?'.  Returns Latin1CharsZ() on failure.
"""

::: js/src/vm/CharacterEncoding.cpp
@@ +6,5 @@
>  
>  #include "js/CharacterEncoding.h"
>  
>  #include "mozilla/Range.h"
> +#include "mozilla/TypeTraits.h"

These days you can use <type_traits> -- someone's converting things over, might as well not add new uses when we're not interplaying with existing code.

@@ +254,5 @@
>      Copy
>  };
>  
>  static const uint32_t REPLACE_UTF8 = 0xFFFD;
> +static const uint32_t REPLACE_UTF8_LATIN1 = '?';

Make these char16_t and Latin1Char typed.

@@ +266,5 @@
>  {
>      if (Action != AssertNoInvalids)
>          *isAsciip = true;
>  
> +    // Count how many characters need to be in the inflated string.

s/characters/code units/ so there's no confusion about "characters" and what that might mean.

@@ +280,5 @@
>  
>          } else {
>              // Non-ASCII code unit.  Determine its length in bytes (n).
>              if (Action != AssertNoInvalids)
>                  *isAsciip = false;

Along with the Encoding* outparam thing, I might add lambdas like so:

  auto RequireLatin1 = [&mostRestrictedEncoding]{
    mostRestrictedEncoding = std::max(Latin1, mostRestrictedEncoding);
  };
  auto RequireUTF16 = [&mostRestrictedEncoding]{
    mostRestrictedEncoding = UTF16;
  };

and then call those in the right places.  The Latin-1 bit is the tricky one, because it has to *conditionally* upgrade, and not overwrite once UTF16 is encountered.

Although, hm.  This does bring to mind that *if* we have a CheckEncoding enum as I suggested removing in the last review, *if* we have that we can just return immediately once a non-Latin-1 character is encountered.  That probably motivates having CheckEncoding after all.

@@ +294,5 @@
>                  } else if (Action == AssertNoInvalids) {                \
>                      MOZ_CRASH("invalid UTF-8 string: " # report);       \
>                  } else {                                                \
> +                    if (Action == Copy) {                               \
> +                        if (mozilla::IsSame<CharT, Latin1Char>::value)  \

std::is_same<decltype(dst[0]), Latin1Char>::value

so that we don't have to wonder, however briefly, if CharT is |*dst|'s type or not.

@@ +371,3 @@
>  InflateUTF8StringHelper(JSContext* cx, const UTF8Chars src, size_t* outlen)
>  {
> +    typedef typename CharsT::CharT CharT;

using CharT = CharsT::CharT;

maybe with a typename in there.  Basically at this point, never typedef, always using.  :-)
Attachment #8786525 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 8786526 [details] [diff] [review]
Part 2: Add PossibleMinumumEncodingForUTF8.

Review of attachment 8786526 [details] [diff] [review]:
-----------------------------------------------------------------

Okay, previous r- more or less stands, just a couple slightly different takes on some things.

::: js/public/CharacterEncoding.h
@@ +296,5 @@
> +    Latin1,
> +    UTF16
> +};
> +JS_PUBLIC_API(PossibleMinumumEncoding)
> +PossibleMinumumEncodingForUTF8(UTF8Chars utf8);

FindSmallestEncoding is a better name for this, on second thought.  And I guess then SmallestEncoding is a better name than PossibleMinimumEncoding, with the doc-comment "The smallest character encoding capable of fully representing a particular string."

::: js/src/vm/CharacterEncoding.cpp
@@ +251,5 @@
>      CountAndReportInvalids,
>      CountAndIgnoreInvalids,
>      AssertNoInvalids,
> +    Copy,
> +    CheckEncoding

FindSmallestEncoding seems better than "check", which implies the input might not be valid

@@ +338,5 @@
>              // Determine the code unit's length in CharT and act accordingly.
>              v = JS::Utf8ToOneUcs4Char((uint8_t*)&src[i], n);
> +            if (Action == CheckEncoding) {
> +                if (v > 0xff)
> +                    *isLatin1 = false;

We could/should just assert here that |dst| is null and immediately return SmallestEncoding::UTF16.
Comment on attachment 8786527 [details] [diff] [review]
Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z.

Review of attachment 8786527 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/String.cpp
@@ +1317,5 @@
> +        Latin1Char* latin1 = UTF8CharsToNewLatin1CharsZ(cx, utf8, &length).get();
> +        if (!latin1)
> +            return nullptr;
> +
> +        Rooted<JSFlatString*> result(cx, NewString<allowGC>(cx, latin1, length));

Can just use JSFlatString* for this type (and for the other |result| below) -- the Rooted doesn't do anything.
Attachment #8786527 - Flags: review?(jwalden+bmo) → review+
Assignee

Comment 13

3 years ago
addressed review comments.
RequireLatin1/RequireUTF16 are in Part 2.
Attachment #8786525 - Attachment is obsolete: true
Attachment #8787434 - Flags: review+
Assignee

Comment 14

3 years ago
For now, used "FindEncoding" instead of "FindSmallestEncoding" for enum value, as it conflicts with FindSmallestEncoding function, especially inside it.
Do you have any other idea?
(we might be able to use enum class, but it will make template parameter long...)
Attachment #8786526 - Attachment is obsolete: true
Attachment #8787437 - Flags: review?(jwalden+bmo)
Comment on attachment 8787437 [details] [diff] [review]
Part 2: Add PossibleMinumumEncodingForUTF8.

Review of attachment 8787437 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/CharacterEncoding.cpp
@@ +268,3 @@
>  {
> +    auto RequireLatin1 = [&smallestEncoding]{
> +        *smallestEncoding = std::max(JS::SmallestEncoding::Latin1, *smallestEncoding);

#include <algorithm>

(I think?) at top of the file somewhere, if it's not there.

@@ +443,5 @@
> +JS::SmallestEncoding
> +JS::FindSmallestEncoding(UTF8Chars utf8)
> +{
> +    JS::SmallestEncoding encoding;
> +    JS_ALWAYS_TRUE((InflateUTF8StringToBuffer<FindEncoding, char16_t>(

Use MOZ_ALWAYS_TRUE for everything new you're adding here, we need to delete the other.
Attachment #8787437 - Flags: review?(jwalden+bmo) → review+
Assignee

Comment 17

3 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/b4c7d481bf10a39a99cb41b89bc677434d80dc0c
Bug 1289003 - Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. r=jwalden

https://hg.mozilla.org/integration/mozilla-inbound/rev/1bbdfed5b149b231a27692542b6cee5b9f5138a8
Bug 1289003 - Part 2: Add FindSmallestEncoding. r=jwalden

https://hg.mozilla.org/integration/mozilla-inbound/rev/3abe6a1579f9bc79816ab781869b5232a1b3e483
Bug 1289003 - Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. r=jwalden
Assignee

Comment 20

3 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/bae24022813be38a0433b39469d4cd38ff288ce2
Bug 1289003 - Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. r=jwalden

https://hg.mozilla.org/integration/mozilla-inbound/rev/18b15a11adf0a6f5c297c48364993a231fecbf91
Bug 1289003 - Part 2: Add FindSmallestEncoding. r=jwalden

https://hg.mozilla.org/integration/mozilla-inbound/rev/9f509473b88714d6dbd7666b6896551d669a6fda
Bug 1289003 - Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. r=jwalden

Comment 21

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/bae24022813b
https://hg.mozilla.org/mozilla-central/rev/18b15a11adf0
https://hg.mozilla.org/mozilla-central/rev/9f509473b887
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
You need to log in before you can comment on or make changes to this bug.