Open Bug 1578650 Opened 5 years ago Updated 2 years ago

Use SIMD-accelerated conversions from UTF-8 and WTF-8

Categories

(Core :: JavaScript Engine, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: hsivonen, Unassigned)

References

(Depends on 2 open bugs)

Details

Bug 1578339 and bug 1561567 added SIMD acceleration for ASCIIness and Latin1ness checking, conversions between Latin1 and UTF-16, and conversions to UTF-8.

This bug is about making conversions from UTF-8 also use SIMD-accelerated code. However, it's a bit difficult to see what the actual required set of operations is, because InflateUTF8StringHelper templatizes UTF-8 and WTF-8 sources, UTF-16 and Latin1 destinations, as well as different error handling approaches together.

Also, it's unclear to me what the exactness requirements for memory allocation are. Experience from XPCOM strings indicates that it's better not to run the UTF-8 math twice (once to compute the required buffer length and a second time to do the conversion), but it's better to allocate for the worst case (the ASCII case) and then after the conversion check if the result would fit in a smaller jemalloc bucket and to memcpy it to a smaller bucket if necessary.

Does that memory allocation approach work here?

Do we ever need to convert to Latin1 without having first checked that the input is valid UTF-8 in the Latin1 range?

For a) UTF-8 and b) WTF-8, which of the following error handling modes are actually needed?

  1. The input is known to be valid. If there are errors anyway, the conversion may execute Undefined Behavior.
  2. Invalid byte sequences are replaced with the REPLACEMENT CHARACTER in a WHATWG-compliant way.
  3. If the input is invalid, return an error.
Flags: needinfo?(jwalden)
Flags: needinfo?(arai.unmht)

It appears that the use case for WTF-8 is BinAst.

Flags: needinfo?(dteller)

FWIW, I already have pre-packaged FFI entry points for (all these require worst-case allocation for destination):

  • Convert valid UTF-8 to UTF-16. Invalid UTF-8 is UB.
  • Convert potentially-invalid UTF-8 to UTF-16. Invalid sequences are replaced with the REPLACEMENT CHARACTER in a WHATWG-compliant way.
  • Convert Latin1-range UTF-8 to Latin1. If the input isn't valid UTF-8 in the Latin1 range, the output is garbage in a memory-safe way.

Deflecting WTF-8-related questions to arai :)

Flags: needinfo?(dteller)

In term of pure conversion, all WTF-8 cases are unused.

Here's the list of actual pure conversion that uses InflateUTF8ToUTF16:

  • Convert UTF-8 to UTF-16, use 0xFFFD on invalid character
    • JS::LossyUTF8CharsToNewTwoByteCharsZ
  • Convert UTF-8 to UTF-16, throw on invalid character
    • JS::UTF8CharsToNewTwoByteCharsZ
    • js::StringBuffer::append
  • Convert UTF-8 to Latin1, throw on invalid character
    • JS::UTF8CharsToNewLatin1CharsZ

There are 2 more conversions below, but they calculate hash at the same time, and I'm not sure this applies:

  • Convert UTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character
    • js::AtomizeUTF8Chars
  • Convert WTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character
    • js::AtomizeWTF8Chars

And remaining consumers of InflateUTF8ToUTF16 are the following:

  • Check smallest encoding of UTF-8, use 0xFFFD on invalid character
    • JS::FindSmallestEncoding
  • Compare {UTF-8 or WTF-8} with either {Latin1 or UTF-16}, crash on invalid chnaracter
    • js::AtomHasher::match
  • Just validate UTF-8 chars, crash on invalid character (debug only)
    • JS::ConstUTF8CharsZ::validate

And here's the list of unused APIs that needs to be removed:

  • JS::WTF8CharsToNewTwoByteCharsZ
  • JS::UTF8CharsToNewTwoByteCharsZ with ConstUTF8CharsZ
  • JS::LossyUTF8CharsToNewTwoByteCharsZ with ConstUTF8CharsZ
  • JS::LossyUTF8CharsToNewLatin1CharsZ
Flags: needinfo?(arai.unmht)
Depends on: 1579248
Depends on: 1579249

...I'm pretty confident I have nothing particularly useful to add after comment 4. :-)

Flags: needinfo?(jwalden)

Does that memory allocation approach work here?

Afaik, it works.

Do we ever need to convert to Latin1 without having first checked that the input is valid UTF-8 in the Latin1 range?

In case we want Latin1 if possible, we first check the smallest encoding by FindSmallestEncoding or similar internally,
So, there's no need to do the check at the same time.

comment #6 might be somewhat inaccurate
about eliminating the step to calculate string length, in some case we check the smallest encoding at the same time as calculating length.
I'll check the details/affected case shortly.

Flags: needinfo?(arai.unmht)

(In reply to Tooru Fujisawa [:arai] from comment #4)

In term of pure conversion, all WTF-8 cases are unused.

Nice.

Here's the list of actual pure conversion that uses InflateUTF8ToUTF16:

  • Convert UTF-8 to UTF-16, use 0xFFFD on invalid character
    • JS::LossyUTF8CharsToNewTwoByteCharsZ

This could use mozilla::ConvertUtf8toUtf16() (from bug 1490601).

  • Convert UTF-8 to UTF-16, throw on invalid character
    • JS::UTF8CharsToNewTwoByteCharsZ
    • js::StringBuffer::append

encoding_c_mem lacks a nicely packaged function for this at the moment. (The SpiderMonkey-unavailable heap-allocated mozilla::Decoder has this.)

It's a bit unclear to me why these users want throwing instead of REPLACEMENT CHARACTERs. Throwing seems disruptive enough that the callers probably already have a high degree of confidence that they have valid UTF-8 and throwing only serves as a kind of release assertion, so I wonder if it's truly important to throw (as opposed to generate REPLACEMENT CHARACTERs). (With the exception of CTypes where the behavior appears to be part of an external API.)

  • Convert UTF-8 to Latin1, throw on invalid character
    • JS::UTF8CharsToNewLatin1CharsZ

This appears to be called only immediately after FindSmallestEncoding has validated the precondition, so mozilla::LossyConvertUtf8toLatin1 (from bug 1490601) would work.

There are 2 more conversions below, but they calculate hash at the same time, and I'm not sure this applies:

  • Convert UTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character
    • js::AtomizeUTF8Chars
  • Convert WTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character
    • js::AtomizeWTF8Chars

Probably the best not to modify these.

And remaining consumers of InflateUTF8ToUTF16 are the following:

  • Check smallest encoding of UTF-8, use 0xFFFD on invalid character
    • JS::FindSmallestEncoding

I have a SIMD-accelerated implemetation of this in bug 1578339.

  • Compare {UTF-8 or WTF-8} with either {Latin1 or UTF-16}, crash on invalid chnaracter
    • js::AtomHasher::match

Probably the best not to change this one.

  • Just validate UTF-8 chars, crash on invalid character (debug only)
    • JS::ConstUTF8CharsZ::validate

No need to change debug-only things.

And here's the list of unused APIs that needs to be removed:

  • JS::WTF8CharsToNewTwoByteCharsZ
  • JS::UTF8CharsToNewTwoByteCharsZ with ConstUTF8CharsZ
  • JS::LossyUTF8CharsToNewTwoByteCharsZ with ConstUTF8CharsZ
  • JS::LossyUTF8CharsToNewLatin1CharsZ

Removing these would indeed add clarity.

Thank you!

Depends on: 1575150

I suggest using the API surface suggested in bug 1575150 to deal with potential extra buffer capacity after conversion.

(In reply to Tooru Fujisawa [:arai] from comment #7)

comment #6 might be somewhat inaccurate
about eliminating the step to calculate string length, in some case we check the smallest encoding at the same time as calculating length.
I'll check the details/affected case shortly.

this affects only js::AtomizeUTF8Chars and js::AtomizeWTF8Chars, so we can ignore them here, per comment #8.

It's a bit unclear to me why these users want throwing instead of REPLACEMENT CHARACTERs. Throwing seems disruptive enough that the callers probably already have a high degree of confidence that they have valid UTF-8 and throwing only serves as a kind of release assertion, so I wonder if it's truly important to throw (as opposed to generate REPLACEMENT CHARACTERs). (With the exception of CTypes where the behavior appears to be part of an external API.)

will check the list of callers of those APIs to see if we really need throwing

will check the list of callers of those APIs to see if we really need throwing

Probably not worthwhile anymore. I already have a version of encoding_rs and encoding_c_mem in the pipeline that add a conversion function that signals on error.

I see

Flags: needinfo?(arai.unmht)
Depends on: 1579813
Priority: -- → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.