1578650 - Use SIMD-accelerated conversions from UTF-8 and WTF-8

Reporter

Description

•

5 years ago

•

Bug 1578339 and bug 1561567 added SIMD acceleration for ASCIIness and Latin1ness checking, conversions between Latin1 and UTF-16, and conversions to UTF-8.

This bug is about making conversions from UTF-8 also use SIMD-accelerated code. However, it's a bit difficult to see what the actual required set of operations is, because InflateUTF8StringHelper templatizes UTF-8 and WTF-8 sources, UTF-16 and Latin1 destinations, as well as different error handling approaches together.

Also, it's unclear to me what the exactness requirements for memory allocation are. Experience from XPCOM strings indicates that it's better not to run the UTF-8 math twice (once to compute the required buffer length and a second time to do the conversion), but it's better to allocate for the worst case (the ASCII case) and then after the conversion check if the result would fit in a smaller jemalloc bucket and to memcpy it to a smaller bucket if necessary.

Does that memory allocation approach work here?

Do we ever need to convert to Latin1 without having first checked that the input is valid UTF-8 in the Latin1 range?

For a) UTF-8 and b) WTF-8, which of the following error handling modes are actually needed?

The input is known to be valid. If there are errors anyway, the conversion may execute Undefined Behavior.
Invalid byte sequences are replaced with the REPLACEMENT CHARACTER in a WHATWG-compliant way.
If the input is invalid, return an error.

Henri Sivonen (:hsivonen)

Reporter

Updated

•

5 years ago

Flags: needinfo?(jwalden)

Flags: needinfo?(arai.unmht)

Henri Sivonen (:hsivonen)

Reporter

Comment 1

•

5 years ago

It appears that the use case for WTF-8 is BinAst.

Flags: needinfo?(dteller)

Henri Sivonen (:hsivonen)

Reporter

Comment 2

•

5 years ago

FWIW, I already have pre-packaged FFI entry points for (all these require worst-case allocation for destination):

Convert valid UTF-8 to UTF-16. Invalid UTF-8 is UB.
Convert potentially-invalid UTF-8 to UTF-16. Invalid sequences are replaced with the REPLACEMENT CHARACTER in a WHATWG-compliant way.
Convert Latin1-range UTF-8 to Latin1. If the input isn't valid UTF-8 in the Latin1 range, the output is garbage in a memory-safe way.

David Teller [:Yoric] - still alive but not very active

Comment 3

•

5 years ago

Deflecting WTF-8-related questions to arai :)

Flags: needinfo?(dteller)

Tooru Fujisawa [:arai]

Comment 4

•

5 years ago

In term of pure conversion, all WTF-8 cases are unused.

Here's the list of actual pure conversion that uses InflateUTF8ToUTF16:

Convert UTF-8 to UTF-16, use 0xFFFD on invalid character
- JS::LossyUTF8CharsToNewTwoByteCharsZ
Convert UTF-8 to UTF-16, throw on invalid character
- JS::UTF8CharsToNewTwoByteCharsZ
- js::StringBuffer::append
Convert UTF-8 to Latin1, throw on invalid character
- JS::UTF8CharsToNewLatin1CharsZ

There are 2 more conversions below, but they calculate hash at the same time, and I'm not sure this applies:

Convert UTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character
- js::AtomizeUTF8Chars
Convert WTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character
- js::AtomizeWTF8Chars

And remaining consumers of InflateUTF8ToUTF16 are the following:

Check smallest encoding of UTF-8, use 0xFFFD on invalid character
- JS::FindSmallestEncoding
Compare {UTF-8 or WTF-8} with either {Latin1 or UTF-16}, crash on invalid chnaracter
- js::AtomHasher::match
Just validate UTF-8 chars, crash on invalid character (debug only)
- JS::ConstUTF8CharsZ::validate

And here's the list of unused APIs that needs to be removed:

JS::WTF8CharsToNewTwoByteCharsZ
JS::UTF8CharsToNewTwoByteCharsZ with ConstUTF8CharsZ
JS::LossyUTF8CharsToNewTwoByteCharsZ with ConstUTF8CharsZ
JS::LossyUTF8CharsToNewLatin1CharsZ

Flags: needinfo?(arai.unmht)

Tooru Fujisawa [:arai]

Updated

•

5 years ago

Depends on: 1579248

Tooru Fujisawa [:arai]

Updated

•

5 years ago

Depends on: 1579249

Jeff Walden [:Waldo]

Comment 5

•

5 years ago

...I'm pretty confident I have nothing particularly useful to add after comment 4. :-)

Flags: needinfo?(jwalden)

Tooru Fujisawa [:arai]

Comment 6

•

5 years ago

•

Edited

Does that memory allocation approach work here?

Afaik, it works.

Do we ever need to convert to Latin1 without having first checked that the input is valid UTF-8 in the Latin1 range?

In case we want Latin1 if possible, we first check the smallest encoding by FindSmallestEncoding or similar internally,
So, there's no need to do the check at the same time.

Tooru Fujisawa [:arai]

Comment 7

•

5 years ago

comment #6 might be somewhat inaccurate
about eliminating the step to calculate string length, in some case we check the smallest encoding at the same time as calculating length.
I'll check the details/affected case shortly.

Flags: needinfo?(arai.unmht)

Henri Sivonen (:hsivonen)

Reporter

Comment 8

•

5 years ago

(In reply to Tooru Fujisawa [:arai] from comment #4)

In term of pure conversion, all WTF-8 cases are unused.

Nice.

Here's the list of actual pure conversion that uses InflateUTF8ToUTF16:

Convert UTF-8 to UTF-16, use 0xFFFD on invalid character

JS::LossyUTF8CharsToNewTwoByteCharsZ

This could use mozilla::ConvertUtf8toUtf16() (from bug 1490601).

Convert UTF-8 to UTF-16, throw on invalid character

JS::UTF8CharsToNewTwoByteCharsZ

js::StringBuffer::append

encoding_c_mem lacks a nicely packaged function for this at the moment. (The SpiderMonkey-unavailable heap-allocated mozilla::Decoder has this.)

It's a bit unclear to me why these users want throwing instead of REPLACEMENT CHARACTERs. Throwing seems disruptive enough that the callers probably already have a high degree of confidence that they have valid UTF-8 and throwing only serves as a kind of release assertion, so I wonder if it's truly important to throw (as opposed to generate REPLACEMENT CHARACTERs). (With the exception of CTypes where the behavior appears to be part of an external API.)

Convert UTF-8 to Latin1, throw on invalid character

JS::UTF8CharsToNewLatin1CharsZ

This appears to be called only immediately after FindSmallestEncoding has validated the precondition, so mozilla::LossyConvertUtf8toLatin1 (from bug 1490601) would work.

There are 2 more conversions below, but they calculate hash at the same time, and I'm not sure this applies:

Convert UTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character

js::AtomizeUTF8Chars

Convert WTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character

js::AtomizeWTF8Chars

Probably the best not to modify these.

And remaining consumers of InflateUTF8ToUTF16 are the following:

Check smallest encoding of UTF-8, use 0xFFFD on invalid character

JS::FindSmallestEncoding

I have a SIMD-accelerated implemetation of this in bug 1578339.

Compare {UTF-8 or WTF-8} with either {Latin1 or UTF-16}, crash on invalid chnaracter

js::AtomHasher::match

Probably the best not to change this one.

Just validate UTF-8 chars, crash on invalid character (debug only)

JS::ConstUTF8CharsZ::validate

No need to change debug-only things.

And here's the list of unused APIs that needs to be removed:

JS::WTF8CharsToNewTwoByteCharsZ

JS::UTF8CharsToNewTwoByteCharsZ with ConstUTF8CharsZ

JS::LossyUTF8CharsToNewTwoByteCharsZ with ConstUTF8CharsZ

JS::LossyUTF8CharsToNewLatin1CharsZ

Removing these would indeed add clarity.

Thank you!

Henri Sivonen (:hsivonen)

Reporter

Updated

•

5 years ago

Depends on: 1575150

Henri Sivonen (:hsivonen)

Reporter

Comment 9

•

5 years ago

I suggest using the API surface suggested in bug 1575150 to deal with potential extra buffer capacity after conversion.

Tooru Fujisawa [:arai]

Comment 10

•

5 years ago

(In reply to Tooru Fujisawa [:arai] from comment #7)

comment #6 might be somewhat inaccurate
about eliminating the step to calculate string length, in some case we check the smallest encoding at the same time as calculating length.
I'll check the details/affected case shortly.

this affects only js::AtomizeUTF8Chars and js::AtomizeWTF8Chars, so we can ignore them here, per comment #8.

It's a bit unclear to me why these users want throwing instead of REPLACEMENT CHARACTERs. Throwing seems disruptive enough that the callers probably already have a high degree of confidence that they have valid UTF-8 and throwing only serves as a kind of release assertion, so I wonder if it's truly important to throw (as opposed to generate REPLACEMENT CHARACTERs). (With the exception of CTypes where the behavior appears to be part of an external API.)

will check the list of callers of those APIs to see if we really need throwing

Henri Sivonen (:hsivonen)

Reporter

Comment 11

•

5 years ago

will check the list of callers of those APIs to see if we really need throwing

Probably not worthwhile anymore. I already have a version of encoding_rs and encoding_c_mem in the pipeline that add a conversion function that signals on error.

Tooru Fujisawa [:arai]

Comment 12

•

5 years ago

I see

Flags: needinfo?(arai.unmht)

Henri Sivonen (:hsivonen)

Reporter

Updated

•

5 years ago

Depends on: 1579813

Jason Orendorff [:jorendorff]

Updated

•

5 years ago

Priority: -- → P3

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Bugzilla

Quick Search

Use SIMD-accelerated conversions from UTF-8 and WTF-8

Categories

(Core :: JavaScript Engine, enhancement, P3)

Tracking

()

People

(Reporter: hsivonen, Unassigned)

References

(Depends on 2 open bugs)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Updated

Updated