Use SIMD-accelerated conversions from UTF-8 and WTF-8
Categories
(Core :: JavaScript Engine, enhancement, P3)
Tracking
()
People
(Reporter: hsivonen, Unassigned)
References
(Depends on 2 open bugs)
Details
Bug 1578339 and bug 1561567 added SIMD acceleration for ASCIIness and Latin1ness checking, conversions between Latin1 and UTF-16, and conversions to UTF-8.
This bug is about making conversions from UTF-8 also use SIMD-accelerated code. However, it's a bit difficult to see what the actual required set of operations is, because InflateUTF8StringHelper
templatizes UTF-8 and WTF-8 sources, UTF-16 and Latin1 destinations, as well as different error handling approaches together.
Also, it's unclear to me what the exactness requirements for memory allocation are. Experience from XPCOM strings indicates that it's better not to run the UTF-8 math twice (once to compute the required buffer length and a second time to do the conversion), but it's better to allocate for the worst case (the ASCII case) and then after the conversion check if the result would fit in a smaller jemalloc bucket and to memcpy
it to a smaller bucket if necessary.
Does that memory allocation approach work here?
Do we ever need to convert to Latin1 without having first checked that the input is valid UTF-8 in the Latin1 range?
For a) UTF-8 and b) WTF-8, which of the following error handling modes are actually needed?
- The input is known to be valid. If there are errors anyway, the conversion may execute Undefined Behavior.
- Invalid byte sequences are replaced with the REPLACEMENT CHARACTER in a WHATWG-compliant way.
- If the input is invalid, return an error.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
It appears that the use case for WTF-8 is BinAst.
Reporter | ||
Comment 2•5 years ago
|
||
FWIW, I already have pre-packaged FFI entry points for (all these require worst-case allocation for destination):
- Convert valid UTF-8 to UTF-16. Invalid UTF-8 is UB.
- Convert potentially-invalid UTF-8 to UTF-16. Invalid sequences are replaced with the REPLACEMENT CHARACTER in a WHATWG-compliant way.
- Convert Latin1-range UTF-8 to Latin1. If the input isn't valid UTF-8 in the Latin1 range, the output is garbage in a memory-safe way.
Deflecting WTF-8-related questions to arai :)
Comment 4•5 years ago
|
||
In term of pure conversion, all WTF-8 cases are unused.
Here's the list of actual pure conversion that uses InflateUTF8ToUTF16
:
- Convert UTF-8 to UTF-16, use 0xFFFD on invalid character
JS::LossyUTF8CharsToNewTwoByteCharsZ
- Convert UTF-8 to UTF-16, throw on invalid character
JS::UTF8CharsToNewTwoByteCharsZ
js::StringBuffer::append
- Convert UTF-8 to Latin1, throw on invalid character
JS::UTF8CharsToNewLatin1CharsZ
There are 2 more conversions below, but they calculate hash at the same time, and I'm not sure this applies:
- Convert UTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character
js::AtomizeUTF8Chars
- Convert WTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character
js::AtomizeWTF8Chars
And remaining consumers of InflateUTF8ToUTF16
are the following:
- Check smallest encoding of UTF-8, use 0xFFFD on invalid character
JS::FindSmallestEncoding
- Compare {UTF-8 or WTF-8} with either {Latin1 or UTF-16}, crash on invalid chnaracter
js::AtomHasher::match
- Just validate UTF-8 chars, crash on invalid character (debug only)
JS::ConstUTF8CharsZ::validate
And here's the list of unused APIs that needs to be removed:
JS::WTF8CharsToNewTwoByteCharsZ
JS::UTF8CharsToNewTwoByteCharsZ
withConstUTF8CharsZ
JS::LossyUTF8CharsToNewTwoByteCharsZ
withConstUTF8CharsZ
JS::LossyUTF8CharsToNewLatin1CharsZ
Comment 5•5 years ago
|
||
...I'm pretty confident I have nothing particularly useful to add after comment 4. :-)
Comment 6•5 years ago
•
|
||
Does that memory allocation approach work here?
Afaik, it works.
Do we ever need to convert to Latin1 without having first checked that the input is valid UTF-8 in the Latin1 range?
In case we want Latin1 if possible, we first check the smallest encoding by FindSmallestEncoding or similar internally,
So, there's no need to do the check at the same time.
Comment 7•5 years ago
|
||
comment #6 might be somewhat inaccurate
about eliminating the step to calculate string length, in some case we check the smallest encoding at the same time as calculating length.
I'll check the details/affected case shortly.
Reporter | ||
Comment 8•5 years ago
|
||
(In reply to Tooru Fujisawa [:arai] from comment #4)
In term of pure conversion, all WTF-8 cases are unused.
Nice.
Here's the list of actual pure conversion that uses
InflateUTF8ToUTF16
:
- Convert UTF-8 to UTF-16, use 0xFFFD on invalid character
JS::LossyUTF8CharsToNewTwoByteCharsZ
This could use mozilla::ConvertUtf8toUtf16()
(from bug 1490601).
- Convert UTF-8 to UTF-16, throw on invalid character
JS::UTF8CharsToNewTwoByteCharsZ
js::StringBuffer::append
encoding_c_mem
lacks a nicely packaged function for this at the moment. (The SpiderMonkey-unavailable heap-allocated mozilla::Decoder
has this.)
It's a bit unclear to me why these users want throwing instead of REPLACEMENT CHARACTERs. Throwing seems disruptive enough that the callers probably already have a high degree of confidence that they have valid UTF-8 and throwing only serves as a kind of release assertion, so I wonder if it's truly important to throw (as opposed to generate REPLACEMENT CHARACTERs). (With the exception of CTypes where the behavior appears to be part of an external API.)
- Convert UTF-8 to Latin1, throw on invalid character
JS::UTF8CharsToNewLatin1CharsZ
This appears to be called only immediately after FindSmallestEncoding
has validated the precondition, so mozilla::LossyConvertUtf8toLatin1
(from bug 1490601) would work.
There are 2 more conversions below, but they calculate hash at the same time, and I'm not sure this applies:
- Convert UTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character
js::AtomizeUTF8Chars
- Convert WTF-8 to either UTF-16 or Latin1 depending on smallest encoding, calculate hash, throw on invalid character
js::AtomizeWTF8Chars
Probably the best not to modify these.
And remaining consumers of
InflateUTF8ToUTF16
are the following:
- Check smallest encoding of UTF-8, use 0xFFFD on invalid character
JS::FindSmallestEncoding
I have a SIMD-accelerated implemetation of this in bug 1578339.
- Compare {UTF-8 or WTF-8} with either {Latin1 or UTF-16}, crash on invalid chnaracter
js::AtomHasher::match
Probably the best not to change this one.
- Just validate UTF-8 chars, crash on invalid character (debug only)
JS::ConstUTF8CharsZ::validate
No need to change debug-only things.
And here's the list of unused APIs that needs to be removed:
JS::WTF8CharsToNewTwoByteCharsZ
JS::UTF8CharsToNewTwoByteCharsZ
withConstUTF8CharsZ
JS::LossyUTF8CharsToNewTwoByteCharsZ
withConstUTF8CharsZ
JS::LossyUTF8CharsToNewLatin1CharsZ
Removing these would indeed add clarity.
Thank you!
Reporter | ||
Comment 9•5 years ago
|
||
I suggest using the API surface suggested in bug 1575150 to deal with potential extra buffer capacity after conversion.
Comment 10•5 years ago
|
||
(In reply to Tooru Fujisawa [:arai] from comment #7)
comment #6 might be somewhat inaccurate
about eliminating the step to calculate string length, in some case we check the smallest encoding at the same time as calculating length.
I'll check the details/affected case shortly.
this affects only js::AtomizeUTF8Chars
and js::AtomizeWTF8Chars
, so we can ignore them here, per comment #8.
It's a bit unclear to me why these users want throwing instead of REPLACEMENT CHARACTERs. Throwing seems disruptive enough that the callers probably already have a high degree of confidence that they have valid UTF-8 and throwing only serves as a kind of release assertion, so I wonder if it's truly important to throw (as opposed to generate REPLACEMENT CHARACTERs). (With the exception of CTypes where the behavior appears to be part of an external API.)
will check the list of callers of those APIs to see if we really need throwing
Reporter | ||
Comment 11•5 years ago
|
||
will check the list of callers of those APIs to see if we really need throwing
Probably not worthwhile anymore. I already have a version of encoding_rs and encoding_c_mem in the pipeline that add a conversion function that signals on error.
Updated•5 years ago
|
Updated•2 years ago
|
Description
•