Closed Bug 1426909 Opened 6 years ago Closed 4 years ago

Add a character type for UTF-8 code units and a UTF-8 string type for UTF-8 strings

Categories

(Core :: MFBT, enhancement, P2)

58 Branch
enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: Waldo, Assigned: Waldo)

References

(Depends on 1 open bug)

Details

Attachments

(11 files, 9 obsolete files)

2.82 KB, patch
Details | Diff | Splinter Review
20.91 KB, patch
Details | Diff | Splinter Review
4.46 KB, patch
Details | Diff | Splinter Review
12.37 KB, patch
Details | Diff | Splinter Review
40.34 KB, patch
Details | Diff | Splinter Review
15.11 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
27.55 KB, patch
Details | Diff | Splinter Review
5.78 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
1.63 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
18.92 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
40.14 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
JS parsing of single-byte source text is coming in bug 1351107.  (All parsing right now is of char16_t UTF-16 -- a UTF-8 script downloaded from the web gets inflated to UTF-16 before the JS engine sees it.)  The JS parser and token stream infrastructure are all templated on CharT as character type.  Currently that's *only* char16_t, with a different type needed for UTF-8 text.

The easy choice for that type is |unsigned char|, letting you use UTF-8-encoded C strings trivially.  But this lets people accidentally pass non-UTF-8 text to such APIs with no compile-time feedback (and runtime feedback only if lucky, or with adversarial input).  This is a Bad Idea.

C++next is considering a fresh character type for UTF-8 text:

http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0482r0.html

We can't wait on that, so I'm rolling something of our own.  I'm tentatively having an enum with specified underlying type to represent characters, functions to convert to and from the enum, then classes to encapsulate a pointer/size of that enum type.  (Not sure if containing a pointer to that enum type or just an unsigned char* yet -- may depend on C++ object/memory model intricacy.)  Messy, but better than having to dig ourselves out of wrongly-encoded madness in the future.

(For those wondering: because char16_t and char16_t* are easy types to infer from CharT = char16_t, but something similarly easy will *not* be possible with a scheme even vaguely above the hazy lines established above, I'll probably have to also expose some sort of traits class to provide various useful type-transformations, like CharT -> array of CharT and such, and data transformations, like sequence of CharT -> sequence of char16_t.  Unideal, but nothing except char8_t will be.)

I'm still designing this now.  I doubt I'll have patches til I start adapting the JS parser to use this, to prove the implementation of the concept is usable.  Probably not til next year, likely a couple weeks in given the current shutdown happening.
So.  I'm happy with the first two patches.  Utf8Unit is good, and I'm as confident as any reasonably informed and fastidious person *could* be about its spec-correctness.  It should be a good basis for a type for UTF-8 code units that isn't just confusable with raw char.

The rest of the patches...eh.  There's nothing wrong with them or the interfaces they expose, except in not restricting UnicodeString to *only* be valid UTF-8/16 (but that's afield from my immediate needs).  (I should also note -- the validation parts, to ensure string contents on construction are valid UTF-8/16, are not quite the correct fit for JavaScript's needs, because of some ECMAScript silliness.  Maybe.  Just flagging that, not addressing it, for now.)

The big problem is with ownership.  None of this stuff owns the data/memory/code units observed.  You can have a CodeUnitIterator outlive its CodeUnitIterable, and that won't be a problem assertions-wise, but it might be problematic from a "don't do this" sense of the API we might want to have.  It's quite possible to use all this stuff safely, but it is not guaranteed at all.  Not without something like Rust borrow checking, which I'm sure C++ will pick up any day now.  In a sense this just feels like it has C++17's <string_view> problems, and that certainly can't make us happy.

It *has* occurred to me that you could put reference counts into all the different classes, then have each class have a destructor that asserts that its internal reference count is zero, so that no iterator outlives an iterable, no iterable outlives a string.  Maybe that's good enough, combined with testing in debug builds?  Really not sure.

So I'm not sure where this wants to go.  I'm definitely going to want/need the first two patches.  But the rest?  Dunno.

I *do* need a pointer/length structure that can take in |const char*| or |const char16_t*| and length, and expose only |const Utf8Unit*| and |const char16_t*| from it.  (And same with |const| removed from everything, for mutable viewing.)  UnicodeString does that, very carefully.  But maybe some much-smaller approach -- that somehow doesn't have ownership problems? -- is better.  What are your thoughts on what to do?
Flags: needinfo?(nfroyd)
Off the top of my head, I think it's pretty hard to write the C++ code you probably want to write without exposing ownership problems.  My sense is that we should try to not let the perfect be the enemy of the good here: we should have a reasonable (or even a completely unreasonable "why do you have this many?") number of assertions, we should have a reasonable API that as difficult as possible to misuse, and call that good.

As to the rest of it, my sense of it is twofold:

1) It all looks reasonable to somebody who has glanced through it and has not been neck-deep in this for several weeks; and
2) We should consult Henri, who has Plans For XPCOM Strings to be partially backed via encoding_rs, and I suspect would like some of the concepts here to be backed by encoding_rs as well.  Henri is also more qualified than I am to offer design advice on Unicode-related topics.

Passing ni? to Henri.
Flags: needinfo?(nfroyd) → needinfo?(hsivonen)
First of all, apologies for my failure to more clearly advertise what I've been and am working on to avoid duplication.

As I understand from our previous discussions, the UTF-8-mode JS parser will want to iterate by code unit instead code point most of the time for efficiency, so forcing iteration by code point is not the purpose of mozilla::UnicodeString. In that light, it seems to me that the proposed mozilla::UnicodeString<T> is very close to mozilla::Span<T>. I'm rather skeptical about the utility of introducing mozilla::UnicodeString<T> instead of using mozilla::Span<T> like mozilla::Encoding/mozilla::Decoder/mozilla::Encoder do for borrowed text.

The promise of mozilla::Utf8Unit isn't quite clear to me. The comment for Utf8String explicitly says that the type doesn't guarantee UTF-8 validity. If the type doesn't signal validity, what do we gain compared to mozilla::Span<uint8_t>?

(Both mozilla::UnicodeString and mozilla::Span have the problem that they don't have the Rust borrow checker to protect against use-after-free if the borrowed memory goes away too early.)

For checking UTF-8 validity, please call into encoding_rs. Its UTF-8 validation code is an explicitly SIMD-accelerated fork of the Rust standard-library UTF-8 validation code. (I'm hoping to upstream it once I'm satisfied with its ARMv7+NEON story; currently no NEON on ARMv7.) The code is already in Gecko. See IsUTF8() in nsReadableUtils.h for a caller.

(In fact, this UTF-8 validation function was what prompted me, in Austin, to bring up the issue of SpiderMonkey stand-alone builds gaining an equivalent of libgkrust so that SpiderMonkey gets to use this code that I've already made an effort to make correct and fast.)

In the case of script data coming from UTF-8-aware (in the future) Gecko script loader, though, it shouldn't need to be up to SpiderMonkey to validate UTF-8, since you can trust mozilla::Decoder to give you valid UTF-8 and to do so without copying when possible if the input and output are in nsCString. (See https://searchfox.org/mozilla-central/source/layout/style/StreamLoader.cpp#57 for CSS-side precedent.)

For checking UTF-16 validity, please call into encoding_rs::mem (in-flight in bug 1431356). There's https://docs.rs/encoding_rs/0.7.2/encoding_rs/mem/fn.utf16_valid_up_to.html and https://docs.rs/encoding_rs/0.7.2/encoding_rs/mem/fn.ensure_utf16_validity.html which is built on the former. There are no C bindings yet, but the C bindings are trivial to write by copy-and-paste. utf16_valid_up_to() is SIMD-accelerated on x86, x86_64 and aarch64 (again NEON on ARMv7 is on the TODO list).

(Additionally, I'm in the process of changing XPCOM string conversions between UTF-8, UTF-16 and Latin1 to be backed by encoding_rs::mem. Bug 1402247.)
Flags: needinfo?(hsivonen)
Okay, I've added the sanity-checking assertion idea -- plus documentation of the lifetime requirements -- to these patches.  Not a huge amount of stuff has changed (most signficant thing is because UnicodeString now has a user-provided destructor, it isn't a literal type and so can't be used in constexpr code, so the literal-string helpers can't be constexpr either), happily.

I also have one use of this that's ready to land to very-trivially use it in one spot (outside the included tests).  I expect to use more of it in JS-specific patches, that I hope won't require too many more changes to this foundational work.  I'll include it in this bug even tho it's a purely JS engine change.
Attachment #8946835 - Flags: review?(nfroyd)
Attachment #8943755 - Attachment is obsolete: true
Attachment #8943756 - Attachment is obsolete: true
Attachment #8943757 - Attachment is obsolete: true
Since I last posted this, a couple things have changed/come up.

First and most notably, JS source text is *not* UTF-16 as you might understand it.  JS source text is *any* array of 16-bit code units, whether or not those code units encode *only* valid UTF-16.  If unpaired surrogates are encoded, they just are supposed to get interpreted -- ECMAScript specifically requires this :-\ -- as the corresponding "code point".  So basically all of the JS engine would want to be passing MaybeInvalidContents for created strings.

Second, I see comment 11 (which I really probably should have responded to before beginning to post these patches, sest la vee).  Some aspects of validating encoding don't seem to be relevant, if JS validation of encoding is not really going to happen.  Still, I wrote this patch, and I wrote the validation overlay atop it, so it doesn't really *hurt* to consider it.  I can always remove it from these patches if truly desired.

So, this patch-post is maybe somewhat optional, but I'm hesitant to excise it given it's written now.  If desired, I can excise it in a separate patch atop these, I guess.
Attachment #8946842 - Flags: review?(nfroyd)
Attachment #8943758 - Attachment is obsolete: true
Attachment #8943759 - Attachment is obsolete: true
Attachment #8943760 - Attachment is obsolete: true
Attachment #8943761 - Attachment is obsolete: true
Attachment #8943762 - Attachment is obsolete: true
(In reply to Henri Sivonen (:hsivonen) (Away from Bugzilla until 2018-02-02) from comment #11)
> As I understand from our previous discussions, the UTF-8-mode JS parser will
> want to iterate by code unit instead code point most of the time for
> efficiency, so forcing iteration by code point is not the purpose of
> mozilla::UnicodeString. In that light, it seems to me that the proposed
> mozilla::UnicodeString<T> is very close to mozilla::Span<T>. I'm rather
> skeptical about the utility of introducing mozilla::UnicodeString<T> instead
> of using mozilla::Span<T> like
> mozilla::Encoding/mozilla::Decoder/mozilla::Encoder do for borrowed text.
> 
> The promise of mozilla::Utf8Unit isn't quite clear to me. The comment for
> Utf8String explicitly says that the type doesn't guarantee UTF-8 validity.
> If the type doesn't signal validity, what do we gain compared to
> mozilla::Span<uint8_t>?

C++'s strict aliasing rules are very particular about how memory can be mis-accessed using different types.  There is no guarantee that accessing |char| data, using a |uint8_t| data type, will behave correctly without inducing undefined behavior.  It's necessary to use another character type for such aliasing, and it is not guaranteed that |uint8_t| actually is a character type.

It may well be that most compilers implement it that way -- but they are not guaranteed to do so, and I have seen clang discussions of changing this stuff so that they can optimize things harder.  This is not a footgun we should introduce into our code, to have to fix in the future.

Additionally, using uint8_t does not allow for a coherent story for abstracting away the (potentially unsafe) interconversion story between code units and |char|, as these patches do.  We do not want to use |char| for the actual code unit type, because using |char| or |char*| or |const char*| data with UTF-8 stuff hazards people mistakenly passing non-UTF-8 data in.  And we want a type that makes the UTF-8-ness of the input data very clear.  UnicodeString<Utf8Unit> does this.  Hypothetical Span<char> does not, and Span<unsigned char> does not (the latter being problematic because given |u8"foo"|, you cannot actually get an |unsigned char*| for it without a cast, and we don't want to have umpteen places that perform this cast -- nor have every user have to perform some casting operation themselves -- needs to be baked into the substrate they're working on).

> (Both mozilla::UnicodeString and mozilla::Span have the problem that they
> don't have the Rust borrow checker to protect against use-after-free if the
> borrowed memory goes away too early.)

That problem can't be addressed in C++, of course.  But these latest patches do contain debug assertions that should detect such misuse when the UnicodeString is destroyed in debug builds.

> For checking UTF-8 validity, please call into encoding_rs. Its UTF-8
> validation code is an explicitly SIMD-accelerated fork of the Rust
> standard-library UTF-8 validation code. (I'm hoping to upstream it once I'm
> satisfied with its ARMv7+NEON story; currently no NEON on ARMv7.) The code
> is already in Gecko. See IsUTF8() in nsReadableUtils.h for a caller.

Ultimately, because of JS not actually requiring valid input, I don't think it'll be necessary for this stuff to have any validation overlay on it.  (And quite frankly, I'm not even sure what validation would be -- should UTF-8 ban code unit sequences that would encode surrogates?  I dunno.  Fiddling I did in Python suggests it is not necessarily obvious what "valid" "UTF-8" is.  But again -- doesn't matter for now, and if people are really upset I can remove the validation stuff in these patches.)

However -- this will have the potential of accepting invalid UTF-8 provided by JSAPI callers other than Gecko, so SpiderMonkey will have to continue, alas, to intertwine some bad-UTF-8 handling in tokenizing.  I have posted patches in bug 1434429 that should enable multi-code-unit matching to be specialized for UTF-8 in one place with ease and clarity.

> In the case of script data coming from UTF-8-aware (in the future) Gecko
> script loader, though, it shouldn't need to be up to SpiderMonkey to
> validate UTF-8, since you can trust mozilla::Decoder to give you valid UTF-8
> and to do so without copying when possible if the input and output are in
> nsCString. (See
> https://searchfox.org/mozilla-central/source/layout/style/StreamLoader.
> cpp#57 for CSS-side precedent.)

Yeah, this is the bz thing I noted in this bug.  I think this bug.  But JSAPI entrypoints are for more than just the Gecko user, so TokenStream cannot specifically depend on it.
Oh, for what it's worth -- attachment bug 8946879 is IMO a poster child for not using uint8_t or Span<uint8_t> here.  There's an existing AtomizeChars function, a template function, that's instantiated with either char16_t or |JS::Latin1Char == unsigned char|.  In the UTF-16 case, the encoded goo is interpreted the usual UTF-16 way (with the addition of mispaired surrogates coding the corresponding "code point").  In the Latin-1 case, the encoded goo is interpreted byte-by-byte as the corresponding code point.  If I were *not* careful about how that code is written, and I happened to miss that hazard, it might have been that atomizeChars would be defined type-generically as

    static JSAtom* atomizeChars(JSContext* cx, Span<const CharT> chars) {
        return AtomizeChars(cx, chars.data(), chars.size());
    }

and this would compile to a call, in the hypothetical UTF-8 case, to AtomizeChars(JSContext*, JS::Latin1Char*, size_t) -- very much *not* what is desired.  Avoiding this sort of lapse is exactly why I do not want the UTF-8 unit type to be interconvertible with characters or uint8_t.
(In reply to Jeff Walden [:Waldo] from comment #21)
> (In reply to Henri Sivonen (:hsivonen) (Away from Bugzilla until 2018-02-02)
> from comment #11)
> > As I understand from our previous discussions, the UTF-8-mode JS parser will
> > want to iterate by code unit instead code point most of the time for
> > efficiency, so forcing iteration by code point is not the purpose of
> > mozilla::UnicodeString. In that light, it seems to me that the proposed
> > mozilla::UnicodeString<T> is very close to mozilla::Span<T>. I'm rather
> > skeptical about the utility of introducing mozilla::UnicodeString<T> instead
> > of using mozilla::Span<T> like
> > mozilla::Encoding/mozilla::Decoder/mozilla::Encoder do for borrowed text.
> > 
> > The promise of mozilla::Utf8Unit isn't quite clear to me. The comment for
> > Utf8String explicitly says that the type doesn't guarantee UTF-8 validity.
> > If the type doesn't signal validity, what do we gain compared to
> > mozilla::Span<uint8_t>?
> 
> C++'s strict aliasing rules are very particular about how memory can be
> mis-accessed using different types.  There is no guarantee that accessing
> |char| data, using a |uint8_t| data type, will behave correctly without
> inducing undefined behavior.  It's necessary to use another character type
> for such aliasing, and it is not guaranteed that |uint8_t| actually is a
> character type.

Now I'm confused. I thought the rules were:
 0) The compiler has to assume that two pointers to "compatible types" may alias if neither is qualified with "restrict".
 1) Except for char*, the compiler is allowed assume that two pointers to incompatible types do not alias even if they aren't qualified with "restrict".
 2) The compiler is allowed to assume that pointers qualified with "restrict" do not alias other pointers.
 3) The compiler has to assume that a char* can alias any non-"restrict"-qualified other pointer regardless of the type the other pointer points to.

The exception for char* is motivated by char* not only meaning character but also being the raw memory type that memmove, etc., consume.

Do I misunderstand?

AFAICT, the char* special case means that access via char* is pessimized and having a type that points to 8-bit units but isn't assumed to potentially alias random stuff would be a useful characteristic for a type used as the 8-bit analog of char16_t and not as "any raw memory" type.

I don't know if uint8_t* is defined to be free of the special aliasing characteristics of char* of if uint8_t* inherits the same special case, but either way, I don't understand why the aliasing special nature of char* is desirable here.

> Additionally, using uint8_t does not allow for a coherent story for
> abstracting away the (potentially unsafe) interconversion story between code
> units and |char|, as these patches do.  We do not want to use |char| for the
> actual code unit type, because using |char| or |char*| or |const char*| data
> with UTF-8 stuff hazards people mistakenly passing non-UTF-8 data in.  And
> we want a type that makes the UTF-8-ness of the input data very clear. 
> UnicodeString<Utf8Unit> does this.  Hypothetical Span<char> does not, and
> Span<unsigned char> does not

What's the benefit of UnicodeString<Utf8Unit> compared to Span<Utf8Unit>?

> (And quite frankly, I'm not even sure what validation would be -- should
> UTF-8 ban code unit sequences that would encode surrogates?  I dunno. 

It should. We shouldn't be calling something "UTF-8 validation" if it doesn't actually perform UTF-8 validation, and valid UTF-8 is well-defined. (Only shortest forms for encoding valid Unicode scalar values [i.e. code points excluding surrogates] are allowed.)

(If we have a true need to round-trip unpaired surrogates via an UTF-8-like representation, we should use WTF-8 [https://simonsapin.github.io/wtf-8/] precisely and not come up with something else ad hoc. But I very much doubt that WTF-8 is needed here.)

> Fiddling I did in Python suggests it is not necessarily obvious what "valid"
> "UTF-8" is.

It's spec-wise clear, but Python2's UTF-8 decoder is non-conforming. Of the prominent UTF-8 decoders that I tested last year, only Python2's was non-conforming, so Python2's behavior does not represent a universal bug. See https://hsivonen.fi/broken-utf-8/#implementations

> But again -- doesn't matter for now, and if people are really
> upset I can remove the validation stuff in these patches.)

I continue to be of the opinion that C++ code should call into encoding_rs for UTF-8 validation instead of rolling its own.

> However -- this will have the potential of accepting invalid UTF-8 provided
> by JSAPI callers other than Gecko, so SpiderMonkey will have to continue,
> alas, to intertwine some bad-UTF-8 handling in tokenizing.  I have posted
> patches in bug 1434429 that should enable multi-code-unit matching to be
> specialized for UTF-8 in one place with ease and clarity.

Wouldn't it be simpler to introduce two entry points: One that takes valid UTF-8 and defines invalid input as UB and another than validates input and calls the former if the input is valid and returns an error if not? Gecko could use the former and others could use the latter.
 
(In reply to Jeff Walden [:Waldo] from comment #23)
> Oh, for what it's worth -- attachment bug 8946879 is IMO a poster child for
> not using uint8_t or Span<uint8_t> here.  There's an existing AtomizeChars
> function, a template function, that's instantiated with either char16_t or
> |JS::Latin1Char == unsigned char|.  In the UTF-16 case, the encoded goo is
> interpreted the usual UTF-16 way (with the addition of mispaired surrogates
> coding the corresponding "code point").  In the Latin-1 case, the encoded
> goo is interpreted byte-by-byte as the corresponding code point.  If I were
> *not* careful about how that code is written, and I happened to miss that
> hazard, it might have been that atomizeChars would be defined
> type-generically as
> 
>     static JSAtom* atomizeChars(JSContext* cx, Span<const CharT> chars) {
>         return AtomizeChars(cx, chars.data(), chars.size());
>     }
> 
> and this would compile to a call, in the hypothetical UTF-8 case, to
> AtomizeChars(JSContext*, JS::Latin1Char*, size_t) -- very much *not* what is
> desired.  Avoiding this sort of lapse is exactly why I do not want the UTF-8
> unit type to be interconvertible with characters or uint8_t.

Currently, nsCString is used for UTF-8 strings, Latin1 strings and arbitrary binary data, so on the Gecko side this problem is already all over the place. It's good if this problem can be avoided within SpiderMonkey, though.
Comment on attachment 8946842 [details] [diff] [review]
Introduce Utf16.h as a header containing general UTF-16-specific functionality

Review of attachment 8946842 [details] [diff] [review]:
-----------------------------------------------------------------

I am inclined to use encoding_rs for this sort of thing, really.  I think we *do* want validating and non-validating construction of UnicodeString, and we should definitely make the JS engine's uses explicitly non-validating, but we would be better off if we had a single source of truth for all the validation code.

::: mfbt/Utf16.h
@@ +43,5 @@
> + * A valid UTF-16 string contains no mispaired or unpaired UTF-16 surrogates.
> + * The string *may* contain U+0000 NULL code points.
> + */
> +extern MFBT_API bool
> +IsValidUtf16(const char16_t* aCodeUnits, size_t aCount);

Do I understand correctly that this function, as written, is correct, but the JS engine isn't going to care about ever validating UTF-16-ness, because JS?  So we basically have no callers of this function in the patches in this bug as written?  I understand that a validating UnicodeString<char16_t> construction would call this, but AFAICT, such constructions would be rare to nonexistent currently, because the only construction(s) of UnicodeString<char16_t> would come from the JS engine, and those would be non-validating for the reasons above.
Attachment #8946842 - Flags: review?(nfroyd)
Comment on attachment 8946843 [details] [diff] [review]
Add UnicodeString constructors that assert validity of their pointed-to text, for earlier/earliest sanity-checking

Review of attachment 8946843 [details] [diff] [review]:
-----------------------------------------------------------------

I'm sure this would be waaaay too much of a pain in practice, but is it at all feasible to encode validity in the type of the string?  People are going to see `UnicodeString` and, I'd bet, assume that it's valid, particularly for the UTF-8 case.  And at least in the JS engine, that's not going to be the case, right?  And of course it's nicer to be passed things that are guaranteed by the type system, rather than fungible API guarantees.

OTOH, I suppose Rust has declared that it's on the producer to always construct valid UTF-8 strings.  So maybe this is just a possibility you have to live with.

::: mfbt/UnicodeString.h
@@ +63,5 @@
> + *
> + *   UnicodeString<const Utf8Unit> reallyLong
> + *     { reallyLongData, reallyLongDataLength, ContentsTooLarge };
> + *
> + *   UnicodeStrong<char16_t> stringInSuperHotFunction

Nit: `UnicodeString`, yes?

@@ +220,5 @@
> +   * Use this if the string does contain valid data, but the string could be too
> +   * long to assert it in every case.  String validity will only be asserted if
> +   * the string actually *is* "short" at runtime.
> +   */
> +  ContentsTooLarge,

OnlyValidateForSmallStrings?

@@ +227,5 @@
> +   * Use this if the string does contain valid data, but a particular
> +   * construction operation occurs so frequently that asserting validity only in
> +   * debug builds, even *sometimes*, only for *small* strings, is too expensive.
> +   */
> +  ConstructedTooOften,

WDYT about calling this SurroundingCodeIsHot, ValidationWouldBeTooExpensive, or something like that?  Something that clued people in to the context mattering would be better, and might even help the lack of checking get removed if the code happened to be moved around.

::: mfbt/tests/TestUnicodeString.cpp
@@ +176,5 @@
> +
> +  static char validLongBytes[] = "0123456789012345678901234567890123456789";
> +  UnicodeString<Utf8Unit> validLong
> +    { validLongBytes, ArrayLength(validLongBytes) - 1, ContentsTooLarge };
> +  MOZ_RELEASE_ASSERT(IsValidUtf8(validLong.chars(), validLong.size()));

Is it reasonable to write a test that asserts on !IsValidUtf8 when passing ContentsTooLarge to ensure that we really did let things through the validator?

@@ +181,5 @@
> +
> +  static char tooOftenBytes[] = "ohai";
> +  UnicodeString<Utf8Unit> tooOften
> +    { tooOftenBytes, ArrayLength(tooOftenBytes) - 1, ConstructedTooOften };
> +  MOZ_RELEASE_ASSERT(IsValidUtf8(tooOften.chars(), tooOften.size()));

Same question here, but for ConstructedTooOften.

@@ +201,5 @@
> +
> +  static char16_t validShortChars[] = u"
Attachment #8946843 - Flags: review?(nfroyd)
Comment on attachment 8946844 [details] [diff] [review]
Implement CodeUnitIterable as an iterable class that exposes the code units of a UnicodeString, with raw Element* as iterator type

Review of attachment 8946844 [details] [diff] [review]:
-----------------------------------------------------------------

You have a weird line-break in your commit message.

This is an amazingly large amount of code for something that amounts to simple pointer iteration.  Is there any reason that codeUnits() couldn't simply return Span<> or something similar as a reasonable stand-in for "bounds-checked block of memory"?

::: mfbt/UnicodeString.h
@@ +80,5 @@
> + * particularly, it implements the C++ iteration protocol (exposing begin/end
> + * functions for mutating or immutable access in forward or backward directions)
> + * and may be used with ranged-for loops:
> + *
> + *   static const char hexDigitsUnits = u8"0123456789ABCDEF";

Nit: `static const char hexDigitsUnits[]`, surely.

@@ +208,5 @@
> +    aOther.assertZeroReferences();
> +    mUnits = aOther.mUnits;
> +    mSize = aOther.mSize;
> +    MOZ_ASSERT(mRefCount == 0);
> +    MOZ_ASSERT(aOther.mRefCount == 0);

Aren't these redundant with the assertZeroReferences() calls above?
Attachment #8946844 - Flags: review?(nfroyd)
Comment on attachment 8946842 [details] [diff] [review]
Introduce Utf16.h as a header containing general UTF-16-specific functionality

Review of attachment 8946842 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/Utf16.h
@@ +43,5 @@
> + * A valid UTF-16 string contains no mispaired or unpaired UTF-16 surrogates.
> + * The string *may* contain U+0000 NULL code points.
> + */
> +extern MFBT_API bool
> +IsValidUtf16(const char16_t* aCodeUnits, size_t aCount);

AFAICT, this code is correct, but I think we shouldn't land it, because it duplicates functionality proveded by https://searchfox.org/mozilla-central/source/third_party/rust/encoding_rs/src/mem.rs#1780 which is SIMD-accelerated on x86, x86_64 and aarch64 and will be on ARMv7+NEON.

(Also, I think we shouldn't be adding APIs that take const foo* aPtr and size_t aLength. We should use Span<const foo> instead.)
(In reply to Henri Sivonen (:hsivonen) from comment #24)
> I thought the rules were:
>  0) The compiler has to assume that two pointers to "compatible types" may
> alias if neither is qualified with "restrict".

"restrict" doesn't exist in C++ and is kind of broken in C.  http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3988.pdf discusses numerous issues/inadequacies with it.

Past that, there's no guarantee about how "compatible types" may alias -- unless you define "compatible" as same within cv-qualifications, or same ignoring cv-qualifications and signedness if you're willing to use implementation-dependent behavior.  In particular, same size and same alignment guarantee you nothing.  |char| and |uint8_t| only relate only in terms of same size/alignment, so you cannot alias them in both directions with guaranteed behavior.  (You can access |uint8_t| as |char| because you can access anything using |char|, but not the other way -- meaning interpreting anything like |u8"foo"| as |uint8_t| is not possible.)

>  1) Except for char*, the compiler is allowed assume that two pointers to
> incompatible types do not alias even if they aren't qualified with
> "restrict".
>  2) The compiler is allowed to assume that pointers qualified with
> "restrict" do not alias other pointers.
>  3) The compiler has to assume that a char* can alias any
> non-"restrict"-qualified other pointer regardless of the type the other
> pointer points to.

Setting aside "restrict", and setting aside that the spec works in terms of a whitelist of what access is permitted -- you may *only* access the stored value of an object using these exact types -- this roughly checks out.

You *really* need to read the comments in these patches, with direct citation to spec sections, to evaluate this stuff.  Summaries of understandings of rules do not and cannot substitute for precise spec text.

> AFAICT, the char* special case means that access via char* is pessimized and
> having a type that points to 8-bit units but isn't assumed to potentially
> alias random stuff would be a useful characteristic for a type used as the
> 8-bit analog of char16_t and not as "any raw memory" type.

This would be nice.  This is a/the motivation for having a char8_t that can't be accessed except by cv-qualified char8_t (or also by char/unsigned char, because they can be used to access anything).  But that C++ spec proposal is in unusable limbo now and may never go anywhere.  (I also link to/refer to that idea in the patch comments.)

> I don't know if uint8_t* is defined to be free of the special aliasing
> characteristics of char* of if uint8_t* inherits the same special case

uint8_t is not defined such that it's always okay to use it to access unrelated types.  (This is explained in patch comments.)

> I don't understand why the aliasing special nature of char* is
> desirable here.

It isn't.  But if we wish to treat |u8""| or even |""| literals as Utf8Unit, our access type *must* be |char| up to cv-qualifications and signedness, or a struct or union containing such (possibly recursively in other such structs/unions).  Again, the patch comments elaborately discuss the exact representation choice here almost uniquely compelled by C++ language requirements.

> What's the benefit of UnicodeString<Utf8Unit> compared to Span<Utf8Unit>?

Utf8Unit must not interconvert with |char|, so that strings that aren't UTF-8 can't be used in places where only UTF-8 is permitted.  Because interconversion must be forbidden, |Span(Utf8Unit*, size_t)| would require users to individually convert -- when because of how hairy the aliasing concerns are, we *really* want the *only* cast to be hidden away somewhere.  And a Span<Utf8Unit>(char*, size_t) doesn't work because of the non-interconversion requirement.

It's also valuable to have the assertions that UnicodeString contains to effectively runtime borrow-check the input data.  Span has no assertions, and it's not practical to tack them (or unicodeString's related movability restrictions) on in the short term.

> We shouldn't be calling something "UTF-8 validation" if it
> doesn't actually perform UTF-8 validation, and valid UTF-8 is well-defined.
> (Only shortest forms for encoding valid Unicode scalar values [i.e. code
> points excluding surrogates] are allowed.)

Mm, I wasn't familiar enough with the relevant bits of Unicode/UTF-8 terminology to be sure of this -- and frankly, after spending awhile diving through terms, I probably still have them all a bit mixed up in my head on the point.  Good to know that *something* is affirmatively saying what numbers can be encoded by the various length-N subcomponents of UTF-8.  (I was aware of shortest-form being mandatory -- just not that [0xD800, 0xE000) was actually forbidden.)

> (If we have a true need to round-trip unpaired surrogates via an UTF-8-like
> representation, we should use WTF-8

Yeah, I don't think it's needed here.  Although -- we might have to be careful about allowing encoded surrogates inside JS strings and comments, which do have *some* constraints on their contents, but I don't believe they exclude surrogates.

> Wouldn't it be simpler to introduce two entry points: One that takes valid
> UTF-8 and defines invalid input as UB and another than validates input and
> calls the former if the input is valid and returns an error if not? Gecko
> could use the former and others could use the latter.

"Simpler" is not clear to me.  The scheme you suggest does seem undoubtedly more fragile, tho, in that writing our algorithms to *only* work with presumptively valid input is sort of disabling all the safeties.  Without type support to *enforce* an input validity check, that's a big risk.  And if we were to do your Span thing, we certainly couldn't guarantee validity.

I dunno.  Maybe that risk is okay.  At least for the first pass, tho, extra caution seems best.
(In reply to Jeff Walden [:Waldo] from comment #29)
> In particular, same size and same
> alignment guarantee you nothing.  |char| and |uint8_t| only relate only in
> terms of same size/alignment, so you cannot alias them in both directions
> with guaranteed behavior.  (You can access |uint8_t| as |char| because you
> can access anything using |char|, but not the other way -- meaning
> interpreting anything like |u8"foo"| as |uint8_t| is not possible.)

We don't need to consider implementations where a char isn't 8 bits, since we'd have massive failures in such a case anyway.

In C++ std::uint8_t is an unsigned integer type and the same as in the C standard library type uint8_t. C11 says that uint8_t is a typedef (as opposed to a type on its own). C11 says that the typedefs may denote extended integer types.

For uint8_t to be wrong, it would have to be a typedef for an extended integer type instead of being a typedef for a standard integer type.

1.5 under [conv.rank] implies that extended integer types of the same size as a standard integer type are permitted.

I guess I have to concede that the specs leave open the possibility of uint8_t being a typedef for an extended integer type.

As a practical matter, though, I'm inclined to think this issue is similar to non-two's complement signed integers and sizeof(bool) not being 1. We seem to have plenty of code that assumes that char* or unsigned char* can be reinterpreted as uint8_t*.

> You *really* need to read the comments in these patches, with direct
> citation to spec sections, to evaluate this stuff.

Sorry about not doing that earlier.

> > What's the benefit of UnicodeString<Utf8Unit> compared to Span<Utf8Unit>?
> 
> Utf8Unit must not interconvert with |char|, so that strings that aren't
> UTF-8 can't be used in places where only UTF-8 is permitted.  Because
> interconversion must be forbidden, |Span(Utf8Unit*, size_t)| would require
> users to individually convert -- when because of how hairy the aliasing
> concerns are, we *really* want the *only* cast to be hidden away somewhere. 
> And a Span<Utf8Unit>(char*, size_t) doesn't work because of the
> non-interconversion requirement.

Can't we hide the cast away in
Span<Utf8Unit> MakeUTF8Span(char*, size_t) // Like MakeSpan but different return type
and
Span<Utf8Unit> MakeUTF8StringSpan(char*) // Like MakeStringSpan but different return type
(plus const versions of these)?

> It's also valuable to have the assertions that UnicodeString contains to
> effectively runtime borrow-check the input data.  Span has no assertions,
> and it's not practical to tack them (or unicodeString's related movability
> restrictions) on in the short term.

Why isn't it practical?

I'm still unconvinced that we need UnicodeString in addition to Span when it seems that UnicodeString provides debug-only counting of outstanding iterators, which on the face of things would seem like a fine addition to Span, and doesn't provide a Unicode validity guarantee.

> > We shouldn't be calling something "UTF-8 validation" if it
> > doesn't actually perform UTF-8 validation, and valid UTF-8 is well-defined.
> > (Only shortest forms for encoding valid Unicode scalar values [i.e. code
> > points excluding surrogates] are allowed.)
> 
> Mm, I wasn't familiar enough with the relevant bits of Unicode/UTF-8
> terminology to be sure of this -- and frankly, after spending awhile diving
> through terms, I probably still have them all a bit mixed up in my head on
> the point.  Good to know that *something* is affirmatively saying what
> numbers can be encoded by the various length-N subcomponents of UTF-8.

Unicode 10.0 D92 covers this: "Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is ill-formed." https://www.unicode.org/versions/Unicode10.0.0/ch03.pdf

RFC 3629 has this is in BNF and says "An octet sequence is valid UTF-8 only if it matches the following syntax". https://tools.ietf.org/html/rfc3629#section-4

Applying the WHATWG UTF-8 decoder algorithm has the same effect. https://encoding.spec.whatwg.org/#utf-8-decoder

> > Wouldn't it be simpler to introduce two entry points: One that takes valid
> > UTF-8 and defines invalid input as UB and another than validates input and
> > calls the former if the input is valid and returns an error if not? Gecko
> > could use the former and others could use the latter.
> 
> "Simpler" is not clear to me.  The scheme you suggest does seem undoubtedly
> more fragile, tho, in that writing our algorithms to *only* work with
> presumptively valid input is sort of disabling all the safeties.  Without
> type support to *enforce* an input validity check, that's a big risk.  And
> if we were to do your Span thing, we certainly couldn't guarantee validity.
> 
> I dunno.  Maybe that risk is okay.  At least for the first pass, tho, extra
> caution seems best.

Even Rust, despite being big on safety, has (unsafe) std::str::from_utf8_unchecked() which doesn't perform the validity check in addition to the usual (safe) std::str::from_utf8(), which checks validity.

If on one line we obtain JavaScript source as UTF-8 from encoding_rs and on the next line we pass it to the JS engine, it seems wasteful and unnecessary to run UTF-8 validation again inside the JS engine and we should have a way to assert to the JS engine that we just got the data from a component that guarantees validity even though, nsACString being the way it is, we lack the type system means to communicate that.
(In reply to Henri Sivonen (:hsivonen) from comment #30)
> As a practical matter, though, I'm inclined to think this issue is similar
> to non-two's complement signed integers and sizeof(bool) not being 1. We
> seem to have plenty of code that assumes that char* or unsigned char* can be
> reinterpreted as uint8_t*.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66110#c13 for an argument from mangling compatibility to trust that the compilers we care about won't make uint8_t and extended integer type.
(In reply to Henri Sivonen (:hsivonen) from comment #30)
> As a practical matter, though, I'm inclined to think this issue is similar
> to non-two's complement signed integers and sizeof(bool) not being 1.

Let me put it this way.  We argue that web developers should first write code against the specs, then adapt it for browser-specific bugs.  I do the same thing here -- write code against what the specs allow, then if necessary (I haven't seen it here) adapt for compiler-specific bugs.  Starting from what our supported compilers do -- now -- hazards breakage from inadvertent misuse (by not approaching the problem systematically) in the future.  Even if a technique is "likely" to keep working for reasons outside the language, it remains no guarantee.

> Can't we hide the cast away in
> Span<Utf8Unit> MakeUTF8Span(char*, size_t) // Like MakeSpan but different
> return type
> and
> Span<Utf8Unit> MakeUTF8StringSpan(char*) // Like MakeStringSpan but
> different return type
> (plus const versions of these)?

That can be done.  But I'm trying to take a longer-term view, laying a foundation to incrementally build up to always-valid UTF-8 strings.  (Hence the patch comments about how code point iteration being not *yet* supported, among others.)

I can rename to UnicodeData or similar if that helps with not implying validity (and in fact, thinking about it, that's probably a good idea regardless).

> Why isn't it practical [to use Span]?

There are too many mozilla::Span users to quickly audit them for borrow-correctness, a prerequisite to adding borrow-assertions.  Given borrow-correctness forbids some safe code, some users *will* have to change.  And the time required to properly audit (relying on in-the-field assertions seems highly unwise) is too great for the time scale I want to work on for UTF-8 script parsing.

If we want to add borrow-checking to Span, fine -- but it's stop-energy to put that in front of having a specialized data structure for working with Unicode code unit data.  The prerequisite conversion should not stand in the way of this work.

> If on one line we obtain JavaScript source as UTF-8 from encoding_rs and on
> the next line we pass it to the JS engine, it seems wasteful and unnecessary
> to run UTF-8 validation again inside the JS engine

Tokenizing ASCII units right now has zero cost (our JS tokenizing has an ASCII fast-path they'd fall into), so the marginal cost of as-you-go validation, for now, is unlikely to exceed the cost of preflight mass-validation.  And lots of UTF-8 script is fully ASCII.  Maybe prevalidating will matter if ever lots of non-ASCII scripts are out there *and* contain signficant amounts of non-ASCII.  But preflight-validating mostly-ASCII scripts now is almost never likely to pay off.

> we should have a way
> to assert to the JS engine that we just got the data from a component that
> guarantees validity

Eventually, yes.
(In reply to Jeff Walden [:Waldo] from comment #32)
> (In reply to Henri Sivonen (:hsivonen) from comment #30)
> > As a practical matter, though, I'm inclined to think this issue is similar
> > to non-two's complement signed integers and sizeof(bool) not being 1.
> 
> Let me put it this way.  We argue that web developers should first write
> code against the specs, then adapt it for browser-specific bugs.  I do the
> same thing here -- write code against what the specs allow, then if
> necessary (I haven't seen it here) adapt for compiler-specific bugs. 
> Starting from what our supported compilers do -- now -- hazards breakage
> from inadvertent misuse (by not approaching the problem systematically) in
> the future.  Even if a technique is "likely" to keep working for reasons
> outside the language, it remains no guarantee.

The analogy to Web devs doesn't work, because C and C++ specs are fundamentally less grounded in practical reality than Web specs and Web specs avoid implementation-defined behavior (and undefined behavior). For example, by C11 (I didn't bother checking C++), it's not guaranteed that char16_t has UTF-16 semantics and char32_t has UTF-32 semantics (as processed by the standard library). That's even less reasonable than two's complement not being guaranteed for integers. (At the time the non-guarantee of two's complement was introduced, there was existence proof of one's complement hardware, but at the time char16_t and char32_t were introduced, there was no replacement for Unicode even on the horizon.) So with C and C++, *some* level of reliance on implementation-defined behavior of relevant implementations is always needed.

I agree that it's a bad idea to rely on undefined behavior that works now, but in this case, it's implementation-defined behavior whether it's undefined behavior. (It being implementation-defined behavior whether something is undefined behavior is probably the most C/C++ thing.)

Anyway, it's probably not productive to discuss this further.

> But I'm trying to take a longer-term view, laying a
> foundation to incrementally build up to always-valid UTF-8 strings.

If these types signified validity, they'd make sense to have in addition to Span. However, I think it's a problem to leave the validity guarantee as a later TODO. That's the sort of thing that's particularly hard to add later.

If these don't guarantee validity from the start, chances are that they'll just end up as another way to do Span. (Range being the first, but the addition of Span was justified by it having the same component representation as Rust's slices and, therefore, there being no arithmetic in going from Span to Rust slice, and by the pointer and length pattern being generally more common in our code than pointer and pointer past end.)

> There are too many mozilla::Span users to quickly audit them for
> borrow-correctness, a prerequisite to adding borrow-assertions. 

I disagree. I just had a quick look at all our uses of mozilla::Span, and (outside of tests for Span itself) I found one use of Span iterators:
https://searchfox.org/mozilla-central/source/parser/html/nsHtml5Portability.cpp#19

I may have missed another, but the number uses of Span iterators in the codebase is on the order of at most three (most likely just the one).

> Given
> borrow-correctness forbids some safe code, some users *will* have to change.
> And the time required to properly audit (relying on in-the-field assertions
> seems highly unwise) is too great for the time scale I want to work on for
> UTF-8 script parsing.
> 
> If we want to add borrow-checking to Span, fine -- but it's stop-energy to
> put that in front of having a specialized data structure for working with
> Unicode code unit data.  The prerequisite conversion should not stand in the
> way of this work.

I'd agree that what I say would be stop energy if the level of effort to audit (and change) existing Span usage was truly a lot greater than writing UnicodeString itself, but AFAICT, changing Span would be a smaller thing than writing a second implementation thereof, i.e. UnicodeString, with borrow checking from scratch. Of course, by now writing the second implementation is sunk cost, but that's not a good reason for duplicate facilities.

Also, if there's a time scale concern, it seems a bit odd to put in the effort to write a new type with borrow checking. I'm not saying that borrow checking is unimportant, but as far as justifying new types under schedule pressure go, it seems to me that guaranteeing UTF validity should be a higher priority if guaranteeing UTF validity is what we want. (But based on what you say below, it seems that UTF validity indeed isn't an essential prerequisite for the UTF-8 JS parser work.)

> > If on one line we obtain JavaScript source as UTF-8 from encoding_rs and on
> > the next line we pass it to the JS engine, it seems wasteful and unnecessary
> > to run UTF-8 validation again inside the JS engine
> 
> Tokenizing ASCII units right now has zero cost (our JS tokenizing has an
> ASCII fast-path they'd fall into), so the marginal cost of as-you-go
> validation, for now, is unlikely to exceed the cost of preflight
> mass-validation.  And lots of UTF-8 script is fully ASCII.  Maybe
> prevalidating will matter if ever lots of non-ASCII scripts are out there
> *and* contain signficant amounts of non-ASCII.  But preflight-validating
> mostly-ASCII scripts now is almost never likely to pay off.

Ah, if you don't intend to call your UTF-8 bulk validation function, that's fine then.
Comment on attachment 8946846 [details] [diff] [review]
Implement LiteralUnicodeString for creating UnicodeString<const Element> around string literals

Review of attachment 8946846 [details] [diff] [review]:
-----------------------------------------------------------------

If we were going to do this (it's not clear from the discussion above), this seems non-controversial.
Attachment #8946846 - Flags: review?(nfroyd) → review+
Comment on attachment 8946859 [details] [diff] [review]
Make TokenStreamChars::atomizeChars take a UnicodeString, not a raw pointer/length

Review of attachment 8946859 [details] [diff] [review]:
-----------------------------------------------------------------

Yup.
Attachment #8946859 - Flags: review?(nfroyd) → review+
Comment on attachment 8946879 [details] [diff] [review]
Implement TokenStreamCharsBase<Utf8Unit>::atomizeChars

Review of attachment 8946879 [details] [diff] [review]:
-----------------------------------------------------------------

Sure.
Attachment #8946879 - Flags: review?(nfroyd) → review+
If you plan on adding iterators by scalar value, it's probably worthwhile to grab the code from my new implementations of UTF8CharEnumerator::NextChar() and UTF16CharEnumerator::NextChar() in
https://reviewboard.mozilla.org/r/229426/diff/1#index_header

The former is updated to match WHATWG-required / Unicode-recommended U+FFFD generation. The latter is just more compact than the old code.
Comment on attachment 8946835 [details] [diff] [review]
Introduce a new mfbt/Utf8.h header for UTF-8-related functionality, including a UTF-8 code unit type that is compatible with, but doesn't directly interconvert with, |char|

I'm going to cancel the outstanding reviews for the time being; they've been sitting in my queue for a while.  My current mental model of the state of this bug is that the ball is in Waldo's court to work things out with hsivonen's input or to negotiate a compromise position with hsivonen.  If that's *not* the case, please re-flag for review.

(I think moving the UTF8-ness testing to MFBT is somewhat uncontroversial, as it could be retrofitted with encoding_rs support.  I'm less clear on the UTF8Unit stuff.)
Attachment #8946835 - Flags: review?(nfroyd)
Comment on attachment 8946836 [details] [diff] [review]
Allow Utf8Unit to be compared against |char| values

Canceling as per comment 38.  I'm a) skeptical that you want to compare arbitrary UTF8Units with chars (this seems like a great footgun), and b) you probably (?) want char16/32_t overloads for these as well?
Attachment #8946836 - Flags: review?(nfroyd)
Comment on attachment 8946837 [details] [diff] [review]
Add a string type for a sequence of UTF-8/16 code units, designed for easy use with C++ char data (including literals and strings)

Canceling as per comment 38.  I think this would be r+ modulo the discussion around Utf8Unit.
Attachment #8946837 - Flags: review?(nfroyd)
Comment on attachment 8946848 [details] [diff] [review]
Implement CodeUnitIterator as an iterator class for use in iterating code units, that implements proper in-bounds assertions and asserts that only iterators derived from the same string are compared

Canceling as per comment 38.  I think this is basically OK, though; one small nit is making CodeUnitIteratorBase::mString be a reference, rather than a pointer.
Attachment #8946848 - Flags: review?(nfroyd)
I've been plodding away at other things in the rough vein of following up on this -- moving functions around in template hierarchy so there's a coherent list of things demanding specialization once this is resolved, that sort of thing.  Mostly because of a sense of ennui about arguing too much here.  :-\  (It is good to argue, and it is worth arguing about these things.  It's also a little sapping.)

But I think I'm approaching the end of that process, so back we come to here.  I may have a fresh/updated series of patches to post here, somewhat soonish.

(In reply to Nathan Froyd [:froydnj] from comment #25)
> I am inclined to use encoding_rs for this sort of thing, really...we would
> be better off if we had a single source of truth for all the validation
> code.

Fair enough.  And with JS's quirks here, it just doesn't matter for "UTF-16".

I ripped this stuff out locally and renamed the class to UnicodeData -- didn't remove the UTF-8 centralization because having a function signature in mfbt will encourage centralization, tho.

(In reply to Nathan Froyd [:froydnj] from comment #26)
> is it at all feasible to encode validity in the type of the string?  People
> are going to see `UnicodeString` and, I'd bet, assume that it's valid,
> particularly for the UTF-8 case.

Fair.  Given JS doesn't have valid UTF-16 data, and maybe not even valid UTF-8 data depending how the JSAPI is defined, I reverted to UnicodeData so people won't assume.  We can readd validation goo when someone would use it.

(In reply to Nathan Froyd [:froydnj] from comment #27)
> Is there any reason that codeUnits() couldn't
> simply return Span<> or something similar as a reasonable stand-in for
> "bounds-checked block of memory"?

As noted in comment 32, Span<> lacks borrow checking, and I would like to incrementally improve this code -- which currently uses raw pointers all over -- to have runtime borrow-detection.  And, I am not sure that adding borrow-checking asserts to Span and making all users happy is a quick fix.

But a fresh point just occurred to me.  Even supposing all Span users were fixed, that wouldn't be enough.  Span is binary-compatible with Rust slices right now.  We depend on this, if code comments are any indication.  I see no way we could add borrow-asserts to Span, yet still be binary-compatible with Rust.

(In reply to Henri Sivonen (:hsivonen) from comment #33)
> I just had a quick look at all our uses of mozilla::Span, and (outside
> of tests for Span itself) I found one use of Span iterators:
> https://searchfox.org/mozilla-central/source/parser/html/nsHtml5Portability.cpp#19

I was using Searchfox and going by its reports of considerable mozilla::Span usage.  However, searching now for Span::{,reverse_,const_,const_reverse_}iterator uses, again I get nothing.  Not even the one hit you point out.  Nor does searching for Span::begin find your hit.  I don't know what's up here.  Conceivably Searchfox doesn't understand ranged-for loops' desugaring?  :-(

But in any event, changing Span means breaking binary compatibility with Rust, which seems like a dealbreaker.

> AFAICT, changing Span would be a smaller thing
> than writing a second implementation thereof, i.e. UnicodeString, with
> borrow checking from scratch.

I don't think these are quite duplicate facilities.  String handling code routinely and sensibly introduces string-specific types, and while this isn't precisely strings, it's very much in the same vein.

I'm very leery of over-using one thing for too many concepts.  Unicode data is IMO a different matter from random ranges of data.  Particularly when it needs to translate |char| to |Utf8Unit| or so, and also sometimes needs to offer |char|-based access to data that as a general rule we only want to expose as |Utf8Unit|.
(In reply to Jeff Walden [:Waldo] from comment #42)
> But a fresh point just occurred to me.  Even supposing all Span users were
> fixed, that wouldn't be enough.  Span is binary-compatible with Rust slices
> right now.  We depend on this, if code comments are any indication.  I see
> no way we could add borrow-asserts to Span, yet still be binary-compatible
> with Rust.

This isn't a problem.

FFI is C, so we don't pass Span over the FFI, and I'm not aware of plans to extend the FFI to allow non-C-like objects to cross even if actually layout-compatible.

We pass the pointer and the length as distinct arguments over the FFI and put them back together on the other side the FFI. The compatibility feature is that mozilla::Span's .Elements() and .Length() return values that you can pass to ::std::slice::from_raw_parts() without further checks, while the same is not true gsl::span.

> (In reply to Henri Sivonen (:hsivonen) from comment #33)
> > I just had a quick look at all our uses of mozilla::Span, and (outside
> > of tests for Span itself) I found one use of Span iterators:
> > https://searchfox.org/mozilla-central/source/parser/html/nsHtml5Portability.cpp#19
> 
> I was using Searchfox and going by its reports of considerable mozilla::Span
> usage.  However, searching now for
> Span::{,reverse_,const_,const_reverse_}iterator uses, again I get nothing. 
> Not even the one hit you point out.  Nor does searching for Span::begin find
> your hit.  I don't know what's up here.  Conceivably Searchfox doesn't
> understand ranged-for loops' desugaring?  :-(

I think it doesn't understand desugaring for search. I looked at occurrences of Span itself and took a quick look at what they did.

> > AFAICT, changing Span would be a smaller thing
> > than writing a second implementation thereof, i.e. UnicodeString, with
> > borrow checking from scratch.
> 
> I don't think these are quite duplicate facilities.  String handling code
> routinely and sensibly introduces string-specific types, and while this
> isn't precisely strings, it's very much in the same vein.

I guess we have different views on this topic. For example, I think it's anti-useful that std::u16string_view exists separately from gsl::span<const char16_t>. (Not worth going against in the standard-lib context, but I'd rather we didn't have the same dichotomy unless one of them guarantees UTF validity.) OTOH, in Rust, there have been, what I think are legitimate, requests to make byte slices have some ASCII-oriented operations.
(In reply to Henri Sivonen (:hsivonen) from comment #43)
> (In reply to Jeff Walden [:Waldo] from comment #42)
> > But a fresh point just occurred to me.  Even supposing all Span users were
> > fixed, that wouldn't be enough.  Span is binary-compatible with Rust slices
> > right now.  We depend on this, if code comments are any indication.  I see
> > no way we could add borrow-asserts to Span, yet still be binary-compatible
> > with Rust.
> 
> This isn't a problem.
> 
> FFI is C, so we don't pass Span over the FFI, and I'm not aware of plans to
> extend the FFI to allow non-C-like objects to cross even if actually
> layout-compatible.

Furthermore, mozilla::Span isn't actually layout-compatible with Rust slices already in the case where mozilla::Span is in the mode where its length goes into the C++ type system at compile time. (Not sure if we actually have any uses of that mode in the tree, though.)
So, following up on summit: hsivonen agreed on adding Utf8Unit as a type, and I agreed to look at retrofitting runtime borrow-asserts onto Span.

For now I'd like to get the most minimal part of this landed -- Utf8Unit -- so that I can do specialization work that requires the type's existence.  Getting this minimal bit in place gives a lot of runway to do stuff, without immediately needing a container/iterator interface.
Attachment #8988642 - Flags: review?(nfroyd)
Attachment #8946835 - Attachment is obsolete: true
Blocks: 1472066
Comment on attachment 8988642 [details] [diff] [review]
Introduce a new mfbt/Utf8.h header for UTF-8-related functionality, including a UTF-8 code unit type that is compatible with, but doesn't directly interconvert with, |char|

Review of attachment 8988642 [details] [diff] [review]:
-----------------------------------------------------------------

I believe I was OK with this before; thanks for talking to hsivonen about this.

r=me with some more extensive tests added.

::: mfbt/Utf8.cpp
@@ +10,5 @@
> +#include <stddef.h>
> +#include <stdint.h>
> +
> +MFBT_API bool
> +mozilla::IsValidUtf8(const void* aCodeUnits, size_t aCount)

Did the summit come to any consensus on using encoding_rs for this sort of thing?

::: mfbt/tests/TestUtf8.cpp
@@ +45,5 @@
> +  static const char endNonAsciiBytes[] = u8"Life is like a
Attachment #8988642 - Flags: review?(nfroyd) → review+
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d7fcfaa2c82d
Introduce a new mfbt/Utf8.h header for UTF-8-related functionality, including a UTF-8 code unit type that is compatible with, but doesn't directly interconvert with, |char|.  r=froydnj
Keywords: leave-open
Landed with tests added for 1/2/3/4 encodings, surrogates and boundary conditions, overlong code points, and too-large code points.

I think we decided encoding_rs is a reasonable thing when we have clear places to use it, tho when we're not prevalidating everything because in practice there's already an is-ASCII check imposed on near every code unit anyway, it's not important for bulk consumption now.  Also there are some issues with this until we have cross-language inlining in flight.  Once the dust settles on a bunch of this it shouldn't be too hard to retrofit things.
PREEMPTIVE WARNING: This patch looks big, but the vast majority of it is tests.  :-)

Most JS will just be handled as ASCII when tokenizing a code unit at a time, but the non-ASCII case requires the ability to decode a code point given an already-consumed non-ASCII lead code point.  For example, the JS tokenizer includes this function:

    /**
     * Given a just-consumed non-ASCII code unit |lead|, consume a full code
     * point or LineTerminatorSequence (normalizing it to '\n') and store it
     * in |*codePoint|.  Return true on success, otherwise return false and
     * leave |*codePoint| undefined on failure.
     *
     * If a LineTerminatorSequence was consumed, also update line/column info.
     *
     * This may change the current |sourceUnits| offset.
     */
    MOZ_MUST_USE bool getNonAsciiCodePoint(int32_t lead, int32_t* cp);

which requires exactly this capability.  Certain other functionality will as well: copying template literal raw UTF-8 contents into a UTF-16 buffer while normalizing '\r' and "\r\n" to '\n', a getNonAsciiCodePoint that *doesn't* normalize, and a few others.   (And some of these operate upon js/src/frontend/TokenStream.h's SourceUnits data, which means our iterators must be templates -- at least if we want iterators that will assert bounds-checking so that SourceUnits data can't be overrun.)

This patch stands up the existing IsValidUtf8 algorithm, with some cleaned-up readability using binary literals and digit separators, as a template function with user-configurable behavior in case of error.  IsValidUtf8 doesn't require user-customizable behavior, but some of the functions mentioned above will.  In particular, my patchwork uses configurable error notification to provide these messages for UTF-8 errors:

> +// UTF-8 source text encoding errors
> +MSG_DEF(JSMSG_BAD_LEADING_UTF8_UNIT,   1, JSEXN_SYNTAXERR, "{0} byte doesn't begin a valid UTF-8 code point")
> +MSG_DEF(JSMSG_NOT_ENOUGH_CODE_UNITS,   4, JSEXN_SYNTAXERR, "{0} byte in UTF-8 must be followed by {1} bytes, but {2} byte{3} present")
> 
> where {1} is "1"/"2"/"3", {2} is "0"/"1"/"2", and {3} is "was" or "s were" as appropriate
> 
> +MSG_DEF(JSMSG_BAD_TRAILING_UTF8_UNIT,  1, JSEXN_SYNTAXERR, "bad trailing UTF-8 byte {0} doesn't match the pattern 0b10xxxxxx")
> +MSG_DEF(JSMSG_FORBIDDEN_UTF8_CODE_POINT,2,JSEXN_SYNTAXERR, "{0} isn't a valid code point because {1}")
> 
> where {1} is "it's a UTF-16 surrogate", "the maximum code point is U+10FFFF", or "it wasn't encoded in shortest possible form"

and all these errors will have this additional note, containing pertinent data, tacked onto them:

> +MSG_DEF(JSMSG_BAD_CODE_UNITS,          1, JSEXN_NOTE, "the code units comprising this invalid code point were: {0}")
> 
> where {0} is something like "0xHH 0xHH 0xHH 0xHH" (or shorter, as necessary)

As far as API fussing goes, over other ways this could have been/could be:

Arguably this function could start with an un-advanced iterator and advance it only once the entire code point is validated, but this fits very poorly with JS, where we ubiquitously tokenize by |getCodeUnit()| that gets a lead unit *and* advances.  Changing to always peek is a pretty widespread task, and it would force the ASCII code path to manually advance everywhere, which IMO increases complexity and chance of error everywhere.

Also we could still imagine only modifying |*aIter| on failure, or only once at the end of the function.  But SourceUnits mentioned above doesn't have generalized peek-N-ahead functionality now, and I'm loathe to add such just for this.

I'll post a couple illustrative patches for how this functionality is used in JS, to demonstrate how I make use of its flexibility.

As far as making this use encoding_rs: whenever encoding_rs supports the flexibility this function provides, I'm happy to replace the implementation.  But looking now, it appears to me that encoding_rs doesn't really support decode-a-single-point (mostly just bulk conversions), and it doesn't expose the nature of decoding errors sufficient to expose error messages like those above.  I'm happy to switch over when it's possible, but it doesn't look to be right now.
Attachment #8992386 - Flags: review?(nfroyd)
Comment on attachment 8992386 [details] [diff] [review]
Abstract out mozilla::DecodeOneUtf8CodePoint for decoding a non-ASCII UTF-8 code point

Review of attachment 8992386 [details] [diff] [review]:
-----------------------------------------------------------------

This patch looks quite reasonable.
Attachment #8992386 - Flags: review?(nfroyd) → review+
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8258ce540165
Abstract out mozilla::DecodeOneUtf8CodePoint for decoding a UTF-8 code point after having consumed a non-ASCII lead unit, with configurable error notification through optional user-provided functors.  r=froydnj
(In reply to Pulsebot from comment #52)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/8258ce540165
> Abstract out mozilla::DecodeOneUtf8CodePoint for decoding a UTF-8 code point
> after having consumed a non-ASCII lead unit, with configurable error
> notification through optional user-provided functors.  r=froydnj

Are there use cases for the customizable error notifications beyond being able to emit different error messages in the different cases? Is it truly important to emit different error messages for different kinds of UTF-8 errors? Would it work for SpiderMonkey if we moved the simpler UTF8CharEnumerator::NextChar() API to MFBT? https://hg.mozilla.org/mozilla-central/file/8546719c58dc/xpcom/string/nsUTF8Utils.h#l92

Note that UTF8CharEnumerator::NextChar() generates the number of errors required by the WHATWG Encoding Standard and recommended by the Unicode standard. I didn't review mfbt/Utf8.h properly for spec compliance, but the explicit check for non-shortest forms strongly hints at its behavior perhaps not being the behavior required by the WHATWG Encoding Standard and recommended by Unicode, since the WHATWG-required behavior arises best by pattern-matching the bytes to Table 3-7 from the Unicode standard and not checking the the scalar value after computing it.
Flags: needinfo?(jwalden+bmo)
(In reply to Henri Sivonen (:hsivonen) from comment #54) Is it truly
> important to emit different error messages for different kinds of UTF-8
> errors?

I note that we don't do this for HTML or CSS.

> Note that UTF8CharEnumerator::NextChar() generates the number of errors
> required by the WHATWG Encoding Standard and recommended by the Unicode
> standard.

Tested at https://hg.mozilla.org/mozilla-central/file/8546719c58dc/xpcom/tests/gtest/TestUTF.cpp#l176

The leave-open keyword is there and there is no activity for 6 months.
:froydnj, maybe it's time to close this bug?

Flags: needinfo?(nfroyd)

(In reply to Release mgmt bot [:sylvestre / :calixte] from comment #56)

The leave-open keyword is there and there is no activity for 6 months.
:froydnj, maybe it's time to close this bug?

No, there is still work to be done here.

Flags: needinfo?(nfroyd)

The leave-open keyword is there and there is no activity for 6 months.
:froydnj, maybe it's time to close this bug?

Flags: needinfo?(nfroyd)

(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #58)

The leave-open keyword is there and there is no activity for 6 months.
:froydnj, maybe it's time to close this bug?

Still a valid bug.

Flags: needinfo?(nfroyd)

The leave-open keyword is there and there is no activity for 6 months.
:Waldo, maybe it's time to close this bug?

Flags: needinfo?(jwalden)

At this point...yeah, this probably ought be closed.

mozilla::Utf8Unit as already landed is a useful thing, distinguishing UTF-8-accepting overloads from those accepting just Latin-1 or ASCII or some other encoding.

But while the container and iterator stuff here is moderately nice and all...the extra ownership stuff turns out to be somewhat tedious, for the purpose of JS tokenizing it's not truly needed.

Moreover, C++20 introduces a new char8_t type, and u"foo" will be an array of char8_t. So UTF-8 literals will have a distinct UTF-8-specific type, for template parameter and overload purposes char8_ will be distinct from char or unsigned char (i.e. JS::Latin1Char), and we will get the typed clarity of mozilla::Utf8Unit that we want. And at that point bog-standard containers and iterators (e.g. const char8_t*) will be good enough. And mozilla::Utf8Unit can be replaced/removed.

So this has had its day in the sun, what's landed is adequate, and we can close this.

(In reply to Henri Sivonen (:hsivonen) from comment #54)

Are there use cases for the customizable error notifications beyond being
able to emit different error messages in the different cases? Is it truly
important to emit different error messages for different kinds of UTF-8
errors?

Bug 1546442 comment 13 is a demonstration of a case where it was extremely helpful to have different error messages for this. As an extra bit of polish, those different error messages including the particular offending byte values. Having good errors for this meant when relevant people looked at that bug they -- not even me as domain-expert SpiderMonkey hacker! -- could immediately say garbage had been passed. And because I went to special effort to include byte values in the error message, the 0xE5 in that case provided a poison value for relevant code searching.

It isn't hard to imagine that sort of thing arising again in the future. It could arise again for us as embedder. Web cases are already validated (...or not, if an implementation bug like possibly bug 1546442 happens -- we're still not sure what's up in that bug). But we have non-web users, for example various C++ callers of mozIJSSubScriptLoader.loadSubScript, that pass in non-validated content. And it could arise for other embedders who don't have the web's pre-validated restrictions.

Would it work for SpiderMonkey if we moved the simpler
UTF8CharEnumerator::NextChar() API to MFBT?
https://hg.mozilla.org/mozilla-central/file/8546719c58dc/xpcom/string/
nsUTF8Utils.h#l92

Doesn't give us the same error messages, and as explained above these error messages are demonstrably valuable.

Note that UTF8CharEnumerator::NextChar() generates the number of errors
required by the WHATWG Encoding Standard and recommended by the Unicode
standard. I didn't review mfbt/Utf8.h properly for spec compliance, but the
explicit check for non-shortest forms strongly hints at its behavior perhaps
not being the behavior required by the WHATWG Encoding Standard and
recommended by Unicode, since the WHATWG-required behavior arises best by
pattern-matching the bytes to Table 3-7 from the Unicode standard and not
checking the the scalar value after computing it.

I double-checked the WHATWG algorithm against the one we have, and it is identical to the Utf8.h algorithm. The particular byte-matching that is performed there, in baroque state-machine byte-at-a-time fashion, exactly replicates minimum-value (for code points that are 1, 2, 3, or 4 code units in length), maximum-value (relevant for code points that are 3 and 4 units in length), and no-surrogate (the lead byte of a surrogate is 0xED and the subsequent byte would have to be 0xA0 or greater, but the UTF-8 upper boundary excludes that case and only that case) restrictions enforced by that code. And as the spec notes, "other algorithms that achieve the same result are fine, even encouraged".

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(jwalden)
Resolution: --- → FIXED
Keywords: leave-open
Depends on: 1880008
Depends on: 1880204
You need to log in before you can comment on or make changes to this bug.