Replace Utf8Unit with char8_t
Categories
(Core :: MFBT, task)
Tracking
()
People
(Reporter: cpeterson, Unassigned)
References
(Depends on 1 open bug)
Details
Once we compile with -std=c++20 (bug 1768116), we can replace MFBT's Utf8Unit
type (introduced in bug 1426909) with C++20's char8_t
type.
Comment 1•7 months ago
|
||
Unlike Utf8Unit
, casts from char*
to char8_t*
are UB. Or do we no longer care about aliasing?
Reporter | ||
Comment 2•7 months ago
•
|
||
(In reply to Masatoshi Kimura [:emk] from comment #1)
Unlike
Utf8Unit
, casts fromchar*
tochar8_t*
are UB. Or do we no longer care about aliasing?
I see we compile with -fno-strict-aliasing (bug 413253, 16 years ago), though there were a couple failed attempts since then to support -fstrict-aliasing (like bug 414641).
I didn't realize casts from char*
to char8_t*
are UB. That makes sense: a char
might not be a UTF-8 code unit. The author of the char8_t
proposal acknowledges that in a Stack Overflow comment https://stackoverflow.com/questions/57402464/is-c20-char8-t-the-same-as-our-old-char/57453713#57453713:
objects of type char8_t can be accessed via pointers to char or unsigned char, but pointers to char8_t cannot be used to access char or unsigned char data. In other words:
reinterpret_cast<const char *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text"); // Undefined behavior
What is a safe approach for working with char*
and char8_t*
strings? Even Utf8Unit
's definition has comments saying it depends on implementation-defined behavior: https://searchfox.org/mozilla-central/search?q=defined&path=Utf8.h
Comment 3•7 months ago
•
|
||
(In reply to Chris Peterson [:cpeterson] from comment #2)
What is a safe approach for working with
char*
andchar8_t*
strings? EvenUtf8Unit
's definition has comments saying it depends on implementation-defined behavior: https://searchfox.org/mozilla-central/search?q=defined&path=Utf8.h
std::start_lifetime_as
(but unavailable until C++23.)
std::cast_as_utf_unchecked
(still work in progress.)
Comment 4•7 months ago
|
||
This SO answer explains how to implement start_lifetime_as
on C++20 or earlier (uses std::memmove
and hopes the compiler will optimize out the std::memmove
call):
https://stackoverflow.com/questions/76445860/implementation-of-stdstart-lifetime-as#76794371
Description
•