Open Bug 1880204 Opened 7 months ago Updated 7 months ago

Replace Utf8Unit with char8_t

Categories

(Core :: MFBT, task)

task

Tracking

()

People

(Reporter: cpeterson, Unassigned)

References

(Depends on 1 open bug)

Details

Once we compile with -std=c++20 (bug 1768116), we can replace MFBT's Utf8Unit type (introduced in bug 1426909) with C++20's char8_t type.

Unlike Utf8Unit, casts from char* to char8_t* are UB. Or do we no longer care about aliasing?

(In reply to Masatoshi Kimura [:emk] from comment #1)

Unlike Utf8Unit, casts from char* to char8_t* are UB. Or do we no longer care about aliasing?

I see we compile with -fno-strict-aliasing (bug 413253, 16 years ago), though there were a couple failed attempts since then to support -fstrict-aliasing (like bug 414641).

https://searchfox.org/mozilla-central/rev/bf8c7a7d47debb1c22b51a72184d5c703ae57588/python/mozbuild/mozbuild/frontend/context.py#676-679

I didn't realize casts from char* to char8_t* are UB. That makes sense: a char might not be a UTF-8 code unit. The author of the char8_t proposal acknowledges that in a Stack Overflow comment https://stackoverflow.com/questions/57402464/is-c20-char8-t-the-same-as-our-old-char/57453713#57453713:

objects of type char8_t can be accessed via pointers to char or unsigned char, but pointers to char8_t cannot be used to access char or unsigned char data. In other words:

reinterpret_cast<const char *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text"); // Undefined behavior

What is a safe approach for working with char* and char8_t* strings? Even Utf8Unit's definition has comments saying it depends on implementation-defined behavior: https://searchfox.org/mozilla-central/search?q=defined&path=Utf8.h

(In reply to Chris Peterson [:cpeterson] from comment #2)

What is a safe approach for working with char* and char8_t* strings? Even Utf8Unit's definition has comments saying it depends on implementation-defined behavior: https://searchfox.org/mozilla-central/search?q=defined&path=Utf8.h

std::start_lifetime_as (but unavailable until C++23.)
std::cast_as_utf_unchecked (still work in progress.)

Depends on: C++23
No longer depends on: C++20

This SO answer explains how to implement start_lifetime_as on C++20 or earlier (uses std::memmove and hopes the compiler will optimize out the std::memmove call):
https://stackoverflow.com/questions/76445860/implementation-of-stdstart-lifetime-as#76794371

You need to log in before you can comment on or make changes to this bug.