(In reply to Masatoshi Kimura [:emk] from comment #1) > Unlike `Utf8Unit`, casts from `char*` to `char8_t*` are UB. Or do we no longer care about aliasing? I see we compile with -fno-strict-aliasing (bug 413253, 16 years ago), though there were a couple failed attempts since then to support -fstrict-aliasing (like bug 414641). https://searchfox.org/mozilla-central/rev/bf8c7a7d47debb1c22b51a72184d5c703ae57588/python/mozbuild/mozbuild/frontend/context.py#676-679 I didn't realize casts from `char*` to `char8_t*` are UB. That makes sense: a `char` might not be a UTF-8 code unit. The author of the `char8_t` proposal acknowledges that in a Stack Overflow comment https://stackoverflow.com/questions/57402464/is-c20-char8-t-the-same-as-our-old-char/57453713#57453713: > objects of type char8_t can be accessed via pointers to char or unsigned char, but pointers to char8_t cannot be used to access char or unsigned char data. In other words: > > `reinterpret_cast<const char *>(u8"text"); // Ok.` > `reinterpret_cast<const char8_t*>("text"); // Undefined behavior` What is a safe approach for working with `char*` and `char8_t*` strings?
Bug 1880204 Comment 2 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
(In reply to Masatoshi Kimura [:emk] from comment #1) > Unlike `Utf8Unit`, casts from `char*` to `char8_t*` are UB. Or do we no longer care about aliasing? I see we compile with -fno-strict-aliasing (bug 413253, 16 years ago), though there were a couple failed attempts since then to support -fstrict-aliasing (like bug 414641). https://searchfox.org/mozilla-central/rev/bf8c7a7d47debb1c22b51a72184d5c703ae57588/python/mozbuild/mozbuild/frontend/context.py#676-679 I didn't realize casts from `char*` to `char8_t*` are UB. That makes sense: a `char` might not be a UTF-8 code unit. The author of the `char8_t` proposal acknowledges that in a Stack Overflow comment https://stackoverflow.com/questions/57402464/is-c20-char8-t-the-same-as-our-old-char/57453713#57453713: > objects of type char8_t can be accessed via pointers to char or unsigned char, but pointers to char8_t cannot be used to access char or unsigned char data. In other words: > > `reinterpret_cast<const char *>(u8"text"); // Ok.` > `reinterpret_cast<const char8_t*>("text"); // Undefined behavior` What is a safe approach for working with `char*` and `char8_t*` strings? Even `Utf8Unit`'s definition has comments saying it depends on implementation-defined behavior: https://searchfox.org/mozilla-central/search?q=defined&path=Utf8.h