Bug 1880204 Comment 2 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

(In reply to Masatoshi Kimura [:emk] from comment #1)
> Unlike `Utf8Unit`, casts from `char*` to `char8_t*` are UB. Or do we no longer care about aliasing?

I see we compile with -fno-strict-aliasing (bug 413253, 16 years ago), though there were a couple failed attempts since then to support -fstrict-aliasing (like bug 414641).

https://searchfox.org/mozilla-central/rev/bf8c7a7d47debb1c22b51a72184d5c703ae57588/python/mozbuild/mozbuild/frontend/context.py#676-679

I didn't realize casts from `char*` to `char8_t*` are UB. That makes sense: a `char` might not be a UTF-8 code unit. The author of the `char8_t` proposal acknowledges that in a Stack Overflow comment https://stackoverflow.com/questions/57402464/is-c20-char8-t-the-same-as-our-old-char/57453713#57453713:

> objects of type char8_t can be accessed via pointers to char or unsigned char, but pointers to char8_t cannot be used to access char or unsigned char data. In other words:
> 
> `reinterpret_cast<const char   *>(u8"text"); // Ok.`
> `reinterpret_cast<const char8_t*>("text");   // Undefined behavior`

What is a safe approach for working with `char*` and `char8_t*` strings?
(In reply to Masatoshi Kimura [:emk] from comment #1)
> Unlike `Utf8Unit`, casts from `char*` to `char8_t*` are UB. Or do we no longer care about aliasing?

I see we compile with -fno-strict-aliasing (bug 413253, 16 years ago), though there were a couple failed attempts since then to support -fstrict-aliasing (like bug 414641).

https://searchfox.org/mozilla-central/rev/bf8c7a7d47debb1c22b51a72184d5c703ae57588/python/mozbuild/mozbuild/frontend/context.py#676-679

I didn't realize casts from `char*` to `char8_t*` are UB. That makes sense: a `char` might not be a UTF-8 code unit. The author of the `char8_t` proposal acknowledges that in a Stack Overflow comment https://stackoverflow.com/questions/57402464/is-c20-char8-t-the-same-as-our-old-char/57453713#57453713:

> objects of type char8_t can be accessed via pointers to char or unsigned char, but pointers to char8_t cannot be used to access char or unsigned char data. In other words:
> 
> `reinterpret_cast<const char   *>(u8"text"); // Ok.`
> `reinterpret_cast<const char8_t*>("text");   // Undefined behavior`

What is a safe approach for working with `char*` and `char8_t*` strings? Even `Utf8Unit`'s definition has comments saying it depends on implementation-defined behavior: https://searchfox.org/mozilla-central/search?q=defined&path=Utf8.h

Back to Bug 1880204 Comment 2