Closed Bug 796948 Opened 7 years ago Closed 7 years ago

Add char16_t to MFBT

Categories

(Core :: MFBT, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla20
Tracking Status
firefox19 --- wontfix
firefox20 --- fixed

People

(Reporter: jcranmer, Assigned: cpeterson)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files, 11 obsolete files)

5.65 KB, patch
ehsan
: review+
Details | Diff | Splinter Review
2.62 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
We should add char16_t to mozilla/StandardInteger.h where it doesn't already exist.

It's available with gcc >= 4.4 (with -std=c++11), and clang >= 2.9. I believe MSVC does a poor man's version of it starting in 10.

Where char16_t is not natively available, we should just use typedef uint16_t char16_t. This would be compliant to C11 but not C++11 (char*_t are distinct types in C++11, typedefs in C11), but given that we're replacing PRUnichar, which is pretty much just a uint16_t typedef at present, this shouldn't present any new problems.
(In reply to Joshua Cranmer [:jcranmer] from comment #0)
> We should add char16_t to mozilla/StandardInteger.h where it doesn't already
> exist.
> 
> It's available with gcc >= 4.4 (with -std=c++11)

Note we can't use -std=gnu++0x with gcc 4.4 because of a bug in the c++0x headers in gcc 4.4.
So the pseudocode should look thusly I think:

#if C
// gcc 4.7 and clang tip-of-trunk don't seem to support uchar.h in C11 mode.
typedef uint16_t char16_t;
#elif (gcc or clang) and not C++11
typedef uint16_t char16_t;
#elif MSVC < 10
// Docs indicate that MSVC 10 is the first to support char16_t
// MSVC 10 appears to only support char16_t if you include stdint.h
// but if we do this at the end of mozilla/StandardInteger.h, that isn't a problem
typedef uint16_t char16_t;
#endif
(In reply to Joshua Cranmer [:jcranmer] from comment #2)
> So the pseudocode should look thusly I think:
> 
> #if C
> // gcc 4.7 and clang tip-of-trunk don't seem to support uchar.h in C11 mode.
> typedef uint16_t char16_t;
> #elif (gcc or clang) and not C++11
> typedef uint16_t char16_t;
> #elif MSVC < 10
> // Docs indicate that MSVC 10 is the first to support char16_t
> // MSVC 10 appears to only support char16_t if you include stdint.h
> // but if we do this at the end of mozilla/StandardInteger.h, that isn't a
> problem
> typedef uint16_t char16_t;
> #endif

Seems sane!

Once we have this, is there anything preventing us from doing a mass PRUnichar->char16_t conversion?
(In reply to Ehsan Akhgari [:ehsan] from comment #3)
> Once we have this, is there anything preventing us from doing a mass
> PRUnichar->char16_t conversion?

I can't think of any reason, but this is NSPR we're talking about, so I'm going to go with "probably, but we won't know until we try."
(In reply to comment #4)
> (In reply to Ehsan Akhgari [:ehsan] from comment #3)
> > Once we have this, is there anything preventing us from doing a mass
> > PRUnichar->char16_t conversion?
> 
> I can't think of any reason, but this is NSPR we're talking about, so I'm going
> to go with "probably, but we won't know until we try."

Yeah for sure, but I don't actually expect that be too hard given my work on the stdint bugs.  I can handle the conversion myself to save everyone else the pain.  Who wants to write a patch here?  :-)
This won't cause a problem if you have a method defined in C with a char16_t parameter referenced from a C++ file, will it? (Will the underlying types be the same?)
(In reply to comment #6)
> This won't cause a problem if you have a method defined in C with a char16_t
> parameter referenced from a C++ file, will it? (Will the underlying types be
> the same?)

Well, we're talking about making char16_t be a typedef of uint16_t everywhere, right?  So I don't see how that would be a problem.
(In reply to Ted Mielczarek [:ted] from comment #6)
> This won't cause a problem if you have a method defined in C with a char16_t
> parameter referenced from a C++ file, will it? (Will the underlying types be
> the same?)

C++11 makes the following guarantee:
Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in <stdint.h>, called the underlying types.

Reading C99's definition of stdint.h gives the following guarantee:
The typedef name uintN_t designates an unsigned integer type with width N. Thus,
uint24_t denotes an unsigned integer type with a width of exactly 24 bits.

[uintN_t] types are optional. However, if an implementation provides integer types with
widths of 8, 16, 32, or 64 bits, no padding bits, and (for the signed types) that have a
two’s complement representation, it shall define the corresponding typedef names.
 ....
The typedef name uint_leastN_t designates an unsigned integer type with a width
of at least N, such that no unsigned integer type with lesser size has at least the specified
width. Thus, uint_least16_t denotes an unsigned integer type with a width of at
least 16 bits.

So on every system that has uint16_t, char16_t should be identical in ABI layout to uint16_t.
Well, ABI's not identical types.  And if char16_t is a distinct type, then uint16_t* and char16_t* are not the same type for calling, and they'll mangle differently.  Whether any of this actually matters, I dunno.  But there is cause for caution, I think.
(In reply to comment #6)
> This won't cause a problem if you have a method defined in C with a char16_t
> parameter referenced from a C++ file, will it? (Will the underlying types be
> the same?)

Ted, do you know of existing areas in our code which do this?
I think we need to use wchar_t on Windows, otherwise we can no longer pass char16_t* parameters to NSPR functions without casting.
(In reply to comment #11)
> I think we need to use wchar_t on Windows, otherwise we can no longer pass
> char16_t* parameters to NSPR functions without casting.

Yes.
Do we need macros like TCHAR's _TEXT("") for UTF-16 string constants like u"char16_t string" and L"wchar_t string"? On Linux and OS X, sizeof(L"a"[0]) == sizeof(wchar_t) == 4.
(In reply to comment #13)
> Do we need macros like TCHAR's _TEXT("") for UTF-16 string constants like
> u"char16_t string" and L"wchar_t string"? On Linux and OS X, sizeof(L"a"[0]) ==
> sizeof(wchar_t) == 4.

You mean besides NS_LITERAL_STRING?  I don't think that we do.
sorry, I didn't see that NS_LL() already differentiated between char16_t and wchar_t.
You'll need to rename NS_LL, and #include "mozilla/Assertions.h" instead of #include <mozilla/Assertions.h>
Assignee: nobody → cpeterson
> and #include "mozilla/Assertions.h" instead of #include <mozilla/Assertions.h>
Why? #include <...> will avoid extra search from the current directory.
I have a preliminary patch (including some PRUnichar->char16_t fixes), but Windows is proving troublesome.

We can typedef __wchar_t char16_t, but the Windows header files (yvals.h) typedef uint16_t char16_t. Introducing a char16_t type that is incompatible with and less functional than (a wchar_t) PRUnichar and NS_LL() does not sound very useful. We can tweak some #defines to avoid Windows's typedef, but StandardInteger.h would need to be #included before any Windows header files.

I am toying with a less invasive change using |#define char16_t __wchar_t| to override Windows' typedef when StandardInteger.h is #included after Windows header files. This is pretty ugly, so I may just post a patch for clang/gcc and fix Windows later.
Attached patch part-1-typedef-char16_t.patch (obsolete) — Splinter Review
Part 1: Add char16_t typedef to StandardInteger.h.

This patch adds a char16_t typedef to StandardInteger.h. For our current compiler versions on Windows, OS X, Linux, and Android, char16_t is a distinct C++ type. For other compilers and C files, char16_t becomes a uint16_t.

The MOZ_HAVE_CHAR16_TYPE macro indicates whether the char16_t typedef is a distinct C++ type and not a uint16_t. This macro is necessary (in later patches) for code that may want to overload functions on both char16_t and uint16_t. I chose the name "MOZ_HAVE_CHAR16_TYPE" instead of "MOZ_HAVE_CHAR16_T" because this patch's char16_t for Windows is a __wchar_t, not a real char16_t. (__wchar_t is a native type of MSVC, whereas wchar_t may be a typedef for __wchar_t or unsigned short.)

MSVC's yvals.h typedefs char16_t as uint16_t. We want to override yvals.h so we can define char16_t to __wchar_t, a distinct 16-bit character type that also allows us to pass char16_t pointers to NSPR and Windows' "wide" APIs.

If yvals.h has not been #included yet, then #define _CHAR16T to prevent it from trying to typedef our char16_t macro. Unfortunately, yvals.h is indirectly included by many Windows header files. Changing Mozilla and third-party code to always #include StandardInteger.h before Windows header files is impractical. So if yvals.h has already been #included, then override its char16_t typedef with our own macro. We must also suppress xkeycheck.h's warning that our macro name conflicts with a C++ keyword.
Attachment #688147 - Flags: review?(ehsan)
Attachment #688147 - Flags: feedback?(Pidgeot18)
Part 2: Add -std=gnu++0x check to js configure.

js/src/configure.in is missing a gcc -std=gnu++0x check that is in Mozilla's top-level configure.in.
Part 3: Remove unnecessary configure checks for HAVE_CPP_CHAR16_T and HAVE_CPP_2BYTE_WCHAR_T.

StandardInteger.h replaces these configure-time checks with compile-time #ifdef checks.
Part 4: Remove unused USE_CPP_WCHAR_FUNCS and literal_string code.

Remove some wchar_t code that have been disabled (by #define FORCED_CPP_2BYTE_WCHAR_T) since 2001 (bug 105987). This patch is actually independent of the char16_t patches.
Attachment #688151 - Flags: review?(ehsan)
Attached file replace-all-PRUnichar.sh (obsolete) —
replace-all-PRUnichar.sh is a shell script I used to recursively search and replace "PRUnichar" with "char16_t" in mozilla-central (but skipping NSPR and NSS). Because the name "char16_t" is shorter than "PRUnichar", the script attempts to preserve the indentation of the old PRUnichar code.
Attached file unbitrot-patch-PRUnichar.sh (obsolete) —
unbitrot-patch-PRUnichar.sh is another shell script that, given one or more patch filename arguments, attempts to replace "PRUnichar" with "char16_t" to minimize patch merge conflicts.
Part 5a: Search and replace PRUnichar with char16_t.

* WARNING! Patch 5a will NOT compile without patch 5b! Patch 5a is simply the output from running the replace-all-PRUnichar.sh script.
Attached patch part-5b-fix-build-errors.patch (obsolete) — Splinter Review
Part 5b: Fix compile errors from replacing PRUnichar with char16_t.

I created utility functions AsChar16String() and AsUint16String() to make casting of char16_t and uint16_t pointers safer and less verbose. clang and gcc insist that casting these pointer types requires reinterpret_cast<>. Unfortunately, some files (such as libjpeg and webrtc) indirectly #include StandardInteger.h within an `extern "C"` block, so I explicitly marked these utility functions as `extern "C++"`.

This patch also adds an ugly #ifdef to js' CTypes.cpp because neither clang nor Android/STLport define std::numeric_limits<char16_t>. I could not add this (partial) implementation numeric_limits to StandardInteger.h because C++ templates do not like to be defined as `extern "C"`. (See above.)

This patch also redefines the jschar typedef as char16_t, but does not replace "jschar" with "char16_t".
Comment on attachment 688147 [details] [diff] [review]
part-1-typedef-char16_t.patch

Is char32_t needed?
Comment on attachment 688147 [details] [diff] [review]
part-1-typedef-char16_t.patch

Review of attachment 688147 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, with the information from comment 19 ending up in the commit message.  :-)
Attachment #688147 - Flags: review?(ehsan) → review+
Attachment #688151 - Flags: review?(ehsan) → review+
Can we please do the PRUnichar conversion stuff in another bug?  I'd like to experimentally use this in a small place in the code base first to make sure we're not breaking everyone who uses different compilers...
(In reply to Chris Peterson (:cpeterson) from comment #20)
> Created attachment 688148 [details] [diff] [review]
> part-2-js-configure-gnu++0x.patch
> 
> Part 2: Add -std=gnu++0x check to js configure.
> 
> js/src/configure.in is missing a gcc -std=gnu++0x check that is in Mozilla's
> top-level configure.in.

Also, this needs to go into its own bug in the Build Config component.  I seem to remember there was a reason why the JS people didn't have this...
(In reply to Chris Peterson (:cpeterson) from comment #19)
> This patch adds a char16_t typedef to StandardInteger.h. For our current
> compiler versions on Windows, OS X, Linux, and Android, char16_t is a
> distinct C++ type. For other compilers and C files, char16_t becomes a
> uint16_t.

I did some reading up on C11 this morning. C11 also has a char16_t type, but unlike C++11 making it a distinct type, C11 makes it a typedef of uint_least16_t, and it may or may not be UTF-16. That said, running some tests makes me inclined to believe that char16_t isn't supported in C mode on any of our compilers as of yet.
Comment on attachment 688150 [details] [diff] [review]
part-3-remove-configure-checks.patch

FWIW, this patch is going to fix bug 559278
Please don't add the new type to StandardInteger.h -- use a new header, Char16.h or something.  StandardInteger.h is supposed to exactly implement the <stdint.h> interface.  The only reason we have it is because we support compiling on older MSVC versions that don't support <stdint.h>.  When we drop support for them, we'll very likely remove StandardInteger.h and replace all #includes of it with the standard header.
Comment on attachment 688147 [details] [diff] [review]
part-1-typedef-char16_t.patch

Review of attachment 688147 [details] [diff] [review]:
-----------------------------------------------------------------

[Too bad there's no neutral feedback flag?]

::: mfbt/StandardInteger.h
@@ +47,5 @@
> +    * with our own macro. If yvals.h has not been #included yet, then #define
> +    * _CHAR16T to prevent it from trying to typedef our char16_t macro.
> +    */
> +#  define MOZ_HAVE_CHAR16_TYPE
> +#  define MOZ_LL(s) L##s

I saw briefly mentioned in m.d.platform that it might behoove us to rename MOZ_LL to a name that is more indicative of the u"" prefix than the L"" prefix.

@@ +62,5 @@
> +    * clang 2.7 and gcc 4.4 define __GXX_EXPERIMENTAL_CXX0X__ and char16_t.
> +    * gcc 4.3 defines __GXX_EXPERIMENTAL_CXX0X__ but not char16_t.
> +    */
> +#  if (__cplusplus >= 201103L || defined(__GXX_EXPERIMENTAL_CXX0X__)) && \
> +      ((__GNUC__ * 1000 + __GNUC_MINOR__) >= 4004 || defined(__clang__))

Since we don't try to compile with gcc 4.3 these days, you could remove the rhs of the and.

@@ +71,5 @@
> +
> +#if !defined(MOZ_HAVE_CHAR16_TYPE)
> +   typedef uint16_t char16_t;
> +   typedef uint32_t char32_t;
> +#  define MOZ_LL(s) s

This is highly suboptimal to me--if we don't have char16_t, MOZ_LL goes from a uint16_t*-ish thing to a char*-ish thing.

This should only happen if we are in C mode or non-C++11 mode. In the former, it's probably safe to not have a define. In the latter, we probably would like a define, but it would be impossible to compile-time produce a char16_t literal, which implies a char-to-utf16 translation function (i.e., what NS_LITERAL_STRING does now).

Since I think android needs to compile without c++11, this means we can't really use naked-u""-like literals yet, which means the MOZ_LL/MOZ_U macro would only really be driving XPCOM string headers for the near-to-medium term, which means not defining MOZ_LL in this scenario could be tenable.

@@ +80,5 @@
> +   MOZ_STATIC_ASSERT(sizeof(char16_t) == 2, "is char16_t 16 bits?");
> +   MOZ_STATIC_ASSERT(sizeof(char32_t) == 4, "is char32_t 32 bits?");
> +   MOZ_STATIC_ASSERT((char16_t)-1 > (char16_t)0, "is char16_t unsigned?");
> +   MOZ_STATIC_ASSERT((char32_t)-1 > (char32_t)0, "is char32_t unsigned?");
> +#endif

I don't know if this is MFBT style--you may want to check with Waldo.
Attachment #688147 - Flags: feedback?(Pidgeot18)
This will need some tweaking for mingw, but I can do that separatelly. I will look at that later, but here is what I spotted during quick review:

+   /*
+    * MSVC typedefs char16_t as uint16_t, but we want a real wchar_t type. If
+    * yvals.h has already been #included, then override its char16_t typedef
+    * with our own macro. If yvals.h has not been #included yet, then #define
+    * _CHAR16T to prevent it from trying to typedef our char16_t macro.
+    */

It sounds like simply including yvals.h here would be cleaner. This will avoid having to deal with different situations and prevent us from depending on _CHAR16T, which is an internal thing.


Another thing to consider is simply not trying so badly to use char16_t name. I agree it's nice to use standard types, but (at least on Windows platforms) it's clearly not the right fit. How about, intead of ending with standard-but-overriden type use a separated name.
(In reply to Chris Peterson (:cpeterson) from comment #26)
> Unfortunately, some files (such as libjpeg and webrtc) indirectly #include
> StandardInteger.h within an `extern "C"` block, so I explicitly marked these
> utility functions as `extern "C++"`.

Hmm.  This smells to me.  I'm not sure if doing this is spec-permissible or not, but we should fix these to include this not-weirdly.  How much trouble is that?

> This patch also adds an ugly #ifdef to js' CTypes.cpp because neither clang
> nor Android/STLport define std::numeric_limits<char16_t>. I could not add
> this (partial) implementation numeric_limits to StandardInteger.h because
> C++ templates do not like to be defined as `extern "C"`. (See above.)

You can nest extern blocks, so this shouldn't be a problem (although I guess if we fix the underlying issue here as noted above, maybe this doesn't matter).

> This patch also redefines the jschar typedef as char16_t, but does not
> replace "jschar" with "char16_t".

Seems reasonable.  You should file a followup in the JS engine to switch us over to char16_t -- ask :dmandelin for review.  (Probably leave the typedef around, tho, so embedders have a little time to switch over.)

(In reply to Jacek Caban from comment #35)
> Another thing to consider is simply not trying so badly to use char16_t
> name. I agree it's nice to use standard types, but (at least on Windows
> platforms) it's clearly not the right fit. How about, intead of ending with
> standard-but-overriden type use a separated name.

We're trying to move to standard stuff so that Mozilla code is more approachable to new contributors; a different name doesn't address that.  And as a practical matter we already have two different names for this rough thing -- if we thought different names were fine we wouldn't be doing this.
(In reply to Jacek Caban from comment #35)
> It sounds like simply including yvals.h here would be cleaner. This will
> avoid having to deal with different situations and prevent us from depending
> on _CHAR16T, which is an internal thing.
See comment #11.
(In reply to Joshua Cranmer [:jcranmer] from comment #34)
> Since I think android needs to compile without c++11, this means we can't
> really use naked-u""-like literals yet, which means the MOZ_LL/MOZ_U macro
> would only really be driving XPCOM string headers for the near-to-medium
> term, which means not defining MOZ_LL in this scenario could be tenable.

I just verified that we compile Firefox for Android with gcc-4.4.3 -std=gnu++0x and that u"" and U"" are char16_t and char32_t strings, as expected.
Status: NEW → ASSIGNED
Whiteboard: [leave open]
Landed r+'d patch 4 (Remove unused USE_CPP_WCHAR_FUNCS and literal_string code) because it has no dependencies on the char16_t patches:

https://hg.mozilla.org/integration/mozilla-inbound/rev/a0ae1934a4a0
(In reply to Masatoshi Kimura [:emk] from comment #37)
> (In reply to Jacek Caban from comment #35)
> > It sounds like simply including yvals.h here would be cleaner. This will
> > avoid having to deal with different situations and prevent us from depending
> > on _CHAR16T, which is an internal thing.
> See comment #11.

I think you missundertood me. Here is what I meant for Win32 case:

#if defined(_WIN32)
#  define MOZ_HAVE_CHAR16_TYPE
#  define MOZ_LL(s) L##s
#  include <yvals.h>
#  define char16_t __wchar_t
#  define char32_t uint32_t
#else ...

This way you know that yvals.h was included (any any later attempt to include it will be no-op), so you don't need _CHAR16T hack.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #36)
> (In reply to Jacek Caban from comment #35)
> > Another thing to consider is simply not trying so badly to use char16_t
> > name. I agree it's nice to use standard types, but (at least on Windows
> > platforms) it's clearly not the right fit. How about, intead of ending with
> > standard-but-overriden type use a separated name.
> 
> We're trying to move to standard stuff so that Mozilla code is more
> approachable to new contributors; a different name doesn't address that. 
> And as a practical matter we already have two different names for this rough
> thing -- if we thought different names were fine we wouldn't be doing this.

I know, but note that the proposed change is not really just introducing usage of a known type in Mozilla code. The proposed change adds a known type name with an unknown, hidden important difference. It means that the new contributor should know the difference (and probably won't). That's what I worry about.
(In reply to Masatoshi Kimura [:emk] from comment #27)
> Comment on attachment 688147 [details] [diff] [review]
> part-1-typedef-char16_t.patch
> 
> Is char32_t needed?

Gecko does not use char32_t, but the Windows header files will fail to compile if the _CHAR16T hack is #defined but we don't have definitions for both char16_t and char32_t. I can remove my (simulated uint32_t) char32_t definition, thought modern versions of clang and gcc will still have their native char32_t type.
(In reply to Jacek Caban from comment #41)
> I know, but note that the proposed change is not really just introducing
> usage of a known type in Mozilla code. The proposed change adds a known type
> name with an unknown, hidden important difference. It means that the new
> contributor should know the difference (and probably won't). That's what I
> worry about.

This is going to show up in weird overload-resolution contexts and in template overload contexts, mostly.  I don't think those are going to be all that common, and I doubt truly new people are likely to stumble across such issues.  Could be wrong, but I think there's reason to be optimistic about this only biting very rarely.
Comment on attachment 688147 [details] [diff] [review]
part-1-typedef-char16_t.patch

>+#    define MOZ_LL(s) u##s
You also need to #define MOZ_L(s) MOZ_LL(s) otherwise people can't write things like MOZ_L(MOZILLA_VERSION) for instance.

>+#if !defined(MOZ_HAVE_CHAR16_TYPE)
>+   typedef uint16_t char16_t;
>+   typedef uint32_t char32_t;
>+#  define MOZ_LL(s) s
So for now we're still supporting compilers without a UTF16 string literal?
(In reply to neil@parkwaycc.co.uk from comment #44)
> >+#if !defined(MOZ_HAVE_CHAR16_TYPE)
> >+   typedef uint16_t char16_t;
> >+   typedef uint32_t char32_t;
> >+#  define MOZ_LL(s) s
> So for now we're still supporting compilers without a UTF16 string literal?

Good question. If we only need to support MSVC, clang >= 2.7, and gcc >= 4.4 then we can assume we have UTF16 string literals (and don't need these fallback typedefs either).
(In reply to Jacek Caban from comment #35)
> Another thing to consider is simply not trying so badly to use char16_t
> name. I agree it's nice to use standard types, but (at least on Windows
> platforms) it's clearly not the right fit. How about, intead of ending with
> standard-but-overriden type use a separated name.

I'm not against using a new name rather than overloading the char16_t name.

Gecko currently represents UTF-16 strings with multiple type names (PRUnichar, jschar, and Unichar in OSX code) and those type names might be wchar_t or uint16_t. Consolidating these type names into one name (such as "char16", "Char16", or even just "C16") would be an improvement. That we could define this new typedef with a distinct type (like wchar_t or char16_t instead of uint16_t) would be an extra benefit.
Comment on attachment 688147 [details] [diff] [review]
part-1-typedef-char16_t.patch

>+#if defined(_WIN32)
Should be _MSC_VER for MinGW builds.
(In reply to Masatoshi Kimura [:emk] from comment #47)
> Comment on attachment 688147 [details] [diff] [review]
> part-1-typedef-char16_t.patch
> 
> >+#if defined(_WIN32)
> Should be _MSC_VER for MinGW builds.

I checked _WIN32 instead of _MSC_VER because I assumed MinGW builds would still want their "char16_t" to be a wchar_t (instead of gcc's native char16_t type) because they will call Win32 APIs that want wchar_t pointers. If we used a new type name (e.g. "Char16") instead of overloading the name "char16_t", then this code might be clearer.
(In reply to Chris Peterson (:cpeterson) from comment #45)
> (In reply to neil@parkwaycc.co.uk from comment #44)
> > >+#if !defined(MOZ_HAVE_CHAR16_TYPE)
> > >+   typedef uint16_t char16_t;
> > >+   typedef uint32_t char32_t;
> > >+#  define MOZ_LL(s) s
> > So for now we're still supporting compilers without a UTF16 string literal?
> 
> Good question. If we only need to support MSVC, clang >= 2.7, and gcc >= 4.4
> then we can assume we have UTF16 string literals (and don't need these
> fallback typedefs either).

gcc >= 4.4 doesn't have u"" when C++11 is disabled and does have L"", but unless -fshort-wchar is used, it uses 4-bytes characters, and you're removing -fshort-wchar. We do support building with gcc 4.4 with C++11 disabled.
While perfect is the enemy of good and all that, I don't actually think having a Char16 or whatever that everyone uses provides quite enough incremental benefit over the current balkanization to be worth it.  Also worth noting that as originally filed this bug is for the char16_t name, not something else.

I'm curious what it would take to always build with -std=gnu++0x, but I haven't had time to look into it.  It might be a good use of time to get everyone building in C++11 mode so that these distinctions can go away.

If memory serves (and it barely does, I almost forgot the why for this), the reason JS builds without C++11 is that we were hitting incompatibilities between libc headers and the C++ standard headers in some rare cases.  Now that SpiderMonkey is C++-only, these issues may no longer matter, and with a little fixup to stop using the problematic C headers we might be able to switch with little pain.
Thanks for the feedback. This new patch has a much smaller scope and only tries to address the original request: add a char16_t type to MFBT. Additional refactoring work, such as replacing PRUnichar or jschar or updating configure.in, can be addressed (and debated) in new bugs. <:)

This patch:
* Creates a Char16.h header file.
* Only defines char16_t when UTF-16 string literals are supported (via u"" or L"").
* Renames MOZ_LL() to something more mnemonic like MOZ_UTF16(). I don't think MOZ_U() would be a good name because its capitalization looks more like a UTF-32 U"" literal more than a UTF-16 u"" literal.
* On Windows, ensures yvals.h is #included to avoid messing with _CHAR16T.

1. If compiling on Windows (with MSVC or MinGW!), then alias char16_t to wchar_t.
2. Else if compiling as C++11, then char16_t is a distinct builtin type.
3. Else if sizeof(wchar_t) == 2, then alias char16_t to wchar_t.
4. Else if compiling as C11, then char16_t is an alias for uint_least16_t.
5. Else no char16_t is declared.
Attachment #688147 - Attachment is obsolete: true
Attachment #688148 - Attachment is obsolete: true
Attachment #688150 - Attachment is obsolete: true
Attachment #688154 - Attachment is obsolete: true
Attachment #688155 - Attachment is obsolete: true
Attachment #688156 - Attachment is obsolete: true
Attachment #688158 - Attachment is obsolete: true
Attachment #689590 - Flags: review?(ehsan)
Attachment #689590 - Flags: feedback?(Pidgeot18)
Why MOZ_STATIC_ASSERTs were removed?
I bet our code assumes char16_t is unsigned and sizeof(char16_t) == 2.
(In reply to Masatoshi Kimura [:emk] from comment #53)
> Why MOZ_STATIC_ASSERTs were removed?
> I bet our code assumes char16_t is unsigned and sizeof(char16_t) == 2.

In comment 34, jcranmer said "I don't know if this is MFBT style". I thought he was referring to having the static asserts in a header file, but maybe I misunderstood.
Char16.h v3:

C11 must be checked before 16-bit wchar_t because we don't want to `typedef wchar_t char16_t` if char16_t is already a real type.

C++11's char16_t is a builtin type, but C11's char16_t is supposed to be a typedef in uchar.h. Unfortunately, uchar.h is not available until glibc 2.16 (2012-06-30) [1]. In the meantime, we can use our own typedef.

[1] http://sourceware.org/ml/libc-alpha/2012-06/msg00807.html
Attachment #689590 - Attachment is obsolete: true
Attachment #689590 - Flags: review?(ehsan)
Attachment #689590 - Flags: feedback?(Pidgeot18)
Attachment #689792 - Flags: review?(ehsan)
Attachment #689792 - Flags: feedback?(Pidgeot18)
Comment on attachment 689792 [details] [diff] [review]
part-1-create-Char16-header-v3.patch

Review of attachment 689792 [details] [diff] [review]:
-----------------------------------------------------------------

I think we should put those static asserts back in.  If that violates the MFBT style, then we should change that style!  ;-)

Over to Waldo for the MFBT specific stuff.
Attachment #689792 - Flags: review?(jwalden+bmo)
Attachment #689792 - Flags: review?(ehsan)
Attachment #689792 - Flags: review+
(In reply to Chris Peterson (:cpeterson) from comment #54)
> (In reply to Masatoshi Kimura [:emk] from comment #53)
> > Why MOZ_STATIC_ASSERTs were removed?
> > I bet our code assumes char16_t is unsigned and sizeof(char16_t) == 2.
> 
> In comment 34, jcranmer said "I don't know if this is MFBT style". I thought
> he was referring to having the static asserts in a header file, but maybe I
> misunderstood.

Let me clarify my statement some more:

I recall that one of the MFBT headers has the MOZ_STATIC_ASSERTs commented out; ISTR that there is some bug due to multiple header inclusions in a compiler that causes things to break.
Comment on attachment 689792 [details] [diff] [review]
part-1-create-Char16-header-v3.patch

Review of attachment 689792 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/Char16.h
@@ +7,5 @@
> +
> +#ifndef mozilla_Char16_h_
> +#define mozilla_Char16_h_
> +
> +#if defined(_WIN32) /* MSVC or MinGW */

I dislike the MSVC or mingw situation here. In my mind, this is primarily about providing a transition to char16_t for compilers that don't have it. If mingw gcc supports char16_t and u"", we should be using that. The rationale that "we can use MOZ_UTF16 to pass things as PRUnichar or wchar_t" strikes me as very a much a wrong decision, since the standards are emphatic on distinguishing between char16_t and wchar_t. Indeed, I would probably rather see char16_t get defined to uint16_t on MSVC, except that means that MOZ_UTF16 would have to cast its string literal.

@@ +34,5 @@
> +    * available until glibc 2.16. In the meantime, __CHAR16_TYPE__ is a builtin
> +    * macro that expands to unsigned short on clang and gcc.
> +    */
> +#  define MOZ_UTF16(s) u##s
> +   typedef __CHAR16_TYPE__ char16_t;

Why can't we get something like Clang's __has_include into the C1y/C++1y standard? It's so bloody useful...
Attachment #689792 - Flags: feedback?(Pidgeot18) → feedback-
Comment on attachment 689792 [details] [diff] [review]
part-1-create-Char16-header-v3.patch

Review of attachment 689792 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/Char16.h
@@ +22,5 @@
> +    */
> +#  define MOZ_UTF16(s) L##s
> +#  define _ALLOW_KEYWORD_MACROS
> +#  include <yvals.h>
> +#  define char16_t wchar_t

Can we #undef _ALLOW_KEYWORD_MACROS here, so this doesn't taint the world?

@@ +34,5 @@
> +    * available until glibc 2.16. In the meantime, __CHAR16_TYPE__ is a builtin
> +    * macro that expands to unsigned short on clang and gcc.
> +    */
> +#  define MOZ_UTF16(s) u##s
> +   typedef __CHAR16_TYPE__ char16_t;

Hm, so this is guarded by a totally standard #if, but then it uses gcc-isms.  Let's not do that, but rather make the this-only-works-with-gcc/clang bit explicit:

#elif __STDC_VERSION__ >= 201112L
#  if defined(__clang__) || defined(__GNUC__)
     ...current contents...
#  else
#    error "Please write code defining a char16_t for this compiler."
#  endif

@@ +39,5 @@
> +#elif __WCHAR_MAX__ == 0x7fff || __WCHAR_MAX__ == 0xffff
> +   /* Thanks to -fshort-wchar, wchar_t is a 16-bit type. */
> +#  define MOZ_UTF16(s) L##s
> +#  include <wchar.h>
> +   typedef wchar_t char16_t;

This also wants #if defined(__clang__) || defined(__GNUC__) / #else / #error "Please write code defining a char16_t for this compiler." / #endif.  Or say to compile with -fshort-wchar, or something.  I dunno.  Just let's make it a compile error with some vaguely informative message.

@@ +40,5 @@
> +   /* Thanks to -fshort-wchar, wchar_t is a 16-bit type. */
> +#  define MOZ_UTF16(s) L##s
> +#  include <wchar.h>
> +   typedef wchar_t char16_t;
> +#endif

#else
#  error "Couldn't figure out how to define a char16_t type."

@@ +41,5 @@
> +#  define MOZ_UTF16(s) L##s
> +#  include <wchar.h>
> +   typedef wchar_t char16_t;
> +#endif
> +

I dunno where this notion having static asserts wouldn't be mfbt style came from, because as far as I'm concerned static asserts are one honking great idea -- let's do more of those!

+#ifdef __cplusplus
+#  include "mozilla/Assertions.h"
+   MOZ_STATIC_ASSERT(sizeof(char16_t) == 2, "is char16_t 16 bits?");
+   MOZ_STATIC_ASSERT(sizeof(char32_t) == 4, "is char32_t 32 bits?");
+   MOZ_STATIC_ASSERT((char16_t)-1 > (char16_t)0, "is char16_t unsigned?");
+   MOZ_STATIC_ASSERT((char32_t)-1 > (char32_t)0, "is char32_t unsigned?");
+#endif

Maybe it was about the indentation of the static asserts within the preprocessor checks?  I dunno.  In any case, I don't see a reason why these should only be tested if __cplusplus -- just do them all the time, and move the #include up to just after #define mozilla_Char16_h_.

Or is it that the char16_t implemented by all this isn't actually guaranteed to be unsigned (but is guaranteed to be 16-bits)?  Le sigh.  I guess let's just check sizes and not bother with the unsigned asserts.  If we can't rely on them on some platforms, we really just can't rely on them anywhere.
Attachment #689792 - Flags: review?(jwalden+bmo)
Attachment #689792 - Flags: review+
Attachment #689792 - Flags: feedback?(Pidgeot18)
Attachment #689792 - Flags: feedback-
(In reply to Joshua Cranmer [:jcranmer] from comment #58)
> ::: mfbt/Char16.h
> @@ +7,5 @@
> > +
> > +#ifndef mozilla_Char16_h_
> > +#define mozilla_Char16_h_
> > +
> > +#if defined(_WIN32) /* MSVC or MinGW */
> 
> I dislike the MSVC or mingw situation here. In my mind, this is primarily
> about providing a transition to char16_t for compilers that don't have it.
> If mingw gcc supports char16_t and u"", we should be using that. The
> rationale that "we can use MOZ_UTF16 to pass things as PRUnichar or wchar_t"
> strikes me as very a much a wrong decision, since the standards are emphatic
> on distinguishing between char16_t and wchar_t. Indeed, I would probably
> rather see char16_t get defined to uint16_t on MSVC, except that means that
> MOZ_UTF16 would have to cast its string literal.

Not only that but also everywhere using Win32 APIs. Also, everywhere using NSPR functions because PRUnichar is wchar_t on Windows.

So the question again. Should we proceed even if we know "char16_t" is not the real char16_t on Windows?
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #59)
> Or is it that the char16_t implemented by all this isn't actually guaranteed
> to be unsigned (but is guaranteed to be 16-bits)?  Le sigh.  I guess let's
> just check sizes and not bother with the unsigned asserts.  If we can't rely
> on them on some platforms, we really just can't rely on them anywhere.

Then
> +#elif __WCHAR_MAX__ == 0x7fff || __WCHAR_MAX__ == 0xffff
should be
> +#elif __WCHAR_MAX__ == 0xffff
because __WCHAR_MAX__ == 0x7fff implies that wchar_t is a signed type.
Midairers gonna midair.  :-P

(In reply to Joshua Cranmer [:jcranmer] from comment #57)
> I recall that one of the MFBT headers has the MOZ_STATIC_ASSERTs commented
> out; ISTR that there is some bug due to multiple header inclusions in a
> compiler that causes things to break.

I can't find this in the current code.  There was a multiple-inclusion issue, if multiple files in the same compilation had static-asserts on the same line, but that should have disappeared now that we require gcc 4.4+ and have __COUNTER__ available everywhere.  If there's a problem, and I don't think there is, we can deal after the fact.

(In reply to Joshua Cranmer [:jcranmer] from comment #58)
> If mingw gcc supports char16_t and u"", we should be using that. The
> rationale that "we can use MOZ_UTF16 to pass things as PRUnichar or wchar_t"
> strikes me as very a much a wrong decision, since the standards are emphatic
> on distinguishing between char16_t and wchar_t. Indeed, I would probably
> rather see char16_t get defined to uint16_t on MSVC, except that means that
> MOZ_UTF16 would have to cast its string literal.

(In reply to Masatoshi Kimura [:emk] from comment #60)
> Not only that but also everywhere using Win32 APIs. Also, everywhere using
> NSPR functions because PRUnichar is wchar_t on Windows.
> 
> So the question again. Should we proceed even if we know "char16_t" is not
> the real char16_t on Windows?

I'm sympathetic to these arguments, although not quite so strongly to insist on it.  We could have a method that converted char16_t* to wchar_t* for Windows' sake, and call it everywhere that mattered, certainly.  I don't have good intuition as to whether that would uglify things too much, or not.

Anyway.  If other people feel strongly about the Windows char16_t situation, they should speak up.  It sounds to me like people generally think this is a reasonable way to do things, tho, so right now I'm thinking we should just go with this.  Also it seems to me there's no reason this subsequent change couldn't be done as a followup, if people felt strongly and we wanted to find out just how much work would be involved, and how ugly it'd look.

(In reply to Masatoshi Kimura [:emk] from comment #61)
> Then
> > +#elif __WCHAR_MAX__ == 0x7fff || __WCHAR_MAX__ == 0xffff
> should be
> > +#elif __WCHAR_MAX__ == 0xffff
> because __WCHAR_MAX__ == 0x7fff implies that wchar_t is a signed type.

I think the idea was we were going to be fine enough with having char16_t possibly be signed, here.  Unsigned has nice things like better overflow/modulo semantics, but it's not clear to me it's worth insisting upon.
Attachment #689792 - Flags: feedback?(Pidgeot18) → feedback-
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #62)
> I'm sympathetic to these arguments, although not quite so strongly to insist
> on it.  We could have a method that converted char16_t* to wchar_t* for
> Windows' sake, and call it everywhere that mattered, certainly.  I don't
> have good intuition as to whether that would uglify things too much, or not.

I'm going to withhold ultimate judgement until I see what happens when MSVC includes char16_t and u"" natively. Given that the recent CTP released raw string literals but not unicode string literals, I'm guessing that char16_t/wchar_t compatibility issues may actually be holding this feature back from implementation in MSVC....
Attached file Char16.h WIP v5 (obsolete) —
WIP v5.

* Add static asserts. The __cplusplus check was a holdover from when this code was in StandardInteger.h, which was #included in some .c files that did not like Assertion.h's variadic macros ("error: anonymous variadic macros were introduced in C99"). What is the minimum version of C and C++ we can assume? We are compiling .c files as C90. Perhaps this limitation should be fixed in Assertions.h itself.

* Check clang's __has_include(<uchar.h>). The #if logic is little convoluted because the C preprocessor does not short-circuit #if expressions so we can't guard undefined function-like macro calls with defined(__has_include) for non-clang compilers.

* MSVC: Waiting for reviewer consensus on char16_t as wchar_t or unsigned short. I included example implementations of both. The unsigned short version is pretty ugly and the MOZ_UTF16() macro must can't be overloaded for both string literals and character literals in C code. Plus, the cast means that the C preprocessor can't concatentate multiple MOZ_UTF16() literals. This is not a big limitation because I can't find any current uses of NS_LL() literal concatenation.

* gcc-4.4 -std=gnu99 supports u"" string literals, even without C11.

* Remove checks for 16-bit wchar_t because I don't think we compile with -fshort-wchar on any platforms (after the char16_t check was added to configure.in).
Attachment #689792 - Attachment is obsolete: true
(In reply to Chris Peterson (:cpeterson) from comment #64)
> What is the minimum version of C and C++ we can assume? We are
> compiling .c files as C90. Perhaps this limitation should be
> fixed in Assertions.h itself.

I tried to make us compile C with -std=gnu99 -fgnu89-inline awhile ago, in bug 719659.  It broke at the time because of stupid tinderboxen compiling with gcc < 4.2, which didn't support -fgnu89-inline.  I'm not sure those tinderboxen exist any more, so we could just flip on C99 and be fine here.  (With the knowledge, of course, that not all C99 is permissible because MSVC.)

For C++, we could use -Wno-variadic-macros, or just use that for C/C++ both.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #65)
> (In reply to Chris Peterson (:cpeterson) from comment #64)
> > What is the minimum version of C and C++ we can assume? We are
> > compiling .c files as C90. Perhaps this limitation should be
> > fixed in Assertions.h itself.
> 
> I tried to make us compile C with -std=gnu99 -fgnu89-inline awhile ago, in
> bug 719659.  It broke at the time because of stupid tinderboxen compiling
> with gcc < 4.2, which didn't support -fgnu89-inline.  I'm not sure those
> tinderboxen exist any more, so we could just flip on C99 and be fine here. 
> (With the knowledge, of course, that not all C99 is permissible because
> MSVC.)

The l10n tinderboxes might still have the problem, but if you enclose the flags in COMPILE_ENVIRONMENT, that might work.
Comment on attachment 690436 [details]
Char16.h WIP v5

> * Remove checks for 16-bit wchar_t because I don't think we compile with
> -fshort-wchar on any platforms (after the char16_t check was added to
> configure.in).
Doesn't MinGW use wchar_t anymore? Jacek, why cast isn't needed when using Win32 API or NSPR functions from MinGW? Or are you adding local patches to compile?
Attachment #690436 - Attachment is patch: true
I feel it's worth coming back to the goals we want for char16_t:

1. decltype(MOZ_UTF16("")[0]) should be the same type as char16_t.
2. sizeof(char16_t) == 2.
[These two are absolute requirements]

3. MOZ_UTF16 should be constexpr so that it can be used in static initializers without requiring any runtime costs.
4. MOZ_UTF16 should be a const char16_t[N] type for some integer N. This lets you declare:
template <size_t N> void foo(const char16_t (&str)[N]);
foo(MOZ_UTF16("A literal"));
5. As this gets used more in Mozilla code, we should be able to compile without -fshort-wchar.

At this point, I would also say the following:
6. Arguing over proper handling of C is putting the cart before the horse. Our public APIs (JSAPI and the XUL sdk) are C++ only; NSPR and NSS aren't going to be using char16_t anytime soon. If we can't decide what to do in C mode, then making it an error to use this header from a C file might be a better alternative at this point.
7. Making PRUnichar == char16_t is a non-goal. Under current conditions, they're already different in C++ (as char16_t is not uint16_t); we've gotten away with them being different on non-Windows systems for a few years already (as NS_LL is a wchar_t*, while PRUnichar is still uint16_t). I'd like to think that we could get away with them being different on Windows too, but I'm not an expert in this regard.

Point 4 appears to force char16_t to be wchar_t on MSVC. I've done some tests that see if I can fake this with reinterpret_cast, but that doesn't let it work (the size parameter is lost).

I don't know what mingw should do, but they're not tier 1, so I'd quite frankly rather have them get selected based on compiler support and let them propose a fix if that turns out to break their code.
Comment on attachment 690436 [details]
Char16.h WIP v5

>MOZ_STATIC_ASSERT(sizeof(char16_t) == sizeof(uint_least16_t),
>                  "Is char16_t type 16 bits?");
>
>#if defined(MOZ_UTF16)
>  MOZ_STATIC_ASSERT(sizeof(MOZ_UTF16('A')) == sizeof(uint_least16_t),
>                    "Is char16_t literal 16 bits?");
>  MOZ_STATIC_ASSERT(sizeof(MOZ_UTF16("")[0]) == sizeof(uint_least16_t),
>                    "Is char16_t string element 16 bits?");
>#endif
The asserts should test sizeof(char16_t)/sizeof(MOZ_UTF16(...)) against the exact value 2. I don't think our code works with other size of char16_t.
Attachment #690436 - Attachment is patch: false
(In reply to Masatoshi Kimura [:emk] from comment #67)
> Comment on attachment 690436 [details]
> Char16.h WIP v5
> 
> > * Remove checks for 16-bit wchar_t because I don't think we compile with
> > -fshort-wchar on any platforms (after the char16_t check was added to
> > configure.in).
> Doesn't MinGW use wchar_t anymore? Jacek, why cast isn't needed when using
> Win32 API or NSPR functions from MinGW?

MinGW uses wchar_t just like MSVC. Currently PRUnichar == wchar_t on mingw, so all those calls use the same builtin type.

> Or are you adding local patches to compile?

No, current m-c compiles just fine. Here is more info:
https://developer.mozilla.org/en-US/docs/Cross_Compile_Mozilla_for_Mingw32

> I don't know what mingw should do, but they're not tier 1, so I'd quite frankly rather have
> them get selected based on compiler support and let them propose a fix if that turns out
> to break their code.

Sure, I may take care of that.

Generally, mingw should be as close as possible to MSVC variant, although in this case wchar_t is already a builtin type on mingw and there is no __wchar_t, so that's a minor problem for mingw.
FWIW I also don't think that mingw should be a concern here.  While I usually hate to break non-tier1 platforms, I think we've over-discussed the mingw side here, and we should just fix up mingw based on whatever solution we agree on for tier1 platforms, potentially even after landing the patch here.
Attached patch Char16.h v6 (obsolete) — Splinter Review
I think this patch is ready to land, unless I overlooked some feedback. Changes to address most recent concerns:

* On Windows, use wchar_t instead of uint16_t, so MOZ_UTF16() can be a constant expression. (Also, mingw doesn't support __wchar_t.)

* However, #ifdef based on MSVC, not _WIN32, until mingw support is tested.

* When compiling without support for UTF-16 literals (such as C99 and C++98), leave the MOZ_UTF16() macro undefined rather than failing with #error. Code can still manipulate UTF-16 strings and characters without needing to encode UTF-16 literals at compile-time.

* Assert sizeof(char16_t) == 2, not sizeof(uint_least16_t).

* Add more comments.
Attachment #690436 - Attachment is obsolete: true
Attachment #691241 - Flags: review?(jwalden+bmo)
Rereading jcranmer's comment 68, I better understand his point that C support is probably not useful in Mozilla code. It greatly complicates this header file and can cause quiet problems when the definition of char16_t is inconsistent in different source files. And the .c files are in our codebase are unlikely to be processing UTF-16 strings anyway.
Can we perhaps do:

#ifndef __cplusplus
#error Wrong language, go away!
#endif

?  :-)
Attached patch Char16.h v7Splinter Review
* Add #error about only supporting C++
* Remove macro heroics attempting to support char16_t in C

(Sorry this has been a long journey back to a small patch. <:)
Attachment #691241 - Attachment is obsolete: true
Attachment #691241 - Flags: review?(jwalden+bmo)
Attachment #691848 - Flags: review?(jwalden+bmo)
Comment on attachment 691848 [details] [diff] [review]
Char16.h v7

Review of attachment 691848 [details] [diff] [review]:
-----------------------------------------------------------------

Sure, let's run with this and get feedback on it.

Incidentally, with jsd and the JSAPI converted to C++, there's almost no C code in Mozilla, owned by Mozilla, so C compat is less relevant now than ever.  (And that C code we do own, we really should just make C++ at this point, and throw around extern "C" if needed to adjust.)
Attachment #691848 - Flags: review?(jwalden+bmo) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/793c5c49bd9b
OS: Windows 7 → All
Hardware: x86_64 → All
Target Milestone: --- → mozilla20
Depends on: 826144
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
Depends on: 1277106
You need to log in before you can comment on or make changes to this bug.