Closed Bug 239303 Opened 20 years ago Closed 20 years ago

provide basic string conversion routines for the Gecko SDK

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.7final

People

(Reporter: darin.moz, Assigned: darin.moz)

Details

Attachments

(1 file, 1 obsolete file)

provide basic string conversion routines to embedding string API.

bsmedberg and i discussed this issue, and we came up with the following API:

//--------------------------------------------------------------------------

/**
 * Available single-byte charsets.
 */
enum {
  /* Conversion between ASCII and UTF-16 assumes that all single-byte
   * characters can be inflated to UTF-16 by inserting zero bytes.  Reverse
   * conversion is done by truncating every other byte.  No attempt is made
   * to protect against data loss in this case. */
  NS_CHARSET_ASCII = 0,

  /* Conversion between UTF-8 and UTF-16 is non-lossy. */
  NS_CHARSET_UTF8 = 1,

  /* Conversion from UTF-16 to the native filesystem charset may result in a
   * loss of information.  No attempt is made to protect against data loss in
   * this case.  The native filesystem charset applies to strings passed to
   * the "Native" method variants on nsIFile and nsILocalFile. */
  NS_CHARSET_NATIVE_FILESYSTEM = 2
};

/**
 * NS_CStringToUTF16
 *
 * This function converts the characters in a nsACString to an array of UTF-16
 * characters, in the platform endianness.  The result is stored in a nsAString
 * object.
 *
 * @param aSource       abstract string reference containing source string
 * @param aCharset      character set of the source string
 * @param aDest         abstract string reference to hold the result
 */
NS_STRINGAPI(nsresult)
NS_CStringToUTF16(const nsACString &aSource, PRUint32 aCharset,
                  nsAString &aDest);

/**
 * NS_UTF16ToCString
 *
 * This function converts the UTF-16 characters in a nsAString to a single-byte
 * encoding.  The result is stored in a nsACString object.  In some cases this
 * conversion may be lossy.  In such cases, the conversion may succeed with a
 * return code indicating loss of information.  The exact behavior is not
 * specified at this time.
 *
 * @param aSource       abstract string reference containing source string
 * @param aCharset      character set of the resulting string
 * @param aDest         abstract string reference to hold the result
 */
NS_STRINGAPI(nsresult)
NS_UTF16ToCString(const nsAString &aSource, PRUint32 aCharset,
                  nsACString &aDest);

//--------------------------------------------------------------------------

This API provides embedders and external component developers with an easy way
to convert between the primary string encodings used by the Gecko SDK.  The
current state of affairs makes the Gecko SDK difficult to use and/or promotes
the use of non-frozen uconv interfaces or methods from nsReadableUtils.h and
nsNativeCharsetUtils.h.

I'd like to see these methods added to the Gecko SDK for Mozilla 1.7 final.
Attached patch v1 patch (obsolete) — Splinter Review
Attachment #145207 - Flags: superreview?(dbaron)
Attachment #145207 - Flags: review?(bsmedberg)
dougt: can you also please review these additions to XPCOM and the glue.  thx!
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.7final
> enum {

why not give it a name and use the name of that enum in the function declaration
instead of PRUint32. that way, the compiler can do some typechecks and it makes
it easier to find out what the excepted parameter value is.

also, the new functions in glue/standalone/nsXPCOMGlue.cpp don't return a value
despite being declared nsresult. but should they be nsresult? can they fail?
NS_ConvertUTF16toUTF8 fails by returning an empty string (iirc), should these
behave the same?
I agree with comment 3, and I'd also prefer NS_ENCODING_* instead of NS_CHARSET_*
> > enum {
> 
> why not give it a name and use the name of that enum in the function...

yeah, i thought about doing that, but i couldn't make up my mind what to call
it.  nsCharset seems too generic.  suggestions?


> also, the new functions in glue/standalone/nsXPCOMGlue.cpp don't return a value

ack.. good catch.  why didn't the compiler complain?  :-/


> can they fail?

yes, see the comments in the header file.


> NS_ConvertUTF16toUTF8 fails by returning an empty string (iirc), should these
> behave the same?

not sure yet.  the comments in the header file leave the error handling vaguely
defined.  i'm not happy with that, but i want to think about it some more.  i
don't want to be too specific about how it behaves since i might not be able to
implement that without major changes to the conversion routines.  i might regret
not nailing down such behavior for 1.7 though :-/
(In reply to comment #4)
> I agree with comment 3, and I'd also prefer NS_ENCODING_* instead of NS_CHARSET_*

Yeah, NS_ENCODING_* sounds better to me too.

Are we in general agreement with the premise for adding these functions?  And,
does anyone have any other suggestions about what the function signatures should be?

Is nsStringEncoding the right name for the enum type?
> why not give it a name and use the name of that enum in the function declaration
> instead of PRUint32. that way, the compiler can do some typechecks and it makes
> it easier to find out what the excepted parameter value is.

Technically, this is not portable across C compilers, enum types can be smaller
than int. I think we should keep PRUint32.
I definitely agree that we should have these, and the names sound fine. 
Although I wonder if the function names should say "SingleByte" rather than
"CString", since "CString" could be interpreted in multiple ways and
"SingleByte" is conceptually closer to "UTF16".
"SingleByte" works for me.

NS_SingleByteToUTF16
NS_UTF16ToSingleByte
Actually, what about "MultiByte" instead of "SingleByte"?  I say that because it
would then be more consistent with WIN32's WideCharToMultiByte and the wchar_t
functions, like wcstombs.  thoughts?
The term "multi-byte" for "characters are variable length" has never made much
sense to me.  Maybe CString is best.
CString is good because it matches nsACString, and the encoding is clearly
variable as the API suggests.  OK, I'll stick with CString.
Attached patch v1.3 patchSplinter Review
ok, here's the revised patch.  i added some test code.
Attachment #145207 - Attachment is obsolete: true
Attachment #145207 - Flags: superreview?(dbaron)
Attachment #145207 - Flags: review?(bsmedberg)
Attachment #145217 - Flags: superreview?(dbaron)
Attachment #145217 - Flags: review?(bsmedberg)
Comment on attachment 145217 [details] [diff] [review]
v1.3 patch

>+NS_STRINGAPI(nsresult)
>+NS_CStringToUTF16(const nsACString &aSrc, PRUint32 aSrcEncoding, nsAString &aDest)
>+{
>+  // XXX handle errors

>+    case NS_ENCODING_UTF8:
>+      CopyUTF8toUTF16(aSrc, aDest);
>+      break;

Do we want to do an IsEmpty check to see if the conversion was successful?

Otherwise, looks good to me.
Attachment #145217 - Flags: review?(bsmedberg) → review+
i thought about doing IsEmpty checks... i'd need to verify that the source is
not empty also before returning an error.  i guess i could do that though.
dbaron and i talked about this.  the converters currently behave very
inconsistently.  it is impossible to discern OOM errors from conversion errors
in some cases, and in other cases conversion errors result in garbage in the
output.  

my intent with the nsresult return value is to allow us to one-day do better
error checking.
Attachment #145217 - Flags: superreview?(dbaron) → superreview+
Attachment #145217 - Flags: approval1.7?
Comment on attachment 145217 [details] [diff] [review]
v1.3 patch

a=chofmann for 1.7
Attachment #145217 - Flags: approval1.7? → approval1.7+
dougt mentioned over AIM that this patch "is fine"

fixed-on-trunk for 1.7 final
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
The patch for this bug broke the AIX Tinderbox. I have filed Bug 239441 for the
issue.
Re: comment 7:  What is not portable across C compilers?  Using an enum type for
a formal parameter should be fine, and using an enum tag to define a type name
is fine too.  Why not do that and let the compiler pack, or not, as it sees fit?
 Is there some need to interface with PRUint32 on only one side of a calling
convention?

Enum is int in C, packable if the domain allows and the compiler wants to, IIRC.
Not sure whether C++ changes this.

Sorry for the late drive-by.

/be
We can add the enum type back in before marking these APIs as frozen.  See bug
239123.
Component: String → XPCOM
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: