Closed Bug 73292 Opened 23 years ago Closed 23 years ago

Add const nsACString& constructor to NS_Convert(ASCII|UTF8)toUCS2

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jag+mozbugs, Assigned: jag+mozilla)

References

Details

Attachments

(1 file, 4 obsolete files)

For parity with NS_ConvertUCS2toUTF8, which has a constructor for nsAString&,
and because this just looks silly:

NS_ConvertASCIItoUCS2 uStr(aStr.get());
Adding dependency (blocks bug 73009), removing silly cc.
Blocks: 73009
Targeting 1.1 since this isn't crucial, but it is a step in the right direction
for consistency and not too hard so I'd like to get it in before then if
possible.  If I make a tracking bug for handling encoding/conversion issues, I
may move this to block that tracker instead.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.1
Attached patch [patch] First crude attempt. (obsolete) — Splinter Review
re-targeting milestones, starting from a clean slate
Target Milestone: mozilla1.1 → ---
I'm assuming a good bit of this is roughly copied, so I'm not reviewing too
closely...

+      NS_ERROR("not a UTF-8 string");

I'd prefer NS_ASSERTION to NS_ERROR.

+  nsReadingIterator<char> p, end;

I prefer |nsACString::const_iterator p, end;|.

+      explicit
+      NS_ConvertUTF8toUCS2( const nsACString& aCString )
+        {
+          Init( aCString );
+        }

It would be nice to also have:

      explicit
      NS_ConvertUTF8toUCS2( const nsAFlatCString& aCString )
        {
          Init( aCString.get() );
        }

to improve performance for things like nsCString and nsDependentString.

Also, you could have the same thing (also inline) for NS_ConvertASCIItoUCS2.

With that, r=dbaron.
Taking this.

dbaron, could you review the new patch?
Assignee: scc → jaggernaut
Status: ASSIGNED → NEW
NS_ConvertASCIItoUCS2(const nsACString&)

  * should use nsACString::const_iterator rather than 
    nsReadingIterator<char>

  * should call aCString.Length() and expand the buffer at the beginning
    to avoid multiple allocations

NS_ConvertUTF8toUCS2(const nsACString&)

  * It would be nice to unify this with Init -- perhaps Init could take
    an nsACString& and you could use nsDependentString elsewhere?
    Making the const char* constructor avoid 3 passes would be a trick,
    though...

  * Both loops over the string would probably be better written as
    double for-loops - outer over fragments and inner over characters.

  * You should probably allow embedded nulls, right?  Or not?
> NS_ConvertASCIItoUCS2(const nsACString&)
> 
>   * should use nsACString::const_iterator rather than 
>     nsReadingIterator<char>

done

>   * should call aCString.Length() and expand the buffer at the beginning
>     to avoid multiple allocations

done

> NS_ConvertUTF8toUCS2(const nsACString&)
> 
>   * It would be nice to unify this with Init -- perhaps Init could take
>     an nsACString& and you could use nsDependentString elsewhere?
>     Making the const char* constructor avoid 3 passes would be a trick,
>     though...

>   * Both loops over the string would probably be better written as
>     double for-loops - outer over fragments and inner over characters.

Let me think about these two.

>   * You should probably allow embedded nulls, right?  Or not?

Erh... scc?
Status: NEW → ASSIGNED
Summary: Add const nsAReadableCString& constructor to NS_Convert(ASCII|UTF8)toUCS2 → Add const nsACString& constructor to NS_Convert(ASCII|UTF8)toUCS2
Attached patch third attempt (obsolete) — Splinter Review
Attachment #31191 - Attachment is obsolete: true
Attachment #41917 - Attachment is obsolete: true
I'm all ears on how to avoid the initial pass (in nsDependentCString) for the
|const char*| constructor of NS_ConvertUTF8toUCS2.
>>nsString2.cpp:
>>
>>the iterators (ns{Reading,Writing}Iterator) ought to have a fragment_type so
>>that you don't have to use nsReadableFragment<char> explicitly.
>
>Next pass? New bug?
>
>>  +        nsReadableFragment<char> frag(start.fragment());
>>  
>>Maybe
>>
>>  const nsReadableFragment<char>& frag = start.fragment();
>>
>>would be better?
>
>I could use that.
>
>>How about an NS_ASSERTION(count == mLength, "calculator calculated
>>incorrect length"); at the end of nsConvertUTF8toUCS2::Init?
>
>Done.

Btw, I think this would only occur when we encounter a UTF8 string that isn't,
in which case we already error. If we're indeed fed utf8 straight from the
netwerk, we might want to deal with this more nicely.

>>I think we need to look at the error-handling of NS_ConvertUTF8toUCS2
>>carefully since we can expect it to be called on input from the network
>>(I would think, anyway, although perhaps not).
>
>I should do that.
>
>>+                else
>>+                  {
>>+                    ucs4 -= 0x00010000;
>>+                    *mBuffer++ = 0xD800 | (0x000003FF & (ucs4 >> 10));
>>+                    *mBuffer++ = 0xDC00 | (0x000003FF & ucs4);
>>+                  }
>>
>>Is this UCS2 or UTF-16 or what?
>
>I'll come back to you on this one :-)

Hmmm... This is UTF-16 [0], and it will mess us up (even in the old code) in
that the UTF8 calculator will only have assigned one character for this, and the
above writes out two. Eek!

[0] ftp://ftp.isi.edu/in-notes/rfc2781.txt
That code sure looks wrong to me. I've cc'd ftang and yokoyama, from whom I
stole the conversion code. Guys, we don't really do UTF-16, right?
Why bother maintaining mLength in ConvertUTF8toUCS2?  Or should it just be
|#ifdef DEBUG|?
The extra pass with nsDependentString calculating the length is
inefficient -- at the very least, don't bother with the ~PRUint32(0),
just use the one-parameter constructor, since they mean the same thing
and the latter is *much* clearer.  I guess, for simplicity, don't worry
about the extra pass for now, though -- it's also the best way to allow
embedded nulls in all the other cases, I think.  Same for the double
length calculation.


+                if ( ucs4 != 0xfeff ) // ignore BOM
+                    *mBuffer++ = ucs4;

Don't you need:

                 else
                   --mLength;

However, I think it would be better to scrap mLength and keep an
mStart instead of mLength, initialized in the ctor to mBuffer, and
have Length() be |{ return mBuffer - mStart; }|.

One other optimization that would be nice is to, after the check against
minUcs4, check |ucs4 < 0xD800| to short-circuit the rest of the checks
for the vast majority of characters (with a comment saying that it's
short-ciruiting what is below in case someone changes what is below).

Should the |SetCapacity| have a |+1| or not?

Other than that, r=dbaron.  scc should super-review.
I'll rewrite Length() as you proposed.

And you're right, nsStr's code takes care of adding space for the 0 terminator,
so I don't need to add 1 myself when calling SetCapacity.
At least, if I'm reading this code correctly:

http://lxr.mozilla.org/seamonkey/source/string/obsolete/nsStr.cpp#680
Comment on attachment 53457 [details] [diff] [review]
Addressing comments. Also removed init and null terminate code from conversion constructors, nsAutoString's constructor does that already.

r=dbaron
Attachment #53457 - Flags: review+
scc, sr= please?
Attachment #50925 - Attachment is obsolete: true
Attachment #48525 - Attachment is obsolete: true
sr=scc
Comment on attachment 53457 [details] [diff] [review]
Addressing comments. Also removed init and null terminate code from conversion constructors, nsAutoString's constructor does that already.

and sr=scc
Attachment #53457 - Flags: superreview+
Ran into a little problem in NS_ConvertUTF8toUCS2::Init:

if |count| is |0|, SetCapacity will free the allocated buffer and make mUStr
point at a shared \0 string, which we can't write our null terminator to. Added
an |mCapacity| guard.

Checked in, marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Component: String → XPCOM
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: