Closed Bug 295047 Opened 19 years ago Closed 19 years ago

Want unicode stream readers/writers

Categories

(Core :: Internationalization, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: Biesinger, Assigned: Biesinger)

References

Details

(Keywords: arch)

Attachments

(3 files, 9 obsolete files)

1.37 KB, application/x-javascript
Details
63.62 KB, patch
Details | Diff | Splinter Review
64.01 KB, patch
darin.moz
: review+
darin.moz
: superreview+
chofmann
: approval1.8b3+
Details | Diff | Splinter Review
It would be nice to have an easy way to read from/write to streams, having
charset conversion automatically done. something like:

interface UnicodeStreamWriter {
  void init(in nsIOutputStream stream, in string charset);
  void write(in AString string);
}

similarly for reading (maybe we can just make converterinputstream scriptable)

this should simplify reading/writing strings a lot.
Attached patch patch v0 (writing) (obsolete) — Splinter Review
not quite complete yet, but shows the idea. any comments, especially on the
interface?
(assumes a unix-like system)
Attached patch patch v0.1 (reading + writing) (obsolete) — Splinter Review
- also does reading, by making converterinputstream scriptable (this fixes bug
260281)
- return Flush's return value from close if it failed
- call mOutStream->Flush from Flush
Attachment #184209 - Attachment is obsolete: true
Attachment #184210 - Attachment is obsolete: true
> - call mOutStream->Flush from Flush

NOTE: flushing a buffered output stream does not call flush on the inner output
stream.  that's probably a good thing becaues flushing a file output stream
means calling PR_Sync, which can be very very expensive.  you might want to
consider implementing this flush function similarly.
ok, yeah - people who want to flush the underlying stream can just keep a
reference to it.
Attached patch patch v0.5 (obsolete) — Splinter Review
- allow setting a replacement character (default to throwing when writing an
unsupported char)
- implement writeChar
- don't flush the underlying stream in flush

todo:
- decide what to do when only some bytes were written
- should nsIUnicharOutputStream.idl be in xpcom/io, where
nsIUnicharInputStream.idl is?
Attachment #184212 - Attachment is obsolete: true
http://developer-test.mozilla.org/docs/Writing_textual_data
http://developer-test.mozilla.org/docs/Reading_textual_data

have some documentation on these interfaces, and what can be done in previous
mozilla versions.
If we can put the interfaces in xpcom/io, I think that would be great (even if
the actual implementation is in uconv).
bz: nsIConverter{Input,Output}Stream too? (note: nsIConverterInputStream.h
currently lives in intl/uconv/public)
Possibly, yes.  That would allow XPCOM code to work with those streams,
basically, which may be desirable.
Attached patch patch v1.0 (obsolete) — Splinter Review
- return a boolean from the write functions that indicates whether the write
was successful
- move the IDLs to xpcom

- add nsIUnicharLineInputStream to read linewise from the converter input
stream
  - templatize nsReadLine.h in order to do that
Attachment #184628 - Attachment is obsolete: true
Attachment #184730 - Flags: superreview?(bzbarsky)
Attachment #184730 - Flags: review?(jshin1987)
Comment on attachment 184730 [details] [diff] [review]
patch v1.0

r=jshin
thank you for the work !

>Index: xpcom/io/nsIConverterInputStream.idl

>+interface nsIInputStream;
>+
>+/**
>+ * A unichar input stream that wraps an input stream.
>+ * This allows reading unicode strings from a stream, automatically converting
>+ * the bytes from a selected charset.

 Let's standardize on 'character encoding' in the documentation (I changed
accordingly in the developer wiki pages for the interface without talking to
you which I should've done.). Our i18n library uses charset extensively, but it
was written when multiple terms were used in a rather confusing manner. For
variable names, 'aCharEncoding' is a bit long. I'll not insist on using it
(especially considering that we have nsICharsetConverterManager and many other
names like that...). 


>Index: xpcom/io/nsIConverterOutputStream.idl

>+     * @param aReplacementChar
>+     *        The replacement character to use when an unsupported character
>+     *        is found.
>+     *        The character must be encodable in the selected character set.
>+     */
>+    void setReplacementChar(in PRUnichar aReplacementChar);

Perhaps, it's a good idea to document what happens when a character not
representable is passed to this method (fall back to the default behavior?)


>+interface nsIUnicharLineInputStream : nsISupports
>+{
>+  /**
>+   * Read a single line from the stream, where a line is a 
>+   * possibly zero length sequence of characters terminated by a
>+   * CR, LF, CRLF, LFCR, or eof.

This is supposed  to be used 'on top of' nsIUnicharInputStream, isn't it? 
Btw, Unicode has other line separators. We have to think about them in another
bug, IMO.


>+UTF8InputStream::ReadString(PRUint32 aCount, nsAString& aString,
>+                            PRUint32* aReadCount)
>+{
>+  NS_ASSERTION(mUnicharDataLength >= mUnicharDataOffset, "unsigned madness");
>+  PRUint32 rv = mUnicharDataLength - mUnicharDataOffset;
>+  nsresult errorCode;

Perhaps, it's me, but it looks a bit strange to see 'rv' used for something
other than the return value. How about using 'readCount' in place of 'rv' and
using 'rv' for 'errorCode' (as is usually done)? 


>+nsConverterInputStream::ReadString(PRUint32 aCount, nsAString& aString,
>+                                   PRUint32* aReadCount)
>+{

>+  PRUint32 rv = mUnicharDataLength - mUnicharDataOffset;

same here.



>Index: netwerk/base/public/nsReadLine.h

>+template<typename CharType>
>+class nsLineBuffer {
>+  public:
>+  CharType buf[kLineBufferSize+1];
>+  CharType* start;
>+  CharType* current;
>+  CharType* end;
>   PRBool empty;

|struct nsLineBuffer| doesn't work with template? It doesn't matter, but I'm
just wondering.
Attachment #184730 - Flags: review?(jshin1987) → review+
>Perhaps, it's me, but it looks a bit strange to see 'rv' used for something
>other than the return value. How about using 'readCount' in place of 'rv' and
>using 'rv' for 'errorCode' (as is usually done)? 

yes, I agree. I want to note that I copied the existing code here. but maybe
it's better to change the existing code too, I will do that.

> |struct nsLineBuffer| doesn't work with template?

it does on GCC... I don't trust its portability.

I'll make the other changes. Your wiki changes look fine. What's the difference
between UTF-32 and UCS-4 though? I thought they were identical.
>This is supposed  to be used 'on top of' nsIUnicharInputStream, isn't it? 

yeah. nsConverterInputStream can be qi'd to it, although I'm not sure that
that's the best way to do it. better might be a way to wrap it around any
nsIUnicharInputStream.

>Btw, Unicode has other line separators. We have to think about them in another
>bug, IMO.

it does? fun...
Attachment #184730 - Flags: superreview?(bzbarsky)
Attachment #184730 - Flags: review?(darin)
Attached patch patch v1.1 (obsolete) — Splinter Review
Attachment #184730 - Attachment is obsolete: true
Attachment #184877 - Flags: superreview?(bzbarsky)
Attachment #184877 - Flags: review?(darin)
Comment on attachment 184877 [details] [diff] [review]
patch v1.1

>Index: xpcom/io/nsIConverterInputStream.idl

>+ * The Initial Developer of the Original Code is
>+ * Netscape Communications Corporation.

shouldn't this be you?


>+     * @param aCharset
>+     *        The character encoding to use for converting the bytes of the
>+     *        stream.

can this parameter be null, and if so what does null mean?  (UTF-8?)


>Index: xpcom/io/nsIConverterOutputStream.idl

>+interface nsIConverterOutputStream : nsIUnicharOutputStream {

I'd prefer opening bracket on a new line for consistency with other
IDL files.


>+    /**
>+     * Initialize this stream. Must be called before any other method on this
>+     * interface, or you will crash. The arguments passed to this method must
>+     * not be null, or you will crash.
>+     */
>+    void init(in nsIOutputStream aOutStream, in string aCharset);

maybe null charset could mean UTF-8, no?


>+    /**
>+     * Set the replacement character to use when encountering characters that
>+     * can not be represented in the selected character encoding.
>+     * The default is to throw an exception upon trying to write unsupported
>+     * characters.

if i call this method, then how do i revert to the default behavior?  is it
the case that i cannot do that?  also, why doesn't nsIConverterInputStream
expose a similar method?  maybe i should be able to get the replacement
char?  i.e., perhaps this should be a PRUnichar attribute instead?


>Index: xpcom/io/nsIUnicharInputStream.idl

>+%{C++
>+/**
>+ * The signature of the writer function passed to ReadSegments. This
>+ * is the "consumer" of data that gets read from the stream's buffer.
>+ *
>+ * @param aInStream stream being read
>+ * @param aClosure opaque parameter passed to ReadSegments
>+ * @param aFromSegment pointer to memory owned by the input stream
>+ * @param aToOffset amount already read (since ReadSegments was called)
>+ * @param aCount length of fromSegment
>+ * @param aWriteCount number of bytes read
>+ *
>+ * Implementers should return the following:
>+ *
>+ * @return NS_OK and (*aWriteCount > 0) if consumed some data
>+ * @return <any-error> if not interested in consuming any data
>+ *
>+ * Errors are never passed to the caller of ReadSegments.
>+ *
>+ * NOTE: returning NS_OK and (*aWriteCount = 0) has undefined behavior.
>+ */
>+typedef NS_CALLBACK(nsWriteUnicharSegmentFun)(nsIUnicharInputStream *aInStream,
>+                                              void *aClosure,
>+                                              const PRUnichar *aFromSegment,
>+                                              PRUint32 aToOffset,
>+                                              PRUint32 aCount,
>+                                              PRUint32 *aWriteCount);
>+%}
>+native nsWriteUnicharSegmentFun(nsWriteUnicharSegmentFun);
>+
>+/**
>+ * Abstract unicode character input stream
>+ * @see nsIInputStream
>+ */
>+[scriptable, uuid(d5e3bd80-6723-4b92-b0c9-22f6162fd94f)]
>+interface nsIUnicharInputStream : nsISupports {
>+  /**
>+   * Reads into a caller-provided character array.
>+   *
>+   * @return The number of characters that were successfully read
>+   *
>+   * @note To read more than 2^32 characters, call this method multiple times.
>+   */
>+  [noscript] unsigned long read([array, size_is(aCount)] in PRUnichar aBuf,
>+                                in unsigned long aCount);

return value can be less than aCount even if there is more data
in the source stream.  you should also mention the special case
of returning 0 to indicate EOF.  (at least i think this interface
should behave like that to be consistent with nsIInputStream.)


>+  /**
>+   * @return The number of characters actually read.
>+   */
>+  [noscript] unsigned long readSegments(in nsWriteUnicharSegmentFun aWriter,

needs better documentation.  see nsIInputStream.idl


>+  /**
>+   * Read into a string object.
>+   * @param aCount The number of characters that should be read
>+   * @return The number of characters that were read.
>+   */
>+  unsigned long readString(in unsigned long aCount, out AString aString);

you could just write:

   AString readString(in unsigned long aCount);

AString knows its length, so there is no real reason to return the length.


should this API also expose an Available() method that returns the number
of unicode characters available?  maybe that is difficult to compute, so
answer is probably no.


>+extern NS_COM nsresult
>+  NS_NewStringUnicharInputStream(nsIUnicharInputStream** aInstancePtrResult,
>+                                 const nsAString* aString,
>+                                 PRBool aTakeOwnership);

we should make it easy to create one of these from JS.


>Index: xpcom/io/nsIUnicharOutputStream.idl

>+interface nsIUnicharOutputStream : nsISupports {

nit: bracket on new line


>+    /**
>+     * Write a single character to the stream. When writing many characters,
>+     * prefer the string-taking write method.
>+     *
>+     * @retval true The character was written successfully
>+     * @retval false Not all bytes of the character could be written.
>+     */
>+    boolean writeChar(in PRUnichar c);
>+
>+    /**
>+     * Write a string to the stream.
>+     *
>+     * @retval true The string was written successfully
>+     * @retval false Not all bytes of the string could be written.
>+     */
>+    boolean write(in AString str);

hmm... i think this interface should expose a write method like the
read method on nsIUnicharInputStream.  it should be possible to write
an array of PRUnichar.	write defined above could be renamed to
writeString for consistency.  maybe nsIUnicharInputStream should have
a readChar method?  what about writeSegments?  also, are there cases
where nsIUnicharInputStream::readSegments might not be supported?


>+    void flush();
>+    void close();

document these.


I haven't really reviewed the source code yet.	I think this is really
great work.  Thanks for doing it.  Let's finalize the interfaces :-)
Attachment #184877 - Flags: review?(darin) → review-
(In reply to comment #17)
> (From update of attachment 184877 [details] [diff] [review] [edit])
> >Index: xpcom/io/nsIConverterInputStream.idl
> 
> >+ * The Initial Developer of the Original Code is
> >+ * Netscape Communications Corporation.
> 
> shouldn't this be you?

Actually no, I think - that file is basically nsIConverterInputStream.h, idl-ized.

> >+     * @param aCharset
> >+     *        The character encoding to use for converting the bytes of the
> >+     *        stream.
> 
> can this parameter be null, and if so what does null mean?  (UTF-8?)

hm, maybe assuming UTF-8 would be best. I considered making null mean the system
charset... but I sorta fear that it would be used even when that's wrong.

> if i call this method, then how do i revert to the default behavior?

Yeah, you can't currently. Do you think that it should be possible? Maybe 0x0000
as character should mean default behaviour.

>  is it
> the case that i cannot do that?  also, why doesn't nsIConverterInputStream
> expose a similar method?

Hm, so on that interface, this seems to be a constructor argument; setting a
replacement character isn't possible (not possible with nsIUnicodeDecoder). an
attribute |boolean recoverFromErrors;| may be a good idea.

>  maybe i should be able to get the replacement
> char?  i.e., perhaps this should be a PRUnichar attribute instead?

yeah, I think that's a good idea.

> you could just write:
> 
>    AString readString(in unsigned long aCount);
> 
> AString knows its length, so there is no real reason to return the length.

So the reason I wrote it that way is so that users can read an entire stream like:
  while (is.readString(1234, str)) {
  }
instead of having to check whether the string is empty. should I change that?

> we should make it easy to create one of these from JS.

Hm, yeah, probably...

> hmm... i think this interface should expose a write method like the
> read method on nsIUnicharInputStream.  it should be possible to write
> an array of PRUnichar.	write defined above could be renamed to
> writeString for consistency.

hm, yeah, guess that's a good idea. I'll remove writeChar then - if one wants to
write a single character, one can just write an array with one element.

>  maybe nsIUnicharInputStream should have
> a readChar method?  what about writeSegments? 

> also, are there cases
> where nsIUnicharInputStream::readSegments might not be supported?

it looks like all impls of nsIUnicharInputStream support it in all cases...

> >+    void flush();
> >+    void close();
> 
> document these.

yeah, will do.
as for writeSegments: The only implementation is not buffered... (contrary to
nsConverterInputStream). so, I couldn't implement it. should I specify it
anyway? I'm not sure whether it's useful.
Attached patch patch v1.2 (obsolete) — Splinter Review
Attachment #184877 - Attachment is obsolete: true
Attachment #185306 - Flags: review?(darin)
Attachment #184877 - Flags: superreview?(bzbarsky)
having readString return the number of bytes read is fine then.  perhaps
writeString should likewise return the number of bytes written?  e.g., suppose
the underlying output stream is non-blocking and cannot accept all of the data
at once?  in a non-blocking configuration it should be okay to not write all
data at once.  in a blocking configuration, it should write the entire string
before returning.

perhaps the replacement char should just be a constructor parameter then?  i do
like that idea.  what is the use case for changing it after having converted
some data?
(In reply to comment #21)
> perhaps
> writeString should likewise return the number of bytes written?  e.g., suppose
> the underlying output stream is non-blocking and cannot accept all of the data
> at once?  in a non-blocking configuration it should be okay to not write all
> data at once.  in a blocking configuration, it should write the entire string
> before returning.

Well, I considered that. but what is the caller to do with the information? It
does not know what the total length of the string (in bytes) is.

also, the output stream is not buffered...

> perhaps the replacement char should just be a constructor parameter then?  i do
> like that idea.  what is the use case for changing it after having converted
> some data?

hm, so, originally I didn't have a way to set the replacement character; bz
thought there should be a way and suggested a setReplacementChar function (and
have default behaviour be to throw for unsupported chars). that's where things
came from :) maybe making it all an init argument would be better...

Hmm... yeah, characters != bytes.  How about if WriteString returned the number
of characters successfully written.  ReadString could similarly return the
number of characters successfully read.

I like having the replacement character be a constructor argument for both input
and output converter streams.
(In reply to comment #23)
> Hmm... yeah, characters != bytes.  How about if WriteString returned the number
> of characters successfully written.

Yeah, that seems like a good idea, but how do I convert the number of bytes
(returned from nsIOutputStream.write) to a number of characters? Hm.. I could
write each character individually, but that's probably bad for performance.
(what if a character gets written partially, if it is a multi-byte sequence?)

So.. if you have any suggestions for these questions, I can do that.

> I like having the replacement character be a constructor argument for both input
> and output converter streams.

OK, I will change the patch accordingly.
Attached patch patch v1.3 (obsolete) — Splinter Review
now with replacement character as init argument
Attachment #185306 - Attachment is obsolete: true
Attachment #185718 - Flags: review?(darin)
Comment on attachment 185718 [details] [diff] [review]
patch v1.3

so, when reading we replace unknown characters with U+FFFD, but when writing we
can replace with an arbitrary character?  are the intl apis such that we cannot
easily make these apis more symmetric?
hm actually... it looks like that replacing is not done by nsIUnicodeDecoder,
instead it's done at
http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsConverterInputStream.cpp#216

so that should be trivially fixable; I'll make a new patch tomorrowish.
Cool.  I think consistency will be worth the effort.  Much appreciated :)
Attached patch patch v1.4 (obsolete) — Splinter Review
Attachment #185718 - Attachment is obsolete: true
Attachment #185775 - Flags: review?(darin)
Comment on attachment 185775 [details] [diff] [review]
patch v1.4

>Index: xpcom/io/nsIConverterInputStream.idl

>+     * @param aReplacementChar
>+     *        The character to replace unknown byte sequences in the stream
>+     *        with. The standard replacement character is U+FFFD.
>+     *        A value of 0x0000 will cause an exception to be thrown if unknown
>+     *        byte sequences are encountered in the stream.

maybe document the exception?


>Index: xpcom/io/nsIConverterOutputStream.idl

>+    void init(in nsIOutputStream aOutStream, in string aCharset,
>+              in PRUnichar aReplacementCharacter);

I think it would make sense to have a buffer size parameter here as well.
Even if the current implementation does not buffer, it could make sense
to support that in another implementation.

Why does reading get a configurable buffer when writing does not?  etc.


>Index: xpcom/io/nsIUnicharInputStream.idl

>+
>+  void close();
>+};

document me :)


>Index: xpcom/io/nsUnicharInputStream.cpp

>+  if (mPos >= mLen) {
>+    *aReadCount = 0;
>+    return NS_OK;
>+  }
>+  NS_ASSERTION(mLen >= mPos, "unsigned madness");

wait.. how can this assertion ever get triggered?


>Index: intl/uconv/src/nsConverterInputStream.cpp

> nsConverterInputStream::Close()
> {
>+    nsresult rv;
>+    if (mInput)
>+      rv = mInput->Close();
>+    else
>+      rv = NS_OK;

I'm a big fan of:

      nsresult rv = mInput ? mInput->Close() : NS_OK;

in these cases :)


>Index: intl/uconv/src/nsConverterOutputStream.cpp

>+nsConverterOutputStream::nsConverterOutputStream()
>+{
>+}

can this just live in the header file?


>+    nsAutoArrayPtr<char> buf(new char[maxLen]);

would it be worth it to use nsCAutoString instead?  that way
small writes will use the stack instead of the heap.

      nsCAutoString buf;
      buf.SetLength(maxLen);

etc.


>+    PRInt32 inLen = aString.Length();
>+    nsAString::const_iterator i;
>+    aString.BeginReading(i);
>+    return Write(inLen, i.start(), aSuccess);

I think you want .get() instead of .start()

It turns out that there is no difference, but it would seem
better to ask for the address of the character that the iterator
is pointing at instead of some other address.


>Index: netwerk/base/public/nsReadLine.h

>+template<typename CharType>

We use CharT in the string code.  Maybe do the same here?


>Index: layout/style/nsCSSLoader.cpp

>+                               8192, 0xFFFD);

Maybe there should be a #define for this special value someplace.


Looking really good...
Attachment #185775 - Flags: review?(darin) → review-
> maybe document the exception?

hm... the intl code is inconsistent in the exception it throws... it usually
seems to be NS_ERROR_UNEXPECTED, but it's documented to be
NS_ERROR_ILLEGAL_INPUT and one converter does throw that...

> wait.. how can this assertion ever get triggered?

it can be triggered if mLen > mPos, no? (I copied this from Read...)

>I think you want .get() instead of .start()

Ah... I didn't notice that there was a .get().

Attached patch patch v1.5Splinter Review
Attachment #185775 - Attachment is obsolete: true
Attachment #186754 - Flags: review?(darin)
Status: NEW → ASSIGNED
Priority: -- → P1
(In reply to comment #31)
> > maybe document the exception?
> 
> hm... the intl code is inconsistent in the exception it throws... it usually
> seems to be NS_ERROR_UNEXPECTED, but it's documented to be
> NS_ERROR_ILLEGAL_INPUT and one converter does throw that...
> 

OK, that's unfortunate.  File a bug on that inconsistency?


> > wait.. how can this assertion ever get triggered?
> 
> it can be triggered if mLen > mPos, no? (I copied this from Read...)

>+  if (mPos >= mLen) {
>+    *aReadCount = 0;
>+    return NS_OK;
>+  }
>+  NS_ASSERTION(mLen >= mPos, "unsigned madness");

If mLen > mPos, then the assertion is upheld.  If mLen <= mPos, then
we never reach this assertion.  So, the assertion seems unnecessary
or wrongly placed.
(In reply to comment #33)
> OK, that's unfortunate.  File a bug on that inconsistency?

OK, I filed Bug 298533.

> If mLen > mPos, then the assertion is upheld.  If mLen <= mPos, then
> we never reach this assertion.  So, the assertion seems unnecessary
> or wrongly placed.

Oh.. you're right... I got that backwards. Hm.. warren added that assertion in
http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&root=/cvsroot&subdir=mozilla/xpcom/io&command=DIFF_FRAMESET&file=nsUnicharInputStream.cpp&rev2=3.5&rev1=3.4

seems like is it not necessary, I'll remove it
Attached patch patch v1.6Splinter Review
removed the two assertions, and fixed (linux) debug builds
Attachment #187076 - Flags: review?(darin)
Attachment #187076 - Flags: review?(darin) → review+
Comment on attachment 187076 [details] [diff] [review]
patch v1.6

ok, hm... I'd ask for sr=bz, but he's currently unavailable, and I'd like to
get this checked in for alpha2... anyone mind if I checkin with r=jshin
sr=darin?
Attachment #187076 - Flags: superreview?(darin)
Comment on attachment 187076 [details] [diff] [review]
patch v1.6

go for it.  we'll still have opportunities to revise these APIs (if bz or
someone else comes up with desirable changes) after DPA2.
Attachment #187076 - Flags: superreview?(darin) → superreview+
Comment on attachment 187076 [details] [diff] [review]
patch v1.6

this patch adds new APIs for use by extensions and mozilla code to make it
easier to read/write non-ASCII characters. since the functions are new, the
risk is very low.
Attachment #187076 - Flags: approval1.8b3?
Comment on attachment 187076 [details] [diff] [review]
patch v1.6

a=chofmann
Attachment #187076 - Flags: approval1.8b3? → approval1.8b3+
OK, checked in for 1.8b3!
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
It looks like this hurt Tp on btek by 5-10ms.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: