Closed Bug 295047 Opened 20 years ago Closed 20 years ago

Want unicode stream readers/writers

Categories

(Core :: Internationalization, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: Biesinger, Assigned: Biesinger)

References

Details

(Keywords: arch)

Attachments

(3 files, 9 obsolete files)

1.37 KB, application/x-javascript
Details
63.62 KB, patch
Details | Diff | Splinter Review
64.01 KB, patch
darin.moz
: review+
darin.moz
: superreview+
chofmann
: approval1.8b3+
Details | Diff | Splinter Review
It would be nice to have an easy way to read from/write to streams, having charset conversion automatically done. something like: interface UnicodeStreamWriter { void init(in nsIOutputStream stream, in string charset); void write(in AString string); } similarly for reading (maybe we can just make converterinputstream scriptable) this should simplify reading/writing strings a lot.
Attached patch patch v0 (writing) (obsolete) — Splinter Review
not quite complete yet, but shows the idea. any comments, especially on the interface?
(assumes a unix-like system)
Attached patch patch v0.1 (reading + writing) (obsolete) — Splinter Review
- also does reading, by making converterinputstream scriptable (this fixes bug 260281) - return Flush's return value from close if it failed - call mOutStream->Flush from Flush
Attachment #184209 - Attachment is obsolete: true
Attachment #184210 - Attachment is obsolete: true
> - call mOutStream->Flush from Flush NOTE: flushing a buffered output stream does not call flush on the inner output stream. that's probably a good thing becaues flushing a file output stream means calling PR_Sync, which can be very very expensive. you might want to consider implementing this flush function similarly.
ok, yeah - people who want to flush the underlying stream can just keep a reference to it.
Attached patch patch v0.5 (obsolete) — Splinter Review
- allow setting a replacement character (default to throwing when writing an unsupported char) - implement writeChar - don't flush the underlying stream in flush todo: - decide what to do when only some bytes were written - should nsIUnicharOutputStream.idl be in xpcom/io, where nsIUnicharInputStream.idl is?
Attachment #184212 - Attachment is obsolete: true
http://developer-test.mozilla.org/docs/Writing_textual_data http://developer-test.mozilla.org/docs/Reading_textual_data have some documentation on these interfaces, and what can be done in previous mozilla versions.
If we can put the interfaces in xpcom/io, I think that would be great (even if the actual implementation is in uconv).
bz: nsIConverter{Input,Output}Stream too? (note: nsIConverterInputStream.h currently lives in intl/uconv/public)
Possibly, yes. That would allow XPCOM code to work with those streams, basically, which may be desirable.
Attached patch patch v1.0 (obsolete) — Splinter Review
- return a boolean from the write functions that indicates whether the write was successful - move the IDLs to xpcom - add nsIUnicharLineInputStream to read linewise from the converter input stream - templatize nsReadLine.h in order to do that
Attachment #184628 - Attachment is obsolete: true
Attachment #184730 - Flags: superreview?(bzbarsky)
Attachment #184730 - Flags: review?(jshin1987)
Comment on attachment 184730 [details] [diff] [review] patch v1.0 r=jshin thank you for the work ! >Index: xpcom/io/nsIConverterInputStream.idl >+interface nsIInputStream; >+ >+/** >+ * A unichar input stream that wraps an input stream. >+ * This allows reading unicode strings from a stream, automatically converting >+ * the bytes from a selected charset. Let's standardize on 'character encoding' in the documentation (I changed accordingly in the developer wiki pages for the interface without talking to you which I should've done.). Our i18n library uses charset extensively, but it was written when multiple terms were used in a rather confusing manner. For variable names, 'aCharEncoding' is a bit long. I'll not insist on using it (especially considering that we have nsICharsetConverterManager and many other names like that...). >Index: xpcom/io/nsIConverterOutputStream.idl >+ * @param aReplacementChar >+ * The replacement character to use when an unsupported character >+ * is found. >+ * The character must be encodable in the selected character set. >+ */ >+ void setReplacementChar(in PRUnichar aReplacementChar); Perhaps, it's a good idea to document what happens when a character not representable is passed to this method (fall back to the default behavior?) >+interface nsIUnicharLineInputStream : nsISupports >+{ >+ /** >+ * Read a single line from the stream, where a line is a >+ * possibly zero length sequence of characters terminated by a >+ * CR, LF, CRLF, LFCR, or eof. This is supposed to be used 'on top of' nsIUnicharInputStream, isn't it? Btw, Unicode has other line separators. We have to think about them in another bug, IMO. >+UTF8InputStream::ReadString(PRUint32 aCount, nsAString& aString, >+ PRUint32* aReadCount) >+{ >+ NS_ASSERTION(mUnicharDataLength >= mUnicharDataOffset, "unsigned madness"); >+ PRUint32 rv = mUnicharDataLength - mUnicharDataOffset; >+ nsresult errorCode; Perhaps, it's me, but it looks a bit strange to see 'rv' used for something other than the return value. How about using 'readCount' in place of 'rv' and using 'rv' for 'errorCode' (as is usually done)? >+nsConverterInputStream::ReadString(PRUint32 aCount, nsAString& aString, >+ PRUint32* aReadCount) >+{ >+ PRUint32 rv = mUnicharDataLength - mUnicharDataOffset; same here. >Index: netwerk/base/public/nsReadLine.h >+template<typename CharType> >+class nsLineBuffer { >+ public: >+ CharType buf[kLineBufferSize+1]; >+ CharType* start; >+ CharType* current; >+ CharType* end; > PRBool empty; |struct nsLineBuffer| doesn't work with template? It doesn't matter, but I'm just wondering.
Attachment #184730 - Flags: review?(jshin1987) → review+
>Perhaps, it's me, but it looks a bit strange to see 'rv' used for something >other than the return value. How about using 'readCount' in place of 'rv' and >using 'rv' for 'errorCode' (as is usually done)? yes, I agree. I want to note that I copied the existing code here. but maybe it's better to change the existing code too, I will do that. > |struct nsLineBuffer| doesn't work with template? it does on GCC... I don't trust its portability. I'll make the other changes. Your wiki changes look fine. What's the difference between UTF-32 and UCS-4 though? I thought they were identical.
>This is supposed to be used 'on top of' nsIUnicharInputStream, isn't it? yeah. nsConverterInputStream can be qi'd to it, although I'm not sure that that's the best way to do it. better might be a way to wrap it around any nsIUnicharInputStream. >Btw, Unicode has other line separators. We have to think about them in another >bug, IMO. it does? fun...
Attachment #184730 - Flags: superreview?(bzbarsky)
Attachment #184730 - Flags: review?(darin)
Attached patch patch v1.1 (obsolete) — Splinter Review
Attachment #184730 - Attachment is obsolete: true
Attachment #184877 - Flags: superreview?(bzbarsky)
Attachment #184877 - Flags: review?(darin)
Comment on attachment 184877 [details] [diff] [review] patch v1.1 >Index: xpcom/io/nsIConverterInputStream.idl >+ * The Initial Developer of the Original Code is >+ * Netscape Communications Corporation. shouldn't this be you? >+ * @param aCharset >+ * The character encoding to use for converting the bytes of the >+ * stream. can this parameter be null, and if so what does null mean? (UTF-8?) >Index: xpcom/io/nsIConverterOutputStream.idl >+interface nsIConverterOutputStream : nsIUnicharOutputStream { I'd prefer opening bracket on a new line for consistency with other IDL files. >+ /** >+ * Initialize this stream. Must be called before any other method on this >+ * interface, or you will crash. The arguments passed to this method must >+ * not be null, or you will crash. >+ */ >+ void init(in nsIOutputStream aOutStream, in string aCharset); maybe null charset could mean UTF-8, no? >+ /** >+ * Set the replacement character to use when encountering characters that >+ * can not be represented in the selected character encoding. >+ * The default is to throw an exception upon trying to write unsupported >+ * characters. if i call this method, then how do i revert to the default behavior? is it the case that i cannot do that? also, why doesn't nsIConverterInputStream expose a similar method? maybe i should be able to get the replacement char? i.e., perhaps this should be a PRUnichar attribute instead? >Index: xpcom/io/nsIUnicharInputStream.idl >+%{C++ >+/** >+ * The signature of the writer function passed to ReadSegments. This >+ * is the "consumer" of data that gets read from the stream's buffer. >+ * >+ * @param aInStream stream being read >+ * @param aClosure opaque parameter passed to ReadSegments >+ * @param aFromSegment pointer to memory owned by the input stream >+ * @param aToOffset amount already read (since ReadSegments was called) >+ * @param aCount length of fromSegment >+ * @param aWriteCount number of bytes read >+ * >+ * Implementers should return the following: >+ * >+ * @return NS_OK and (*aWriteCount > 0) if consumed some data >+ * @return <any-error> if not interested in consuming any data >+ * >+ * Errors are never passed to the caller of ReadSegments. >+ * >+ * NOTE: returning NS_OK and (*aWriteCount = 0) has undefined behavior. >+ */ >+typedef NS_CALLBACK(nsWriteUnicharSegmentFun)(nsIUnicharInputStream *aInStream, >+ void *aClosure, >+ const PRUnichar *aFromSegment, >+ PRUint32 aToOffset, >+ PRUint32 aCount, >+ PRUint32 *aWriteCount); >+%} >+native nsWriteUnicharSegmentFun(nsWriteUnicharSegmentFun); >+ >+/** >+ * Abstract unicode character input stream >+ * @see nsIInputStream >+ */ >+[scriptable, uuid(d5e3bd80-6723-4b92-b0c9-22f6162fd94f)] >+interface nsIUnicharInputStream : nsISupports { >+ /** >+ * Reads into a caller-provided character array. >+ * >+ * @return The number of characters that were successfully read >+ * >+ * @note To read more than 2^32 characters, call this method multiple times. >+ */ >+ [noscript] unsigned long read([array, size_is(aCount)] in PRUnichar aBuf, >+ in unsigned long aCount); return value can be less than aCount even if there is more data in the source stream. you should also mention the special case of returning 0 to indicate EOF. (at least i think this interface should behave like that to be consistent with nsIInputStream.) >+ /** >+ * @return The number of characters actually read. >+ */ >+ [noscript] unsigned long readSegments(in nsWriteUnicharSegmentFun aWriter, needs better documentation. see nsIInputStream.idl >+ /** >+ * Read into a string object. >+ * @param aCount The number of characters that should be read >+ * @return The number of characters that were read. >+ */ >+ unsigned long readString(in unsigned long aCount, out AString aString); you could just write: AString readString(in unsigned long aCount); AString knows its length, so there is no real reason to return the length. should this API also expose an Available() method that returns the number of unicode characters available? maybe that is difficult to compute, so answer is probably no. >+extern NS_COM nsresult >+ NS_NewStringUnicharInputStream(nsIUnicharInputStream** aInstancePtrResult, >+ const nsAString* aString, >+ PRBool aTakeOwnership); we should make it easy to create one of these from JS. >Index: xpcom/io/nsIUnicharOutputStream.idl >+interface nsIUnicharOutputStream : nsISupports { nit: bracket on new line >+ /** >+ * Write a single character to the stream. When writing many characters, >+ * prefer the string-taking write method. >+ * >+ * @retval true The character was written successfully >+ * @retval false Not all bytes of the character could be written. >+ */ >+ boolean writeChar(in PRUnichar c); >+ >+ /** >+ * Write a string to the stream. >+ * >+ * @retval true The string was written successfully >+ * @retval false Not all bytes of the string could be written. >+ */ >+ boolean write(in AString str); hmm... i think this interface should expose a write method like the read method on nsIUnicharInputStream. it should be possible to write an array of PRUnichar. write defined above could be renamed to writeString for consistency. maybe nsIUnicharInputStream should have a readChar method? what about writeSegments? also, are there cases where nsIUnicharInputStream::readSegments might not be supported? >+ void flush(); >+ void close(); document these. I haven't really reviewed the source code yet. I think this is really great work. Thanks for doing it. Let's finalize the interfaces :-)
Attachment #184877 - Flags: review?(darin) → review-
(In reply to comment #17) > (From update of attachment 184877 [details] [diff] [review] [edit]) > >Index: xpcom/io/nsIConverterInputStream.idl > > >+ * The Initial Developer of the Original Code is > >+ * Netscape Communications Corporation. > > shouldn't this be you? Actually no, I think - that file is basically nsIConverterInputStream.h, idl-ized. > >+ * @param aCharset > >+ * The character encoding to use for converting the bytes of the > >+ * stream. > > can this parameter be null, and if so what does null mean? (UTF-8?) hm, maybe assuming UTF-8 would be best. I considered making null mean the system charset... but I sorta fear that it would be used even when that's wrong. > if i call this method, then how do i revert to the default behavior? Yeah, you can't currently. Do you think that it should be possible? Maybe 0x0000 as character should mean default behaviour. > is it > the case that i cannot do that? also, why doesn't nsIConverterInputStream > expose a similar method? Hm, so on that interface, this seems to be a constructor argument; setting a replacement character isn't possible (not possible with nsIUnicodeDecoder). an attribute |boolean recoverFromErrors;| may be a good idea. > maybe i should be able to get the replacement > char? i.e., perhaps this should be a PRUnichar attribute instead? yeah, I think that's a good idea. > you could just write: > > AString readString(in unsigned long aCount); > > AString knows its length, so there is no real reason to return the length. So the reason I wrote it that way is so that users can read an entire stream like: while (is.readString(1234, str)) { } instead of having to check whether the string is empty. should I change that? > we should make it easy to create one of these from JS. Hm, yeah, probably... > hmm... i think this interface should expose a write method like the > read method on nsIUnicharInputStream. it should be possible to write > an array of PRUnichar. write defined above could be renamed to > writeString for consistency. hm, yeah, guess that's a good idea. I'll remove writeChar then - if one wants to write a single character, one can just write an array with one element. > maybe nsIUnicharInputStream should have > a readChar method? what about writeSegments? > also, are there cases > where nsIUnicharInputStream::readSegments might not be supported? it looks like all impls of nsIUnicharInputStream support it in all cases... > >+ void flush(); > >+ void close(); > > document these. yeah, will do.
as for writeSegments: The only implementation is not buffered... (contrary to nsConverterInputStream). so, I couldn't implement it. should I specify it anyway? I'm not sure whether it's useful.
Attached patch patch v1.2 (obsolete) — Splinter Review
Attachment #184877 - Attachment is obsolete: true
Attachment #185306 - Flags: review?(darin)
Attachment #184877 - Flags: superreview?(bzbarsky)
having readString return the number of bytes read is fine then. perhaps writeString should likewise return the number of bytes written? e.g., suppose the underlying output stream is non-blocking and cannot accept all of the data at once? in a non-blocking configuration it should be okay to not write all data at once. in a blocking configuration, it should write the entire string before returning. perhaps the replacement char should just be a constructor parameter then? i do like that idea. what is the use case for changing it after having converted some data?
(In reply to comment #21) > perhaps > writeString should likewise return the number of bytes written? e.g., suppose > the underlying output stream is non-blocking and cannot accept all of the data > at once? in a non-blocking configuration it should be okay to not write all > data at once. in a blocking configuration, it should write the entire string > before returning. Well, I considered that. but what is the caller to do with the information? It does not know what the total length of the string (in bytes) is. also, the output stream is not buffered... > perhaps the replacement char should just be a constructor parameter then? i do > like that idea. what is the use case for changing it after having converted > some data? hm, so, originally I didn't have a way to set the replacement character; bz thought there should be a way and suggested a setReplacementChar function (and have default behaviour be to throw for unsupported chars). that's where things came from :) maybe making it all an init argument would be better...
Hmm... yeah, characters != bytes. How about if WriteString returned the number of characters successfully written. ReadString could similarly return the number of characters successfully read. I like having the replacement character be a constructor argument for both input and output converter streams.
(In reply to comment #23) > Hmm... yeah, characters != bytes. How about if WriteString returned the number > of characters successfully written. Yeah, that seems like a good idea, but how do I convert the number of bytes (returned from nsIOutputStream.write) to a number of characters? Hm.. I could write each character individually, but that's probably bad for performance. (what if a character gets written partially, if it is a multi-byte sequence?) So.. if you have any suggestions for these questions, I can do that. > I like having the replacement character be a constructor argument for both input > and output converter streams. OK, I will change the patch accordingly.
Attached patch patch v1.3 (obsolete) — Splinter Review
now with replacement character as init argument
Attachment #185306 - Attachment is obsolete: true
Attachment #185718 - Flags: review?(darin)
Comment on attachment 185718 [details] [diff] [review] patch v1.3 so, when reading we replace unknown characters with U+FFFD, but when writing we can replace with an arbitrary character? are the intl apis such that we cannot easily make these apis more symmetric?
hm actually... it looks like that replacing is not done by nsIUnicodeDecoder, instead it's done at http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsConverterInputStream.cpp#216 so that should be trivially fixable; I'll make a new patch tomorrowish.
Cool. I think consistency will be worth the effort. Much appreciated :)
Attached patch patch v1.4 (obsolete) — Splinter Review
Attachment #185718 - Attachment is obsolete: true
Attachment #185775 - Flags: review?(darin)
Comment on attachment 185775 [details] [diff] [review] patch v1.4 >Index: xpcom/io/nsIConverterInputStream.idl >+ * @param aReplacementChar >+ * The character to replace unknown byte sequences in the stream >+ * with. The standard replacement character is U+FFFD. >+ * A value of 0x0000 will cause an exception to be thrown if unknown >+ * byte sequences are encountered in the stream. maybe document the exception? >Index: xpcom/io/nsIConverterOutputStream.idl >+ void init(in nsIOutputStream aOutStream, in string aCharset, >+ in PRUnichar aReplacementCharacter); I think it would make sense to have a buffer size parameter here as well. Even if the current implementation does not buffer, it could make sense to support that in another implementation. Why does reading get a configurable buffer when writing does not? etc. >Index: xpcom/io/nsIUnicharInputStream.idl >+ >+ void close(); >+}; document me :) >Index: xpcom/io/nsUnicharInputStream.cpp >+ if (mPos >= mLen) { >+ *aReadCount = 0; >+ return NS_OK; >+ } >+ NS_ASSERTION(mLen >= mPos, "unsigned madness"); wait.. how can this assertion ever get triggered? >Index: intl/uconv/src/nsConverterInputStream.cpp > nsConverterInputStream::Close() > { >+ nsresult rv; >+ if (mInput) >+ rv = mInput->Close(); >+ else >+ rv = NS_OK; I'm a big fan of: nsresult rv = mInput ? mInput->Close() : NS_OK; in these cases :) >Index: intl/uconv/src/nsConverterOutputStream.cpp >+nsConverterOutputStream::nsConverterOutputStream() >+{ >+} can this just live in the header file? >+ nsAutoArrayPtr<char> buf(new char[maxLen]); would it be worth it to use nsCAutoString instead? that way small writes will use the stack instead of the heap. nsCAutoString buf; buf.SetLength(maxLen); etc. >+ PRInt32 inLen = aString.Length(); >+ nsAString::const_iterator i; >+ aString.BeginReading(i); >+ return Write(inLen, i.start(), aSuccess); I think you want .get() instead of .start() It turns out that there is no difference, but it would seem better to ask for the address of the character that the iterator is pointing at instead of some other address. >Index: netwerk/base/public/nsReadLine.h >+template<typename CharType> We use CharT in the string code. Maybe do the same here? >Index: layout/style/nsCSSLoader.cpp >+ 8192, 0xFFFD); Maybe there should be a #define for this special value someplace. Looking really good...
Attachment #185775 - Flags: review?(darin) → review-
> maybe document the exception? hm... the intl code is inconsistent in the exception it throws... it usually seems to be NS_ERROR_UNEXPECTED, but it's documented to be NS_ERROR_ILLEGAL_INPUT and one converter does throw that... > wait.. how can this assertion ever get triggered? it can be triggered if mLen > mPos, no? (I copied this from Read...) >I think you want .get() instead of .start() Ah... I didn't notice that there was a .get().
Attached patch patch v1.5Splinter Review
Attachment #185775 - Attachment is obsolete: true
Attachment #186754 - Flags: review?(darin)
Status: NEW → ASSIGNED
Priority: -- → P1
(In reply to comment #31) > > maybe document the exception? > > hm... the intl code is inconsistent in the exception it throws... it usually > seems to be NS_ERROR_UNEXPECTED, but it's documented to be > NS_ERROR_ILLEGAL_INPUT and one converter does throw that... > OK, that's unfortunate. File a bug on that inconsistency? > > wait.. how can this assertion ever get triggered? > > it can be triggered if mLen > mPos, no? (I copied this from Read...) >+ if (mPos >= mLen) { >+ *aReadCount = 0; >+ return NS_OK; >+ } >+ NS_ASSERTION(mLen >= mPos, "unsigned madness"); If mLen > mPos, then the assertion is upheld. If mLen <= mPos, then we never reach this assertion. So, the assertion seems unnecessary or wrongly placed.
(In reply to comment #33) > OK, that's unfortunate. File a bug on that inconsistency? OK, I filed Bug 298533. > If mLen > mPos, then the assertion is upheld. If mLen <= mPos, then > we never reach this assertion. So, the assertion seems unnecessary > or wrongly placed. Oh.. you're right... I got that backwards. Hm.. warren added that assertion in http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&root=/cvsroot&subdir=mozilla/xpcom/io&command=DIFF_FRAMESET&file=nsUnicharInputStream.cpp&rev2=3.5&rev1=3.4 seems like is it not necessary, I'll remove it
Attached patch patch v1.6Splinter Review
removed the two assertions, and fixed (linux) debug builds
Attachment #187076 - Flags: review?(darin)
Attachment #187076 - Flags: review?(darin) → review+
Comment on attachment 187076 [details] [diff] [review] patch v1.6 ok, hm... I'd ask for sr=bz, but he's currently unavailable, and I'd like to get this checked in for alpha2... anyone mind if I checkin with r=jshin sr=darin?
Attachment #187076 - Flags: superreview?(darin)
Comment on attachment 187076 [details] [diff] [review] patch v1.6 go for it. we'll still have opportunities to revise these APIs (if bz or someone else comes up with desirable changes) after DPA2.
Attachment #187076 - Flags: superreview?(darin) → superreview+
Comment on attachment 187076 [details] [diff] [review] patch v1.6 this patch adds new APIs for use by extensions and mozilla code to make it easier to read/write non-ASCII characters. since the functions are new, the risk is very low.
Attachment #187076 - Flags: approval1.8b3?
Comment on attachment 187076 [details] [diff] [review] patch v1.6 a=chofmann
Attachment #187076 - Flags: approval1.8b3? → approval1.8b3+
OK, checked in for 1.8b3!
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
It looks like this hurt Tp on btek by 5-10ms.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: