Closed Bug 197166 Opened 21 years ago Closed 20 years ago

[nsILineInputStream] shouldn't be allocating/using AString buffers

Categories

(Core Graveyard :: File Handling, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dwitte, Assigned: Bienvenu)

References

Details

(Keywords: fixed-aviary1.0, fixed1.7.5)

Attachments

(3 files, 3 obsolete files)

darin mentioned this a while back, when we were reviewing one of the cookie
patches. nsILineInputStream::ReadLine gives back an nsAString, which is silly,
because it's just reading an ACString from the file and zero-padding it to give
a unicode string. So the callers are forced to use a temp AString buffer, and
downconvert back to ASCII in most cases.

nsILineInputStream only has a few real consumers, so should be fairly simple to
fix it and update its callers.
Which way are we going to jump here?  Option 1 is to make it work with ASCII
only.  Option 2 is to make it do conversions to Unicode from arbitrary charsets
instead of what it does now.

At the time when I wrote it, it didn't really matter for the use I had for it,
and I did not have a very good grasp of intl issues.  If I were doing it today,
I think I'd be tempted by option 2....
>Which way are we going to jump here?  Option 1 is to make it work with ASCII
>only.  Option 2 is to make it do conversions to Unicode from arbitrary charsets
>instead of what it does now.

When darin mentioned it, I understood he thought option 1 was fine.

>At the time when I wrote it, it didn't really matter for the use I had for it,
>and I did not have a very good grasp of intl issues.  If I were doing it today,
>I think I'd be tempted by option 2....

Of course, I have little grasp of intl issues at this point ;)

I would presume all file formats across plaftorms are ASCII, in which case
option 1 avoids the temp buffer copies that lineinputstream does now. if this is
wrong, well...
> I would presume all file formats across plaftorms are ASCII

That's the pretty questionable assumption.... If we make this assumption, we may
be OK with ASCII or UTF8 or Shift_JIS files, but God forbid someone uses UTF-16....

What are the current users of this code other than mailcap stuff?
Hmm, ok, I'd better wait for someone to decide which option to take then.

Currently the only other user apart from mailcap, is
uriloader/exthandler/os2/nsOSHelperAppService.cpp, and soon-to-be cookie code...
Oh, one final note. the impl for nsILineInputStream currently uses
nsFileInputStream::Read, which returns a char* buffer. So that would also have
to be fixed to deal with arbitrary charsets? Which is a much bigger undertaking,
methinks...
Well.. that char* buffer is a buffer of _bytes_, which is all nsIInputStream
promises.  But the basic premise of a "line" input stream is that you are
providiing characters, since the concept of a "line" makes no sense if you're
working with bytes.  So the line input stream has to convert bytes to characters
somehow; the question is exactly how.  ;)
nsBookmarksService.cpp also uses this code... see my patch in bug 191783, which
"decided" that nsILineInputStream was doing UTF8 conversion.
erg, thanks for the link bsmedberg. I didn't realize boris already had a "plan"
for it ;)

so i guess the AString outparam is kinda necessary if we want to support
arbitrary charsets. So it comes back to the question of 1) putting in a quick
fix that changes it to deal only with ACStrings, or 2) go and extend the iface
to allow the consumer to specify the charset.

If we go with 2), it'd be nice to have two ::ReadLine functions; one that takes
AString and another ACString, to avoid silly up/downconverting in the latter case.

bsmedberg, bz, darin, alecf: what's the vote for now? if #2, who's willing to
take it on?
a) All our current consumers assume ASCII/UTF8 input. So why don't we codify
that? (which is what my patch in bug 191783 does). 

b) If we add a CString readLine method, can it be AUTF8String? i.e.
boolean readUTF8Line(out AUTF8String aLine);
I don't think that this would hurt any consumers that are assuming ASCII.
Option a) sounds good to me - let's fix it up for ASCII & UTF8 only, since
that's all the consumers care about at the moment.

And if we go that route, we can change the iface to AUTF8String, kill the
conversions, and be done with it. Adding a second AString iface can be a future
exercise, if anyone actually wants it.

comments?
OS: Windows 2000 → All
Yeah, I'm ok with just treating things as UTF-8 and returning nsAUTF8String for
now...  But document that _very_ clearly.  ;)
i think this kind of low-level interface does not need to restrict the charset.
 it does not need to know anything about charsets.  it can claim to simply read
a hunk of data up to but not including the first \r, \n, or \r\n.  it should be
just like fgets or getline functions found in standard C/C++.  as soon as you
require a charset, like UTF-8, then you suddenly have to discover what the
charset of the underlying stream is.  there isn't any good way to do that.  and
once you do find out the charset, you will have to convert it from whatever
charset it is to UTF-8.  most likely it will not be UTF-8, since we are mostly
talking about streams originating from the filesystem, and filesystems mostly
don't use UTF-8 (today... tomorrow may be different of course).  so, now we will
have to add generic code to nsLineInputStream that converts foo-charset to
UTF-8.  well, i can tell you that that generally requires an intermediate
conversion to UCS-2 or UTF-16.  doing so is just wrong at this level.  don't go
there.  keep it simple please!!

  [scriptable, uuid(... new IID required ...)]
  interface nsILineInputStream : nsISupports
  {
    boolean readLine(out ACString aLine);
  };

ACString can pass binary data.  this is one of the key distinctions between
ACString and the plain old |string| IDL type.  Javascript/XPConnect will not
have any trouble dealing with aLine containing binary crap provided the JS code
is prepared to deal with such.  XPConnect will simply zero-pad expand the data,
which is a non-destructive conversion, and is equivalent to what we are
currently providing with an AString out param.  JS can do something fancy with
the zero-pad expanded data if it needs to.
Yeah... The only problem there is that a UCS-2 file containing \r\n to denote
EOL will look like two lines to the proposed readLine (since the bytes are 0x10
0x0 0x13 0x0).  I think that's something we can live with.
the interface could support configurable line endings, but i think that might
just be going off the deep end since we don't have any consumers requiring it 8^)
after this is fixed, we might want to start using it in bug 202474, where we're
writing our own line-parsing code.
after this is fixed, we might want to start using it in bug 202474, where we're
writing our own line-parsing code.
*** Bug 209431 has been marked as a duplicate of this bug. ***
I agree with darin... let's not make things too complicated here...

Now if you insist, you could add a parameter "boolean two_byte_line_terminator"
to readLine, which would look for \r\0\n\0.
after talking to Darin and Dan, taking - I need this anyway...
Assignee: dwitte → bienvenu
Attached patch work in progress (obsolete) — Splinter Review
this seems to work. Most of it is pretty straightforward. I did have to fix
nsFileStreams Seek method to clear the mLineBuffer, or else if you read some
stuff, then seek, then read, you'll get some of the previous buffer. Some of
the code I haven't compiled because its Linux or OS/2 or FireFox-specific.

I need to clean up nsMsgBodyHandler::StripHtml, which as I have it now,
manipulates the nsCString's buffer directly.

this patch makes local folder searching on windows quite a bit faster. My debug
build is 25% faster than my release build, so the speedup is likely more than
25%, all from caching the input stream and not using nsFileSpec's readLine.
So which of the options discussed in comment 1 through comment 13 did we go
with, then?
Darin's comment 12:
 [scriptable, uuid(... new IID required ...)]
  interface nsILineInputStream : nsISupports
  {
    boolean readLine(out ACString aLine);
  };
OK.  Could we please very clearly document in this interface exactly what it
means by a "line" so that people will know, in that case?  Because what it means
is NOT "a line of text" but rather "a collection of bytes satisfying certain
criteria".  Those criteria need to be explicitly listed.
Is this accurate/sufficient? It looks like this is what NS_ReadLine does...

   * Read a single line from the stream, where a line is a 
   * possibly zero length sequence of bytes terminated by 
   * CR, LF, CRLF, LFCR, or eof.
   * The line terminator is not returned.
   * Return false for end of file, true otherwise

Or should I just put it as Darin so succinctly did:

read a hunk of bytes up to but not including the first \r, \n, or \r\n

David, that looks great.  Thanks!
Oh, and as far as the unix and OS/2 changes go, they will need to be a lot more
extensive... there are 4-5 callers of readLine() in each of those files...  And
we'll need to decide what to do with the bytes once we get then (converting to
unicode from native is what I think we should do).
Attached patch non-mail diffs. (obsolete) — Splinter Review
oops, thx, yes, I certainly missed a few there. This patch gets more of the
readLine calls, all of them, I hope.

Re how to convert, maybe I'm being naive, but the old code just did a 0
expansion of the 8 bit data, via Assign/AppendWithConversion, so if we keep
doing that, we'll be working the same say the old code did. The callers can do
what they want, of course. Many of them were pretty insistent on using 16 bit
chars, even though the data is really 8 bit, so I didn't convert everything
over to nsCStrings because the changes would ripple too much.
Attachment #146247 - Attachment is obsolete: true
I think you still missed at least two callers of readLine in
nsOSHelperAppServiceUnix (and presumably the same on OS/2).  Are you going to
get a chance to build this on Unix before checking in, by the way?
it will be compiled on unix before being checked in, and I hope to get someone
to compile it on OS/2 as well...I did miss one unix call, but I think I got all
the OS/2 ones...I'll apply it against firefox as well.
I believe this should compile on linux, os/2, and with firefox...
Attachment #146289 - Attachment is obsolete: true
Comment on attachment 146304 [details] [diff] [review]
fix readLine case I missed on linux, use of StringBeginsWith for linux and os/2

Darin, should anyone else look at this? Besides folks already cc'ed?
Attachment #146304 - Flags: superreview?(mscott)
Attachment #146304 - Flags: review?(darin)
 nsFileInputStream::Seek(PRInt32 aWhence, PRInt64 aOffset)
 {
+    PR_FREEIF(mLineBuffer); // this invalidates the line buffer
+    mLineBuffer = nsnull;      

the setting to nsnull is not needed - PR_FREEIF will do that for you; see
http://lxr.mozilla.org/seamonkey/source/nsprpub/pr/include/prmem.h#152
oh, of course - I'll fix that and the place I copied the code from.
Comment on attachment 146304 [details] [diff] [review]
fix readLine case I missed on linux, use of StringBeginsWith for linux and os/2

Note that this patch changes the behavior of nsDogbertProfileMigrator.cpp (it's
going to write out a lot more data now than it did before).  The change is
probably correct (the old code looks totally bogus to me), but Ben should
review it.

For the rest, I'm just reading the OS/2 and Unix helper app service changes.

>Index: uriloader/exthandler/os2/nsOSHelperAppService.cpp

>@@ -559,6 +559,7 @@
>                             descriptionStart, descriptionEnd;
>   
>   do {
>+  buf.AssignWithConversion(cBuf);

(note that using the -p option to cvs diff would make this a lot easier to
review).

Indent is off.	Also, I would much prefer |CopyASCIItoUTF16(cBuf, buf);| so
that it's clear that in fact absolutely no conversion is happening.  Same thing
at all other points where AssignWithConversion is used.

With that, the OS/2 and Unix changes helper app service changes are fine by me.
Attachment #146304 - Flags: superreview?(mscott)
Attachment #146304 - Flags: review?(darin)
Attachment #146374 - Flags: superreview?(mscott)
Attachment #146374 - Flags: review?(darin)
Comment on attachment 146374 [details] [diff] [review]
use CopyAsciiToUTF16, add a missing NS_LITERAL_CSTRING, remove unneeded null assignment

can nsReadCLine.h simply be cvs removed now?  (i.e., there doesn't seem to be
any reason to delete its contents... outright removal makes more sense, no?)


+  if (!utf8Buffer.Equals("#2c")) {
     NS_ERROR("Unexpected version header in signon file");

maybe use NS_LITERAL_CSTRING here.


r=darin
Attachment #146374 - Flags: review?(darin) → review+
yeah, cvs removal of nsReadCline.h was what I had in mind after getting this all
in. And I'll add the NS_LITERAL_CSTRING
Attachment #146374 - Flags: superreview?(mscott) → superreview+
Attachment #146556 - Flags: superreview?(mscott)
Attachment #146556 - Flags: superreview?(mscott) → superreview+
couple more directories
fixed.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
This checkin broke MIME type lookups on Linux.  See bug 242495
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: