Closed Bug 116399 Opened 23 years ago Closed 12 years ago

reduce roundtrips between 8 and 16 bit characters in mime text plain converter when possible

Categories

(MailNews Core :: MIME, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: Bienvenu, Assigned: Bienvenu)

Details

MimeInlineTextPlain_parse_line starts off with an 8 bit string line, converts it
to  16 bit chars, calls ScanTxt, which, even when it does nothing returns a new
allocated unicode string, which we then convert back to an 8 bit string and
write out. Most of the time, none of these extra allocations are needed, and it
should be our goal to avoid all these extra allocations when not needed. I will
need Naoki's help to verify that these conversions aren't needed, and I will
post diffs when I get them.
One thing good about taking unicode is that those ASCII characters might
conflict if the input is non unicode multibyte characters. So,
!nsMsgI18Nstateful_charset(mailCharset) is there to avoid that to happen.
But the current input for ScanTxt was generated by AssignWithConversion and not
really valid unicode. It would not need to do
!nsMsgI18Nstateful_charset(mailCharset) if the input was real unicode.
So, I think removing AssignWithConversion would not break non ASCII cases (it
still needs the stateful charset check though). In fact, it would be nicer if
the parsing could be postponed after the body converted to UTF-8, that's match
cleaner because you can check ":',".",";", ">" without warring about the
multibyte problem.
For test data, please ask ji@netscape.com, she can provide i18n test cases.

Naoki and I discussed this, and here's what we came up with:

I'm going to make all the mozTXTToHTML routines take 8 bit characters and return
8 bit characters, if ScanTxt finds something that needs new html. For the
stateful character set case, we'll convert the line to utf8 before passing the
line to mozTXTToHTML, and convert the utf8 result, if any, back to ascii.
Because there's no direct conversion to utf8 from the ascii stateful charset
case, it will be a bit more work for this case, but we might make up the
difference by avoiding any extra work in ScanTxt. And Naoki will look at
optimizing this case later. Thanks to Naoki for the help.
Status: NEW → ASSIGNED
Filed bug 116442 for the stateful charset case.
I put a break point at MimeInlineTextPlain_parse_line() and tried a ISO-2022-JP
message then found that the input line is already in UTF-8.
Now I understand the following lines. The input line is already in UTF-8 except
for SaveAs.
440 if (obj->options->format_out != nsMimeOutput::nsMimeMessageSaveAs ||
441 !mailCharset || !nsMsgI18Nstateful_charset(mailCharset))

This is good, the planed change for mozTXTToHTML works fine with input as UTF-8.


I've found another possible elimination of memory allocation in the mime text
plain flowed converter : Line_convert_whitespace() wouldn't usually need to do
anything if it didn't remove <cr>'s (it leaves LF's untouched), which means it
wouldn't need to allocate more memory, except when tabs are present.  Does
anyone know if removing <cr>'s is really required? The plain text converter (not
the flowed one), doesn't remove <CR's>. I've tried commenting out that code in
my tree and can't see any difference in the way messages are displayed.
Daniel Bratell is the true master of that code.
(I was refering to the logic of the flowed converter.)

I now see:

bienvenu wrote:
> I'm going to make all the mozTXTToHTML routines take 8 bit characters and return
> 8 bit characters,

uh, oh.

I know the conversion is a bit messy, but IMO, that's because moztxttohtmlconv
is in the new unicode world, while libmime lives in the old 8 bit char world.

1. the mozTXTToHTML converter is a general one, not limited to Mailnews. Please
don't mess up its interface, just because libmime happens to use 8 bit wide
strings. At most, *add* some methods that take 8 bit strings (but don't change
the 16 bit ones).

2. What is your logic here? Can you describe what you plan to change exactly? Do
you want to pass the string byte-for-byte, then check in moztxttohtmlconv (via
some quick, but reliable checks), if we need to do any conversion at all, and if
not just return with an non-fatal error code? Do you really think that's worth
it, considering that it will increase processing time for cases where we do have
to make any conversion?

3. Note that a conversion is needed for each occasion of ">", "<" or "&" (HTML
escaping), not even counting the fancy recognizer stuff. In the plaintext
converter, this is quite frequently (quotes).

4. I hope, you don't even think of converting the *whole* moztxttohtmlconv class
to 8 bit. (Your description is too vague to know.) The class inherently assumes
unicode and will horribly break in non-obvious ways, if you run it over
something non-unicode. Using 8 bit strings will probably slow it down, because
it jumps around a lot on the strings.
OK, sounds like the thing to do is leave MozTXTToHTML alone, and let all the
client code that wants to convert 16 bit characters to HTML use it, and write
some code for libmime that works with 8 bit characters and doesn't do all the
extra allocations, conversions, and string copying, etc. When libmime works on
raw unicode, it can go back to using MozTXTToHTML. That way, you don't have to
worry about me breaking any non-mail news clients of MozTXTToHTML.
You want to get rid of mozTXTToHTML altogether? Are you kidding? With what do
you want to replace the code?
Seriously, it sounds like the wrong move to duplicate 1000+ lines of generic
code  just to save 2 8<->16 bit conversions per line.
no, not get rid of it at all - just leave it where it is and let other client
code that's currently use it keep using it.  And I'd be getting rid of a lot
more conversions and allocations and copies than that, and I can reduce the code
quite a bit as well. 
How about that (concretizing nhotta's suggestion):

> In fact, it would be nicer if the parsing could be postponed after the body
> converted to UTF-8, that's match cleaner because you can check
> ":',".",";", ">" without warring about the multibyte problem.

If these chars appear in cleartext, are they sure to appear in any encoding? How
likely are these chars to appear in the encoding when they are not in the cleartext?
Maybe we can just ignore the multibyte encoding here and treat it as ASCII. If
we find these chars, we do the conversion to unicode and are safe (at most
slower than necessary). We have a problem, if ">" etc. in some encoding is not
represented by its ASCII code.

Assuming we can use this method:
In libmime, you could check, if mozTXTToHTMLConv will do anything at all, by
checking, if any of ">.:;@&<" (in that order) occurs in the line (IIRC, there's
a string class method doing that). That should cover all cases. It *will*
decrease processing time for lines, which do have any of these chars (by the
time the Find takes), but if it's implemented well, Find might report - for
quote lines - a success already after the first char comparison. The rationale
is that the Find is probably faster than 2 conversions. You'll have to check that.

As for implementation, you could define a string constant or MACRO in a C++
section of the mozTXTToHTMLConv idl, called MOZTXTTOHTMLCONV_SKIP_HACK or
similar, containing the "special chars" mentioned above, and just use the
constant/macro in libmime. That way, you'd keep knowledge of the implementation
details local to the class. It'll break in the theoretical case when the
recognizer gets fancy enough so that special chars are not viable anymore, but
at least the hack would be well-documented and all instances easy to find.


Are there any other suggestions? If not, why not just let that code as-is? You
spotted a *possible* optimization, but I don't see how you could actually
optimize it any further without creating other, IMO worse, problems like
maintainance problems or lost features (Don't you want reliable URL
recognition?). Note that several people optimized this particular code snipplet
already and you look at the result. The chance that you find further ground for
improvement is low.
> just leave it where it is and let other client
> code that's currently use it keep using it.

I meant, get rid of it in Mailnews.

> And I'd be getting rid of a lot more conversions and allocations and
> copies than that, and I can reduce the code quite a bit as well.

You are welcome to improve to code quality of the mozTXTToHTMLConv class.
However, I honestly don't know how you want to keep the same quality of result
while operating on 8 bit strings *and* reducing code size.


(I assume that "multibyte string" here means that a char can be represented by a
variable amount of bytes.)
E.g. if you operate on raw multibyte string, how do you know that the ":" you
just found is not part of the encoding?
If you operate with multibyte-aware functions (which help you go
forward/backward one char or skip to a certain char), you will have a lot of
overhead, because searching (which is what the task of the converter is) in
multibyte strings is much slower than searching in fixed-width strings, because
all of the string (before the position) has to be decoded to skip to a certain
char position.
(These were some of the reasons why mozTXTToHTMLConv is "inherently 16 bit
unicode".)
Ben, I'm confused by your comments. I believe I've made it clear that the plan
was to convert unicode to utf8 before scanning the text. We couldn't do that if
utf8 didn't maintain 7 bit ascii transparency, so that encoded characters will
never appear as 7 bit ascii characters. So your comments about multi-byte aware
functions don't make sense.

Also, I considered doing a Find for ">.:;@&<", but rejected it immediately since
'.' is used to end a sentence, and is quite commonly used.

I've already implemented all this, and it works on all of your test cases. The
plan is simply to add a parameter which says whether ScanTxt actually made a
change or not. The helper routine, also called ScanTxt, maintains this flag and
returns it to the caller so the caller can know if they should pay attention to
the out parameter or not. If not, it can just write out the input string which
saves at least three allocations, two conversions, and three string copies. If
ScanTxt hasn't discovered any changes, it doesn't bother building up the output
string, which also saves allocations. Maintaining this flag does add some code,
but it saves a lot of memory allocations.
> the plan was to convert unicode to utf8 before scanning the text.

I guess I got confused, yes. Did you actually do that? If so, why? I fail to see
how that would help.

> So your comments about multi-byte aware functions don't make sense.

Dunno, if you want to do that, but it still holds that mozTXTToHTMLConv
operating on utf8 is *most likely* slower than if operating on 16 bit unicode.

> I've already implemented all this, and it works on all of your test cases.

Can you attach the patch, please?

> The plan is simply to add a parameter which says whether ScanTxt actually
> made a change or not.

Sounds good. I thought that you also wanted to avoid the conversion before the
ScanTXT call.
>I guess I got confused, yes. Did you actually do that? If so, why? I fail to see
>how that would help.
As I explained before, converting real unicode (as opposed to the fake, 8-bit
ascii  unicode mime passes in to mozTXTToHTML) to utf-8 means we can treat
string as ascii as far as the 7 bit ascii characters like ".?><" are concerned,
so we don't have to use multi-byte aware character iteration functions.
>Dunno, if you want to do that, but it still holds that mozTXTToHTMLConv
>operating on utf8 is *most likely* slower than if operating on 16 bit unicode.
Not sure what you mean here. Since the vast majority of the callers start with
ascii strings (and not unicode), the vast majority of the time mozTXTToHTMLConv
would be operating on ascii and not utf-8. I assume you're not claiming that
operating on 16 bit characters is faster than operating on 8 bit characters. It
is true that converting non-ascii unicode strings to utf-8 and scanning the
utf-8 strings would be slower than just scanning the unicode, but that case is
the exception, not the rule. As near as I can tell, the only time we have actual
unicode data to examine is when we do the url detection at send time. In all the
other cases (reading a message), we have 8 bit ascii characters. Reading
messages is much more common than sending messages, I would argue, and should be
the case we optimize for. In general, you want to optimize for the common case.

>Sounds good. I thought that you also wanted to avoid the conversion before the
>ScanTXT call.
I am avoiding the conversion before the ScanTxt call, because it takes 8 bit
strings now, which is what libmime starts with. In the normal case, where there
are no glyphs, smilies, structured phrases, urls, etc., no extra memory gets
allocated, converted, or copied.

I'll attach a patch after I clean up some more of the smiley handling code.

> converting real unicode (as opposed to the fake, 8-bit
> ascii  unicode mime passes in to mozTXTToHTML) to utf-8

?
mozTXTToHTMLConv currently takes and operates on 16 bit unicode, not the "fake,
8-bit ascii unicode" libmime uses.

> utf-8 means we can treat
> string as ascii as far as the 7 bit ascii characters like ".?><" are
> concerned, so we don't have to use multi-byte aware character iteration
> functions.

I am not so sure about it. The class assumes that it iterates over characters,
not bytes. I don't know a concrete case anymore, but it is possible that this
class fails in (i18n) edge cases, if you let it interpret utf8 as ascii. That's
why I oppose a change of the string encoding of the internal converter functions.

> clean up some more of the smiley handling code.

Good. Much needed :).
If we can't change the string encoding of the internal converter functions, then
we need to have two versions of every routine, since every routine currently
assumes 16 bit unicode data. If we need to have two versions of every routine,
then the ascii version might as well live in mailnews/mime, since it would be
the only client of the ascii versions, and there's no reason to bloat the core
networking code with code that only mime uses.

But I believe we can use utf8, since the converter functions only look for ascii
characters, and let all other characters pass through. I haven't seen any code
that cares if there are two non-special characters (e.g., not "<>.|*CRLF")
instead of one in any particular place, which is the only thing that would
matter between utf8 and unicode, since that's the only difference between
iterating over bytes vs iterating over characters. In other words, if there's
code that treats *AB* differently from *A* because AB is two characters instead
of one, then you're right. But I don't see any such code, nor can I imagine the
need for any such code.

Ben, please note again nhotta's comment that in the case of ISO-2022-JP, the
input to MimeInlineTextPlain_parse_line was already was in utf-8, which we then
converted to fake unicode by adding a 0 in front of every byte, and then
analyzed in ScanTxt as if it was real unicode. A little thought should convince
you that this is equivalent to not converting it to fake unicode and then
looking at every byte.
bienvenu, first let me say that I respect you. I don't like to disagree with
you, which is why I answer so late.

> I haven't seen any code that cares if there are two non-special characters
> (e.g., not "<>.|*CRLF") instead of one in any particular place

Such code might not exist yet, but it is possible that it will appear later.
For example, if we convert æ to &aelig; or similar.

> In other words, if there's code that treats *AB* differently from *A*
> because AB is two characters instead of one, then you're right. But I
> don't see any such code, nor can I imagine the need for any such code.

My main concern is that this class deals a lot with indices into the string.
Such indices are, by convention, indices to chars, not bytes. This makes sense
because the class is operating with chars not bytes. Now, if you change that
meaning, you considerably complicate the class. Maybe there might be no
difference in the code, but I will have to check that each time I add a new feature.

What happens, if in the future the difference in fact matters? I will have to
either work around it with a lot of slow and ugly code or revert your change,
which again takes a lot of time.

Let's take another example: bug 116242. Here, the main task is to report back 2
indices into the string that the (method-)client provided. We don't know what
the client (e.g. the spellchecker) does with the result, so we cannot be sure,
if the difference of 1 or 2 chars is important. I'd have to either
- bother the client with our implementation details and tell it about
  the problem and to take care of it (very bad) or
- write code that translates not only the 16 bit string to utf8 and back,
  but also the indices. I don't know, if there are functions to convert
  the indices from bytes to chars. Worst case, I'd have to convert each part
  of the string (before first index, between indices and after index)
  independantly.
Talk about ugly...



I like your suggestion of the "modified" flag.
You are welcome to carry that flag around inside of mozTXTToHTMLConv (but please
in as few places as possible, of course - changing around 5 function signatures
would probably suffice).
You can add new IDL method(s) specifically for libmime, where you report back
that flag. If you want, you can use any string encoding in those method
signatures as well, as long as you don't change the encoding in the internal
functions, so that the class can still work as today and provide the same 16bit
interfaces.

I believe that this takes care of most of your original complaint that we do
conversions when we don't change anything.

> If we can't change the string encoding of the internal converter functions,
> then we need to have two versions of every routine

The converter has been originally written for use by libmime, so I don't like
the fact that you threaten me with not using mozTXTToHTMLConv anymore.

But back to rationale: Why would you need to do any of that? Those 1-2 saved
conversions* IMO aren't that bad that it would justify either
- possible correctness bugs
- substantially more complicated and slower code at other places
- code redundancy (I don't think that this is really a considerable option)

(*Do we actually save the conversions at all times or do we still have to
convert between ascii and utf8? Isn't that costy, too? I didn't understand your
comment #2 completely.)

You are creating, at least potentially, significant problems by changing
mozTXTToHTMLConv to UTF8 or duplicating that code in libmime. I really think
that we should just forget about these 1-2 conversions until we rewrite libmime
completely in 16bit unicode.

> in the case of ISO-2022-JP, the
> input to MimeInlineTextPlain_parse_line was already was in utf-8, which we
> then converted to fake unicode by adding a 0 in front of every byte, and
> then analyzed in ScanTxt as if it was real unicode.

I was not aware of that. Or are you speaking about the else block currently
starting at line 448 and with the comment "// If nsMimeMessageSaveAs"? If so,
this is just for the SaveAs feature - it is not used during normal mail processing.
Anyways, that we got through with it so far doesn't mean that it's correct.

Summary: Changing the string representation inside of mozTXTTOHTMLConv to UTF8
is one of the worst things I can imagine to be done to the class. As the main
author of mozTXTToHTMLConv, mimetpla.cpp and the relevant parts of mimetpfl.cpp,
i.e. all code in question, I ask you to please leave the internal string
representation in mozTXTToHTMLConv in 16 bit unicode and nevertheless still keep
using that class for converting mails.

Sincerely,
Ben
Ben, thanks for your comments. Let me try again to explain how
mozTXTToHTMLConv.cpp is already acting on utf8 today in the case of, for
example, ISO-2022-JP. I'm specifically NOT talking about the SaveAs branch of
the code, but the other branch that we execute whenever we display a message. 
The input to MimeInlineTextPlain_parse_line was already was in utf-8, we
assigned it to a unicode string, as I said, by putting 0x0 in front of each
character, and passed it to ScanTxt, which treats each utf8 byte (not character,
byte), as a single unicode character. Granted, that may be broken sometime in
the future with some sort of hypothetical example. However, the example you give
would explicitly NOT be broken, because again, as I said before, you would need
to want to distinguish between two NON-SPECIAL characters, vs one NON-SPECIAL
character. the &aelig case would involve two special characters.

Bug 116242 would be a nice feature to have. However, notice that it doesn't make
a copy of the url or the input string, because that's a highly undesirable
behaviour.

Mail messages are not going to be sent around in unicode anytime soon, so
rewriting libmime to be native unicode will not be interesting anytime soon.

I know you think all these copies and conversions and allocations are no big
deal, but consider the case of 1000 line plain text message. You're going to do
at least 3000 extra allocations, 2000 conversions between 8 bits and unicode,
and 1000 string copies (each one byte at a time). That kind of thing adds up.
That's why I care about this, and don't think the current situation is desirable.
Last time I debugged, the input string for MimeInlineTextPlain_parse_line() was
already in UTF-8. This means the charset conversion from mail charset to UTF-8
is already applied before coming to this function.
The conversion has two parts, a mail charset to UCS2 then UCS2 to UTF-8. It
might be possible to call ScanTxt while we have the UCS2 string.
Here is a link to the conversion function,
http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimemoz2.cpp#690
after line 720, the string is in UCS2.
I will take a look if my assumption is really true (I mean, the charset
conversion takes
place before MimeInlineTextPlain_parse_line).
Currently, I have a build problem, so it may take some time.

bienvenu wrote:
> However, the example you give would explicitly NOT be broken, because again,
> as I said before, you would need to want to distinguish between two
> NON-SPECIAL characters, vs one NON-SPECIAL character. the &aelig case
> would involve two special characters.

Yes, that's the problem!?  æ and practically* all other such entities are really
only one char.
If they occupy 1, 2 or 3 "positions" in our string, we will have to carry around
the byte-length of the replaced char, so we know where to insert the
replacement, where to restart the recognition etc..
Also, the comparison will be a string comparison and no int (|PRUnichar|)
comparison anymore and the æ etc. we compare against will be strings in memory,
not ints.

*The few exceptions, like +/-, are/should be coded in the smiley recognition
part and need special treatment anyway.

> Bug 116242 would be a nice feature to have. However, notice that it doesn't
> make a copy of the url or the input string, because that's a highly
> undesirable behaviour.

I don't understand. Currently, it doesn't do anything at all, because it's not
yet implemented (but we probably need it for the open-source spellchecker to
work well).

> Mail messages are not going to be sent around in unicode anytime soon, so
> rewriting libmime to be native unicode will not be interesting anytime soon.

Why? We'd convert the encoding/charset at once central place in the beginning to
unicode and never again worry about such stuff. libmime, the converter, layout
could all use the 16bit unicode representation.
As nhotta mentioned, we seem to convert to unicode already anyways. We use other
encodings because it is a hard-to-remove leftover.

> but consider the case of 1000 line plain text message

Yes, but the vast majority of plaintext messages isn't even close that size.

> at least 3000 extra allocations, 2000 conversions between 8 bits and unicode,
> and 1000 string copies (each one byte at a time)

Do you have some number how long that is on a normal machine? I'd guess that's
in the << 100 ms area. ("<<" = "well below")

I know you care about performance, esp. this milestone ;-P, but this
optimisation comes at a cost - a too high one IMO.
nhotta wrote:
> It might be possible to call ScanTxt while we have the UCS2 string.

Interesting. <http://www.mozilla.org/mailnews/arch/libmime-description.html>
says that the class containing that conversion function is a superclass of our
plaintext converters. So, it might be possible to overwrite that function and
postpone the conversion "or something like that" (TM) :-). Note, however, that
quoting makes the matter more complicated than just running a single ScanTXT
instance over everything.

But we can't hack ScanTXT right into mimemoz2.cpp, because that code is also
used by TextHTML et al. (not sure if you suggested that or not, just in case.)
I used today's pull and verified that the charset conversion is done before the
parse line function. In fact, the parsing starts right after the charset conversion.
http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetext.cpp#347
Line 398 dose the charset conversion then line 420 call the parser.
In a long term, we want to send 16 bit to the layout but more local change might
be possible by delaying UCS2 -> UTF-8 conversion. Change the charset convert
function to return UCS2, feed UCS2 to the parser, then convert UCS2 to UTF-8.
But I am not sure all the parse functions are 16 bit unicode ready (there are
two for plain text and one for html). Also, the function also seems to be used
for save the message and the input string is mail charset in C string, so still
needs char* input instead of PRUnichar* for that case.
 
UCS2 is a myth.  What the term is being used to refer to is UTF-16, where each
character is encoded using either 1 or 2 16-bit words.  Assumptions about 1-to-1
correspondence between characters and encoding positions will break when Mozilla
has to deal with characters outside the BMP.
> Assumptions about 1-to-1
> correspondence between characters and encoding positions will break when Mozilla
> has to deal with characters outside the BMP.

At which point we will probably switch to UCS4, the 32 bit wide encoding. Other
projects did this already.
bienvenu, what's the state of this bug? How do you intend to proceed?
In case you decided not to switch to 8bit, can I help merging your other changes
with the current converter class?
Can you attach your changes regardless? I might have to work on the class soon,
and I want to avoid conflicts for you.
I hope to have more time to work on this, but other things have a higher
priority now - I've pretty much been successfully blocked from fixing this,
unfortunately, and arguing about it just isn't a good use of my time at the moment.
Product: MailNews → Core
QA Contact: esther → mime
Product: Core → MailNews Core
mozTXTToHTML isn't used anymore, so I think this is wontfix.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.