Closed
Bug 116399
Opened 23 years ago
Closed 12 years ago
reduce roundtrips between 8 and 16 bit characters in mime text plain converter when possible
Categories
(MailNews Core :: MIME, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: Bienvenu, Assigned: Bienvenu)
Details
MimeInlineTextPlain_parse_line starts off with an 8 bit string line, converts it to 16 bit chars, calls ScanTxt, which, even when it does nothing returns a new allocated unicode string, which we then convert back to an 8 bit string and write out. Most of the time, none of these extra allocations are needed, and it should be our goal to avoid all these extra allocations when not needed. I will need Naoki's help to verify that these conversions aren't needed, and I will post diffs when I get them.
Comment 1•23 years ago
|
||
One thing good about taking unicode is that those ASCII characters might conflict if the input is non unicode multibyte characters. So, !nsMsgI18Nstateful_charset(mailCharset) is there to avoid that to happen. But the current input for ScanTxt was generated by AssignWithConversion and not really valid unicode. It would not need to do !nsMsgI18Nstateful_charset(mailCharset) if the input was real unicode. So, I think removing AssignWithConversion would not break non ASCII cases (it still needs the stateful charset check though). In fact, it would be nicer if the parsing could be postponed after the body converted to UTF-8, that's match cleaner because you can check ":',".",";", ">" without warring about the multibyte problem. For test data, please ask ji@netscape.com, she can provide i18n test cases.
Assignee | ||
Comment 2•23 years ago
|
||
Naoki and I discussed this, and here's what we came up with: I'm going to make all the mozTXTToHTML routines take 8 bit characters and return 8 bit characters, if ScanTxt finds something that needs new html. For the stateful character set case, we'll convert the line to utf8 before passing the line to mozTXTToHTML, and convert the utf8 result, if any, back to ascii. Because there's no direct conversion to utf8 from the ascii stateful charset case, it will be a bit more work for this case, but we might make up the difference by avoiding any extra work in ScanTxt. And Naoki will look at optimizing this case later. Thanks to Naoki for the help.
Status: NEW → ASSIGNED
Comment 3•23 years ago
|
||
Filed bug 116442 for the stateful charset case.
Comment 4•23 years ago
|
||
I put a break point at MimeInlineTextPlain_parse_line() and tried a ISO-2022-JP message then found that the input line is already in UTF-8. Now I understand the following lines. The input line is already in UTF-8 except for SaveAs. 440 if (obj->options->format_out != nsMimeOutput::nsMimeMessageSaveAs || 441 !mailCharset || !nsMsgI18Nstateful_charset(mailCharset)) This is good, the planed change for mozTXTToHTML works fine with input as UTF-8.
Assignee | ||
Comment 5•23 years ago
|
||
I've found another possible elimination of memory allocation in the mime text plain flowed converter : Line_convert_whitespace() wouldn't usually need to do anything if it didn't remove <cr>'s (it leaves LF's untouched), which means it wouldn't need to allocate more memory, except when tabs are present. Does anyone know if removing <cr>'s is really required? The plain text converter (not the flowed one), doesn't remove <CR's>. I've tried commenting out that code in my tree and can't see any difference in the way messages are displayed.
Comment 6•23 years ago
|
||
Daniel Bratell is the true master of that code.
Comment 7•23 years ago
|
||
(I was refering to the logic of the flowed converter.)
I now see:
bienvenu wrote:
> I'm going to make all the mozTXTToHTML routines take 8 bit characters and return
> 8 bit characters,
uh, oh.
I know the conversion is a bit messy, but IMO, that's because moztxttohtmlconv
is in the new unicode world, while libmime lives in the old 8 bit char world.
1. the mozTXTToHTML converter is a general one, not limited to Mailnews. Please
don't mess up its interface, just because libmime happens to use 8 bit wide
strings. At most, *add* some methods that take 8 bit strings (but don't change
the 16 bit ones).
2. What is your logic here? Can you describe what you plan to change exactly? Do
you want to pass the string byte-for-byte, then check in moztxttohtmlconv (via
some quick, but reliable checks), if we need to do any conversion at all, and if
not just return with an non-fatal error code? Do you really think that's worth
it, considering that it will increase processing time for cases where we do have
to make any conversion?
3. Note that a conversion is needed for each occasion of ">", "<" or "&" (HTML
escaping), not even counting the fancy recognizer stuff. In the plaintext
converter, this is quite frequently (quotes).
4. I hope, you don't even think of converting the *whole* moztxttohtmlconv class
to 8 bit. (Your description is too vague to know.) The class inherently assumes
unicode and will horribly break in non-obvious ways, if you run it over
something non-unicode. Using 8 bit strings will probably slow it down, because
it jumps around a lot on the strings.
Assignee | ||
Comment 8•23 years ago
|
||
OK, sounds like the thing to do is leave MozTXTToHTML alone, and let all the client code that wants to convert 16 bit characters to HTML use it, and write some code for libmime that works with 8 bit characters and doesn't do all the extra allocations, conversions, and string copying, etc. When libmime works on raw unicode, it can go back to using MozTXTToHTML. That way, you don't have to worry about me breaking any non-mail news clients of MozTXTToHTML.
Comment 9•23 years ago
|
||
You want to get rid of mozTXTToHTML altogether? Are you kidding? With what do you want to replace the code?
Comment 10•23 years ago
|
||
Seriously, it sounds like the wrong move to duplicate 1000+ lines of generic code just to save 2 8<->16 bit conversions per line.
Assignee | ||
Comment 11•23 years ago
|
||
no, not get rid of it at all - just leave it where it is and let other client code that's currently use it keep using it. And I'd be getting rid of a lot more conversions and allocations and copies than that, and I can reduce the code quite a bit as well.
Comment 12•23 years ago
|
||
How about that (concretizing nhotta's suggestion):
> In fact, it would be nicer if the parsing could be postponed after the body
> converted to UTF-8, that's match cleaner because you can check
> ":',".",";", ">" without warring about the multibyte problem.
If these chars appear in cleartext, are they sure to appear in any encoding? How
likely are these chars to appear in the encoding when they are not in the cleartext?
Maybe we can just ignore the multibyte encoding here and treat it as ASCII. If
we find these chars, we do the conversion to unicode and are safe (at most
slower than necessary). We have a problem, if ">" etc. in some encoding is not
represented by its ASCII code.
Assuming we can use this method:
In libmime, you could check, if mozTXTToHTMLConv will do anything at all, by
checking, if any of ">.:;@&<" (in that order) occurs in the line (IIRC, there's
a string class method doing that). That should cover all cases. It *will*
decrease processing time for lines, which do have any of these chars (by the
time the Find takes), but if it's implemented well, Find might report - for
quote lines - a success already after the first char comparison. The rationale
is that the Find is probably faster than 2 conversions. You'll have to check that.
As for implementation, you could define a string constant or MACRO in a C++
section of the mozTXTToHTMLConv idl, called MOZTXTTOHTMLCONV_SKIP_HACK or
similar, containing the "special chars" mentioned above, and just use the
constant/macro in libmime. That way, you'd keep knowledge of the implementation
details local to the class. It'll break in the theoretical case when the
recognizer gets fancy enough so that special chars are not viable anymore, but
at least the hack would be well-documented and all instances easy to find.
Are there any other suggestions? If not, why not just let that code as-is? You
spotted a *possible* optimization, but I don't see how you could actually
optimize it any further without creating other, IMO worse, problems like
maintainance problems or lost features (Don't you want reliable URL
recognition?). Note that several people optimized this particular code snipplet
already and you look at the result. The chance that you find further ground for
improvement is low.
Comment 13•23 years ago
|
||
> just leave it where it is and let other client > code that's currently use it keep using it. I meant, get rid of it in Mailnews. > And I'd be getting rid of a lot more conversions and allocations and > copies than that, and I can reduce the code quite a bit as well. You are welcome to improve to code quality of the mozTXTToHTMLConv class. However, I honestly don't know how you want to keep the same quality of result while operating on 8 bit strings *and* reducing code size. (I assume that "multibyte string" here means that a char can be represented by a variable amount of bytes.) E.g. if you operate on raw multibyte string, how do you know that the ":" you just found is not part of the encoding? If you operate with multibyte-aware functions (which help you go forward/backward one char or skip to a certain char), you will have a lot of overhead, because searching (which is what the task of the converter is) in multibyte strings is much slower than searching in fixed-width strings, because all of the string (before the position) has to be decoded to skip to a certain char position. (These were some of the reasons why mozTXTToHTMLConv is "inherently 16 bit unicode".)
Assignee | ||
Comment 14•23 years ago
|
||
Ben, I'm confused by your comments. I believe I've made it clear that the plan was to convert unicode to utf8 before scanning the text. We couldn't do that if utf8 didn't maintain 7 bit ascii transparency, so that encoded characters will never appear as 7 bit ascii characters. So your comments about multi-byte aware functions don't make sense. Also, I considered doing a Find for ">.:;@&<", but rejected it immediately since '.' is used to end a sentence, and is quite commonly used. I've already implemented all this, and it works on all of your test cases. The plan is simply to add a parameter which says whether ScanTxt actually made a change or not. The helper routine, also called ScanTxt, maintains this flag and returns it to the caller so the caller can know if they should pay attention to the out parameter or not. If not, it can just write out the input string which saves at least three allocations, two conversions, and three string copies. If ScanTxt hasn't discovered any changes, it doesn't bother building up the output string, which also saves allocations. Maintaining this flag does add some code, but it saves a lot of memory allocations.
Comment 15•23 years ago
|
||
> the plan was to convert unicode to utf8 before scanning the text. I guess I got confused, yes. Did you actually do that? If so, why? I fail to see how that would help. > So your comments about multi-byte aware functions don't make sense. Dunno, if you want to do that, but it still holds that mozTXTToHTMLConv operating on utf8 is *most likely* slower than if operating on 16 bit unicode. > I've already implemented all this, and it works on all of your test cases. Can you attach the patch, please? > The plan is simply to add a parameter which says whether ScanTxt actually > made a change or not. Sounds good. I thought that you also wanted to avoid the conversion before the ScanTXT call.
Assignee | ||
Comment 16•23 years ago
|
||
>I guess I got confused, yes. Did you actually do that? If so, why? I fail to see >how that would help. As I explained before, converting real unicode (as opposed to the fake, 8-bit ascii unicode mime passes in to mozTXTToHTML) to utf-8 means we can treat string as ascii as far as the 7 bit ascii characters like ".?><" are concerned, so we don't have to use multi-byte aware character iteration functions. >Dunno, if you want to do that, but it still holds that mozTXTToHTMLConv >operating on utf8 is *most likely* slower than if operating on 16 bit unicode. Not sure what you mean here. Since the vast majority of the callers start with ascii strings (and not unicode), the vast majority of the time mozTXTToHTMLConv would be operating on ascii and not utf-8. I assume you're not claiming that operating on 16 bit characters is faster than operating on 8 bit characters. It is true that converting non-ascii unicode strings to utf-8 and scanning the utf-8 strings would be slower than just scanning the unicode, but that case is the exception, not the rule. As near as I can tell, the only time we have actual unicode data to examine is when we do the url detection at send time. In all the other cases (reading a message), we have 8 bit ascii characters. Reading messages is much more common than sending messages, I would argue, and should be the case we optimize for. In general, you want to optimize for the common case. >Sounds good. I thought that you also wanted to avoid the conversion before the >ScanTXT call. I am avoiding the conversion before the ScanTxt call, because it takes 8 bit strings now, which is what libmime starts with. In the normal case, where there are no glyphs, smilies, structured phrases, urls, etc., no extra memory gets allocated, converted, or copied. I'll attach a patch after I clean up some more of the smiley handling code.
Comment 17•23 years ago
|
||
> converting real unicode (as opposed to the fake, 8-bit > ascii unicode mime passes in to mozTXTToHTML) to utf-8 ? mozTXTToHTMLConv currently takes and operates on 16 bit unicode, not the "fake, 8-bit ascii unicode" libmime uses. > utf-8 means we can treat > string as ascii as far as the 7 bit ascii characters like ".?><" are > concerned, so we don't have to use multi-byte aware character iteration > functions. I am not so sure about it. The class assumes that it iterates over characters, not bytes. I don't know a concrete case anymore, but it is possible that this class fails in (i18n) edge cases, if you let it interpret utf8 as ascii. That's why I oppose a change of the string encoding of the internal converter functions. > clean up some more of the smiley handling code. Good. Much needed :).
Assignee | ||
Comment 18•23 years ago
|
||
If we can't change the string encoding of the internal converter functions, then we need to have two versions of every routine, since every routine currently assumes 16 bit unicode data. If we need to have two versions of every routine, then the ascii version might as well live in mailnews/mime, since it would be the only client of the ascii versions, and there's no reason to bloat the core networking code with code that only mime uses. But I believe we can use utf8, since the converter functions only look for ascii characters, and let all other characters pass through. I haven't seen any code that cares if there are two non-special characters (e.g., not "<>.|*CRLF") instead of one in any particular place, which is the only thing that would matter between utf8 and unicode, since that's the only difference between iterating over bytes vs iterating over characters. In other words, if there's code that treats *AB* differently from *A* because AB is two characters instead of one, then you're right. But I don't see any such code, nor can I imagine the need for any such code.
Assignee | ||
Comment 19•23 years ago
|
||
Ben, please note again nhotta's comment that in the case of ISO-2022-JP, the input to MimeInlineTextPlain_parse_line was already was in utf-8, which we then converted to fake unicode by adding a 0 in front of every byte, and then analyzed in ScanTxt as if it was real unicode. A little thought should convince you that this is equivalent to not converting it to fake unicode and then looking at every byte.
Comment 20•23 years ago
|
||
bienvenu, first let me say that I respect you. I don't like to disagree with you, which is why I answer so late. > I haven't seen any code that cares if there are two non-special characters > (e.g., not "<>.|*CRLF") instead of one in any particular place Such code might not exist yet, but it is possible that it will appear later. For example, if we convert æ to æ or similar. > In other words, if there's code that treats *AB* differently from *A* > because AB is two characters instead of one, then you're right. But I > don't see any such code, nor can I imagine the need for any such code. My main concern is that this class deals a lot with indices into the string. Such indices are, by convention, indices to chars, not bytes. This makes sense because the class is operating with chars not bytes. Now, if you change that meaning, you considerably complicate the class. Maybe there might be no difference in the code, but I will have to check that each time I add a new feature. What happens, if in the future the difference in fact matters? I will have to either work around it with a lot of slow and ugly code or revert your change, which again takes a lot of time. Let's take another example: bug 116242. Here, the main task is to report back 2 indices into the string that the (method-)client provided. We don't know what the client (e.g. the spellchecker) does with the result, so we cannot be sure, if the difference of 1 or 2 chars is important. I'd have to either - bother the client with our implementation details and tell it about the problem and to take care of it (very bad) or - write code that translates not only the 16 bit string to utf8 and back, but also the indices. I don't know, if there are functions to convert the indices from bytes to chars. Worst case, I'd have to convert each part of the string (before first index, between indices and after index) independantly. Talk about ugly... I like your suggestion of the "modified" flag. You are welcome to carry that flag around inside of mozTXTToHTMLConv (but please in as few places as possible, of course - changing around 5 function signatures would probably suffice). You can add new IDL method(s) specifically for libmime, where you report back that flag. If you want, you can use any string encoding in those method signatures as well, as long as you don't change the encoding in the internal functions, so that the class can still work as today and provide the same 16bit interfaces. I believe that this takes care of most of your original complaint that we do conversions when we don't change anything. > If we can't change the string encoding of the internal converter functions, > then we need to have two versions of every routine The converter has been originally written for use by libmime, so I don't like the fact that you threaten me with not using mozTXTToHTMLConv anymore. But back to rationale: Why would you need to do any of that? Those 1-2 saved conversions* IMO aren't that bad that it would justify either - possible correctness bugs - substantially more complicated and slower code at other places - code redundancy (I don't think that this is really a considerable option) (*Do we actually save the conversions at all times or do we still have to convert between ascii and utf8? Isn't that costy, too? I didn't understand your comment #2 completely.) You are creating, at least potentially, significant problems by changing mozTXTToHTMLConv to UTF8 or duplicating that code in libmime. I really think that we should just forget about these 1-2 conversions until we rewrite libmime completely in 16bit unicode. > in the case of ISO-2022-JP, the > input to MimeInlineTextPlain_parse_line was already was in utf-8, which we > then converted to fake unicode by adding a 0 in front of every byte, and > then analyzed in ScanTxt as if it was real unicode. I was not aware of that. Or are you speaking about the else block currently starting at line 448 and with the comment "// If nsMimeMessageSaveAs"? If so, this is just for the SaveAs feature - it is not used during normal mail processing. Anyways, that we got through with it so far doesn't mean that it's correct. Summary: Changing the string representation inside of mozTXTTOHTMLConv to UTF8 is one of the worst things I can imagine to be done to the class. As the main author of mozTXTToHTMLConv, mimetpla.cpp and the relevant parts of mimetpfl.cpp, i.e. all code in question, I ask you to please leave the internal string representation in mozTXTToHTMLConv in 16 bit unicode and nevertheless still keep using that class for converting mails. Sincerely, Ben
Assignee | ||
Comment 21•23 years ago
|
||
Ben, thanks for your comments. Let me try again to explain how mozTXTToHTMLConv.cpp is already acting on utf8 today in the case of, for example, ISO-2022-JP. I'm specifically NOT talking about the SaveAs branch of the code, but the other branch that we execute whenever we display a message. The input to MimeInlineTextPlain_parse_line was already was in utf-8, we assigned it to a unicode string, as I said, by putting 0x0 in front of each character, and passed it to ScanTxt, which treats each utf8 byte (not character, byte), as a single unicode character. Granted, that may be broken sometime in the future with some sort of hypothetical example. However, the example you give would explicitly NOT be broken, because again, as I said before, you would need to want to distinguish between two NON-SPECIAL characters, vs one NON-SPECIAL character. the æ case would involve two special characters. Bug 116242 would be a nice feature to have. However, notice that it doesn't make a copy of the url or the input string, because that's a highly undesirable behaviour. Mail messages are not going to be sent around in unicode anytime soon, so rewriting libmime to be native unicode will not be interesting anytime soon. I know you think all these copies and conversions and allocations are no big deal, but consider the case of 1000 line plain text message. You're going to do at least 3000 extra allocations, 2000 conversions between 8 bits and unicode, and 1000 string copies (each one byte at a time). That kind of thing adds up. That's why I care about this, and don't think the current situation is desirable.
Comment 22•23 years ago
|
||
Last time I debugged, the input string for MimeInlineTextPlain_parse_line() was already in UTF-8. This means the charset conversion from mail charset to UTF-8 is already applied before coming to this function. The conversion has two parts, a mail charset to UCS2 then UCS2 to UTF-8. It might be possible to call ScanTxt while we have the UCS2 string. Here is a link to the conversion function, http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimemoz2.cpp#690 after line 720, the string is in UCS2. I will take a look if my assumption is really true (I mean, the charset conversion takes place before MimeInlineTextPlain_parse_line). Currently, I have a build problem, so it may take some time.
Comment 23•23 years ago
|
||
bienvenu wrote: > However, the example you give would explicitly NOT be broken, because again, > as I said before, you would need to want to distinguish between two > NON-SPECIAL characters, vs one NON-SPECIAL character. the æ case > would involve two special characters. Yes, that's the problem!? æ and practically* all other such entities are really only one char. If they occupy 1, 2 or 3 "positions" in our string, we will have to carry around the byte-length of the replaced char, so we know where to insert the replacement, where to restart the recognition etc.. Also, the comparison will be a string comparison and no int (|PRUnichar|) comparison anymore and the æ etc. we compare against will be strings in memory, not ints. *The few exceptions, like +/-, are/should be coded in the smiley recognition part and need special treatment anyway. > Bug 116242 would be a nice feature to have. However, notice that it doesn't > make a copy of the url or the input string, because that's a highly > undesirable behaviour. I don't understand. Currently, it doesn't do anything at all, because it's not yet implemented (but we probably need it for the open-source spellchecker to work well). > Mail messages are not going to be sent around in unicode anytime soon, so > rewriting libmime to be native unicode will not be interesting anytime soon. Why? We'd convert the encoding/charset at once central place in the beginning to unicode and never again worry about such stuff. libmime, the converter, layout could all use the 16bit unicode representation. As nhotta mentioned, we seem to convert to unicode already anyways. We use other encodings because it is a hard-to-remove leftover. > but consider the case of 1000 line plain text message Yes, but the vast majority of plaintext messages isn't even close that size. > at least 3000 extra allocations, 2000 conversions between 8 bits and unicode, > and 1000 string copies (each one byte at a time) Do you have some number how long that is on a normal machine? I'd guess that's in the << 100 ms area. ("<<" = "well below") I know you care about performance, esp. this milestone ;-P, but this optimisation comes at a cost - a too high one IMO. nhotta wrote: > It might be possible to call ScanTxt while we have the UCS2 string. Interesting. <http://www.mozilla.org/mailnews/arch/libmime-description.html> says that the class containing that conversion function is a superclass of our plaintext converters. So, it might be possible to overwrite that function and postpone the conversion "or something like that" (TM) :-). Note, however, that quoting makes the matter more complicated than just running a single ScanTXT instance over everything. But we can't hack ScanTXT right into mimemoz2.cpp, because that code is also used by TextHTML et al. (not sure if you suggested that or not, just in case.)
Comment 24•23 years ago
|
||
I used today's pull and verified that the charset conversion is done before the parse line function. In fact, the parsing starts right after the charset conversion. http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetext.cpp#347 Line 398 dose the charset conversion then line 420 call the parser. In a long term, we want to send 16 bit to the layout but more local change might be possible by delaying UCS2 -> UTF-8 conversion. Change the charset convert function to return UCS2, feed UCS2 to the parser, then convert UCS2 to UTF-8. But I am not sure all the parse functions are 16 bit unicode ready (there are two for plain text and one for html). Also, the function also seems to be used for save the message and the input string is mail charset in C string, so still needs char* input instead of PRUnichar* for that case.
Comment 25•23 years ago
|
||
UCS2 is a myth. What the term is being used to refer to is UTF-16, where each character is encoded using either 1 or 2 16-bit words. Assumptions about 1-to-1 correspondence between characters and encoding positions will break when Mozilla has to deal with characters outside the BMP.
Comment 26•23 years ago
|
||
> Assumptions about 1-to-1
> correspondence between characters and encoding positions will break when Mozilla
> has to deal with characters outside the BMP.
At which point we will probably switch to UCS4, the 32 bit wide encoding. Other
projects did this already.
Comment 27•23 years ago
|
||
bienvenu, what's the state of this bug? How do you intend to proceed? In case you decided not to switch to 8bit, can I help merging your other changes with the current converter class? Can you attach your changes regardless? I might have to work on the class soon, and I want to avoid conflicts for you.
Assignee | ||
Comment 28•23 years ago
|
||
I hope to have more time to work on this, but other things have a higher priority now - I've pretty much been successfully blocked from fixing this, unfortunately, and arguing about it just isn't a good use of my time at the moment.
Updated•20 years ago
|
Product: MailNews → Core
Updated•16 years ago
|
QA Contact: esther → mime
Updated•16 years ago
|
Product: Core → MailNews Core
Assignee | ||
Comment 29•12 years ago
|
||
mozTXTToHTML isn't used anymore, so I think this is wontfix.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•