Closed Bug 228543 Opened 21 years ago Closed 21 years ago

IDN: Japanese domain name in mail body is displayed as garbage

Categories

(MailNews Core :: Internationalization, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: kazhik, Assigned: jshin1987)

Details

(Keywords: intl)

Attachments

(3 files, 2 obsolete files)

Japanese domain name in mail body is displayed as garbage. Build: 2003121209-trunk/Linux Original report in Bugzilla-jp: http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=3527
Attached image Screenshot
Keywords: intl
OS: Linux → All
Hardware: PC → All
How to reproduce: 1. Copy the following URLs to the message body: http://ñ»ãÒ&#20250;Þä«¢«¹«­&#12540;.jp/ ªÎ ¡¸ñ»¡¹ http://ªÀª¤ª®ªó.jp/ ªÎ ¡¸ªÀ¡¹ http://«È«Ã«Ñ«ó«Õ«©&#12540;«à«º.jp/ ªÎ ¡¸«à¡¹ http://á¹&#26628;.jp/ ªÎ ¡¸&#26628;¡¹ 2. For testing purpose, put the first URL in Subject header 3. Set the character coding to ISO-2022-JP 4. Send it to yourself 5. Repeat 3 and 4 with the character coding set to UTF-8, Shift_JIS and EUC-JP In all cases, some parts(always the same part) of URLs in the message body are replaced by replacement characters (as shown in the screenshot). However, the Japanese URL in Subject is rendered fine. Given these, there appears to be a bug in the code that 'anchors' URLs with <a> tag (especially, the code that identifies the boundary between a URL and the following text).
Sorry I forgot to set the character encoding to UTF-8 before posting the previous comment. Characters (not covered by EUC-KR) were turned to NCRs because the encoding was set to EUC-KR. Here are they again in UTF-8. http://株式会社アスキー.jp/ の 「株」 http://だいぎん.jp/ の 「だ」 http://トッパンフォームズ.jp/ の 「ム」 http://昭栄.jp/ の 「栄」
The hexdump result is enlightening. Note that in all four cases, it's 0xa0 that breaks URIs apart. '株', 'だ', 'ム' and '栄' in UTF-8 have all contain '0xa0'. I know one place in the Mozilla source where 0xa0 is replaced by '0x20', but I'm not sure that's the cause of this bug. http://株式会社アスキー.jp/ 00000000 68 74 74 70 3a 2f 2f e6 a0 aa e5 bc 8f e4 bc 9a |http://.........| 00000010 e7 a4 be e3 82 a2 e3 82 b9 e3 82 ad e3 83 bc 2e |................| 00000020 6a 70 2f |jp/ http://だいぎん.jp/ 00000020 68 74 74 70 3a 2f 2f e3 81 a0 e3 81 | http://.....| 00000030 84 e3 81 8e e3 82 93 2e 6a 70 2f |........jp/ http://トッパンフォームズ.jp/ 00000030 68 74 74 70 | http| 00000040 3a 2f 2f e3 83 88 e3 83 83 e3 83 91 e3 83 b3 e3 |://.............| 00000050 83 95 e3 82 a9 e3 83 bc e3 83 a0 e3 82 ba 2e 6a |...............j| 00000060 70 2f |p/ http://昭栄.jp/ 00000060 68 74 74 70 3a 2f 2f e6 98 ad e6 a0 84 | http://......| 00000070 2e 6a 70 2f |.jp/
To my great surprise, mozTXTToHTMLConv:ScanTxt is called with a zero-paded UTF-8 string somewhere in mailnews/mime/src/mimet(pla|pfl).cpp. Therefore, FindURL() regards U+00A0 (zero-paded 0xa0 in UTF-8) as the end of a URL. What's amazing is not that UTF-8 strings are zero-paded (which is a common bug in the Mozilla souorce tree) but that it didn't break anything more conspicuous than this.
This bug could have easily been avoided if things had been done right in the first place. Instead, (see bug 50413 comment #10 and bug 50413 comment #15) UTF-8 string is stored in PRUnichar[]/ns*String as if PRUnichar[]/ns*String were zero-paded char[]/nsC*String. mimetpla.cpp and mimetpfl.cpp are so ugly with this hack that I don't know where to start. http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpfl.cpp#378 http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpfl.cpp#399 (both lineSource and lineResult are nsString but what's stored there is zero-paded C strings and they managed to make it __sorta__ work....)
Assignee: smontagu → jshin
Attached patch a patch (obsolete) — Splinter Review
It works except that the line wrapping in 'flowed text/plain' is wrong. There may be a 1-off error somewhere in my patch, but I haven't yet found it.
There's a typo in attachment 137747 [details] [diff] [review] (lineResult => lineResult2) that broke 'flowed text/plain' rendering. With that typo fixed, it now works for 'flowed text/plain'. However, there's an unexpected problem with 'Save As' (html format). I thought 'charset' information is always available because it's used in the current code (without my patch) to determine whether 'charset' is stateful (e.g. ISO-2022-JP). It's not available to |MimeInlineTextPlainFlowed_parse_line| [1] in mimetpfl.cpp and the corresponding function in mimetpla.cpp. I went upward through the caller chain and it's indeed not set anywhere (as far as I can tell). When the patch for bug 50413 went in, it must have been available... [1] http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpfl.cpp#359
Status: NEW → ASSIGNED
Attached patch update (obsolete) — Splinter Review
This patch works well in all test cases. 'Save As' (html) now works. I checked the value of 'initializeCharset' and set charset if it's not set yet. The check could be moved up to mimebuf.cpp where 'per line function' is invoked looping through the buffer, but I didn't because there may be other side effects I'm not aware of (although there isn't likely to be one). In the patch, I also did some clean-up and simplification.
Attachment #137747 - Attachment is obsolete: true
In comment #9, the alternative location I had in mind where |initializeCharset| can be checked is here. http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimebuf.cpp#186 What I'm not sure is whether or not the code always goes through that part. To be on the safe side, I'm inclined to go ahead with the current patch, but I can be persuaded to move it there.
Jungshik, I appreciate you looking into all this. I have a big issue with the way mozTXTToHTMLConv:ScanTxt works - even in the plain ascii case, we do a bunch of unneeded allocations, conversions, copying, etc. It turns out this is a performance bottleneck on certain messages with a lot of text that looks links, or just messages with a large number of lines. I had a patch that got rid of a lot of this for the plain text messages but it's long since been lost. I guess your patch doesn't really change the amount of allocations and copyings around ScanTxt. However, my goal would be to make ScanTxt and surrounding code not do any extra allocations and copies in the default case, which is Ascii input lines with no links. Right now, it will convert the ascii input line to unicode; ScanTxt will make a Unicode copy; the calling code will convert the identical unicode copy back to ascii. What are your thoughts on that?
David, thanks for the comment. As you found, I didn't change anything in ScanTXT. I guess we somehow have to resurrect :-) or rewrite your patch in a separate bug because that's tangential to this patch. The same is true of lines without links, isn't it? Calling sites can't do much about it, can they? As for the calling sites in mailnews (i.e code _around_ ScanTXT), I think converting to and from Unicode is not expensive compared with what's done in the current code (for text labelled as ASCII and ISO-8859-1, the direct CopyASCIItoUTF16 and CopyUTF16oASCII are used in helper functions). Note that the conversion is only done for 'SaveAs HTML', which, IMHO, is not speed-critical. For message display, pretty efficient CopyUTF8toUTF16 and CopyUTF16toUTF8 are used in place of AssignWithConversion. And I removed several 'Recycle()', PR_FREE() and 'ToNew...String' :-) If you're still concerned about 'SaveAs HTML' case, there's a little room for improvement. For instance, we can check if a line is ASCII-only (regardless of the charset label) and just use CopyASCIItoUTF16 and CopyUTF16toASCII [1] directly without calling Unicode conversion helper functions, which would save a couple of function calls in pure ASCII case (assuming IsASCII is cheap). Alternatively, we can do that check in Unicode conversion helper functions if we don't want to add extra lines at the calling site. [1] Note that these helpers are rather efficient (compared with AssignWithConversion()).
thx, Jungshik, for your comments. I'm mainly concerned about message display. I will have to resurrect my patch at some point.
When making attachment 137780 [details] [diff] [review], one file (mimetpla.cpp) was not saved (although I made changes in the editor's buffer) so that there's a minor difference between mimetpla.cpp and mimetpfl.cpp in the logic (not significant to make any run-time difference in actual cases). Then, I had a second thought about an edge case (which I doubt will happen in practice : 'charset' not being defined) and made a little change to both mimetpla/tpfl.cpp. This is a very minor change, but I wanted to upload it before asking for review. BTW, David, my patch may have sped up things a bit for message display (for non-ASCII cases). By using CopyUTF8toUTF16/CopyUTF16toUTF8, two bytes in UTF-8 (for characters below U+0800) and three bytes in UTF-8 (for characters above U+0800 in BMP) are represented in a single PRUnichar when passed to ScanTXT() instead of 'two' or 'three' zero-padded pseudo-PRUnichar's. This would be especially helpful for Russian, Greek, and CJK cases where the number of characters for ScanTXT to scan is cut down by almost two or three folds.
Attachment #137780 - Attachment is obsolete: true
Comment on attachment 137872 [details] [diff] [review] update (basically the same) asking for r/sr (with a bit of variation :-))
Attachment #137872 - Flags: superreview?(bienvenu)
Attachment #137872 - Flags: review?(mscott)
Comment on attachment 137872 [details] [diff] [review] update (basically the same) mscott will be out of the office until Jan 2, so you might try sspitzer@mozilla.org
Attachment #137872 - Flags: superreview?(bienvenu) → superreview+
Comment on attachment 137872 [details] [diff] [review] update (basically the same) Thanks, David, for sr and note. Asking r for sspitzer, instead.
Attachment #137872 - Flags: review?(mscott) → review?(sspitzer)
Compiling on Windows with MSVC++ 6 led me to find a lot of 'type mismatches'. In attachement 137872, I replaced 'char *' with 'const char *' in output_fn() function prototype because I don't want to cast away 'const' from ns*String.get(). g++ on Linux was rather silent, but VC++ 6 found every and each of type mismatches introduced by that forcing me to hunt down and add 'const' to all of them. This patch is the result. It's long (partly because I used '-u8' option when making a diff), but straightforward. Below are a couple of examples. -static int MimeEncrypted_parse_buffer (char *, PRInt32, MimeObject *); +static int MimeEncrypted_parse_buffer (const char *, PRInt32, MimeObject *); static int MimeEncrypted_parse_line (char *, PRInt32, MimeObject *); -static int MimeEncrypted_parse_decoded_buffer (char *, PRInt32, MimeObject *); +static int MimeEncrypted_parse_decoded_buffer (const char *, PRInt32, MimeObjec t *); -nsresult mime_decompose_file_output_fn ( char *buf, PRInt32 size, vo id *stream_closure ); +nsresult mime_decompose_file_output_fn ( const char *buf, PRInt32 si ze, void *stream_closure ); While doing so, I removed a stupid (according to the comment in the current code) utility function replacing it with PL_strchr(). David, can you take a look to see if there's any problem with casting away 'const' (I didn't find any so far). It's not hurry (this is for 1.7 alpha) so that you can take your time.
Comment on attachment 138067 [details] [diff] [review] extra patch to 'soothe' MSVC++ 6 this looks fine - but I don't see any casting away of const, just the opposite, replacing char * with const char *, which is safe, as long as things compile.
Attachment #138067 - Flags: review+
Comment on attachment 138067 [details] [diff] [review] extra patch to 'soothe' MSVC++ 6 Thanks for r. asking for sr. As for NS_CONST_CAST, I meant I would have to use it in attachment 137872 [details] [diff] [review] without this patch. And, there's one place where I can get rid of NS_CONST_CAST as shwon below. (I forgot to remove this cast) --- mailnews/mime/src/mimethsa.cpp 19 Aug 2002 18:07:13 -0000 1.3 +++ mailnews/mime/src/mimethsa.cpp 29 Dec 2003 16:40:36 -0000 @@ -110,17 +110,17 @@ printf(" B2\n"); "\n<meta http-equiv=\"Context-Type\" content=\"text/html; charset="); charsetline += charset; charsetline += "\">\n"; int status = MimeObject_write(obj, - NS_CONST_CAST(char*, charsetline.get()), + charsetline.get(), charsetline.Length(), PR_TRUE);
Attachment #138067 - Flags: superreview?(mscott)
Attachment #138067 - Flags: superreview?(mscott) → superreview+
Comment on attachment 137872 [details] [diff] [review] update (basically the same) mscott, can you review this one as well? thanks.
Attachment #137872 - Flags: review?(sspitzer) → review?(mscott)
Attachment #137872 - Flags: review?(mscott) → review+
thanks all. fix checked into the trunk.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Verified: 2004010408-trunk/Linux.
Status: RESOLVED → VERIFIED
I had created a similar patch in a smaller scale for bug 209526. This was also to ensure that the text passed through to ScanTXT was actually unicode and not padded utf-8. I'll cancel the mimetp part of the patch now since this has been done.
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: