Closed
Bug 228543
Opened 21 years ago
Closed 21 years ago
IDN: Japanese domain name in mail body is displayed as garbage
Categories
(MailNews Core :: Internationalization, defect)
MailNews Core
Internationalization
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: kazhik, Assigned: jshin1987)
Details
(Keywords: intl)
Attachments
(3 files, 2 obsolete files)
21.30 KB,
image/jpeg
|
Details | |
34.01 KB,
patch
|
mscott
:
review+
Bienvenu
:
superreview+
|
Details | Diff | Splinter Review |
37.46 KB,
patch
|
Bienvenu
:
review+
mscott
:
superreview+
|
Details | Diff | Splinter Review |
Japanese domain name in mail body is displayed as garbage.
Build: 2003121209-trunk/Linux
Original report in Bugzilla-jp:
http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=3527
Reporter | ||
Comment 1•21 years ago
|
||
Assignee | ||
Updated•21 years ago
|
Assignee | ||
Comment 2•21 years ago
|
||
How to reproduce:
1. Copy the following URLs to the message body:
http://ñ»ãÒ会Þä«¢«¹«ー.jp/ ªÎ ¡¸ñ»¡¹
http://ªÀª¤ª®ªó.jp/ ªÎ ¡¸ªÀ¡¹
http://«È«Ã«Ñ«ó«Õ«©ー«à«º.jp/ ªÎ ¡¸«à¡¹
http://á¹栄.jp/ ªÎ ¡¸栄¡¹
2. For testing purpose, put the first URL in Subject header
3. Set the character coding to ISO-2022-JP
4. Send it to yourself
5. Repeat 3 and 4 with the character coding set to UTF-8, Shift_JIS and EUC-JP
In all cases, some parts(always the same part) of URLs in the message body are
replaced by replacement characters (as shown in the screenshot). However, the
Japanese URL in Subject is rendered fine.
Given these, there appears to be a bug in the code that 'anchors' URLs with <a>
tag (especially, the code that identifies the boundary between a URL and the
following text).
Assignee | ||
Comment 3•21 years ago
|
||
Sorry I forgot to set the character encoding to UTF-8 before posting the
previous comment. Characters (not covered by EUC-KR) were turned to NCRs because
the encoding was set to EUC-KR. Here are they again in UTF-8.
http://株式会社アスキー.jp/ の 「株」
http://だいぎん.jp/ の 「だ」
http://トッパンフォームズ.jp/ の 「ム」
http://昭栄.jp/ の 「栄」
Assignee | ||
Comment 4•21 years ago
|
||
The hexdump result is enlightening. Note that in all four cases, it's 0xa0 that
breaks URIs apart. '株', 'だ', 'ム' and '栄' in UTF-8 have all contain '0xa0'. I
know one place in the Mozilla source where 0xa0 is replaced by '0x20', but I'm
not sure that's the cause of this bug.
http://株式会社アスキー.jp/
00000000 68 74 74 70 3a 2f 2f e6 a0 aa e5 bc 8f e4 bc 9a |http://.........|
00000010 e7 a4 be e3 82 a2 e3 82 b9 e3 82 ad e3 83 bc 2e |................|
00000020 6a 70 2f |jp/
http://だいぎん.jp/
00000020 68 74 74 70 3a 2f 2f e3 81 a0 e3 81 | http://.....|
00000030 84 e3 81 8e e3 82 93 2e 6a 70 2f |........jp/
http://トッパンフォームズ.jp/
00000030 68 74 74 70 | http|
00000040 3a 2f 2f e3 83 88 e3 83 83 e3 83 91 e3 83 b3 e3 |://.............|
00000050 83 95 e3 82 a9 e3 83 bc e3 83 a0 e3 82 ba 2e 6a |...............j|
00000060 70 2f |p/
http://昭栄.jp/
00000060 68 74 74 70 3a 2f 2f e6 98 ad e6 a0 84 | http://......|
00000070 2e 6a 70 2f |.jp/
Assignee | ||
Comment 5•21 years ago
|
||
To my great surprise, mozTXTToHTMLConv:ScanTxt is called with a zero-paded UTF-8
string somewhere in mailnews/mime/src/mimet(pla|pfl).cpp. Therefore, FindURL()
regards U+00A0 (zero-paded 0xa0 in UTF-8) as the end of a URL.
What's amazing is not that UTF-8 strings are zero-paded (which is a common bug
in the Mozilla souorce tree) but that it didn't break anything more conspicuous
than this.
Assignee | ||
Comment 6•21 years ago
|
||
This bug could have easily been avoided if things had been done right in the
first place. Instead, (see bug 50413 comment #10 and bug 50413 comment #15)
UTF-8 string is stored in PRUnichar[]/ns*String as if PRUnichar[]/ns*String were
zero-paded char[]/nsC*String. mimetpla.cpp and mimetpfl.cpp are so ugly with
this hack that I don't know where to start.
http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpfl.cpp#378
http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpfl.cpp#399
(both lineSource and lineResult are nsString but what's stored there is
zero-paded C strings and they managed to make it __sorta__ work....)
Assignee: smontagu → jshin
Assignee | ||
Comment 7•21 years ago
|
||
It works except that the line wrapping in 'flowed text/plain' is wrong. There
may be a 1-off error somewhere in my patch, but I haven't yet found it.
Assignee | ||
Comment 8•21 years ago
|
||
There's a typo in attachment 137747 [details] [diff] [review] (lineResult => lineResult2) that broke
'flowed text/plain' rendering. With that typo fixed, it now works for 'flowed
text/plain'. However, there's an unexpected problem with 'Save As' (html
format). I thought 'charset' information is always available because it's used
in the current code (without my patch) to determine whether 'charset' is
stateful (e.g. ISO-2022-JP). It's not available to
|MimeInlineTextPlainFlowed_parse_line| [1] in mimetpfl.cpp and the corresponding
function in mimetpla.cpp. I went upward through the caller chain and it's indeed
not set anywhere (as far as I can tell). When the patch for bug 50413 went in,
it must have been available...
[1] http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimetpfl.cpp#359
Status: NEW → ASSIGNED
Assignee | ||
Comment 9•21 years ago
|
||
This patch works well in all test cases. 'Save As' (html) now works. I checked
the value of 'initializeCharset' and set charset if it's not set yet. The check
could be moved up to mimebuf.cpp where 'per line function' is invoked looping
through the buffer, but I didn't because there may be other side effects I'm
not aware of (although there isn't likely to be one).
In the patch, I also did some clean-up and simplification.
Attachment #137747 -
Attachment is obsolete: true
Assignee | ||
Comment 10•21 years ago
|
||
In comment #9, the alternative location I had in mind where |initializeCharset|
can be checked is here.
http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/mimebuf.cpp#186
What I'm not sure is whether or not the code always goes through that part. To
be on the safe side, I'm inclined to go ahead with the current patch, but I can
be persuaded to move it there.
Comment 11•21 years ago
|
||
Jungshik, I appreciate you looking into all this. I have a big issue with the
way mozTXTToHTMLConv:ScanTxt works - even in the plain ascii case, we do a bunch
of unneeded allocations, conversions, copying, etc. It turns out this is a
performance bottleneck on certain messages with a lot of text that looks links,
or just messages with a large number of lines. I had a patch that got rid of a
lot of this for the plain text messages but it's long since been lost. I guess
your patch doesn't really change the amount of allocations and copyings around
ScanTxt. However, my goal would be to make ScanTxt and surrounding code not do
any extra allocations and copies in the default case, which is Ascii input lines
with no links. Right now, it will convert the ascii input line to unicode;
ScanTxt will make a Unicode copy; the calling code will convert the identical
unicode copy back to ascii. What are your thoughts on that?
Assignee | ||
Comment 12•21 years ago
|
||
David, thanks for the comment.
As you found, I didn't change anything in ScanTXT. I guess we somehow have to
resurrect :-) or rewrite your patch in a separate bug because that's tangential
to this patch. The same is true of lines without links, isn't it? Calling sites
can't do much about it, can they?
As for the calling sites in mailnews (i.e code _around_ ScanTXT), I think
converting to and from Unicode is not expensive compared with what's done in the
current code (for text labelled as ASCII and ISO-8859-1, the direct
CopyASCIItoUTF16 and CopyUTF16oASCII are used in helper functions). Note that
the conversion is only done for 'SaveAs HTML', which, IMHO, is not
speed-critical. For message display, pretty efficient CopyUTF8toUTF16 and
CopyUTF16toUTF8 are used in place of AssignWithConversion. And I removed several
'Recycle()', PR_FREE() and 'ToNew...String' :-)
If you're still concerned about 'SaveAs HTML' case, there's a little room for
improvement. For instance, we can check if a line is ASCII-only (regardless of
the charset label) and just use CopyASCIItoUTF16 and CopyUTF16toASCII [1]
directly without calling Unicode conversion helper functions, which would save a
couple of function calls in pure ASCII case (assuming IsASCII is cheap).
Alternatively, we can do that check in Unicode conversion helper functions if we
don't want to add extra lines at the calling site.
[1] Note that these helpers are rather efficient (compared with
AssignWithConversion()).
Comment 13•21 years ago
|
||
thx, Jungshik, for your comments. I'm mainly concerned about message display. I
will have to resurrect my patch at some point.
Assignee | ||
Comment 14•21 years ago
|
||
When making attachment 137780 [details] [diff] [review], one file (mimetpla.cpp) was not saved (although
I made changes in the editor's buffer) so that there's a minor difference
between mimetpla.cpp and mimetpfl.cpp in the logic (not significant to make any
run-time difference in actual cases). Then, I had a second thought about an
edge case (which I doubt will happen in practice : 'charset' not being defined)
and made a little change to both mimetpla/tpfl.cpp. This is a very minor
change, but I wanted to upload it before asking for review.
BTW, David, my patch may have sped up things a bit for message display (for
non-ASCII cases). By using CopyUTF8toUTF16/CopyUTF16toUTF8, two bytes in UTF-8
(for characters below U+0800) and three bytes in UTF-8 (for characters above
U+0800 in BMP) are represented in a single PRUnichar when passed to ScanTXT()
instead of 'two' or 'three' zero-padded pseudo-PRUnichar's. This would be
especially helpful for Russian, Greek, and CJK cases where the number of
characters for ScanTXT to scan is cut down by almost two or three folds.
Assignee | ||
Updated•21 years ago
|
Attachment #137780 -
Attachment is obsolete: true
Assignee | ||
Comment 15•21 years ago
|
||
Comment on attachment 137872 [details] [diff] [review]
update (basically the same)
asking for r/sr (with a bit of variation :-))
Attachment #137872 -
Flags: superreview?(bienvenu)
Attachment #137872 -
Flags: review?(mscott)
Comment 16•21 years ago
|
||
Comment on attachment 137872 [details] [diff] [review]
update (basically the same)
mscott will be out of the office until Jan 2, so you might try
sspitzer@mozilla.org
Attachment #137872 -
Flags: superreview?(bienvenu) → superreview+
Assignee | ||
Comment 17•21 years ago
|
||
Comment on attachment 137872 [details] [diff] [review]
update (basically the same)
Thanks, David, for sr and note.
Asking r for sspitzer, instead.
Attachment #137872 -
Flags: review?(mscott) → review?(sspitzer)
Assignee | ||
Comment 18•21 years ago
|
||
Compiling on Windows with MSVC++ 6 led me to find a lot of 'type mismatches'.
In attachement 137872, I replaced 'char *' with 'const char *' in output_fn()
function prototype because I don't want to cast away 'const' from
ns*String.get(). g++ on Linux was rather silent, but VC++ 6 found every and
each of type mismatches introduced by that forcing me to hunt down and add
'const' to all of them. This patch is the result. It's long (partly because I
used '-u8' option when making a diff), but straightforward. Below are a couple
of examples.
-static int MimeEncrypted_parse_buffer (char *, PRInt32, MimeObject *);
+static int MimeEncrypted_parse_buffer (const char *, PRInt32, MimeObject *);
static int MimeEncrypted_parse_line (char *, PRInt32, MimeObject *);
-static int MimeEncrypted_parse_decoded_buffer (char *, PRInt32, MimeObject *);
+static int MimeEncrypted_parse_decoded_buffer (const char *, PRInt32,
MimeObjec
t *);
-nsresult mime_decompose_file_output_fn ( char *buf, PRInt32 size,
vo
id *stream_closure );
+nsresult mime_decompose_file_output_fn ( const char *buf, PRInt32
si
ze, void *stream_closure );
While doing so, I removed a stupid (according to the comment in the current
code) utility function replacing it with PL_strchr().
David, can you take a look to see if there's any problem with casting away
'const' (I didn't find any so far). It's not hurry (this is for 1.7 alpha) so
that you can take your time.
Comment 19•21 years ago
|
||
Comment on attachment 138067 [details] [diff] [review]
extra patch to 'soothe' MSVC++ 6
this looks fine - but I don't see any casting away of const, just the opposite,
replacing char * with const char *, which is safe, as long as things compile.
Attachment #138067 -
Flags: review+
Assignee | ||
Comment 20•21 years ago
|
||
Comment on attachment 138067 [details] [diff] [review]
extra patch to 'soothe' MSVC++ 6
Thanks for r. asking for sr.
As for NS_CONST_CAST, I meant I would have to use it in attachment 137872 [details] [diff] [review]
without this patch. And, there's one place where I can get rid of NS_CONST_CAST
as shwon below. (I forgot to remove this cast)
--- mailnews/mime/src/mimethsa.cpp 19 Aug 2002 18:07:13 -0000 1.3
+++ mailnews/mime/src/mimethsa.cpp 29 Dec 2003 16:40:36 -0000
@@ -110,17 +110,17 @@ printf(" B2\n");
"\n<meta http-equiv=\"Context-Type\" content=\"text/html; charset=");
charsetline += charset;
charsetline += "\">\n";
int status = MimeObject_write(obj,
- NS_CONST_CAST(char*, charsetline.get()),
+ charsetline.get(),
charsetline.Length(),
PR_TRUE);
Attachment #138067 -
Flags: superreview?(mscott)
Updated•21 years ago
|
Attachment #138067 -
Flags: superreview?(mscott) → superreview+
Assignee | ||
Comment 21•21 years ago
|
||
Comment on attachment 137872 [details] [diff] [review]
update (basically the same)
mscott, can you review this one as well? thanks.
Attachment #137872 -
Flags: review?(sspitzer) → review?(mscott)
Updated•21 years ago
|
Attachment #137872 -
Flags: review?(mscott) → review+
Assignee | ||
Comment 22•21 years ago
|
||
thanks all. fix checked into the trunk.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Comment 24•21 years ago
|
||
I had created a similar patch in a smaller scale for bug 209526. This was also
to ensure that the text passed through to ScanTXT was actually unicode and not
padded utf-8. I'll cancel the mimetp part of the patch now since this has been
done.
Updated•20 years ago
|
Product: MailNews → Core
Updated•17 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•