Closed
Bug 18410
Opened 25 years ago
Closed 25 years ago
[DOGFOOD] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail
Categories
(MailNews Core :: Composition, defect, P3)
Tracking
(Not tracked)
VERIFIED
FIXED
M13
People
(Reporter: momoi, Assigned: rhp)
References
Details
(Whiteboard: [PDT-]workaround patch proposed for review)
Attachments
(2 files)
1017 bytes,
patch
|
Details | Diff | Splinter Review | |
872 bytes,
text/plain
|
Details |
** Observed with 11/9/99 Win32 build (1999110911) ** In Mozilla, we should be able to send an HTML message in which the mail charset is Japanese but the message contains Latin 1 characters. All that we need to do is to turn these Latin 1 8-bit characters into HTML entities. Unfortunately we form such message body very badly and the resulted message cannot be displayed uner 4.7 or Mozilla. Here's what you can do to reproducce a message. 1. Bring up HTML Mail Compose window 2. Input the following text using JPN IME. (View the text under Japanese (Auto-Detect) encoding. これは日本語のテキストです。 3. Now switch the keyboard to EN -- this works under Win98. This is an accented word: bete. 4. Setthe Mozilla messenger encoding to Japanese (ISO-2022-JP) and send out the mail. 5.Receive this mail and observe that the display is badly mangled since the composition has been badly done.
Reporter | ||
Comment 1•25 years ago
|
||
I could not input the Latin 1 accented word correctly under 4.7 and so here is the sentence you can use for Latin 1: This is an accented word: béte.
Reporter | ||
Updated•25 years ago
|
QA Contact: lchiang → momoi
Reporter | ||
Comment 2•25 years ago
|
||
I need to make a correction the test string to use for Japanese. It now seems that the problem is due to some Japanese characters containing what amounts to "@" as part of its rae bytes. Here's the string which works to reprodcue this problem: 日本語と西欧語のアクセント。 The source of teh mangled message looks like this: ---- Content-Type: text/html; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit <html><head></head> <body> <div type="_moz"><A HREF="mailto:$BF|K\8l$H@">$BF|K\8l$H@</A>>2$8l$N%"%/%;%s%H!#(B</div><div type="_moz">This is an accented word: bête.</div></body> </html> ------- Note how the "@" formed a basis for creating a mailtourl structure.
Updated•25 years ago
|
Assignee: ducarroz → nhotta
Comment 3•25 years ago
|
||
This is likely to be a problem of entity covnerter. Reassign to nhotta.
Updated•25 years ago
|
Status: NEW → ASSIGNED
Target Milestone: M12
Reporter | ||
Updated•25 years ago
|
Summary: Mixed JPN and Latin 1 text body is badly composed in HTML mail when sent → JPN text body with certain byte combinations is badly composed in HTML mail
Reporter | ||
Comment 4•25 years ago
|
||
This turns out to have nothing to do with entities. I first observed this problem with data which included HTML entities but the problem occurs when you have Japanese data only. The problem has to do with certain byte sequence which cannot be handled well as HTML text is formed. In the above example, 日本語と西欧語のアクセント。 the problem seems to be caused by the character 西 whose JIS codepoint is 0x40 0x3E "@ >". I suspect the problem is 0x3E which somehow causes HTM parser to generate "mailtoutl". Here's a much simplified example. Include the following 2 characters (actually the 2nd character is not directly relevant) in your HTML mail body. 西国 and you'll see the same problem of bogus mailtourl formation.
Reporter | ||
Comment 5•25 years ago
|
||
I guess it is "@" rather which causes this problem. Try the following: 生きること in HTML body text. This one gets pulled into the bogus mailtourl formation without any corruption. It is still wrong nonetheless. 生 is 0x40 0x39.
Reporter | ||
Updated•25 years ago
|
Summary: JPN text body with certain byte combinations is badly composed in HTML mail → JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail
Reporter | ||
Updated•25 years ago
|
Severity: major → critical
Summary: JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail → [Dogfood] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail
Reporter | ||
Comment 6•25 years ago
|
||
We should fix this problem because the characters which cause this problem are too common and too many. Here's my current guess: Looking at its ISO-2022-JP values: First byte: 0x90 Second byte: 0x9F - 0xFC The problem is corruption of input data and it probably will be triggered by 94 characters in a very common range. I don't believe we should let M11 out without fixing it.
Reporter | ||
Comment 7•25 years ago
|
||
Correction: The above Hex values are in Shift_JIS encoding. In the JPN mail encoding, IS0-2022-JP, the byte ranges are as follows: 1st byte: 0x40 2nd byte: 0x21 - 0x94 The problem in nutshell is that the 1st byte value 0x40 triggers bogus HTML mailto URL formation.
Reporter | ||
Comment 8•25 years ago
|
||
The 2nd byte JIS value should be corrected to: 2nd byte: 0x21 - 0x7E
Comment 10•25 years ago
|
||
Investigating, no solution so far. The problem could be in, messenger, editor or charset converter. If editor's problem, it should be reproducable in the HTML editor's save. If converter's problem, it should happen in the plain text by putting the Japanese text with the HTML tags.
Comment 11•25 years ago
|
||
The text converted incorrectly by ISO-2022-JP encoder when the input was Japanese with HTML tags. Editor save works because it uses converter to convert only the text without HTML tags. Adding cata to cc. It would help to isolate the problem by checking this is a regression of M10. Here is a reproducable data (where JAPANESE to be replaced by the data of momoi's comment 11/09/99 22:42) <html><head><title>ikiru</title></head><body> <div type="_moz">JAPANESE</div></body> </html>
Reporter | ||
Comment 12•25 years ago
|
||
The problem also exists with 10/6/99 Win32 M10 release build. (Note: There was a crashing bug for IME input (without first inputting ASCII) at M10 and extensive input test was not practical at that time.)
Updated•25 years ago
|
Assignee: nhotta → cata
Status: ASSIGNED → NEW
Comment 13•25 years ago
|
||
I talked to cata, he is going to take a look.
Reporter | ||
Comment 14•25 years ago
|
||
Having said that, I think we should try to fix this for M11. The characters which trigger this problem are fairly common and many in number. Also once 0x40 (@) is encounterd, data after that gets pulled into mailtourl formation until a break is encountered. Since Japanese does not have a space break, this means pretty much corrputed data thereafter and not even recognizable as Japanese. It normally displays as corrupted data under 4.x but under 5.0, the body simply displays as blank. This is a bad problem and needs to be fixed ASAP regardless of when the bug was introduced.
Comment 15•25 years ago
|
||
Here is an expected result of the example I posted before. But actual result, '<' before "/div>" is missing. <html><head></head> <body> <div type="_moz">$B$-$k$3$H(B</div></body> </html>
Comment 16•25 years ago
|
||
There is a correction to my last example (I pasted data from the broken data). Here is a correct expected result of Japanese part (generated by 4.6). $B@8$-$k$3$H(B And Japanese part of the data in unicode is 751F 304D 308B 3053 3068.
Comment 17•25 years ago
|
||
Tested with the provided data and the conversion is ok. So, the problem is somewhere else. Reassigning back.
Updated•25 years ago
|
Assignee: nhotta → rhp
Comment 18•25 years ago
|
||
With my local build updated this morning, I can no longer see the corrupted data from the converter. The problem happens after we convert from unicode. Before we send, ScanHTMLForURLs() is called and that does not work for ISO-2022-JP (or whatever charsets overwrapps with special characters like '<', '&' or '@', etc.). That operation needs to be done using unicode data before we convert to the mail charaset. I don't think there is an easy fix for this since the function is not written for unicode. We should make it PRUnichar* base or use UTF-8. I propose to disable this feature for M11 and do the change in M12. Reassign to rhp.
Comment 19•25 years ago
|
||
Comment 20•25 years ago
|
||
I put a patch which enables URL scanning only for ISO-8859-1 and us-ascii. I tested, ISO-8859-1 does generate mailto url and ISO-2022-JP bypassed this so the problem described bug no longer happens. We should change the scanner to unicode base for M12. Rich, could you take a look at the diff and check in to the branch if it looks fine?
Assignee | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Target Milestone: M11 → M13
Assignee | ||
Comment 21•25 years ago
|
||
I checked in the fix Naoki provided for the M11 branch and I'm moving this until later for a better fix. - rhp
Whiteboard: workaround patch proposed for review → [PDT-]workaround patch proposed for review
Comment 22•25 years ago
|
||
Putting on pdt- radar. bobj in room and approved.
Assignee | ||
Updated•25 years ago
|
Summary: [Dogfood] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail → [DOGFOOD] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail
Assignee | ||
Comment 23•25 years ago
|
||
Caps make it stand out in my list :-)
Assignee | ||
Comment 24•25 years ago
|
||
Currently, Ben Bucksch (mozilla@bucksch.org) is rewriting the class that is responsible for this autodetection and he will be doing Unicode safe techniques. When his rewrite is in place, I may have to tweak a call or two, but we should be ok. I18N Gurus: I have a question....if I have a "char *" that I know is ISO-2022-JP, how do I create the correct nsString? - rhp
Comment 25•25 years ago
|
||
Unicode conversion is needed in order to create nsString from ISO-2022-JP (e.g. use ConvertToUnicode). But the conversion should be avoided by doing the autodetection before converting to ISO-2022-JP. Can the detection be done earlier (i.e. right after getting the data from the editor and before calling nsISaveAsCharset)?
Assignee | ||
Comment 26•25 years ago
|
||
Can someone post a message that would cause this problem? Thanks! - rhp
Assignee | ||
Comment 27•25 years ago
|
||
I was digging into this a little further and it looks like we may have fixed this after landing the new URL detection code by BenB. I need to take out the fix that Naoki came up with and then this can be retested. - rhp PS: ignore my last post about the test message.
Assignee | ||
Updated•25 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 28•25 years ago
|
||
Ok, no guarantees here, but I think this one might be fixed. The latest changes for the URL detection code deals with comparisons using nsStrings instead of char * based and the limimted testing I could do seems to have this fixed (maybe :-) I just checked in the fix for this so you won't be able to test it until a new build with these changes is done from the tip. Please let me know if we still have problems. - rhp
Comment 29•25 years ago
|
||
I looked at the code and I think it is missing the conversion. The input of ScanHTML is unicode but it is generated without charset conversion. The data from the editor is unicode so ScanHTML can be applied before we convert to mail charset. I will investigate this more. IQA, could you put a reproducible data?
Reporter | ||
Comment 30•25 years ago
|
||
Reporter | ||
Comment 31•25 years ago
|
||
I just uploaded a test case mag which includes both the problem inducing characters and a mailtourl candidate. In M11, we avoided the problem by turning off mailtourl creation for Japanese mail. The current solution should be able to handle the above mail and at the same time create a mailtourl. ** Checked with 1/24/00 Win32 build ** The above build does create a mailtourl correctly and at the same time does not mangle the problem characters (those containng 0x40 in iso-2022-jp). Marking the fix verified.
Status: RESOLVED → VERIFIED
Updated•20 years ago
|
Product: MailNews → Core
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•