Closed Bug 18410 Opened 25 years ago Closed 25 years ago

[DOGFOOD] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail

Categories

(MailNews Core :: Composition, defect, P3)

x86
Other
defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: momoi, Assigned: rhp)

References

Details

(Whiteboard: [PDT-]workaround patch proposed for review)

Attachments

(2 files)

** Observed with 11/9/99 Win32 build (1999110911) **

In Mozilla, we should be able to send an HTML message in which the mail charset is Japanese but
the message contains Latin 1 characters. All that we need to do is to turn these Latin 1 8-bit characters
into HTML entities. Unfortunately we form such message body very badly and the resulted message
cannot be displayed uner 4.7 or Mozilla.

Here's what you can do to reproducce a message.

1. Bring up HTML Mail Compose window
2. Input the following text using JPN IME. (View the text under Japanese (Auto-Detect) encoding.

   これは日本語のテキストです。

3. Now switch the keyboard to EN -- this works under Win98.

   This is an accented word: bete.

4. Setthe Mozilla messenger encoding to Japanese (ISO-2022-JP)
   and send out the mail.
5.Receive this mail and observe that the display is badly mangled since the composition has been badly done.
I could not input the Latin 1 accented word correctly under 4.7 and
so here is the sentence you can use for Latin 1:

This is an accented word: béte.
QA Contact: lchiang → momoi
I need to make a correction the test string to use for Japanese.
It now seems that the problem is due to some Japanese characters containing what amounts to
"@" as part of its rae bytes.

Here's the string which works to reprodcue this problem:

日本語と西欧語のアクセント。

The source of teh mangled message looks like this:

----
Content-Type: text/html; charset=ISO-2022-JP
Content-Transfer-Encoding: 7bit

<html><head></head>
<body>
<div type="_moz"><A HREF="mailto:$BF|K\8l$H@">$BF|K\8l$H@</A>>2$8l$N%"%/%;%s%H!#(B</div><div type="_moz">This
is an accented word: b&ecirc;te.</div></body>
</html>
-------

Note how the "@" formed a basis for creating a mailtourl structure.
Assignee: ducarroz → nhotta
This is likely to be a problem of entity covnerter. Reassign to nhotta.
Status: NEW → ASSIGNED
Target Milestone: M12
Summary: Mixed JPN and Latin 1 text body is badly composed in HTML mail when sent → JPN text body with certain byte combinations is badly composed in HTML mail
This turns out to have nothing to do with entities. I first observed
this problem with data which included HTML entities but the problem occurs
when you have Japanese data only. The problem has to do with certain
byte sequence which cannot be handled well as HTML text is formed.

In the above example,

日本語と西欧語のアクセント。

the problem seems to be caused by the character 西 whose JIS codepoint
is 0x40 0x3E "@ >". I suspect the problem is 0x3E which somehow
causes HTM parser to generate "mailtoutl". Here's a much simplified
example. Include the following 2 characters (actually the 2nd character
is not directly relevant) in your HTML mail body.

西国

and you'll see the same problem of bogus mailtourl formation.
I guess it is "@" rather which causes this problem.
Try the following:

生きること

in HTML body text. This one gets pulled into the bogus mailtourl
formation without any corruption. It is still wrong nonetheless.

生 is 0x40 0x39.
Summary: JPN text body with certain byte combinations is badly composed in HTML mail → JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail
Severity: major → critical
Summary: JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail → [Dogfood] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail
We should fix this problem because the characters which
cause this problem are too common and too many.
Here's my current guess:

Looking at its ISO-2022-JP values:

First byte: 0x90
Second byte: 0x9F - 0xFC

The problem is corruption of input data and it probably will
be triggered by 94 characters in a very common range.

I don't believe we should let M11 out without fixing it.
Correction:

The above Hex values are in Shift_JIS encoding.
In the JPN mail encoding, IS0-2022-JP, the byte ranges are as follows:

1st byte: 0x40
2nd byte: 0x21 - 0x94

The problem in nutshell is that the 1st byte value 0x40 triggers
bogus HTML mailto URL formation.
The 2nd byte JIS value should be corrected to:

2nd byte: 0x21 - 0x7E
Target Milestone: M12 → M11
Putting on M11 radar to get fixed on the branch.
Investigating, no solution so far. The problem could be in, messenger, editor or
charset converter.
If editor's problem, it should be reproducable in the HTML editor's save.
If converter's problem, it should happen in the plain text by putting the
Japanese text with the HTML tags.
The text converted incorrectly by ISO-2022-JP encoder when the input was
Japanese with HTML tags. Editor save works because it uses converter to convert
only the text without HTML tags. Adding cata to cc.
It would help to isolate the problem by checking this is a regression of M10.

Here is a reproducable data (where JAPANESE to be replaced by the data of
momoi's comment 11/09/99 22:42)

<html><head><title>ikiru</title></head><body>
<div type="_moz">JAPANESE</div></body>
</html>
The problem also exists with 10/6/99 Win32 M10 release build.
(Note: There was a crashing bug for IME input (without first inputting
ASCII) at M10 and extensive input test was not practical at that
time.)
Assignee: nhotta → cata
Status: ASSIGNED → NEW
I talked to cata, he is going to take a look.
Having said that, I think we should try to fix this for M11.
The characters which trigger this problem are fairly common
and many in number. Also once 0x40 (@) is encounterd, data after that
gets pulled into mailtourl formation until a break is encountered.
Since Japanese does not have a space break, this means pretty
much corrputed data thereafter and not even recognizable as
Japanese. It  normally displays as corrupted data under 4.x but
under 5.0, the body simply displays as blank.

This is a bad problem and needs to be fixed ASAP regardless of when the bug
was introduced.
Here is an expected result of the example I posted before. But actual result,
'<' before "/div>" is missing.

<html><head></head>
<body>
<div type="_moz">$B$-$k$3$H(B</div></body>
</html>
There is a correction to my last example (I pasted data from the broken data).
Here is a correct expected result of Japanese part (generated by 4.6).
$B@8$-$k$3$H(B
And Japanese part of the data in unicode is 751F 304D 308B 3053 3068.
Assignee: cata → nhotta
Tested with the provided data and the conversion is ok. So, the problem is
somewhere else. Reassigning back.
Assignee: nhotta → rhp
With my local build updated this morning, I can no longer see the corrupted data
from the converter. The problem happens after we convert from unicode.
Before we send, ScanHTMLForURLs() is called and that does not work for
ISO-2022-JP (or whatever charsets overwrapps with special characters like '<',
'&' or '@', etc.). That operation needs to be done using unicode data before we
convert to the mail charaset.
I don't think there is an easy fix for this since the function is not written
for unicode. We should make it PRUnichar* base or use UTF-8.
I propose to disable this feature for M11 and do the change in M12.
Reassign to rhp.
I put a patch which enables URL scanning only for ISO-8859-1 and us-ascii.
I tested, ISO-8859-1 does generate mailto url and ISO-2022-JP bypassed this so
the problem described bug no longer happens. We should change the scanner to
unicode base for M12.
Rich, could you take a look at the diff and check in to the branch if it looks
fine?
Whiteboard: workaround patch proposed for review
Status: NEW → ASSIGNED
Target Milestone: M11 → M13
I checked in the fix Naoki provided for the M11 branch and I'm moving this
until later for a better fix.

- rhp
Whiteboard: workaround patch proposed for review → [PDT-]workaround patch proposed for review
Putting on pdt- radar. bobj in room and approved.
Summary: [Dogfood] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail → [DOGFOOD] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail
Caps make it stand out in my list :-)
Depends on: 19251
Currently, Ben Bucksch (mozilla@bucksch.org) is rewriting the class that is
responsible for this autodetection and he will be doing Unicode safe
techniques. When his rewrite is in place, I may have to tweak a call or two,
but we should be ok.

I18N Gurus:
I have a question....if I have a "char *" that I know is ISO-2022-JP, how do I
create the correct nsString?

- rhp
Unicode conversion is needed in order to create nsString from ISO-2022-JP (e.g.
use ConvertToUnicode).
But the conversion should be avoided by doing the autodetection before
converting to ISO-2022-JP.
Can the detection be done earlier (i.e. right after getting the data from the
editor and before calling nsISaveAsCharset)?
Blocks: 20203
Blocks: 20870
Can someone post a message that would cause this problem?

Thanks!

- rhp
I was digging into this a little further and it looks like we may have fixed
this after landing the new URL detection code by BenB. I need to take out the
fix that Naoki came up with and then this can be retested.

- rhp

PS: ignore my last post about the test message.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Ok, no guarantees here, but I think this one might be fixed. The latest changes
for the URL detection code deals with comparisons using nsStrings instead of
char * based and the limimted testing I could do seems to have this fixed
(maybe :-)

I just checked in the fix for this so you won't be able to test it until a new
build with these changes is done from the tip. Please let me know if we still
have problems.

- rhp
I looked at the code and I think it is missing the conversion. The input of
ScanHTML is unicode but it is generated without charset conversion.
The data from the editor is unicode so ScanHTML can be applied before we convert
to mail charset. I will investigate this more.
IQA, could you put a reproducible data?
I just uploaded a test case mag which includes both the problem
inducing characters and a mailtourl candidate.

In M11, we avoided the problem by turning off mailtourl creation for
Japanese mail. The current solution should be able to handle 
the above mail and at the same time create a mailtourl.

** Checked with 1/24/00 Win32 build **

The above build does create a mailtourl correctly and at the
same time does not mangle the problem characters (those containng
0x40 in iso-2022-jp).

Marking the fix verified.
Status: RESOLVED → VERIFIED
No longer blocks: 20203
No longer blocks: 20870
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: