Closed Bug 18410 Opened 26 years ago Closed 26 years ago

[DOGFOOD] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail

Categories

(MailNews Core :: Composition, defect, P3)

Product:

Component:

Platform:

x86

Other

Type:

defect

Priority:

P3

Severity:

critical

Tracking

(Not tracked)

Status:

VERIFIED FIXED

Milestone:

M13

People

(Reporter: momoi, Assigned: rhp)

References

Details

(Whiteboard: [PDT-]workaround patch proposed for review)

Attachments

(2 files)

patch to enable URL scanning only for Latin1 and us-ascii. 26 years ago nhottanscp 1017 bytes, patch		Details \| Diff \| Splinter Review
A test file containing problem characters along with mailtourl candidate 25 years ago Katsuhiko Momoi 872 bytes, text/plain		Details

Katsuhiko Momoi

Reporter

Description

•

26 years ago

** Observed with 11/9/99 Win32 build (1999110911) ** In Mozilla, we should be able to send an HTML message in which the mail charset is Japanese but the message contains Latin 1 characters. All that we need to do is to turn these Latin 1 8-bit characters into HTML entities. Unfortunately we form such message body very badly and the resulted message cannot be displayed uner 4.7 or Mozilla. Here's what you can do to reproducce a message. 1. Bring up HTML Mail Compose window 2. Input the following text using JPN IME. (View the text under Japanese (Auto-Detect) encoding. これは日本語のテキストです。 3. Now switch the keyboard to EN -- this works under Win98. This is an accented word: bete. 4. Setthe Mozilla messenger encoding to Japanese (ISO-2022-JP) and send out the mail. 5.Receive this mail and observe that the display is badly mangled since the composition has been badly done.

Katsuhiko Momoi

Reporter

Comment 1

•

26 years ago

I could not input the Latin 1 accented word correctly under 4.7 and so here is the sentence you can use for Latin 1: This is an accented word: béte.

Katsuhiko Momoi

Reporter

Updated

•

26 years ago

QA Contact: lchiang → momoi

Katsuhiko Momoi

Reporter

Comment 2

•

26 years ago

I need to make a correction the test string to use for Japanese. It now seems that the problem is due to some Japanese characters containing what amounts to "@" as part of its rae bytes. Here's the string which works to reprodcue this problem: 日本語と西欧語のアクセント。 The source of teh mangled message looks like this: ---- Content-Type: text/html; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit <html><head></head> <body> <div type="_moz"><A HREF="mailto:$BF|K\8l$H@">$BF|K\8l$H@</A>>2$8l$N%"%/%;%s%H!#(B</div><div type="_moz">This is an accented word: bête.</div></body> </html> ------- Note how the "@" formed a basis for creating a mailtourl structure.

Updated

•

26 years ago

Assignee: ducarroz → nhotta

Comment 3

•

26 years ago

This is likely to be a problem of entity covnerter. Reassign to nhotta.

Updated

•

26 years ago

Status: NEW → ASSIGNED

Target Milestone: M12

Katsuhiko Momoi

Reporter

Updated

•

26 years ago

Summary: Mixed JPN and Latin 1 text body is badly composed in HTML mail when sent → JPN text body with certain byte combinations is badly composed in HTML mail

Katsuhiko Momoi

Reporter

Comment 4

•

26 years ago

This turns out to have nothing to do with entities. I first observed this problem with data which included HTML entities but the problem occurs when you have Japanese data only. The problem has to do with certain byte sequence which cannot be handled well as HTML text is formed. In the above example, 日本語と西欧語のアクセント。 the problem seems to be caused by the character 西 whose JIS codepoint is 0x40 0x3E "@ >". I suspect the problem is 0x3E which somehow causes HTM parser to generate "mailtoutl". Here's a much simplified example. Include the following 2 characters (actually the 2nd character is not directly relevant) in your HTML mail body. 西国 and you'll see the same problem of bogus mailtourl formation.

Katsuhiko Momoi

Reporter

Comment 5

•

26 years ago

I guess it is "@" rather which causes this problem. Try the following: 生きること in HTML body text. This one gets pulled into the bogus mailtourl formation without any corruption. It is still wrong nonetheless. 生 is 0x40 0x39.

Katsuhiko Momoi

Reporter

Updated

•

26 years ago

Summary: JPN text body with certain byte combinations is badly composed in HTML mail → JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail

Katsuhiko Momoi

Reporter

Updated

•

26 years ago

Severity: major → critical

Summary: JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail → [Dogfood] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail

Katsuhiko Momoi

Reporter

Comment 6

•

26 years ago

We should fix this problem because the characters which cause this problem are too common and too many. Here's my current guess: Looking at its ISO-2022-JP values: First byte: 0x90 Second byte: 0x9F - 0xFC The problem is corruption of input data and it probably will be triggered by 94 characters in a very common range. I don't believe we should let M11 out without fixing it.

Katsuhiko Momoi

Reporter

Comment 7

•

26 years ago

Correction: The above Hex values are in Shift_JIS encoding. In the JPN mail encoding, IS0-2022-JP, the byte ranges are as follows: 1st byte: 0x40 2nd byte: 0x21 - 0x94 The problem in nutshell is that the 1st byte value 0x40 triggers bogus HTML mailto URL formation.

Katsuhiko Momoi

Reporter

Comment 8

•

26 years ago

The 2nd byte JIS value should be corrected to: 2nd byte: 0x21 - 0x7E

Updated

•

26 years ago

Target Milestone: M12 → M11

Comment 9

•

26 years ago

Putting on M11 radar to get fixed on the branch.

Comment 10

•

26 years ago

Investigating, no solution so far. The problem could be in, messenger, editor or charset converter. If editor's problem, it should be reproducable in the HTML editor's save. If converter's problem, it should happen in the plain text by putting the Japanese text with the HTML tags.

Comment 11

•

26 years ago

The text converted incorrectly by ISO-2022-JP encoder when the input was Japanese with HTML tags. Editor save works because it uses converter to convert only the text without HTML tags. Adding cata to cc. It would help to isolate the problem by checking this is a regression of M10. Here is a reproducable data (where JAPANESE to be replaced by the data of momoi's comment 11/09/99 22:42) <html><head><title>ikiru</title></head><body> <div type="_moz">JAPANESE</div></body> </html>

Katsuhiko Momoi

Reporter

Comment 12

•

26 years ago

The problem also exists with 10/6/99 Win32 M10 release build. (Note: There was a crashing bug for IME input (without first inputting ASCII) at M10 and extensive input test was not practical at that time.)

Updated

•

26 years ago

Assignee: nhotta → cata

Status: ASSIGNED → NEW

Comment 13

•

26 years ago

I talked to cata, he is going to take a look.

Katsuhiko Momoi

Reporter

Comment 14

•

26 years ago

Having said that, I think we should try to fix this for M11. The characters which trigger this problem are fairly common and many in number. Also once 0x40 (@) is encounterd, data after that gets pulled into mailtourl formation until a break is encountered. Since Japanese does not have a space break, this means pretty much corrputed data thereafter and not even recognizable as Japanese. It normally displays as corrupted data under 4.x but under 5.0, the body simply displays as blank. This is a bad problem and needs to be fixed ASAP regardless of when the bug was introduced.

Comment 15

•

26 years ago

Here is an expected result of the example I posted before. But actual result, '<' before "/div>" is missing. <html><head></head> <body> <div type="_moz">$B$-$k$3$H(B</div></body> </html>

Comment 16

•

26 years ago

There is a correction to my last example (I pasted data from the broken data). Here is a correct expected result of Japanese part (generated by 4.6). $B@8$-$k$3$H(B And Japanese part of the data in unicode is 751F 304D 308B 3053 3068.

Updated

•

26 years ago

Assignee: cata → nhotta

Comment 17

•

26 years ago

Tested with the provided data and the conversion is ok. So, the problem is somewhere else. Reassigning back.

Updated

•

26 years ago

Assignee: nhotta → rhp

Comment 18

•

26 years ago

With my local build updated this morning, I can no longer see the corrupted data from the converter. The problem happens after we convert from unicode. Before we send, ScanHTMLForURLs() is called and that does not work for ISO-2022-JP (or whatever charsets overwrapps with special characters like '<', '&' or '@', etc.). That operation needs to be done using unicode data before we convert to the mail charaset. I don't think there is an easy fix for this since the function is not written for unicode. We should make it PRUnichar* base or use UTF-8. I propose to disable this feature for M11 and do the change in M12. Reassign to rhp.

Comment 19

•

26 years ago

Attached patch patch to enable URL scanning only for Latin1 and us-ascii. — Details — Splinter Review

Comment 20

•

26 years ago

I put a patch which enables URL scanning only for ISO-8859-1 and us-ascii. I tested, ISO-8859-1 does generate mailto url and ISO-2022-JP bypassed this so the problem described bug no longer happens. We should change the scanner to unicode base for M12. Rich, could you take a look at the diff and check in to the branch if it looks fine?

Updated

•

26 years ago

Whiteboard: workaround patch proposed for review

Assignee

Updated

•

26 years ago

Status: NEW → ASSIGNED

Target Milestone: M11 → M13

Assignee

Comment 21

•

26 years ago

I checked in the fix Naoki provided for the M11 branch and I'm moving this until later for a better fix. - rhp

Updated

•

26 years ago

Whiteboard: workaround patch proposed for review → [PDT-]workaround patch proposed for review

Comment 22

•

26 years ago

Putting on pdt- radar. bobj in room and approved.

Assignee

Updated

•

26 years ago

Summary: [Dogfood] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail → [DOGFOOD] JPN text body with certain byte combinations is turned into bogus mailtourl in HTML mail

Assignee

Comment 23

•

26 years ago

Caps make it stand out in my list :-)

Ben Bucksch (:BenB)

Updated

•

26 years ago

Depends on: 19251

Assignee

Comment 24

•

26 years ago

Currently, Ben Bucksch (mozilla@bucksch.org) is rewriting the class that is responsible for this autodetection and he will be doing Unicode safe techniques. When his rewrite is in place, I may have to tweak a call or two, but we should be ok. I18N Gurus: I have a question....if I have a "char *" that I know is ISO-2022-JP, how do I create the correct nsString? - rhp

Comment 25

•

26 years ago

Unicode conversion is needed in order to create nsString from ISO-2022-JP (e.g. use ConvertToUnicode). But the conversion should be avoided by doing the autodetection before converting to ISO-2022-JP. Can the detection be done earlier (i.e. right after getting the data from the editor and before calling nsISaveAsCharset)?

Updated

•

26 years ago

Blocks: 20203

Updated

•

26 years ago

Blocks: 20870

Assignee

Comment 26

•

26 years ago

Can someone post a message that would cause this problem? Thanks! - rhp

Assignee

Comment 27

•

26 years ago

I was digging into this a little further and it looks like we may have fixed this after landing the new URL detection code by BenB. I need to take out the fix that Naoki came up with and then this can be retested. - rhp PS: ignore my last post about the test message.

Assignee

Updated

•

26 years ago

Status: ASSIGNED → RESOLVED

Closed: 26 years ago

Resolution: --- → FIXED

Assignee

Comment 28

•

26 years ago

Ok, no guarantees here, but I think this one might be fixed. The latest changes for the URL detection code deals with comparisons using nsStrings instead of char * based and the limimted testing I could do seems to have this fixed (maybe :-) I just checked in the fix for this so you won't be able to test it until a new build with these changes is done from the tip. Please let me know if we still have problems. - rhp

Comment 29

•

26 years ago

I looked at the code and I think it is missing the conversion. The input of ScanHTML is unicode but it is generated without charset conversion. The data from the editor is unicode so ScanHTML can be applied before we convert to mail charset. I will investigate this more. IQA, could you put a reproducible data?

Katsuhiko Momoi

Reporter

Comment 30

•

25 years ago

Attached file A test file containing problem characters along with mailtourl candidate — Details

Katsuhiko Momoi

Reporter

Comment 31

•

25 years ago

I just uploaded a test case mag which includes both the problem inducing characters and a mailtourl candidate. In M11, we avoided the problem by turning off mailtourl creation for Japanese mail. The current solution should be able to handle the above mail and at the same time create a mailtourl. ** Checked with 1/24/00 Win32 build ** The above build does create a mailtourl correctly and at the same time does not mangle the problem characters (those containng 0x40 in iso-2022-jp). Marking the fix verified.

Status: RESOLVED → VERIFIED

Updated

•

25 years ago

No longer blocks: 20203

Updated

•

25 years ago

No longer blocks: 20870

Myk Melez [:myk] [@mykmelez]

Updated

•

21 years ago

Product: MailNews → Core

Nobody; OK to take it and work on it

Updated

•

17 years ago

Product: Core → MailNews Core

You need to log in before you can comment on or make changes to this bug.