Closed Bug 209526 Opened 21 years ago Closed 21 years ago

[mozTXTToHTMLConv] creates incorrect links before full width space

Categories

(SeaMonkey :: MailNews: Message Display, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bmo, Assigned: bmo)

Details

Attachments

(2 files, 7 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030615
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030615

When viewing an email message in Japanese (or any other language that has
non-ascii whitespace like full width spaces) which has some links in it where
the link is followed by a non-ascii space, the link will span over the space. 

e.g.
...japanese text.... http://link.com/foo/[FULL-WIDTH-SPACE]...japanese text...

This will be turned into a link like:
...japanese text.... <a href="http://link.com/foo/[FULL-WIDTH-SPACE]...japanese
text...">http://link.com/foo/[FULL-WIDTH-SPACE]...japanese text...</a>

It should be:
...japanese text.... <a
href="http://link.com/foo/">http://link.com/foo/</a>[FULL-WIDTH-SPACE]...japanese
text...

This problem is caused by the text to html converter only checking for ascii
spaces. When dealing with Japanese text it also needs to consider the full width
space as a delimiter.

Reproducible: Always

Steps to Reproduce:
1.
2.
3.
Attached file Test text to show linkify behaviour (obsolete) —
utf-8 text
Attached patch proposed patch (obsolete) — Splinter Review
This patch removes the special casing of email links only to make it invalid to
have anything except for ascii in a URL. This means that international IRI
won't be linkified, but the rest of the 99.9% of the URLs on the web will be. 

Is any valid URLs with other than ASCII characters which are not IRI?
To use the test text, paste it into an email and send it to yourself. Then view
the email. The linkify code will be run over it. You will notice that all of the
links in the english text will be correctly linkified. The links in the Japanese
text are not. The proposed patch fixes this.

Changing the affected platform to All/All since the code change is not Windows
specific it must affect all platforms.
OS: Windows 2000 → All
Hardware: PC → All
Attachment #132441 - Flags: review?
Attached file expanded tests (view/send as UTF-8) (obsolete) —
expanded the test cases with tests from bug 201369 and from inside the code.
Attachment #132440 - Attachment is obsolete: true
Attached patch updated patch (obsolete) — Splinter Review
Fixed the patch so that I don't remove the comments which were added in bug
201369 but just update them to be relevant to the patched code.
Attachment #132441 - Attachment is obsolete: true
Attachment #132441 - Flags: review?
Sorry for spam. I'm having a great conversation with myself here.

Adding Ben Bucksch to the CC list since he is initial developer of the code and
reviewed patches on similar bug. Ben, would you please review the patch I have
submitted.

I realize that IRI links will not be linkified with this change, but as I
understand they aren't in use yet, and nor are they linkified correctly by the
current code anyhow.
There is an existing bug about this, IIRC, but I don't find it. Confirming anyway.

Thanks for the patch. This:

+// Also recognize the Japanese ideographic space 0x3000 as a space.
 static inline PRBool IsSpace(const PRUnichar aChar)
 {
-  return (nsCRT::IsAsciiSpace(aChar) || aChar == 0xA0);
+  return (nsCRT::IsAsciiSpace(aChar) || aChar == 0xA0 || aChar == 0x3000);
 }

is fine with me, r=BenB. I am not sure about the rest.
Assignee: sspitzer → brofield
Summary: linkify creates incorrect links when link is followed by full width space → [mozTXToHTMLConv] creates incorrect links before full width space
Actually without the other bits it doesn't fix the cases where the link is 
embedded in the Japanese (Chinese/Arabic/etc) text without spaces separating 
it. At least in Japanese emails I see this a lot.

If you can provide any testcases for where the entire current patch is a 
problem, then I can create a new patch to account for it. Otherwise I would 
like to see the whole patch as is checked in.
> Actually without the other bits it doesn't fix the cases where the link is 
> embedded in the Japanese (Chinese/Arabic/etc) text without spaces separating 
> it.

This problem is different from the one stated in the initial
description/summary, it is a known problem and the/your fix is much more
dangerous. It will definitly have side effects for all users, not just japanese
ones. It should be considered in a separate bug.

> If you can provide any testcases

I don't know, what about the testcase in bug 220487?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Attachment #132452 - Attachment description: expanded tests → expanded tests (view/send as UTF-8)
Expanded testcases for these problems
Attachment #132452 - Attachment is obsolete: true
Attached patch patch to the netwerk component (obsolete) — Splinter Review
Patch for the netwerk/streamconv/converters/mozTXTtoHTMLConv.cpp side of the
problem.
Attachment #132453 - Attachment is obsolete: true
Attached patch patch to the mime component (obsolete) — Splinter Review
Patch to the mime component where the mozTXTtoHTMLConv is called
While investigating this bug more I found that although PRUnichar string is
passed to ScanTXT from the mime component, this string is really only promoted
UTF-8 (UTF-8 cast to PRUnichar). This means that unless we get real unicode to
ScanTXT we can't check for any characters that are multi-byte in UTF-8 (e.g.
ideographic space 0x3000). Therefore, I have included a patch to the mime
component which converts the UTF-8 string to Unicode (UTF-16) before sending it
through to ScanTXT. 

I've also created a new patch for the mozTXTtoHTMLConv component which fixes the
(space) and (punct) testcases as listed in the expanded textcase document. This
is still work in progress. Ben, would you provide me some comments please.

1. Do you forsee any problems with converting the string to actual unicode in
mime rather than using promoted UTF-8? 

2. I tried to extract the logic of the punctuation checking in FindURLStart and
FindURLEnd as best as I could. I'm not sure about the results though. It is a
little confusing. 
> this string is really only promoted UTF-8 (UTF-8 cast to PRUnichar)

whoops.

So, the previous patch didn't work (because 0x3000 wouldn't be seen as that)?

> 1. Do you forsee any problems with converting the string to actual unicode in
> mime rather than using promoted UTF-8? 

POssible, but I have no idea which ones. I'd like to get review from an i18n
expert, preferably nhotta, but I guess they're gone by now :-(.

> I've also created a new patch for the mozTXTtoHTMLConv component which
> fixes the (space) and (punct) testcases as listed in the expanded textcase
> document. This is still work in progress. Ben, would you provide me
> some comments please.

I am not sure about this one.

It makes the one code part shorter on the one hand, but on the other hand adds a
layer of indirection and makes it much harder to read (esp. with all the start
and end and sentence punctaion rolled into one), if you care about what exactly
counts as punctation when.
To me, these "full width" punctations look like bloat - should we add the
punctation and language rules of every language in the world?

The recognizer was intended to help the majority of users most of the time, not
all of them all of the time.

> 2. I tried to extract the logic of the punctuation checking in FindURLStart
> and FindURLEnd as best as I could.

For "freetext" mode:
- Some characaters end the URL at the start,
- some end it at the end,
- some are stripped, if found at the end, but don't end the URL,
  if they are in the middle.
- the URL scheme (e.g. "http:") has special requirements.

The different modes (e.g. enclosed in <> or not) have different rules, too.
Sorry about the time for reply. Work got busy and am now on vacation. 

Re: the original patch. It won't work without patching the mime component to 
pass real unicode through to your bits. 

How about if we just deal with the UTF-8 promotion in the mime component (so 
that the linkify component gets real unicode), and the wide space in this bug. 
(i.e. the original scope of this bug). We can then debate other changes in 
another bug. 

If we limited the scope back to the space and unicode, then the mime component 
patch is unchanged, and the network component would just be the 2 line change 
of:

+// Also recognize the Japanese ideographic space 0x3000 as a space.
 static inline PRBool IsSpace(const PRUnichar aChar)
 {
-  return (nsCRT::IsAsciiSpace(aChar) || aChar == 0xA0);
+  return (nsCRT::IsAsciiSpace(aChar) || aChar == 0xA0 || aChar == 0x3000);

What do you think?
Yes, makes sense to me.
This is the simplified patch for the network component. It only updates the
mozTXTToHTMLConv.cpp file to recognize the ideographic japanese space.
Attachment #134305 - Attachment is obsolete: true
Comment on attachment 136710 [details] [diff] [review]
simplied network patch

r=BenB
Attachment #136710 - Flags: review+
This is the updated patch for the mozilla/mailnews/src/mimetp(la|fl).cpp files.
The change ensures that the text passed through to the txt to html converter is
true unicode. I couldn't find the options in Mozilla to legitimately exercise
the second path of mimetpfl.cpp:
       // If nsMimeMessageSaveAs, the string is in mail charset (and stateful,
e.g. ISO-2022-JP).

But I manually placed the IP there during debugging and stepped through it.
Attachment #134306 - Attachment is obsolete: true
The mime part of this patch is now no longer required as bug 228543 has fixed
this. Ben, would you please check in the network patch for me (I don't have CVS
access).
Comment on attachment 136765 [details] [diff] [review]
updated patch for mailnews/mime/src

obsoleted by bug 228543
Attachment #136765 - Attachment is obsolete: true
Comment on attachment 136710 [details] [diff] [review]
simplied network patch

I think it still needs sr, even though it's trivial.
Attachment #136710 - Flags: superreview?(darin)
Comment on attachment 136710 [details] [diff] [review]
simplied network patch

Changing SR request for this 1 line simple patch to mscott since you SR'd the
Japanese IDN patch which had similarities to this.
Attachment #136710 - Flags: superreview?(darin) → superreview?(mscott)
Attachment #136710 - Flags: superreview?(mscott) → superreview+
Fix checked in.

Thanks a lot, Brodie, for the report, patch, cooperation and patience.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Product: Browser → Seamonkey
Summary: [mozTXToHTMLConv] creates incorrect links before full width space → [mozTXTToHTMLConv] creates incorrect links before full width space
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: