Closed Bug 209526 Opened 21 years ago Closed 21 years ago

[mozTXTToHTMLConv] creates incorrect links before full width space

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bmo, Assigned: bmo)

Details

Attachments

(2 files, 7 obsolete files)

Test text to show linkify behaviour 21 years ago Brodie 2.64 KB, text/plain		Details
proposed patch 21 years ago Brodie 1.89 KB, patch		Details \| Diff \| Splinter Review
expanded tests (view/send as UTF-8) 21 years ago Brodie 3.08 KB, text/plain		Details
updated patch 21 years ago Brodie 2.57 KB, patch		Details \| Diff \| Splinter Review
expanded test cases (view/send as UTF-8) 21 years ago Brodie 16.20 KB, text/plain		Details
patch to the netwerk component 21 years ago Brodie 6.62 KB, patch		Details \| Diff \| Splinter Review
patch to the mime component 21 years ago Brodie 3.22 KB, patch		Details \| Diff \| Splinter Review
simplied network patch 21 years ago Brodie 1.11 KB, patch	BenB : review+ mscott : superreview+	Details \| Diff \| Splinter Review
updated patch for mailnews/mime/src 21 years ago Brodie 7.02 KB, patch		Details \| Diff \| Splinter Review

Brodie

Assignee

Description

•

21 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030615
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030615

When viewing an email message in Japanese (or any other language that has
non-ascii whitespace like full width spaces) which has some links in it where
the link is followed by a non-ascii space, the link will span over the space. 

e.g.
...japanese text.... http://link.com/foo/[FULL-WIDTH-SPACE]...japanese text...

This will be turned into a link like:
...japanese text.... <a href="http://link.com/foo/[FULL-WIDTH-SPACE]...japanese
text...">http://link.com/foo/[FULL-WIDTH-SPACE]...japanese text...</a>

It should be:
...japanese text.... <a
href="http://link.com/foo/">http://link.com/foo/</a>[FULL-WIDTH-SPACE]...japanese
text...

This problem is caused by the text to html converter only checking for ascii
spaces. When dealing with Japanese text it also needs to consider the full width
space as a delimiter.

Reproducible: Always

Steps to Reproduce:
1.
2.
3.

Brodie

Assignee

Comment 1

•

21 years ago

Attached file Test text to show linkify behaviour (obsolete) — Details

utf-8 text

Brodie

Assignee

Comment 2

•

21 years ago

Attached patch proposed patch (obsolete) — Details — Splinter Review

This patch removes the special casing of email links only to make it invalid to
have anything except for ascii in a URL. This means that international IRI
won't be linkified, but the rest of the 99.9% of the URLs on the web will be. 

Is any valid URLs with other than ASCII characters which are not IRI?

Brodie

Assignee

Comment 3

•

21 years ago

To use the test text, paste it into an email and send it to yourself. Then view
the email. The linkify code will be run over it. You will notice that all of the
links in the english text will be correctly linkified. The links in the Japanese
text are not. The proposed patch fixes this.

Changing the affected platform to All/All since the code change is not Windows
specific it must affect all platforms.

OS: Windows 2000 → All

Hardware: PC → All

Brodie

Assignee

Updated

•

21 years ago

Attachment #132441 - Flags: review?

Brodie

Assignee

Comment 4

•

21 years ago

Attached file expanded tests (view/send as UTF-8) (obsolete) — Details

expanded the test cases with tests from bug 201369 and from inside the code.

Attachment #132440 - Attachment is obsolete: true

Brodie

Assignee

Comment 5

•

21 years ago

Attached patch updated patch (obsolete) — Details — Splinter Review

Fixed the patch so that I don't remove the comments which were added in bug
201369 but just update them to be relevant to the patched code.

Attachment #132441 - Attachment is obsolete: true

Brodie

Assignee

Updated

•

21 years ago

Attachment #132441 - Flags: review?

Brodie

Assignee

Comment 6

•

21 years ago

Sorry for spam. I'm having a great conversation with myself here.

Adding Ben Bucksch to the CC list since he is initial developer of the code and
reviewed patches on similar bug. Ben, would you please review the patch I have
submitted.

I realize that IRI links will not be linkified with this change, but as I
understand they aren't in use yet, and nor are they linkified correctly by the
current code anyhow.

Ben Bucksch (:BenB)

Comment 7

•

21 years ago

There is an existing bug about this, IIRC, but I don't find it. Confirming anyway.

Thanks for the patch. This:

+// Also recognize the Japanese ideographic space 0x3000 as a space.
 static inline PRBool IsSpace(const PRUnichar aChar)
 {
-  return (nsCRT::IsAsciiSpace(aChar) || aChar == 0xA0);
+  return (nsCRT::IsAsciiSpace(aChar) || aChar == 0xA0 || aChar == 0x3000);
 }

is fine with me, r=BenB. I am not sure about the rest.

Assignee: sspitzer → brofield

Summary: linkify creates incorrect links when link is followed by full width space → [mozTXToHTMLConv] creates incorrect links before full width space

Brodie

Assignee

Comment 8

•

21 years ago

Actually without the other bits it doesn't fix the cases where the link is 
embedded in the Japanese (Chinese/Arabic/etc) text without spaces separating 
it. At least in Japanese emails I see this a lot.

If you can provide any testcases for where the entire current patch is a 
problem, then I can create a new patch to account for it. Otherwise I would 
like to see the whole patch as is checked in.

Ben Bucksch (:BenB)

Comment 9

•

21 years ago

> Actually without the other bits it doesn't fix the cases where the link is 
> embedded in the Japanese (Chinese/Arabic/etc) text without spaces separating 
> it.

This problem is different from the one stated in the initial
description/summary, it is a known problem and the/your fix is much more
dangerous. It will definitly have side effects for all users, not just japanese
ones. It should be considered in a separate bug.

> If you can provide any testcases

I don't know, what about the testcase in bug 220487?

Status: UNCONFIRMED → NEW

Ever confirmed: true

Ben Bucksch (:BenB)

Updated

•

21 years ago

Attachment #132452 - Attachment description: expanded tests → expanded tests (view/send as UTF-8)

Brodie

Assignee

Comment 10

•

21 years ago

Attached file expanded test cases (view/send as UTF-8) — Details

Expanded testcases for these problems

Brodie

Assignee

Updated

•

21 years ago

Attachment #132452 - Attachment is obsolete: true

Brodie

Assignee

Comment 11

•

21 years ago

Attached patch patch to the netwerk component (obsolete) — Details — Splinter Review

Patch for the netwerk/streamconv/converters/mozTXTtoHTMLConv.cpp side of the
problem.

Attachment #132453 - Attachment is obsolete: true

Brodie

Assignee

Comment 12

•

21 years ago

Attached patch patch to the mime component (obsolete) — Details — Splinter Review

Patch to the mime component where the mozTXTtoHTMLConv is called

Brodie

Assignee

Comment 13

•

21 years ago

While investigating this bug more I found that although PRUnichar string is
passed to ScanTXT from the mime component, this string is really only promoted
UTF-8 (UTF-8 cast to PRUnichar). This means that unless we get real unicode to
ScanTXT we can't check for any characters that are multi-byte in UTF-8 (e.g.
ideographic space 0x3000). Therefore, I have included a patch to the mime
component which converts the UTF-8 string to Unicode (UTF-16) before sending it
through to ScanTXT. 

I've also created a new patch for the mozTXTtoHTMLConv component which fixes the
(space) and (punct) testcases as listed in the expanded textcase document. This
is still work in progress. Ben, would you provide me some comments please.

1. Do you forsee any problems with converting the string to actual unicode in
mime rather than using promoted UTF-8? 

2. I tried to extract the logic of the punctuation checking in FindURLStart and
FindURLEnd as best as I could. I'm not sure about the results though. It is a
little confusing.

Ben Bucksch (:BenB)

Comment 14

•

21 years ago

> this string is really only promoted UTF-8 (UTF-8 cast to PRUnichar)

whoops.

So, the previous patch didn't work (because 0x3000 wouldn't be seen as that)?

> 1. Do you forsee any problems with converting the string to actual unicode in
> mime rather than using promoted UTF-8? 

POssible, but I have no idea which ones. I'd like to get review from an i18n
expert, preferably nhotta, but I guess they're gone by now :-(.

> I've also created a new patch for the mozTXTtoHTMLConv component which
> fixes the (space) and (punct) testcases as listed in the expanded textcase
> document. This is still work in progress. Ben, would you provide me
> some comments please.

I am not sure about this one.

It makes the one code part shorter on the one hand, but on the other hand adds a
layer of indirection and makes it much harder to read (esp. with all the start
and end and sentence punctaion rolled into one), if you care about what exactly
counts as punctation when.
To me, these "full width" punctations look like bloat - should we add the
punctation and language rules of every language in the world?

The recognizer was intended to help the majority of users most of the time, not
all of them all of the time.

> 2. I tried to extract the logic of the punctuation checking in FindURLStart
> and FindURLEnd as best as I could.

For "freetext" mode:
- Some characaters end the URL at the start,
- some end it at the end,
- some are stripped, if found at the end, but don't end the URL,
  if they are in the middle.
- the URL scheme (e.g. "http:") has special requirements.

The different modes (e.g. enclosed in <> or not) have different rules, too.

Brodie

Assignee

Comment 15

•

21 years ago

Sorry about the time for reply. Work got busy and am now on vacation. 

Re: the original patch. It won't work without patching the mime component to 
pass real unicode through to your bits. 

How about if we just deal with the UTF-8 promotion in the mime component (so 
that the linkify component gets real unicode), and the wide space in this bug. 
(i.e. the original scope of this bug). We can then debate other changes in 
another bug. 

If we limited the scope back to the space and unicode, then the mime component 
patch is unchanged, and the network component would just be the 2 line change 
of:

+// Also recognize the Japanese ideographic space 0x3000 as a space.
 static inline PRBool IsSpace(const PRUnichar aChar)
 {
-  return (nsCRT::IsAsciiSpace(aChar) || aChar == 0xA0);
+  return (nsCRT::IsAsciiSpace(aChar) || aChar == 0xA0 || aChar == 0x3000);

What do you think?

Ben Bucksch (:BenB)

Comment 16

•

21 years ago

Yes, makes sense to me.

Brodie

Assignee

Comment 17

•

21 years ago

Attached patch simplied network patch — Details — Splinter Review

This is the simplified patch for the network component. It only updates the
mozTXTToHTMLConv.cpp file to recognize the ideographic japanese space.

Brodie

Assignee

Updated

•

21 years ago

Attachment #134305 - Attachment is obsolete: true

Ben Bucksch (:BenB)

Comment 18

•

21 years ago

Comment on attachment 136710 [details] [diff] [review]
simplied network patch

r=BenB

Attachment #136710 - Flags: review+

Brodie

Assignee

Comment 19

•

21 years ago

Attached patch updated patch for mailnews/mime/src (obsolete) — Details — Splinter Review

This is the updated patch for the mozilla/mailnews/src/mimetp(la|fl).cpp files.
The change ensures that the text passed through to the txt to html converter is
true unicode. I couldn't find the options in Mozilla to legitimately exercise
the second path of mimetpfl.cpp:
       // If nsMimeMessageSaveAs, the string is in mail charset (and stateful,
e.g. ISO-2022-JP).

But I manually placed the IP there during debugging and stepped through it.

Brodie

Assignee

Updated

•

21 years ago

Attachment #134306 - Attachment is obsolete: true

Brodie

Assignee

Comment 20

•

21 years ago

The mime part of this patch is now no longer required as bug 228543 has fixed
this. Ben, would you please check in the network patch for me (I don't have CVS
access).

Brodie

Assignee

Comment 21

•

21 years ago

Comment on attachment 136765 [details] [diff] [review]
updated patch for mailnews/mime/src

obsoleted by bug 228543

Attachment #136765 - Attachment is obsolete: true

Ben Bucksch (:BenB)

Comment 22

•

21 years ago

Comment on attachment 136710 [details] [diff] [review]
simplied network patch

I think it still needs sr, even though it's trivial.

Attachment #136710 - Flags: superreview?(darin)

Brodie

Assignee

Comment 23

•

21 years ago

Comment on attachment 136710 [details] [diff] [review]
simplied network patch

Changing SR request for this 1 line simple patch to mscott since you SR'd the
Japanese IDN patch which had similarities to this.

Attachment #136710 - Flags: superreview?(darin) → superreview?(mscott)

Scott MacGregor

Updated

•

21 years ago

Attachment #136710 - Flags: superreview?(mscott) → superreview+

Ben Bucksch (:BenB)

Comment 24

•

21 years ago

Fix checked in.

Thanks a lot, Brodie, for the report, patch, cooperation and patience.

Status: NEW → RESOLVED

Closed: 21 years ago

Resolution: --- → FIXED

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: Browser → Seamonkey

Ben Bucksch (:BenB)

Updated

•

15 years ago

Summary: [mozTXToHTMLConv] creates incorrect links before full width space → [mozTXTToHTMLConv] creates incorrect links before full width space

You need to log in before you can comment on or make changes to this bug.