954848 - angle bracket-delimited URL linkified with closing bracket included in link and semi-colon appended to it

Reporter

Description

•

11 years ago

*** Original post on bio 1413 by Myk Melez [:myk] <myk AT mozilla.org> at 2012-04-30 20:07:00 UTC *** When I delimit a URL with angle brackets in a message, the closing bracket is included in the link Instantbird creates for that URL, and Instantbird appends a semi-colon to the link. For example, if I enter the message: Go to <http://example.com/>. Instantbird turns it into something like: Go to <a href="http://example.com/>">http://example.com/></a>;. It should turn it into: Go to <a href="http://example.com/">http://example.com/</a>.

Patrick Cloke [:clokep]

Comment 1

•

11 years ago

*** Original post on bio 1413 at 2012-05-02 10:35:34 UTC *** I can confirm this behavior.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Patrick Cloke [:clokep]

Comment 2

•

11 years ago

*** Original post on bio 1413 at 2012-05-11 21:21:41 UTC *** So we first scanTXT and you end up with: Go to <http://example.com/>. It turns out that you can keep &gt as part of a URL (it's valid), but the semi-colon is not...so it ends up being broken up and placed outside. This obviously messes up this situation. It does, however, say that ""<", ">" and "&" must be escaped,". If they must be escaped...it should realize that? (For the record, if you don't escape them...it finds nothing.) [1] http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/public/mozITXTToHTMLConv.idl#79

Instantbot

Reporter

Comment 3

•

11 years ago

*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:55:54 UTC *** > So we first scanTXT and you end up with: > Go to <http://example.com/>. Ah, yes. And that's entirely correct from the perspecptive of ScanTXT(text, kEntities), because that's the correct representation of the input in HTML/XML, if you don't want <a> tags. Now, from the perspective of ScanHTML(text, kURLs), it must first unescape >/< and then treat it as plaintext to check whether this could be a URL. In FindURLStart() [2] should go into the RFC2396E mode due to the "<", and then in FindURLEnd() search for the ">". It should then make the following HTML: Expected: "Go to <a href="http://example.com/"><http://example.com/></a>." (I hope that Bugzilla's URL recognizer doesn't interfere here.) Note that the <> are intentionally included in the link text, but not in the link target. For details, see [1]. So, what seems to be going wrong is that FindURLStart() doesn't find the "<" and thus doesn't go in RFC2396E mode. [1] <http://www.bucksch.org/1/projects/mozilla/16507/> [1] <http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#222>

Instantbot

Reporter

Comment 4

•

11 years ago

*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:57:49 UTC *** If this is true, then this is a bug in the recognizer. Either way, I would recommend to call ScanTXT() with all flags (maybe minus smileys), if at all possible. ScanHTML() should skip <a> tags, avoiding the problem.

Instantbot

Reporter

Comment 5

•

11 years ago

*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:58:17 UTC *** From clokep: http://lxr.instantbird.org/instantbird/source/instantbird/content/conversation.xml#327 http://lxr.instantbird.org/instantbird/source/chat/content/convbrowser.xml#388

Instantbot

Reporter

Comment 6

•

11 years ago

*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:59:08 UTC *** msg = cs.scanHTML(msg.replace(/&/g, "&"), csFlags) .replace(/&/g, "&"); What is this replace? Are you sure that one isn't messing things up?

Florian Quèze [:florian]

Comment 7

•

11 years ago

*** Original post on bio 1413 at 2012-05-11 23:02:42 UTC *** (In reply to comment #6) > msg = cs.scanHTML(msg.replace(/&/g, "&"), csFlags) > .replace(/&/g, "&"); > > What is this replace? Are you sure that one isn't messing things up? I suspect I was tricked by https://bugzilla.mozilla.org/show_bug.cgi?id=728093 several years ago and added the replace to workaround something that was a bug of the error console rather than the scanHTML result :-/. And it seems that this pair of replace is actually what causes this bug. Thanks for looking at this, and sorry this was so stupid!

Florian Quèze [:florian]

Comment 8

•

11 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

*** Original post on bio 1413 as attmnt 1463 at 2012-05-11 23:04:00 UTC was without comment, so any subsequent comment numbers will be shifted ***

Attachment #8353216 - Flags: review?(clokep)

Instantbot

Reporter

Comment 9

•

11 years ago

*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 23:41:10 UTC *** Florian, thanks for the patch! No problem, just bear with me when I am stupid the text time :). I'm just glad there's no bug in the component. Cheers!

Patrick Cloke [:clokep]

Comment 10

•

11 years ago

Comment on attachment 8353216 [details] [diff] [review] Patch *** Original change on bio 1413 attmnt 1463 at 2012-05-12 04:05:25 UTC *** Thanks for fixing this, I did always find those replace calls strange. Do we know why they were added? (I.e. are we regressing something else?)

Attachment #8353216 - Flags: review?(clokep) → review+

Patrick Cloke [:clokep]

Comment 11

•

11 years ago

*** Original post on bio 1413 at 2012-05-12 04:06:50 UTC *** And BenB: thanks for taking the time to look at this. :)

Florian Quèze [:florian]

Comment 12

•

11 years ago

*** Original post on bio 1413 at 2012-05-12 10:25:40 UTC *** (In reply to comment #10) > Do we know why they were added? (I.e. are we regressing something else?) I assumed (comment 7) that the reason was that I originally tested the scanHTML call in the error console. If you evaluate "foo&bar" in the error console, it will print "foo&bar". This is covered by https://bugzilla.mozilla.org/show_bug.cgi?id=728093 There was already a strange .replace call on the result of scanHTML 4 years ago: https://hg.instantbird.org/instantbird/annotate/756bab83a3e8/instantbird/base/content/instantbird/conversation.xml#l163 The replace in its current form appeared with https://hg.instantbird.org/instantbird/rev/cf3ba5a040b7 which is unfortunately not linked to a bug number :-/. After further testing, it seems what this code was working around is this behavior: var result = Components.classes["@mozilla.org/txttohtmlconv;1"].getService(Components.interfaces.mozITXTToHTMLConv).scanHTML(input, 2); input: "toto'titi" result: "toto&apos;titi" input: "totoàtiti" result: "toto&agrave;titi" Originally we were avoiding this for the apos and quot entities. After https://hg.instantbird.org/instantbird/rev/cf3ba5a040b7 we were doing it for all HTML entities. http://hg.mozilla.org/mozilla-central/rev/71042c8b4bf4 has changed the behavior for " so now instead of being converted to &quot; it's converted to " Ben, do you know if the behavior these replace calls was working around is the expected behavior for scanHTML?

Instantbot

Reporter

Comment 13

•

11 years ago

*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-12 14:00:12 UTC *** > ScanHTML() > input: "toto'titi" > result: "toto&apos;titi" > input: "totoàtiti" > result: "toto&agrave;titi" That is just dumb and not expected for ScanHTML(), no. (It would be expected for ScanTXT(), though.) The problem is that ScanHTML() doesn't use a full HTML parser. Its unescape function doesn't know about the other entities. http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#1303 http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#143 Feel free to fix UnescapeStr(). Alternatively, you could unescape these entities (everything apart from <>&", which must be escaped) in your code and replace them with Unicode characters.

Florian Quèze [:florian]

Comment 14

•

11 years ago

Comment on attachment 8353216 [details] [diff] [review] Patch *** Original change on bio 1413 attmnt 1463 at 2012-05-14 10:03:27 UTC *** Marking this patch obsolete, as it would fix this bug by regressing something else (explained in comment 12).

Attachment #8353216 - Attachment is obsolete: true

Instantbot

Reporter

Comment 15

•

11 years ago

*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-14 10:25:41 UTC *** Florian, the patch is nevertheless necessary, as it introduces double-encoding, and it goes wrong in other ways. It's necessary to remove entities, not to add them.

Patrick Cloke [:clokep]

Comment 17

•

9 years ago

This is in shared code now around http://mxr.mozilla.org/comm-central/source/chat/content/convbrowser.xml#499

Component: Conversation → General

Product: Instantbird → Chat Core

Benjamin Bach

Comment 18

•

7 years ago

This happens frequently now because of Matrix messages like this: > 11:18:21 AM * johndoe [10:43:01] sent a long message: johndoe_2018-03-09_09:43:00.txt <https://matrix.org/_matrix/media/v1/download/matrix.org/adasd123k1j3k213k>; In the above, the latter `>` character is part of the URL.

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Bugzilla

angle bracket-delimited URL linkified with closing bracket included in link and semi-colon appended to it

Categories

(Chat Core :: General, defect)

Tracking

(Not tracked)

People

(Reporter: bugzilla, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 obsolete file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 17

Comment 18

Updated

Attachment

General

Description

File Name

Content Type