Open Bug 954848 Opened 11 years ago Updated 2 years ago

angle bracket-delimited URL linkified with closing bracket included in link and semi-colon appended to it

Categories

(Chat Core :: General, defect)

All
Other
defect

Tracking

(Not tracked)

People

(Reporter: bugzilla, Unassigned)

References

Details

Attachments

(1 obsolete file)

*** Original post on bio 1413 by Myk Melez [:myk] <myk AT mozilla.org> at 2012-04-30 20:07:00 UTC *** When I delimit a URL with angle brackets in a message, the closing bracket is included in the link Instantbird creates for that URL, and Instantbird appends a semi-colon to the link. For example, if I enter the message: Go to <http://example.com/>. Instantbird turns it into something like: Go to <a href="http://example.com/>">http://example.com/></a>;. It should turn it into: Go to <a href="http://example.com/">http://example.com/</a>.
*** Original post on bio 1413 at 2012-05-02 10:35:34 UTC *** I can confirm this behavior.
Status: UNCONFIRMED → NEW
Ever confirmed: true
*** Original post on bio 1413 at 2012-05-11 21:21:41 UTC *** So we first scanTXT and you end up with: Go to &lt;http://example.com/&gt;. It turns out that you can keep &gt as part of a URL (it's valid), but the semi-colon is not...so it ends up being broken up and placed outside. This obviously messes up this situation. It does, however, say that ""<", ">" and "&" must be escaped,". If they must be escaped...it should realize that? (For the record, if you don't escape them...it finds nothing.) [1] http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/public/mozITXTToHTMLConv.idl#79
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:55:54 UTC *** > So we first scanTXT and you end up with: > Go to &lt;http://example.com/&gt;. Ah, yes. And that's entirely correct from the perspecptive of ScanTXT(text, kEntities), because that's the correct representation of the input in HTML/XML, if you don't want <a> tags. Now, from the perspective of ScanHTML(text, kURLs), it must first unescape &gt;/&lt; and then treat it as plaintext to check whether this could be a URL. In FindURLStart() [2] should go into the RFC2396E mode due to the "<", and then in FindURLEnd() search for the ">". It should then make the following HTML: Expected: "Go to <a href="http://example.com/">&lt;http://example.com/&gt;</a>." (I hope that Bugzilla's URL recognizer doesn't interfere here.) Note that the <> are intentionally included in the link text, but not in the link target. For details, see [1]. So, what seems to be going wrong is that FindURLStart() doesn't find the "<" and thus doesn't go in RFC2396E mode. [1] <http://www.bucksch.org/1/projects/mozilla/16507/> [1] <http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#222>
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:57:49 UTC *** If this is true, then this is a bug in the recognizer. Either way, I would recommend to call ScanTXT() with all flags (maybe minus smileys), if at all possible. ScanHTML() should skip <a> tags, avoiding the problem.
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:58:17 UTC *** From clokep: http://lxr.instantbird.org/instantbird/source/instantbird/content/conversation.xml#327 http://lxr.instantbird.org/instantbird/source/chat/content/convbrowser.xml#388
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:59:08 UTC *** msg = cs.scanHTML(msg.replace(/&/g, "&amp;"), csFlags) .replace(/&amp;/g, "&"); What is this replace? Are you sure that one isn't messing things up?
*** Original post on bio 1413 at 2012-05-11 23:02:42 UTC *** (In reply to comment #6) > msg = cs.scanHTML(msg.replace(/&/g, "&amp;"), csFlags) > .replace(/&amp;/g, "&"); > > What is this replace? Are you sure that one isn't messing things up? I suspect I was tricked by https://bugzilla.mozilla.org/show_bug.cgi?id=728093 several years ago and added the replace to workaround something that was a bug of the error console rather than the scanHTML result :-/. And it seems that this pair of replace is actually what causes this bug. Thanks for looking at this, and sorry this was so stupid!
Attached patch Patch (obsolete) — Splinter Review
*** Original post on bio 1413 as attmnt 1463 at 2012-05-11 23:04:00 UTC was without comment, so any subsequent comment numbers will be shifted ***
Attachment #8353216 - Flags: review?(clokep)
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 23:41:10 UTC *** Florian, thanks for the patch! No problem, just bear with me when I am stupid the text time :). I'm just glad there's no bug in the component. Cheers!
Comment on attachment 8353216 [details] [diff] [review] Patch *** Original change on bio 1413 attmnt 1463 at 2012-05-12 04:05:25 UTC *** Thanks for fixing this, I did always find those replace calls strange. Do we know why they were added? (I.e. are we regressing something else?)
Attachment #8353216 - Flags: review?(clokep) → review+
*** Original post on bio 1413 at 2012-05-12 04:06:50 UTC *** And BenB: thanks for taking the time to look at this. :)
*** Original post on bio 1413 at 2012-05-12 10:25:40 UTC *** (In reply to comment #10) > Do we know why they were added? (I.e. are we regressing something else?) I assumed (comment 7) that the reason was that I originally tested the scanHTML call in the error console. If you evaluate "foo&amp;bar" in the error console, it will print "foo&bar". This is covered by https://bugzilla.mozilla.org/show_bug.cgi?id=728093 There was already a strange .replace call on the result of scanHTML 4 years ago: https://hg.instantbird.org/instantbird/annotate/756bab83a3e8/instantbird/base/content/instantbird/conversation.xml#l163 The replace in its current form appeared with https://hg.instantbird.org/instantbird/rev/cf3ba5a040b7 which is unfortunately not linked to a bug number :-/. After further testing, it seems what this code was working around is this behavior: var result = Components.classes["@mozilla.org/txttohtmlconv;1"].getService(Components.interfaces.mozITXTToHTMLConv).scanHTML(input, 2); input: "toto&apos;titi" result: "toto&amp;apos;titi" input: "toto&agrave;titi" result: "toto&amp;agrave;titi" Originally we were avoiding this for the apos and quot entities. After https://hg.instantbird.org/instantbird/rev/cf3ba5a040b7 we were doing it for all HTML entities. http://hg.mozilla.org/mozilla-central/rev/71042c8b4bf4 has changed the behavior for &quot; so now instead of being converted to &amp;quot; it's converted to " Ben, do you know if the behavior these replace calls was working around is the expected behavior for scanHTML?
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-12 14:00:12 UTC *** > ScanHTML() > input: "toto&apos;titi" > result: "toto&amp;apos;titi" > input: "toto&agrave;titi" > result: "toto&amp;agrave;titi" That is just dumb and not expected for ScanHTML(), no. (It would be expected for ScanTXT(), though.) The problem is that ScanHTML() doesn't use a full HTML parser. Its unescape function doesn't know about the other entities. http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#1303 http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#143 Feel free to fix UnescapeStr(). Alternatively, you could unescape these entities (everything apart from <>&", which must be escaped) in your code and replace them with Unicode characters.
Comment on attachment 8353216 [details] [diff] [review] Patch *** Original change on bio 1413 attmnt 1463 at 2012-05-14 10:03:27 UTC *** Marking this patch obsolete, as it would fix this bug by regressing something else (explained in comment 12).
Attachment #8353216 - Attachment is obsolete: true
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-14 10:25:41 UTC *** Florian, the patch is nevertheless necessary, as it introduces double-encoding, and it goes wrong in other ways. It's necessary to remove entities, not to add them.
Component: Conversation → General
Product: Instantbird → Chat Core
This happens frequently now because of Matrix messages like this: > 11:18:21 AM * johndoe [10:43:01] sent a long message: johndoe_2018-03-09_09:43:00.txt <https://matrix.org/_matrix/media/v1/download/matrix.org/adasd123k1j3k213k>; In the above, the latter `>` character is part of the URL.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: