Open
Bug 954848
Opened 11 years ago
Updated 2 years ago
angle bracket-delimited URL linkified with closing bracket included in link and semi-colon appended to it
Categories
(Chat Core :: General, defect)
Tracking
(Not tracked)
NEW
People
(Reporter: bugzilla, Unassigned)
References
Details
Attachments
(1 obsolete file)
*** Original post on bio 1413 by Myk Melez [:myk] <myk AT mozilla.org> at 2012-04-30 20:07:00 UTC ***
When I delimit a URL with angle brackets in a message, the closing bracket is included in the link Instantbird creates for that URL, and Instantbird appends a semi-colon to the link.
For example, if I enter the message:
Go to <http://example.com/>.
Instantbird turns it into something like:
Go to <a href="http://example.com/>">http://example.com/></a>;.
It should turn it into:
Go to <a href="http://example.com/">http://example.com/</a>.
Comment 1•11 years ago
|
||
*** Original post on bio 1413 at 2012-05-02 10:35:34 UTC ***
I can confirm this behavior.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 2•11 years ago
|
||
*** Original post on bio 1413 at 2012-05-11 21:21:41 UTC ***
So we first scanTXT and you end up with:
Go to <http://example.com/>.
It turns out that you can keep > as part of a URL (it's valid), but the semi-colon is not...so it ends up being broken up and placed outside. This obviously messes up this situation.
It does, however, say that ""<", ">" and "&" must be escaped,".
If they must be escaped...it should realize that?
(For the record, if you don't escape them...it finds nothing.)
[1] http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/public/mozITXTToHTMLConv.idl#79
Reporter | ||
Comment 3•11 years ago
|
||
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:55:54 UTC ***
> So we first scanTXT and you end up with:
> Go to <http://example.com/>.
Ah, yes. And that's entirely correct from the perspecptive of ScanTXT(text, kEntities), because that's the correct representation of the input in HTML/XML, if you don't want <a> tags.
Now, from the perspective of ScanHTML(text, kURLs), it must first unescape >/< and then treat it as plaintext to check whether this could be a URL. In FindURLStart() [2] should go into the RFC2396E mode due to the "<", and then in FindURLEnd() search for the ">". It should then make the following HTML:
Expected:
"Go to <a href="http://example.com/"><http://example.com/></a>."
(I hope that Bugzilla's URL recognizer doesn't interfere here.) Note that the <> are intentionally included in the link text, but not in the link target. For details, see [1].
So, what seems to be going wrong is that FindURLStart() doesn't find the "<" and thus doesn't go in RFC2396E mode.
[1] <http://www.bucksch.org/1/projects/mozilla/16507/>
[1] <http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#222>
Reporter | ||
Comment 4•11 years ago
|
||
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:57:49 UTC ***
If this is true, then this is a bug in the recognizer.
Either way, I would recommend to call ScanTXT() with all flags (maybe minus smileys), if at all possible. ScanHTML() should skip <a> tags, avoiding the problem.
Reporter | ||
Comment 5•11 years ago
|
||
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:58:17 UTC ***
From clokep:
http://lxr.instantbird.org/instantbird/source/instantbird/content/conversation.xml#327
http://lxr.instantbird.org/instantbird/source/chat/content/convbrowser.xml#388
Reporter | ||
Comment 6•11 years ago
|
||
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 21:59:08 UTC ***
msg = cs.scanHTML(msg.replace(/&/g, "&"), csFlags)
.replace(/&/g, "&");
What is this replace? Are you sure that one isn't messing things up?
Comment 7•11 years ago
|
||
*** Original post on bio 1413 at 2012-05-11 23:02:42 UTC ***
(In reply to comment #6)
> msg = cs.scanHTML(msg.replace(/&/g, "&"), csFlags)
> .replace(/&/g, "&");
>
> What is this replace? Are you sure that one isn't messing things up?
I suspect I was tricked by https://bugzilla.mozilla.org/show_bug.cgi?id=728093 several years ago and added the replace to workaround something that was a bug of the error console rather than the scanHTML result :-/. And it seems that this pair of replace is actually what causes this bug.
Thanks for looking at this, and sorry this was so stupid!
Comment 8•11 years ago
|
||
*** Original post on bio 1413 as attmnt 1463 at 2012-05-11 23:04:00 UTC was without comment, so any subsequent comment numbers will be shifted ***
Attachment #8353216 -
Flags: review?(clokep)
Reporter | ||
Comment 9•11 years ago
|
||
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-11 23:41:10 UTC ***
Florian, thanks for the patch! No problem, just bear with me when I am stupid the text time :). I'm just glad there's no bug in the component. Cheers!
Comment 10•11 years ago
|
||
Comment on attachment 8353216 [details] [diff] [review]
Patch
*** Original change on bio 1413 attmnt 1463 at 2012-05-12 04:05:25 UTC ***
Thanks for fixing this, I did always find those replace calls strange. Do we know why they were added? (I.e. are we regressing something else?)
Attachment #8353216 -
Flags: review?(clokep) → review+
Comment 11•11 years ago
|
||
*** Original post on bio 1413 at 2012-05-12 04:06:50 UTC ***
And BenB: thanks for taking the time to look at this. :)
Comment 12•11 years ago
|
||
*** Original post on bio 1413 at 2012-05-12 10:25:40 UTC ***
(In reply to comment #10)
> Do we know why they were added? (I.e. are we regressing something else?)
I assumed (comment 7) that the reason was that I originally tested the scanHTML call in the error console. If you evaluate "foo&bar" in the error console, it will print "foo&bar". This is covered by https://bugzilla.mozilla.org/show_bug.cgi?id=728093
There was already a strange .replace call on the result of scanHTML 4 years ago: https://hg.instantbird.org/instantbird/annotate/756bab83a3e8/instantbird/base/content/instantbird/conversation.xml#l163
The replace in its current form appeared with https://hg.instantbird.org/instantbird/rev/cf3ba5a040b7 which is unfortunately not linked to a bug number :-/.
After further testing, it seems what this code was working around is this behavior:
var result = Components.classes["@mozilla.org/txttohtmlconv;1"].getService(Components.interfaces.mozITXTToHTMLConv).scanHTML(input, 2);
input: "toto'titi"
result: "toto&apos;titi"
input: "totoàtiti"
result: "toto&agrave;titi"
Originally we were avoiding this for the apos and quot entities. After https://hg.instantbird.org/instantbird/rev/cf3ba5a040b7 we were doing it for all HTML entities.
http://hg.mozilla.org/mozilla-central/rev/71042c8b4bf4 has changed the behavior for " so now instead of being converted to &quot; it's converted to "
Ben, do you know if the behavior these replace calls was working around is the expected behavior for scanHTML?
Reporter | ||
Comment 13•11 years ago
|
||
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-12 14:00:12 UTC ***
> ScanHTML()
> input: "toto'titi"
> result: "toto&apos;titi"
> input: "totoàtiti"
> result: "toto&agrave;titi"
That is just dumb and not expected for ScanHTML(), no.
(It would be expected for ScanTXT(), though.)
The problem is that ScanHTML() doesn't use a full HTML parser. Its unescape function doesn't know about the other entities.
http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#1303
http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#143
Feel free to fix UnescapeStr(). Alternatively, you could unescape these entities (everything apart from <>&", which must be escaped) in your code and replace them with Unicode characters.
Comment 14•11 years ago
|
||
Comment on attachment 8353216 [details] [diff] [review]
Patch
*** Original change on bio 1413 attmnt 1463 at 2012-05-14 10:03:27 UTC ***
Marking this patch obsolete, as it would fix this bug by regressing something else (explained in comment 12).
Attachment #8353216 -
Attachment is obsolete: true
Reporter | ||
Comment 15•11 years ago
|
||
*** Original post on bio 1413 by Ben Bucksch <linux.news AT bucksch.org> at 2012-05-14 10:25:41 UTC ***
Florian, the patch is nevertheless necessary, as it introduces double-encoding, and it goes wrong in other ways. It's necessary to remove entities, not to add them.
Comment 17•9 years ago
|
||
This is in shared code now around http://mxr.mozilla.org/comm-central/source/chat/content/convbrowser.xml#499
Component: Conversation → General
Product: Instantbird → Chat Core
Comment 18•7 years ago
|
||
This happens frequently now because of Matrix messages like this:
> 11:18:21 AM * johndoe [10:43:01] sent a long message: johndoe_2018-03-09_09:43:00.txt <https://matrix.org/_matrix/media/v1/download/matrix.org/adasd123k1j3k213k>;
In the above, the latter `>` character is part of the URL.
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•