177820 - URLs with a trailing non-word character don't get the character hyperlinked in bug comments

Dave Miller [:justdave]

Reporter

Description

•

22 years ago

Discovered on bug 176594: URLs with trailing "+" characters don't get the +'s hyperlinked.

Bradley Baetz (:bbaetz)

Comment 1

•

22 years ago

This is a feature :) The rule is: $text =~ s~\b(${protocol_re}: # The protocol: [^\s<>\"]+ # Any non-whitespace [\w\/]) # so that we end in \w or / I believe that the idea behind the regexp was to not linkinfy the . in http://foo.com/. (or training "/'/,/etc. I guess that we could add \+ to the list, though, and I don't _think_ that that would break anything This didn't change in my rewrite, so its not a regression from that.

Dave Miller [:justdave]

Reporter

Comment 2

•

22 years ago

+ is a legal replacement for a space in a query string part of a URL, so it should be included.

Bradley Baetz (:bbaetz)

Comment 3

•

22 years ago

Well, thast true. But . is a valid character in a url, as is , Anyway, I'd be happy with a patch which added + to the linkfy stuff.

GavinS

Comment 4

•

21 years ago

A trailing colon, ':', is also not linked - should it be? I think it would probably break more things than it fixed, but I'm mentioning it anyway. See bug#86553 comment#5 and bug#86553 comment#6

Olav Vitters

Updated

•

19 years ago

QA Contact: mattyt-bugzilla → default-qa

Frédéric Buclin

Updated

•

18 years ago

Assignee: myk → create-and-change

Max Kanat-Alexander

Updated

•

18 years ago

Severity: normal → minor

Frédéric Buclin

Comment 6

•

10 years ago

I'm morphing this bug summary to include all non-word characters in general, not just about "+". Several users are reporting their own bugs to get their favourite trailing characters included in URLs, and we really don't need one bug per character.

Summary: URLs with trailing "+" characters don't get the +'s hyperlinked in bug comments → URLs with a trailing non-word character don't get the character hyperlinked in bug comments

Frédéric Buclin

Updated

•

10 years ago

Comment 10

•

10 years ago

If anything would break if all characters except spaces, less-than characters (i.e., opening angle brackets), and greater-than characters (i.e., closing angle brackets) were included in URL links, then maybe the safer solution is to leave URLs with unusual characters completely unlinked, which would signal a user to copy the whole URL themself. It's a bug, not a feature, regardless of the original intention. Most people would click on a link (which we encourage by linking) without noticing that part of the URL was omitted from the link, resulting in a page being different or not found, and that can confuse the conversation intended to be held in light of the link. As a test on my Linux laptop, I saved a text file called "test.pls." (with trailing period but not and it opened in gedit, thus not as a *.pls file, which would have opened in Rhythmbox instead. Since Linux allowed a trailing period and treated the extension as different because of it, I assume if I exposed the laptop to the Internet as a Web host the file's URL would have a trailing period. I've seen URLs with a variety of unusual characters and we shouldn't be judging their owners, but leave that judgment to hosts and domain name servers.

Nick Levinson

Comment 11

•

10 years ago

I accidentally saved this page before finishing editing my last comment and apparently I can't edit my comment. The phrase "(with trailing period but not" was supposed to be "(with trailing period but not quotation marks)".

Markus Keller

Comment 12

•

10 years ago

The best solution is to avoid any magic and just adhere to <https://tools.ietf.org/html/rfc3986#appendix-C>. Double-quotes, angle brackets, whitespace, and other characters that are invalid in URLs should delimit URLs. Valid URL characters should be considered part of the URL no matter where they appear.

Frédéric Buclin

Comment 13

•

10 years ago

(In reply to Markus Keller from comment #12) > Valid URL characters should be considered part of the URL no matter where they appear. I disagree. It's very common to include a URL in a sentence and have the URL followed by a dot. In that case, you don't want the dot to be part of the URL. So at least a trailing dot shouldn't be included in the URL.

Nick Levinson

Comment 14

•

10 years ago

For the trailing dot, that's not a good solution either, because a URL can itself end in a dot, because a filename extension can end in a dot and because by a little-used official convention a top-level domain (TLD) may be followed by a dot (at least if nothing else in the URL follows that dot). Therefore, if the exact beginning and end of a URL cannot both be unambiguously determined, the putative URL should not be linked at all. And, even if the two ends are certain, if linking it would break LibreOffice in any way, it should not be linked, so the user will know to select what they want to copy for a link. A line-breaking hyphen within a URL is another problem. As the above-linked RFC appendix (comment 12) says, such a hyphen might or might not belong in the unbroken URL itself. It's annoying when I get an email that has a URL broken over two lines but only the pre-break part is linked. Because the hyphen is an ambiguous character, no part of a line-broken URL should be linked.

Frédéric Buclin

Comment 15

•

10 years ago

LibreOffice has nothing to do with Bugzilla, so this part of the discussion is irrelevant. And for the very rare cases where a dot would be at the very end of a URL, it's way too rare to be a rule to break all valid cases where a dot means the end of the sentence, not the end of the URL.

Nick Levinson

Comment 16

•

10 years ago

On LibreOffice, I meant Bugzilla. (Sorry. LibreOffice has a similar problem, which is why I discovered Bugzilla's problem.) This thread refers to breaking Bugzilla in two contexts, and that's what I was referring to. If a TLD has a trailing dot followed by whitespace, a line break, a paragraph end, or some such, then the link can omit the dot without affecting where the link goes, but that may not be the case with a URL that does not end in a TLD. Trailing-dotted URLs probably are rare, but how rare I can't tell, since I don't know how to do a Google search for that kind of URL to generate a reasonable statistic. It may be more frequent in other nations; here are URLs to illustrate the point, apparently not happening with google.com (I assume Bugzilla will display the entire URLs I've typed here, regardless of whether wholly or partly linked or not): https://www.google.fr/?gws_rd=ssl#q=Mars. https://www.google.co.jp/?gws_rd=ssl#q=Wednesday. https://www.google.com.hk/?gws_rd=ssl#q=Saturn. With or without the trailing dot, they produce the same search results in Google, but that's determinable only after the linking is correctly or incorrectly done.. I prefer exactitude, and my guess is that Bugzilla attracts more demanding participants. Not linking in ambivalent cases would probably be less misleading than mislinking would be.

Erik Quaeghebeur

Comment 17

•

8 years ago

(In reply to Markus Keller from comment #12) > The best solution is to avoid any magic and just adhere to > <https://tools.ietf.org/html/rfc3986#appendix-C>. > > Double-quotes, angle brackets, whitespace, and other characters that are > invalid in URLs should delimit URLs. Valid URL characters should be > considered part of the URL no matter where they appear. I think this should be the basis of for the regexp that is used. (I agree trailing dots are a thorny issue that perhaps deserves its entire own bug, but that shouldn't hold back correction/improvement of the current regexp, which should be simple enough given the guidance from the RFCs.) (In reply to Bradley Baetz (:bbaetz) from comment #1) > > The rule is: > > $text =~ s~\b(${protocol_re}: # The protocol: > [^\s<>\"]+ # Any non-whitespace > [\w\/]) # so that we end in \w or / Currently, this regexp is way too accepting: it does not break at, or not create a link for urls containing, unsafe characters such as [ and ]. See https://bugs.kde.org/show_bug.cgi?id=363427#c1 for an example of the issues this causes. I'll try it here as well: [http://www.example.org/][whatever]. Interesting in this context: https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid (Ok, disregard the ‘should be simple enough’ above).

Dylan Hardison [:dylan] (he/him)

Comment 18

•

8 years ago

Hey Erik! I see you're from the KDE bugzilla and I've looked a little bit at the comment there. The current regexp is defined by this code: https://github.com/bugzilla/bugzilla/blob/master/Bugzilla/Template.pm#L47-L51 I'd be happy to see this bug to resolution if you have definite suggestions for how it should be made more strict.

Erik Quaeghebeur

Comment 19

•

8 years ago

(In reply to Dylan Hardison [:dylan] from comment #18) > The current regexp is defined by this code: > > https://github.com/bugzilla/bugzilla/blob/master/Bugzilla/Template.pm#L47-L51 > > I'd be happy to see this bug to resolution if you have definite suggestions > for how it should be made more strict. To resolve this bug, one thing will be to decide what to do with periods and commas at the end of URI-candidates. This is a thorny issue: practical considerations would suggest to let them the last such character *before white space* to be delimiters, but strict interpretation of the specs would lead one to treat them as belonging to the URI-candidate. Less thorny, but more tedious will be what to do for the rest. If you haven't already done so, I suggest reading <https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid>, as it is enlightening. My favorite comment there for a basis of deciding what to is <http://stackoverflow.com/a/36667242/671672>. Given the bugzilla context, I'd think the following conceptual steps need to be taken in the URI-candidate detection: 1. First identify strings for further investigation, i.e., those delimited by ($safe_protocols): and [\s<>\"] (brackets used in regex meaning, as in the code you pointed to); everything in between is the remainder. 2. Parse the remainder according to the specific rules for that scheme (I guess you could limit yourself to http(s) and mailto for starters as this is going to be the most tedious part) to see whether it is invalid or some tail part should be excluded (for example in [http://www.example.org/][whatever] the tail ][whatever] should be removed).

BMO Automation

Updated

•

7 months ago

Attachment #9386030 - Attachment is obsolete: true