URLs with a trailing non-word character don't get the character hyperlinked in bug comments

NEW
Unassigned

Status

()

Bugzilla
Creating/Changing Bugs
--
minor
16 years ago
a year ago

People

(Reporter: justdave, Unassigned)

Tracking

Details

(URL)

Discovered on bug 176594:

URLs with trailing "+" characters don't get the +'s hyperlinked.
This is a feature :)

The rule is:

    $text =~ s~\b(${protocol_re}:  # The protocol:
                  [^\s<>\"]+       # Any non-whitespace
                  [\w\/])          # so that we end in \w or /

I believe that the idea behind the regexp was to not linkinfy the . in
http://foo.com/. (or training "/'/,/etc. I guess that we could add \+ to the
list, though, and I don't _think_ that that would break anything

This didn't change in my rewrite, so its not a regression from that.
+ is a legal replacement for a space in a query string part of a URL, so it
should be included.
Well, thast true. But . is a valid character in a url, as is ,

Anyway, I'd be happy with a patch which added + to the linkfy stuff.

Comment 4

14 years ago
A trailing colon, ':', is also not linked - should it be? I think it would
probably break more things than it fixed, but I'm mentioning it anyway.

See bug#86553 comment#5 and bug#86553 comment#6

Updated

12 years ago
QA Contact: mattyt-bugzilla → default-qa

Updated

12 years ago
Assignee: myk → create-and-change

Updated

11 years ago
Severity: normal → minor

Updated

7 years ago
Duplicate of this bug: 657310

Comment 6

3 years ago
I'm morphing this bug summary to include all non-word characters in general, not just about "+". Several users are reporting their own bugs to get their favourite trailing characters included in URLs, and we really don't need one bug per character.
Summary: URLs with trailing "+" characters don't get the +'s hyperlinked in bug comments → URLs with a trailing non-word character don't get the character hyperlinked in bug comments

Updated

3 years ago
Duplicate of this bug: 1133819

Updated

3 years ago
Duplicate of this bug: 1163344

Updated

3 years ago
Duplicate of this bug: 670998

Updated

3 years ago
See Also: → bug 438798

Comment 10

3 years ago
If anything would break if all characters except spaces, less-than characters (i.e., opening angle brackets), and greater-than characters (i.e., closing angle brackets) were included in URL links, then maybe the safer solution is to leave URLs with unusual characters completely unlinked, which would signal a user to copy the whole URL themself.

It's a bug, not a feature, regardless of the original intention. Most people would click on a link (which we encourage by linking) without noticing that part of the URL was omitted from the link, resulting in a page being different or not found, and that can confuse the conversation intended to be held in light of the link. As a test on my Linux laptop, I saved a text file called "test.pls." (with trailing period but not and it opened in gedit, thus not as a *.pls file, which would have opened in Rhythmbox instead. Since Linux allowed a trailing period and treated the extension as different because of it, I assume if I exposed the laptop to the Internet as a Web host the file's URL would have a trailing period. I've seen URLs with a variety of unusual characters and we shouldn't be judging their owners, but leave that judgment to hosts and domain name servers.

Comment 11

3 years ago
I accidentally saved this page before finishing editing my last comment and apparently I can't edit my comment. The phrase "(with trailing period but not" was supposed to be "(with trailing period but not quotation marks)".

Comment 12

3 years ago
The best solution is to avoid any magic and just adhere to <https://tools.ietf.org/html/rfc3986#appendix-C>.

Double-quotes, angle brackets, whitespace, and other characters that are invalid in URLs should delimit URLs. Valid URL characters should be considered part of the URL no matter where they appear.

Comment 13

3 years ago
(In reply to Markus Keller from comment #12)
> Valid URL characters should be considered part of the URL no matter where they appear.

I disagree. It's very common to include a URL in a sentence and have the URL followed by a dot. In that case, you don't want the dot to be part of the URL. So at least a trailing dot shouldn't be included in the URL.

Comment 14

3 years ago
For the trailing dot, that's not a good solution either, because a URL can itself end in a dot, because a filename extension can end in a dot and because by a little-used official convention a top-level domain (TLD) may be followed by a dot (at least if nothing else in the URL follows that dot). Therefore, if the exact beginning and end of a URL cannot both be unambiguously determined, the putative URL should not be linked at all. And, even if the two ends are certain, if linking it would break LibreOffice in any way, it should not be linked, so the user will know to select what they want to copy for a link.

A line-breaking hyphen within a URL is another problem. As the above-linked RFC appendix (comment 12) says, such a hyphen might or might not belong in the unbroken URL itself. It's annoying when I get an email that has a URL broken over two lines but only the pre-break part is linked. Because the hyphen is an ambiguous character, no part of a line-broken URL should be linked.

Comment 15

3 years ago
LibreOffice has nothing to do with Bugzilla, so this part of the discussion is irrelevant. And for the very rare cases where a dot would be at the very end of a URL, it's way too rare to be a rule to break all valid cases where a dot means the end of the sentence, not the end of the URL.

Comment 16

3 years ago
On LibreOffice, I meant Bugzilla. (Sorry. LibreOffice has a similar problem, which is why I discovered Bugzilla's problem.) This thread refers to breaking Bugzilla in two contexts, and that's what I was referring to.

If a TLD has a trailing dot followed by whitespace, a line break, a paragraph end, or some such, then the link can omit the dot without affecting where the link goes, but that may not be the case with a URL that does not end in a TLD.

Trailing-dotted URLs probably are rare, but how rare I can't tell, since I don't know how to do a Google search for that kind of URL to generate a reasonable statistic. It may be more frequent in other nations; here are URLs to illustrate the point, apparently not happening with google.com (I assume Bugzilla will display the entire URLs I've typed here, regardless of whether wholly or partly linked or not):
https://www.google.fr/?gws_rd=ssl#q=Mars.
https://www.google.co.jp/?gws_rd=ssl#q=Wednesday.
https://www.google.com.hk/?gws_rd=ssl#q=Saturn.
With or without the trailing dot, they produce the same search results in Google, but that's determinable only after the linking is correctly or incorrectly done..

I prefer exactitude, and my guess is that Bugzilla attracts more demanding participants. Not linking in ambivalent cases would probably be less misleading than mislinking would be.

Comment 17

a year ago
(In reply to Markus Keller from comment #12)
> The best solution is to avoid any magic and just adhere to
> <https://tools.ietf.org/html/rfc3986#appendix-C>.
> 
> Double-quotes, angle brackets, whitespace, and other characters that are
> invalid in URLs should delimit URLs. Valid URL characters should be
> considered part of the URL no matter where they appear.

I think this should be the basis of for the regexp that is used. (I agree trailing dots are a thorny issue that perhaps deserves its entire own bug, but that shouldn't hold back correction/improvement of the current regexp, which should be simple enough given the guidance from the RFCs.)

(In reply to Bradley Baetz (:bbaetz) from comment #1)
> 
> The rule is:
> 
>     $text =~ s~\b(${protocol_re}:  # The protocol:
>                   [^\s<>\"]+       # Any non-whitespace
>                   [\w\/])          # so that we end in \w or /

Currently, this regexp is way too accepting: it does not break at, or not create a link for urls containing, unsafe characters such as [ and ]. See https://bugs.kde.org/show_bug.cgi?id=363427#c1 for an example of the issues this causes. I'll try it here as well: [http://www.example.org/][whatever].

Interesting in this context: https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid (Ok, disregard the ‘should be simple enough’ above).
Hey Erik! I see you're from the KDE bugzilla and I've looked a little bit at the comment there.

The current regexp is defined by this code:

https://github.com/bugzilla/bugzilla/blob/master/Bugzilla/Template.pm#L47-L51

I'd be happy to see this bug to resolution if you have definite suggestions for how it should be made more strict.

Comment 19

a year ago
(In reply to Dylan Hardison [:dylan] from comment #18)
> The current regexp is defined by this code:
> 
> https://github.com/bugzilla/bugzilla/blob/master/Bugzilla/Template.pm#L47-L51
> 
> I'd be happy to see this bug to resolution if you have definite suggestions
> for how it should be made more strict.

To resolve this bug, one thing will be to decide what to do with periods and commas at the end of URI-candidates. This is a thorny issue: practical considerations would suggest to let them the last such character *before white space* to be delimiters, but strict interpretation of the specs would lead one to treat them as belonging to the URI-candidate.

Less thorny, but more tedious will be what to do for the rest. If you haven't already done so, I suggest reading <https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid>, as it is enlightening. My favorite comment there for a basis of deciding what to is <http://stackoverflow.com/a/36667242/671672>.

Given the bugzilla context, I'd think the following conceptual steps need to be taken in the URI-candidate detection:

1. First identify strings for further investigation, i.e., those delimited by ($safe_protocols): and [\s<>\"] (brackets used in regex meaning, as in the code you pointed to); everything in between is the remainder.

2. Parse the remainder according to the specific rules for that scheme (I guess you could limit yourself to http(s) and mailto for starters as this is going to be the most tedious part) to see whether it is invalid or some tail part should be excluded (for example in [http://www.example.org/][whatever] the tail ][whatever] should be removed).
You need to log in before you can comment on or make changes to this bug.