1551439 - Link in my RSS with %3A is transformed to %253A

Pascal BORSCHNECK

Reporter

Description

•

6 years ago

Attached image 46922985515_d37b24db4c_o.jpg — Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0

Steps to reproduce:

I was told my blog RSS feed had a problem with a link in Thunderbird (with a %3A in the link)

Feed: https://www.arfy.fr/dotclear/index.php?feed/rss2

The blog's post with the problem: Pub, de l'humour de Burger King : The Not Big Mac’s
( https://www.arfy.fr/dotclear/index.php?post/2019/05/12/Pub--de-l-humour-de-Burger-King-%3A-The-Not-Big-Mac%E2%80%99s )

Inside TB, the link looks good: see attached screenshot

Actual results:

But, in Thunderbird if you click on the link to open it "outside", the "%" from "%3A" is transformed in %253A in the browser and my blog yelds "page not found" !

Side by side links to compare: (bad one first)
[code]
https://www.arfy.fr/dotclear/index.php?post/2019/05/12/Pub--de-l-humour-de-Burger-King-%253A-The-Not-Big-Mac%E2%80%99s
https://www.arfy.fr/dotclear/index.php?post/2019/05/12/Pub--de-l-humour-de-Burger-King-%3A-The-Not-Big-Mac%E2%80%99s
[/code]

To note: the "%" at the end of the link are not "urlencoded": %E2%80%99

Expected results:

The "%" should not be "urlencoded"

Jorg K (CEST = GMT+2)

Comment 1

•

6 years ago

Well, the link is this:
https://www.arfy.fr/dotclear/index.php?post/2019/05/12/Pub--de-l-humour-de-Burger-King-%3A-The-Not-Big-Mac%E2%80%99s

At the end, you have "Mac's", well it's really a smart-quote I don't have on the keyboard, but there is no % at the end.

I think a % in a link should be percent encoded, but the feed should have supplied a : as it supplied a ' (look closely at your picture) and not encoded one but not the other. Hover the second link in you description to see that %3A is really a colon.

I think this bug is invalid, Alta88, do you agree?

Flags: needinfo?(alta88)

Pascal BORSCHNECK

Reporter

Comment 2

•

6 years ago

About the "Mac's" at the end, it is "urlencoded" as
%E2%80%99s
as you can see

The url
https://www.arfy.fr/dotclear/index.php?post/2019/05/12/Pub--de-l-humour-de-Burger-King-%3A-The-Not-Big-Mac%E2%80%99s
will work

The other with the first "%" encoded to %25 no:
https://www.arfy.fr/dotclear/index.php?post/2019/05/12/Pub--de-l-humour-de-Burger-King-%253A-The-Not-Big-Mac%E2%80%99s

Jorg K (CEST = GMT+2)

Comment 3

•

6 years ago

Yes, the "Mac's" at the end is percent-encoded, but you said (comment #0) (quote):
To note: the "%" at the end of the link are not "urlencoded": %E2%80%99

I still think the feed provided an incorrect link. There are two special characters in the link: a colon ":" and a (smart) quote "'". In your picture, the quote is visible, but the colon is not, it shows as a %3A.

Pascal BORSCHNECK

Reporter

Comment 4

•

6 years ago

About the feed provided an incorrect link, try it in Firefox ... in mine it works
https://www.arfy.fr/dotclear/index.php?feed/rss2

;)

Jorg K (CEST = GMT+2)

Comment 5

•

6 years ago

OK, but on that page you can see the : and the ’

So maybe there is a bug in TB, but the bug is NOT that %3A is changed to %253A, the bug is that : is changed to %3A and later changed again, when the ’ is not.

I've subscribed to the feed myself now and the article in question has:
Content-Base: https://www.arfy.fr/dotclear/index.php?post/2019/05/12/Pub--de-l-humour-de-Burger-King-%3A-The-Not-Big-Mac%E2%80%99s

Yet, in the display we get %3A and a ’. Weird.

alta88

Comment 6

•

6 years ago

There's nothing in the rss2.0 spec on <link> format, and it is barely a usable spec at all in many ways like this. One interpretation could be that it should be published as human readable, and if publishers don't their users are broken. The link will load in the browser, as published, and also if decoded into human readable.

The external link opener does encodeURI() first, thus double encoding if so published. Maybe the feed parser could decodeURI() before storing; it is currently stored as published without decoding/alteration. But doing that could alter other pieces that use uri.

Flags: needinfo?(alta88)

Jorg K (CEST = GMT+2)

Comment 7

•

6 years ago

•

Edited

OK, but how do you explain that in %3A in the content base header is shown as %3A in the UI, whereas %E2%80%99 is shown as ’ ?

When you click the link, %3A is re-encoded to %25...3A whereas the stuff at the back isn't. That doesn't make sense.
In UTF-8, e2 80 91 is a NON-BREAKING HYPHEN and 3a is an (ASCII) colon. Why are they treated differently?

Maybe the simple colon shouln't have been encoded?

EDIT: I edited the feed article in a text editor and re-imported the result. If I remove %3A and insert a colon instead, it all works. So again, should the %3A be there and where does it come from? The publisher?

alta88

Comment 8

•

6 years ago

(In reply to Jorg K (GMT+2) from comment #7)

OK, but how do you explain that in %3A in the content base header is shown as %3A in the UI, whereas %E2%80%99 is shown as ’ ?

When you click the link, %3A is re-encoded to %25...3A whereas the stuff at the back isn't. That doesn't make sense.
In UTF-8, e2 80 91 is a NON-BREAKING HYPHEN and 3a is an (ASCII) colon. Why are they treated differently?

Maybe the simple colon shouln't have been encoded?

There is an important difference between decodeURI() and decodeURIComponent(). But that's for display; the stored url is what matters for opening the link.

EDIT: I edited the feed article in a text editor and re-imported the result. If I remove %3A and insert a colon instead, it all works. So again, should the %3A be there and where does it come from? The publisher?

Of course it works. The publisher shouldn't encode the url. If you look at the source of the feed url page, you'll see what they publish, and that's what's stored, as mentioned. Once upon a time, in a different display paradigm (in a browser, rather than a parser/email message-like store system we have here) it may have been necessary to encode. The question is whether there should be any hacks to fix up things published in error to the spec. I always say no. Here it's debatable as there's no spec, and an encoded url is not invalid. It could be fixed either by the publisher, or the parser being made more hacky to handle it. I lean no, as to fix this one publishers feed means everyone has to process the hack, on every computer with feeds, on every feed refresh.

Jorg K (CEST = GMT+2)

Comment 9

•

6 years ago

•

Edited

Yes, I looked at the page:

  <item>
    <title>Pub, de l'humour de Burger King : The Not Big Mac’s</title>
    <link>https://www.arfy.fr/dotclear/index.php?post/2019/05/12/Pub--de-l-humour-de-Burger-King-%3A-The-Not-Big-Mac%E2%80%99s</link>
    <guid isPermaLink="false">urn:md5:6563e1691dee713bf701296cf314147e</guid>
    <pubDate>Sun, 12 May 2019 08:19:00 +0200</pubDate>
    <dc:creator>Arfy</dc:creator>
        <category>Médias</category>
        <category>Humour</category><category>Pub</category>

etc. Yes, that is stored. So you're saying that on that page, the colon and the quote shouldn't be encoded.

I still like to understand why for display, the %3A is not changed to a colon, but the %E2%80%99 is changed to a quote. There must be some parsing going on that detects one but not the other.

You're saying that the external link opener does encodeURI() first, thus double encoding if so published. But that's only 50% the case. The %3A is double encoded, but the %E2%80%99 isn't.

What am I missing?

EDIT: I tried encodeURI() in the console. It really encodes all % characters, so there must be something else going on.

alta88

Comment 10

•

6 years ago

(In reply to Jorg K (GMT+2) from comment #9)

Yes, I looked at the page:

  <item>
    <title>Pub, de l'humour de Burger King : The Not Big Mac’s</title>
    <link>https://www.arfy.fr/dotclear/index.php?post/2019/05/12/Pub--de-l-humour-de-Burger-King-%3A-The-Not-Big-Mac%E2%80%99s</link>
    <guid isPermaLink="false">urn:md5:6563e1691dee713bf701296cf314147e</guid>
    <pubDate>Sun, 12 May 2019 08:19:00 +0200</pubDate>
    <dc:creator>Arfy</dc:creator>
        <category>Médias</category>
        <category>Humour</category><category>Pub</category>

etc. Yes, that is stored. So you're saying that on that page, the colon and the quote shouldn't be encoded.

yes.

I still like to understand why for display, the %3A is not changed to a colon, but the %E2%80%99 is changed to a quote. There must be some parsing going on that detects one but not the other.

try both decodeURI() and decodeURIComponent() on the url string.

You're saying that the external link opener does encodeURI() first, thus double encoding if so published. But that's only 50% the case. The %3A is double encoded, but the %E2%80%99 isn't.

What am I missing?

because when it's put up for display, into textContent, decodeURI() is used which doesn't do the colon. the opener takes it from textContent.

EDIT: I tried encodeURI() in the console. It really encodes all % characters, so there must be something else going on.

Jorg K (CEST = GMT+2)

Comment 11

•

6 years ago

•

Edited

OK, here:
https://searchfox.org/comm-central/rev/d86758c2328ae10f2d3f8b0422772e33e858f089/mail/base/content/msgHdrView.js#499
and, as you said, that doesn't convert %3A back to colon. So why not use decodeURIComponent()?

And while educating me, what is this about:
mail/base/content/foldersummary.js
143 decodeURIComponent(escape(msgHdr.getStringProperty("preview")));
mailnews/extensions/newsblog/content/newsblogOverlay.js
212 url = decodeURIComponent(escape(url));
347 let url = decodeURIComponent(escape(aMimeMsg.headers["content-base"]));

First escape, than decode. That's pretty much like a no-op, no? However, was introduced here:
https://searchfox.org/comm-central/diff/24bd97cd719c26e22e9cc956a13eb08998ad8990/mailnews/extensions/newsblog/content/newsblogOverlay.js#355
as a substitution of a conversion of raw UTF-8 to internal JS storage (UTF-16) since headers can contain raw UTF-8.

EDIT:
OK, I read https://hg.mozilla.org/comm-central/rev/55b04a77e7610a1907960a9268f5e816869cddc8. Hmm, that's a pretty hacky conversion and escape() and unescape() are actually deprecated.

So the question in the first paragraph remains.

alta88

Comment 12

•

6 years ago

(In reply to Jorg K (GMT+2) from comment #11)

OK, here:
https://searchfox.org/comm-central/rev/d86758c2328ae10f2d3f8b0422772e33e858f089/mail/base/content/msgHdrView.js#499
and, as you said, that doesn't convert %3A back to colon. So why not use decodeURIComponent()?

You could, in this case. Maybe it will break a url stored as utf16 string. Maybe it will break an IDN url (domain only), which is a different spec and not the same as breaking an internationalized url (that spec refers to the non domain path part). All these things evolved in bits and pieces, as did nsIURI, which now deals with punycode and IDN and even has a displaySpec property.

And while educating me, what is this about:
mail/base/content/foldersummary.js
143 decodeURIComponent(escape(msgHdr.getStringProperty("preview")));
mailnews/extensions/newsblog/content/newsblogOverlay.js
212 url = decodeURIComponent(escape(url));
347 let url = decodeURIComponent(escape(aMimeMsg.headers["content-base"]));

First escape, than decode. That's pretty much like a no-op, no? However, was introduced here:
https://searchfox.org/comm-central/diff/24bd97cd719c26e22e9cc956a13eb08998ad8990/mailnews/extensions/newsblog/content/newsblogOverlay.js#355
as a substitution of a conversion of raw UTF-8 to internal JS storage (UTF-16) since headers can contain raw UTF-8.

EDIT:
OK, I read https://hg.mozilla.org/comm-central/rev/55b04a77e7610a1907960a9268f5e816869cddc8. Hmm, that's a pretty hacky conversion and escape() and unescape() are actually deprecated.

Sure, it's a no op for ascii. That specific change was for cyrillic with encoding, I believe.

So the question in the first paragraph remains.

As I've said:

Scrub all urls to human readable; handle garbage. Probably the best way is to make nsIURIs out of raw strings. But there was once an m-c bug about the numerous conversions back and forth between string to nsIURI in a flow.
Do not alter the url; pass on whatever the publisher issues and require human readable to work.

Benjamin Flanagin

Comment 13

•

6 years ago

Sounds like we have a plan of action to fix this issue. Though I have some questions.

Has a task or enhancement been created about this?
Any objections to creating a one?

Component: Untriaged → Message Reader UI

Flags: needinfo?(jorgk)

Flags: needinfo?(alta88)

Jorg K (CEST = GMT+2)

Comment 14

•

6 years ago

Why create a task or enhancement. It can be fixed in this bug here.

Flags: needinfo?(jorgk)

alta88

Comment 15

•

6 years ago

The plan is #2. Absent any clear spec guidance, but with a clear emphasis on 'human readable' in the spec for strings, the implementer (us) can decide what's best for the application. Therefore, it is up to the publisher to not uri encode but ensure human readable content for this tag. This is a total edge case as well, and not worth extra processing/complexity/regression risk for the vast majority of users whose publishers don't do this.

Status: UNCONFIRMED → RESOLVED

Closed: 6 years ago

Flags: needinfo?(alta88)

Resolution: --- → WONTFIX

Michael Mueller

Comment 17

•

6 years ago

@alta88

I am not sure how much of an edge case this is. How do you know how many people besides me and OP experience the same problem? Should I write every publisher to change the encoding? A normal user does not know that this problem is a consequence of a publisher's wrong encoding and therefore would not come up with the idea to write the responsible IT department to encode differently.

Bugzilla

Link in my RSS with %3A is transformed to %253A

Categories

(Thunderbird :: Message Reader UI, defect)

Tracking

(Not tracked)

People

(Reporter: borschneck, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 17

Attachment

General

Description

File Name

Content Type