nsDocumentEncoder percent-encodes links, particularly %7e

RESOLVED FIXED in Firefox 56

Status

()

RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: v+mozbug, Assigned: emk)

Tracking

52 Branch
mozilla56
Points:
---

Firefox Tracking Flags

(firefox56 fixed)

Details

Attachments

(3 attachments)

(Reporter)

Description

2 years ago
User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0
Build ID: 20170518000419

Steps to reproduce:

Paste a URL containing a "~" character into "insert / link" feature (Ctrl-K, used to be Ctrl-L)


Actual results:

Looks fine, when pasted, looks fine when, still in compose mode, the link is "edited" (same test as pasted is viewed).  But when sent, the "~" character is % encoded to "%7E" inside the URL (but not in the text form in the message).


Expected results:

According to Wikipedia [1], and standards it references from 2005, the "~" character is an unreserved character, and "Characters from the unreserved set never need to be percent-encoded".

This has caused some URLs I have sent to others to get 404 errors from some sites that apparently do not expect unreserved characters to be percent-encoded.

Since unreserved characters do not need to be percent-encoded, why is Thunderbird doing so?

[1] https://en.wikipedia.org/wiki/Percent-encoding

Comment 1

2 years ago
Well, I created such a message with a link to http://www.jorgk.com/%7Exxx and clicking onto it, I get:
The requested URL /~xxx was not found on this server.

So my server knows what to do with the %7E.

In general, Thunderbird relies on Mozilla core software for these things, so let's get an answer from them.

Valentin, can you help?
Flags: needinfo?(valentin.gosu)

Comment 2

2 years ago
Reporter, which version are you using, since we changed the behaviour of encoded links in bug 770022.

Clicking on http://www.jorgk.com/%7Exxx in an e-mail does take me to the ~ URL, clicking on the link in comment #1 doesn't.
(Reporter)

Comment 3

2 years ago
version 52.1.1 (32-bit) is the version I got for the Thunderbird I tested this with today.

It is not at all clear why Thunderbird should change the URL text pasted in by the user in creating the link, but if it does, it should do minimal encoding, not affecting unreserved characters.  If the user puts in a bad link, and it gets an error when used, that is the user's problem. But if the user puts in a good link, and Thunderbird changes it so it doesn't work, that is a bug in Thunderbird.
Indeed, ~ is not a reserved character, so it should not be percent encoded.
The URL parser in Gecko does not seem to encode it, so I assume the problem is a Thunderbird issue.
At first look bug 770022 doesn't seem to be the problem, but I didn't investigate too much.
Flags: needinfo?(valentin.gosu)

Comment 5

2 years ago
This appears to be an M-C problem. TB inserts an anchor element with href="http://www.xx.com/~xxx" into the DOM.

However, when the DOM is later serialised to text for saving as a draft or sending out as an e-mail, the ~ is replaced.

You can see this with the add-on ThunderHTMLedit. Right after the link inserting, the DOM inspector shows the ~, however, the HTML editor (or even the cheap HTML editor via TB's "Insert > HTML") already shows the %7E. So the substitution happens in the DOM to text serialiser in M-C code. I'll look into it further to find the exact location where this happens.

Comment 6

2 years ago
I've dumped out the converted string resulting from the DOM encoding here:
https://dxr.mozilla.org/mozilla-central/rev/a49112c7a5765802096b3fc298069b9495436107/dom/base/nsDocumentEncoder.cpp#1203
at the end of nsDocumentEncoder::EncodeToStringWithMaxLength().

What I see is this when saving the message:
<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=windows-1252">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <a moz-do-not-send="true" href="http://www.xx.com/%7Exxx">nm</a><br>
  </body>
</html>

which is of course the same thing that ThunderHTMLedit shows me.

So the document was encoded with the ~ replaced. I set breakpoints all over intl/uconv/nsTextToSubURI.cpp hoping that the ~ to %7E conversation would be done in there, but no luck.

Before I spent hours tracing through the code, it might be easier to ask:
Masatoshi-san, do you know where the ~ to %7E conversation would happen when a document is encoded? It doesn't appear to be in nsTextToSubURI.cpp.
Flags: needinfo?(VYV03354)
(Assignee)

Comment 7

2 years ago
Here (but please read below):
https://dxr.mozilla.org/mozilla-central/source/dom/base/nsXHTMLContentSerializer.cpp?q=%2Bfunction%3A%22nsXHTMLContentSerializer%3A%3AEscapeURI%28nsIContent+%2A%2C+const+nsAString+%26%2C+nsAString+%26%29%22&redirect_type=single#196-197,214-215

"never need to be percent-encoded" does not mean "must not be percent-encoded". Unreserved characters are supposed to be encodable without changing the meaning of the URL. If it changes the meaning, it is a bug of servers. (That said, we can stop encoding them if it is web-incompatible.)

Rather, we can't percent-encode reserved characters without changing the meaning of URLs. For example, we can't change "/" to "%2F" or "?" to "%3F" in path components. And nsXHTMLContentSerializer::EscapeURI is exactly doing this (percent-encode any ASCII characters except reserved characters).
Flags: needinfo?(VYV03354)

Comment 8

2 years ago
With this information I traced through the code a bit. Each "part" of the URL delimited by any of "%#;/?:@&=+$,[]" is run through NS_Escape() (as Masatoshi-san already suggested by giving the exact code ranges) and that appears to do the "damage".

So I don't know why NS_Escape() escapes the ~, and as a humble TB developer I don't know whether that is right or wrong. At least under certain circumstances it appears undesirable. But then, if we change that, how many tests and websites will we break?

As I said in comment #1: The server I tried converted the %7E back to ~.
(Reporter)

Comment 9

2 years ago
Thanks for your investigations so far.

Re: comment 7: "never need to be percent-encoded" does not mean "must not be percent-encoded". 

While the above is true, in violates Postel's law: "an implementation should be conservative in its sending behavior, and liberal in its receiving behavior".

Re: comment 7: And nsXHTMLContentSerializer::EscapeURI is exactly doing this (percent-encode any ASCII characters except reserved characters).

It should be noted that A-Z and a-z and 0-9 are not percent-encoded, they are also unreserved. To adhere to Postel's law, and implement URL encoding, percent-encoding should only be applied to reserved characters.  So "any ASCII characters" are not all being encoded. If there is already a list reserved characters in EscapeURI, it would seem appropriate to only encode them, in only the spots that are necessary to do so, rather than encoding a bigger list of unreserved characters.  It is already skipping the A-Z (et alia), it should also skip ~ (and other unreserved, non-alphanumeric characters).

As Jorg K points out, some web servers convert the %7E back to ~: they are liberal in what they accept.  This is good, and robust, and adhere's to Postel's law.  However, apprarently not all web servers, in all situations, are as liberal in what they accept as the one Jorg K has.

I have reported the decoding problem to one such web server maintainer (with no response as yet), but likely there are others out there.

Comment 10

2 years ago
(In reply to Glenn Linderman from comment #9)
> It is already skipping the A-Z (et
> alia), it should also skip ~ (and other unreserved, non-alphanumeric
> characters).
Agreed. But you need to understand how TB works. TB is in some ways a giant Firefox add-on, built on Mozilla core technology. Many times we run into problems with that underlying Mozilla core technology but we're not in charge or responsible for it. So let's see how the Mozilla team wants to handle this.
(Assignee)

Comment 11

2 years ago
var a=document.createElement("a");
a.href="http://example.com/~xxx";
new XMLSerializer().serializeToString(a);

Firefox: "<a xmlns="http://www.w3.org/1999/xhtml" href="http://example.com/%7Exxx"></a>"
Others: "<a href="http://example.com/~xxx"></a>"

Only Firefox percent-encodes the "~" character in the URL. This is a web-visible Core issue.
Component: Untriaged → Networking
Product: Thunderbird → Core
(Assignee)

Updated

2 years ago
Summary: URL encoding links in email, particularly %7e → nsDocumentEncoder percent-encodes links, particularly %7e
(Assignee)

Comment 13

2 years ago
Created attachment 8875943 [details]
Test to serialize non-ASCII urls in non-utf-8 documents

Our serializer is totally broken. It should not use the document charset to encode the path part of URLs.
(Assignee)

Updated

2 years ago
Attachment #8875943 - Attachment mime type: text/html → text/html;charset=euc-jp
(Assignee)

Comment 14

2 years ago
Bug 74137 added percent-encoding to avoid a composer bug. But we should remove it because:
1. The composer bug was fixed in another bug (bug 81090).
2. Other browsers (at least Chrome, IE11, and Edge) do not percent-encode the URL.
Blocks: 74137
Component: Networking → DOM

Comment 15

2 years ago
This is not related to bug 1338900 which is a spin-off of bug 770022, is it?

Comment 16

2 years ago
Apparently attachment 34620 [details] [diff] [review] was landed for bug 74137:
 // Replace whitespace with "_" and allow only HTML CDATA
+//   characters: "a"-"z","A"-"Z","0"-"9", "_", ":", "-", ".",
+//   and characters above ASCII 127
 function ConvertToCDATAString(string)
 {
+  return string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:\x80-\xFFFF]+/g,'');
 }
which has moved here:
https://dxr.mozilla.org/comm-central/rev/a19e0ffd4bd52f658d332e8d249e1bbec2d03e68/editor/ui/composer/content/editorUtilities.js#141
So that removes certain characters from the name when adding/editing a named anchor <a name="xxx">.
(Assignee)

Comment 17

2 years ago
(In reply to Jorg K (GMT+2) from comment #16)
> +//   characters: "a"-"z","A"-"Z","0"-"9", "_", ":", "-", ".",
> +//   and characters above ASCII 127

The allowed characters don't contain "%", so percent-encoding does not help. SeaMonkey team should deal with the change.
(Assignee)

Comment 18

2 years ago
(In reply to Jorg K (GMT+2) from comment #15)
> This is not related to bug 1338900 which is a spin-off of bug 770022, is it?

No. It is impossible to escape both path component and query component correctly using only one charset. So EscapeURI will break href attributes regardless of mCharset.

Comment 19

2 years ago
(In reply to Masatoshi Kimura [:emk] from comment #17)
> The allowed characters don't contain "%", so percent-encoding does not help.
> SeaMonkey team should deal with the change.
Actually, "Insert > Named Anchor" is a function in TB ... and I'm not aware of any complaints ;-) It just overly restricts what you can use in the name.
Comment hidden (mozreview-request)
(Assignee)

Updated

2 years ago
Assignee: nobody → VYV03354
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

Updated

2 years ago
See Also: → bug 1371559

Comment 22

2 years ago
(In reply to Masatoshi Kimura [:emk] from comment #17)
> (In reply to Jorg K (GMT+2) from comment #16)
> The allowed characters don't contain "%", so percent-encoding does not help.
> SeaMonkey team should deal with the change.
Filed bug 1371559 for that. Thanks for researching this and thanks for addressing this bug here.

Comment 23

2 years ago
mozreview-review
Comment on attachment 8876028 [details]
Bug 1371010 - Stop percent-encoding href attributes when serializing documents.

https://reviewboard.mozilla.org/r/147482/#review151920

While I think your patch is good, because it touches on encoding, I think Henri (hsivonen) is a better reviewer, do you mind redirecting the review to him, please?

Also, in the interest of ensuring cross-browser compatibility about this, can you please add some web-platform tests for this to https://searchfox.org/mozilla-central/source/testing/web-platform/tests/domparsing/XMLSerializer-serializeToString.html?  Thanks!
Attachment #8876028 - Flags: review?(ehsan)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Updated

a year ago
Duplicate of this bug: 547667

Comment 27

a year ago
mozreview-review
Comment on attachment 8876363 [details]
Bug 1371010 - Test to make sure that serializeToString does not percent-encode href attributes.

https://reviewboard.mozilla.org/r/147752/#review152322
Attachment #8876363 - Flags: review?(james) → review+
This looks OK to me, but without a test that tests the query part, too, it's hard to be sure. emk, could you, please, add a test that tests a URL with a query string, too?
Comment on attachment 8876028 [details]
Bug 1371010 - Stop percent-encoding href attributes when serializing documents.

Absent a test with query string, I'm not sure if this fully fixes things, but since it's clear that this fixes at least the non-query bits, r+.
Attachment #8876028 - Flags: review?(hsivonen) → review+
Comment hidden (mozreview-request)
(Assignee)

Comment 31

a year ago
Comment on attachment 8876028 [details]
Bug 1371010 - Stop percent-encoding href attributes when serializing documents.

(In reply to Henri Sivonen (:hsivonen) from comment #29)
> Absent a test with query string, I'm not sure if this fully fixes things,
> but since it's clear that this fixes at least the non-query bits, r+.

I added one more wpt to test query parts.

Henri, Could you update the review flag via MozReview? Otherwise MozReview does not allow me to land the commit.
Attachment #8876028 - Flags: review+ → review?(hsivonen)

Comment 32

a year ago
mozreview-review
Comment on attachment 8876028 [details]
Bug 1371010 - Stop percent-encoding href attributes when serializing documents.

https://reviewboard.mozilla.org/r/147482/#review152396
Attachment #8876028 - Flags: review?(hsivonen) → review+

Comment 33

a year ago
Pushed by VYV03354@nifty.ne.jp:
https://hg.mozilla.org/integration/autoland/rev/f9f63d01b659
Stop percent-encoding href attributes when serializing documents. r=hsivonen
https://hg.mozilla.org/integration/autoland/rev/1be3848d9f52
Test to make sure that serializeToString does not percent-encode href attributes. r=jgraham

Comment 34

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/f9f63d01b659
https://hg.mozilla.org/mozilla-central/rev/1be3848d9f52
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
status-firefox56: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
You need to log in before you can comment on or make changes to this bug.