Closed Bug 1371010 Opened 7 years ago Closed 7 years ago

nsDocumentEncoder percent-encodes links, particularly %7e

Categories

(Core :: DOM: Core & HTML, defect)

52 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox56 --- fixed

People

(Reporter: v+mozbug, Assigned: emk)

References

Details

Attachments

(3 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0
Build ID: 20170518000419

Steps to reproduce:

Paste a URL containing a "~" character into "insert / link" feature (Ctrl-K, used to be Ctrl-L)


Actual results:

Looks fine, when pasted, looks fine when, still in compose mode, the link is "edited" (same test as pasted is viewed).  But when sent, the "~" character is % encoded to "%7E" inside the URL (but not in the text form in the message).


Expected results:

According to Wikipedia [1], and standards it references from 2005, the "~" character is an unreserved character, and "Characters from the unreserved set never need to be percent-encoded".

This has caused some URLs I have sent to others to get 404 errors from some sites that apparently do not expect unreserved characters to be percent-encoded.

Since unreserved characters do not need to be percent-encoded, why is Thunderbird doing so?

[1] https://en.wikipedia.org/wiki/Percent-encoding
Well, I created such a message with a link to http://www.jorgk.com/%7Exxx and clicking onto it, I get:
The requested URL /~xxx was not found on this server.

So my server knows what to do with the %7E.

In general, Thunderbird relies on Mozilla core software for these things, so let's get an answer from them.

Valentin, can you help?
Flags: needinfo?(valentin.gosu)
Reporter, which version are you using, since we changed the behaviour of encoded links in bug 770022.

Clicking on http://www.jorgk.com/%7Exxx in an e-mail does take me to the ~ URL, clicking on the link in comment #1 doesn't.
version 52.1.1 (32-bit) is the version I got for the Thunderbird I tested this with today.

It is not at all clear why Thunderbird should change the URL text pasted in by the user in creating the link, but if it does, it should do minimal encoding, not affecting unreserved characters.  If the user puts in a bad link, and it gets an error when used, that is the user's problem. But if the user puts in a good link, and Thunderbird changes it so it doesn't work, that is a bug in Thunderbird.
Indeed, ~ is not a reserved character, so it should not be percent encoded.
The URL parser in Gecko does not seem to encode it, so I assume the problem is a Thunderbird issue.
At first look bug 770022 doesn't seem to be the problem, but I didn't investigate too much.
Flags: needinfo?(valentin.gosu)
This appears to be an M-C problem. TB inserts an anchor element with href="http://www.xx.com/~xxx" into the DOM.

However, when the DOM is later serialised to text for saving as a draft or sending out as an e-mail, the ~ is replaced.

You can see this with the add-on ThunderHTMLedit. Right after the link inserting, the DOM inspector shows the ~, however, the HTML editor (or even the cheap HTML editor via TB's "Insert > HTML") already shows the %7E. So the substitution happens in the DOM to text serialiser in M-C code. I'll look into it further to find the exact location where this happens.
I've dumped out the converted string resulting from the DOM encoding here:
https://dxr.mozilla.org/mozilla-central/rev/a49112c7a5765802096b3fc298069b9495436107/dom/base/nsDocumentEncoder.cpp#1203
at the end of nsDocumentEncoder::EncodeToStringWithMaxLength().

What I see is this when saving the message:
<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=windows-1252">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <a moz-do-not-send="true" href="http://www.xx.com/%7Exxx">nm</a><br>
  </body>
</html>

which is of course the same thing that ThunderHTMLedit shows me.

So the document was encoded with the ~ replaced. I set breakpoints all over intl/uconv/nsTextToSubURI.cpp hoping that the ~ to %7E conversation would be done in there, but no luck.

Before I spent hours tracing through the code, it might be easier to ask:
Masatoshi-san, do you know where the ~ to %7E conversation would happen when a document is encoded? It doesn't appear to be in nsTextToSubURI.cpp.
Flags: needinfo?(VYV03354)
Here (but please read below):
https://dxr.mozilla.org/mozilla-central/source/dom/base/nsXHTMLContentSerializer.cpp?q=%2Bfunction%3A%22nsXHTMLContentSerializer%3A%3AEscapeURI%28nsIContent+%2A%2C+const+nsAString+%26%2C+nsAString+%26%29%22&redirect_type=single#196-197,214-215

"never need to be percent-encoded" does not mean "must not be percent-encoded". Unreserved characters are supposed to be encodable without changing the meaning of the URL. If it changes the meaning, it is a bug of servers. (That said, we can stop encoding them if it is web-incompatible.)

Rather, we can't percent-encode reserved characters without changing the meaning of URLs. For example, we can't change "/" to "%2F" or "?" to "%3F" in path components. And nsXHTMLContentSerializer::EscapeURI is exactly doing this (percent-encode any ASCII characters except reserved characters).
Flags: needinfo?(VYV03354)
With this information I traced through the code a bit. Each "part" of the URL delimited by any of "%#;/?:@&=+$,[]" is run through NS_Escape() (as Masatoshi-san already suggested by giving the exact code ranges) and that appears to do the "damage".

So I don't know why NS_Escape() escapes the ~, and as a humble TB developer I don't know whether that is right or wrong. At least under certain circumstances it appears undesirable. But then, if we change that, how many tests and websites will we break?

As I said in comment #1: The server I tried converted the %7E back to ~.
Thanks for your investigations so far.

Re: comment 7: "never need to be percent-encoded" does not mean "must not be percent-encoded". 

While the above is true, in violates Postel's law: "an implementation should be conservative in its sending behavior, and liberal in its receiving behavior".

Re: comment 7: And nsXHTMLContentSerializer::EscapeURI is exactly doing this (percent-encode any ASCII characters except reserved characters).

It should be noted that A-Z and a-z and 0-9 are not percent-encoded, they are also unreserved. To adhere to Postel's law, and implement URL encoding, percent-encoding should only be applied to reserved characters.  So "any ASCII characters" are not all being encoded. If there is already a list reserved characters in EscapeURI, it would seem appropriate to only encode them, in only the spots that are necessary to do so, rather than encoding a bigger list of unreserved characters.  It is already skipping the A-Z (et alia), it should also skip ~ (and other unreserved, non-alphanumeric characters).

As Jorg K points out, some web servers convert the %7E back to ~: they are liberal in what they accept.  This is good, and robust, and adhere's to Postel's law.  However, apprarently not all web servers, in all situations, are as liberal in what they accept as the one Jorg K has.

I have reported the decoding problem to one such web server maintainer (with no response as yet), but likely there are others out there.
(In reply to Glenn Linderman from comment #9)
> It is already skipping the A-Z (et
> alia), it should also skip ~ (and other unreserved, non-alphanumeric
> characters).
Agreed. But you need to understand how TB works. TB is in some ways a giant Firefox add-on, built on Mozilla core technology. Many times we run into problems with that underlying Mozilla core technology but we're not in charge or responsible for it. So let's see how the Mozilla team wants to handle this.
var a=document.createElement("a");
a.href="http://example.com/~xxx";
new XMLSerializer().serializeToString(a);

Firefox: "<a xmlns="http://www.w3.org/1999/xhtml" href="http://example.com/%7Exxx"></a>"
Others: "<a href="http://example.com/~xxx"></a>"

Only Firefox percent-encodes the "~" character in the URL. This is a web-visible Core issue.
Component: Untriaged → Networking
Product: Thunderbird → Core
Summary: URL encoding links in email, particularly %7e → nsDocumentEncoder percent-encodes links, particularly %7e
Our serializer is totally broken. It should not use the document charset to encode the path part of URLs.
Attachment #8875943 - Attachment mime type: text/html → text/html;charset=euc-jp
Bug 74137 added percent-encoding to avoid a composer bug. But we should remove it because:
1. The composer bug was fixed in another bug (bug 81090).
2. Other browsers (at least Chrome, IE11, and Edge) do not percent-encode the URL.
Blocks: 74137
Component: Networking → DOM
This is not related to bug 1338900 which is a spin-off of bug 770022, is it?
Apparently attachment 34620 [details] [diff] [review] was landed for bug 74137:
 // Replace whitespace with "_" and allow only HTML CDATA
+//   characters: "a"-"z","A"-"Z","0"-"9", "_", ":", "-", ".",
+//   and characters above ASCII 127
 function ConvertToCDATAString(string)
 {
+  return string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:\x80-\xFFFF]+/g,'');
 }
which has moved here:
https://dxr.mozilla.org/comm-central/rev/a19e0ffd4bd52f658d332e8d249e1bbec2d03e68/editor/ui/composer/content/editorUtilities.js#141
So that removes certain characters from the name when adding/editing a named anchor <a name="xxx">.
(In reply to Jorg K (GMT+2) from comment #16)
> +//   characters: "a"-"z","A"-"Z","0"-"9", "_", ":", "-", ".",
> +//   and characters above ASCII 127

The allowed characters don't contain "%", so percent-encoding does not help. SeaMonkey team should deal with the change.
(In reply to Jorg K (GMT+2) from comment #15)
> This is not related to bug 1338900 which is a spin-off of bug 770022, is it?

No. It is impossible to escape both path component and query component correctly using only one charset. So EscapeURI will break href attributes regardless of mCharset.
(In reply to Masatoshi Kimura [:emk] from comment #17)
> The allowed characters don't contain "%", so percent-encoding does not help.
> SeaMonkey team should deal with the change.
Actually, "Insert > Named Anchor" is a function in TB ... and I'm not aware of any complaints ;-) It just overly restricts what you can use in the name.
Assignee: nobody → VYV03354
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
See Also: → 1371559
(In reply to Masatoshi Kimura [:emk] from comment #17)
> (In reply to Jorg K (GMT+2) from comment #16)
> The allowed characters don't contain "%", so percent-encoding does not help.
> SeaMonkey team should deal with the change.
Filed bug 1371559 for that. Thanks for researching this and thanks for addressing this bug here.
Comment on attachment 8876028 [details]
Bug 1371010 - Stop percent-encoding href attributes when serializing documents.

https://reviewboard.mozilla.org/r/147482/#review151920

While I think your patch is good, because it touches on encoding, I think Henri (hsivonen) is a better reviewer, do you mind redirecting the review to him, please?

Also, in the interest of ensuring cross-browser compatibility about this, can you please add some web-platform tests for this to https://searchfox.org/mozilla-central/source/testing/web-platform/tests/domparsing/XMLSerializer-serializeToString.html?  Thanks!
Attachment #8876028 - Flags: review?(ehsan)
Comment on attachment 8876363 [details]
Bug 1371010 - Test to make sure that serializeToString does not percent-encode href attributes.

https://reviewboard.mozilla.org/r/147752/#review152322
Attachment #8876363 - Flags: review?(james) → review+
This looks OK to me, but without a test that tests the query part, too, it's hard to be sure. emk, could you, please, add a test that tests a URL with a query string, too?
Comment on attachment 8876028 [details]
Bug 1371010 - Stop percent-encoding href attributes when serializing documents.

Absent a test with query string, I'm not sure if this fully fixes things, but since it's clear that this fixes at least the non-query bits, r+.
Attachment #8876028 - Flags: review?(hsivonen) → review+
Comment on attachment 8876028 [details]
Bug 1371010 - Stop percent-encoding href attributes when serializing documents.

(In reply to Henri Sivonen (:hsivonen) from comment #29)
> Absent a test with query string, I'm not sure if this fully fixes things,
> but since it's clear that this fixes at least the non-query bits, r+.

I added one more wpt to test query parts.

Henri, Could you update the review flag via MozReview? Otherwise MozReview does not allow me to land the commit.
Attachment #8876028 - Flags: review+ → review?(hsivonen)
Comment on attachment 8876028 [details]
Bug 1371010 - Stop percent-encoding href attributes when serializing documents.

https://reviewboard.mozilla.org/r/147482/#review152396
Attachment #8876028 - Flags: review?(hsivonen) → review+
Pushed by VYV03354@nifty.ne.jp:
https://hg.mozilla.org/integration/autoland/rev/f9f63d01b659
Stop percent-encoding href attributes when serializing documents. r=hsivonen
https://hg.mozilla.org/integration/autoland/rev/1be3848d9f52
Test to make sure that serializeToString does not percent-encode href attributes. r=jgraham
https://hg.mozilla.org/mozilla-central/rev/f9f63d01b659
https://hg.mozilla.org/mozilla-central/rev/1be3848d9f52
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: