1371010 - nsDocumentEncoder percent-encodes links, particularly %7e

Reporter

Description

•

8 years ago

User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0 Build ID: 20170518000419 Steps to reproduce: Paste a URL containing a "~" character into "insert / link" feature (Ctrl-K, used to be Ctrl-L) Actual results: Looks fine, when pasted, looks fine when, still in compose mode, the link is "edited" (same test as pasted is viewed). But when sent, the "~" character is % encoded to "%7E" inside the URL (but not in the text form in the message). Expected results: According to Wikipedia [1], and standards it references from 2005, the "~" character is an unreserved character, and "Characters from the unreserved set never need to be percent-encoded". This has caused some URLs I have sent to others to get 404 errors from some sites that apparently do not expect unreserved characters to be percent-encoded. Since unreserved characters do not need to be percent-encoded, why is Thunderbird doing so? [1] https://en.wikipedia.org/wiki/Percent-encoding

Jorg K (CEST = GMT+2)

Comment 1

•

8 years ago

Well, I created such a message with a link to http://www.jorgk.com/%7Exxx and clicking onto it, I get: The requested URL /~xxx was not found on this server. So my server knows what to do with the %7E. In general, Thunderbird relies on Mozilla core software for these things, so let's get an answer from them. Valentin, can you help?

Flags: needinfo?(valentin.gosu)

Jorg K (CEST = GMT+2)

Comment 2

•

8 years ago

Reporter, which version are you using, since we changed the behaviour of encoded links in bug 770022. Clicking on http://www.jorgk.com/%7Exxx in an e-mail does take me to the ~ URL, clicking on the link in comment #1 doesn't.

Glenn Linderman

Reporter

Comment 3

•

8 years ago

version 52.1.1 (32-bit) is the version I got for the Thunderbird I tested this with today. It is not at all clear why Thunderbird should change the URL text pasted in by the user in creating the link, but if it does, it should do minimal encoding, not affecting unreserved characters. If the user puts in a bad link, and it gets an error when used, that is the user's problem. But if the user puts in a good link, and Thunderbird changes it so it doesn't work, that is a bug in Thunderbird.

Valentin Gosu [:valentin] (he/him)

Comment 4

•

8 years ago

Indeed, ~ is not a reserved character, so it should not be percent encoded. The URL parser in Gecko does not seem to encode it, so I assume the problem is a Thunderbird issue. At first look bug 770022 doesn't seem to be the problem, but I didn't investigate too much.

Flags: needinfo?(valentin.gosu)

Jorg K (CEST = GMT+2)

Comment 5

•

8 years ago

This appears to be an M-C problem. TB inserts an anchor element with href="http://www.xx.com/~xxx" into the DOM. However, when the DOM is later serialised to text for saving as a draft or sending out as an e-mail, the ~ is replaced. You can see this with the add-on ThunderHTMLedit. Right after the link inserting, the DOM inspector shows the ~, however, the HTML editor (or even the cheap HTML editor via TB's "Insert > HTML") already shows the %7E. So the substitution happens in the DOM to text serialiser in M-C code. I'll look into it further to find the exact location where this happens.

Jorg K (CEST = GMT+2)

Comment 6

•

8 years ago

I've dumped out the converted string resulting from the DOM encoding here: https://dxr.mozilla.org/mozilla-central/rev/a49112c7a5765802096b3fc298069b9495436107/dom/base/nsDocumentEncoder.cpp#1203 at the end of nsDocumentEncoder::EncodeToStringWithMaxLength(). What I see is this when saving the message: <html> <head> <meta http-equiv="content-type" content="text/html; charset=windows-1252"> </head> <body bgcolor="#FFFFFF" text="#000000"> <a moz-do-not-send="true" href="http://www.xx.com/%7Exxx">nm</a><br> </body> </html> which is of course the same thing that ThunderHTMLedit shows me. So the document was encoded with the ~ replaced. I set breakpoints all over intl/uconv/nsTextToSubURI.cpp hoping that the ~ to %7E conversation would be done in there, but no luck. Before I spent hours tracing through the code, it might be easier to ask: Masatoshi-san, do you know where the ~ to %7E conversation would happen when a document is encoded? It doesn't appear to be in nsTextToSubURI.cpp.

Flags: needinfo?(VYV03354)

Masatoshi Kimura [:emk]

Assignee

Comment 7

•

8 years ago

Here (but please read below): https://dxr.mozilla.org/mozilla-central/source/dom/base/nsXHTMLContentSerializer.cpp?q=%2Bfunction%3A%22nsXHTMLContentSerializer%3A%3AEscapeURI%28nsIContent+%2A%2C+const+nsAString+%26%2C+nsAString+%26%29%22&redirect_type=single#196-197,214-215 "never need to be percent-encoded" does not mean "must not be percent-encoded". Unreserved characters are supposed to be encodable without changing the meaning of the URL. If it changes the meaning, it is a bug of servers. (That said, we can stop encoding them if it is web-incompatible.) Rather, we can't percent-encode reserved characters without changing the meaning of URLs. For example, we can't change "/" to "%2F" or "?" to "%3F" in path components. And nsXHTMLContentSerializer::EscapeURI is exactly doing this (percent-encode any ASCII characters except reserved characters).

Flags: needinfo?(VYV03354)

Jorg K (CEST = GMT+2)

Comment 8

•

8 years ago

With this information I traced through the code a bit. Each "part" of the URL delimited by any of "%#;/?:@&=+$,[]" is run through NS_Escape() (as Masatoshi-san already suggested by giving the exact code ranges) and that appears to do the "damage". So I don't know why NS_Escape() escapes the ~, and as a humble TB developer I don't know whether that is right or wrong. At least under certain circumstances it appears undesirable. But then, if we change that, how many tests and websites will we break? As I said in comment #1: The server I tried converted the %7E back to ~.

Glenn Linderman

Reporter

Comment 9

•

8 years ago

Thanks for your investigations so far. Re: comment 7: "never need to be percent-encoded" does not mean "must not be percent-encoded". While the above is true, in violates Postel's law: "an implementation should be conservative in its sending behavior, and liberal in its receiving behavior". Re: comment 7: And nsXHTMLContentSerializer::EscapeURI is exactly doing this (percent-encode any ASCII characters except reserved characters). It should be noted that A-Z and a-z and 0-9 are not percent-encoded, they are also unreserved. To adhere to Postel's law, and implement URL encoding, percent-encoding should only be applied to reserved characters. So "any ASCII characters" are not all being encoded. If there is already a list reserved characters in EscapeURI, it would seem appropriate to only encode them, in only the spots that are necessary to do so, rather than encoding a bigger list of unreserved characters. It is already skipping the A-Z (et alia), it should also skip ~ (and other unreserved, non-alphanumeric characters). As Jorg K points out, some web servers convert the %7E back to ~: they are liberal in what they accept. This is good, and robust, and adhere's to Postel's law. However, apprarently not all web servers, in all situations, are as liberal in what they accept as the one Jorg K has. I have reported the decoding problem to one such web server maintainer (with no response as yet), but likely there are others out there.

Jorg K (CEST = GMT+2)

Comment 10

•

8 years ago

(In reply to Glenn Linderman from comment #9) > It is already skipping the A-Z (et > alia), it should also skip ~ (and other unreserved, non-alphanumeric > characters). Agreed. But you need to understand how TB works. TB is in some ways a giant Firefox add-on, built on Mozilla core technology. Many times we run into problems with that underlying Mozilla core technology but we're not in charge or responsible for it. So let's see how the Mozilla team wants to handle this.

Masatoshi Kimura [:emk]

Assignee

Comment 11

•

8 years ago

var a=document.createElement("a"); a.href="http://example.com/~xxx"; new XMLSerializer().serializeToString(a); Firefox: "<a xmlns="http://www.w3.org/1999/xhtml" href="http://example.com/%7Exxx"></a>" Others: "<a href="http://example.com/~xxx"></a>" Only Firefox percent-encodes the "~" character in the URL. This is a web-visible Core issue.

Component: Untriaged → Networking

Product: Thunderbird → Core

Masatoshi Kimura [:emk]

Assignee

Updated

•

8 years ago

Summary: URL encoding links in email, particularly %7e → nsDocumentEncoder percent-encodes links, particularly %7e

Masatoshi Kimura [:emk]

Assignee

Comment 12

•

8 years ago

Permalink for the record: https://dxr.mozilla.org/mozilla-central/rev/a49112c7a5765802096b3fc298069b9495436107/dom/base/nsXHTMLContentSerializer.cpp#196-197,214-215

Masatoshi Kimura [:emk]

Assignee

Comment 13

•

8 years ago

Attached file Test to serialize non-ASCII urls in non-utf-8 documents — Details

Our serializer is totally broken. It should not use the document charset to encode the path part of URLs.

Masatoshi Kimura [:emk]

Assignee

Updated

•

8 years ago

Attachment #8875943 - Attachment mime type: text/html → text/html;charset=euc-jp

Masatoshi Kimura [:emk]

Assignee

Comment 14

•

8 years ago

Bug 74137 added percent-encoding to avoid a composer bug. But we should remove it because: 1. The composer bug was fixed in another bug (bug 81090). 2. Other browsers (at least Chrome, IE11, and Edge) do not percent-encode the URL.

Blocks: 74137

Component: Networking → DOM

Jorg K (CEST = GMT+2)

Comment 15

•

8 years ago

This is not related to bug 1338900 which is a spin-off of bug 770022, is it?

Jorg K (CEST = GMT+2)

Comment 16

•

8 years ago

Apparently attachment 34620 [details] [diff] [review] was landed for bug 74137: // Replace whitespace with "_" and allow only HTML CDATA +// characters: "a"-"z","A"-"Z","0"-"9", "_", ":", "-", ".", +// and characters above ASCII 127 function ConvertToCDATAString(string) { + return string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:\x80-\xFFFF]+/g,''); } which has moved here: https://dxr.mozilla.org/comm-central/rev/a19e0ffd4bd52f658d332e8d249e1bbec2d03e68/editor/ui/composer/content/editorUtilities.js#141 So that removes certain characters from the name when adding/editing a named anchor <a name="xxx">.

Masatoshi Kimura [:emk]

Assignee

Comment 17

•

8 years ago

(In reply to Jorg K (GMT+2) from comment #16) > +// characters: "a"-"z","A"-"Z","0"-"9", "_", ":", "-", ".", > +// and characters above ASCII 127 The allowed characters don't contain "%", so percent-encoding does not help. SeaMonkey team should deal with the change.

Masatoshi Kimura [:emk]

Assignee

Comment 18

•

8 years ago

(In reply to Jorg K (GMT+2) from comment #15) > This is not related to bug 1338900 which is a spin-off of bug 770022, is it? No. It is impossible to escape both path component and query component correctly using only one charset. So EscapeURI will break href attributes regardless of mCharset.

Jorg K (CEST = GMT+2)

Comment 19

•

8 years ago

(In reply to Masatoshi Kimura [:emk] from comment #17) > The allowed characters don't contain "%", so percent-encoding does not help. > SeaMonkey team should deal with the change. Actually, "Insert > Named Anchor" is a function in TB ... and I'm not aware of any complaints ;-) It just overly restricts what you can use in the name.

Masatoshi Kimura [:emk]

Assignee

Comment 20

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb0cc9cfe18d847b0ecc4a6444d40672398343fa

Comment hidden (mozreview-request)

Masatoshi Kimura [:emk]

Assignee

Updated

•

8 years ago

Assignee: nobody → VYV03354

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Jorg K (CEST = GMT+2)

Updated

•

8 years ago

Comment 22

•

8 years ago

(In reply to Masatoshi Kimura [:emk] from comment #17) > (In reply to Jorg K (GMT+2) from comment #16) > The allowed characters don't contain "%", so percent-encoding does not help. > SeaMonkey team should deal with the change. Filed bug 1371559 for that. Thanks for researching this and thanks for addressing this bug here.

(no longer active)

Comment 23

•

8 years ago

mozreview-review

Comment on attachment 8876028 [details] Bug 1371010 - Stop percent-encoding href attributes when serializing documents. https://reviewboard.mozilla.org/r/147482/#review151920 While I think your patch is good, because it touches on encoding, I think Henri (hsivonen) is a better reviewer, do you mind redirecting the review to him, please? Also, in the interest of ensuring cross-browser compatibility about this, can you please add some web-platform tests for this to https://searchfox.org/mozilla-central/source/testing/web-platform/tests/domparsing/XMLSerializer-serializeToString.html? Thanks!

Attachment #8876028 - Flags: review?(ehsan)

Comment hidden (mozreview-request)

James Graham [:jgraham]

Comment 27

•

8 years ago

mozreview-review

Comment on attachment 8876363 [details] Bug 1371010 - Test to make sure that serializeToString does not percent-encode href attributes. https://reviewboard.mozilla.org/r/147752/#review152322

Attachment #8876363 - Flags: review?(james) → review+

Henri Sivonen (:hsivonen)

Comment 28

•

8 years ago

This looks OK to me, but without a test that tests the query part, too, it's hard to be sure. emk, could you, please, add a test that tests a URL with a query string, too?

Henri Sivonen (:hsivonen)

Comment 29

•

8 years ago

Comment on attachment 8876028 [details] Bug 1371010 - Stop percent-encoding href attributes when serializing documents. Absent a test with query string, I'm not sure if this fully fixes things, but since it's clear that this fixes at least the non-query bits, r+.

Attachment #8876028 - Flags: review?(hsivonen) → review+

Comment hidden (mozreview-request)

Masatoshi Kimura [:emk]

Assignee

Comment 31

•

8 years ago

Comment on attachment 8876028 [details] Bug 1371010 - Stop percent-encoding href attributes when serializing documents. (In reply to Henri Sivonen (:hsivonen) from comment #29) > Absent a test with query string, I'm not sure if this fully fixes things, > but since it's clear that this fixes at least the non-query bits, r+. I added one more wpt to test query parts. Henri, Could you update the review flag via MozReview? Otherwise MozReview does not allow me to land the commit.

Attachment #8876028 - Flags: review+ → review?(hsivonen)

Henri Sivonen (:hsivonen)

Comment 32

•

8 years ago

mozreview-review

Comment on attachment 8876028 [details] Bug 1371010 - Stop percent-encoding href attributes when serializing documents. https://reviewboard.mozilla.org/r/147482/#review152396

Attachment #8876028 - Flags: review?(hsivonen) → review+

Pulsebot

Comment 33

•

8 years ago

Pushed by VYV03354@nifty.ne.jp: https://hg.mozilla.org/integration/autoland/rev/f9f63d01b659 Stop percent-encoding href attributes when serializing documents. r=hsivonen https://hg.mozilla.org/integration/autoland/rev/1be3848d9f52 Test to make sure that serializeToString does not percent-encode href attributes. r=jgraham

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 34

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f9f63d01b659 https://hg.mozilla.org/mozilla-central/rev/1be3848d9f52

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox56: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla56

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: DOM → DOM: Core & HTML

Test to serialize non-ASCII urls in non-utf-8 documents 8 years ago Masatoshi Kimura [:emk] 222 bytes, text/html;charset=euc-jp		Details
Bug 1371010 - Stop percent-encoding href attributes when serializing documents. 8 years ago Masatoshi Kimura [:emk] 59 bytes, text/x-review-board-request	hsivonen : review+	Details
Bug 1371010 - Test to make sure that serializeToString does not percent-encode href attributes. 8 years ago Masatoshi Kimura [:emk] 59 bytes, text/x-review-board-request	jgraham : review+	Details