Open Bug 1329960 Opened 4 years ago Updated 3 years ago

wrong URI encoding for hash with some special characters

Categories

(Core :: Networking, defect, P3)

53 Branch
defect

Tracking

()

People

(Reporter: tristan.fraipont, Unassigned)

Details

(Keywords: dev-doc-needed, site-compat, Whiteboard: [necko-backlog])

Attachments

(2 files)

Attached file hashBug.html
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:51.0) Gecko/20100101 Firefox/51.0
Build ID: 20170103031119

Steps to reproduce:

set an anchor element's href to # followed by any special character followed by an ~out-of-range encoded character~ defined as follow : a percent-symbol (%) followed by any value from 81 to 99. e.g #'%89 


Actual results:

Only the first special character is encoded, not the % of the out-of-range %89.
This results in #%27%91


Expected results:

 instead of #'%91 (expected one) or #%27%2591 or even #'%2591 which two laters would at least be valid URIs.
I suspect the encoder ignores %91 since it looks like a valid percentage-encoded character, and the decoder to throw an error since it doesn't recognize this character.

Should also be noted that bot encodeURI and encodeURIComponent do encode correctly this string.
Component: JavaScript: Standard Library → DOM
May be related to bug 1093611.
It follows the spec and all browsers agree. (Actually Firefox will encode ' to %27 that does not conform to the spec.)
Status: UNCONFIRMED → RESOLVED
Closed: 4 years ago
Resolution: --- → INVALID
(In reply to Masatoshi Kimura [:emk] from comment #2)
> It follows the spec and all browsers agree. (Actually Firefox will encode '
> to %27 that does not conform to the spec.)

Ok so the bug is not what I "suspected", but it's still there.
FYI it also happens for other "special characters" like < or > and probably others. 

If you do open a new issue without my wrong and probably misleading assumptions, could you point to it in here ?
OK, reopening and changing the component. (The URL parser implementation is lying under Networking)
Status: RESOLVED → REOPENED
Component: DOM → Networking
Ever confirmed: true
Resolution: INVALID → ---
Valentin, anything that falls into your jurisdiction?
Assignee: nobody → valentin.gosu
Status: REOPENED → NEW
Whiteboard: [necko-active]
The relevant specs are

 - https://html.spec.whatwg.org/multipage/semantics.html#dom-hyperlink-hash
 - https://url.spec.whatwg.org/#fragment-state


If my understanding of the specs is correct, this code shouldn't log any error:

    function utf8encode(codePoint) {
      if (codePoint >>> 0 !== codePoint) throw codePoint + ' is not a code point';
      /* https://encoding.spec.whatwg.org/#utf-8-encoder */
      if (codePoint >= 0x0000 && codePoint <= 0x007F) return [codePoint];
      let count, offset;
      if (codePoint >= 0x0080 && codePoint <= 0x07FF) [count, offset] = [1, 0xC0];
      else if (codePoint >= 0x0800 && codePoint <= 0xFFFF) [count, offset] = [2, 0xE0];
      else if (codePoint >= 0x10000 && codePoint <= 0x10FFFF) [count, offset] = [3, 0xF0];
      let bytes = [(codePoint >> (6*count)) + offset];
      while (count > 0) {
        let temp = codePoint >> (6 * (count-1));
        bytes.push(0x80 | (temp & 0x3F));
        --count;
      }
      return bytes;
    }
    function percentEncode(byte) {
      if ((byte & 255) !== byte) throw byte + ' is not a byte';
      return '%' + byte.toString(16).toUpperCase().padStart(2, '0');
    }
    function test(from, to) {
      let url = new URL('http://example.com');
      for (let i=from; i<to; ++i) {
        let c = String.fromCodePoint(i);
        url.hash = '#' + c;
        let result = url.hash;
        let expected;
        if (i === 0x00) {
          expected = '';
        } else if (i >= 0x20 && i <= 0x7E) {
          expected = c;
        } else {
          expected = utf8encode(i).map(byte => percentEncode(byte)).join('');
        }
        if (expected) {
          expected = '#' + expected;
        }
        if (result !== expected ) {
          console.error('Assert fail: i='+i+' produced '+result+', expected '+expected);
        }
      }
    }
    test(0x00, 0x10FFFF);

Currently, there are these problems:

 1) U+0000 is not removed
 2) U+0020 ( ), U+0022 ("), U+0027 ('), U+003C (<), U+003E (>), U+0060 (`) are percent encoded
 3) Bad encodings starting from U+D800.

This patch fixes problem 2, which seems the one reported here.
That is, now this works: test(0x01, 0xD800);
This makes Firefox match Chrome in the U+0001...U+007E range (inclusive).
Attachment #8858154 - Flags: review?(valentin.gosu)
Attachment #8858154 - Attachment description: hash-percent-encoding.path → hash-percent-encoding.patch
Attachment #8858154 - Attachment filename: hash-percent-encoding.path → hash-percent-encoding.patch
Attachment #8858154 - Attachment is patch: true
Ah, there is also a 4th problem, which should be fixed in another bug

    let url = new URL('http://example.com');
    url.hash = '#à|';

Then I think url.hash should be '#%C3%A0|', but it's '#%C3%A0%7C'.
This behavior (encoding '|' after non-ASCII) is hard-coded for some reason.
Presumably the 3rd problem is because unicode does not assign code points to these numbers.
Assuming that range can be ignored (I don't know), then the tests are

    test(0x00, 0xD800);
    test(0xE000, 0x110000);

Currently that fails for U+0000 (NULL), U+0020 ( ), U+0022 ("), U+0027 ('), U+003C (<), U+003E (>), U+0060 (`).

After the patch, only for U+0000 (NULL).
Comment on attachment 8858154 [details] [diff] [review]
hash-percent-encoding.patch

Review of attachment 8858154 [details] [diff] [review]:
-----------------------------------------------------------------

So, I think this is ok, except for spaces and \0.
Spaces we always encode (see below), and \0 is a validation error [1], so indeed we might want to strip it.
This patch would break a bunch of tests, so I need to check some use cases before we take it.

[1] https://url.spec.whatwg.org/#fragment-state

::: xpcom/io/nsEscape.cpp
@@ +345,5 @@
>  //   0    1    2    3    4    5    6    7    8    9    A    B    C    D    E    F
>  {
>       0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  // 0x
>       0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  // 1x
> +   512,1023,   0, 512,1023,   0,1023, 112,1023,1023,1023,1023,1023,1023, 953, 784,  // 2x   !"#$%&'()*+,-./

Previous necko bugs have established that we will always encode spaces in all parts of the URL. I think the relevant reasons are in bug 1184589.
Attachment #8858154 - Flags: review?(valentin.gosu) → review-
(In reply to Valentin Gosu [:valentin] from comment #9)
> Previous necko bugs have established that we will always encode spaces in
> all parts of the URL. I think the relevant reasons are in bug 1184589.

But this is a violation of the spec, isn't it? The C0 control percent-encode set does not include spaces, unlike the path percent-encode set or the userinfo percent-encode set.

If encoding spaces is so important, can't Anne change the spec?
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Assignee: valentin.gosu → nobody
Priority: P1 → P3
Whiteboard: [necko-active] → [necko-backlog]
You need to log in before you can comment on or make changes to this bug.