1357416 - URLs should be normalized (Unicode NFKC)

Assignee

Description

•

8 years ago

What did you do? ================ There are URLs that appear the same when displayed, but some forms are 404s. For example, the Greek word "Εφαρμογές" can be encoded as at least two URLS: 1. https://developer.mozilla.org/el/docs/%CE%95%CF%86%CE%B1%CF%81%CE%BC%CE%BF%CE%B3%CE%AD%CF%82 (NFC) 2. https://developer.mozilla.org/el/docs/%CE%95%CF%86%CE%B1%CF%81%CE%BC%CE%BF%CE%B3%CE%B5%CC%81%CF%82 (NFD) The Bengalese translation of Firefox (ফায়ারফক্স) can be represented in at least two ways: 3. https://developer.mozilla.org/bn-BD/docs/Mozilla/%E0%A6%AB%E0%A6%BE%E0%A7%9F%E0%A6%BE%E0%A6%B0%E0%A6%AB%E0%A6%95%E0%A7%8D%E0%A6%B8 (non-normalized) 4. https://developer.mozilla.org/bn-BD/docs/Mozilla/%E0%A6%AB%E0%A6%BE%E0%A6%AF%E0%A6%BC%E0%A6%BE%E0%A6%B0%E0%A6%AB%E0%A6%95%E0%A7%8D%E0%A6%B8 (NFC / NFD) What happened? ============== In Chrome and Firefox, the URL 1 links to the Greek translation of the App Center page, and URL 2 is a 404. In Safari, URL 1 and 2 return the App Center page. In Chrome and Firefox, URL 1 was a 404, and URL 2 linked to the Bengali Firefox page. Safari got a 404 for both URLs. The slug was changed to match URL 1, and now works for all browsers. What should have happened? ========================== * Fields that appear in URLs, such as slugs and tags, should be converted to NFC (Normalization Form C, or "Combined") before being stored in the database. * Existing database data should be converted to NFC * Middleware should redirect incoming requests that are not already NFC encoded * (Maybe) Other fields that expect Unicode should be converted to NFC when saving to the database or used in forms Is there anything else we should know? ====================================== Unicode Normalization FAQ: http://www.unicode.org/faq/normalization.html Normalization matters when a letter includes one or more combining marks. For example, the word "Résumé" has a lower case "e" with an acute mark. NFD (Normalization Form D, or "Decomposed") uses multiple code points to represent characters with combining marks. In NFD, "é" is represented as two code points, \x65\u0301 (LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). If multiple combining marks are used, they are re-ordered to a normalized order. NFC (Normalization Form C, or 'combined') starts with NFD, and then combines some code points again. In NFC, "é" is represented as one code point, \xe9 (LATIN SMALL LETTER E WITH ACUTE).

Kadir Topal [:atopal]

Updated

•

8 years ago

Keywords: in-triage

John Whitlock [:jwhitlock]

Assignee

Updated

•

7 years ago

Updated

•

7 years ago

Blocks: 1528745

John Whitlock [:jwhitlock]

Assignee

Comment 1

•

7 years ago

Here's some code to detect the problem:

import unicodedata
import urllib

for doc in Document.objects.filter_for_list().order_by('id'):
  unquote_slug = urllib.unquote(doc.slug)
  normslug = unicodedata.normalize('NFC', unquote_slug)
  encode_normslug = normslug.encode('utf8')
  quote_slug = urllib.quote(encode_normslug)
  if normslug != doc.slug and quote_slug != doc.slug:
    full = doc.get_full_url()
    print "* %s: %s" % (doc.id, full)

Here's the 24 results:

Florian Scholz (Open Web Docs)

Updated

•

7 years ago

Priority: -- → P3

Whiteboard: [specification][type:bug] → [specification][type:bug][points=3]

John Whitlock [:jwhitlock]

Assignee

Comment 2

•

6 years ago

When "%25" is in the URL, then the issue is that a slug was double-encoded. For document 3863, the slug was Web/HTML/HTML5_%E8%A1%A8%E5%96%AE. When this is URL encoded, the % is converted to %25, so the URL is https://developer.mozilla.org/zh-TW/docs/Web/HTML/HTML5_%25E8%25A1%25A8%25E5%2596%25AE'.

The solution is to manually unquote the slug:

from urllib import unquote
doc = Document.objects.get(id=3863)
doc.current_revision.slug = unquote(doc.current_revision.slug.replace('%25', '%')
doc.current_revision.save()

This updates the URL to https://developer.mozilla.org/zh-TW/docs/Web/HTML/HTML5_%E8%A1%A8%E5%96%AE.

This is a old translation, and the new version of the page is translated at:

https://developer.mozilla.org/zh-TW/docs/Learn/HTML/Forms

I updated the now-working page to be a redirect to that page.

Here's an auto-fixer:

import unicodedata
import urllib

for doc in Document.objects.filter(slug__contains='%').order_by('id'):
    old_url = doc.get_full_url()
    old_slug = doc.slug
    doc.current_revision.slug = urllib.unquote(old_slug.replace('%25', '%'))
    doc.current_revision.save()
    new_doc = Document.objects.get(id=doc.id)
    new_url = new_doc.get_full_url()
    print "* %s: %s -> %s" % (doc.id, old_url, new_url)

This would potentially break a page that legitimately used a percent % in the slug, but I could find none. This suggests that % should be forbidden from slugs to avoid this issue in the future.

Converted pages:

There were additional documents that would have caused a collision, and were deleted:

John Whitlock [:jwhitlock]

Assignee

Comment 3

•

6 years ago

This leaves 16 documents that appears to be valid, but require normalization to NFC. Here's that code:

for doc in Document.objects.filter_for_list().order_by('id'):
  normslug = unicodedata.normalize('NFC', doc.slug)
  if normslug != doc.slug:
    old_url = doc.get_full_url()
    doc.current_revision.slug = unicodedata.normalize('NFC', doc.current_revision.slug)
    doc.current_revision.save()
    new_doc = Document.objects.get(id=doc.id)
    new_url = new_doc.get_full_url()
    print "* %s: %s -> %s" % (doc.id, old_url, new_url)

And here's the documents with now-working URLs:

Summary: URLs should be normalized (Unicode NFC) → URLs should be normalized (Unicode NFC, no percent % allowed)

John Whitlock [:jwhitlock]

Assignee

Comment 4

•

6 years ago

The current slugs with this issue have been manually modified.

A percent sign %, is already forbidden in slugs. The remaining work is to normalize slugs with NFC.

Summary: URLs should be normalized (Unicode NFC, no percent % allowed) → URLs should be normalized (Unicode NFC)

John Whitlock [:jwhitlock]

Assignee

Comment 5

•

6 years ago

•

Edited

Some of the un-quoted slugs were incorrectly encoded. I attempted to manually fix these, but when I looked at the details, there were collisions, and moved pages, and often translations of deleted pages. I've manually fixed items, but I didn't take careful notes.

I'm working on detecting badly encoded strings in the database, and I'm thinking about solutions.

Assignee: nobody → jwhitlock

Status: NEW → ASSIGNED

John Whitlock [:jwhitlock]

Assignee

Comment 6

•

6 years ago

I ran this to find "badly encoded" URLS:

for doc in Document.objects.only('id', 'locale', 'slug', 'is_redirect'):
  bad_chars = sum([1 for c in doc.slug if 0x80 <= ord(c) <= 0xbf])
  if bad_chars:
    print "%d (%d, %s): %s" % (doc.id, bad_chars, doc.is_redirect, doc.get_full_url())

Most were redirects. There was one remaining that seems to validly use these characters ("Instal·lació" does appear to be "Installation" in Catalan):

https://developer.mozilla.org/ca/docs/Learn/Getting_started_with_the_web/Instal%C2%B7laci%C3%B3_b%C3%A0sica_programari

Next, I looked at how Django does this, and it used NFKC to slugify unicode.

After some more reading, I found RFC 3987, section 5.3.2.2

Equivalence of IRIs MUST rely on the assumption that IRIs are
appropriately pre-character-normalized rather than apply character
normalization when comparing two IRIs. The exceptions are conversion
from a non-digital form, and conversion from a non-UCS-based
character encoding to a UCS-based character encoding. In these cases,
NFC or a normalizing transcoder using NFC MUST be used for
interoperability. To avoid false negatives and problems with
transcoding, IRIs SHOULD be created by using NFC. Using NFKC may
avoid even more problems; for example, by choosing half-width Latin
letters instead of full-width ones, and full-width instead of
half-width Katakana.

So it looks like the standard is "must be normalized", not "normalized in NFC". I'm inclined to follow Django's lead and use NFKC.

One change that even I can understand is that this URL:

https://developer.mozilla.org/fr/docs/JavaScript/Reference/Instructions/for_each%E2%80%A6in

would stop using the horizontal ellipsis … and change to this URL:

https://developer.mozilla.org/fr/docs/JavaScript/Reference/Instructions/for_each...in

In this case, this was already done, and the first URL is a Kuma redirect to the second. There were three non-redirects that were not already NFKD normalized, which I normalized using the Move Page feature:

I'm going to change the task to converting to NFKC. An NFKC string is also an NFC string, but some NFC strings (like the above) are not NFKC strings. One of the design goals of NFKC is to avoid codepoints that look like other codepoints, and they are suggested as a better choice than NFC for identifiers.

Summary: URLs should be normalized (Unicode NFC) → URLs should be normalized (Unicode NFKC)

John Whitlock [:jwhitlock]

Assignee

Comment 7

•

6 years ago

https://github.com/mozilla/kuma/pull/5344 converts slugs to NFKC when editing pages. This takes care of badly-formed slugs going forward.

There's a couple of follow-on tasks that are uncovered:

Middleware should redirect incoming requests that are not already NFC encoded
Other fields (such as page content) that expect Unicode should be converted to NFC when saving to the database or used in forms
Tag names appear in URLs, and also need NFKC

I think these can be addressed by new bugs if and when they cause issues for users.

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

5 years ago

Product: developer.mozilla.org → developer.mozilla.org Graveyard

Bugzilla

URLs should be normalized (Unicode NFKC)

Categories

(developer.mozilla.org Graveyard :: Wiki pages, enhancement, P3)

Tracking

(Not tracked)

People

(Reporter: jwhitlock, Assigned: jwhitlock)

References

Details

(Keywords: in-triage, Whiteboard: [specification][type:bug][points=3])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated