Closed Bug 1357416 Opened 8 years ago Closed 6 years ago

URLs should be normalized (Unicode NFKC)

Categories

(developer.mozilla.org Graveyard :: Wiki pages, enhancement, P3)

All
Other
enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jwhitlock, Assigned: jwhitlock)

References

Details

(Keywords: in-triage, Whiteboard: [specification][type:bug][points=3])

What did you do? ================ There are URLs that appear the same when displayed, but some forms are 404s. For example, the Greek word "Εφαρμογές" can be encoded as at least two URLS: 1. https://developer.mozilla.org/el/docs/%CE%95%CF%86%CE%B1%CF%81%CE%BC%CE%BF%CE%B3%CE%AD%CF%82 (NFC) 2. https://developer.mozilla.org/el/docs/%CE%95%CF%86%CE%B1%CF%81%CE%BC%CE%BF%CE%B3%CE%B5%CC%81%CF%82 (NFD) The Bengalese translation of Firefox (ফায়ারফক্স) can be represented in at least two ways: 3. https://developer.mozilla.org/bn-BD/docs/Mozilla/%E0%A6%AB%E0%A6%BE%E0%A7%9F%E0%A6%BE%E0%A6%B0%E0%A6%AB%E0%A6%95%E0%A7%8D%E0%A6%B8 (non-normalized) 4. https://developer.mozilla.org/bn-BD/docs/Mozilla/%E0%A6%AB%E0%A6%BE%E0%A6%AF%E0%A6%BC%E0%A6%BE%E0%A6%B0%E0%A6%AB%E0%A6%95%E0%A7%8D%E0%A6%B8 (NFC / NFD) What happened? ============== In Chrome and Firefox, the URL 1 links to the Greek translation of the App Center page, and URL 2 is a 404. In Safari, URL 1 and 2 return the App Center page. In Chrome and Firefox, URL 1 was a 404, and URL 2 linked to the Bengali Firefox page. Safari got a 404 for both URLs. The slug was changed to match URL 1, and now works for all browsers. What should have happened? ========================== * Fields that appear in URLs, such as slugs and tags, should be converted to NFC (Normalization Form C, or "Combined") before being stored in the database. * Existing database data should be converted to NFC * Middleware should redirect incoming requests that are not already NFC encoded * (Maybe) Other fields that expect Unicode should be converted to NFC when saving to the database or used in forms Is there anything else we should know? ====================================== Unicode Normalization FAQ: http://www.unicode.org/faq/normalization.html Normalization matters when a letter includes one or more combining marks. For example, the word "Résumé" has a lower case "e" with an acute mark. NFD (Normalization Form D, or "Decomposed") uses multiple code points to represent characters with combining marks. In NFD, "é" is represented as two code points, \x65\u0301 (LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). If multiple combining marks are used, they are re-ordered to a normalized order. NFC (Normalization Form C, or 'combined') starts with NFD, and then combines some code points again. In NFC, "é" is represented as one code point, \xe9 (LATIN SMALL LETTER E WITH ACUTE).
Keywords: in-triage
See Also: → 1473645
Blocks: 1528745

Here's some code to detect the problem:

import unicodedata
import urllib

for doc in Document.objects.filter_for_list().order_by('id'):
  unquote_slug = urllib.unquote(doc.slug)
  normslug = unicodedata.normalize('NFC', unquote_slug)
  encode_normslug = normslug.encode('utf8')
  quote_slug = urllib.quote(encode_normslug)
  if normslug != doc.slug and quote_slug != doc.slug:
    full = doc.get_full_url()
    print "* %s: %s" % (doc.id, full)

Here's the 24 results:

Priority: -- → P3
Whiteboard: [specification][type:bug] → [specification][type:bug][points=3]

When "%25" is in the URL, then the issue is that a slug was double-encoded. For document 3863, the slug was Web/HTML/HTML5_%E8%A1%A8%E5%96%AE. When this is URL encoded, the % is converted to %25, so the URL is https://developer.mozilla.org/zh-TW/docs/Web/HTML/HTML5_%25E8%25A1%25A8%25E5%2596%25AE'.

The solution is to manually unquote the slug:

from urllib import unquote
doc = Document.objects.get(id=3863)
doc.current_revision.slug = unquote(doc.current_revision.slug.replace('%25', '%')
doc.current_revision.save()

This updates the URL to https://developer.mozilla.org/zh-TW/docs/Web/HTML/HTML5_%E8%A1%A8%E5%96%AE.

This is a old translation, and the new version of the page is translated at:

https://developer.mozilla.org/zh-TW/docs/Learn/HTML/Forms

I updated the now-working page to be a redirect to that page.

Here's an auto-fixer:

import unicodedata
import urllib

for doc in Document.objects.filter(slug__contains='%').order_by('id'):
    old_url = doc.get_full_url()
    old_slug = doc.slug
    doc.current_revision.slug = urllib.unquote(old_slug.replace('%25', '%'))
    doc.current_revision.save()
    new_doc = Document.objects.get(id=doc.id)
    new_url = new_doc.get_full_url()
    print "* %s: %s -> %s" % (doc.id, old_url, new_url)

This would potentially break a page that legitimately used a percent % in the slug, but I could find none. This suggests that % should be forbidden from slugs to avoid this issue in the future.

Converted pages:

There were additional documents that would have caused a collision, and were deleted:

This leaves 16 documents that appears to be valid, but require normalization to NFC. Here's that code:

for doc in Document.objects.filter_for_list().order_by('id'):
  normslug = unicodedata.normalize('NFC', doc.slug)
  if normslug != doc.slug:
    old_url = doc.get_full_url()
    doc.current_revision.slug = unicodedata.normalize('NFC', doc.current_revision.slug)
    doc.current_revision.save()
    new_doc = Document.objects.get(id=doc.id)
    new_url = new_doc.get_full_url()
    print "* %s: %s -> %s" % (doc.id, old_url, new_url)

And here's the documents with now-working URLs:

Summary: URLs should be normalized (Unicode NFC) → URLs should be normalized (Unicode NFC, no percent % allowed)

The current slugs with this issue have been manually modified.

A percent sign %, is already forbidden in slugs. The remaining work is to normalize slugs with NFC.

Summary: URLs should be normalized (Unicode NFC, no percent % allowed) → URLs should be normalized (Unicode NFC)

Some of the un-quoted slugs were incorrectly encoded. I attempted to manually fix these, but when I looked at the details, there were collisions, and moved pages, and often translations of deleted pages. I've manually fixed items, but I didn't take careful notes.

I'm working on detecting badly encoded strings in the database, and I'm thinking about solutions.

Assignee: nobody → jwhitlock
Status: NEW → ASSIGNED

I ran this to find "badly encoded" URLS:

for doc in Document.objects.only('id', 'locale', 'slug', 'is_redirect'):
  bad_chars = sum([1 for c in doc.slug if 0x80 <= ord(c) <= 0xbf])
  if bad_chars:
    print "%d (%d, %s): %s" % (doc.id, bad_chars, doc.is_redirect, doc.get_full_url())

Most were redirects. There was one remaining that seems to validly use these characters ("Instal·lació" does appear to be "Installation" in Catalan):

https://developer.mozilla.org/ca/docs/Learn/Getting_started_with_the_web/Instal%C2%B7laci%C3%B3_b%C3%A0sica_programari

Next, I looked at how Django does this, and it used NFKC to slugify unicode.

After some more reading, I found RFC 3987, section 5.3.2.2

Equivalence of IRIs MUST rely on the assumption that IRIs are
appropriately pre-character-normalized rather than apply character
normalization when comparing two IRIs. The exceptions are conversion
from a non-digital form, and conversion from a non-UCS-based
character encoding to a UCS-based character encoding. In these cases,
NFC or a normalizing transcoder using NFC MUST be used for
interoperability. To avoid false negatives and problems with
transcoding, IRIs SHOULD be created by using NFC. Using NFKC may
avoid even more problems; for example, by choosing half-width Latin
letters instead of full-width ones, and full-width instead of
half-width Katakana.

So it looks like the standard is "must be normalized", not "normalized in NFC". I'm inclined to follow Django's lead and use NFKC.

One change that even I can understand is that this URL:

https://developer.mozilla.org/fr/docs/JavaScript/Reference/Instructions/for_each%E2%80%A6in

would stop using the horizontal ellipsis and change to this URL:

https://developer.mozilla.org/fr/docs/JavaScript/Reference/Instructions/for_each...in

In this case, this was already done, and the first URL is a Kuma redirect to the second. There were three non-redirects that were not already NFKD normalized, which I normalized using the Move Page feature:

I'm going to change the task to converting to NFKC. An NFKC string is also an NFC string, but some NFC strings (like the above) are not NFKC strings. One of the design goals of NFKC is to avoid codepoints that look like other codepoints, and they are suggested as a better choice than NFC for identifiers.

Summary: URLs should be normalized (Unicode NFC) → URLs should be normalized (Unicode NFKC)

https://github.com/mozilla/kuma/pull/5344 converts slugs to NFKC when editing pages. This takes care of badly-formed slugs going forward.

There's a couple of follow-on tasks that are uncovered:

  • Middleware should redirect incoming requests that are not already NFC encoded
  • Other fields (such as page content) that expect Unicode should be converted to NFC when saving to the database or used in forms
  • Tag names appear in URLs, and also need NFKC

I think these can be addressed by new bugs if and when they cause issues for users.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.