Closed Bug 316727 Opened 19 years ago Closed 8 years ago

IDN processing code should be ready to enforce script-mixing constraints

Categories

(Core :: Networking, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: usenet, Assigned: usenet)

References

(Blocks 1 open bug)

Details

(Keywords: sec-low, Whiteboard: [sg:low] spoof)

Attachments

(9 files, 12 obsolete files)

3.39 KB, text/plain
Details
7.37 KB, text/plain
Details
19.85 KB, patch
Details | Diff | Splinter Review
23.45 KB, text/html
Details
4.03 KB, text/plain
Details
549 bytes, text/plain
Details
10.15 KB, text/html
Details
17.69 KB, text/html
Details
17.97 KB, text/html
Details
Since Unicode Technical Report #36 and the draft of UTR #39 appear to be likely to recommend the application of script-mixing rules to IDNs. 

These rules are designed to prevent IDN homograph name-spoofing, and cannot be addressed by either the current IDN display blacklist or DNS output whitelist mechanisms.

The IDN-processing code may soon need to be able to enforce these checks, if the rules become official.
Blocks: 316730
If this is about to be a new recommendation in a public spec perhaps we don't need the confidential flag. It's public, and being new people will assume we may not yet implement it.
No longer blocks: 316730
Whiteboard: [sg:low] spoof
Fine by me, please can someone remove the security tag?
Blocks: 316730
Assignee: nobody → smontagu
Group: security
Component: General → Internationalization
Product: Firefox → Core
QA Contact: general → amyy
Version: unspecified → Trunk
Blocks: 309435
Attached file A first cut at a character range table (obsolete) —
This is a first cut at defining a character range table, based on Unicode 4.1.0
Scripts.txt and Blocks.txt tables, with some hand amendments to:

* slice up the Basic Latin and Latin-1 Supplement to remove control chars and
punctuation, and to separate the ASCII letters and digits

* slice up the Greek and Coptic block, to allow clearer distintctions -- note
that Greek / Coptic disunification is an issue here, and this may need to be
tweaked further to be Unicode 3.2 compatible

In addition, the pre-processor program used has:
* removed blocks containing _nothing_but_ Common or Inherited characters
* removed blocks containing only presentation forms (should not be used for
naming, and should be normalized away)
* removed blocks containing other forms that will be normalized away
(halfwidth, fullwidth, circled, subscripts + superscripts)
* removed blocks exclusively containing musical or other symbols
* removed the Braille block: raw Braille is in effect a transliteration /
presentation form for other scripts, and the presence of raw Braille in a name
creates a spoofing risk for Braille users

However, the IPA block has been kept, even though IPA is usually only used for
transliteration, as some languages such as Zulu and Hungarian use a few of
these symbols. The IPA Spacing Modifiers have gone, since no such use is
indicated.

In all, this table has 82 blocks defined (some of which are adjacent, and could
be merged), which map to a total of 59 scripts.

Note: these blocks are selected for one-block -> one-script mappings. Any of
these blocks may, and often do, contain non-script characters and/or
unallocated code points; they are purely intended to stop script mixing, and
for no other purpose.

Note again: Greek / Coptic unification / disunification presents a particularly
knotty problem, which needs further thought.
Assignee: smontagu → usenet
Status: NEW → ASSIGNED
Just to explain the Greek / Coptic unification issue:

In earlier versions of Unicode, Greek and Coptic scripts were unified, so Coptic shared the corresponding Greek characters in the "Greek and Coptic" block, with some special characters of its own. Now, however, a disunified Coptic block has been produced with its own versions of the Coptic characters (with characters which are, of course, very confusable for the corresponding Greek characters, and vice versa). However, seven Coptic-unique characters (in both uppercase and lowercase forms) remain in the Greek/Coptic block. 

Question: it is presumably safe to mix the Greek characters and the seven Coptic characters in that block: is it ever justified to mix:
* The Coptic-unique chars in the Greek/Coptic block and the characters in the new Coptic block?
* The Greek Extended set and the new characters in the new Coptic block?
* etc...
Assignee: usenet → smontagu
Status: ASSIGNED → NEW
One final point: no attempt has been made in the generation of the list above to filter out dead (eg. Linear B) or artificial (eg. Shavian) scripts from the mix: any attempt to do this is fraught with potential discord (where do you stop cutting? whose toes are you treading on?).

Regarding special non-letter characters:

There are currently three outstanding IDN issues involving the use of characters other than letters or digits:
* allowing ETHIOPIAN WORDSPACE in Ethiopian
* allowing GERESH and GERSHAYIM in Hebrew
* allowing "l·l" to be used in Catalan

The first two do not affect this proposed test at all, since they already share the appropriate script blocks, so implementing this either of these would need no change to the code.

We can deal with the last in two ways possible:
* The most restrictive possible interpretation, allowing "l·l" as a special composite grapheme, treating it as if it was in the Latin block.
* Allowing MIDDLE DOT into the Latin block

I prefer the former, since for security, most restrictive is always best, but it will require special-case code to implement. The simplest and easiest way to implement it would be to do the following:

  canonical_name = string.replace(name_string, "l·l", "ll") 

[or even 

  canonical_name = string.replace(name_string, "l·l", "lll")

if we want to patch the buffer in place]

and then apply the tests as before, but to canonical_name. Clearly, this approach will work well provided the number of exceptions is limited, but will get unwieldy and inefficient if many such exceptions are created.
And another thought: if we want to give ETHIOPIC WORDSPACE the same semantics as the hyphen, ie. not allowed at the beginning or the end, (which is, I believe, the intent) we could also add the pre-processing rule

canonical_name = string.replace(name, ETHIOPIC_WORDSPACE, "-")

which would have the desired effect.

Inspection of the Unicode 3.2 tables shows that the Coptic block did not exist at that time, and hence Unicode 3.2 treats the two scripts as unified. 

Since IDN is currently (2005) defined as using Unicode 3.2, we can simply remove the dedicated Coptic block from our list, and treat Greek+Coptic as a single script for the purposes of this filter.
Oh, and one last thing: characters not in any block should be regarded as invalid, unless transformed by one of the preprocessing rules above. Remember, the result of this is intended to force Punycode display in the GUI in an attempt to prevent spoofing, rather than to block IDNs from working, so interpreting this list in a very conservative way is OK from the viewpoint of interoperability.
Attached file Character table, v1.1 (obsolete) —
This is a reworking of the character table.

It adds HYPHEN as a special character class, and provides an extra distinction to help handle the Coptic unification problem by labeling the Coptic characters currently sharing a block with the Greek characters as COPTIC_EXTRAS, which are presumably OK to mix either with the GREEK range or the COPTIC range, but not both.
Attachment #204985 - Attachment is obsolete: true
Question: should the COMBINING DIACRITICAL MARKS be allowed at all, or are only precomposed characters allowed? If the combining marks were to be allowed, what scripts should they be allowed to mix with? 
Attached file Script range table, v1.2 (obsolete) —
Now with defines created for label tags, and comments added to refer back to original and/or my created range names.
Attachment #205177 - Attachment is obsolete: true
Script range codes are as in attachment 205179 [details]: "Script range table, v1.2"

Reference: http://www.unicode.org/reports/tr36/#Security_Levels_and_Alerts
with Restriction Level 2, "Highly Restrictive", Unicode 4.1.0 Scripts.txt and Blocks.txt, amended manually for special treatment within ASCII ranges, and special treatment of the Coptic characters within the Greek and Coptic block.

This last is intended to allow for Coptic disunification, which is currently prevented by the existing Nameprep, but may later need to be implemented.

[Note: the ASCII_DIGITS and HYPHEN are treated as character classes of their own, since it may eventually be decided that one or both of these is incompatible with one of the scripts given below.]
[correction of the above] for "proper subset", read "subset".
Note: this code compiles, but has not yet been tested in any way.

There's some scope for making it faster, and with a smaller footprint, but it'll probably do in more-or-less its present form for testing purposes.
This is a second version of the above, with a bit of tidying and a bug fix for the case of names without any letters in.
Attachment #205311 - Attachment is obsolete: true
Attachment #205312 - Attachment is obsolete: true
Implementing combining marks raises interesting questions about the situations in which they are appropriate: for example, what about latin accents on hanji, and other nonsensical combinations? Should we simply refuse to display names containing combining marks at all?
Attachment #205321 - Attachment is obsolete: true
Note: the logical place to hook this in appears to be what is now called isOnlySafeChars(), defined at 

http://lxr.mozilla.org/mozilla1.8/source/netwerk/dns/src/nsIDNService.cpp#61

which seems to have the same purpose of testing whether a Unicoded domain name string is OK to display in the user interface. However, it gets called in three places, and these need checking.

isOnlySafeChars() is defined in terms of UTF16, so the data will need to be converted to UCS4 before the checking code is called.
Attached file version 5 of the script checking code (obsolete) —
This now lets the underscore through as a "word separator", for compatibility with an unfortunately large set of existing DNS labels.

Thinks: will this code need to recognize and special-case support IPv6 addresses with colons in, or will these always go through another code path? (one idea might be to treat colons as a special character class that can only be mixed with ASCII digits in a label -- but what about the square brackets that surround v6 addresses in URLs -- will they pass through this code path?).
Attachment #205344 - Attachment is obsolete: true
A possible approach: declare '[', ':', and ']' to be SCRIPT_IPV6_SPECIALS, and allow the set

SCRIPT_ASCII_DIGITS
SCRIPT_IPV6_SPECIALS

as a special-case combination. Remember that this code is only intended as a belt-and-braces double-check for possible script spoofing, and code elsewhere should already be parsing and checking for valid name and address string formats.

In newer RFCs, SRV RRs with names _starting with_ underscores are now used. Changed patch to be more lenient to reflect this usage. Kept the hyphen restriction, since this is still necessary to prevent Unix command-line option ambiguity.
Attachment #205390 - Attachment is obsolete: true
More notes:

There are three routines under consideration here, with confusingly similar functionality, that however need to be kept carefully distinct. Perhaps the first and last should be renamed, to make things clearer?

1) The predicate isOnlySafeChars() roughly means "there are no easily programmatcally detected opportunities for visual spoofing in rendering this Unicode string as a host name". Failing this should result in the text being presented as Punycode in the UI. Note that this should always be called on the result of a string that has been through, at the least, Nameprep preprocessing.

In practice, this is going to mean the following:
* no script mixing, unless the scripts are special compatible scripts
* no unreasonable combining-mark seqences (leading combining marks, combining marks applied to digits or word separators)
* no characters in the IDN-display special-case character blacklist

Moreover, since this test is called on the whole DNS name string at once, the checks above must be applied on a label-by-label basis. This assumes that, at the very least, all valid label-separator characters in the DNS name have been already converted into '.' characters, as per the IDN RFCs.  

2) The predicate isValidHostName() means "this string consists only of the characters known to be valid in the print-representations of FQDNs, IPv4 addresses or IPv6 addresses (assuming IDN Nameprep and Punycode processing have been applied first)". Failing this should result in the hostname/IP address string being regarded as invalid, and all lookups of this FQDN should fail.

3) The UseSTD3ASCII test in the punycode() routine means "this string consists only of the characters known to be valid in _a_single_label_ of an FQDN (assuming IDN Nameprep processing has been applied first), and does not start or end with a hyphen". Failing this should result in the hostname string being regarded as invalid, and thus the punycode() operation, and hence all other operations relying on punycode(), including DNS lookups, should fail.

_If_ both the tests above were to be merged into the code, and _if_ all the other domain name/IP address processing code were to be _demonstrably_watertight_, this test could probably be omitted. However, for now (and possibly for a long time), it's a useful belt-and-braces check. When the other patches below have been deployed and tested in the wild for some time, it's worth considering changing this into a fatal assertion of the "this must never happen" variety.
Attachment #205536 - Attachment is obsolete: true
Assignee: smontagu → usenet
Status: NEW → ASSIGNED
This is an unfinished, unpolished, untested patch for both this and bug 316444. But it's nearly at the "end of the beginning phase"; it compiles, all of its parts have each been individually tested, and it needs only a couple of lines of code to become functional.
Depends on: 320568
Attached patch slight tweak of the above (obsolete) — Splinter Review
I've just hosed my build environment, and need to rebuild... saving here just in case...
Whilst I'm waiting to get a properly bootable build machine back, I've just recieved some more feedback about character repertoires:

Briefly: 
* there _may_ be a need to support some of the CJK Compatibility stuff... (3300..33FF, F900..FAFF, FE30..FE4F, 2F800..2FA1F)
* there _may_ be a need to support some of the Hangul Conjoining Jamos (3130..318F) to support some obsolete Korean syllables found in some company names

However: since most Korean-language systems do not support the older syllables properly, companies are likely to use only the modern precomposed syllables: so,  after checking with a native speaker, I think it is OK to go on excluding them.

The CJK Compatibility Ideographs are another issue, and this will require some codepoint-crunching to check in detail.

Note: the most common repertoires in use for IDNs in the CJK languages are:

    http://www.iana.org/assignments/idn/cn-chinese.html
    http://www.iana.org/assignments/idn/tw-traditional-chinese.html
    http://www.iana.org/assignments/idn/jp-japanese.html
    http://www.iana.org/assignments/idn/kr-korean.html 
Ooops.. 

U+1100..U+11FF   Hangul Jamo 

U+3130..U+318F   Hangul Compatibility Jamo

Two different things...

I've now analyzed the data from the ICANN CJK code point assignments given above, cross-referenced against the Unicode 4.1.0 code block assignments, and I get the following:

Writing system: zh-cn (= 50510 codepoints)
Writing system: zh-tw (= 50697 codepoints)
Writing system: jp (= 13142 codepoints)
Writing system: ko (= 11209 codepoints)
Not in use: Hangul Jamo (Hangul)
Not in use: CJK Radicals Supplement (Han)
Not in use: Kangxi Radicals (Han)
Not in use: Bopomofo (Bopomofo)
Not in use: Hangul Compatibility Jamo (Hangul)
Not in use: Bopomofo Extended (Bopomofo)
Not in use: CJK Compatibility Ideographs (Han)
Not in use: CJK Unified Ideographs Extension B (Han)
Not in use: CJK Compatibility Ideographs Supplement (Han)

This is mildly reassuring.  More to come... 
Here are some more thoughts about safe/risky character ranges:

According to the Unicode 3.0 standard, 

"Characters unique to the U source are found in the CJK Compatibility Ideographs block. There are 12 of these characters: U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, and U+FA29. The remaining characters in the CJK Compatibility Ideographs block are either duplicates or unifiable variants of characters in the one of the other blocks and are included in the Unicode Standard for reasons of round-trip compatibility."

So, it looks like blocking the CJK Compatibility ranges will block _at_least_ 800 possible confusing duplicates, at the cost of losing perhaps 12 characters.
For security purposes, I think this is a no-brainer, particularly since none of the codepoints in these ranges are used by the IANA-registered character sets indpendently compiled by the main registries using the ch-zh, cn-tw, ko and jp writing systems. 

For this reason, I intend to deprecate the "CJK Compatibility" ranges for the Han script.  If CJK language experts agree that a small subset of these are safe for use, they can always be individually whitelisted later.

The CJK and KangXi radical ranges are similarly dubious: according to the Unicode 3.0 spec, most of them are also encoded in the CJK Unified Ideographs block. So, they should be deprecated too.

Similar reasoning applies to the Hangul Jamo and Compatibility Jamo ranges, which provide another way of specifying essentially the same character set encoded in the precomposed Hangul Syllable block. So they get deprecated, too,

Speaking of radicals, the Yi Radicals do not seem to have any meaning other than for lexicographic use, and they contain at least a couple of glaring spoofs for the Yi Syllables, which are (according to the spec) what is used for normal linguistic purposes. I'm not sure what to do here: without a strong reason to deprecate them, I think I'll keep them in, but just flag this as a potential problem for later consideration. Hopefully, Nameprep v2 will make this clearer.

Bopomofo, on the other hand, appears to be a different matter. As far as I understand it, this is a separate auxiliary script used in the zh-tw writing system only. UTR #36 specifically suggests that it is safe to use in its proposed "Highly Restrictive" script-mixing profile. So, on this evidence, Bopomofo should stay in, even though it is not currently used in any registry I know of.

Similarly, I can't see any evidence that the CJK Unified Ideographs Extensions A and B, contain any harmful characters; as far as I'm aware, they're just rare, not confusing.

So, in summary, out of the CJK ranges listed as usused in the previous comment:

DEPRECATE:

CJK Compatibility Ideographs (Han)
CJK Compatibility Ideographs Supplement (Han)
Hangul Jamo (Hangul)
CJK Radicals Supplement (Han)
Kangxi Radicals (Han)
Hangul Compatibility Jamo (Hangul)

KEEP AS VALID RANGES, EVEN THOUGH NOT CURRENTLY USED:

CJK Unified Ideographs Extension B (Han)
Bopomofo (Bopomofo)
Bopomofo Extended (Bopomofo)
New script-block classification tables, as per previous comment. Note that there is an explicit SCRIPT_DEPRECATED tag, to make this table more maintainable. 

I should probably also define SCRIPT_TABLE_SIZE in terms of the ratio of the lengths of script_ranges[] and the script_range structure, which will also make the code more maintainable.
Attachment #205179 - Attachment is obsolete: true
This now 
* properly deals with the punctuation characters in the Latin-1 Supplement, 
* adds MIDDLE DOT to the Latin script, in order to support the new .cat domain.
* properly filters out various highly risky compatibility character ranges, including the Hangul Jamo (note: all existing Korean IDNs already correctly use the Hangul syllables instead)
* Added a new character class SCRIPT_DEPRECATED for these last.

Also, it now actually compiles, and doesn't seem to crash anything. However, this is still not properly tested, so please don't use it for anything.
Attachment #205818 - Attachment is obsolete: true
Attachment #206046 - Attachment is obsolete: true
Attachment #206131 - Attachment is obsolete: true
Depends on: 321491
*** Bug 354592 has been marked as a duplicate of this bug. ***
Tag-soup HTML file for this code: contains domains in a wide variety of languages.

Unicode chars are escaped using entities. You will need to add "idntest" to your whitelist of IDN-enabled domains to make this work, and to create a local DNS server that is authoritative for the "idntest" domain.
Note: from a mixture of CLDR and Wikipedia data -- this is mainly for reference, not script-mixing, since some languages will map to multiple incompatible writing systems because they have to different orthographies.

What is interesting here are the characters which are _not_ members of any script system which are part of the orthography of some languages, eg U+200C, U+200D, U+2019, U+02BC...
These are the characters used in language orthographies which are not included in any script system. 

MIDDLE DOT is also used in Catalan, but is not listed here, because it is already handled in the current code.

ETHIOPIC WORDSPACE is another edge-case, as are the GERESH and GERSHAYIM in Hebrew.
I've now filed bug 355156 for the separate issue of display of IDNs which have labels with different BiDi directions: eg http://עברית.idntest/
This tag-soup HTML file contains links to domains with mixed-script labels which _should not_ be displayed if the code in this patch is working correctly.

To make this work, you will need to enable IDNs for the (testing-only) ".idntest" domain.
(script allocations as of Unicode 4.1.0, BMP only, may also contain broken surrogate sequences.)
Less dramatically wrong than the previous set. The edge case of completely invalid Unicode (broken surrogate pairs, non-characters, etc, etc.) needs further testing, but looks safe for now.
QA Contact: amyy → i18n
Neil, are you still working on this?
Component: Internationalization → Networking
Flags: needinfo?(usenet)
Not currently, but I can do so if there's a need for it.
Flags: needinfo?(usenet)
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: