Closed
Bug 316727
Opened 19 years ago
Closed 8 years ago
IDN processing code should be ready to enforce script-mixing constraints
Categories
(Core :: Networking, defect)
Tracking
()
RESOLVED
INCOMPLETE
People
(Reporter: usenet, Assigned: usenet)
References
(Blocks 1 open bug)
Details
(Keywords: sec-low, Whiteboard: [sg:low] spoof)
Attachments
(9 files, 12 obsolete files)
Since Unicode Technical Report #36 and the draft of UTR #39 appear to be likely to recommend the application of script-mixing rules to IDNs. These rules are designed to prevent IDN homograph name-spoofing, and cannot be addressed by either the current IDN display blacklist or DNS output whitelist mechanisms. The IDN-processing code may soon need to be able to enforce these checks, if the rules become official.
Comment 1•19 years ago
|
||
If this is about to be a new recommendation in a public spec perhaps we don't need the confidential flag. It's public, and being new people will assume we may not yet implement it.
No longer blocks: 316730
Whiteboard: [sg:low] spoof
Assignee | ||
Comment 2•19 years ago
|
||
Fine by me, please can someone remove the security tag?
Assignee: nobody → smontagu
Group: security
Component: General → Internationalization
Product: Firefox → Core
QA Contact: general → amyy
Version: unspecified → Trunk
Assignee | ||
Comment 3•19 years ago
|
||
This is a first cut at defining a character range table, based on Unicode 4.1.0 Scripts.txt and Blocks.txt tables, with some hand amendments to: * slice up the Basic Latin and Latin-1 Supplement to remove control chars and punctuation, and to separate the ASCII letters and digits * slice up the Greek and Coptic block, to allow clearer distintctions -- note that Greek / Coptic disunification is an issue here, and this may need to be tweaked further to be Unicode 3.2 compatible In addition, the pre-processor program used has: * removed blocks containing _nothing_but_ Common or Inherited characters * removed blocks containing only presentation forms (should not be used for naming, and should be normalized away) * removed blocks containing other forms that will be normalized away (halfwidth, fullwidth, circled, subscripts + superscripts) * removed blocks exclusively containing musical or other symbols * removed the Braille block: raw Braille is in effect a transliteration / presentation form for other scripts, and the presence of raw Braille in a name creates a spoofing risk for Braille users However, the IPA block has been kept, even though IPA is usually only used for transliteration, as some languages such as Zulu and Hungarian use a few of these symbols. The IPA Spacing Modifiers have gone, since no such use is indicated. In all, this table has 82 blocks defined (some of which are adjacent, and could be merged), which map to a total of 59 scripts. Note: these blocks are selected for one-block -> one-script mappings. Any of these blocks may, and often do, contain non-script characters and/or unallocated code points; they are purely intended to stop script mixing, and for no other purpose. Note again: Greek / Coptic unification / disunification presents a particularly knotty problem, which needs further thought.
Assignee: smontagu → usenet
Status: NEW → ASSIGNED
Assignee | ||
Comment 4•19 years ago
|
||
Just to explain the Greek / Coptic unification issue: In earlier versions of Unicode, Greek and Coptic scripts were unified, so Coptic shared the corresponding Greek characters in the "Greek and Coptic" block, with some special characters of its own. Now, however, a disunified Coptic block has been produced with its own versions of the Coptic characters (with characters which are, of course, very confusable for the corresponding Greek characters, and vice versa). However, seven Coptic-unique characters (in both uppercase and lowercase forms) remain in the Greek/Coptic block. Question: it is presumably safe to mix the Greek characters and the seven Coptic characters in that block: is it ever justified to mix: * The Coptic-unique chars in the Greek/Coptic block and the characters in the new Coptic block? * The Greek Extended set and the new characters in the new Coptic block? * etc...
Assignee | ||
Updated•19 years ago
|
Assignee: usenet → smontagu
Status: ASSIGNED → NEW
Assignee | ||
Comment 5•19 years ago
|
||
One final point: no attempt has been made in the generation of the list above to filter out dead (eg. Linear B) or artificial (eg. Shavian) scripts from the mix: any attempt to do this is fraught with potential discord (where do you stop cutting? whose toes are you treading on?).
Assignee | ||
Comment 6•19 years ago
|
||
Regarding special non-letter characters: There are currently three outstanding IDN issues involving the use of characters other than letters or digits: * allowing ETHIOPIAN WORDSPACE in Ethiopian * allowing GERESH and GERSHAYIM in Hebrew * allowing "l·l" to be used in Catalan The first two do not affect this proposed test at all, since they already share the appropriate script blocks, so implementing this either of these would need no change to the code. We can deal with the last in two ways possible: * The most restrictive possible interpretation, allowing "l·l" as a special composite grapheme, treating it as if it was in the Latin block. * Allowing MIDDLE DOT into the Latin block I prefer the former, since for security, most restrictive is always best, but it will require special-case code to implement. The simplest and easiest way to implement it would be to do the following: canonical_name = string.replace(name_string, "l·l", "ll") [or even canonical_name = string.replace(name_string, "l·l", "lll") if we want to patch the buffer in place] and then apply the tests as before, but to canonical_name. Clearly, this approach will work well provided the number of exceptions is limited, but will get unwieldy and inefficient if many such exceptions are created.
Assignee | ||
Comment 7•19 years ago
|
||
And another thought: if we want to give ETHIOPIC WORDSPACE the same semantics as the hyphen, ie. not allowed at the beginning or the end, (which is, I believe, the intent) we could also add the pre-processing rule canonical_name = string.replace(name, ETHIOPIC_WORDSPACE, "-") which would have the desired effect.
Assignee | ||
Comment 8•19 years ago
|
||
Inspection of the Unicode 3.2 tables shows that the Coptic block did not exist at that time, and hence Unicode 3.2 treats the two scripts as unified. Since IDN is currently (2005) defined as using Unicode 3.2, we can simply remove the dedicated Coptic block from our list, and treat Greek+Coptic as a single script for the purposes of this filter.
Assignee | ||
Comment 9•19 years ago
|
||
Oh, and one last thing: characters not in any block should be regarded as invalid, unless transformed by one of the preprocessing rules above. Remember, the result of this is intended to force Punycode display in the GUI in an attempt to prevent spoofing, rather than to block IDNs from working, so interpreting this list in a very conservative way is OK from the viewpoint of interoperability.
Assignee | ||
Comment 10•19 years ago
|
||
This is a reworking of the character table. It adds HYPHEN as a special character class, and provides an extra distinction to help handle the Coptic unification problem by labeling the Coptic characters currently sharing a block with the Greek characters as COPTIC_EXTRAS, which are presumably OK to mix either with the GREEK range or the COPTIC range, but not both.
Attachment #204985 -
Attachment is obsolete: true
Assignee | ||
Comment 11•19 years ago
|
||
Question: should the COMBINING DIACRITICAL MARKS be allowed at all, or are only precomposed characters allowed? If the combining marks were to be allowed, what scripts should they be allowed to mix with?
Assignee | ||
Comment 12•19 years ago
|
||
Now with defines created for label tags, and comments added to refer back to original and/or my created range names.
Attachment #205177 -
Attachment is obsolete: true
Assignee | ||
Comment 13•19 years ago
|
||
Script range codes are as in attachment 205179 [details]: "Script range table, v1.2" Reference: http://www.unicode.org/reports/tr36/#Security_Levels_and_Alerts with Restriction Level 2, "Highly Restrictive", Unicode 4.1.0 Scripts.txt and Blocks.txt, amended manually for special treatment within ASCII ranges, and special treatment of the Coptic characters within the Greek and Coptic block. This last is intended to allow for Coptic disunification, which is currently prevented by the existing Nameprep, but may later need to be implemented. [Note: the ASCII_DIGITS and HYPHEN are treated as character classes of their own, since it may eventually be decided that one or both of these is incompatible with one of the scripts given below.]
Assignee | ||
Comment 14•19 years ago
|
||
[correction of the above] for "proper subset", read "subset".
Assignee | ||
Comment 15•19 years ago
|
||
Note: this code compiles, but has not yet been tested in any way. There's some scope for making it faster, and with a smaller footprint, but it'll probably do in more-or-less its present form for testing purposes.
Assignee | ||
Comment 16•19 years ago
|
||
This is a second version of the above, with a bit of tidying and a bug fix for the case of names without any letters in.
Assignee | ||
Comment 17•19 years ago
|
||
Attachment #205311 -
Attachment is obsolete: true
Attachment #205312 -
Attachment is obsolete: true
Assignee | ||
Comment 18•19 years ago
|
||
Implementing combining marks raises interesting questions about the situations in which they are appropriate: for example, what about latin accents on hanji, and other nonsensical combinations? Should we simply refuse to display names containing combining marks at all?
Attachment #205321 -
Attachment is obsolete: true
Assignee | ||
Comment 19•19 years ago
|
||
Note: the logical place to hook this in appears to be what is now called isOnlySafeChars(), defined at http://lxr.mozilla.org/mozilla1.8/source/netwerk/dns/src/nsIDNService.cpp#61 which seems to have the same purpose of testing whether a Unicoded domain name string is OK to display in the user interface. However, it gets called in three places, and these need checking. isOnlySafeChars() is defined in terms of UTF16, so the data will need to be converted to UCS4 before the checking code is called.
Assignee | ||
Comment 20•19 years ago
|
||
This now lets the underscore through as a "word separator", for compatibility with an unfortunately large set of existing DNS labels. Thinks: will this code need to recognize and special-case support IPv6 addresses with colons in, or will these always go through another code path? (one idea might be to treat colons as a special character class that can only be mixed with ASCII digits in a label -- but what about the square brackets that surround v6 addresses in URLs -- will they pass through this code path?).
Attachment #205344 -
Attachment is obsolete: true
Assignee | ||
Comment 21•19 years ago
|
||
A possible approach: declare '[', ':', and ']' to be SCRIPT_IPV6_SPECIALS, and allow the set SCRIPT_ASCII_DIGITS SCRIPT_IPV6_SPECIALS as a special-case combination. Remember that this code is only intended as a belt-and-braces double-check for possible script spoofing, and code elsewhere should already be parsing and checking for valid name and address string formats.
Assignee | ||
Comment 22•19 years ago
|
||
In newer RFCs, SRV RRs with names _starting with_ underscores are now used. Changed patch to be more lenient to reflect this usage. Kept the hyphen restriction, since this is still necessary to prevent Unix command-line option ambiguity.
Attachment #205390 -
Attachment is obsolete: true
Assignee | ||
Comment 23•19 years ago
|
||
More notes: There are three routines under consideration here, with confusingly similar functionality, that however need to be kept carefully distinct. Perhaps the first and last should be renamed, to make things clearer? 1) The predicate isOnlySafeChars() roughly means "there are no easily programmatcally detected opportunities for visual spoofing in rendering this Unicode string as a host name". Failing this should result in the text being presented as Punycode in the UI. Note that this should always be called on the result of a string that has been through, at the least, Nameprep preprocessing. In practice, this is going to mean the following: * no script mixing, unless the scripts are special compatible scripts * no unreasonable combining-mark seqences (leading combining marks, combining marks applied to digits or word separators) * no characters in the IDN-display special-case character blacklist Moreover, since this test is called on the whole DNS name string at once, the checks above must be applied on a label-by-label basis. This assumes that, at the very least, all valid label-separator characters in the DNS name have been already converted into '.' characters, as per the IDN RFCs. 2) The predicate isValidHostName() means "this string consists only of the characters known to be valid in the print-representations of FQDNs, IPv4 addresses or IPv6 addresses (assuming IDN Nameprep and Punycode processing have been applied first)". Failing this should result in the hostname/IP address string being regarded as invalid, and all lookups of this FQDN should fail. 3) The UseSTD3ASCII test in the punycode() routine means "this string consists only of the characters known to be valid in _a_single_label_ of an FQDN (assuming IDN Nameprep processing has been applied first), and does not start or end with a hyphen". Failing this should result in the hostname string being regarded as invalid, and thus the punycode() operation, and hence all other operations relying on punycode(), including DNS lookups, should fail. _If_ both the tests above were to be merged into the code, and _if_ all the other domain name/IP address processing code were to be _demonstrably_watertight_, this test could probably be omitted. However, for now (and possibly for a long time), it's a useful belt-and-braces check. When the other patches below have been deployed and tested in the wild for some time, it's worth considering changing this into a fatal assertion of the "this must never happen" variety.
Assignee | ||
Comment 24•19 years ago
|
||
Attachment #205536 -
Attachment is obsolete: true
Assignee | ||
Updated•19 years ago
|
Assignee: smontagu → usenet
Assignee | ||
Updated•19 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 25•19 years ago
|
||
This is an unfinished, unpolished, untested patch for both this and bug 316444. But it's nearly at the "end of the beginning phase"; it compiles, all of its parts have each been individually tested, and it needs only a couple of lines of code to become functional.
Assignee | ||
Comment 26•19 years ago
|
||
I've just hosed my build environment, and need to rebuild... saving here just in case...
Assignee | ||
Comment 27•19 years ago
|
||
Whilst I'm waiting to get a properly bootable build machine back, I've just recieved some more feedback about character repertoires: Briefly: * there _may_ be a need to support some of the CJK Compatibility stuff... (3300..33FF, F900..FAFF, FE30..FE4F, 2F800..2FA1F) * there _may_ be a need to support some of the Hangul Conjoining Jamos (3130..318F) to support some obsolete Korean syllables found in some company names However: since most Korean-language systems do not support the older syllables properly, companies are likely to use only the modern precomposed syllables: so, after checking with a native speaker, I think it is OK to go on excluding them. The CJK Compatibility Ideographs are another issue, and this will require some codepoint-crunching to check in detail. Note: the most common repertoires in use for IDNs in the CJK languages are: http://www.iana.org/assignments/idn/cn-chinese.html http://www.iana.org/assignments/idn/tw-traditional-chinese.html http://www.iana.org/assignments/idn/jp-japanese.html http://www.iana.org/assignments/idn/kr-korean.html
Assignee | ||
Comment 28•19 years ago
|
||
Ooops.. U+1100..U+11FF Hangul Jamo U+3130..U+318F Hangul Compatibility Jamo Two different things...
Assignee | ||
Comment 29•19 years ago
|
||
I've now analyzed the data from the ICANN CJK code point assignments given above, cross-referenced against the Unicode 4.1.0 code block assignments, and I get the following: Writing system: zh-cn (= 50510 codepoints) Writing system: zh-tw (= 50697 codepoints) Writing system: jp (= 13142 codepoints) Writing system: ko (= 11209 codepoints) Not in use: Hangul Jamo (Hangul) Not in use: CJK Radicals Supplement (Han) Not in use: Kangxi Radicals (Han) Not in use: Bopomofo (Bopomofo) Not in use: Hangul Compatibility Jamo (Hangul) Not in use: Bopomofo Extended (Bopomofo) Not in use: CJK Compatibility Ideographs (Han) Not in use: CJK Unified Ideographs Extension B (Han) Not in use: CJK Compatibility Ideographs Supplement (Han) This is mildly reassuring. More to come...
Assignee | ||
Comment 30•19 years ago
|
||
Here are some more thoughts about safe/risky character ranges: According to the Unicode 3.0 standard, "Characters unique to the U source are found in the CJK Compatibility Ideographs block. There are 12 of these characters: U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, and U+FA29. The remaining characters in the CJK Compatibility Ideographs block are either duplicates or unifiable variants of characters in the one of the other blocks and are included in the Unicode Standard for reasons of round-trip compatibility." So, it looks like blocking the CJK Compatibility ranges will block _at_least_ 800 possible confusing duplicates, at the cost of losing perhaps 12 characters. For security purposes, I think this is a no-brainer, particularly since none of the codepoints in these ranges are used by the IANA-registered character sets indpendently compiled by the main registries using the ch-zh, cn-tw, ko and jp writing systems. For this reason, I intend to deprecate the "CJK Compatibility" ranges for the Han script. If CJK language experts agree that a small subset of these are safe for use, they can always be individually whitelisted later. The CJK and KangXi radical ranges are similarly dubious: according to the Unicode 3.0 spec, most of them are also encoded in the CJK Unified Ideographs block. So, they should be deprecated too. Similar reasoning applies to the Hangul Jamo and Compatibility Jamo ranges, which provide another way of specifying essentially the same character set encoded in the precomposed Hangul Syllable block. So they get deprecated, too, Speaking of radicals, the Yi Radicals do not seem to have any meaning other than for lexicographic use, and they contain at least a couple of glaring spoofs for the Yi Syllables, which are (according to the spec) what is used for normal linguistic purposes. I'm not sure what to do here: without a strong reason to deprecate them, I think I'll keep them in, but just flag this as a potential problem for later consideration. Hopefully, Nameprep v2 will make this clearer. Bopomofo, on the other hand, appears to be a different matter. As far as I understand it, this is a separate auxiliary script used in the zh-tw writing system only. UTR #36 specifically suggests that it is safe to use in its proposed "Highly Restrictive" script-mixing profile. So, on this evidence, Bopomofo should stay in, even though it is not currently used in any registry I know of. Similarly, I can't see any evidence that the CJK Unified Ideographs Extensions A and B, contain any harmful characters; as far as I'm aware, they're just rare, not confusing. So, in summary, out of the CJK ranges listed as usused in the previous comment: DEPRECATE: CJK Compatibility Ideographs (Han) CJK Compatibility Ideographs Supplement (Han) Hangul Jamo (Hangul) CJK Radicals Supplement (Han) Kangxi Radicals (Han) Hangul Compatibility Jamo (Hangul) KEEP AS VALID RANGES, EVEN THOUGH NOT CURRENTLY USED: CJK Unified Ideographs Extension B (Han) Bopomofo (Bopomofo) Bopomofo Extended (Bopomofo)
Assignee | ||
Comment 31•19 years ago
|
||
New script-block classification tables, as per previous comment. Note that there is an explicit SCRIPT_DEPRECATED tag, to make this table more maintainable. I should probably also define SCRIPT_TABLE_SIZE in terms of the ratio of the lengths of script_ranges[] and the script_range structure, which will also make the code more maintainable.
Attachment #205179 -
Attachment is obsolete: true
Assignee | ||
Comment 32•19 years ago
|
||
This now * properly deals with the punctuation characters in the Latin-1 Supplement, * adds MIDDLE DOT to the Latin script, in order to support the new .cat domain. * properly filters out various highly risky compatibility character ranges, including the Hangul Jamo (note: all existing Korean IDNs already correctly use the Hangul syllables instead) * Added a new character class SCRIPT_DEPRECATED for these last. Also, it now actually compiles, and doesn't seem to crash anything. However, this is still not properly tested, so please don't use it for anything.
Attachment #205818 -
Attachment is obsolete: true
Attachment #206046 -
Attachment is obsolete: true
Attachment #206131 -
Attachment is obsolete: true
Assignee | ||
Comment 33•18 years ago
|
||
*** Bug 354592 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 34•18 years ago
|
||
Tag-soup HTML file for this code: contains domains in a wide variety of languages. Unicode chars are escaped using entities. You will need to add "idntest" to your whitelist of IDN-enabled domains to make this work, and to create a local DNS server that is authoritative for the "idntest" domain.
Assignee | ||
Comment 35•18 years ago
|
||
Note: from a mixture of CLDR and Wikipedia data -- this is mainly for reference, not script-mixing, since some languages will map to multiple incompatible writing systems because they have to different orthographies. What is interesting here are the characters which are _not_ members of any script system which are part of the orthography of some languages, eg U+200C, U+200D, U+2019, U+02BC...
Assignee | ||
Comment 36•18 years ago
|
||
These are the characters used in language orthographies which are not included in any script system. MIDDLE DOT is also used in Catalan, but is not listed here, because it is already handled in the current code. ETHIOPIC WORDSPACE is another edge-case, as are the GERESH and GERSHAYIM in Hebrew.
Assignee | ||
Comment 37•18 years ago
|
||
I've now filed bug 355156 for the separate issue of display of IDNs which have labels with different BiDi directions: eg http://עברית.idntest/
Assignee | ||
Comment 38•18 years ago
|
||
This tag-soup HTML file contains links to domains with mixed-script labels which _should not_ be displayed if the code in this patch is working correctly. To make this work, you will need to enable IDNs for the (testing-only) ".idntest" domain.
Assignee | ||
Comment 39•18 years ago
|
||
(script allocations as of Unicode 4.1.0, BMP only, may also contain broken surrogate sequences.)
Assignee | ||
Comment 40•18 years ago
|
||
Less dramatically wrong than the previous set. The edge case of completely invalid Unicode (broken surrogate pairs, non-characters, etc, etc.) needs further testing, but looks safe for now.
Updated•15 years ago
|
QA Contact: amyy → i18n
Comment 41•10 years ago
|
||
Neil, are you still working on this?
Component: Internationalization → Networking
Flags: needinfo?(usenet)
Assignee | ||
Comment 42•10 years ago
|
||
Not currently, but I can do so if there's a need for it.
Flags: needinfo?(usenet)
Updated•8 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•