Closed
Bug 309435
Opened 19 years ago
Closed 9 years ago
Implement ICANN Guidelines for the Implementation of IDNs, v2.0
Categories
(Core :: Networking, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: usenet, Assigned: usenet)
References
(Blocks 1 open bug)
Details
(Keywords: sec-want, Whiteboard: [sg:want][necko-would-take])
Attachments
(1 file, 3 obsolete files)
|
3.90 KB,
text/plain
|
Details |
ICANN has issued a new draft set of guidelines for restrictions of the character reportoire in IDNs: http://icann.org/general/idn-guidelines-20sep05.htm Whilst this is still a draft, it looks extremely likely to pass, as its recommendations reflect the most recent thinking among both software developers and the registry community. In particular, it references Unicode Technical Report #36, which is a well-thought-out response to the domain name homograph spoofing problem with substantial input from the technical community. Mozilla products should enforce these restrictions in order to mitigate domain spoofing threats, which will also have the effect of greatly reducing opportunities for phishing and other general naughtiness.
| Assignee | ||
Updated•19 years ago
|
Severity: normal → major
| Assignee | ||
Comment 1•19 years ago
|
||
A proposed format for IDN whitelists:
example:
# draft legal Unicode whitelist, ASCII-encoded
prefs("network.IDN.whitelist.1",
"30;9 41;19 61;19 AA B2;1 B5 B9;1 BC;2 C0;16 D8;1E F8;149 250;71 2C6;B 2E0;4");
prefs("network.IDN.whitelist.2",
"2EE 37A 386 388;2 38C 38E;13 3A3;2B 3D0;25 3F7;8A 48A;44 4D0;29 500;F 531;25");
prefs("network.IDN.whitelist.3",
"559 561;26 5D0;1A 5F0;2 621;19 640;A 660;9 66E;1 671;62 6D5 6E5;1 6EE;E");
prefs("network.IDN.whitelist.4",
"6FF 710 712;1D 74D;20 780;25 7B1 904;35 93D 950 958;9 966;9 97D 985;7 98F;1");
... more similar lines ...
prefs("network.IDN.whitelist.35",
"1D6FC;18 1D716;1E 1D736;18 1D750;1E 1D770;18 1D78A;1E 1D7AA;18 1D7C4;5");
prefs("network.IDN.whitelist.36",
"1D7CE;31 20000;A6D6 2F800;21D");
advantages:
* human-readable, ASCII-only
* uses character ranges
* slightly compressed, in base-and-offset form, no leading zeroes or U+
* clearly shows base of each range in expected hex format
* _really_ easy to parse
* does not need a dedicated parser in the prefs module, so we don't need to
block on modifying that (rather complex, state-machine driven) code
* should compress well with zlib, if packaged in a jar file
| Assignee | ||
Comment 2•19 years ago
|
||
| Assignee | ||
Comment 3•19 years ago
|
||
| Assignee | ||
Comment 4•19 years ago
|
||
This uses various character property lists, as well as blocking what I consider to be silly characters (dingbats, font variants, and the like)
| Assignee | ||
Comment 5•19 years ago
|
||
Having written the rather hacky and sprawling program above, and realised the
limitations of that ad-hoc approach, I'm now working on a set of small Unix
tools which take several different data sources (the Unicode3.2 data file, the
idn-chars list for Unicode 4.0, the confusables table from Unicode TR 36) and
format them into a single data format containing lines of the form
character-range ; tags ; name ; comment
with a merge program which:
1 reads the pre-processed tag data files via stdin, and merges them
2 applies a positive (keep) priority or a negative (remove) priority to each tag
3 keeps only characters with len({tag_values}) > 0 and abs(max({tag_values})) >
abs(min({tag_values}})
4 outputs these as character ranges
More to come...
Updated•19 years ago
|
Assignee: nobody → darin
Component: Security → Networking
Product: Firefox → Core
QA Contact: firefox → benc
Version: unspecified → Trunk
| Assignee | ||
Comment 6•19 years ago
|
||
See http://www.icann.org/general/idn-guidelines-14nov05.htm for the latest set of recommendations. Selected highlights from this draft proposal: "All code points in a single label will be taken from the same script as determined by the Unicode Standard Annex #24: Script Names at http://www.unicode.org/reports/tr24. Exception to this is permissible for languages with established orthographies and conventions that require the commingled use of multiple scripts. In such cases, visually confusable characters from different scripts will not be allowed to co-exist in a single set of permissible codepoints unless a corresponding policy and character table is clearly defined." "Permissible code points will not include: (a) line symbol-drawing characters (as those in the Unicode Box Drawing block), (b) symbols and icons that are neither alphanumeric nor ideographic language characters, such as typographic and pictographic dingbats, (c) characters with well-established functions as protocol elements, (d) punctuation marks used solely to indicate the structure of sentences. (e) Punctuation marks that are used within words may only be permitted if they are not excluded by any of the preceding points, are essential to the language of the IDN registration, and are associated with explicit prescriptive rules about the context in which they may be used. (f) Under corresponding conditions, a single specified character may be used as a separator within a label, either by allowing the hyphen-minus to appear together with non-Latin scripts, or by designating a functionally equivalent punctuation mark from within the script."
Summary: Antispoofing: Implement ICANN Guidelines for the Implementation of IDNs, v2.0 → Implement ICANN Guidelines for the Implementation of IDNs, v2.0
| Assignee | ||
Comment 9•19 years ago
|
||
Just to clarify the above: whilst I started the comments in this bug by talking about defining "safe" character whitelists, it's now clear that defining our own ad-hoc character filtering lists is neither politically nor technically possible at this time. What we can do (and what I've filed the two sub-bugs on) is to: * implement the script-mixing rules described in the ICANN recommendation * tighten up on the implementation of some of the existing IDN/ICANN rules A reasonable interpretation of the text of the rules seems to me to be to use the script assignments defined in Scripts.txt of the current Unicode version (currently 4.1.0), as amended by any exceptions officially permitted by ICANN, in conjunction with the limited number of permitted "safe" script-mixing sets defined in the "Highly Restrictive" restriction level defined in Unicode Technical Report 36. However, instead of adopting the script-mixing detection algorithm in UTR 36, which ignores non-identifier characters from the Common or Inherited sets for the purpose of mixed-script detection, I propose (at least at first) to adopt a stricter rule that does not allow _any_ character to be used that does not have an explicitly defined script assignment, unless it is from a list of explicit exceptions endorsed by ICANN on either a per-script or all-scripts basis. Now, adopting this definition of the script-mixing rules fortuitously just happens to have the side-effect of forbidding any character without a script assignment, unless it is explicitly allowed by a specific variation of those rules. And this is good, because this has the de-facto effect of eliminating most characters that would be eliminated by most reasonably-framed explicit character-filtering lists. ;-) Since ICANN is quite likely to add extra compatible character / script combinations, any implementation will need to be table-driven so it can be updated rapidly if necessary. What I suggest as an interface for controlling the script-mixing testing code is that we create config variables that define acceptable sets of characters in terms of a combination of character ranges and symbolic names for previously defined sets of characters. For the sake of illustration, the names of these config variables might be of the forms either IDN.scriptset.define.xxx or IDN.scriptset.accept.xxx Any non-empty DNS label that contains only characters that are all elements of one or more of the "accept" sets would be allowed. All other strings would be regarded as improper script-mixings. An example of use might be something like: IDN.scriptset.define.asciidigits "0030-0039" IDN.scriptset.define.asciiletters "0061-007A" IDN.scriptset.define.hyphen "002D" IDN.scriptset.accept.ldh "asciidigits asciiletters hyphen" IDN.scriptset.define.cyrillicletters "0400-0449 0500-050F" IDN.scriptset.accept.cyrillicldh "asciidigits cyrillicletters hyphen" ..etc.. IDN.scriptset.accept.hhkdh "han hiragana katakana asciidigits hyphen" IDN.scriptset.accept.hbdh "han bopomofo asciidigits hyphen" and so on... Now, given a way of defining "proper" mixable character sets in labels, an interesting policy question is then: * should improper script-mixed labels simply be forced to display as Punycode, or * should DNS names containing them simply be treated as invalid, and lookups on them made to fail?
| Assignee | ||
Comment 10•19 years ago
|
||
A thought: since the above is becoming so complex, would it make sense to go the whole hog and define a context-free grammar for IDNs, and compile this into a table to be interpreted by a pushdown acceptor for this grammar? This would also have the side-effect of making the rules extremely clear for multi-character grapheme special cases such as the Catalan lΒ·l.
| Assignee | ||
Comment 11•19 years ago
|
||
This is a first cut at defining a character range table, based on Unicode 4.1.0 Scripts.txt and Blocks.txt tables, with some hand amendments to: * slice up the Basic Latin and Latin-1 Supplement to remove control chars and punctuation, and to separate the ASCII letters and digits * slice up the Greek and Coptic block, to allow clearer distintctions -- note that Greek / Coptic disunification is an issue here, and this may need to be tweaked further to be Unicode 3.2 compatible In addition, the pre-processor program used has: * removed blocks containing _nothing_but_ Common or Inherited characters * removed blocks containing only presentation forms (should not be used for naming, and should be normalized away) * removed blocks containing other forms that will be normalized away (halfwidth, fullwidth, circled, subscripts + superscripts) * removed blocks exclusively containing musical or other symbols * removed the Braille block: raw Braille is in effect a transliteration / presentation form for other scripts, and the presence of raw Braille in a name creates a spoofing risk for Braille users However, the IPA block has been kept, even though IPA is usually only used for transliteration, as some languages such as Zulu and Hungarian use a few of these symbols. The IPA Spacing Modifiers have gone, since no such use is indicated. In all, this table has 82 blocks defined (some of which are adjacent, and could be merged), which map to a total of 59 scripts. Note: these blocks are selected for one-block -> one-script mappings. Any of these blocks may, and often do, contain non-script characters and/or unallocated code points; they are purely intended to stop script mixing, and for no other purpose. Note again: Greek / Coptic unification / disunification presents a particularly knotty problem, which needs further thought.
Assignee: darin → usenet
Attachment #196934 -
Attachment is obsolete: true
Attachment #196935 -
Attachment is obsolete: true
Attachment #196968 -
Attachment is obsolete: true
Status: NEW → ASSIGNED
Updated•19 years ago
|
Flags: blocking1.9a1+
Whiteboard: [sg:want]
Flags: blocking1.9-
Comment 12•16 years ago
|
||
Requesting blocking-1.9.2? just because of the blocking-1.9a1 flag...
Status: ASSIGNED → NEW
Flags: blocking1.9.2?
QA Contact: benc → networking
Updated•15 years ago
|
Flags: blocking1.9.2? → blocking1.9.2-
Updated•9 years ago
|
Whiteboard: [sg:want] → [sg:want][necko-would-take]
Comment 13•9 years ago
|
||
IDNA2008 has superseded the ICANN guidelines from 2005, so this is now WONTFIX.
You need to log in
before you can comment on or make changes to this bug.
Description
•