Closed Bug 316727 Opened 20 years ago Closed 10 years ago

IDN processing code should be ready to enforce script-mixing constraints

Tracking

()

Status:

RESOLVED INCOMPLETE

People

(Reporter: usenet, Assigned: usenet)

References

(Blocks 1 open bug)

Details

(Keywords: sec-low, Whiteboard: [sg:low] spoof)

Attachments

(9 files, 12 obsolete files)

A first cut at a character range table 20 years ago Neil Harris 3.90 KB, text/plain		Details
Character table, v1.1 20 years ago Neil Harris 2.29 KB, text/plain		Details
Script range table, v1.2 20 years ago Neil Harris 6.05 KB, text/plain		Details
Suggested possible script mixings, v1.0 20 years ago Neil Harris 3.39 KB, text/plain		Details
A first hack at some C code for enforcing the rules suggested above 20 years ago Neil Harris 9.80 KB, text/plain		Details
A second hack at some C code for enforcing the rules suggested above 20 years ago Neil Harris 9.98 KB, text/plain		Details
script-filtering code; a third try, now passes own self-tests 20 years ago Neil Harris 11.88 KB, text/plain		Details
Version 4 of script-filtering code, now handles combining marks, more test cases 20 years ago Neil Harris 13.66 KB, text/plain		Details
version 5 of the script checking code 20 years ago Neil Harris 13.75 KB, text/plain		Details
version 6: more lenient, does not screw up SRV requests with underscores in 20 years ago Neil Harris 2.61 KB, patch		Details \| Diff \| Splinter Review
script-checking code version 7: label-aware, more sanity checks 20 years ago Neil Harris 16.45 KB, text/plain		Details
Combination patch for both this and bug 316444 20 years ago Neil Harris 16.79 KB, patch		Details \| Diff \| Splinter Review
slight tweak of the above 20 years ago Neil Harris 16.93 KB, patch		Details \| Diff \| Splinter Review
Script range table, v1.3 20 years ago Neil Harris 7.37 KB, text/plain		Details
script-checking code, patch 24 Dec: fix some Latin-1 punctuation issues, MIDDLE DOT for Catalan 20 years ago Neil Harris 19.85 KB, patch		Details \| Diff \| Splinter Review
Test file for this code: contains links to test domains in a wide variety of languages. 19 years ago Neil Harris 23.45 KB, text/html		Details
Exemplar script sets for different languages 19 years ago Neil Harris 4.03 KB, text/plain		Details
Extra non-script-associated characters used as part of an orthography 19 years ago Neil Harris 549 bytes, text/plain		Details
Test file: contains links to test domains containing script mixtures 19 years ago Neil Harris 10.15 KB, text/html		Details
Test file: contains links to test domains containing Unicode characters not allocated to any script 19 years ago Neil Harris 17.69 KB, text/html		Details
Test file: contains links to test domains containing Unicode characters not allocated to any script 19 years ago Neil Harris 17.97 KB, text/html		Details

Neil Harris

Assignee

Description

•

20 years ago

Since Unicode Technical Report #36 and the draft of UTR #39 appear to be likely to recommend the application of script-mixing rules to IDNs. These rules are designed to prevent IDN homograph name-spoofing, and cannot be addressed by either the current IDN display blacklist or DNS output whitelist mechanisms. The IDN-processing code may soon need to be able to enforce these checks, if the rules become official.

Neil Harris

Assignee

Updated

•

20 years ago

Blocks: 316730

Daniel Veditz [:dveditz]

Comment 1

•

20 years ago

If this is about to be a new recommendation in a public spec perhaps we don't need the confidential flag. It's public, and being new people will assume we may not yet implement it.

No longer blocks: 316730

Whiteboard: [sg:low] spoof

Neil Harris

Assignee

Comment 2

•

20 years ago

Fine by me, please can someone remove the security tag?

Neil Harris

Assignee

Updated

•

20 years ago

Blocks: 316730

timeless

Updated

•

20 years ago

Assignee: nobody → smontagu

Group: security

Component: General → Internationalization

Product: Firefox → Core

QA Contact: general → amyy

Version: unspecified → Trunk

Neil Harris

Assignee

Updated

•

20 years ago

Blocks: 309435

Neil Harris

Assignee

Comment 3

•

20 years ago

Attached file A first cut at a character range table (obsolete) — Details

This is a first cut at defining a character range table, based on Unicode 4.1.0 Scripts.txt and Blocks.txt tables, with some hand amendments to: * slice up the Basic Latin and Latin-1 Supplement to remove control chars and punctuation, and to separate the ASCII letters and digits * slice up the Greek and Coptic block, to allow clearer distintctions -- note that Greek / Coptic disunification is an issue here, and this may need to be tweaked further to be Unicode 3.2 compatible In addition, the pre-processor program used has: * removed blocks containing _nothing_but_ Common or Inherited characters * removed blocks containing only presentation forms (should not be used for naming, and should be normalized away) * removed blocks containing other forms that will be normalized away (halfwidth, fullwidth, circled, subscripts + superscripts) * removed blocks exclusively containing musical or other symbols * removed the Braille block: raw Braille is in effect a transliteration / presentation form for other scripts, and the presence of raw Braille in a name creates a spoofing risk for Braille users However, the IPA block has been kept, even though IPA is usually only used for transliteration, as some languages such as Zulu and Hungarian use a few of these symbols. The IPA Spacing Modifiers have gone, since no such use is indicated. In all, this table has 82 blocks defined (some of which are adjacent, and could be merged), which map to a total of 59 scripts. Note: these blocks are selected for one-block -> one-script mappings. Any of these blocks may, and often do, contain non-script characters and/or unallocated code points; they are purely intended to stop script mixing, and for no other purpose. Note again: Greek / Coptic unification / disunification presents a particularly knotty problem, which needs further thought.

Assignee: smontagu → usenet

Status: NEW → ASSIGNED

Neil Harris

Assignee

Comment 4

•

20 years ago

Just to explain the Greek / Coptic unification issue: In earlier versions of Unicode, Greek and Coptic scripts were unified, so Coptic shared the corresponding Greek characters in the "Greek and Coptic" block, with some special characters of its own. Now, however, a disunified Coptic block has been produced with its own versions of the Coptic characters (with characters which are, of course, very confusable for the corresponding Greek characters, and vice versa). However, seven Coptic-unique characters (in both uppercase and lowercase forms) remain in the Greek/Coptic block. Question: it is presumably safe to mix the Greek characters and the seven Coptic characters in that block: is it ever justified to mix: * The Coptic-unique chars in the Greek/Coptic block and the characters in the new Coptic block? * The Greek Extended set and the new characters in the new Coptic block? * etc...

Neil Harris

Assignee

Updated

•

20 years ago

Assignee: usenet → smontagu

Status: ASSIGNED → NEW

Neil Harris

Assignee

Comment 5

•

20 years ago

One final point: no attempt has been made in the generation of the list above to filter out dead (eg. Linear B) or artificial (eg. Shavian) scripts from the mix: any attempt to do this is fraught with potential discord (where do you stop cutting? whose toes are you treading on?).

Neil Harris

Assignee

Comment 6

•

20 years ago

Regarding special non-letter characters: There are currently three outstanding IDN issues involving the use of characters other than letters or digits: * allowing ETHIOPIAN WORDSPACE in Ethiopian * allowing GERESH and GERSHAYIM in Hebrew * allowing "l·l" to be used in Catalan The first two do not affect this proposed test at all, since they already share the appropriate script blocks, so implementing this either of these would need no change to the code. We can deal with the last in two ways possible: * The most restrictive possible interpretation, allowing "l·l" as a special composite grapheme, treating it as if it was in the Latin block. * Allowing MIDDLE DOT into the Latin block I prefer the former, since for security, most restrictive is always best, but it will require special-case code to implement. The simplest and easiest way to implement it would be to do the following: canonical_name = string.replace(name_string, "l·l", "ll") [or even canonical_name = string.replace(name_string, "l·l", "lll") if we want to patch the buffer in place] and then apply the tests as before, but to canonical_name. Clearly, this approach will work well provided the number of exceptions is limited, but will get unwieldy and inefficient if many such exceptions are created.

Neil Harris

Assignee

Comment 7

•

20 years ago

And another thought: if we want to give ETHIOPIC WORDSPACE the same semantics as the hyphen, ie. not allowed at the beginning or the end, (which is, I believe, the intent) we could also add the pre-processing rule canonical_name = string.replace(name, ETHIOPIC_WORDSPACE, "-") which would have the desired effect.

Neil Harris

Assignee

Comment 8

•

20 years ago

Inspection of the Unicode 3.2 tables shows that the Coptic block did not exist at that time, and hence Unicode 3.2 treats the two scripts as unified. Since IDN is currently (2005) defined as using Unicode 3.2, we can simply remove the dedicated Coptic block from our list, and treat Greek+Coptic as a single script for the purposes of this filter.

Neil Harris

Assignee

Comment 9

•

20 years ago

Oh, and one last thing: characters not in any block should be regarded as invalid, unless transformed by one of the preprocessing rules above. Remember, the result of this is intended to force Punycode display in the GUI in an attempt to prevent spoofing, rather than to block IDNs from working, so interpreting this list in a very conservative way is OK from the viewpoint of interoperability.

Neil Harris

Assignee

Comment 10

•

20 years ago

Attached file Character table, v1.1 (obsolete) — Details

This is a reworking of the character table. It adds HYPHEN as a special character class, and provides an extra distinction to help handle the Coptic unification problem by labeling the Coptic characters currently sharing a block with the Greek characters as COPTIC_EXTRAS, which are presumably OK to mix either with the GREEK range or the COPTIC range, but not both.

Attachment #204985 - Attachment is obsolete: true

Neil Harris

Assignee

Comment 11

•

20 years ago

Question: should the COMBINING DIACRITICAL MARKS be allowed at all, or are only precomposed characters allowed? If the combining marks were to be allowed, what scripts should they be allowed to mix with?

Neil Harris

Assignee

Comment 12

•

20 years ago

Attached file Script range table, v1.2 (obsolete) — Details

Now with defines created for label tags, and comments added to refer back to original and/or my created range names.

Attachment #205177 - Attachment is obsolete: true

Neil Harris

Assignee

Comment 13

•

20 years ago

Attached file Suggested possible script mixings, v1.0 — Details

Script range codes are as in attachment 205179 [details]: "Script range table, v1.2" Reference: http://www.unicode.org/reports/tr36/#Security_Levels_and_Alerts with Restriction Level 2, "Highly Restrictive", Unicode 4.1.0 Scripts.txt and Blocks.txt, amended manually for special treatment within ASCII ranges, and special treatment of the Coptic characters within the Greek and Coptic block. This last is intended to allow for Coptic disunification, which is currently prevented by the existing Nameprep, but may later need to be implemented. [Note: the ASCII_DIGITS and HYPHEN are treated as character classes of their own, since it may eventually be decided that one or both of these is incompatible with one of the scripts given below.]

Neil Harris

Assignee

Comment 14

•

20 years ago

[correction of the above] for "proper subset", read "subset".

Neil Harris

Assignee

Comment 15

•

20 years ago

Attached file A first hack at some C code for enforcing the rules suggested above (obsolete) — Details

Note: this code compiles, but has not yet been tested in any way. There's some scope for making it faster, and with a smaller footprint, but it'll probably do in more-or-less its present form for testing purposes.

Neil Harris

Assignee

Comment 16

•

20 years ago

Attached file A second hack at some C code for enforcing the rules suggested above (obsolete) — Details

This is a second version of the above, with a bit of tidying and a bug fix for the case of names without any letters in.

Neil Harris

Assignee

Comment 17

•

20 years ago

Attached file script-filtering code; a third try, now passes own self-tests (obsolete) — Details

Attachment #205311 - Attachment is obsolete: true

Attachment #205312 - Attachment is obsolete: true

Neil Harris

Assignee

Comment 18

•

20 years ago

Attached file Version 4 of script-filtering code, now handles combining marks, more test cases (obsolete) — Details

Implementing combining marks raises interesting questions about the situations in which they are appropriate: for example, what about latin accents on hanji, and other nonsensical combinations? Should we simply refuse to display names containing combining marks at all?

Attachment #205321 - Attachment is obsolete: true

Neil Harris

Assignee

Comment 19

•

20 years ago

Note: the logical place to hook this in appears to be what is now called isOnlySafeChars(), defined at http://lxr.mozilla.org/mozilla1.8/source/netwerk/dns/src/nsIDNService.cpp#61 which seems to have the same purpose of testing whether a Unicoded domain name string is OK to display in the user interface. However, it gets called in three places, and these need checking. isOnlySafeChars() is defined in terms of UTF16, so the data will need to be converted to UCS4 before the checking code is called.

Neil Harris

Assignee

Comment 20

•

20 years ago

Attached file version 5 of the script checking code (obsolete) — Details

This now lets the underscore through as a "word separator", for compatibility with an unfortunately large set of existing DNS labels. Thinks: will this code need to recognize and special-case support IPv6 addresses with colons in, or will these always go through another code path? (one idea might be to treat colons as a special character class that can only be mixed with ASCII digits in a label -- but what about the square brackets that surround v6 addresses in URLs -- will they pass through this code path?).

Attachment #205344 - Attachment is obsolete: true

Neil Harris

Assignee

Comment 21

•

20 years ago

A possible approach: declare '[', ':', and ']' to be SCRIPT_IPV6_SPECIALS, and allow the set SCRIPT_ASCII_DIGITS SCRIPT_IPV6_SPECIALS as a special-case combination. Remember that this code is only intended as a belt-and-braces double-check for possible script spoofing, and code elsewhere should already be parsing and checking for valid name and address string formats.

Neil Harris

Assignee

Comment 22

•

20 years ago

Attached patch version 6: more lenient, does not screw up SRV requests with underscores in (obsolete) — Details — Splinter Review

In newer RFCs, SRV RRs with names _starting with_ underscores are now used. Changed patch to be more lenient to reflect this usage. Kept the hyphen restriction, since this is still necessary to prevent Unix command-line option ambiguity.

Attachment #205390 - Attachment is obsolete: true

Neil Harris

Assignee

Comment 23

•

20 years ago

More notes: There are three routines under consideration here, with confusingly similar functionality, that however need to be kept carefully distinct. Perhaps the first and last should be renamed, to make things clearer? 1) The predicate isOnlySafeChars() roughly means "there are no easily programmatcally detected opportunities for visual spoofing in rendering this Unicode string as a host name". Failing this should result in the text being presented as Punycode in the UI. Note that this should always be called on the result of a string that has been through, at the least, Nameprep preprocessing. In practice, this is going to mean the following: * no script mixing, unless the scripts are special compatible scripts * no unreasonable combining-mark seqences (leading combining marks, combining marks applied to digits or word separators) * no characters in the IDN-display special-case character blacklist Moreover, since this test is called on the whole DNS name string at once, the checks above must be applied on a label-by-label basis. This assumes that, at the very least, all valid label-separator characters in the DNS name have been already converted into '.' characters, as per the IDN RFCs. 2) The predicate isValidHostName() means "this string consists only of the characters known to be valid in the print-representations of FQDNs, IPv4 addresses or IPv6 addresses (assuming IDN Nameprep and Punycode processing have been applied first)". Failing this should result in the hostname/IP address string being regarded as invalid, and all lookups of this FQDN should fail. 3) The UseSTD3ASCII test in the punycode() routine means "this string consists only of the characters known to be valid in _a_single_label_ of an FQDN (assuming IDN Nameprep processing has been applied first), and does not start or end with a hyphen". Failing this should result in the hostname string being regarded as invalid, and thus the punycode() operation, and hence all other operations relying on punycode(), including DNS lookups, should fail. _If_ both the tests above were to be merged into the code, and _if_ all the other domain name/IP address processing code were to be _demonstrably_watertight_, this test could probably be omitted. However, for now (and possibly for a long time), it's a useful belt-and-braces check. When the other patches below have been deployed and tested in the wild for some time, it's worth considering changing this into a fatal assertion of the "this must never happen" variety.

Neil Harris

Assignee

Comment 24

•

20 years ago

Attached file script-checking code version 7: label-aware, more sanity checks (obsolete) — Details

Attachment #205536 - Attachment is obsolete: true

Neil Harris

Assignee

Updated

•

20 years ago

Assignee: smontagu → usenet

Neil Harris

Assignee

Updated

•

20 years ago

Status: NEW → ASSIGNED

Neil Harris

Assignee

Comment 25

•

20 years ago

Attached patch Combination patch for both this and bug 316444 (obsolete) — Details — Splinter Review

This is an unfinished, unpolished, untested patch for both this and bug 316444. But it's nearly at the "end of the beginning phase"; it compiles, all of its parts have each been individually tested, and it needs only a couple of lines of code to become functional.

Neil Harris

Assignee

Updated

•

20 years ago

Depends on: 320568

Neil Harris

Assignee

Comment 26

•

20 years ago

Attached patch slight tweak of the above (obsolete) — Details — Splinter Review

I've just hosed my build environment, and need to rebuild... saving here just in case...

Neil Harris

Assignee

Comment 27

•

20 years ago

Whilst I'm waiting to get a properly bootable build machine back, I've just recieved some more feedback about character repertoires: Briefly: * there _may_ be a need to support some of the CJK Compatibility stuff... (3300..33FF, F900..FAFF, FE30..FE4F, 2F800..2FA1F) * there _may_ be a need to support some of the Hangul Conjoining Jamos (3130..318F) to support some obsolete Korean syllables found in some company names However: since most Korean-language systems do not support the older syllables properly, companies are likely to use only the modern precomposed syllables: so, after checking with a native speaker, I think it is OK to go on excluding them. The CJK Compatibility Ideographs are another issue, and this will require some codepoint-crunching to check in detail. Note: the most common repertoires in use for IDNs in the CJK languages are: http://www.iana.org/assignments/idn/cn-chinese.html http://www.iana.org/assignments/idn/tw-traditional-chinese.html http://www.iana.org/assignments/idn/jp-japanese.html http://www.iana.org/assignments/idn/kr-korean.html

Neil Harris

Assignee

Comment 28

•

20 years ago

Ooops.. U+1100..U+11FF Hangul Jamo U+3130..U+318F Hangul Compatibility Jamo Two different things...

Neil Harris

Assignee

Comment 29

•

20 years ago

I've now analyzed the data from the ICANN CJK code point assignments given above, cross-referenced against the Unicode 4.1.0 code block assignments, and I get the following: Writing system: zh-cn (= 50510 codepoints) Writing system: zh-tw (= 50697 codepoints) Writing system: jp (= 13142 codepoints) Writing system: ko (= 11209 codepoints) Not in use: Hangul Jamo (Hangul) Not in use: CJK Radicals Supplement (Han) Not in use: Kangxi Radicals (Han) Not in use: Bopomofo (Bopomofo) Not in use: Hangul Compatibility Jamo (Hangul) Not in use: Bopomofo Extended (Bopomofo) Not in use: CJK Compatibility Ideographs (Han) Not in use: CJK Unified Ideographs Extension B (Han) Not in use: CJK Compatibility Ideographs Supplement (Han) This is mildly reassuring. More to come...

Neil Harris

Assignee

Comment 30

•

20 years ago

Here are some more thoughts about safe/risky character ranges: According to the Unicode 3.0 standard, "Characters unique to the U source are found in the CJK Compatibility Ideographs block. There are 12 of these characters: U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, and U+FA29. The remaining characters in the CJK Compatibility Ideographs block are either duplicates or unifiable variants of characters in the one of the other blocks and are included in the Unicode Standard for reasons of round-trip compatibility." So, it looks like blocking the CJK Compatibility ranges will block _at_least_ 800 possible confusing duplicates, at the cost of losing perhaps 12 characters. For security purposes, I think this is a no-brainer, particularly since none of the codepoints in these ranges are used by the IANA-registered character sets indpendently compiled by the main registries using the ch-zh, cn-tw, ko and jp writing systems. For this reason, I intend to deprecate the "CJK Compatibility" ranges for the Han script. If CJK language experts agree that a small subset of these are safe for use, they can always be individually whitelisted later. The CJK and KangXi radical ranges are similarly dubious: according to the Unicode 3.0 spec, most of them are also encoded in the CJK Unified Ideographs block. So, they should be deprecated too. Similar reasoning applies to the Hangul Jamo and Compatibility Jamo ranges, which provide another way of specifying essentially the same character set encoded in the precomposed Hangul Syllable block. So they get deprecated, too, Speaking of radicals, the Yi Radicals do not seem to have any meaning other than for lexicographic use, and they contain at least a couple of glaring spoofs for the Yi Syllables, which are (according to the spec) what is used for normal linguistic purposes. I'm not sure what to do here: without a strong reason to deprecate them, I think I'll keep them in, but just flag this as a potential problem for later consideration. Hopefully, Nameprep v2 will make this clearer. Bopomofo, on the other hand, appears to be a different matter. As far as I understand it, this is a separate auxiliary script used in the zh-tw writing system only. UTR #36 specifically suggests that it is safe to use in its proposed "Highly Restrictive" script-mixing profile. So, on this evidence, Bopomofo should stay in, even though it is not currently used in any registry I know of. Similarly, I can't see any evidence that the CJK Unified Ideographs Extensions A and B, contain any harmful characters; as far as I'm aware, they're just rare, not confusing. So, in summary, out of the CJK ranges listed as usused in the previous comment: DEPRECATE: CJK Compatibility Ideographs (Han) CJK Compatibility Ideographs Supplement (Han) Hangul Jamo (Hangul) CJK Radicals Supplement (Han) Kangxi Radicals (Han) Hangul Compatibility Jamo (Hangul) KEEP AS VALID RANGES, EVEN THOUGH NOT CURRENTLY USED: CJK Unified Ideographs Extension B (Han) Bopomofo (Bopomofo) Bopomofo Extended (Bopomofo)

Neil Harris

Assignee

Comment 31

•

20 years ago

Attached file Script range table, v1.3 — Details

New script-block classification tables, as per previous comment. Note that there is an explicit SCRIPT_DEPRECATED tag, to make this table more maintainable. I should probably also define SCRIPT_TABLE_SIZE in terms of the ratio of the lengths of script_ranges[] and the script_range structure, which will also make the code more maintainable.

Attachment #205179 - Attachment is obsolete: true

Neil Harris

Assignee

Comment 32

•

20 years ago

Attached patch script-checking code, patch 24 Dec: fix some Latin-1 punctuation issues, MIDDLE DOT for Catalan — Details — Splinter Review

This now * properly deals with the punctuation characters in the Latin-1 Supplement, * adds MIDDLE DOT to the Latin script, in order to support the new .cat domain. * properly filters out various highly risky compatibility character ranges, including the Hangul Jamo (note: all existing Korean IDNs already correctly use the Hangul syllables instead) * Added a new character class SCRIPT_DEPRECATED for these last. Also, it now actually compiles, and doesn't seem to crash anything. However, this is still not properly tested, so please don't use it for anything.

Attachment #205818 - Attachment is obsolete: true

Attachment #206046 - Attachment is obsolete: true

Attachment #206131 - Attachment is obsolete: true

Neil Harris

Assignee

Updated

•

20 years ago

Depends on: 321491

Neil Harris

Assignee

Comment 33

•

19 years ago

*** Bug 354592 has been marked as a duplicate of this bug. ***

Neil Harris

Assignee

Comment 34

•

19 years ago

Attached file Test file for this code: contains links to test domains in a wide variety of languages. — Details

Tag-soup HTML file for this code: contains domains in a wide variety of languages. Unicode chars are escaped using entities. You will need to add "idntest" to your whitelist of IDN-enabled domains to make this work, and to create a local DNS server that is authoritative for the "idntest" domain.

Neil Harris

Assignee

Comment 35

•

19 years ago

Attached file Exemplar script sets for different languages — Details

Note: from a mixture of CLDR and Wikipedia data -- this is mainly for reference, not script-mixing, since some languages will map to multiple incompatible writing systems because they have to different orthographies. What is interesting here are the characters which are _not_ members of any script system which are part of the orthography of some languages, eg U+200C, U+200D, U+2019, U+02BC...

Neil Harris

Assignee

Comment 36

•

19 years ago

Attached file Extra non-script-associated characters used as part of an orthography — Details

These are the characters used in language orthographies which are not included in any script system. MIDDLE DOT is also used in Catalan, but is not listed here, because it is already handled in the current code. ETHIOPIC WORDSPACE is another edge-case, as are the GERESH and GERSHAYIM in Hebrew.

Neil Harris

Assignee

Comment 37

•

19 years ago

I've now filed bug 355156 for the separate issue of display of IDNs which have labels with different BiDi directions: eg http://עברית.idntest/

Neil Harris

Assignee

Comment 38

•

19 years ago

Attached file Test file: contains links to test domains containing script mixtures — Details

This tag-soup HTML file contains links to domains with mixed-script labels which _should not_ be displayed if the code in this patch is working correctly. To make this work, you will need to enable IDNs for the (testing-only) ".idntest" domain.

Neil Harris

Assignee

Comment 39

•

19 years ago

Attached file Test file: contains links to test domains containing Unicode characters not allocated to any script — Details

(script allocations as of Unicode 4.1.0, BMP only, may also contain broken surrogate sequences.)

Neil Harris

Assignee

Comment 40

•

19 years ago

Attached file Test file: contains links to test domains containing Unicode characters not allocated to any script — Details

Less dramatically wrong than the previous set. The edge case of completely invalid Unicode (broken surrogate pairs, non-characters, etc, etc.) needs further testing, but looks safe for now.

Phil Ringnalda (:philor)

Updated

•

16 years ago

QA Contact: amyy → i18n

Curtis Koenig [:curtisk-use curtis.koenig+bzATgmail.com]]

Updated

•

13 years ago

Keywords: sec-low

Mats Palmgren (inactive)

Comment 41

•

11 years ago

Neil, are you still working on this?

Component: Internationalization → Networking

Flags: needinfo?(usenet)

Neil Harris

Assignee

Comment 42

•

11 years ago

Not currently, but I can do so if there's a need for it.

Flags: needinfo?(usenet)

Patrick McManus [:mcmanus]

Updated

•

10 years ago

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → INCOMPLETE

You need to log in before you can comment on or make changes to this bug.