Closed Bug 304316 Opened 19 years ago Closed 19 years ago

Expand IDN character blacklist

Categories

(Core :: Networking, defect)

defect
Not set
major

Tracking

()

RESOLVED DUPLICATE of bug 309311
mozilla1.8beta4

People

(Reporter: gerv, Assigned: gerv)

Details

(Whiteboard: [sg:dupe 309311])

We may need to expand our IDN character blacklist. Currently, it includes two
homographs of the "/" character. It also needs to include homographs of other
URL punctuation characters, such as ".", "?", "#", "&" and ":".

We have a character blacklist as well as TLD restrictions because registries
have no control of the levels above the directly-registered level. The blacklist
is to prevent labels at that level pretending to be at a different level; hence
the focus on URL punctuation characters.

Gerv
I will contact Opera to see what characters they are blacklisting.

Gerv
Severity: normal → major
Flags: blocking1.8b4?
Target Milestone: --- → mozilla1.8beta4
Whiteboard: [sg:investigate]
Presumably it only needs to blacklist those characters that are NOT normalized
by the IDN normalization rules, which include NFKC Unicode normalization with
some additional rules for homographs of "." (U+3002, U+FF0E, U+FF61).  When
characters are normalized according to those rules, we always use the normalized
form in the status bar, in the URL bar (unless the user is typing and they
haven't pressed Enter yet), etc.
Flags: blocking1.8b4? → blocking1.8b4+
dbaron: indeed. If our decoding code transforms a given dodgy domain name into a
different one altogether, and then displays that and tries to visit it, it's as
if the person linked there originally, and there's no problem.

We just need to make sure there's no inconsistency or seeming inconsistency
between what's displayed and the place visited.

Gerv
I got the following message from Yngve Pettersen of Opera:

The list used in 7.5x is:
   0021-0023;
   0025-002C;
   002F; Forward slash
   003B-0040;
   005C;
   005E;
   007B-007D;
   007F;

"%" and "/" are the most important of these, since they can impact both  visual
and machine interpretation of a URI, especially if the servername  is converted,
and then later parsed again.

The others were added because they are not (or should not be) valid in a  DNS
name. (Hmmm, on second thought, not sure U+002B "+" should be on the  list)

In 8.0 beta I added two sets of characters to the strict list

The first list is

   2000-206f; General punctuation
   2215; fractional slash

The fractional slash is a homograph of "/" and should therefore be covered  by
the same rules as "/" in the above list. I suspect there are more of  these
(IIRC I have seen references that indicate that). Blocking this  character
impacts U+33C6 which is folded into a sequence that includes  U+2215 by
stringprep (the only one in the fold list).

The punctuation section was added based on a suggestion in the unicode 
discussion list that was forwarded to me, and IMO should be excluded,  although
U+2010 and some others should probably be folded to "-" (U+002D).

The second set was added also added based on the same list as general 
punctuation, but I am not 100% sure these should be excluded; it may be 
overkill. I'll leave final decision to the experts.

   1680-16FF;
   2400-243f;
   2500-257f;
   FB00-FB4F;
   FE50-FE6F;
   FF00-FFEF;
   10000-100FF;
   10300-1032F;
   10400-1047F;

AFAICT (I am not a Unicode expert), at least the alphanumeric  half/fullwidths
in FF00-FFEF are normalized to normal ASCII.

The referenced list (which includes a number of ranges already covered by 
stringprep) was

--------------
* Box Drawing
* Block Elements
* Geometric Shapes
* Miscellaneous Symbols
* Dingbats
* Byzantine Musical Symbols
* Musical Symbols
* Mathematical Alphanumeric Symbols
* Letterlike Symbols
* Number Forms
* Arrows
* Mathematical Operators
* Miscellaneous Technical
* Combining Marks for Symbols
* Control Pictures
* Optical Character Recognition
* Enclosed Alphanumerics
* Miscellaneous Mathematical Symbols-A
* Supplemental Arrows-A
* Supplemental Arrows-B
* Miscellaneous Mathematical Symbols-B
* Supplemental Mathematical Operators
* Miscellaneous Symbols and Arrows
* High Surrogates
* Low Surrogates
* Private Use Area
* Alphabetic Presentation Forms
* Small Form Variants
* Halfwidth and Fullwidth Forms
* Variation Selectors
* Tags
* Specials
* Variation Selectors Supplement
* Supplementary Private Use Area-A
* Supplementary Private Use Area-B
* Linear B Syllabary
* Linear B Ideograms
* Shavian
* Deseret
* Ugaritic
* Old Italic
* Ogham
* Runic
* General Punctuation
-------------- 
That list probably isn't right for us to use as-is; we need to answer the
following questions:

- Do we already have code that will barf on the non-LDH ASCII characters in the
first list?

- Do we want to include "+"? (I believe Opera may have later removed that one
due to encountering problems)

- Do we want to include large swathes of punctuation, or only characters which
directly spoof URL punctuation?

- If we are going to have a large set of blocked characters, do we need a better
storage method than a character list?

My suggested approach is to comb their list looking for homographs of the
characters listed in comment 0, and just add those. But perhaps other people
have a different view.

Gerv
Flags: blocking1.8b5+
Flags: blocking1.8b5+
Does this bug need to be security sensitive given public bug 301694? Clearly
people know we need *a* blacklist, and which particular characters are or are
not on it would seem to benefit us from the many eyes theory making sure we
don't miss any.

In fact, is this not a plain duplicate of bug 301694?
This list needs reconciling with the one in bug 309311, which is the canonical
list. Once that's done, we can dupe it.

Gerv
Gerv, can you carry the contents of this bug over to the other and dupe this?
We've only got a couple of days to get all these blackliste chars in for beta.
If they're not in by monday, they might not make it.
After talking with Darin, we're going to push this out to after the beta2 (first
RC).
Flags: blocking1.8b5+ → blocking1.8b5-
So what's the story here?  Are we done?
I don't think this bug has further intrinsic value.

Gerv

*** This bug has been marked as a duplicate of 309311 ***
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → DUPLICATE
Group: security
Whiteboard: [sg:investigate] → [sg:dupe 309311]
You need to log in before you can comment on or make changes to this bug.