Closed Bug 301694 Opened 18 years ago Closed 18 years ago

Create IDN blacklist that include 'DIVISION SLASH'(U+2215) and 'FRACTION SLASH'(U+2044)

Categories

(Core :: Networking, defect, P1)

defect

Tracking

()

RESOLVED DUPLICATE of bug 309311
mozilla1.8beta4

People

(Reporter: masayuki, Assigned: masayuki)

References

Details

Attachments

(2 files, 6 obsolete files)

This is remain work of bug 283016.
Status: NEW → ASSIGNED
Flags: blocking1.8b4?
Priority: -- → P1
Target Milestone: --- → mozilla1.8beta4
Attached patch Patch rv1.0 (obsolete) — Splinter Review
Attachment #190114 - Flags: superreview?(darin)
Attachment #190114 - Flags: review+
Attachment #190114 - Flags: approval1.8b4?
Suggested comment:

// If a domain includes any of the following characters, it may be a spoof 
// attempt and so we always display the domain name as punycode. This would 
// override the settings "network.IDN_show_punycode" and 
// "netword.IDN.whitelist.*".
// The list currently contains the characters:
// U+2215 DIVISION SLASH
// U+2044 FRACTION SLASH

We should list the chars as not all editors can display them.

Gerv
Comment on attachment 190114 [details] [diff] [review]
Patch rv1.0

Please do not request approval until reviews are completed.
Attachment #190114 - Flags: approval1.8b4?
Attached patch Patch rv1.1 (obsolete) — Splinter Review
Attachment #190114 - Attachment is obsolete: true
Attachment #190137 - Flags: superreview?(darin)
Attachment #190137 - Flags: review+
Attachment #190114 - Flags: superreview?(darin)
Attachment #190114 - Flags: review-
Attachment #190114 - Flags: review+
'DIVISION SLASH' (U+2215) and 'FRACTION SLASH' (U+2044) are examples of unicode
Pattern_Syntax characters. In addition to these two, you should probably
consider all of the other Pattern_Syntax characters as illegal. There are a
total of 2006 of them (unfortunately). I've uploaded a complete list, extracted
from http://www.unicode.org/reports/tr36/draft/idn-chars.txt, and expressed as
a C array.
Tony: we are marking these two particularly as illegal because they can be used
to spoof URL separators. The IAB (Internet Architecture Board), the parent of
the IETF, has put together a working group to consider restricting the
characters permissible in IDN. We don't want to prejudice that work, so we are
keeping our blacklist limited to the more seriously problematic characters.

Gerv
Attachment #190137 - Flags: superreview?(darin) → superreview+
Attachment #190137 - Flags: approval1.8b4?
Attachment #190137 - Flags: approval1.8b4? → approval1.8b4+
checked-in
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Flags: blocking1.8b4?
Regarding the full list of 2006 characters: playing around with various
transformations and compressions of the list gives an apparent limiting
complexity of about 300 to 500 bytes for compressing the list (roughly 1.2 bits
per character).

I can get it down to about 500 - 750 bytes with some bitmap and range-table
compression (15-way switch statement for 8 MSBs, with bitmap lookups for the
LSBs, or none if entire MSB range is empty) but just doing a simple C switch
statement generates code that's about 1280 bytes on an x86 with gcc and
optimization, and this form of the code is much more readable and maintainable
and does not need a hand-hacked "pre-compiler" to generate the bitmaps etc.

At the moment, it does not look like the super-efficient implementation is
needed, unless there's a desperate need to save 500 bytes of code.

Challenge: can anyone do better than 750 bytes for the code+data for this?

If you need the super-efficient implementation for the full set of characters,
drop me an E-mail.


can nsIUGenCategory be used?
*** Bug 286535 has been marked as a duplicate of this bug. ***
The following are also possible Unicode homographs for (one or more) slashes,
and should be added to this blocklist, in addition to the existing U+2215 and
U+2044:

Combining long solidus overlay U+0338 &#x0338; <br>
Fullwidth solidus U+FF0F &#xff0f; <br>
Big solidus U+29F8 &#x29f8;<br>
Solidus with overbar U+29F6 &#x29f6;<br>
Triple solidus binary relation U+2AFB &#x2afb;<br>
Double solidus operator U+2AFD &#x2afd;<br>
Combining short solidus overlay U+0337 &#x0337;<br>
Integral extension U+23AE &#x23ae;<br>

Note: some of these are from math fonts, and some may be visible depending on
combining-character rendering. The "integral extension" character, in
particular, may be rendered either as a vertical bar or a slash-like sloping
line, depending on font. None of these (in my opinion) are reasonable characters
for use in identifiers.




This Python program generates the expanded blacklist as a standard C
hex-escaped UTF-8 string from a human-readable listing of the offending
characters.
This adds the new proposed characters, and uses Javascript \u escape format,
which makes for a more self-documenting blacklist string. See
http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide:Unicode for
documentation of this string escape format.
Comment on attachment 193309 [details] [diff] [review]
Created a patch to add new blacklist characters, with more self-documenting string format

Great - I like the new format. r=gerv.

Gerv
Attachment #193309 - Flags: superreview?(darin)
Attachment #193309 - Flags: review+
Reopening; but if there are any further patches, let's do them in new bugs.

Gerv
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Another thought: can we turn that pref into a regexp? That way, we could include
character ranges if that ever became necessary.

Gerv
gerv: not easily, no... the networking engine is rather far removed from
anything that does regexp processing
Where in the code is this check, that is expanded here, happening?
Comment on attachment 193309 [details] [diff] [review]
Created a patch to add new blacklist characters, with more self-documenting string format

sr=darin
Attachment #193309 - Flags: superreview?(darin) → superreview+
Attachment #193309 - Flags: approval1.8b4?
andreas: see the patch attached to bug 283016.

Gerv
that's in IDN code, and the only thing it does is show the host as punycode
instead of the unicode form, afaict. therefore, I think bug 304904 should be
fixed independently of this blacklist.
There is a check with isOnlySafeChars, but I suspect it will only get triggered
if the IDNService is used, which does not happen with ascii-only hostsnames I
think. Both checks are necessary, maybe it is possible to chare the ascii
portion of the blacklist.
Attachment #193309 - Flags: approval1.8b4? → approval1.8b4+
Fixed on trunk and branch. Any further updates in new bugs, please.

Checking in ./modules/libpref/src/init/all.js;
/cvsroot/mozilla/modules/libpref/src/init/all.js,v  <--  all.js
new revision: 3.591; previous revision: 3.590
done

---

Checking in ./modules/libpref/src/init/all.js;
/cvsroot/mozilla/modules/libpref/src/init/all.js,v  <--  all.js
new revision: 3.585.2.5; previous revision: 3.585.2.4
done

Gerv
Status: REOPENED → RESOLVED
Closed: 18 years ago18 years ago
Resolution: --- → FIXED
guys, why does this work? I'm told this causes an NS_WARNING in prefread.cpp,
due to lack of support for \u.
http://lxr.mozilla.org/seamonkey/source/modules/libpref/src/prefread.cpp#388

This file is not parsed by a JS interpreter, but by a specialized parser.

Did anyone test this patch?

reopening because of that
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: blocking1.8b5?
Please re-create or back out the latest patch.
The patch makes infinite loop on the test case.
See http://bugbug.cocolog-nifty.xn--cominfo-ef0d.nekodama.com/icons/nekodama64b.gif

usenet:
You *MUST* test before attaching a patch.
Gah... i'm also seeing the NS_WARNING in my trunk build.  I figured the code had
been tested when I reviewed this (and forgot that I never coded support for
\u-escape sequences in prefread.cpp) :(
I backed this patch out on the trunk.
I filed bug 307438 on adding support for \u escape sequences to prefread.cpp
Attachment #193309 - Flags: superreview+ → superreview-
Darin, is this critical to get backed out for Firefox 1.5 beta 1?
Asa: Hmm.. yes, I think so.  I would have backed it out had I realized that it
was commited on the 1.8 branch.  Apparently, no one added the fixed1.8 keyword! :(
OK, I backed out the patch from the 1.8 branch as well.
backout from 1.8 branch was with a=asa ("per the results of our driver
discussion" according to asa).
Aaargh. Mea culpa. My apologies: I should have tested, but mistakenly assumed
this was trivial. I won't make that mistake again.

Some punycoded test cases:
u'www.pyth\u0337on.org' -> www.xn--python-e8d.org
u'www.pyth\u0338on.org' -> www.xn--python-l8d.org
u'www.pyth\u2044on.org' -> www.xn--python-sq0c.org
u'www.pyth\u2215on.org' -> www.xn--python-se3c.org
u'www.pyth\u23aeon.org' -> www.xn--python-lq5c.org
u'www.pyth\u29f6on.org' -> www.xn--python-6w4d.org
u'www.pyth\u29f8on.org' -> www.xn--python-lx4d.org
u'www.pyth\u2afbon.org' -> www.xn--python-ef6d.org
u'www.pyth\u2afdon.org' -> www.xn--python-sf6d.org
u'www.pyth\uff0fon.org' -> www.pyth/on.org

UTF-8 versions in an attachment coming soon. I'll let you know when I've tested
them.
Here are some test cases. Although some are caught by the nameprep logic,
others are not, and although some are more convincing than others (depending on
your choice of font), some of them are quite believable spoofs, even one really
devious one where the slash is made up of a Canadian syllabics character and a
solidus overlay.
Also, note that URLs with nbsps in the domain name are displayed as clickable
links: they probably shouldn't be, since nameprep will convert them to ASCII
spaces, are not valid either in a conventional domain label, or in a Punycoded
IDN label (since Punycoding passes them straight through). 

This should probably also apply to any other URL where the nameprep'd version of
the FQDN has an ASCII space in it.
Attachment #195247 - Attachment mime type: text/html → text/html;charset=UTF-8
My sincere apologies for not testing this - as the guy who checked it in, the
buck stops here. And I forgot the fixed1.8 keyword as well :-(

usenet@tonal.clara.co.uk: any chance of configuring your real name in Bugzilla,
so we don't have to call you "usenet"? :-)

Gerv
usenet@tonal.clara.co.uk: are you able to work on bug 307438 and then this one
again?

Gerv
Alternatively, we could do the new list using the current encoding method. Given
that we are locked down for 1.5, this may be a better plan. Anyone got a
Unicode-compatible editor? :-)

Gerv
This is an (as yet untested) patch for this bug, just encoding the characters
as UTF-8. This was generated by a Python program, as some of these characters
cause some text editors (even emacs) to do funny things.

Please don't commit this until I've tested it.
Attachment #190137 - Attachment is obsolete: true
Flags: blocking1.8b5? → blocking1.8b5+
Attached patch Patch now testedSplinter Review
Note: this patch is not a proper CVS patch, since I can't currently make my
build environment work. However, when manually applied to all.js in an
installation of 1.5b1, it:
* passes the pre-checkin test
* blocks the display of all but one of the test cases, which I believe is a
separate issue from that addressed in this bug, also related to IDN
normalization, and will file as a new bug

Can someone with better Mozilla-fu than me confirm this, please, and generate a
proper patch prior to reviewing it?
Attachment #190146 - Attachment is obsolete: true
Attachment #193305 - Attachment is obsolete: true
Attachment #193309 - Attachment is obsolete: true
Attachment #196171 - Attachment is obsolete: true
*** Bug 308800 has been marked as a duplicate of this bug. ***
Gerv, can you drive this one? Looks like we need a proper patch and some reviews. 
See also
* Bug 309128
* Bug 309133

and bug 301694 as well, or there will some difficulty in adding these characters
in a maintainable way. 

Perhaps a half-way house would be representing trhe codepoints in the blocklist
in hex, and having a special-case decoder that only decodes bare hex number
separated by spaces just for this purpose?
Is it that much more complex to just fix it properly by decoding \u and \x
sequences?

Gerv
(In reply to comment #45)
> Is it that much more complex to just fix it properly by decoding \u and \x
> sequences?

That's bug 307438. Not going to decode \x, pref files are supposed to be utf8
and \x is just going to be misused.
Depends on: 307438
Bug 309311 has the canonical list of chars we are going to ban, including all of
those from this bug. We are going to use that set of chars as soon as bug
307438, which allows us to insert them into prefs.js in a sensible way, is fixed.

Gerv

*** This bug has been marked as a duplicate of 309311 ***
Status: REOPENED → RESOLVED
Closed: 18 years ago18 years ago
Resolution: --- → DUPLICATE
Flags: blocking1.8b5+
You need to log in before you can comment on or make changes to this bug.