Closed
Bug 403655
Opened 17 years ago
Closed 17 years ago
update eTLD file for gecko 1.9
Categories
(Core :: Networking, defect, P2)
Core
Networking
Tracking
()
RESOLVED
FIXED
mozilla1.9beta3
People
(Reporter: dwitte, Assigned: pamg.bugs)
References
Details
Attachments
(6 files, 9 obsolete files)
3.93 KB,
message/rfc822
|
Details | |
12.70 KB,
application/zip
|
Details | |
37.19 KB,
text/plain
|
Details | |
10.53 KB,
patch
|
dwitte
:
review+
|
Details | Diff | Splinter Review |
9.77 KB,
patch
|
Details | Diff | Splinter Review | |
21.74 KB,
patch
|
Details | Diff | Splinter Review |
now that more consumers (cookies, soon navhistory) are using the effective TLD service, it's would be nice to review what we have and update it if necessary. (see bug 342314 for the original checkin of effective_tld_names.dat). there are also a few comments in the file indicating incomplete entries (e.g the city.state.us system, but that's pretty lengthy to fix).
Jo, Rubena, do you have any interest in helping out here? i'm not sure how up to date the wiki (http://wiki.mozilla.org/TLD_List) is, but that seems like a good starting point for anyone who wants to help out.
Athough I do not have any experience editing a wiki, the MozillaWiki TLD List currently indicates .AQ (Antarctica) is not used.
Running a quick Google search, two Antarctic-related sites using TLD .aq came up:
http://http://www.comnap.aq/
http://www.ats.aq/
Athough I do not have any experience editing a wiki, the MozillaWiki TLD List currently indicates .AQ (Antarctica) is not used.
Running a quick Google search, two Antarctic-related sites using TLD .aq came up:
http://www.comnap.aq/
http://www.ats.aq/
Comment 4•17 years ago
|
||
I was going to fix bug 373013 and bug 370569 (Brazil / Korea), but I haven't
found a diff or patch yet that can work with utf-8. Can I check in a complete
new file ?
I should also visit every TLD-registry (again ...) to see if something has been
changed. Note that the wiki-page is way out of date. This information is quite volatile.
Reporter | ||
Comment 5•17 years ago
|
||
(In reply to comment #4)
> I was going to fix bug 373013 and bug 370569 (Brazil / Korea), but I haven't
> found a diff or patch yet that can work with utf-8. Can I check in a complete
> new file ?
wow, nice work!
i tested my diff here (linux) and it seems to work with utf8... if you're having trouble, you can either attach the file here or email it to me and i'll diff it. (note that if bug 402013 lands, it might make sense to convert the TLD file to an IDN ASCII encoding, to reduce the legwork the normalizer has to do on read-in... which would have the side effect of solving your utf8 issues. ;)
Reporter | ||
Comment 6•17 years ago
|
||
ftr, http://publicsuffix.org/ allows registries to submit modifications to the list.
Comment 7•17 years ago
|
||
(In reply to comment #4)
> I was going to fix bug 373013 and bug 370569 (Brazil / Korea), but I haven't
> found a diff or patch yet that can work with utf-8. Can I check in a complete
> new file ?
$ echo $LANG
pt_BR.UTF-8
and diff works fine with utf-8.
Maybe your system language is n ot utf-8.
> I should also visit every TLD-registry (again ...) to see if something has been
> changed. Note that the wiki-page is way out of date. This information is quite
> volatile.
>
For Brazil it still the same.
Comment 8•17 years ago
|
||
.aq is in use; for random reasons, I once owned a domain in .aq.
PublicSuffix.org does allow submission of fixes; they come to me :-) And I have some pending, which I can pass on to whoever takes on this task.
We also wanted to ping all the registries to help us with data gathering, and I arranged that the message could be sent out by ICANN - but we never got around to it, because publicsuffix.org wasn't finished. If anyone wants to take on the small bit of web design work required there, that would be great too.
As for converting from UTF-8 to ASCII, I'll leave that to the technical people, but even if it was ASCII, it would be nice to have the full IDN version of each domain as a comment.
Gerv
Reporter | ||
Comment 9•17 years ago
|
||
(In reply to comment #8)
> PublicSuffix.org does allow submission of fixes; they come to me :-) And I have
> some pending, which I can pass on to whoever takes on this task.
would you be okay with filing bugs as you get these, maybe in batches, and cc'ing relevant people? alternatively, if someone here is willing to be your contact for updating this stuff, that might work. as long as someone gets this information. ;)
for now, could you attach this info here? assuming there's nothing private in it...
> We also wanted to ping all the registries to help us with data gathering, and I
> arranged that the message could be sent out by ICANN - but we never got around
> to it, because publicsuffix.org wasn't finished. If anyone wants to take on the
> small bit of web design work required there, that would be great too.
that would be great to do. just thinking out loud, does moco have any in-house web design guys that we could ping about doing this? (i can do the pinging, if necessary.) i'll also send Ruben an email in case he's interested.
> As for converting from UTF-8 to ASCII, I'll leave that to the technical people,
> but even if it was ASCII, it would be nice to have the full IDN version of each
> domain as a comment.
absolutely, that'd be no problem.
Reporter | ||
Comment 10•17 years ago
|
||
i forgot to ask - what exactly needs to be done to get publicsuffix.org up to scratch? i searched bugzilla, but couldn't find anything apart from bug 373190.
Comment 11•17 years ago
|
||
Dan: thanks for picking up this ball and running with it :-) It's past 11pm here now, but I'll try and get you the info you need ASAP.
Gerv
Comment 12•17 years ago
|
||
I have received an email from Dan regarding this and have replied to him. The basic gist is that I have the second version of the site ready with the changes that Gerv suggested. I had a few problems with logging into SVN and therefore zipped up the site and sent it to Gerv some time ago. I'm guessing it must have got lost in the mountains of email, but I can email it again so that it can be uploaded.
Sorry I haven't spoken about this earlier - I have been very busy with university!
Ruben
Reporter | ||
Comment 13•17 years ago
|
||
thanks Ruben! we should definitely get your svn problems cleared up; please file a bug under mozilla.org/server operations (even if you just need technical help), that's the fastest way to get someone to look at it.
Comment 14•17 years ago
|
||
Yeah the problems are at my end so I'll have to sort them out at some point. In the meantime, I'll attached a zipped copy of the new site that someone else can upload to SVN.
Comment 15•17 years ago
|
||
Comment 16•17 years ago
|
||
This is an email I got from an Apache guy with some updates to the list.
Comment 17•17 years ago
|
||
One guy commented on the site:
"I eventually found the 'format' section nder the 'submit' link, but that's far from intuitive. The format and examples should get their own dedicated page."
The only other stuff I had was from Reuben, and it's now attached here. I do think the website needs a usability once-over - we need to make sure that we are correctly serving both registrars (who want to submit changes to their part), list users (who want to see the whole list and learn the parsing rules). But I'm hoping you guys can run with this ball.
Thanks for picking this back up. This is important work; more and more things are relying on this list being correct for security.
Once the site is good and up to date, compose a _short_ email to send to the registrars, saying "We are doing this... Your bit of the list is attached... Lots of big companies are using it (name them)... It's in your interests to make sure it's up to date, because otherwise X, Y and Z will break... Here's a link to a page with the update procedure."
Then, I'll see about getting it sent out.
Gerv
Reporter | ||
Updated•17 years ago
|
Comment 18•17 years ago
|
||
Moved format and submissions to their own page and linked to it from the homepage.
Attachment #289672 -
Attachment is obsolete: true
Reporter | ||
Comment 19•17 years ago
|
||
as a note, the .tv subdomains need to be added to the file too; there's currently just the "tv" tld in there.
Comment 20•17 years ago
|
||
(In reply to comment #19)
> as a note, the .tv subdomains need to be added to the file too; there's
> currently just the "tv" tld in there.
>
Actually, no. I haven't found yet an explanation on their NIC website (www.tv), but <http://en.wikipedia.org/wiki/.tv> tells us (for what it's worth) that second level domains are allowed. There should be a few third-level domains like gov.tv, but I haven't found yet such a list.
If you're talking about co.tv, then the "tv" rule will still suffice.
Reporter | ||
Comment 21•17 years ago
|
||
(In reply to comment #20)
> If you're talking about co.tv, then the "tv" rule will still suffice.
oh, so we don't want to consider co.tv an eTLD?
Comment 22•17 years ago
|
||
Why would it be ? co.tv, www.co.tv and bubu.co.tv all point to the same host
(66.232.143.122) : co.tv is the domain name. If co.tv was an eTLD, then you couldn't surf to it. Try it with gov.tv and www.gov.tv.
Reporter | ||
Comment 23•17 years ago
|
||
ok - i wasn't thinking of it being a legit site or not as the test, but more whether we want to prevent co.tv from being able to set domain cookies. (i.e. whether it serves also as an eTLD for other, unrelated sites.) if it doesn't, then don't mind me :)
Comment 24•17 years ago
|
||
Got the following email recently:
From: hsowa@bfk.de
Subject: some small errors in effective_tld_names.dat
Hi!
I found some small errors in the effectie_tld_names.dat list supplied with firefox/mozilla.
1. Line 1750: gáŋgaviika (.no tld forgotten)
2. Line 1912: Lærdal (.no tld forgotten)
Thanks in advance!
bye,
Hannes
Jo: are you working on the list for Firefox 3? We are using this in anger now, so it needs to be good.
Gerv
Comment 25•17 years ago
|
||
(In reply to comment #16)
> Created an attachment (id=289922) [details]
> Email from Apache dude
>
> This is an email I got from an Apache guy with some updates to the list.
>
I'm incorporating these changes, but with these remarks :
- I took the 3 .us domains
- The .uk domains are already covered under *.uk.
The regular expressions are indeed difficult to understand - they don't point to domains, but to toplevel domains. *.uk matches every eTLD, including ones that don't exists yet, like blah.uk
- Idem for .tr domains (Turkey).
- Those 3 .hk domains are actually exceptions to the to the old *.hk rule. But since second level eTLD's are now permitted, it doesn't matter anymore.
- italy.it doesn't exist, according to Google.
- pro.az is now included
- All .au domains are already covered with *.au (it's even mentioned in the comment)
- eu.int isn't supposed to exist anymore, but is apparently used, so I added them again
- .br was covered in bug 370569
- gov.it & edu.it are now corrected
- all uppercase characters are now converted
Comment 26•17 years ago
|
||
This is the new full eTLD list (not a diff yet). It contains the changes for 370569 (Brazil), bug 373013 (Korea), russian chnages from bug 342314 comment 20 and the list from comment 16.
But I'm having some trouble editing it, since various editors tend to destroy the UTF-8 characters, even when they're supposed to be compatible. I have been scratching my head to remember what editor I used last time. I think it was one on my old Mac, and might try again, when I'm at home.
The real editor that I used to edit the list, is Firefox itself. You can find the original document in <http://wiki.mozilla.org/User:Jhermans/scratch>, if you open that document to edit it. Select everything inside the form-field with select-all, then copy.
In the mean time, I'm going home now, and I'm trying to find the correct editor again. When I have the correct file, I'll try to generate the diff again.
Comment 27•17 years ago
|
||
Hmmm ... my editor apparently didn't save it in Unicode after all (I thought I checked it my opening in Firefox first). Let's try again.
Note : the characters that give the most problems (not present in ISO-8859-1), are marked with the comment "utf8 !"
Attachment #299031 -
Attachment is obsolete: true
Comment 28•17 years ago
|
||
Remind me again why the list contains UTF-8 characters rather than punycode? (I guess, though, that if it was punycode, we'd probably want the UTF-8 just above it anyway so people knew what it said...)
Gerv
Comment 29•17 years ago
|
||
The previous editor apprently saved it as utf-16, but this should be the correct one. The only editor that I know that does it correctly, and the one I used before, is old TextEdit on Mac OS X. I guess that Seamonkey would be fine too.
Attachment #299033 -
Attachment is obsolete: true
Reporter | ||
Comment 30•17 years ago
|
||
(In reply to comment #28)
> Remind me again why the list contains UTF-8 characters rather than punycode?
just historical - no good reason. Jo, if this really is giving you trouble across editors, please feel free to convert all the entries to punycode - i'll gladly r+ a patch to do that. (it'd be nice to leave the utf8 in a comment beside the entry, i'm guessing that would get corrupted - but how badly?).
Comment 31•17 years ago
|
||
(In reply to comment #28)
> Remind me again why the list contains UTF-8 characters rather than punycode? (I
> guess, though, that if it was punycode, we'd probably want the UTF-8 just above
> it anyway so people knew what it said...)
>
> Gerv
>
If nsEffectiveTLDService::AddEffectiveTLDEntry() is converting the UTF-8
strings to ACE, then I guess it would be possible. But note that I didn't
actually type all those names, I copied them from the various websites. I can
convert all strings if you like, but as you say, we would probably still need
to mention the utf-8 name in a comment, otherwise people wouldn't recognize
them.
Note : this time, attachment 299051 [details] is the correct one. I'll try tomorrow to
find a diff that doesn't mangle the data either.
Comment 32•17 years ago
|
||
My diff works fine, on Linux, configured with UTF-8 as default encoding system.
Maybe I even attach a diff before you wake up.
Comment 33•17 years ago
|
||
Attachment #299081 -
Flags: review?
Updated•17 years ago
|
Attachment #299081 -
Flags: review? → review?(dwitte)
Comment 34•17 years ago
|
||
Here: http://mxr.mozilla.org/seamonkey/source/netwerk/dns/src/effective_tld_names.dat#1889
lea?gaviika.no
should be:
leaŋgaviika.no
http://www.norid.no/domenenavnbaser/whois/?query=lea%26%23331%3Bgaviika.no&charset=iso-8859-1&sok=s%F8k
Comment 35•17 years ago
|
||
Converting to punnycode, plus:
- addressing my last comment
- removing the UTF8 at all, to make it easier for people on non UTF8 systems to make patches
- adding a comment on start (feel free to correct my English or the sentence itself)
Attachment #299081 -
Attachment is obsolete: true
Attachment #299086 -
Flags: review?(dwitte)
Attachment #299081 -
Flags: review?(dwitte)
Comment 36•17 years ago
|
||
I've seen the extra blank line at the end of line is strictly required, otherwise the last rule won't be applied.
This is the same as last patch, without the last line removal.
Attachment #299086 -
Attachment is obsolete: true
Attachment #299092 -
Flags: review?(dwitte)
Attachment #299086 -
Flags: review?(dwitte)
Comment 37•17 years ago
|
||
This patch incorporate the changes from the last Jo's list on top of the patch I've submitted.
Should the attachment 299092 [details] [diff] [review] be approved, this patch would make a (possible) final list.
I'd suggest removing the "utf8 !" comments. If so, it's better to remove them here than in the other patch.
This way we make that file plain ascii and then everyone else can make diffs cleanly.
Reporter | ||
Comment 38•17 years ago
|
||
Comment on attachment 299092 [details] [diff] [review]
Fixing a minor issue with previous patch
>Index: netwerk/dns/src/effective_tld_names.dat
>===================================================================
>@@ -1,8 +1,11 @@
>+// All entries on this file should be on punnycode (ACE)
>+// rather than UTF8.
Perhaps instead:
// All entries in this file should be in ASCII or ACE (punycode) encodings.
>-`øksnes.no
>+xn--`ksnes-bya.no
the ` looks like a slip of the keyboard - can you verify?
looks good, r=me, provided we remove all the "utf8!" comments in the followup patch. i didn't check all the conversions are correct - can you tell us how you did it? i'm not too worried about readability of the punycode entries, since there are websites that will do the conversion if people care (see e.g. http://idnaconv.phlymail.de/).
would be great to get this, and the followup, in for b3 - i'll see if we can get blanket sr/moa for these changes...
Attachment #299092 -
Flags: review?(dwitte) → review+
Reporter | ||
Comment 39•17 years ago
|
||
followup note to Ruben: might need to note the file is ACE encoded on the website...
Comment 40•17 years ago
|
||
(In reply to comment #38)
> (From update of attachment 299092 [details] [diff] [review])
> >Index: netwerk/dns/src/effective_tld_names.dat
> >===================================================================
> >@@ -1,8 +1,11 @@
> >+// All entries on this file should be on punnycode (ACE)
> >+// rather than UTF8.
>
> Perhaps instead:
> // All entries in this file should be in ASCII or ACE (punycode) encodings.
OK.
> >-`øksnes.no
> >+xn--`ksnes-bya.no
The ` is ascii... so it passed the conversion. Thanks for the catch:
http://www.norid.no/domenenavnbaser/whois/index.php3?charset=UTF-8&query=%C3%B8ksnes.no&sok=s%C3%B8k
> looks good, r=me, provided we remove all the "utf8!" comments in the followup
> patch. i didn't check all the conversions are correct - can you tell us how you
> did it? i'm not too worried about readability of the punycode entries, since
> there are websites that will do the conversion if people care (see e.g.
> http://idnaconv.phlymail.de/).
The nsEffectiveTLDService::LoadOneEffectiveTLDFile was modified to put stuff on a new file. So I got a new converted etld file on startup.
The most important line is this:
mIDNService->ConvertUTF8toACE(rule, rule);
Moving the write to file to bellow TruncateAtWhitespace will clean the "utf8 !"'s.
It's possible to make it all in JS instead of C++ and put on an extension, but would take more time.
Comment 41•17 years ago
|
||
I had a better idea, since the first part of this patch wouldn't pass sr, anyway, since the file we were using had some issues other than those fixed on attachment 299094 [details] [diff] [review] and the one you mentioned.
We also gain a lot of time this way.
Compare with attachment 299051 [details].
OBS: There were a few white-space changes. I've removed the spaces at end of line.
Attachment #299092 -
Attachment is obsolete: true
Attachment #299094 -
Attachment is obsolete: true
Attachment #299109 -
Flags: review?(dwitte)
Comment 44•17 years ago
|
||
(In reply to comment #39)
> followup note to Ruben: might need to note the file is ACE encoded on the
> website...
>
Thanks for the note; I'll make sure I change it and i'll upload the whole new site this weekend.
Comment 45•17 years ago
|
||
You've added 2nd level domains from CentralNIC, NetRegistry and eu.org. It seems to me that we shouldn't do that until we have an explicit request from the company concerned.
Not wanting to flip-flop on the UTF-8 thing, but it seems to me that putting them in the file in their full form does make the file a heck of a lot easier to read. Example: can someone tell me if bådåddjå.no is in the list? Well, not easily, because it's not written that way.
If the list were UTF-8 and Mozilla would need to convert the entire list to punycode on startup, I completely agree that we need to eliminate that unnecessary step. But I still think there is value in having the UTF-8 values in there as comments, even if we are repeating ourselves.
Gerv
Comment 46•17 years ago
|
||
(In reply to comment #45)
> If the list were UTF-8 and Mozilla would need to convert the entire list to
> punycode on startup, I completely agree that we need to eliminate that
> unnecessary step. But I still think there is value in having the UTF-8 values
> in there as comments, even if we are repeating ourselves.
This file isn't edited that often; can we agree that the win of easy readability trumps the lesser win of easy editability? Don't forget that repeating/duplicating makes it much easier to have divergence between the two.
Comment 47•17 years ago
|
||
(In reply to comment #45)
> Not wanting to flip-flop on the UTF-8 thing, but it seems to me that putting
> them in the file in their full form does make the file a heck of a lot easier
> to read. Example: can someone tell me if bådåddjå.no is in the list? Well,
> not easily, because it's not written that way.
>
> If the list were UTF-8 and Mozilla would need to convert the entire list to
> punycode on startup, I completely agree that we need to eliminate that
> unnecessary step. But I still think there is value in having the UTF-8 values
> in there as comments, even if we are repeating ourselves.
All utf8 entries are converted to punnycode on each startup.
Do you want to try only converting the utf8 entries to punnycode, without adding the other entries, and measure the startup perf impact?
Making diffs with the utf8 files is being a real pain for Jo on Mac.
Since diff has a context, the utf8 on comments would bother, anyway.
And, how often will someone wants to look at the list to see if a domain is missing?
Comment 48•17 years ago
|
||
> All utf8 entries are converted to punnycode on each startup.
Seriously? We should add a build step to do the conversion; I'm sure there's something to do it in Python.
> And, how often will someone wants to look at the list to see if a domain is
> missing?
I imagine audits wouldn't be that uncommon at release time and, say, if other browsers wanted to swipe this from us but still do their own QA to be safe.
Comment 49•17 years ago
|
||
(In reply to comment #48)
> Seriously? We should add a build step to do the conversion; I'm sure there's
> something to do it in Python.
http://mxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsEffectiveTLDService.cpp#405
nsEffectiveTLDService::LoadOneEffectiveTLDFile is called on startup, it calls AddEffectiveTLDEntry for each non-comment non empty line, which processes the line and call NormalizeHostname for the host itself:
http://mxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsEffectiveTLDService.cpp#282
which checks if it's ascii, then converts it to lower case or convert to ACE otherwise.
If the file is already normalized, there is no need to call NormalizeHostname.
> > And, how often will someone wants to look at the list to see if a domain is
> > missing?
>
> I imagine audits wouldn't be that uncommon at release time and, say, if other
> browsers wanted to swipe this from us but still do their own QA to be safe.
So they could use any web app, such http://idnaconv.phlymail.de/, some script or even someone could build an extension.
Comment 50•17 years ago
|
||
(In reply to comment #49)
> nsEffectiveTLDService::LoadOneEffectiveTLDFile is called on startup, it calls
> AddEffectiveTLDEntry for each non-comment non empty line, which processes the
> line and call NormalizeHostname for the host itself:
> http://mxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsEffectiveTLDService.cpp#282
>
> which checks if it's ascii, then converts it to lower case or convert to ACE
> otherwise.
>
> If the file is already normalized, there is no need to call NormalizeHostname.
Yowza. So we're "needlessly" converting/lowercasing the whole file at startup, every startup. That's pointless work we really don't need to do -- I'll take a look at that and see how we can make improvements here.
> So they could use any web app, such http://idnaconv.phlymail.de/, some script
> or even someone could build an extension.
That's not ease of use by any means. :-)
Comment 51•17 years ago
|
||
(In reply to comment #50)
> Yowza. So we're "needlessly" converting/lowercasing the whole file at startup,
> every startup. That's pointless work we really don't need to do -- I'll take a
> look at that and see how we can make improvements here.
It looks like the eTLD service can load arbitrary eTLD.dat files - at least from the profile folder. So we can't guarantee that the file will be normalized. But it's possible to convert the file to punnycode while building, save the ascii+punnycode file on the app folder and to pass a flag saying there is no need to convert it again while loading.
We could even let a eTLD.dat.txt and a eTLD.dat on the same folder, check timestamps, and rebuild eTLD.dat if needed.
I'm using this fact to generate converted file on startup. It's trivial to make it to add the UTF8 entries as comments before the punnycode.
So, if it's a consensus, I can make that.
> That's not ease of use by any means. :-)
To make Composer punnycode aware would be easy of use?
Reporter | ||
Comment 52•17 years ago
|
||
(In reply to comment #50)
> Yowza. So we're "needlessly" converting/lowercasing the whole file
pretty sure you're chasing ghosts here - it may seem "shock! horror!" but the fact is this is gonna be small potatoes. the majority of lines will go through the fast path (ASCII), and the UTF8 entries won't be significant since there aren't many of them. turning off normalization on read-in is dangerous, in case a) something slips by us (we've found two obvious typos in the file already), b) the user replaces the tld file, or has supplemental ones.
however, i completely agree it'd be nice to do this better (and avoiding the file read altogether *would* be significant). preprocessing the file at build time, and using static data, would be nice - if we did that, such that we can guarantee the data is normalized, and we either remove or special-case the supplemental file capabilities, then it'd be viable.
Reporter | ||
Comment 53•17 years ago
|
||
nominating for blocking - we should definitely roll in these updates before release (and we'll use this bug to cover all of them).
Flags: blocking1.9?
Comment 54•17 years ago
|
||
Bug 414122 moves the processing into the build process for great justice.
(In reply to comment #51)
> To make Composer punnycode aware would be easy of use?
Using a proper UTF-8-supporting editor is ease of use; there's not really a shortage of them these days. For example, TextWrangler works just peachy for me over here on Unicode.
Updated•17 years ago
|
Flags: blocking1.9? → blocking1.9+
Priority: -- → P2
Comment 55•17 years ago
|
||
(In reply to comment #54)
> Bug 414122 moves the processing into the build process for great justice.
>
> (In reply to comment #51)
> > To make Composer punnycode aware would be easy of use?
>
> Using a proper UTF-8-supporting editor is ease of use; there's not really a
> shortage of them these days. For example, TextWrangler works just peachy for
> me over here on Unicode.
>
TextWrangler does not run on any operating system I have access to. I have been using Firefox as an editor, but since it doesn't save any files (I guess Seamonkey might do the trick), I have been copying the data in various editors on Linux, Windows, Solaris and Mac OS X. But only TextEdit did what I asked : File->New. Edit->Paste, File->Save, without any changes to the data. You would be surprised how many I found that can complete such a simple task. Most decided to save it UTF-16 when they discovered a character that can't be present in ISO-8859-1 (the ŋ in .no domains, and the Taiwanese ones). Or they did it in UTF-8, but they mangled the character anyway.
Comment 56•17 years ago
|
||
TextWrangler is OS X; gEdit also handles UTF-8, and that's Linux. How do both of these not work (especially the Linux one, since Linux at this point is basically pure UTF-8, no ASCII to be seen)?
Comment 57•17 years ago
|
||
Jo, you can try Emacs:
Mac OS X: http://aquamacs.org/
Win: http://www.ourcomments.org/Emacs/EmacsW32.html
Linux: Your distro package or http://ftp.gnu.org/pub/gnu/emacs/
There is a builtin diff that should work against CVS cleanly.
If you can't find a way to make the diff work for you with utf8, you can upload the whole file and someone makes a diff.
Your last list should have the "lea?gaviika.no" and "`øksnes.no" corrected to
"leaŋgaviika.no" and "øksnes.no".
Other than that there are the Gerv comments about 2nd level eTLDs.
Comment 58•17 years ago
|
||
Comment on attachment 299109 [details] [diff] [review]
Addressing comments and joining both patches
It looks like most people won't like to convert the file and Jeff is taking care of pre-processing the file on another issue.
Attachment #299109 -
Attachment is obsolete: true
Attachment #299109 -
Flags: review?(dwitte)
Comment 59•17 years ago
|
||
(In reply to comment #56)
> TextWrangler is OS X; gEdit also handles UTF-8, and that's Linux. How do both
> of these not work (especially the Linux one, since Linux at this point is
> basically pure UTF-8, no ASCII to be seen)?
>
I'm running OS X 10.2.8 - TextWrangler requires 4.0. And Linux - I'm using an emasculated version at the office, (which is also why I can't find a diff that works correctly), where gEdit doesn't run. Neither does Firefox btw, since that requires a never version of glib.
I tried even vi as an editor (I didn't bother for emacs, but I haven't fired that in at least 4 years) - but it still mangles the data (it saved either in Ascii or in Unicode-16). PS : emacs is installed on OS X - no need to download it.
You remark about "leaŋgaviika.no" and "øksnes.no" is exactly what I'm talking about - every time you think it's correct, and check and recheck the file, you always discover there's a problem. It doesn't even matter where or how it happens - the entire data is very fragile ! That's why I applaud all efforts to simplify it, like only using punycode in the input file.
Comment 60•17 years ago
|
||
Emacs 21 doesn't handle unicode very well, I needed some black magic to make it work.
But editing utf-8 files anywhere is possible without much effort... if your system is configured to use western encoding, vi and emacs will pick that as default.
Evaluating those:
(set-keyboard-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
or C-x RET t utf-8
and C-x RET k utf-8
I could make emacs open that eTLD file as utf8 on a system configured to use iso-8859-1.
I couldn't make the Composer on SM edit that file, it opens the file as read-only when it actually contains utf8 entities.
Another approach would be if Gerv and Jeff would be OK with making the list on the web-site as a reference, and using punnycode on the source code.
But I think everyone agreed we should move into getting the list as complete as possible on b3.
It's better leaving optimizations and changes in the format for latter, even though the pre-processing Jeff is doing would be nice for 1.9.
Comment 61•17 years ago
|
||
(In reply to comment #55)
> But only TextEdit did what I asked :
> File->New. Edit->Paste, File->Save, without any changes to the data. You would
> be surprised how many I found that can complete such a simple task. Most
> decided to save it UTF-16 when they discovered a character that can't be
> present in ISO-8859-1 (the ŋ in .no domains, and the Taiwanese ones). Or they
> did it in UTF-8, but they mangled the character anyway.
So just explicitly switch to UTF-8 before pasting? Or open the previously existing file, which should already be UTF-8? Either way I don't really see the problem.
Using UTF-8 rather than punycode makes the file better accessible and maintainable.
Reporter | ||
Comment 62•17 years ago
|
||
alright - let's do this thing in UTF8. punycode would be nice, but Waldo's right, readability beats editability. sorry Jo - but, the least we can do is roll diffs for you if you attach full files here.
i have verbal moa=biesi for all updates to this file for 1.9. asrail (or Jo), if you want to roll together a final patch with Jo's changes, i'll review it and we can land for b3. (please do remove those existing "utf8!" comments - they look a bit silly!)
Comment 63•17 years ago
|
||
I'm not sure what's the better thing to do about Gerv comments on 2nd level eTLDs.
The ones from:
CentralNIC, NetRegistry and eu.org.
I can make a patch without that controversial part, which is minor.
So that part could land even after b3, if desired.
Reporter | ||
Comment 64•17 years ago
|
||
yeah, let's leave those out for now, can land later.
gerv, can we just ask them about adding those domains? seems like the responsibility is generally on us to keep our file up to date. (also, do we already have domains listed from those registrars? if so, and assuming we didn't ask before, why ask now?)
Comment 65•17 years ago
|
||
Attachment #299640 -
Flags: review?(dwitte)
Comment 66•17 years ago
|
||
Since there were a few whitespace changes, uploading a diff -w.
The other one is intended to land, this one is for comparison.
Comment 67•17 years ago
|
||
I've removed the "free za" subdomains, since they are on the same category as eu.org and others.
And Dan, we don't have domains from those registrars yet.
Reporter | ||
Comment 68•17 years ago
|
||
Comment on attachment 299640 [details] [diff] [review]
Jo's list without 2nd level eTLDs (checked in)
r=dwitte
Attachment #299640 -
Flags: review?(dwitte) → review+
Reporter | ||
Comment 69•17 years ago
|
||
Comment on attachment 299640 [details] [diff] [review]
Jo's list without 2nd level eTLDs (checked in)
landed this one - leaving this one open for followup updates for 1.9.
Attachment #299640 -
Attachment description: Jo's list without 2nd level eTLDs → Jo's list without 2nd level eTLDs (checked in)
Comment 70•17 years ago
|
||
(In reply to comment #64)
> gerv, can we just ask them about adding those domains? seems like the
> responsibility is generally on us to keep our file up to date. (also, do we
> already have domains listed from those registrars? if so, and assuming we
> didn't ask before, why ask now?)
Here's my thought. Ideally, the controller of each TLD or pseudo-TLD would be responsible for their section of the file. But until that is the case, we have a duty to be very cautious.
Say, for example, we-provide-great-urls.com currently makes domains available in country-specific areas, e.g.:
my-company.uk.we-provide-great-urls.com
someone-else.de.we-provide-great-urls.com
So we come along and put uk.we-provide-great-urls.com and de.we-provide-great-urls.com into the TLD list.
However, unbeknown to us, this company is planning to launch a great new service with even shorter URLs, allowing you to do:
mycompany.we-provide-great-urls.com
They launch this service on the same day Firefox 3 comes out. But disaster - no-one with one of these new domains can set cookies in Firefox. (That's what would happen, right?)
In other words: we should not be adding restrictions in the sub-domain space of private companies without their express permission.
Gerv
Reporter | ||
Comment 71•17 years ago
|
||
(In reply to comment #70)
> They launch this service on the same day Firefox 3 comes out. But disaster -
> no-one with one of these new domains can set cookies in Firefox. (That's what
> would happen, right?)
well, as you pose the problem, no - mycompany.we-provide-great-urls.com would still be able to set cookies and everything else, *unless* we added a *.we-provide-great-urls.com rule, but that really seems like something we wouldn't do (for exactly this reason). (wildcard rules should be used with care!)
so, i'm not sure what you think given this information, though i do agree we should be cautious. i just want to make sure things don't get stalled, since more and more consumers are using this list for security-related purposes.
Reporter | ||
Comment 72•17 years ago
|
||
Ruben - how's the site update going? can you ping me once it's live, and I (or you!) can draft up an email to send out to registrars (comment 17).
Comment 73•17 years ago
|
||
Dan, I've updated the site to mention that the list is in UTF-8 format (I gather that's the latest consensus) and I've uploaded everything to the SVN - it just need to be replicated onto the live site now...
There's a small typo ;), whenever the eTLD.dat is next updated:
http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/netwerk/dns/src/effective_tld_names.dat&rev=1.3&mark=1148#1148
Comment 75•17 years ago
|
||
(In reply to comment #74)
> There's a small typo ;), whenever the eTLD.dat is next updated:
>
> http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/netwerk/dns/src/effective_tld_names.dat&rev=1.3&mark=1148#1148
I needed to kick off the unit test boxen, so I just fixed it.
Updated•17 years ago
|
Flags: tracking1.9+
Comment 76•17 years ago
|
||
Beltzner: did you mean to mark this blocking1.9+? Removing the tracking-1.9+ flag without a comment makes it look like a mistake.
Flags: blocking1.9?
Comment 77•17 years ago
|
||
(In reply to comment #69)
> (From update of attachment 299640 [details] [diff] [review])
> landed this one - leaving this one open for followup updates for 1.9.
...
(In reply to comment #76)
> Beltzner: did you mean to mark this blocking1.9+? Removing the tracking-1.9+
> flag without a comment makes it look like a mistake.
Oops. I should have commented and marked this wanted-next+. I'm assuming based on comment 69 that we've got what we need to ship, any further updates would be sauce for the goose.
Flags: wanted-next+
Flags: blocking1.9?
Flags: blocking1.9-
Reporter | ||
Comment 78•17 years ago
|
||
yep, further updates here will be for the love of saucy geese.
Comment 79•17 years ago
|
||
The current .dat file has "*.co" as the rule for .co, but the corresponding page at http://en.wikipedia.org/wiki/.co seems to give a small set of specific TLDs. Is this the right bug to request a review on this? I'm not sure of the process here :(
Comment 80•17 years ago
|
||
It's not a problem, registrations are only possible at the third level in Colombia, so *.co matches all second level domains as effective TLD's. Or any future second level for that matter. This way we don't have to specify all of them.
The regular expressions are a bit different than what you're used to with regular expressions, because they're supposed to match eTLD's, not domains.
Now, if we could convince those registrars not to mix 2nd level and 3rd level domains together, then the file could be a lot simpler ...
Comment 81•17 years ago
|
||
Sure, I understand the eTLD match versus domain match issue; my concern was that right now we would think that "foo.bar.co" was an address in the valid eTLD bar.co, when in fact it's not a valid address at all. It would be nice to distinguish these cases.
Reporter | ||
Comment 82•17 years ago
|
||
the purpose of the eTLD service isn't to determine if an address is valid, it's just to provide the (supposed) eTLD of a given host. providing a wrong answer in the case of an invalid host is acceptable.
Comment 83•17 years ago
|
||
The current eTLD file says for .ru: "there should be geo-names like msk.ru, but I didn't find a list". One of our users found the list:
http://www.ripn.net:8082/nic/dns/geo_list.html
http://www.ripn.net:8082/nic/dns/generic_domains.html
This patch adds the missing .ru geographic domains and ac.ru, which is also missing.
Comment 84•17 years ago
|
||
Can I suggest that the www.centralnic.com domains are added? People can register under the following "eTLDs":
eu.com
uk.com
uk.net
us.com
cn.com
de.com
jpn.com
kr.com
no.com
za.com
br.com
ar.com
ru.com
sa.com
se.com
se.net
hu.com
gb.com
gb.net
qc.com
uy.com
ae.org
Comment 85•17 years ago
|
||
Also, currently the eTLD system defaults to assuming that an unknown eTLD has minimum length (i.e. "www.example.invalid" is assumed to have an eTLD of "invalid"). Isn't this "failing dangerously?" - i.e. wouldn't it be safer to assume that in an unknown situation we should default to restrictive rather than permissive?
Comment 86•17 years ago
|
||
The centralnic domains are not true eTLDs, merely domain names for which a company sells out subdomains.
An eTLD of "invalid" in the example you give is correct, and doing this differently could have unknown consequences on intranet sites.
Comment 87•17 years ago
|
||
If the CentralNIC domains are not "true eTLDs" then either you or I must be considerably misunderstanding the meaning of the 'e' in "eTLDs". As I understand it, "domain names for which a company sells out subdomains" is precisely what this list *is*.
A web site of "example.uk.com" is *not* the same organisation as "otherexample.uk.com" and sharing cookies between them is a security risk. I don't particularly approve of the "uk.com" style domains either, but like it or not, it is a fact that they *are* eTLDs.
An eTLD of "invalid" in my example is not correct. However, I don't know the implications for intranet sites, it may be that a deliberate and reasoned decision has been made that breaking intranet sites is worse than the potential security risk for sites on unknown eTLDs, i.e. that for practical reasons the "wrong" answer is deliberately given - in which case I won't argue.
Comment 88•17 years ago
|
||
CentralNIC is not the registrar for .com, .net, or .org. It merely owns some domains in them. Even if it's selling subdomains to customers, foo.eu.com and bar.eu.com are in the same top-level domain, ".com". This is not different than any other ISP that supplies its customers with subdomains on the ISP's domain.
On what grounds do you say that "invalid" is not the correct eTLD for "www.example.invalid"? The effective top-level domain of a host that does not match any known rules is the last subcomponent. This is what browsers have always done--before the time of eTLD data the TLD of a site was _always_ considered to be the last subcomponent; this merely limits that fallback case to when we don't have more specific data. This is also compatible with new TLDs -- if ICANN or similar announces .coolnewdomain, existing Firefox instances will do the right thing with it even before they're upgraded to a version with more specific eTLD data.
Comment 89•17 years ago
|
||
Re CentralNIC, you still appear to be missing the significant of the "e" in "eTLD". Yes, "example.uk.com" and "otherexample.uk.com" are under the same "top-level domain". They are *not*, however, under the same "eTLD".
(For that matter, example.co.uk and otherexample.co.uk are under the same TLD.)
Also if ICANN announces ".coolnewdomain", existing Firefox instances will only "do the right thing" if it happens that end users register directly under the TLD. If they actually register under "a.coolnewdomain" and "b.coolnewdomain" (like ".pro", for example) then Firefox is certainly not "doing the right thing" - that's the whole point of the eTLD list in the first place.
As I said above, denying that there's a problem is pointless. Saying "yes there's a problem but it's insoluble and we've thought about it and decided upon what we believe is the least-worst solution" is fine, if it's true.
Are you actually a Mozilla committer or am I wasting my time arguing with you? ;-)
Assignee | ||
Comment 90•17 years ago
|
||
Your comments about CentralNIC are true, but they apply just as well to geocities.com or comcast.net or any other ISP or hosting service that lets users create their own virtual subdomains. We have to draw the line somewhere, and we've chosen to draw it at official registrars.
Comment 91•17 years ago
|
||
To follow Pam's comments...
You still haven't said what you think the correct TLD for no-rule scenarios is. You are right that our current behavior is suboptimal if a new domain is announced using registration only under subcomponents of that domin, but the scope of the damage is limited: basically, we fall back to the pre-eTLD behavior. What is your alternative?
Comment 92•17 years ago
|
||
Pam - I guess that's fair enough. I suggest that it might be an idea to write the policy down somewhere though, if a policy has been decided upon. Even just a comment at the top of the eTLD data file would be a big improvement. I'm not sure I 100% agree with it but I won't argue about it.
Peter - sorry, I thought I made it obvious in my first comment on the subject what I thought the behaviour should be: given "foo.bar.unknown", it should default to "most restrictive", i.e. refuse to share cookies with anything not ending in "foo.bar.unknown", just the same as if the site was "example.co.uk" or "example.com".
As already mentioned, if it's already been thought about and a policy decision has been made that the security benefits of this approach are outweighed by practical breakages that would occur, that's fine but I think it should be documented (if it isn't already and I haven't missed it!)
Comment 93•17 years ago
|
||
If you use the "most restrictive" policy on unknown TLDs, then there is the chance that if ICANN approves a new TLD, previous versions of Firefox etc will break sites under that TLD since they will not allow cookies to be set/read. It is true that many more people regularly update their Mozilla products than, say, IE, but there is still a chance which must be considered.
Assignee | ||
Comment 94•17 years ago
|
||
Should this bug be retired and a new one created?
Here's a patch incorporating the previous addition (missing eTLDs for .ru), adding .rs and .me information, and updating info for several TLDs based on posted registrar information (with occasional small additions from Wikipedia).
Assignee: nobody → pamg.bugs
Attachment #318014 -
Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #331017 -
Flags: review?
Assignee | ||
Updated•17 years ago
|
Attachment #331017 -
Flags: review? → review?(jo.hermans)
Comment 95•17 years ago
|
||
Yes, please create a new bug for further additions. Thanks :-)
Gerv
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•17 years ago
|
Attachment #331017 -
Flags: review?(jo.hermans)
Assignee | ||
Comment 96•17 years ago
|
||
Moved to bug 447815.
You need to log in
before you can comment on or make changes to this bug.
Description
•