Closed Bug 342314 Opened 14 years ago Closed 13 years ago
Need effective-TLD file
We need a properly formatted effective-TLD rule file to go along with the new service created in bug 331510. See above URL.
Thanks Pam, I'll start working on it at the end of July when I'm back from my holiday. Some preliminary work is done on http://wiki.mozilla.org/TLD_List . Note, the biggest problem will not be writing all the rules, but looking up the various policies on the TLD registrars. It took me at least 2 weeks to process them all. A few of them are not even in English ; it helps if you can read French and Spanish (I only found 1 east-european one that I couldn't read). Some of the third-level domains are not even mentioned, but can be discovered with dig (especially mil.* and gov.*), although they're probably not our concern. The biggest work will documenting all that ; most urls can be found at the wiki, or at wikipedia.
Since it sounds like it's time-consuming to create the full list, might it be useful to put in a few of the more common entries immediately so that the service can be tested? I personally have an extension that will use this service and I'd like to see some real results coming back to make sure I'm using it properly.
There's a tiny sample (UTF-8) file attached to bug 331510: https://bugzilla.mozilla.org/attachment.cgi?id=224732
(In reply to comment #2) > Since it sounds like it's time-consuming to create the full list, might it be > useful to put in a few of the more common entries immediately so that the > service can be tested? > > I personally have an extension that will use this service and I'd like to see > some real results coming back to make sure I'm using it properly. > Absolutely, we only need a few rules (*.com, *.co.uk, etc ..) to start testing. We'll have to add the rest of the rules later (which might need to be frequently updated anyway). Even the most basic rule*.tld will suffice for the moment, because it describes the current situation for the cookie-sort routine for instance.
update: I'm currently working on the data-file, and I hope that I will be finished this weekend. It's a huge undertaking, I'm already at 1711 lines, despite use of wildcards. Record-holder is Norway, with 760 rules on its own (sigh). Note: it will be a bit different from the examples cited in http://wiki.mozilla.org/Gecko:Effective_TLD_Service#Example . For example, the correct ruleset for the United Kingdom is : *.uk *.sch.uk !bl.uk !british-library.uk !icnet.uk !jet.uk !nel.uk !nls.uk !national-library-scotland.uk !parliament.uk
Jo: presumably the file does not need to be complete for it to be useful, right? The only downside is that some TLDs do not get the protection that the TLDs you've done get - i.e for them, the situation is as now. So given the relevant deadlines, should we stablise and check what we have and get it checked in? And then we can improve it incrementally as ship dates permit. Gerv
Yes please, if you've already got 1200 rules that's a great start. Attach it here even if incomplete and we can start reviewing it. I strongly believe we need this in FF2 for the globalStorage feature.
Ok, here's what I have so far. I'll post a newer version tomorrow, after I've verified more rules that I have in my draft version. The *.no rules contain many international characters - at least 5 of them still have errors in them, I'll still have to fix them.
Comment on attachment 234092 [details] first version (very incomplete) >// the following rules would be only valid under the geo-name, but we can't express that >// *.*.us cities, counties, parishes, and townships (locality.state.us) >// !ci.*.*.us city government agencies (subdomain under locality) >// !town.*.*.us town government agencies (subdomain under locality) >// !co.*.*.us county government agencies (subdomain under locality) >// k12.*.us public school districts >// pvt.k12.*.us private schools >// cc.*.us community colleges >// tec.*.us technical and vocational schools >// lib.*.us state, regional, city, and county libraries >// state.*.us state government agencies >// gen.*.us general independent entities (groups not fitting into the above categories) If I understand this comment correctly, the way to express it is to write out each of the rules for all 51 geographic names. With that, a rule like |!ci.*.ak.us| is valid. Not *convenient*, certainly, but valid.
As noted elsewhere, this isn't a blocker, but we would consider a patch.
Flags: blocking1.8.1? → blocking1.8.1-
I gathered a list of .pl subdomains and put it here: http://wiki.mozilla.org/TLD_List:.pl in a format that should be compatible with attachment 234092 [details]. I never supposed that the .pl domain could be such a mess, though. :/
Thought: we should give registries the opportunity to review the entry for their registry. I suspect ICANN has an announce mailing list which should be able to reach people at all the registries. I'll try and find out what it is, and how we could post a message there. Gerv
This should be complete. UTF8 names can be found in the .kr domain (Korea), in .tw (Taiwan) and in .no (Norway). They're marked with the word utf8.
Attachment #234667 - Attachment is obsolete: true
Jo - mind if I assign this to you? - Just making sure these have proper owners.
Assignee: nobody → jo.hermans
Jo - is this in progress?
Talking to DVeditz today this is not wired up to anything on the 18branch and we don't have a patch ready to hookup to cookies. So removing this from the 18 list...
Flags: blocking1.8.1+ → blocking1.8.1-
*** Bug 319643 has been marked as a duplicate of this bug. ***
Jo, there is a list of Russian regional domains here: http://www.bizhost.ru/domen/alldomen.php It also lists biz.ua which you don't have on your list so maybe you want to check the Ukrainian ones as well. The official list is at http://www.ripn.net:8080/nic/dns/geo_list.html but it doesn't have the .su domains.
Actually, these should be the most complete and comprehensive lists for Russia and Ukraine: http://www.whois-service.ru/domains/?id=russian http://www.whois-service.ru/domains/?id=ukrainian
Hi, I'm attached a version of the third version of tld.txt which includes all the second-level registries that I'm aware of. I would have submitted a diff but due to some sort of line-endings problem the output of diff was enormous. Just in case, here is a list of the second-level registries I added: // CentralNic : http://www.centralnic.com/ ae.org br.com cn.com de.com eu.com gb.com hu.com jpn.com kr.com no.com qc.com ru.com sa.com se.com uk.com us.com uy.com web.com za.com se.net gb.net uk.net // NetRegistry : http://www.netregistry.com au.com jp.com // za.net za.net // eu.org - looking at the DNS, it looks like all the EU country-code .eu.orgs // are delegated, so this list might grow in the future eu.org cy.eu.org gr.eu.org il.eu.org nl.eu.org pt.eu.org dk.eu.org ie.eu.org Gavin Brown CentralNic Ltd.
Gavin: thanks very much, but if a domain gets on this list which shouldn't be, we could really inconvenience the domain owner. Therefore, I think we should have a policy of only putting domains on when specifically asked to by the person controlling them. And we should verify that by asking them to post the list to a page they control in their webspace. So, if CentralNIC wanted their domains added, they would need to put up a note to that effect, with a list, at something like www.centralnic.com/effective-tlds.html. Then we'd know the request was legitimate and public. (But please don't do that right now; I'm making up policy off the top of my head here, and this could change.) I need to work with Jo on some infrastructure to manage this list going forwards... Gerv
Gerv: you're absolutely right about not adding in domains unsolicited. Coincidentally, we do already publish a list of domains that we manage. There's a human readable list at http://www.centralnic.com/domains but we have a plain text list at http://toolkit.centralnic.com/srv/suffixes We will be happy to produce this information in another format if you need it.
Gavin: that's one half of the puzzle; the other half is confirming that email@example.com actually has authority to speak for CentralNIC :-) Please bear with us while we put a system in place for this. Gerv
Gervase: good point, I had not realised I was using a personal e-mail address. I'll use my "official" address for future comments.
No reason I can see. Do the people working on the effective TLD service want to check this in now? Jo, are you still working on the file? Are you happy to be its maintainer for the moment? Gerv
(In reply to comment #27) > No reason I can see. > > Do the people working on the effective TLD service want to check this in now? > > Jo, are you still working on the file? Are you happy to be its maintainer for > the moment? > > Gerv > The third version by Gavin (my second one + CentralNic) seems ok. There are still some comments from me that mark the utf8-encoded names (search for "utf8"). We'll probably have to be careful not to remove or damage them in the future. When I originally made the list with a Firefox 1.5, I lost the utf8-encoded characters a few times. Since this is a utf8-encoded file, we should probably add a BOM character too (auto-code detect doesn't work in this case). I'm very busy at my day-time job, but I can expand the initial notes further. I guess it will be some kind of Wiki ?
Please do not include CentralNIC until we have a formal mechanism set up to ensure that we can't be spoofed by troublemakers who pretend to speak for domain owners and who don't. Let's stick to the known ICANN, gTLD and ccTLD structure. I believe that UTF-8 does not permit a BOM; it doesn't make sense, because there are not two possible byte orders. Gerv
(In reply to comment #29) > I believe that UTF-8 does not permit a BOM; it doesn't make sense, because > there are not two possible byte orders. But you can include the UTF-8 version of the BOM, which will trigger most encoding detectors: EF BB BF
(In reply to comment #29) > Please do not include CentralNIC until we have a formal mechanism set up to > ensure that we can't be spoofed by troublemakers who pretend to speak for > domain owners and who don't. Let's stick to the known ICANN, gTLD and ccTLD > structure. Ok, then it will be attachment 234865 [details] for the moment. I'm not to sure about the list found in comment 19 (Russia), but that's because I don't know if bizhost.ru is authoritative. Unfortunately, I don't speak Russian, so it's not easy to figure all this out. Wikipedia claims that it's RELCOM (<http://www.relcom.ru/English>), but they don't have such a list. Note that the presence of such a list is not necessarily an indication what the official rules for a TLD are. They might be a reseller like CentralNIC though. I always tried to follow the official rules, by searching the documents at a country-NIC.
Jo, ripn.net is official for Russia, see http://en.wikipedia.org/wiki/.ru. Unfortunately this document isn't available in English but http://translate.google.com/translate?u=www.ripn.net%3A8080%2Fnic%2Fdns%2Fgeo_list.html&langpair=ru%7Cen is mostly fine.
Has this been checked in to the trunk yet? If not, what's stopping it? Gerv
(In reply to comment #21) > Created an attachment (id=241327) [details] > Third version with second-level registries added There are still many of the second-level .kr domains missing. See: http://en.wikipedia.org/wiki/.kr
Toby: no-one claims the file is complete. But if it's not, we just get the same behaviour as now. We need to get it checked in as a start. If you have more update suggestions, please send them to Jo Hermans. Gerv
Comment on attachment 252142 [details] [diff] [review] build changes to add the effective TLD file >Index: netwerk/dns/src/Makefile.in >+libs:: >+ $(INSTALL) $(srcdir)/effective_tld_names.dat $(DIST)/bin/res To avoid issues like bug 364599 the libs:: target should use SYSINSTALL IFLAGS1 just like the install:: target does.
Attachment #252142 - Flags: review?(benjamin) → review+
Checked into trunk with the change bsmedberg suggested in comment 37.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
There's missing subdomains in the .ar zone. The complete list is com.ar, org.ar, gov.ar, mil.ar, net.ar, int.ar I can help with spanish tld's, if you need it.
(In reply to comment #39) > There's missing subdomains in the .ar zone. The complete list is com.ar, > org.ar, gov.ar, mil.ar, net.ar, int.ar > I can help with spanish tld's, if you need it. > No, they're not missing : *.ar catches them (the regular expression syntax is not the usual that you expect). Since all domains must be registered as third-level-domains, it was easier to express like this, while allowing for future expansion. That's also why there's a list of exceptions, like uba.ar for the University of Buenos Aires : *.ar !congresodelalengua3.ar !educ.ar !gobiernoelectronico.ar !mecon.ar !nacion.ar !nic.ar !promocion.ar !retina.ar !uba.ar Hablo Español tambien, mi mujer es Colombiana :-)
filed bug 403655 for updating the effective tld file for gecko 1.9. if anyone here has an interest in helping out with parts of that, i'm sure it would be appreciated. ;)
You need to log in before you can comment on or make changes to this bug.