Closed Bug 1433694 Opened 7 years ago Closed 4 years ago

Create a script to extract language/region names FTL file out of CLDR

Categories

(Core :: Internationalization, enhancement, P3)

enhancement

Tracking

()

RESOLVED WONTFIX

People

(Reporter: zbraniecki, Assigned: kekoariggin)

References

Details

Priority: -- → P3
Answering questions from bug 1431324 comment 5: > - What are the files that need to be processed and where can I find them? you can read them from: http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/lang/ - which uses its own format, but there's also JSON https://github.com/unicode-cldr that may make it way easier (read from JSON, write to FTL) > - What output files are we hoping to have? Am I understanding correctly that we need one file per locale per data file? Fluent. You can find a parser/ast/serializer in https://github.com/projectfluent/python-fluent I assume you'll read the source (CLDR format or JSON), produce AST for Fluent and the use serializer to write the file.
Assignee: nobody → kekoariggin
We should cast a wider net around our choice to upstream our choice of region names. Back in bug 1203171, Gerv changed our region names to be more or less what GENC uses. Gerv, any concerns moving that over to CLDR?
Flags: needinfo?(gerv)
Bug 1416148 has a lot more data about the differences we have, for both language and region names. I was planning to NI Gerv there, not sure about the relation these two bugs should have.
I think they are correctly scoped. This bug is about writing a script. That bug is about analyzing differences in terminology and scope. They don't depend on one another. maybe "see also"?
We chose GENC for human-readable region very carefully, after a lot of thought about the pros and cons of each option. It took some months to get that change approved, and so there would need to be a very good reason to reopen that discussion. Are you suggesting there is one? It is not surprising that there are a number of differences between GENC and other sources; if all the sources were the same, there would be no need for a discussion on which to use :-) But those differences are significant, and we have decided that a) making decisions on a per-region-name basis is a really bad idea; and b) the GENC list is the best list, considering all factors. Does that help? Gerv
Flags: needinfo?(gerv)
Note about languages: the current CLDR data is currently not usable for some languages (and I've only checked French and Italian). Mozilla uses uppercase (e.g. "Italiano" for Italian), CLDR uses lowercase ("italiano"). That would work in the middle of a sentence, not as stand-alone (current use in Mozilla) or at the beginning of a sentence. That means that we'd need not just to import data, but also apply transformation to those language names. And potentially need a per-locale rule of the kind of capitalization to apply.
I'd prefer to talk about the unification around CLDR for L10n data in bug 1431324, and leave this bug just for the implementation if we decide to do this. I responded to Gerv in bug 1431324 comment 7.
I've found that CLDR includes rules for capitalizing language names in context, e.g. https://github.com/unicode-cldr/cldr-misc-full/blob/master/main/it/contextTransforms.json There are casing rules for territories, but I can't find them in the GitHub repository, on in SVN https://www.unicode.org/repos/cldr/tags/release-33/common/casing/it.xml

This is basically obsoleted by Intl.DisplayNames

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.