Closed Bug 792575 Opened 12 years ago Closed 11 years ago

Merge all Latin-based langGroups (x-western, x-central-euro, x-baltic, tr) into a single langGroup

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 756022

People

(Reporter: GPHemsley, Assigned: GPHemsley)

Details

(Whiteboard: [bcp47])

Attachments

(1 file, 1 obsolete file)

Per bug 586085 comment 20, x-central-euro does not contain all the characters necessary to display Kashubian text. It should be changed to x-unicode.

Given the difficulty of finding out this sort of information on the Internet (because I did check), I put together this list:

http://gphemsley.github.com/alphabets/Latn.html#lang:csb

To summarize, these are the non-ASCII characters in the Kashubian alphabet (per Yurek):

éëùòóôąãłżń
ÉËÙÒÓÔĄÃŁŻŃ

While certain subsets of them are in various ISO 8859 encodings, no single ISO 8859 encoding has all of them.
Here's a patch that changes 'csb' from 'x-central-euro' to 'x-unicode'.

However, I notice that 'x-unicode' is actually the default value if none is specified, and 'csb' and 'haw' (also added in bug 586085) are the only ones in the file that list 'x-unicode' explicitly.

Thus, I wonder the following:

(1) Should both 'csb' and 'haw' simply be removed from the list instead?
(2) Should this file have been left alone, leaving all new languages to default to 'x-unicode'?
Assignee: nobody → gphemsley
Status: NEW → ASSIGNED
Attachment #662688 - Flags: review?(smontagu)
Does the current setting actually cause any problems? ISTM that it's more useful (from the point of view of font preferences) to group Kashubian with Central European languages than dump it into the catch-all "Other" (x-unicode).
(In reply to Jonathan Kew (:jfkthame) from comment #2)
> Does the current setting actually cause any problems? ISTM that it's more
> useful (from the point of view of font preferences) to group Kashubian with
> Central European languages than dump it into the catch-all "Other"
> (x-unicode).

Well, I'm increasingly unclear about what it is that langGroup is used for in the first place. Is it related to encoding at all? Or is just an arbitrary way to group formatting preferences?
That's the function that comes to mind, at least. I think all that should be ripped out and re-done in a Unicode- and BCP-47-based way, now that legacy 8-bit charsets are no longer at the heart of text processing (see bug 556237), but in the meantime, I think grouping extended-Latin languages with other "generally similar" extended-Latin languages, regardless of precise character inventory, makes more sense than putting them in the x-unicode bucket.
(In reply to Jonathan Kew (:jfkthame) from comment #4)
> That's the function that comes to mind, at least. I think all that should be
> ripped out and re-done in a Unicode- and BCP-47-based way, now that legacy
> 8-bit charsets are no longer at the heart of text processing (see bug
> 556237), but in the meantime, I think grouping extended-Latin languages with
> other "generally similar" extended-Latin languages, regardless of precise
> character inventory, makes more sense than putting them in the x-unicode
> bucket.

In that case, perhaps it is Hawaiian that needs to change: But what group should it be in? (It uses letters with macrons.)

And what is the distinction between x-western and x-central-euro, if they're both Latin-based? Should we begin by consolidating the two groups (or however many Latin-based ones there are)?
(In reply to Gordon P. Hemsley [:gphemsley] from comment #5)
> And what is the distinction between x-western and x-central-euro, if they're
> both Latin-based? Should we begin by consolidating the two groups (or
> however many Latin-based ones there are)?

Yes, this is the question I was trying to ask (though plagued by intermittent connectivity this evening): bug 556237 is a large project which needs extensive infrastructure work and UI work, but what about an interim fix limited to consolidating all of x-western, x-central-euro, x-baltic... into x-Latn?
To me, it'd make more sense to classify Hawaiian as "Western" than as "Central European". I'd guess that users would normally expect the same font settings to apply to Hawaiian as English. (But maybe it'd be better to ask the Hawaiians.)

As for the distinction: they were originally associated with 8-bit charsets, which made sense in the days when fonts often supported just a single 8-bit charset, so there'd be separate Western European and Central European versions (remember Times CE, etc?). So they enabled us to choose suitable generic fonts on the basis of what charset the page used. As Unicode takes over, I think we should now be managing font preferences in terms of Unicode or ISO15924 scripts, not on the basis of a mish-mash of legacy charsets; but restructuring this will require re-thinking the user interface, too, so it's a non-trivial undertaking.

If we're not ready to tackle that, maybe consolidating Latin-script prefs would be a useful first step, though even that would require UI changes in the Preferences dialog. And if we're going to start messing with that, maybe we should take the bull by the horns and do it "right" (after discussing what "right" really should look like) after all.
Ah, yes, it's all coming into focus now.

So, basically, the x-western, x-central-euro, and x-baltic langGroups are merely remnants of the various ISO 8859 encodings that used to be prevalent.

Taking a look at the MXR coverage of those three langGroups, it looks like it would be pretty trivial to merge them. However, given the prominence of x-western, I wonder if it might be better to first just replace x-central-euro and x-baltic, making 'x-western' mean 'Latin' for a little, and see what breaks? I imagine there are more things dependent on x-western than the other two.

Either way, I'm in favor of a merger of the three langGroups. The changes are relatively minimal, and I think it's a good way to get the ball rolling.

In the meantime, I've sent an e-mail to Greg Glind about getting some Test Pilot data about the usage of the various language/font/encoding user interfaces, so that we can make well-informed decisions. (It seems we may have to wait ~5 weeks for the study to get underway, though; I'll await his response for more details.)

What's the best place to have further in-depth discussion of this stuff? dev.i18n?

And what should we do about Kashubian and Hawaiian while we figure the rest of this out? Should I draw up a patch changing Hawaiian to x-western?
I think the most important question is whether there are any platforms where we have (intentionally) different default fonts for x-western, x-central-euro, x-baltic, and tr, which are all Latin-script.  (I think it makes sense to include all 4 in the merger.)  Figuring that out just requires trawling through all.js.

(Then at some point maybe we could consider switching to script codes instead of "lang groups" for the font preferences, though that's a lot more work.)
/font\..+?\.(x-western|x-central-euro|x-baltic|tr)/i

From what I can tell, the following are the same for all on Windows:

* font.minimum-size
* font.name.serif
* font.name.sans-serif
* font.name.monospace
* font.name.cursive
* font.default
* font.size.variable
* font.size.fixed

And on Mac:

* font.name.serif
* font.name.sans-serif
* font.name.monospace
* font.name.cursive
* font.name.fantasy
* font.name-list.cursive
* font.name-list.fantasy
* font.default
* font.size.variable
* font.size.fixed

And on OS/2:

* font.name.serif
* font.name.sans-serif
* font.name.monospace
* font.default
* font.size.variable
* font.size.fixed

And on Android:

* font.name.serif
* font.name.sans-serif
* font.name.monospace
* font.name-list.sans-serif
* font.default
* font.size.variable
* font.size.fixed

And on Unix:

* font.name.serif
* font.name.sans-serif
* font.name.monospace
* font.default
* font.size.variable
* font.size.fixed

The following have different values for x-western on Mac, but I don't see why those values can't be shared with the other three langGroups in question:

* font.name-list.serif
* font.name-list.sans-serif
* font.name-list.monospace

===

So there doesn't appear to be any conscious effort to differentiate between the four langGroups. (In fact, this evidence even suggests that the separation causes changes make to x-western to not propagate to the other three, simply due to human error/forgetfulness.)

(This file is confusing because of platform ifdefs; it's not immediately clear which preferences apply to single platforms and which are cross-platform/global.)
Incidentally, I did a trial run of removing references to x-baltic and/or converting them to x-western, and there doesn't appear to be any immediate fallout (so far):

https://tbpl.mozilla.org/?tree=Try&rev=6a0a1536dcfc
https://hg.mozilla.org/try/rev/c291a8ea4dd4

Of course, there are sections where my removal was a little hackish, because some of the graphics code requires things to be listed in a certain order to maintain indexes, and I didn't want to bother renumbering the indexes.

Also, there's a section of code that has some resemblance to actual Unicode script/blocks support, and I didn't know what to do with that (so I left it).
This may be related to bug 192636, but I'm not sure.
Summary: Change Kashubian [csb] langGroup to x-unicode → Merge all Latin-based langGroups (x-western, x-central-euro, x-baltic, tr) into a single langGroup
Attachment #662688 - Attachment is obsolete: true
Attachment #662688 - Flags: review?(smontagu)
No longer depends on: 586085
Status: ASSIGNED → NEW
Whiteboard: [bcp47]
Blocks: 556237
FYI, the patch from bug 793249 was checked in under this bug number.
https://hg.mozilla.org/integration/mozilla-inbound/rev/18e98f1040c2
So, do we agree that consolidating x-central-euro, x-baltic, and tr into x-western is a good first step, or do we want to take a different approach?
I haven't seen anyone disagree, or propose any different approach (given that you have answered dbaron's question from comment 9). I say go for it.
It's a good step, I think, but there's an addition aspect to consider: if we do this, I think we need to change the term using in the Prefs dialog (Content / Fonts&Colors / Advanced...) from "Western" to "Latin" or something like that. Probably should try to get UX feedback on what's the appropriate terminology to use there.

Maybe "Latin Script" would be best, although the other entries in the menu don't explicitly say "Script"; many of them could be either language or script names, as they're the name of both the script and the (only major) language that uses it. The two clear exceptions to this are "Cyrillic" and "Unified Canadian Syllabary", which are unambiguously scripts (not languages).

Should we consider changing "x-western" (i.e. the internal langGroup label we're using) to "x-latin" as part of this process, so you'd actually be consolidating all the Latin-script langGroups into this new code rather than folding them into the existing "x-western"?
(In reply to Jonathan Kew (:jfkthame) from comment #16)
> It's a good step, I think, but there's an addition aspect to consider: if we
> do this, I think we need to change the term using in the Prefs dialog
> (Content / Fonts&Colors / Advanced...) from "Western" to "Latin" or
> something like that. Probably should try to get UX feedback on what's the
> appropriate terminology to use there.

Well, I thought the idea was to make as little a splash as possible when making this change (which is why I suggested consolidating in x-western instead of doing something new), but perhaps that is not the right approach.

> Maybe "Latin Script" would be best, although the other entries in the menu
> don't explicitly say "Script"; many of them could be either language or
> script names, as they're the name of both the script and the (only major)
> language that uses it. The two clear exceptions to this are "Cyrillic" and
> "Unified Canadian Syllabary", which are unambiguously scripts (not
> languages).

Given that most of the other entries can be considered scripts (though not unambiguously), I think we'd be OK to just silently transition the contents of that menu from langGroups (or whatever it is now) to scripts. By that notion, "Latin" would be fine on its own.

We're still waiting on the data from Test Pilot, but I expect that the good majority of users never even see this menu, let alone use it.

> Should we consider changing "x-western" (i.e. the internal langGroup label
> we're using) to "x-latin" as part of this process, so you'd actually be
> consolidating all the Latin-script langGroups into this new code rather than
> folding them into the existing "x-western"?

I'm thinking the work should be done in stages, to make sure nothing bad happens. I suppose the rename could be one of the stages.

But why not jump straight to using the Script codes at this point? Why not use "Latn" (or "latn", if you prefer)?
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
No longer blocks: 556237
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: