Closed Bug 1014639 Opened 5 years ago Closed 5 years ago

text-transform: uppercase CSS directive is broken for Irish language text

Categories

(Core :: Internationalization, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla32

People

(Reporter: kscanne, Assigned: jfkthame)

References

Details

(Keywords: dev-doc-needed)

Attachments

(5 files)

Irish has unusual rules for writing text in all uppercase.  In short, certain prefixed letters should remain in lowercase or be rendered as smallcaps.

I've provided further explanation and references in bug 830002 comment 17, and several examples in bug 854756.  I'll try and give a complete set of rules in a followup comment.
Here's the algorithm:

Turn every letter to uppercase in the usual way, but with certain exceptions that I'll phrase as follows.   "If the sequence in the first column appears at the beginning of a word, then that sequence should be replaced by the letters in the second column when uppercasing".  For brevity I'll write {v} for any lowercase vowel (a,á,e,é,i,í,o,ó,u,ú) and {V} for the corresponding uppercase vowels: (A,Á,E,É,I,Í,O,Ó,U,Ú).


bhf -> bhF
bhF -> bhF
bp  -> bP
bP  -> bP
dt  -> dT
dT  -> dT
gc  -> gC
gC  -> gC
h{V}  -> h{V}
mb  -> mB
mB  -> mB
n-{v} -> n{V}
n{V} -> n{V}
nd  -> nD
nD  -> nD
ng  -> nG
nG  -> nG
t-{v} -> t{V}
t{V} -> t{V}
tsl  -> tSL
tSl  -> tSL
tsn  -> tSN
tSn  -> tSN
tsr  -> tSR
tSr  -> tSR
tS{v} -> tS{V}
tS{V} -> tS{V}
The edge case I mentioned in bug 830002 comment 17 would be, in this notation,

h{v} -> h{V}

But it's not applicable 100% of the time so it's better to leave it out.    It works here:

muintir na háite -> MUINTIR NA hÁITE

but not here:

muintir na hataí -> MUINTIR NA HATAÍ

One would need a dictionary to get it right always.  

With the one rule h{V} -> h{V} I've given, we'll be fine in the typical cases where the text to be uppercased already looks like "Muintir na hÁite" or "Muintir na Hataí".
100+ test cases, covering each of the proposed rules and several negative examples as well.  When "A" is input, the corresponding "B" should be output.  And in fact, it would be worth testing that if a "B" is input, it isn't changed.
I forgot a few:

tSL  -> tSL
tSN  -> tSN
tSR  -> tSR
ts{v} -> tS{V}
Assignee: nobody → jfkthame
Status: NEW → ASSIGNED
Comment on attachment 8430071 [details] [diff] [review]
part 0 - (preliminary cleanup) split GreekCasing out into its own file to reduce clutter in nsUnicharUtils.cpp.

Review of attachment 8430071 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/unicharutil/util/objs.mozbuild
@@ +7,5 @@
>  intl_unicharutil_util_lcppsrcs = []
>  
> +intl_unicharutil_util_lcppsrcs += [
> +    'GreekCasing.cpp',
> +]

I don't think this needs to be in its own block
Attachment #8430071 - Flags: review?(smontagu) → review+
Attachment #8430077 - Flags: review?(smontagu) → review+
Another edge case: suppose that for some reason the input had an n or t followed by a hyphen and then an uppercase vowel, e.g. Ceol Na n-Éan. With the current rules that would become CEOL NA NÉAN. Wouldn't it make more sense to make it CEOL NA nÉAN, or am I totally barking up the wrong tree?
Flags: needinfo?(kscanne)
Also, could deleting those hyphens when uppercasing cause issues for lowercasing?
Comment on attachment 8430073 [details] [diff] [review]
part 1 - implement IrishCasing to encapsulate the special Irish uppercasing rules.

Review of attachment 8430073 [details] [diff] [review]:
-----------------------------------------------------------------

The code looks good, modulo the questions above
Attachment #8430073 - Flags: review?(smontagu) → review+
Comment on attachment 8430074 [details] [diff] [review]
part 2 - use IrishCasing in nsCaseTransformTextRunFactory::TransformString to provide correct Irish uppercasing.

Review of attachment 8430074 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/generic/nsTextRunTransformations.cpp
@@ +233,1 @@
>  };

Nit: this list is getting long enough that it would be nice if it was alphabetized
Attachment #8430074 - Flags: review?(smontagu) → review+
(In reply to Simon Montagu :smontagu from comment #11)
> Another edge case: suppose that for some reason the input had an n or t
> followed by a hyphen and then an uppercase vowel, e.g. Ceol Na n-Éan. With
> the current rules that would become CEOL NA NÉAN. Wouldn't it make more
> sense to make it CEOL NA nÉAN, or am I totally barking up the wrong tree?

Hi Simon, you're definitely not barking up the wrong tree - this is an important point that I put some thought into.  In short, I think the rules are ok as they are.

One *does* sometimes see a hyphen included in contexts like this, but almost always in old texts before the orthography was standardized (in the 1950s), and so I decided not to include these cases among the rules.  Being conservative in this way avoids some "false positives" - for example, looking in a big corpus of Irish texts from the web, many of the cases with a hyphen are Scottish placenames (Bun na h-Abhainn, etc.) where the convention would be to preserve the hyphen.   And even in older texts, there are examples like "ar n-a gclárú" which should in fact uppercase to "AR N-A GCLÁRÚ".  

And incidentally, the way I've written the rules, Ceol Na n-Éan should uppercase to CEOL NA N-ÉAN (i.e. the usual uppercasing, no deletion of the hyphen since none of "my" rules apply) - this is better than CEOL NA NÉAN.
Flags: needinfo?(kscanne)
(In reply to Simon Montagu :smontagu from comment #12)
> Also, could deleting those hyphens when uppercasing cause issues for
> lowercasing?

Yes.  When lowercasing n{V} or t{V}, a hyphen needs to be reinserted.

Ár nAthair -> ár n-athair
An tAsal -> an t-asal
(In reply to Kevin Scannell from comment #16)
> (In reply to Simon Montagu :smontagu from comment #12)
> > Also, could deleting those hyphens when uppercasing cause issues for
> > lowercasing?
> 
> Yes.  When lowercasing n{V} or t{V}, a hyphen needs to be reinserted.
> 
> Ár nAthair -> ár n-athair
> An tAsal -> an t-asal

Ah....ok, could you file a new bug for the lowercase transform, please? That'll be an additional patch, so it's helpful if we can track each of them separately. Bonus points for including testcases there, too. :) Thanks!
(In reply to Simon Montagu :smontagu from comment #14)
> Comment on attachment 8430074 [details] [diff] [review]

> >  };
> 
> Nit: this list is getting long enough that it would be nice if it was
> alphabetized

Wow, I just noticed this review tool fail: this comment was on the following block:

 enum LanguageSpecificCasingBehavior {
   eLSCB_None,    // default non-lang-specific behavior
   eLSCB_Turkish, // preserve dotted/dotless-i distinction in uppercase
   eLSCB_Dutch,   // treat "ij" digraph as a unit for capitalization
-  eLSCB_Greek    // strip accent when uppercasing Greek vowels
+  eLSCB_Greek,   // strip accent when uppercasing Greek vowels
+  eLSCB_Irish    // keep prefix letters as lowercase when uppercasing Irish
 };
(In reply to Jonathan Kew (:jfkthame) from comment #17)
> (In reply to Kevin Scannell from comment #16)
> > (In reply to Simon Montagu :smontagu from comment #12)
> > > Also, could deleting those hyphens when uppercasing cause issues for
> > > lowercasing?
> > 
> > Yes.  When lowercasing n{V} or t{V}, a hyphen needs to be reinserted.
> > 
> > Ár nAthair -> ár n-athair
> > An tAsal -> an t-asal
> 
> Ah....ok, could you file a new bug for the lowercase transform, please?
> That'll be an additional patch, so it's helpful if we can track each of them
> separately. Bonus points for including testcases there, too. :) Thanks!

Filed bug 1018805 - with lots of tests!
Flags: in-testsuite+
Depends on: CVE-2016-5270
I just noticed that https://developer.mozilla.org/en-US/docs/Web/CSS/text-transform discusses a bunch of language-specific variations in  text-transform behavior, but does not describe the Irish (Gaelic) support that was implemented here (for uppercase) and in bug 1018805 (lowercase).
Keywords: dev-doc-needed
I just added a bullet point about the special behavior for Irish on that page (there was also some stale text, with an example from Scottish Gaelic, not Irish, that I removed).
You need to log in before you can comment on or make changes to this bug.