Closed
Bug 1014639
Opened 11 years ago
Closed 11 years ago
text-transform: uppercase CSS directive is broken for Irish language text
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
mozilla32
People
(Reporter: kscanne, Assigned: jfkthame)
References
Details
(Keywords: dev-doc-needed)
Attachments
(5 files)
4.69 KB,
text/plain
|
Details | |
26.01 KB,
patch
|
smontagu
:
review+
|
Details | Diff | Splinter Review |
13.62 KB,
patch
|
smontagu
:
review+
|
Details | Diff | Splinter Review |
8.31 KB,
patch
|
smontagu
:
review+
|
Details | Diff | Splinter Review |
6.75 KB,
patch
|
smontagu
:
review+
|
Details | Diff | Splinter Review |
Irish has unusual rules for writing text in all uppercase. In short, certain prefixed letters should remain in lowercase or be rendered as smallcaps.
I've provided further explanation and references in bug 830002 comment 17, and several examples in bug 854756. I'll try and give a complete set of rules in a followup comment.
Reporter | ||
Comment 1•11 years ago
|
||
Here's the algorithm:
Turn every letter to uppercase in the usual way, but with certain exceptions that I'll phrase as follows. "If the sequence in the first column appears at the beginning of a word, then that sequence should be replaced by the letters in the second column when uppercasing". For brevity I'll write {v} for any lowercase vowel (a,á,e,é,i,í,o,ó,u,ú) and {V} for the corresponding uppercase vowels: (A,Á,E,É,I,Í,O,Ó,U,Ú).
bhf -> bhF
bhF -> bhF
bp -> bP
bP -> bP
dt -> dT
dT -> dT
gc -> gC
gC -> gC
h{V} -> h{V}
mb -> mB
mB -> mB
n-{v} -> n{V}
n{V} -> n{V}
nd -> nD
nD -> nD
ng -> nG
nG -> nG
t-{v} -> t{V}
t{V} -> t{V}
tsl -> tSL
tSl -> tSL
tsn -> tSN
tSn -> tSN
tsr -> tSR
tSr -> tSR
tS{v} -> tS{V}
tS{V} -> tS{V}
Reporter | ||
Comment 2•11 years ago
|
||
The edge case I mentioned in bug 830002 comment 17 would be, in this notation,
h{v} -> h{V}
But it's not applicable 100% of the time so it's better to leave it out. It works here:
muintir na háite -> MUINTIR NA hÁITE
but not here:
muintir na hataí -> MUINTIR NA HATAÍ
One would need a dictionary to get it right always.
With the one rule h{V} -> h{V} I've given, we'll be fine in the typical cases where the text to be uppercased already looks like "Muintir na hÁite" or "Muintir na Hataí".
Reporter | ||
Comment 3•11 years ago
|
||
100+ test cases, covering each of the proposed rules and several negative examples as well. When "A" is input, the corresponding "B" should be output. And in fact, it would be worth testing that if a "B" is input, it isn't changed.
Reporter | ||
Comment 4•11 years ago
|
||
I forgot a few:
tSL -> tSL
tSN -> tSN
tSR -> tSR
ts{v} -> tS{V}
Assignee | ||
Comment 5•11 years ago
|
||
Attachment #8430071 -
Flags: review?(smontagu)
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → jfkthame
Status: NEW → ASSIGNED
Assignee | ||
Comment 6•11 years ago
|
||
Attachment #8430073 -
Flags: review?(smontagu)
Assignee | ||
Comment 7•11 years ago
|
||
Attachment #8430074 -
Flags: review?(smontagu)
Assignee | ||
Comment 8•11 years ago
|
||
Attachment #8430077 -
Flags: review?(smontagu)
Assignee | ||
Comment 9•11 years ago
|
||
Tryserver run:
https://tbpl.mozilla.org/?tree=Try&rev=96a27f0b9764
Comment 10•11 years ago
|
||
Comment on attachment 8430071 [details] [diff] [review]
part 0 - (preliminary cleanup) split GreekCasing out into its own file to reduce clutter in nsUnicharUtils.cpp.
Review of attachment 8430071 [details] [diff] [review]:
-----------------------------------------------------------------
::: intl/unicharutil/util/objs.mozbuild
@@ +7,5 @@
> intl_unicharutil_util_lcppsrcs = []
>
> +intl_unicharutil_util_lcppsrcs += [
> + 'GreekCasing.cpp',
> +]
I don't think this needs to be in its own block
Attachment #8430071 -
Flags: review?(smontagu) → review+
Updated•11 years ago
|
Attachment #8430077 -
Flags: review?(smontagu) → review+
Comment 11•11 years ago
|
||
Another edge case: suppose that for some reason the input had an n or t followed by a hyphen and then an uppercase vowel, e.g. Ceol Na n-Éan. With the current rules that would become CEOL NA NÉAN. Wouldn't it make more sense to make it CEOL NA nÉAN, or am I totally barking up the wrong tree?
Flags: needinfo?(kscanne)
Comment 12•11 years ago
|
||
Also, could deleting those hyphens when uppercasing cause issues for lowercasing?
Comment 13•11 years ago
|
||
Comment on attachment 8430073 [details] [diff] [review]
part 1 - implement IrishCasing to encapsulate the special Irish uppercasing rules.
Review of attachment 8430073 [details] [diff] [review]:
-----------------------------------------------------------------
The code looks good, modulo the questions above
Attachment #8430073 -
Flags: review?(smontagu) → review+
Comment 14•11 years ago
|
||
Comment on attachment 8430074 [details] [diff] [review]
part 2 - use IrishCasing in nsCaseTransformTextRunFactory::TransformString to provide correct Irish uppercasing.
Review of attachment 8430074 [details] [diff] [review]:
-----------------------------------------------------------------
::: layout/generic/nsTextRunTransformations.cpp
@@ +233,1 @@
> };
Nit: this list is getting long enough that it would be nice if it was alphabetized
Attachment #8430074 -
Flags: review?(smontagu) → review+
Reporter | ||
Comment 15•11 years ago
|
||
(In reply to Simon Montagu :smontagu from comment #11)
> Another edge case: suppose that for some reason the input had an n or t
> followed by a hyphen and then an uppercase vowel, e.g. Ceol Na n-Éan. With
> the current rules that would become CEOL NA NÉAN. Wouldn't it make more
> sense to make it CEOL NA nÉAN, or am I totally barking up the wrong tree?
Hi Simon, you're definitely not barking up the wrong tree - this is an important point that I put some thought into. In short, I think the rules are ok as they are.
One *does* sometimes see a hyphen included in contexts like this, but almost always in old texts before the orthography was standardized (in the 1950s), and so I decided not to include these cases among the rules. Being conservative in this way avoids some "false positives" - for example, looking in a big corpus of Irish texts from the web, many of the cases with a hyphen are Scottish placenames (Bun na h-Abhainn, etc.) where the convention would be to preserve the hyphen. And even in older texts, there are examples like "ar n-a gclárú" which should in fact uppercase to "AR N-A GCLÁRÚ".
And incidentally, the way I've written the rules, Ceol Na n-Éan should uppercase to CEOL NA N-ÉAN (i.e. the usual uppercasing, no deletion of the hyphen since none of "my" rules apply) - this is better than CEOL NA NÉAN.
Flags: needinfo?(kscanne)
Reporter | ||
Comment 16•11 years ago
|
||
(In reply to Simon Montagu :smontagu from comment #12)
> Also, could deleting those hyphens when uppercasing cause issues for
> lowercasing?
Yes. When lowercasing n{V} or t{V}, a hyphen needs to be reinserted.
Ár nAthair -> ár n-athair
An tAsal -> an t-asal
Assignee | ||
Comment 17•11 years ago
|
||
(In reply to Kevin Scannell from comment #16)
> (In reply to Simon Montagu :smontagu from comment #12)
> > Also, could deleting those hyphens when uppercasing cause issues for
> > lowercasing?
>
> Yes. When lowercasing n{V} or t{V}, a hyphen needs to be reinserted.
>
> Ár nAthair -> ár n-athair
> An tAsal -> an t-asal
Ah....ok, could you file a new bug for the lowercase transform, please? That'll be an additional patch, so it's helpful if we can track each of them separately. Bonus points for including testcases there, too. :) Thanks!
Comment 18•11 years ago
|
||
(In reply to Simon Montagu :smontagu from comment #14)
> Comment on attachment 8430074 [details] [diff] [review]
> > };
>
> Nit: this list is getting long enough that it would be nice if it was
> alphabetized
Wow, I just noticed this review tool fail: this comment was on the following block:
enum LanguageSpecificCasingBehavior {
eLSCB_None, // default non-lang-specific behavior
eLSCB_Turkish, // preserve dotted/dotless-i distinction in uppercase
eLSCB_Dutch, // treat "ij" digraph as a unit for capitalization
- eLSCB_Greek // strip accent when uppercasing Greek vowels
+ eLSCB_Greek, // strip accent when uppercasing Greek vowels
+ eLSCB_Irish // keep prefix letters as lowercase when uppercasing Irish
};
Assignee | ||
Comment 19•11 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/672cf66438e2
https://hg.mozilla.org/integration/mozilla-inbound/rev/37d09834dd1e
https://hg.mozilla.org/integration/mozilla-inbound/rev/09a014d5834e
https://hg.mozilla.org/integration/mozilla-inbound/rev/edeea8eaa741
OS: Mac OS X → All
Hardware: x86 → All
Target Milestone: --- → mozilla32
Reporter | ||
Comment 20•11 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #17)
> (In reply to Kevin Scannell from comment #16)
> > (In reply to Simon Montagu :smontagu from comment #12)
> > > Also, could deleting those hyphens when uppercasing cause issues for
> > > lowercasing?
> >
> > Yes. When lowercasing n{V} or t{V}, a hyphen needs to be reinserted.
> >
> > Ár nAthair -> ár n-athair
> > An tAsal -> an t-asal
>
> Ah....ok, could you file a new bug for the lowercase transform, please?
> That'll be an additional patch, so it's helpful if we can track each of them
> separately. Bonus points for including testcases there, too. :) Thanks!
Filed bug 1018805 - with lots of tests!
Comment 21•11 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/672cf66438e2
https://hg.mozilla.org/mozilla-central/rev/37d09834dd1e
https://hg.mozilla.org/mozilla-central/rev/09a014d5834e
https://hg.mozilla.org/mozilla-central/rev/edeea8eaa741
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Flags: in-testsuite+
Assignee | ||
Updated•8 years ago
|
Depends on: CVE-2016-5270
Assignee | ||
Comment 22•8 years ago
|
||
I just noticed that https://developer.mozilla.org/en-US/docs/Web/CSS/text-transform discusses a bunch of language-specific variations in text-transform behavior, but does not describe the Irish (Gaelic) support that was implemented here (for uppercase) and in bug 1018805 (lowercase).
Keywords: dev-doc-needed
Reporter | ||
Comment 23•8 years ago
|
||
I just added a bullet point about the special behavior for Irish on that page (there was also some stale text, with an example from Scottish Gaelic, not Irish, that I removed).
You need to log in
before you can comment on or make changes to this bug.
Description
•