Last Comment Bug 740477 - The dutch IJ digraph is not handled correctly by text-transform:capitalize
: The dutch IJ digraph is not handled correctly by text-transform:capitalize
Status: RESOLVED FIXED
: dev-doc-complete
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: x86 Mac OS X
: -- minor (vote)
: mozilla14
Assigned To: Jonathan Kew (:jfkthame)
:
Mentors:
http://en.wikipedia.org/wiki/IJ_%28di...
Depends on: 231162
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-29 10:42 PDT by Jean-Yves Perrier [:teoli]
Modified: 2012-04-16 17:25 PDT (History)
3 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch, implement Dutch-specific capitalization for "ij" (3.08 KB, patch)
2012-03-29 11:55 PDT, Jonathan Kew (:jfkthame)
smontagu: review-
Details | Diff | Review
reftest for Dutch "ij" capitalization (1.64 KB, patch)
2012-03-29 12:13 PDT, Jonathan Kew (:jfkthame)
smontagu: review+
Details | Diff | Review
patch v2, implement Dutch-specific capitalization for "ij" (4.27 KB, patch)
2012-03-30 13:33 PDT, Jonathan Kew (:jfkthame)
smontagu: review+
Details | Diff | Review
reftest for Dutch "ij" capitalization, v2 (1.74 KB, patch)
2012-03-30 13:34 PDT, Jonathan Kew (:jfkthame)
jfkthame: review+
Details | Diff | Review

Description Jean-Yves Perrier [:teoli] 2012-03-29 10:42:52 PDT
The Dutch language consider ij as a digraph (see link for reference). It means if a word is capitalized and starts with ij, both letters are capitalized.

ijsland -> IJsland

The behaviour is non-standard only affect only text-transform: capitalize, and not lowercase and uppercase. Also there are a few words in Dutch where ij is not a digraph but as far as I known the ij is not at the beginning of these words.
Comment 1 Jonathan Kew (:jfkthame) 2012-03-29 10:55:04 PDT
Note that you'll get the desired behavior if such words are spelled using the Unicode character U+0133 LATIN SMALL LIGATURE IJ, rather than the separate characters "i" and "j".

Compare:
data:text/html;charset=utf-8,<div style="text-transform: capitalize">ijsland
data:text/html;charset=utf-8,<div style="text-transform: capitalize">ijsland

However, perhaps we should consider special-case handling for the sequence "ij" in content that is specifically tagged as lang="nl".
Comment 2 Jean-Yves Perrier [:teoli] 2012-03-29 11:15:13 PDT
Yes, the Unicode character does work but its use is discouraged by Unicode (it is mainly there for legacy purpose): see Unicode 6.1, Ch3 D66 Compatibility decomposable character. Anyway, nobody use it as it is not on the Dutch keyboard layout (http://www.goodtyping.com/teclatDUT.htm ).

I don't think that dutch-related flemish languages have their own language codes, so "nl" only should be ok.
Comment 3 Jonathan Kew (:jfkthame) 2012-03-29 11:55:44 PDT
Created attachment 610631 [details] [diff] [review]
patch, implement Dutch-specific capitalization for "ij"

This implements the requested behavior for elements where lang="nl".

Note that it only applies the digraph-specific behavior (capitalizing the "j" as well) if both "i" and "j" were originally lowercase; thus, "ijsland" -> "IJsland", but "Ijsland" is unchanged by text-transform:capitalize, on the assumption that if it was already entered with mixed case in the "Ij" pair, this was a deliberate choice.
Comment 4 Jonathan Kew (:jfkthame) 2012-03-29 12:13:02 PDT
Created attachment 610637 [details] [diff] [review]
reftest for Dutch "ij" capitalization
Comment 5 Simon Montagu :smontagu 2012-03-30 10:01:44 PDT
Comment on attachment 610631 [details] [diff] [review]
patch, implement Dutch-specific capitalization for "ij"

Review of attachment 610631 [details] [diff] [review]:
-----------------------------------------------------------------

This has a bug when the "j" isn't adjacent to the "i", it still gets capitalized.

Maybe also instead of adding another boolean dutchCasing (and in later bugs adding greekCasing, lithuanianCasing and I don't know what all else, have an enum of languages and a languageSpecificCasing variable (or some shorter name)? There will only ever be one applicable language, unless I am very much mistaken.
Comment 6 Simon Montagu :smontagu 2012-03-30 10:09:07 PDT
Comment on attachment 610637 [details] [diff] [review]
reftest for Dutch "ij" capitalization

Review of attachment 610637 [details] [diff] [review]:
-----------------------------------------------------------------

Add a case with non-adjancent i/j to test the bug I mentioned in the previous comment
Comment 7 Jonathan Kew (:jfkthame) 2012-03-30 13:33:14 PDT
Created attachment 610996 [details] [diff] [review]
patch v2, implement Dutch-specific capitalization for "ij"

Good catch, thanks. Fixed in this version.
Comment 8 Jonathan Kew (:jfkthame) 2012-03-30 13:34:50 PDT
Created attachment 610998 [details] [diff] [review]
reftest for Dutch "ij" capitalization, v2

Added a case with "ixj" to the test; carry forward r=smontagu.
Comment 9 Gordon P. Hemsley [:GPHemsley] 2012-03-30 13:56:14 PDT
Just a (somewhat tangential) thought:
Perhaps it would be better to disentangle the name of the change from the language(s) associated with it?

This is more related to the Turkish transformation than the Dutch one, but it's possible for a transformation to be used by more than one language (as with dotless I). So why not name the transformations after what they do, rather than who uses them? Like capitalizedIJDigraph or capitalizeDotlessI, or eIJDigraph or eDotlessI, or something like that?
Comment 10 Jonathan Kew (:jfkthame) 2012-03-30 14:18:07 PDT
(In reply to Gordon P. Hemsley [:gphemsley] from comment #9)
> Just a (somewhat tangential) thought:
> Perhaps it would be better to disentangle the name of the change from the
> language(s) associated with it?

We could, although I think it's perfectly reasonable to use the name of a well-known exemplar language even though the behavior may be "borrowed" by other languages that have a similar writing system. If we were exposing this to users somehow, it would need to be carefully considered, but here it's is just a question of naming a local variable within the code.

(Essentially the same thing happens for scripts, many of which are named after the "primary" language that uses them even if they get adopted for writing other languages as well.)

> This is more related to the Turkish transformation than the Dutch one, but
> it's possible for a transformation to be used by more than one language (as
> with dotless I). So why not name the transformations after what they do,
> rather than who uses them? Like capitalizedIJDigraph or capitalizeDotlessI,
> or eIJDigraph or eDotlessI, or something like that?

Personally, I find it most natural to label the behavior as "Turkish" even though it is used by several other languages; I think they have modeled their writing systems on the Turkish one. But I don't feel particularly strongly about it - Simon, any opinion?
Comment 11 Simon Montagu :smontagu 2012-03-30 14:44:26 PDT
At most I think we might add a comment that these are mnemonic names of exemplar languages that have the behaviour we are implementing. Getting it 100% right and pleasing everybody is an unattainable goal anyway -- we have enough problems already with user-facing names for various regions and languages.
Comment 12 Gordon P. Hemsley [:GPHemsley] 2012-03-30 14:45:11 PDT
(In reply to Jonathan Kew (:jfkthame) from comment #10)
> (In reply to Gordon P. Hemsley [:gphemsley] from comment #9)
> > Just a (somewhat tangential) thought:
> > Perhaps it would be better to disentangle the name of the change from the
> > language(s) associated with it?
> 
> We could, although I think it's perfectly reasonable to use the name of a
> well-known exemplar language even though the behavior may be "borrowed" by
> other languages that have a similar writing system. If we were exposing this
> to users somehow, it would need to be carefully considered, but here it's is
> just a question of naming a local variable within the code.

What happens when one language has multiple different transformation requirements, and then another language has one of those but not another? Wouldn't you wind up being in the same position there anyway?

Also, beyond that, having a more descriptive name would help developers who come along in the future who perhaps are not as familiar with the various idiosyncrasies of the language used to name the variable. Or what if a language that is used as a variable name then decides that they are no longer going to use that rule? Then you have a rule that is named after a language that doesn't even use it.

I am also trying to spread the notion of decoupling a language from its writing system or writing conventions. All languages can be written many different ways; it just makes more sense to name a particular convention after the convention itself, rather than any particular language that might be using it at any given time.

> (Essentially the same thing happens for scripts, many of which are named
> after the "primary" language that uses them even if they get adopted for
> writing other languages as well.)

True, but that's probably a different discussion for a different time. ;)
Comment 13 Jonathan Kew (:jfkthame) 2012-03-30 17:41:25 PDT
Pushed to inbound, with added comments for the enum values:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bb53aec4a302
https://hg.mozilla.org/integration/mozilla-inbound/rev/324368cce885
Comment 15 Jean-Yves Perrier [:teoli] 2012-04-16 17:25:11 PDT
I've updated https://developer.mozilla.org/en/CSS/text-transform (summary, examples and the browser compatibility table).
and added a note in: https://developer.mozilla.org/en/Firefox_14_for_developers

Note You need to log in before you can comment on or make changes to this bug.