Last Comment Bug 307039 - Greek text not converted correctly to Small-Caps.
: Greek text not converted correctly to Small-Caps.
Status: RESOLVED FIXED
: dev-doc-needed
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla15
Assigned To: Jonathan Kew (:jfkthame)
:
Mentors:
http://simos.info/blog/
Depends on: 752176
Blocks:
  Show dependency treegraph
 
Reported: 2005-09-04 09:18 PDT by Nikos Platis
Modified: 2012-05-07 08:30 PDT (History)
14 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Testcase (1.15 KB, text/html)
2005-09-05 02:10 PDT, Nikos Platis
no flags Details
Screenshot of testcase (27.58 KB, image/png)
2005-09-05 08:31 PDT, Adam Guthrie
no flags Details
Working patch (4.56 KB, patch)
2011-09-19 07:00 PDT, Panos Astithas [:past]
no flags Details | Diff | Splinter Review
Working patch v2 (5.28 KB, patch)
2011-09-30 02:06 PDT, Panos Astithas [:past]
no flags Details | Diff | Splinter Review
Screenshot of the small-caps bug w/o TransformToUpperCase (133.32 KB, image/png)
2011-10-07 03:31 PDT, Panos Astithas [:past]
no flags Details
Working patch v3 (8.57 KB, patch)
2011-10-07 07:38 PDT, Panos Astithas [:past]
no flags Details | Diff | Splinter Review
testcase using some decomposed sequences (269 bytes, text/html)
2011-10-07 10:15 PDT, Jonathan Kew (:jfkthame)
no flags Details
testcase with decomposed sequences (459 bytes, text/html)
2011-10-10 04:21 PDT, Panos Astithas [:past]
no flags Details
Working patch v4 (11.88 KB, patch)
2011-10-10 06:41 PDT, Panos Astithas [:past]
no flags Details | Diff | Splinter Review
Working patch v5 (12.90 KB, patch)
2012-01-15 12:29 PST, Panos Astithas [:past]
dbaron: feedback-
Details | Diff | Splinter Review
patch, implement Greek-specific uppercasing for text-transform & small-caps (20.56 KB, patch)
2012-04-24 09:54 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Splinter Review
reftest for Greek uppercasing in composed and decomposed forms (3.91 KB, patch)
2012-04-24 09:55 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Splinter Review
patch v2, implement Greek-specific uppercasing for text-transform & small-caps (25.78 KB, patch)
2012-04-28 01:38 PDT, Jonathan Kew (:jfkthame)
past: review+
Details | Diff | Splinter Review
reftest for Greek uppercasing in composed and decomposed forms (3.91 KB, patch)
2012-04-28 01:39 PDT, Jonathan Kew (:jfkthame)
past: review+
Details | Diff | Splinter Review
reftest for Greek small-caps behavior (2.53 KB, patch)
2012-04-28 03:15 PDT, Jonathan Kew (:jfkthame)
past: review+
Details | Diff | Splinter Review
reftest for Greek uppercasing in composed and decomposed forms v2 (3.92 KB, patch)
2012-05-02 04:29 PDT, Jonathan Kew (:jfkthame)
jfkthame: review+
Details | Diff | Splinter Review

Description Nikos Platis 2005-09-04 09:18:53 PDT
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.6
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.6

When converting text to Small Caps (via the CSS font-variant directive), Greek
accented letters should be converted to the respective non-accented uppercase
letters. The required conversions are the following (in Unicode):
ά -> Α
έ -> Ε
ή -> Η
ί -> Ι
ΐ -> Ϊ
ό -> Ο
ύ -> Υ
ΰ -> Ϋ
ώ -> Ω

Also diphthongs (two-vowel constructs) should be converted as follows, when the
first vowel is accented:
άι -> ΑΪ
έι -> ΕΪ
όι -> ΟΪ
ύι -> ΥΪ
άυ -> ΑΫ
έυ -> ΕΫ
ήυ -> ΗΫ
όυ -> ΟΫ

In the current implementation, this is not the case: Greek accented lowercase
letters are merely converted to the respective (accented) uppercase letters.

I should note that when converting some text to "Titling" (first letter of each
word capitalized) then the above does not hold, i.e. accented letters should be
converted to accented uppercase.


Reproducible: Always

Steps to Reproduce:
Comment 1 Adam Guthrie 2005-09-04 09:59:54 PDT
Could you provide a minimized testcase to demonstrate this?
Comment 2 Nikos Platis 2005-09-05 02:10:13 PDT
Created attachment 194893 [details]
Testcase
Comment 3 Adam Guthrie 2005-09-05 08:31:22 PDT
Created attachment 194920 [details]
Screenshot of testcase

If I'm understanding this correctly, you are correct, and we are messing this
up.
Comment 4 Simon Montagu :smontagu 2005-09-05 10:27:47 PDT
(In reply to comment #0)

> When converting text to Small Caps (via the CSS font-variant directive), Greek
> accented letters should be converted to the respective non-accented uppercase
> letters.

Do you have a source for this statement?
Comment 5 Nikos Platis 2005-09-05 13:42:06 PDT
It is a rule of Greek grammar that when writing in all capital letters (and,
consequently, in small-caps) no accents are used. The conversions presented in
the bug report are the required ones due to this fact.

I cannot provide a link to an online greek grammar, but check, for example,
http://digital.tovima.gr/, the online edition of a greek newspaper and notice
that whatever is written in all-caps or small-caps has no accent.
Comment 6 Simon Montagu :smontagu 2005-09-06 01:29:23 PDT
I found a reference at
http://ptolemy.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html

  A general titlecase converter for Greek won't use mappings between Ll   
  (lowercase) and Lt; it will simply convert the first character to uppercase. And
  an uppercase converter will strip out all diacritics but diaeresis, and convert 
  the result to uppercase.

Comment 7 Nikos Platis 2005-09-06 06:21:01 PDT
Your reference is fine!

In the next paragraph from the one you quote, the author presents the case of
diphthongs, as I mentioned them in my bug report:

> all-caps word need information on whether two vowels constitute a single 
> syllable or a diphthong, in order to insert diaereses, and that information 
> might only be forthcoming with reference to diacritics. (Thus, αὐλός aulós 
> 'flute' => ΑΥΛΟΣ, but ἄυλος áhulos 'incorporeal' => ΑΫΛΟΣ.)
Comment 8 Alfredos-Panagiotis Damkalis [:fredy] 2011-06-23 17:34:01 PDT
Hi guys, is there any progress at this bug?
Recently, I came across with a similar bug at another css transformation (text-transform) and I was wondering if it has any relation with this one.

PS the link of the above reference can be found now at http://www.opoudjis.net/unicode/unicode.html
Comment 9 Pascal Chevrel:pascalc(PTO until Sept 2) 2011-06-28 01:19:58 PDT
Note that our own mozilla sites in Greek are affected by this bug (bug 667430)
Comment 10 Panos Astithas [:past] 2011-09-19 07:00:53 PDT
Created attachment 560892 [details] [diff] [review]
Working patch

This patch properly converts the accented greek characters, with the exception of the diphthongs. In order to fix those too, I need to mutate str which is a const-ified aTextRun, but I'm not sure if I can remove the const. Is there any significant performance impact with such a change, or with the other changes in the patch?

I missed the attached test case so I also created another one at:

http://htmlpad.org/greek-css/
Comment 11 Panos Astithas [:past] 2011-09-29 02:27:52 PDT
Comment on attachment 560892 [details] [diff] [review]
Working patch

Simon, could you take a look at this patch?
Comment 12 George Fiotakis 2011-09-29 16:22:11 PDT
Panos, I'm not a coder, so I'm not sure if the patch takes into account all the Greek grammar rules.
Could you provide a test case with the phrase "Άκλιτα ρήματα ή άκλιτες μετοχές"?
Comment 13 Panos Astithas [:past] 2011-09-30 02:06:20 PDT
Created attachment 563695 [details] [diff] [review]
Working patch v2

Updated the patch to fix a bug highlighted by the test in comment 12. The problem with the diphthongs still remains, as noted in comment 10. I have another WIP patch that removes the const, but it seems rather invasive, so I'd appreciate any advice on how to mutate the whole text run (the following character to be more precise) instead of just the iterated character.
Comment 14 Panos Astithas [:past] 2011-09-30 02:08:30 PDT
(In reply to George Fiotakis from comment #12)
> Panos, I'm not a coder, so I'm not sure if the patch takes into account all
> the Greek grammar rules.
> Could you provide a test case with the phrase "Άκλιτα ρήματα ή άκλιτες
> μετοχές"?

It doesn't take all the rules into account currently. As I mentioned in comment 10 the diphthongs are not properly converted yet. Your test case highlighted that even the accented capital letters were not properly converted, since we had not been stripping the accents. I took care of that in the last revision of the patch, but I need guidance from Simon or someone else in the layout team on how to go about fixing the diphthong case.

I would appreciate it if you or anyone else can come up with other cases that this patch does not handle properly. Anyone can edit my test case by visiting http://htmlpad.org/greek-css/edit and adding more edge cases. In a couple of hours there should be builds with the patch applied to try out for all platforms at:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/pastithas@mozilla.com-d5ee7ab95169
Comment 15 Alfredos-Panagiotis Damkalis [:fredy] 2011-09-30 03:27:11 PDT
About the comment 4, there is one official source for proving the statement that when we write something with all letters capital then the word should not be accented. 
This is the official grammar high-school book: http://www.pi-schools.gr/books/gymnasio/grammatiki_a_b_c/s_1_200.pdf(in greek). It says at paragraph 3.2 at page 23 that when a word is written with all letters capital then _normally_ it is not accented.
Unfortunately it is not provide exception rules, like what happens with disjunctive eta (διαζευκτικό ήτα) which is always accented when is not capital or what happens with diphthongs.
Comment 16 George Fiotakis 2011-09-30 09:08:34 PDT
(In reply to comment #15)
As far as I can tell, words that start a sentence and their first letter is accented should retain the accent even if written in all caps, plus the disjunctive eta should be accented no matter if it's capitalised or not.
I'll post to the l18ngr list to see if we can get more insight on this but I believe that if a character is already capital and accented, we should not try to convert it (most likely proper nouns like names etc have to be accented if the accent is on the first letter as well, though I can not be sure about it right now), plus accented etas should retain their accent when there's a leading and following space.
Comment 17 Panos Astithas [:past] 2011-10-01 03:45:50 PDT
The official rule is that words in all-caps are never accented, as can be seen in the reference in comment 15. Some people prefer to keep the accent in some occasions, like the disjunctive eta, or others, but the norm is to use the simple official rule. Check any newspaper, or even the power company bill, like I did. People with special preferences in particular occasions should not be using text transformations, but rather put the text as the want it to appear. See also the wikipedia article:

http://el.wikipedia.org/wiki/%CE%A4%CE%BF%CE%BD%CE%B9%CF%83%CE%BC%CF%8C%CF%82
Comment 18 Simon Montagu :smontagu 2011-10-06 03:37:54 PDT
Comment on attachment 563695 [details] [diff] [review]
Working patch v2

Review of attachment 563695 [details] [diff] [review]:
-----------------------------------------------------------------

I don't agree with changing the accented uppercase characters. I think we should maintain an invariant that transforming to uppercase/small-caps only affects lowercase characters and all others are unchanged.

As for the dipthongs, I don't understand why you need to modify str. Can't you just make all changes in convertedString?

::: layout/generic/nsTextRunTransformations.cpp
@@ +347,5 @@
>        } else {
>          if (styles[i]->GetStyleFont()->mFont.variant == NS_STYLE_FONT_VARIANT_SMALL_CAPS) {
>            PRUnichar ch = str[i];
>            PRUnichar ch2;
> +          ch2 = TransformToUpperCase(ch);

I'm not sure that you need to use TransformToUpperCase here at all, since the output is discarded anyway, and this pass through ToUpperCase is only used to test whether the input is lowercase.
Comment 19 Panos Astithas [:past] 2011-10-06 05:53:42 PDT
(In reply to Simon Montagu from comment #18)
> Comment on attachment 563695 [details] [diff] [review] [diff] [details] [review]
> Working patch v2
> 
> Review of attachment 563695 [details] [diff] [review] [diff] [details] [review]:
> -----------------------------------------------------------------
> 
> I don't agree with changing the accented uppercase characters. I think we
> should maintain an invariant that transforming to uppercase/small-caps only
> affects lowercase characters and all others are unchanged.

Are you sure about that? What good is an invariant that does not hold for the Greek language (at least)? The spec says that "the actual transformation in each case is written language dependent":

http://www.w3.org/TR/CSS2/text.html#propdef-text-transform

If the purpose of these text transformations was not to render correct text in every language, but just to apply a mechanical conversion for comparison purposes or something, that would make sense. But as currently used in the web, people expect to use these style changes to render legible text.

> As for the dipthongs, I don't understand why you need to modify str. Can't
> you just make all changes in convertedString?

I've already thought about maintaining a flag inside the loop that won't require touching str, I just haven't had the time to do it yet. I only want to keep state in the diphthong cases in order to apply a different transformation on str[i+1], and my original thought was to fixup str[i+1] right away. I could append a sentinel in convertedString and check its presence for applying the different transformation, too. In any case, I'll have a fix for that soon.

> ::: layout/generic/nsTextRunTransformations.cpp
> @@ +347,5 @@
> >        } else {
> >          if (styles[i]->GetStyleFont()->mFont.variant == NS_STYLE_FONT_VARIANT_SMALL_CAPS) {
> >            PRUnichar ch = str[i];
> >            PRUnichar ch2;
> > +          ch2 = TransformToUpperCase(ch);
> 
> I'm not sure that you need to use TransformToUpperCase here at all, since
> the output is discarded anyway, and this pass through ToUpperCase is only
> used to test whether the input is lowercase.

I remember that the small-caps transformation didn't work properly without this, but I'll double check.

Thanks!
Comment 20 Panos Astithas [:past] 2011-10-06 05:57:31 PDT
(In reply to Simon Montagu from comment #18)
> Comment on attachment 563695 [details] [diff] [review] [diff] [details] [review]
> Working patch v2
> 
> Review of attachment 563695 [details] [diff] [review] [diff] [details] [review]:
> -----------------------------------------------------------------
> 
> I don't agree with changing the accented uppercase characters. I think we
> should maintain an invariant that transforming to uppercase/small-caps only
> affects lowercase characters and all others are unchanged.

Do you happen to know if such an invariant would also hold for languages like Chinese, Japanese or Arabic?
Comment 21 Simon Montagu :smontagu 2011-10-06 09:20:45 PDT
(In reply to Panos Astithas [:past] from comment #19)
> > I don't agree with changing the accented uppercase characters. I think we
> > should maintain an invariant that transforming to uppercase/small-caps only
> > affects lowercase characters and all others are unchanged.
> 
> Are you sure about that? What good is an invariant that does not hold for
> the Greek language (at least)? The spec says that "the actual transformation
> in each case is written language dependent":
> 
> http://www.w3.org/TR/CSS2/text.html#propdef-text-transform
> 
> If the purpose of these text transformations was not to render correct text
> in every language, but just to apply a mechanical conversion for comparison
> purposes or something, that would make sense. But as currently used in the
> web, people expect to use these style changes to render legible text.

Yes, I think you're right, though <thinking aloud> it depends a bit on what the input is. For example, Άνθρωπος to ΑΝΘΡΩΠΟΣ seems like the Right Thing To Do, but correcting ΆΝΘΡΩΠΟΣ to ΑΝΘΡΩΠΟΣ seems counter-intuitive. That said, in the former example we are doing something sensible to normal input, and the latter example is GIGO, so I've convinced myself :) </thinking aloud>

FWIW, this is called out as an open issue at http://www.w3.org/TR/css3-text/#text-transform
Comment 22 Panos Astithas [:past] 2011-10-07 03:31:21 PDT
Created attachment 565483 [details]
Screenshot of the small-caps bug w/o TransformToUpperCase

(In reply to Panos Astithas [:past] from comment #19)
> (In reply to Simon Montagu from comment #18)
> > ::: layout/generic/nsTextRunTransformations.cpp
> > @@ +347,5 @@
> > >        } else {
> > >          if (styles[i]->GetStyleFont()->mFont.variant == NS_STYLE_FONT_VARIANT_SMALL_CAPS) {
> > >            PRUnichar ch = str[i];
> > >            PRUnichar ch2;
> > > +          ch2 = TransformToUpperCase(ch);
> > 
> > I'm not sure that you need to use TransformToUpperCase here at all, since
> > the output is discarded anyway, and this pass through ToUpperCase is only
> > used to test whether the input is lowercase.
> 
> I remember that the small-caps transformation didn't work properly without
> this, but I'll double check.

The reason this was necessary is that ToUpperCase does not handle properly at lest two cases, the lowercase accented upsilon (or iota) with diaeresis and uppercase accented letters, as can be seen in the screenshot.

The behavior of ToUpperCase is arguably correct for uppercase accented letters, since it's a more general purpose method. In the small-caps transformation case though, we need to use a different rule.

The behavior of ToUpperCase for the case of lowercase accented upsilon (or iota) with diaeresis, is probably a bug that presumably stems from the fact that there are no obvious uppercase equivalents (with both accent and diaeresis). The proper equivalent would be the uppercase letter with diaeresis, since no word can start with either of these letters, which would make the retention of the accent conditional (capitalize vs. uppercase): the diaeresis is used to convert a monopthong to a diphthong and is applied to the second vowel.

I will file a separate bug to fix ToUpperCase as well.
Comment 23 Panos Astithas [:past] 2011-10-07 07:38:39 PDT
Created attachment 565530 [details] [diff] [review]
Working patch v3

This version fixes all reported issues, including diphthongs, even a couple of rare corner cases. I also fixed an existing reftest that this patch broke. Try build results at:

https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=d165d3664478

Nightly builds for testing (for anyone who would like to test this fix) will appear soon at:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/pastithas@mozilla.com-d165d3664478

I also intend to contact the W3C regarding the open issue at the CSS3 spec soon.
Comment 24 Jonathan Kew (:jfkthame) 2011-10-07 08:31:32 PDT
I have a couple of concerns with this....

First, the current patch doesn't look like it will handle text that is in "decomposed" form (encoded using base characters and combining diacritics, instead of precomposed accented letters). The two forms are defined to be canonically equivalent, so it's important that they are treated in the same way - users should not see any difference in behavior between the alternate normalization forms.

Second, what about all the other accented Greek letters (those in the U+1Fxx block)? I realize that modern monotonic Greek orthography doesn't normally use these, but we need to consider what to do with them - especially as there are canonical equivalences between these and U+03xx sequences. Is it appropriate to strip accents if uppercase or small-caps transformations are applied to polytonic Greek text?
Comment 25 Panos Astithas [:past] 2011-10-07 09:09:26 PDT
(In reply to Jonathan Kew (:jfkthame) from comment #24)
> I have a couple of concerns with this....
> 
> First, the current patch doesn't look like it will handle text that is in
> "decomposed" form (encoded using base characters and combining diacritics,
> instead of precomposed accented letters). The two forms are defined to be
> canonically equivalent, so it's important that they are treated in the same
> way - users should not see any difference in behavior between the alternate
> normalization forms.

I tried to fix that case, too, but I couldn't find a way to test it. I don't know how to generate such text myself and my web search for a page with such characters came up empty. I agree that on principle this is the right thing to do, but is this a problem in practice?

> Second, what about all the other accented Greek letters (those in the U+1Fxx
> block)? I realize that modern monotonic Greek orthography doesn't normally
> use these, but we need to consider what to do with them - especially as
> there are canonical equivalences between these and U+03xx sequences. Is it
> appropriate to strip accents if uppercase or small-caps transformations are
> applied to polytonic Greek text?

I purposefully didn't consider this case. The mapping would be substantially larger due to the extra accents and if we also take into account ancient greek, it will require me to relearn that, because I haven't touched it for like 20 years :-)

I think the same basic principle holds for old Greek as well (accents should be stripped) and ancient Greek originally didn't have lower case characters, but the corner cases could be numerable. Also, perhaps the rules differ depending on the context or use case (e.g. poetry vs. legal text) and I'm not sure what is the primary use case for old/ancient Greek on the web, if it is used at all. 

Since no browser today capitalizes modern Greek properly, getting old Greek fixed will definitely have a pretty low ROI. Can we file a followup bug for that and deal with it separately?
Comment 26 Jonathan Kew (:jfkthame) 2011-10-07 10:15:27 PDT
Created attachment 565572 [details]
testcase using some decomposed sequences
Comment 27 Panos Astithas [:past] 2011-10-10 04:21:29 PDT
Created attachment 565897 [details]
testcase with decomposed sequences

(In reply to Jonathan Kew (:jfkthame) from comment #26)
> Created attachment 565572 [details]
> testcase using some decomposed sequences

Thank you, that was very helpful. I've expanded on it a bit, adding all the relevant combining diacritic marks that I could find.
Comment 28 Panos Astithas [:past] 2011-10-10 06:41:41 PDT
Created attachment 565926 [details] [diff] [review]
Working patch v4

I've expanded the patch to handle combining diacritic marks as well and added an expanded version of Jonathan's test in the relevant reftest. Converting accented characters in a decomposed sequence is done with a substitution of the combining diacritic with the zero width space character (0x200B), in order to maintain the string length and not trigger numerous assertions. I wanted to use the word joiner character (0x2060), but it seems to change the line spacing in the reftest, causing it to fail.

Try results will appear at:

https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=9414664f98f6

Builds will emerge at:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/pastithas@mozilla.com-9414664f98f6
Comment 29 Jonathan Kew (:jfkthame) 2011-10-10 07:00:18 PDT
(In reply to Panos Astithas [:past] from comment #28)
> Created attachment 565926 [details] [diff] [review] [diff] [details] [review]
> Working patch v4
> 
> I've expanded the patch to handle combining diacritic marks as well and
> added an expanded version of Jonathan's test in the relevant reftest.
> Converting accented characters in a decomposed sequence is done with a
> substitution of the combining diacritic with the zero width space character
> (0x200B), in order to maintain the string length and not trigger numerous
> assertions. I wanted to use the word joiner character (0x2060), but it seems
> to change the line spacing in the reftest, causing it to fail.

Replacing the combining diacritics in this way is not really satisfactory, for a couple of reasons. Your problem with U+2060 is (presumably) that it's not supported in the font being used, so fallback occurs and picks a different font with larger metrics, thus disturbing the line spacing. But this could also occur with U+200B or whatever other "invisible" character you might pick, depending on the specific fonts that happen to be involved.

Secondly, this will also inhibit any kerning that might be present in the font - for example, if the font contains kerning pairs such as "ΑΤ" and "ΑΥ" (which are quite likely in a well-designed Greek font), but you replace a diacritic following the Alpha with ZWSP (or WJ or whatever), the kerning will be broken.

If a transformation wants to remove combining diacritics from the text, it should _really_ delete them so that the adjacent characters that remain can interact properly at the OpenType layout level. Modifying the length will certainly make things trickier - but uppercasing "ß" as "SS" also changes the length, so there may be some existing code you can look at for guidance.
Comment 30 Panos Astithas [:past] 2012-01-15 12:29:17 PST
Created attachment 588764 [details] [diff] [review]
Working patch v5

I finally found some time to look into this again and the solution proved simpler than anticipated. Eliminating the accent in the combining diacritic case was not hard at all. Successful try run at: https://tbpl.mozilla.org/?tree=Try&rev=40b52d3fc85b
Comment 31 Gordon P. Hemsley [:GPHemsley] 2012-01-15 14:38:22 PST
As I recall, there are other languages, like Spanish, that have writing conventions that lose diacritics when capitalized. (I don't know where small-caps fits in with that, though.) I never understood why this was necessary in the modern days of computing, where there is no problem representing capital letters with diacritics—especially when losing them can actually create ambiguity.

So, I guess I have two questions:
(1) Is it truly necessary in the modern days of computing to lose diacritics in Greek?
(2) If so, shouldn't it be implemented in a way that expands easily to include other languages or writing systems that might have similar conventions?
Comment 32 Pascal Chevrel:pascalc(PTO until Sept 2) 2012-01-15 22:39:09 PST
(In reply to Gordon P. Hemsley [:gphemsley] from comment #31)
> As I recall, there are other languages, like Spanish, that have writing
> conventions that lose diacritics when capitalized. (I don't know where
> small-caps fits in with that, though.) 

That's actually not true for Spanish (same for French BTW), diactitics are on the contrary compulsory with capital letters:
http://www.rae.es/rae/gestores/gespub000018.nsf/%28voAnexos%29/arch8100821B76809110C12571B80038BA4A/$File/CuestionesparaelFAQdeconsultas.htm#ap22
Comment 33 Gordon P. Hemsley [:GPHemsley] 2012-01-15 22:44:55 PST
(In reply to Pascal Chevrel:pascalc from comment #32)
> (In reply to Gordon P. Hemsley [:gphemsley] from comment #31)
> > As I recall, there are other languages, like Spanish, that have writing
> > conventions that lose diacritics when capitalized. (I don't know where
> > small-caps fits in with that, though.) 
> 
> That's actually not true for Spanish (same for French BTW), diactitics are
> on the contrary compulsory with capital letters:
> http://www.rae.es/rae/gestores/gespub000018.nsf/%28voAnexos%29/
> arch8100821B76809110C12571B80038BA4A/$File/CuestionesparaelFAQdeconsultas.
> htm#ap22

Well, I'll be. A myth about orthography. Who'da thunk. (Glad to see the RAE agrees with me, then.)

But I don't think it invalidates my question, as I'm sure there are other languages/orthographies that have rules like that. Shouldn't this support be implemented in a way that is not inherently Greek-centric?
Comment 34 Panos Astithas [:past] 2012-01-15 23:46:26 PST
(In reply to Gordon P. Hemsley [:gphemsley] from comment #31)
> As I recall, there are other languages, like Spanish, that have writing
> conventions that lose diacritics when capitalized. (I don't know where
> small-caps fits in with that, though.) I never understood why this was
> necessary in the modern days of computing, where there is no problem
> representing capital letters with diacritics—especially when losing them can
> actually create ambiguity.
> 
> So, I guess I have two questions:
> (1) Is it truly necessary in the modern days of computing to lose diacritics
> in Greek?

I don't see how the modern days of computing have any relevance to this matter, but yes, Greek capital text has no diacritics. You won't see the text in the attached testcases appear in any book, newspaper, magazine, billboard or TV show the way it is rendered by modern browsers. The fact that all greek web pages avoid these text transformations like the plague, speaks volumes about the problem.

> (2) If so, shouldn't it be implemented in a way that expands easily to
> include other languages or writing systems that might have similar
> conventions?

That was my intention from the beginning. This is why I have kept the function's name generic (TransformToUpperCase) and moved all the language-specific logic in there. If there is something I missed, I will gladly fix it.
Comment 35 Jonathan Kew (:jfkthame) 2012-01-15 23:59:28 PST
For some relevant comments from an experienced typographer/font designer, see http://lists.w3.org/Archives/Public/www-style/2011May/0612.html.
Comment 36 Panos Astithas [:past] 2012-01-16 01:27:06 PST
(In reply to Jonathan Kew (:jfkthame) from comment #35)
> For some relevant comments from an experienced typographer/font designer,
> see http://lists.w3.org/Archives/Public/www-style/2011May/0612.html.

Interesting comments, though not terribly relevant to modern Greek. I'm not sure I fully understand the glyph-level alternative that he describes, but it's out of scope here, in any case. Also, can anyone explain the point of the losslessly reversible transformation requirement he mentions?
Comment 37 Jonathan Kew (:jfkthame) 2012-01-16 02:13:38 PST
(In reply to Panos Astithas [:past] from comment #36)
> (In reply to Jonathan Kew (:jfkthame) from comment #35)
> > For some relevant comments from an experienced typographer/font designer,
> > see http://lists.w3.org/Archives/Public/www-style/2011May/0612.html.
> 
> Interesting comments, though not terribly relevant to modern Greek.

But relevant to CSS and to browser implementations, which should not (IMO) limit themselves to considering one particular set of orthographic conventions. If "[omitting accents in uppercase] was not the norm for most of the history of accented Greek text, in which accents were frequently written above Greek uppercase letters...", it's not clear to me whether we should be adopting that convention to the exclusion of others.

Personally, I'd be happier if we had some guidance/consensus at least from the CSS working group before we commit to implementing this particular behavior.
Comment 38 Panos Astithas [:past] 2012-01-16 04:35:21 PST
(In reply to Jonathan Kew (:jfkthame) from comment #37)
> (In reply to Panos Astithas [:past] from comment #36)
> > (In reply to Jonathan Kew (:jfkthame) from comment #35)
> > > For some relevant comments from an experienced typographer/font designer,
> > > see http://lists.w3.org/Archives/Public/www-style/2011May/0612.html.
> > 
> > Interesting comments, though not terribly relevant to modern Greek.
> 
> But relevant to CSS and to browser implementations, which should not (IMO)
> limit themselves to considering one particular set of orthographic
> conventions. If "[omitting accents in uppercase] was not the norm for most
> of the history of accented Greek text, in which accents were frequently
> written above Greek uppercase letters...", it's not clear to me whether we
> should be adopting that convention to the exclusion of others.

I must be missing something obvious, because I cannot understand this part. Please, help me.

There is no set of orthographic conventions that concern the modern Greek language, just one single convention: the one that is embodied in this patch. Regarding the history of accented Greek text, my recollection is that uppercase was not accented most of the time, but I don't find it worthwhile to confirm or disprove, since it has no bearing to the issue at hand. And even if _some_ ancient texts have accents written above uppercase letters, they certainly didn't have them written to the left of the letter (as the commenter points out), so the current transformation still produces wrong results.

If I am not mistaken (and please correct me if I am wrong), this bug is about fixing the way browsers (and Firefox in particular) capitalize Greek text, because the way they do it currently makes it hard for Greeks to read and comprehend. Literally. And it is also wrong, if school books have any say on the matter.

I don't know of any web pages with ancient Greek, that contain the text in lowercase and rely on a transformation to present it as it was written back in the day. If rendering ancient Greek text properly capitalized is deemed an important goal, then let's file a followup bug for that and fix it there.

> Personally, I'd be happier if we had some guidance/consensus at least from
> the CSS working group before we commit to implementing this particular
> behavior.

I can certainly sympathize with that, although in my view the standard provides us with enough leeway to implement the right behavior, even if it is not mandated. I was planning to post there after this landed, but I'll try to find time to post tonight.
Comment 39 Panos Astithas [:past] 2012-01-22 23:40:55 PST
(In reply to Panos Astithas [:past] from comment #38)
> (In reply to Jonathan Kew (:jfkthame) from comment #37)
> > Personally, I'd be happier if we had some guidance/consensus at least from
> > the CSS working group before we commit to implementing this particular
> > behavior.
> 
> I can certainly sympathize with that, although in my view the standard
> provides us with enough leeway to implement the right behavior, even if it
> is not mandated. I was planning to post there after this landed, but I'll
> try to find time to post tonight.

FWIW, the thread I started in www-style confirmed that this is a valid behavior according to the standard:

http://lists.w3.org/Archives/Public/www-style/2012Jan/0827.html
Comment 40 Simon Montagu :smontagu 2012-02-07 11:55:35 PST
Comment on attachment 588764 [details] [diff] [review]
Working patch v5

Passing the request to Jonathan, since he seems to be more on the ball with the issues here than me :)
Comment 41 Jonathan Kew (:jfkthame) 2012-02-26 11:03:32 PST
I still have a couple of concerns with this, which make me hesitant to see it land as it stands:

(a) Is this the correct behavior from a spec point of view?

In response to your question on www-style, fantasai said[1]:
  "The corrected behavior is allowed by the spec, but since it isn't defined in Unicode's tables, it's not required..."
which IMO isn't all that helpful, as it leaves the door wide open for interop problems.

While I understand that this change would make the behavior follow Greek expectations better, it wouldn't really be much use to authors unless all browsers adopt it - and for that to happen, I think there needs to be something that specifies the correct behavior, not just a spec that's vague enough to "allow but not require" it.

Moreover, there was a different response from Christoph Päper, who said[2]:
  "I still believe the resolution for levels 2 and 3 should be that ‘text-transform’ and ‘font-variant’ are language-agnostic..."
I'm somewhat inclined to share this view (although he went on to overstate the case by mentioning "ß", whose standard uppercase form is "SS" in the Unicode standard). It seems to me that unless CSS _explicitly defines_ some other behavior for the uppercase transform, the most natural interpretation is that it should follow the default casing rules given by Unicode.

[1] http://lists.w3.org/Archives/Public/www-style/2012Jan/0852.html
[2] http://lists.w3.org/Archives/Public/www-style/2012Jan/0873.html

(b) If we do make this change, even without any clear spec to guide it, should we restrict it to lang="el"?

Supposing we decide to go ahead with this, I wonder if it would be appropriate to say that the Greek-specific behavior would apply _only_ to content that is explicitly tagged as lang="el"? To some extent, this would answer the objection that these properties "are language-agnostic".... they would remain language-agnostic, using only the standard Unicode-defined behavior, for content whose language is unknown or only guessed, but authors could opt in to the language-specific version of the behavior by explicitly marking the language of their text.

So on the whole, I think I'd be happier with taking the patch if the behavior were conditional on the language.

(c) Is it appropriate to strip accents from the already-uppercase characters?

I notice that the patch as it stands will map accented uppercase letters to their unaccented counterparts. I wonder if this is a good idea... if accented uppercase letters are present in the text, this suggests that the author deliberately chose to use accented uppercase, which is quite different from uppercase that appears as a stylistic choice - perhaps not even controlled by the original author - where the text was originally entered in lowercase.

(d) What about the rest of the accented Greek letters (in the 1Fxx block)?

AFAIK, these are not normally used in modern Greek, but they are used in classical/Biblical/scholarly materials, and in such contexts they will be intermingled with the 03xx letters whose behavior you're changing here. I don't think it's a good situation if we have an uppercase transform that takes the accent off U+03AC (ά), for example, but not off U+1F71 (ά), which is canonically equivalent; and if we strip that accent, then what about letters like U+1F04 (ἄ), which has the same accent but also a breathing mark?

--

Finally, regarding the implementation: given the need to handle a larger number of characters - point (d) above - I think it'd be worth finding a more data-driven way to write this, rather than the multiple nested switch()es. Looking at the rules you're implementing, I think it would be significantly more efficient to create a stateful "uppercasing iterator" that could pass over the string, only needing to read each character once, and recording enough state to be able to make the right choices for the characters where you're currently looking back at the preceding character(s).

--

I'm asking dbaron for feedback on this from a CSS spec point of view, particularly with regard to points (a) and (b) above, as I'm not at all sure what's the best way forward here. Should we ignore the interop concerns and just go ahead with what's convenient for Greek users - even though authors won't be able to reliably use the feature? Should we limit it to content that's explicitly tagged for language, which IMO makes it more reasonable to deviate from the standard Unicode behavior and do something language-specific? Other opinions are also welcome, of course...
Comment 42 Jonathan Kew (:jfkthame) 2012-02-28 09:28:55 PST
CSS2 was pretty vague about this, just saying "The actual transformation in each case is written language dependent", but checking the current text of CSS3-Text (http://www.w3.org/TR/css3-text/#text-transform), it says:

"The UA must use the full case mappings for Unicode characters, including any conditional casing rules, as defined in Default Case Algorithm section. If (and only if) the content language of the element is, according to the rules of the document language, known, then any appropriate language-specific rules must be applied as well."

This appears to support the idea that Greek-specific casing rules that deviate from the standard Unicode case mappings (in particular, the removal of accents) would be permitted/expected/required _only_ when the language of the element is known.
Comment 43 David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) 2012-02-28 13:37:46 PST
(In reply to Jonathan Kew (:jfkthame) from comment #41)
> I'm asking dbaron for feedback on this from a CSS spec point of view,
> particularly with regard to points (a) and (b) above, as I'm not at all sure
> what's the best way forward here. Should we ignore the interop concerns and
> just go ahead with what's convenient for Greek users - even though authors
> won't be able to reliably use the feature? Should we limit it to content
> that's explicitly tagged for language, which IMO makes it more reasonable to
> deviate from the standard Unicode behavior and do something
> language-specific? Other opinions are also welcome, of course...

I don't think I have any useful opinion on (b).

However, for (a), I think that if we think it's worthwhile to implement some behavior other than what Unicode says, we should write a spec for it, whether or not that spec actually progresses down the recommendation track, so that there's at least some documentation of the behavior we think is better, a chance for people to comment on that behavior, and a chance for other browsers to do the same.
Comment 44 David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) 2012-02-28 13:38:29 PST
Comment on attachment 588764 [details] [diff] [review]
Working patch v5

I have no idea what marking this feedback+ or feedback- means, so picking feedback- somewhat arbitrarily.
Comment 45 Panos Astithas [:past] 2012-02-29 05:21:11 PST
(In reply to Jonathan Kew (:jfkthame) from comment #41)
> I still have a couple of concerns with this, which make me hesitant to see
> it land as it stands:
> 
> (a) Is this the correct behavior from a spec point of view?
> 
> In response to your question on www-style, fantasai said[1]:
>   "The corrected behavior is allowed by the spec, but since it isn't defined
> in Unicode's tables, it's not required..."
> which IMO isn't all that helpful, as it leaves the door wide open for
> interop problems.

Can you clarify what interop scenarios you are concerned about?

> While I understand that this change would make the behavior follow Greek
> expectations better, it wouldn't really be much use to authors unless all
> browsers adopt it - and for that to happen, I think there needs to be
> something that specifies the correct behavior, not just a spec that's vague
> enough to "allow but not require" it.

Since all current implementations are currently broken (wrt the expectations of a Greek user), I can't really see how it would be worse to have Firefox work as expected. Knowledgeable Greek users currently shy away from using these transforms, while others who do (like the Android Market), get broken results. Even if this new behavior is never publicized enough to inspire other vendors or affect change, it would at least help with the latter cases, wouldn't it?

> Moreover, there was a different response from Christoph Päper, who said[2]:
>   "I still believe the resolution for levels 2 and 3 should be that
> ‘text-transform’ and ‘font-variant’ are language-agnostic..."
> I'm somewhat inclined to share this view (although he went on to overstate
> the case by mentioning "ß", whose standard uppercase form is "SS" in the
> Unicode standard). It seems to me that unless CSS _explicitly defines_ some
> other behavior for the uppercase transform, the most natural interpretation
> is that it should follow the default casing rules given by Unicode.
> 
> [1] http://lists.w3.org/Archives/Public/www-style/2012Jan/0852.html
> [2] http://lists.w3.org/Archives/Public/www-style/2012Jan/0873.html
> 
> (b) If we do make this change, even without any clear spec to guide it,
> should we restrict it to lang="el"?
> 
> Supposing we decide to go ahead with this, I wonder if it would be
> appropriate to say that the Greek-specific behavior would apply _only_ to
> content that is explicitly tagged as lang="el"? To some extent, this would
> answer the objection that these properties "are language-agnostic".... they
> would remain language-agnostic, using only the standard Unicode-defined
> behavior, for content whose language is unknown or only guessed, but authors
> could opt in to the language-specific version of the behavior by explicitly
> marking the language of their text.
> 
> So on the whole, I think I'd be happier with taking the patch if the
> behavior were conditional on the language.

I will defer to your more experienced judgement on this matter. Since I'm not familiar with this part of the code, some implementation pointers would definitely help me get this changed.

> (c) Is it appropriate to strip accents from the already-uppercase characters?
> 
> I notice that the patch as it stands will map accented uppercase letters to
> their unaccented counterparts. I wonder if this is a good idea... if
> accented uppercase letters are present in the text, this suggests that the
> author deliberately chose to use accented uppercase, which is quite
> different from uppercase that appears as a stylistic choice - perhaps not
> even controlled by the original author - where the text was originally
> entered in lowercase.

If I understand correctly the use case you are considering, it is the garbage-in/garbage-out scenario that Simon mentioned in comment 21. Accented uppercase is wrong, so I would imagine that one would avoid any text transformations on it if he wanted it untouched for some reason.

> (d) What about the rest of the accented Greek letters (in the 1Fxx block)?
> 
> AFAIK, these are not normally used in modern Greek, but they are used in
> classical/Biblical/scholarly materials, and in such contexts they will be
> intermingled with the 03xx letters whose behavior you're changing here. I
> don't think it's a good situation if we have an uppercase transform that
> takes the accent off U+03AC (ά), for example, but not off U+1F71 (ά), which
> is canonically equivalent; and if we strip that accent, then what about
> letters like U+1F04 (ἄ), which has the same accent but also a breathing mark?

As I said in comment 38 I'm not looking to fix ancient and other old forms of Greek here. It's way too complicated and for way too little benefit IMO. Should transformation conformance among the historical variations of a language trump expected behavior in the majority of web content (in that language) out there?

> --
> 
> Finally, regarding the implementation: given the need to handle a larger
> number of characters - point (d) above - I think it'd be worth finding a
> more data-driven way to write this, rather than the multiple nested
> switch()es. Looking at the rules you're implementing, I think it would be
> significantly more efficient to create a stateful "uppercasing iterator"
> that could pass over the string, only needing to read each character once,
> and recording enough state to be able to make the right choices for the
> characters where you're currently looking back at the preceding character(s).
> 
> --
> 
> I'm asking dbaron for feedback on this from a CSS spec point of view,
> particularly with regard to points (a) and (b) above, as I'm not at all sure
> what's the best way forward here. Should we ignore the interop concerns and
> just go ahead with what's convenient for Greek users - even though authors
> won't be able to reliably use the feature? Should we limit it to content
> that's explicitly tagged for language, which IMO makes it more reasonable to
> deviate from the standard Unicode behavior and do something
> language-specific? Other opinions are also welcome, of course...

One thing I'd like to point out is that AFAIK we like to lead at Mozilla. We like to pave the way for a better future for the web, and I would consider this as the first step in getting other vendors to adopt the same behavior. For all I know they may never had someone just sit down and compile the list of rules these transformations should follow.
Comment 46 Panos Astithas [:past] 2012-02-29 05:23:48 PST
(In reply to David Baron [:dbaron] from comment #43)
> (In reply to Jonathan Kew (:jfkthame) from comment #41)
> > I'm asking dbaron for feedback on this from a CSS spec point of view,
> > particularly with regard to points (a) and (b) above, as I'm not at all sure
> > what's the best way forward here. Should we ignore the interop concerns and
> > just go ahead with what's convenient for Greek users - even though authors
> > won't be able to reliably use the feature? Should we limit it to content
> > that's explicitly tagged for language, which IMO makes it more reasonable to
> > deviate from the standard Unicode behavior and do something
> > language-specific? Other opinions are also welcome, of course...
> 
> I don't think I have any useful opinion on (b).
> 
> However, for (a), I think that if we think it's worthwhile to implement some
> behavior other than what Unicode says, we should write a spec for it,
> whether or not that spec actually progresses down the recommendation track,
> so that there's at least some documentation of the behavior we think is
> better, a chance for people to comment on that behavior, and a chance for
> other browsers to do the same.

Would a page in wmo with the transformation explained suffice or be a good starting point, or were you thinking about something more involved? I would gladly help with that, if I can.
Comment 47 Jonathan Kew (:jfkthame) 2012-02-29 07:10:05 PST
(In reply to Panos Astithas [:past] from comment #45)
> Can you clarify what interop scenarios you are concerned about?

For example, if we implement a custom Greek casing behavior, authors who themselves use Firefox may start using text-transform:uppercase with Greek text, assuming that the result they see in Firefox will also be the result that their readers see. But as long as other UAs implement a standardized transform based on Unicode properties, users with those browsers won't be seeing what the author intended.

> > While I understand that this change would make the behavior follow Greek
> > expectations better, it wouldn't really be much use to authors unless all
> > browsers adopt it - and for that to happen, I think there needs to be
> > something that specifies the correct behavior, not just a spec that's vague
> > enough to "allow but not require" it.
> 
> Since all current implementations are currently broken (wrt the expectations
> of a Greek user), I can't really see how it would be worse to have Firefox
> work as expected. Knowledgeable Greek users currently shy away from using
> these transforms,

Which is sensible, as long as the transform doesn't work well for Greek. The danger with unilaterally implementing a better Greek behavior in Firefox, especially in the absence of a clear spec, is that some authors may begin to use it in preference to directly entering text in the casing form they intend it to be displayed, without realizing that users on other browsers will then get a _worse_ result.

> while others who do (like the Android Market), get broken
> results. Even if this new behavior is never publicized enough to inspire
> other vendors or affect change, it would at least help with the latter
> cases, wouldn't it?

Yes, I understand that it would improve the display in some cases, where authors have _not_ been conscious of the issues involved. But because it introduces a significant difference in behavior among browsers, where the behavior is currently pretty uniform (AFAIK), it also carries some risk of introducing confusion/incompatibility (as above).

> > (c) Is it appropriate to strip accents from the already-uppercase characters?

> If I understand correctly the use case you are considering, it is the
> garbage-in/garbage-out scenario that Simon mentioned in comment 21. Accented
> uppercase is wrong, so I would imagine that one would avoid any text
> transformations on it if he wanted it untouched for some reason.

AIUI, accented uppercase is not the normal practice in modern Greek, but I think the statement that "accented uppercase is wrong" is perhaps a little too sweeping. If someone creating Greek content goes to the trouble of including accents in all-uppercase text, I'd be inclined to assume this reflects a deliberate intent to have them there, and I don't think text-transform should then discard them.

Limiting the Greek-specific behavior to elements that are explicitly known to be lang="el" would help to alleviate this concern; it's easier to understand that the orthographic rules of a particular language's writing system should be applied, overriding default Unicode properties, in situations where that language has been specifically declared.

> > (d) What about the rest of the accented Greek letters (in the 1Fxx block)?

> As I said in comment 38 I'm not looking to fix ancient and other old forms
> of Greek here. It's way too complicated and for way too little benefit IMO.
> Should transformation conformance among the historical variations of a
> language trump expected behavior in the majority of web content (in that
> language) out there?

Not necessarily; but OTOH, is it acceptable for two Unicode strings that are _by definition_ canonically equivalent to be treated differently, such that the user perceives a distinction between them even though they should be completely equivalent and interchangeable?

> One thing I'd like to point out is that AFAIK we like to lead at Mozilla. We
> like to pave the way for a better future for the web, and I would consider
> this as the first step in getting other vendors to adopt the same behavior.
> For all I know they may never had someone just sit down and compile the list
> of rules these transformations should follow.

Yes, I don't have any argument with this.

So, how to get there from here? I think the first step should be to document the expected behavior of the "uppercase" and "capitalize" transformations for Greek, _including_ what to do with the 1Fxx block - because of the presence of canonical equivalences, I don't think this is can be optional - and then we can implement that behavior for content that is tagged with lang="el".

(And we can also take the opportunity to do the right thing for Turkish, Azeri, Lithuanian, etc., where language-specific casing behavior is also needed. But those will be separate followups; I'm not expecting this bug to extend to cover other languages!)
Comment 48 Lea Verou 2012-02-29 14:10:18 PST
These transforms are currently in widespread usage across the web anyway, despite their bad results. The reason is that many Greeks use premade templates which are built with the English language in mind. Not all of them are web developers to know how to change them, most are just non techies that want to have a blog at e.g. Blogspot. 

By the way, I strongly disagree that Greeks should capitalize these phrases by hand to avoid accents. In most cases, it's just stylistic,  not semantic, and therefore should be done in CSS.

Honestly, Firefox would be doing the Greek web a favor by fixing this.
Comment 49 Panos Astithas [:past] 2012-03-02 07:40:31 PST
(In reply to Jonathan Kew (:jfkthame) from comment #47)
> (In reply to Panos Astithas [:past] from comment #45)
> > while others who do (like the Android Market), get broken
> > results. Even if this new behavior is never publicized enough to inspire
> > other vendors or affect change, it would at least help with the latter
> > cases, wouldn't it?
> 
> Yes, I understand that it would improve the display in some cases, where
> authors have _not_ been conscious of the issues involved. But because it
> introduces a significant difference in behavior among browsers, where the
> behavior is currently pretty uniform (AFAIK), it also carries some risk of
> introducing confusion/incompatibility (as above).

I understand your points perfectly well, but if I might add one last comment here: it would seem prudent I think, to optimize for the authors who are not conscious of these issues. I would expect that the conscious ones by definition would be better positioned to pick up the change.

> > > (c) Is it appropriate to strip accents from the already-uppercase characters?
> 
> > If I understand correctly the use case you are considering, it is the
> > garbage-in/garbage-out scenario that Simon mentioned in comment 21. Accented
> > uppercase is wrong, so I would imagine that one would avoid any text
> > transformations on it if he wanted it untouched for some reason.
> 
> AIUI, accented uppercase is not the normal practice in modern Greek, but I
> think the statement that "accented uppercase is wrong" is perhaps a little
> too sweeping. If someone creating Greek content goes to the trouble of
> including accents in all-uppercase text, I'd be inclined to assume this
> reflects a deliberate intent to have them there, and I don't think
> text-transform should then discard them.
> 
> Limiting the Greek-specific behavior to elements that are explicitly known
> to be lang="el" would help to alleviate this concern; it's easier to
> understand that the orthographic rules of a particular language's writing
> system should be applied, overriding default Unicode properties, in
> situations where that language has been specifically declared.

Pardon my ignorance, but if one were to author ancient greek text, shouldn't it be marked with lang="el" as well? 

> > > (d) What about the rest of the accented Greek letters (in the 1Fxx block)?
> 
> > As I said in comment 38 I'm not looking to fix ancient and other old forms
> > of Greek here. It's way too complicated and for way too little benefit IMO.
> > Should transformation conformance among the historical variations of a
> > language trump expected behavior in the majority of web content (in that
> > language) out there?
> 
> Not necessarily; but OTOH, is it acceptable for two Unicode strings that are
> _by definition_ canonically equivalent to be treated differently, such that
> the user perceives a distinction between them even though they should be
> completely equivalent and interchangeable?

Yes, adding the exact equivalent letters is fine by me. I agree that for instance U+1F71 (ά) should be treated the same as U+03AC (ά), but letters like U+1F04 (ἄ) are a completely different matter.

> > One thing I'd like to point out is that AFAIK we like to lead at Mozilla. We
> > like to pave the way for a better future for the web, and I would consider
> > this as the first step in getting other vendors to adopt the same behavior.
> > For all I know they may never had someone just sit down and compile the list
> > of rules these transformations should follow.
> 
> Yes, I don't have any argument with this.
> 
> So, how to get there from here? I think the first step should be to document
> the expected behavior of the "uppercase" and "capitalize" transformations
> for Greek, _including_ what to do with the 1Fxx block - because of the
> presence of canonical equivalences, I don't think this is can be optional -
> and then we can implement that behavior for content that is tagged with
> lang="el".

The "capitalize" transformation works as is now, it's only the "uppercase" one that needs fixing. Shall I put up a draft wiki page and then you can help me flesh it out properly?

> (And we can also take the opportunity to do the right thing for Turkish,
> Azeri, Lithuanian, etc., where language-specific casing behavior is also
> needed. But those will be separate followups; I'm not expecting this bug to
> extend to cover other languages!)

Makes sense.
Comment 50 Gordon P. Hemsley [:GPHemsley] 2012-03-02 08:12:50 PST
(In reply to Panos Astithas [:past] from comment #49)
> > AIUI, accented uppercase is not the normal practice in modern Greek, but I
> > think the statement that "accented uppercase is wrong" is perhaps a little
> > too sweeping. If someone creating Greek content goes to the trouble of
> > including accents in all-uppercase text, I'd be inclined to assume this
> > reflects a deliberate intent to have them there, and I don't think
> > text-transform should then discard them.
> > 
> > Limiting the Greek-specific behavior to elements that are explicitly known
> > to be lang="el" would help to alleviate this concern; it's easier to
> > understand that the orthographic rules of a particular language's writing
> > system should be applied, overriding default Unicode properties, in
> > situations where that language has been specifically declared.
> 
> Pardon my ignorance, but if one were to author ancient greek text, shouldn't
> it be marked with lang="el" as well? 

No, Ancient Greek (to 1453) has the language subtag 'grc'. The 'el' subtag is for Modern Green (1453 onward).

There are also language subtags for Cappadocian Greek [cpg], Mycenaean Greek [gmy], and Romano-Greek [rge], as well as Greek Sign Language [gss] and the Hellenic language family as a whole [grk].
Comment 51 Jonathan Kew (:jfkthame) 2012-03-08 05:44:46 PST
(In reply to Panos Astithas [:past] from comment #49)

> Yes, adding the exact equivalent letters is fine by me. I agree that for
> instance U+1F71 (ά) should be treated the same as U+03AC (ά), but letters
> like U+1F04 (ἄ) are a completely different matter.

What troubles me about this is that it means the result of applying "uppercase" to fully-accented (classical/koiné/etc) Greek will be to strip *some* of the accents and leave others untouched. I'd like to hear opinions - particularly from people involved in that area - on what's the most appropriate way to deal with this.

> > So, how to get there from here? I think the first step should be to document
> > the expected behavior of the "uppercase" and "capitalize" transformations
> > for Greek, _including_ what to do with the 1Fxx block - because of the
> > presence of canonical equivalences, I don't think this is can be optional -
> > and then we can implement that behavior for content that is tagged with
> > lang="el".
> 
> The "capitalize" transformation works as is now, it's only the "uppercase"
> one that needs fixing. Shall I put up a draft wiki page and then you can
> help me flesh it out properly?

That sounds reasonable, thanks.
Comment 52 Jonathan Kew (:jfkthame) 2012-03-15 06:52:12 PDT
BTW, I've just posted a patch in bug 231162 that provides language-sensitive casing for Turkish (et al), showing how you can check the language of the current element.

For Greek, given the added complexity of the rules you're implementing, I still think we should refactor the implementation somewhat, as mentioned in comment #41. Then it might make sense to use the same (table-driven) custom mapping mechanism for the Turkish case as well as Greek, even though it only involves a couple of special cases.
Comment 53 Panos Astithas [:past] 2012-03-15 07:58:04 PDT
(In reply to Jonathan Kew (:jfkthame) from comment #52)
> BTW, I've just posted a patch in bug 231162 that provides language-sensitive
> casing for Turkish (et al), showing how you can check the language of the
> current element.
> 
> For Greek, given the added complexity of the rules you're implementing, I
> still think we should refactor the implementation somewhat, as mentioned in
> comment #41. Then it might make sense to use the same (table-driven) custom
> mapping mechanism for the Turkish case as well as Greek, even though it only
> involves a couple of special cases.

Thanks, I'll take a look as soon as I can. I've been meaning to get back to this patch, but I've been swamped with debugger work.
Comment 54 Jonathan Kew (:jfkthame) 2012-04-24 09:54:14 PDT
Created attachment 617912 [details] [diff] [review]
patch, implement Greek-specific uppercasing for text-transform & small-caps

As I've been making quite a few changes to nsTextRunTransformations recently, I decided to have a go at restructuring this to work with the current code there. I believe the attached patch implements the same desired behavior, but in a somewhat cleaner way; it also restricts this to content that is explicitly tagged with lang="el", as per current the CSS3 Text spec[1].

This patch applies on top of the patches in bug 744357. Tryserver build will be at https://tbpl.mozilla.org/?tree=Try&rev=2675647c68c1.

I would still like to see some kind of documentation of the desired case-mapping behavior, as a reference to check our implementation against, as a guide for authors to know what to expect, and as a proposed standard behavior that others could also be encouraged to implement.

[1] http://dev.w3.org/csswg/css3-text/#text-transform
Comment 55 Jonathan Kew (:jfkthame) 2012-04-24 09:55:41 PDT
Created attachment 617913 [details] [diff] [review]
reftest for Greek uppercasing in composed and decomposed forms
Comment 56 Panos Astithas [:past] 2012-04-26 20:20:21 PDT
(In reply to Jonathan Kew (:jfkthame) from comment #54)
> Created attachment 617912 [details] [diff] [review]
> patch, implement Greek-specific uppercasing for text-transform & small-caps
> 
> As I've been making quite a few changes to nsTextRunTransformations
> recently, I decided to have a go at restructuring this to work with the
> current code there. I believe the attached patch implements the same desired
> behavior, but in a somewhat cleaner way; it also restricts this to content
> that is explicitly tagged with lang="el", as per current the CSS3 Text
> spec[1].
> 
> This patch applies on top of the patches in bug 744357. Tryserver build will
> be at https://tbpl.mozilla.org/?tree=Try&rev=2675647c68c1.
> 
> I would still like to see some kind of documentation of the desired
> case-mapping behavior, as a reference to check our implementation against,
> as a guide for authors to know what to expect, and as a proposed standard
> behavior that others could also be encouraged to implement.
> 
> [1] http://dev.w3.org/csswg/css3-text/#text-transform

Thank you for doing this, I really appreciate it. I expect to have the time to look at it in detail when I'm back from our work week, but from testing the patches against http://htmlpad.org/greek-css/ I can see that you missed one case in both text-transform: uppercase and font-variant: small-caps (the case in my patch with the comment "diaeresis is not needed if preceded by a diphthong"), and the accented capital letters case in small-caps (they are not stripped from their accents).
Comment 57 Jonathan Kew (:jfkthame) 2012-04-27 01:12:38 PDT
Thanks for checking - I'll try to look into that and update accordingly.
Comment 58 Jonathan Kew (:jfkthame) 2012-04-28 01:38:24 PDT
Created attachment 619262 [details] [diff] [review]
patch v2, implement Greek-specific uppercasing for text-transform & small-caps

This fixes the cases mentioned in comment 56.

(Note that to make capitals work as desired with font-variant:small-caps, it's not sufficient to just add them to the lowercase run, as that would result in them not just having accents stripped by the uppercase mapping, but *also* being rendered with the reduced-size font, which is incorrect. So they have to be treated as a new kind of child run - transformed, but not scaled.)
Comment 59 Jonathan Kew (:jfkthame) 2012-04-28 01:39:42 PDT
Created attachment 619263 [details] [diff] [review]
reftest for Greek uppercasing in composed and decomposed forms

Corrected reftest to account for the after-diphthong case.
Comment 60 Jonathan Kew (:jfkthame) 2012-04-28 01:42:41 PDT
Comment on attachment 588764 [details] [diff] [review]
Working patch v5

Clearing r? on this version, as it's out-of-date w.r.t. current code, but it still serves as a source for the desired behavior that the newer patches are trying to implement, so not obsoleting for now.
Comment 61 Jonathan Kew (:jfkthame) 2012-04-28 03:15:12 PDT
Created attachment 619272 [details] [diff] [review]
reftest for Greek small-caps behavior

Also adding a reftest for the font-variant:small-caps case.
Comment 62 Panos Astithas [:past] 2012-05-02 02:11:04 PDT
Comment on attachment 619263 [details] [diff] [review]
reftest for Greek uppercasing in composed and decomposed forms

Review of attachment 619263 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/reftests/text-transform/greek-uppercase-1-ref.html
@@ +12,5 @@
> +</style>
> +</head>
> +<body lang="en">
> +<div>ΠΑΤΆΤΑ, ΑΈΡΑΣ, ΜΆΙΟΣ, ΆΥΛΟΣ, ΑΫΠΝΊΑ, ΜΑΐΟΥ, ΧΟΎΙ</div>
> +<div lang="el">ΠΑΤΑΤΑ, ΑΕΡΑΣ, ΜΑΪΟΣ, ΑΫΛΟΣ, ΑΫΠΝΊΑ, ΜΑΪΟΥ, ΧΟΥΙ</div>

You missed the accent in αϋπνία here.
Comment 63 Panos Astithas [:past] 2012-05-02 02:54:07 PDT
Comment on attachment 619262 [details] [diff] [review]
patch v2, implement Greek-specific uppercasing for text-transform & small-caps

Review of attachment 619262 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks!
Comment 64 Jonathan Kew (:jfkthame) 2012-05-02 04:29:52 PDT
Created attachment 620253 [details] [diff] [review]
reftest for Greek uppercasing in composed and decomposed forms v2

Fixed greek-uppercase-1-ref, thanks for catching that. (The error was masked when I tested because text-transform:uppercase was also applied; I've also removed that from the reference.)

Also updated the reference for the non-lang="el" section, as ΐ is now handled properly since bug 744357 landed.

Carrying forward r=past for the fixed-up tests.
Comment 65 Alfredos-Panagiotis Damkalis [:fredy] 2012-05-02 05:29:37 PDT
The link (and the site) at the comment 15 seems to be offline.

For the reference you can find the mentioned official grammar book at http://digitalschool.minedu.gov.gr/books/ebooks/Gymnasio/A_Gymnasiou.zip (~300MB unfortunately this contains many other books) under the directory A_Gymnasiou/Neoellhnikh_Glwssa with filename Grammatiki_Neas_Ellinikis_Glossas.pdf. The book is in greek.

About the disjunctive eta, I have found a reference at an online book for students which says that at this case the eta is always with accent even if it is capital or or not. http://savalas.gr/pr2/21237.pdf (in greek too) at page 25.

Finally there is an official reference which writes about the rules of the accents system but it doesn't refer to capital letters, maybe later there is another official document that clarify more things on that issue. 

So At the third page of http://www.et.gr/idocs-nph/search/pdfViewerForm.html?args=5C7QrtC22wED8PRhve6aLndtvSoClrL8rmXZYcBY1t7tIl9LGdkF53UIxsx942CdyqxSQYNuqAGCF0IfB9HI6qSYtMQEkEHLwnFqmgJSA5UkHEKavWyL4FoKqSe4BlOTSpEWYhszF8P8UqWb_zFijDZ_b9Gi3rWkQbjRkQn9Xk-2aW0XwMA-RGzcO0jhv2-e are the rules of the new (back then) system of accents.

One more thing about the *offtopic* about what happens with the ancient Greek accents. There is an official grammar book at http://digitalschool.minedu.gov.gr/books/ebooks/Lykeio/A_Lykeiou.zip (~650MB unfortunately this contains many other books) under the directory A_Lykeiou/Arxaia_Ellhnikh_Glwssa_kai_Grammateia with filename Grammatiki_Arxaia-Gymnasioy_Lykeiou.pdf at which at page 23 there is a reference about the ancient capital letters. It says in greek that any kind of accent don't appear when all the words are written with capital letters. (also here there is no reference on exceptions like disjunctive eta).
Comment 66 Jonathan Kew (:jfkthame) 2012-05-03 01:04:14 PDT
Pushed the patch and testcases here to inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0de4cbfe2217
https://hg.mozilla.org/integration/mozilla-inbound/rev/5faf400155a4
https://hg.mozilla.org/integration/mozilla-inbound/rev/a6a335cd2c94

It's not really clear to me whether we should be doing something special to handle the disjunctive eta (e.g. don't strip accent from eta when it stands alone?) If so, let's have a followup bug to deal with that, please; I think that'll make it easier to keep track of things.
Comment 67 Panos Astithas [:past] 2012-05-03 02:46:57 PDT
(In reply to Alfredos-Panagiotis Damkalis from comment #65)
> About the disjunctive eta, I have found a reference at an online book for
> students which says that at this case the eta is always with accent even if
> it is capital or or not. http://savalas.gr/pr2/21237.pdf (in greek too) at
> page 25.

This is the first reference of that rule in a book that I've seen, but it's not an official reference by any means.

> So At the third page of
> http://www.et.gr/idocs-nph/search/pdfViewerForm.
> html?args=5C7QrtC22wED8PRhve6aLndtvSoClrL8rmXZYcBY1t7tIl9LGdkF53UIxsx942Cdyqx
> SQYNuqAGCF0IfB9HI6qSYtMQEkEHLwnFqmgJSA5UkHEKavWyL4FoKqSe4BlOTSpEWYhszF8P8UqWb
> _zFijDZ_b9Gi3rWkQbjRkQn9Xk-2aW0XwMA-RGzcO0jhv2-e are the rules of the new
> (back then) system of accents.

Well, this is as official as it gets, but these rules are only for lowercase Greek. However, one could imply that the absence of any exceptions for accented uppercase, is indeed the official rule.

(In reply to Jonathan Kew (:jfkthame) from comment #66)
> It's not really clear to me whether we should be doing something special to
> handle the disjunctive eta (e.g. don't strip accent from eta when it stands
> alone?) If so, let's have a followup bug to deal with that, please; I think
> that'll make it easier to keep track of things.

My opinion is still the same as expressed in comment 17. Some people prefer to accent disjunctive eta carrying over the exception from lowercase (non-disjunctive lowercase eta standing alone is not accented). This is not standard.

If you wanted to special-case disjunctive eta, you shouldn't limit the accented form based on whether it stands alone, nor whether it's at the beginning of a sentence or not. You should be doing complicated grammar checks to make sure a standalone eta is disjunctive.

If someone feels strongly about that, by all means file a followup.
Comment 68 George Fiotakis 2012-05-03 15:01:24 PDT
(In reply to Panos Astithas [:past] from comment #67)

> 
> If you wanted to special-case disjunctive eta, you shouldn't limit the
> accented form based on whether it stands alone, nor whether it's at the
> beginning of a sentence or not. You should be doing complicated grammar
> checks to make sure a standalone eta is disjunctive.
 
With all respect, I don't see why would we need a complicated grammar check for that.
In modern Greek, only the disjunctive eta is accented, otherwise, if it is a standalone eta, it is an article and it shouldn't be accented. In other forms of the language, the disjunctive eta is "ἦ"(ancient greek) or in modern polytonic "ἤ" while the article is "ἡ"
Comment 69 Panos Astithas [:past] 2012-05-04 00:10:30 PDT
(In reply to George Fiotakis from comment #68)
> (In reply to Panos Astithas [:past] from comment #67)
> 
> > 
> > If you wanted to special-case disjunctive eta, you shouldn't limit the
> > accented form based on whether it stands alone, nor whether it's at the
> > beginning of a sentence or not. You should be doing complicated grammar
> > checks to make sure a standalone eta is disjunctive.
>  
> With all respect, I don't see why would we need a complicated grammar check
> for that.
> In modern Greek, only the disjunctive eta is accented, otherwise, if it is a
> standalone eta, it is an article and it shouldn't be accented. In other
> forms of the language, the disjunctive eta is "ἦ"(ancient greek) or in
> modern polytonic "ἤ" while the article is "ἡ"

And how would the code test whether the eta is disjunctive or not?
Comment 70 Alfredos-Panagiotis Damkalis [:fredy] 2012-05-04 01:30:34 PDT
(In reply to Panos Astithas [:past] from comment #69)
> (In reply to George Fiotakis from comment #68)
> > (In reply to Panos Astithas [:past] from comment #67)
> > 
> > > 
> > > If you wanted to special-case disjunctive eta, you shouldn't limit the
> > > accented form based on whether it stands alone, nor whether it's at the
> > > beginning of a sentence or not. You should be doing complicated grammar
> > > checks to make sure a standalone eta is disjunctive.
> >  
> > With all respect, I don't see why would we need a complicated grammar check
> > for that.
> > In modern Greek, only the disjunctive eta is accented, otherwise, if it is a
> > standalone eta, it is an article and it shouldn't be accented. In other
> > forms of the language, the disjunctive eta is "ἦ"(ancient greek) or in
> > modern polytonic "ἤ" while the article is "ἡ"
> 
> And how would the code test whether the eta is disjunctive or not?

The only way I can think on how to code test, whether the eta is disjunctive or not, is by checking if there is one space before and one after and of course has an accent.

Unfortunately I have no idea if a checking like this is easy or hard at the code level.
Comment 72 Lea Verou 2012-05-04 03:55:24 PDT
(In reply to Panos Astithas [:past] from comment #69)
> And how would the code test whether the eta is disjunctive or not?

Accented eta surrounded by word boundaries? (\b) With a regex it's trivial, but not sure how that relates to the Gecko source and what the perf implications are.
Comment 73 Panos Astithas [:past] 2012-05-04 04:04:48 PDT
(In reply to Lea Verou from comment #72)
> (In reply to Panos Astithas [:past] from comment #69)
> > And how would the code test whether the eta is disjunctive or not?
> 
> Accented eta surrounded by word boundaries? (\b) With a regex it's trivial,
> but not sure how that relates to the Gecko source and what the perf
> implications are.

Unfortunately this doesn't cover cases like:

Ή θα πολεμήσουμε, ή θα παραδοθούμε!
Comment 74 Lea Verou 2012-05-04 04:16:40 PDT
(In reply to Panos Astithas [:past] from comment #73)
> Unfortunately this doesn't cover cases like:
> 
> Ή θα πολεμήσουμε, ή θα παραδοθούμε!

I don't see why. Word boundaries match at the beginning of the string too. I'd provide a JS example to demonstrate, but JS regexps are not unicode-aware. No idea about C++ regexps, but I'd imagine they are more advanced.
Comment 75 Jonathan Kew (:jfkthame) 2012-05-04 04:28:17 PDT
I believe we could do this; it'd be a similar issue to the contextual lowercasing of Sigma that was implemented for bug 740120 (where Σ is lowercased to ς if it occurs at the end of a word, but to σ in isolation or within a word).

However, if people think this is a refinement we should pursue, please file a new bug for it so we can track it as a separate issue rather than prolonging discussion in this bug.

Note You need to log in before you can comment on or make changes to this bug.