Last Comment Bug 231162 - text-transform is not using language dependent casing rules
: text-transform is not using language dependent casing rules
Status: RESOLVED FIXED
: dev-doc-complete
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: All All
: -- normal with 4 votes (vote)
: mozilla14
Assigned To: Jonathan Kew (:jfkthame)
:
: Makoto Kato [:m_kato]
Mentors:
: 482095 552711 570333 (view as bug list)
Depends on: 905381
Blocks: css2.1-tests 772268 740477
  Show dependency treegraph
 
Reported: 2004-01-16 11:59 PST by Ernest Cline
Modified: 2015-11-18 05:44 PST (History)
16 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
HTML Testcase (1.74 KB, text/html)
2004-01-16 12:00 PST, Ernest Cline
no flags Details
Remove unused code (5.35 KB, patch)
2004-01-19 17:17 PST, Simon Montagu :smontagu
jshin1987: review+
dbaron: superreview+
Details | Diff | Splinter Review
fix dependency problem (1.02 KB, patch)
2004-01-21 01:46 PST, Brian Ryner (not reading)
cls: review-
Details | Diff | Splinter Review
handle static library dependencies (971 bytes, patch)
2004-01-22 19:30 PST, Brian Ryner (not reading)
cls: review+
Details | Diff | Splinter Review
patch, add Turkish support for text-transform and small-caps (5.00 KB, patch)
2012-03-15 06:47 PDT, Jonathan Kew (:jfkthame)
smontagu: review+
Details | Diff | Splinter Review
reftests for Turkish casing behavior (4.19 KB, patch)
2012-03-15 07:49 PDT, Jonathan Kew (:jfkthame)
smontagu: review+
Details | Diff | Splinter Review

Description Ernest Cline 2004-01-16 11:59:22 PST
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113

CSS calls for using language specific rules and has even since CSS 1.

I decided to write a test case involving the dotted and dotless i's.
In both Turkish and Azerbaijani, the normal casing rules are not follwed
for the letters I (U+0049) and i (U+0069).  In those two languages,
lowercase U+0049 is ı (U+0131) and uppercase U+0069 is İ (U+0130).
Instead, even when I downloaded the Turkish locale (tested with 1.5)
and switched to it, text marked as either language still used the default
casing with both text-transform: uppercase and text-transform:lowercase.


Reproducible: Always

Steps to Reproduce:
1. Markup a section of HTML as either Turkish (lang="tr") or Azerbaijani (lang="az")
2. Apply a CSS text-transform to it.


Actual Results:  
text-transform:lowercase of U+0049 was displayed as U+0069.
text-transform:uppercase of U+0069 was displayed as U+0049.

Expected Results:  
text-transform:lowercase of U+0049 should have been displayed as U+0131.
text-transform:uppercase of U+0069 should have been displayed as U+0130.

May affect other language dependendent casing as well, but have only tested for
the Dotted/Undotted I's of Turkish/Azerbaijani. Will also attach one test file I
used. Another that placed the language info on the same element as the element
that had the text-transform applied to it will not be uploaded as it produced
the same non-result.
Comment 1 Ernest Cline 2004-01-16 12:00:34 PST
Created attachment 139214 [details]
HTML Testcase
Comment 2 Simon Montagu :smontagu 2004-01-16 15:43:01 PST
I assume that this is a regression, and it's odd because we have code to handle
it in intl/unicharutil/src/nsCaseConversionImp2.cpp (at least for lang="tr": we
also need to add lang="az"). I guess that code is somehow not being reached.
Comment 3 Simon Montagu :smontagu 2004-01-16 15:56:44 PST
As far as I can tell ToUpper() with the locale argument is just never being called.
Comment 4 Simon Montagu :smontagu 2004-01-16 23:06:56 PST
>CSS calls for using language specific rules and has even since CSS 1.

For the record, this is not strictly accurate: CSS3 is the first version to
require conformance to language specific rules.

CSS1: http://www.w3.org/TR/CSS1.html#text-transform
CSS1 core: UAs may ignore 'text-transform' (i.e., treat it as 'none') for
characters that are not from the Latin-1 repertoire and for elements in
languages for which the transformation is different from that specified by the
case-conversion tables of Unicode.

CSS2: http://www.w3.org/TR/REC-CSS2/text.html#caps-prop, unchanged in CSS2.1
http://www.w3.org/TR/CSS21/text.html#caps-prop
Conforming user agents may consider the value of 'text-transform' to be 'none'
for characters that are not from the Latin-1 repertoire and for elements in
languages for which the transformation is different from that specified by the
case-conversion tables of ISO 10646

http://www.w3.org/TR/css3-text/#caps-prop
Conforming user agents MUST support case mapping rules according to the Unicode
Standard for all characters specified by that standard.
Comment 5 Ernest Cline 2004-01-17 04:34:31 PST
I could buy that, but in that case shouldn't Mozilla be leaving the dotted I and
dotless i alone since they aren’t in Latin-1?  Or are the two clauses supposed
to be independent?  I read it as saying if a UA doesn’t implement language
sensitive case mapping rules, it must case map Latin-1 only.  If the two were
supposed to be independent, I would expect an "or" instead of the "and" in the
quotes you gave from CSS 1 and CSS 2.

Of course, all this is a battle of semantics for no purpose, since the
dotted/dotless I case mapping for Turkish and Azerbaijani is mentioned in the
Unicode Standard and so CSS 3 clearly requires this be supported.
Comment 6 Simon Montagu :smontagu 2004-01-19 17:17:24 PST
Created attachment 139462 [details] [diff] [review]
Remove unused code

Let's begin by removing the existing code, because:
 (a) it isn't used
 (b) it wouldn't work correctly if it were used, due to errors such as:
	if(kDot_I == *s)
	     *s = kDot_I;
     so it isn't even useful as the basis for a working implementation.
 (c) having it in the tree is misleading and creates a superficial implession  
     
     that we support Turkic casing.
Comment 7 Simon Montagu :smontagu 2004-01-19 17:27:43 PST
Comment on attachment 139462 [details] [diff] [review]
Remove unused code

Requesting reviews for removing the unused and inaccurate version.
Comment 8 Jungshik Shin 2004-01-19 20:55:07 PST
Comment on attachment 139462 [details] [diff] [review]
Remove unused code

r=jshin

just a reminder (you may not need it, but just in case), we have bug 210501 in
which we have to overhaul the case conversion APIs anyway.
Comment 9 Simon Montagu :smontagu 2004-01-21 01:05:20 PST
Comment on attachment 139462 [details] [diff] [review]
Remove unused code

Checking this in made all non-clobber tinderboxen go orange, so I backed it out
again.
Comment 10 Brian Ryner (not reading) 2004-01-21 01:46:02 PST
Created attachment 139555 [details] [diff] [review]
fix dependency problem

we've had a longstanding dependency problem with the unicharutil_s static
library.   We weren't relinking things that use this library when the
unicharutil library changes.

Rather than go add EXTRA_DEPS in a couple dozen Makefiles, I opted to just
handle the dependency in rules.mk.

I think this is the cause of the dep tinderbox bustage.
Comment 11 cls 2004-01-22 17:10:18 PST
Comment on attachment 139555 [details] [diff] [review]
fix dependency problem

Grumblesmurf.  We didn't have the dependency problem when MOZ_UNICHARUTIL_LIBS
were used as part of SHARED_LIBRARY_LIBS.  I think the special casing is a bad
idea in the long run.  There should be a generic mechanism to track
dependencies from EXTRA_DSO_LDOPTS.  Something like:

DSO_LDOPTS_DEPS = $(filter %.$(LIB_SUFFIX) %$(DLL_SUFFIX), $(EXTRA_DSO_LDOPTS))

At some point, the VPATH issue would need to be fixed so that the -lfoo
dependencies could be tracked as well.
Comment 12 Brian Ryner (not reading) 2004-01-22 19:21:12 PST
Right, but we don't want to use --whole-archive for this static library.

I'd be ok with adding a mechanism like you suggest, and it would fix this case
even if we don't address the -lfoo case right now.  And rather than trying to
track through the linker -L switches figuring out a library path to search in
(ugh), I think I'd rather just say that the full library name should be used for
linking against static libraries, and for linking against shared libraries,
chances are that nothing will break if we don't relink (in the case where a
symbol was removed from the shared library, we should be relinking anyway to
remove references to that from the current module, and if new symbols were added
to the shared library that the current module doesn't use, it won't affect
anything).
Comment 13 Brian Ryner (not reading) 2004-01-22 19:30:17 PST
Created attachment 139709 [details] [diff] [review]
handle static library dependencies
Comment 14 Brian Ryner (not reading) 2004-01-23 01:02:25 PST
Comment on attachment 139709 [details] [diff] [review]
handle static library dependencies

This is checked in. It should now be possible to land the original patch in
this bug without breaking the depend tinderboxes.
Comment 15 Simon Montagu :smontagu 2009-03-08 22:26:00 PDT
*** Bug 482095 has been marked as a duplicate of this bug. ***
Comment 16 zug_treno 2010-03-19 06:13:44 PDT
*** Bug 552711 has been marked as a duplicate of this bug. ***
Comment 17 Avram Lyon 2010-03-19 07:52:59 PDT
Before this lands, please note that there are other affected languages. The modern
Latin scripts for Crimean Tatar (crh), Volga Tatar (tt), and Bashkir (ba) all
use the Turkish/Azerbaijani-style i/İ and ı/I pairings. All three have both
Cyrillic and Latin scripts, and only the latter is affected, so perhaps this
would require the use of script variants (e.g., tt-Latn), but Azerbaijani also
has a well-represented Cyrillic script. I would like to see the new
text-transform behavior apply to tt, crh, and ba as well.
Comment 18 David Baron :dbaron: ⌚️UTC-8 2010-06-05 12:59:08 PDT
*** Bug 570333 has been marked as a duplicate of this bug. ***
Comment 19 Riz 2010-06-05 13:19:09 PDT
I am the submitter of the dublicate bug 570333. I am sorry to have missed this bug when I searched the database.

I would like to add my comemnts if I may.

1- The Turkish/Turkic i/I case transform is NOT only limited to the text-transform CSS function but also to the font-variant:small-caps as well

2- I can see in this thread that you guys know of this bug since beginning of 2004. That is SIX years! What is it stopping to fix it so that 100+ Million people can enjoy using their language fully as the other Latin alphabet users?

There is a stark reality here: This bug stops Turks to format their own country's name, Turkiye properly!

Shouldn't be a priority?
Comment 20 Riz 2010-06-05 13:28:30 PDT
I am the submitter of the dublicate bug 570333. I am sorry to have missed this bug when I searched the database.

I would like to add my comemnts if I may.

1- The Turkish/Turkic i/I case transform is NOT only limited to the text-transform CSS function but also to the font-variant:small-caps as well

2- I can see in this thread that you guys know of this bug since beginning of 2004. That is SIX years! What is it stopping to fix it so that 100+ Million people can enjoy using their language fully as the other Latin alphabet users?

There is a stark reality here: This bug stops Turks to format their own country's name, Turkiye properly!

Shouldn't be a priority?
Comment 21 Tantek Çelik 2012-01-23 15:08:02 PST
adding cc. Not that I'm personally biased/affected or anything. ;)
Comment 22 Jonathan Kew (:jfkthame) 2012-03-15 06:47:46 PDT
Created attachment 606188 [details] [diff] [review]
patch, add Turkish support for text-transform and small-caps

This adds support for the Turkish-style İ/i and I/ı casing behavior in text-run transformations. It's handled within the transforming text run, rather than by adding locale support to the low-level Unicode case mapping functions; I think it makes more sense at this level given the limited scope of the changes needed.

I'm expecting further changes and refactoring of nsCaseTransformTextRunFactory::RebuildTextRun in bug 307039 (for Greek support), but this can at least serve as a starting point for adding language-sensitive behavior.
Comment 23 Jonathan Kew (:jfkthame) 2012-03-15 07:49:29 PDT
Created attachment 606212 [details] [diff] [review]
reftests for Turkish casing behavior
Comment 24 Simon Montagu :smontagu 2012-03-15 18:11:43 PDT
Comment on attachment 606188 [details] [diff] [review]
patch, add Turkish support for text-transform and small-caps

Review of attachment 606188 [details] [diff] [review]:
-----------------------------------------------------------------

This doesn't work when the language is specified with xml:lang (which means that the tests in http://www.w3.org/International/tests/html-css/list-text-transform#special fail).

I also don't understand how it handles I with U+307 COMBINING DOT ABOVE, though as far as I can tell it does do so correctly. Is the text normalized before it reaches this code?
Comment 25 Jonathan Kew (:jfkthame) 2012-03-23 10:33:05 PDT
(In reply to Simon Montagu from comment #24)
> This doesn't work when the language is specified with xml:lang 

This is due to bug 702121 (perhaps duplicating bug 234485), I think....

> (which means
> that the tests in
> http://www.w3.org/International/tests/html-css/list-text-transform#special
> fail).

....although according to bz in bug 234485 comment 40, "xml:lang is ignored in text/html content, as it should be", which makes me suspect some of those w3.org testcases (the non-XHTML ones) may be incorrect, as they're using xml:lang in html content.
Comment 26 Jonathan Kew (:jfkthame) 2012-03-26 14:10:48 PDT
(In reply to Simon Montagu from comment #24)
> I also don't understand how it handles I with U+307 COMBINING DOT ABOVE,
> though as far as I can tell it does do so correctly. Is the text normalized
> before it reaches this code?

No, we don't apply any normalization. The sequence <U+0049, U+0307> (İ) will be lowercased as <U+0069, U+0307> (i̇) regardless of whether the element is lang="tr" or not. Many, though not all, fonts will suppress the "extra" dot in this case, either by ligating the "i" and the dot to a simple "i" glyph, or by contextually replacing "i" by "ı" when followed by a diacritic above.

Arguably, it would be good to _remove_ the combining dot when lowercasing this sequence, so that it reliably displays as "i" without an extra dot, but this is a separate issue from the Turkish behavior.
Comment 27 Simon Montagu :smontagu 2012-03-26 16:21:05 PDT
Comment on attachment 606188 [details] [diff] [review]
patch, add Turkish support for text-transform and small-caps

Review of attachment 606188 [details] [diff] [review]:
-----------------------------------------------------------------

r=me (assuming rebasing for bug 605021)
Comment 29 Jonathan Kew (:jfkthame) 2012-03-27 09:19:58 PDT
Once these patches for Turkish etc are merged to m-c, I think we should resolve this bug as fixed, and file followups for any additional languages that require special case-mapping treatment. (We already have work in progress in bug 307039 for Greek.)
Comment 32 Selim Şumlu 2012-04-03 00:39:19 PDT
I never thought this was going to be fixed. Thank you guys, on behalf of all Turkish web makers!

Is this fix live yet?
http://www.w3.org/International/tests/html-css/generate?test=text-transform-040&format=h5 doesn't seem to be working in the latest Nightly.
Comment 33 Simon Montagu :smontagu 2012-04-03 01:08:49 PDT
(In reply to Selim Sumlu from comment #32)
> Is this fix live yet?
> http://www.w3.org/International/tests/html-css/generate?test=text-transform-
> 040&format=h5 doesn't seem to be working in the latest Nightly.

See comment 25 above. Perhaps we should file a bug on the tests?
Comment 34 Selim Şumlu 2012-04-03 01:39:47 PDT
Sorry, I've missed that.

I've just tested a few more Turkish websites on Nightly and they all work well.
Comment 35 Jonathan Kew (:jfkthame) 2012-04-03 01:44:57 PDT
(In reply to Simon Montagu from comment #33)
> (In reply to Selim Sumlu from comment #32)
> > Is this fix live yet?
> > http://www.w3.org/International/tests/html-css/generate?test=text-transform-
> > 040&format=h5 doesn't seem to be working in the latest Nightly.
> 
> See comment 25 above. Perhaps we should file a bug on the tests?

I think that would be appropriate. I just checked the current text at http://dev.w3.org/html5/spec/single-page.html#the-lang-and-xml:lang-attributes, and it says (in part):

<quote>
Authors must not use the lang attribute in the XML namespace on HTML elements in HTML documents. To ease migration to and from XHTML, authors may specify an attribute in no namespace with no prefix and with the literal localname "xml:lang" on HTML elements in HTML documents, but such attributes must only be specified if a lang attribute in no namespace is also specified, and both attributes must have the same value when compared in an ASCII case-insensitive manner.

NOTE: The attribute in no namespace with no prefix and with the literal localname "xml:lang" has no effect on language processing.
</quote>

AFAICS, the way xml:lang is used in the testcase mentioned (and other similar ones) violates this, and Firefox is correct to ignore it and just respect the lang="en" setting from the root element.
Comment 36 Jonathan Kew (:jfkthame) 2012-04-03 01:49:13 PDT
cc'ing Richard Ishida as author of the testcase concerned.
Comment 37 Richard Ishida 2012-04-03 06:05:16 PDT
Thanks for pointing out that bug, Jonathan. http://www.w3.org/International/tests/html-css/generate?test=text-transform-040&format=h5 and associated tests should be fixed now.
Comment 38 Maximilian Franzke 2013-05-17 22:45:54 PDT
sadly this isn't totally solved right now - regarding the allowed value "tr-TR" for the lang-attribute, the described problem still occurs.
Comment 39 Jonathan Kew (:jfkthame) 2013-05-18 01:58:36 PDT
(In reply to Maximilian Franzke from comment #38)
> sadly this isn't totally solved right now - regarding the allowed value
> "tr-TR" for the lang-attribute, the described problem still occurs.

You mean it works as expected with "tr", but fails with "tr-TR"? If so, please file a new bug to track the remaining issue - thanks.

Note You need to log in before you can comment on or make changes to this bug.