Closed Bug 231162 Opened 20 years ago Closed 12 years ago

text-transform is not using language dependent casing rules

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla14

People

(Reporter: ernestcline, Assigned: jfkthame)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: dev-doc-complete)

Attachments

(5 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113

CSS calls for using language specific rules and has even since CSS 1.

I decided to write a test case involving the dotted and dotless i's.
In both Turkish and Azerbaijani, the normal casing rules are not follwed
for the letters I (U+0049) and i (U+0069).  In those two languages,
lowercase U+0049 is ı (U+0131) and uppercase U+0069 is İ (U+0130).
Instead, even when I downloaded the Turkish locale (tested with 1.5)
and switched to it, text marked as either language still used the default
casing with both text-transform: uppercase and text-transform:lowercase.


Reproducible: Always

Steps to Reproduce:
1. Markup a section of HTML as either Turkish (lang="tr") or Azerbaijani (lang="az")
2. Apply a CSS text-transform to it.


Actual Results:  
text-transform:lowercase of U+0049 was displayed as U+0069.
text-transform:uppercase of U+0069 was displayed as U+0049.

Expected Results:  
text-transform:lowercase of U+0049 should have been displayed as U+0131.
text-transform:uppercase of U+0069 should have been displayed as U+0130.

May affect other language dependendent casing as well, but have only tested for
the Dotted/Undotted I's of Turkish/Azerbaijani. Will also attach one test file I
used. Another that placed the language info on the same element as the element
that had the text-transform applied to it will not be uploaded as it produced
the same non-result.
Attached file HTML Testcase
Assignee: nobody → smontagu
Status: UNCONFIRMED → NEW
Component: Layout: Fonts and Text → Internationalization
Ever confirmed: true
QA Contact: core.layout.fonts-and-text → amyy
I assume that this is a regression, and it's odd because we have code to handle
it in intl/unicharutil/src/nsCaseConversionImp2.cpp (at least for lang="tr": we
also need to add lang="az"). I guess that code is somehow not being reached.
As far as I can tell ToUpper() with the locale argument is just never being called.
Status: NEW → ASSIGNED
>CSS calls for using language specific rules and has even since CSS 1.

For the record, this is not strictly accurate: CSS3 is the first version to
require conformance to language specific rules.

CSS1: http://www.w3.org/TR/CSS1.html#text-transform
CSS1 core: UAs may ignore 'text-transform' (i.e., treat it as 'none') for
characters that are not from the Latin-1 repertoire and for elements in
languages for which the transformation is different from that specified by the
case-conversion tables of Unicode.

CSS2: http://www.w3.org/TR/REC-CSS2/text.html#caps-prop, unchanged in CSS2.1
http://www.w3.org/TR/CSS21/text.html#caps-prop
Conforming user agents may consider the value of 'text-transform' to be 'none'
for characters that are not from the Latin-1 repertoire and for elements in
languages for which the transformation is different from that specified by the
case-conversion tables of ISO 10646

http://www.w3.org/TR/css3-text/#caps-prop
Conforming user agents MUST support case mapping rules according to the Unicode
Standard for all characters specified by that standard.
OS: Windows XP → All
Hardware: PC → All
I could buy that, but in that case shouldn't Mozilla be leaving the dotted I and
dotless i alone since they aren’t in Latin-1?  Or are the two clauses supposed
to be independent?  I read it as saying if a UA doesn’t implement language
sensitive case mapping rules, it must case map Latin-1 only.  If the two were
supposed to be independent, I would expect an "or" instead of the "and" in the
quotes you gave from CSS 1 and CSS 2.

Of course, all this is a battle of semantics for no purpose, since the
dotted/dotless I case mapping for Turkish and Azerbaijani is mentioned in the
Unicode Standard and so CSS 3 clearly requires this be supported.
Let's begin by removing the existing code, because:
 (a) it isn't used
 (b) it wouldn't work correctly if it were used, due to errors such as:
	if(kDot_I == *s)
	     *s = kDot_I;
     so it isn't even useful as the basis for a working implementation.
 (c) having it in the tree is misleading and creates a superficial implession  
     
     that we support Turkic casing.
Comment on attachment 139462 [details] [diff] [review]
Remove unused code

Requesting reviews for removing the unused and inaccurate version.
Attachment #139462 - Flags: superreview?(dbaron)
Attachment #139462 - Flags: review?(jshin)
Attachment #139462 - Flags: superreview?(dbaron) → superreview+
Comment on attachment 139462 [details] [diff] [review]
Remove unused code

r=jshin

just a reminder (you may not need it, but just in case), we have bug 210501 in
which we have to overhaul the case conversion APIs anyway.
Attachment #139462 - Flags: review?(jshin) → review+
Comment on attachment 139462 [details] [diff] [review]
Remove unused code

Checking this in made all non-clobber tinderboxen go orange, so I backed it out
again.
Attached patch fix dependency problem (obsolete) — Splinter Review
we've had a longstanding dependency problem with the unicharutil_s static
library.   We weren't relinking things that use this library when the
unicharutil library changes.

Rather than go add EXTRA_DEPS in a couple dozen Makefiles, I opted to just
handle the dependency in rules.mk.

I think this is the cause of the dep tinderbox bustage.
Attachment #139555 - Flags: review?(cls)
Comment on attachment 139555 [details] [diff] [review]
fix dependency problem

Grumblesmurf.  We didn't have the dependency problem when MOZ_UNICHARUTIL_LIBS
were used as part of SHARED_LIBRARY_LIBS.  I think the special casing is a bad
idea in the long run.  There should be a generic mechanism to track
dependencies from EXTRA_DSO_LDOPTS.  Something like:

DSO_LDOPTS_DEPS = $(filter %.$(LIB_SUFFIX) %$(DLL_SUFFIX), $(EXTRA_DSO_LDOPTS))

At some point, the VPATH issue would need to be fixed so that the -lfoo
dependencies could be tracked as well.
Attachment #139555 - Flags: review?(cls) → review-
Right, but we don't want to use --whole-archive for this static library.

I'd be ok with adding a mechanism like you suggest, and it would fix this case
even if we don't address the -lfoo case right now.  And rather than trying to
track through the linker -L switches figuring out a library path to search in
(ugh), I think I'd rather just say that the full library name should be used for
linking against static libraries, and for linking against shared libraries,
chances are that nothing will break if we don't relink (in the case where a
symbol was removed from the shared library, we should be relinking anyway to
remove references to that from the current module, and if new symbols were added
to the shared library that the current module doesn't use, it won't affect
anything).
Attachment #139555 - Attachment is obsolete: true
Attachment #139709 - Flags: review?(cls)
Attachment #139709 - Flags: review?(cls) → review+
Comment on attachment 139709 [details] [diff] [review]
handle static library dependencies

This is checked in. It should now be possible to land the original patch in
this bug without breaking the depend tinderboxes.
QA Contact: amyy → i18n
Before this lands, please note that there are other affected languages. The modern
Latin scripts for Crimean Tatar (crh), Volga Tatar (tt), and Bashkir (ba) all
use the Turkish/Azerbaijani-style i/İ and ı/I pairings. All three have both
Cyrillic and Latin scripts, and only the latter is affected, so perhaps this
would require the use of script variants (e.g., tt-Latn), but Azerbaijani also
has a well-represented Cyrillic script. I would like to see the new
text-transform behavior apply to tt, crh, and ba as well.
I am the submitter of the dublicate bug 570333. I am sorry to have missed this bug when I searched the database.

I would like to add my comemnts if I may.

1- The Turkish/Turkic i/I case transform is NOT only limited to the text-transform CSS function but also to the font-variant:small-caps as well

2- I can see in this thread that you guys know of this bug since beginning of 2004. That is SIX years! What is it stopping to fix it so that 100+ Million people can enjoy using their language fully as the other Latin alphabet users?

There is a stark reality here: This bug stops Turks to format their own country's name, Turkiye properly!

Shouldn't be a priority?
I am the submitter of the dublicate bug 570333. I am sorry to have missed this bug when I searched the database.

I would like to add my comemnts if I may.

1- The Turkish/Turkic i/I case transform is NOT only limited to the text-transform CSS function but also to the font-variant:small-caps as well

2- I can see in this thread that you guys know of this bug since beginning of 2004. That is SIX years! What is it stopping to fix it so that 100+ Million people can enjoy using their language fully as the other Latin alphabet users?

There is a stark reality here: This bug stops Turks to format their own country's name, Turkiye properly!

Shouldn't be a priority?
Blocks: css2.1-tests
adding cc. Not that I'm personally biased/affected or anything. ;)
This adds support for the Turkish-style İ/i and I/ı casing behavior in text-run transformations. It's handled within the transforming text run, rather than by adding locale support to the low-level Unicode case mapping functions; I think it makes more sense at this level given the limited scope of the changes needed.

I'm expecting further changes and refactoring of nsCaseTransformTextRunFactory::RebuildTextRun in bug 307039 (for Greek support), but this can at least serve as a starting point for adding language-sensitive behavior.
Attachment #606188 - Flags: review?(smontagu)
Attachment #606212 - Flags: review?(smontagu)
Comment on attachment 606188 [details] [diff] [review]
patch, add Turkish support for text-transform and small-caps

Review of attachment 606188 [details] [diff] [review]:
-----------------------------------------------------------------

This doesn't work when the language is specified with xml:lang (which means that the tests in http://www.w3.org/International/tests/html-css/list-text-transform#special fail).

I also don't understand how it handles I with U+307 COMBINING DOT ABOVE, though as far as I can tell it does do so correctly. Is the text normalized before it reaches this code?
(In reply to Simon Montagu from comment #24)
> This doesn't work when the language is specified with xml:lang 

This is due to bug 702121 (perhaps duplicating bug 234485), I think....

> (which means
> that the tests in
> http://www.w3.org/International/tests/html-css/list-text-transform#special
> fail).

....although according to bz in bug 234485 comment 40, "xml:lang is ignored in text/html content, as it should be", which makes me suspect some of those w3.org testcases (the non-XHTML ones) may be incorrect, as they're using xml:lang in html content.
(In reply to Simon Montagu from comment #24)
> I also don't understand how it handles I with U+307 COMBINING DOT ABOVE,
> though as far as I can tell it does do so correctly. Is the text normalized
> before it reaches this code?

No, we don't apply any normalization. The sequence <U+0049, U+0307> (İ) will be lowercased as <U+0069, U+0307> (i̇) regardless of whether the element is lang="tr" or not. Many, though not all, fonts will suppress the "extra" dot in this case, either by ligating the "i" and the dot to a simple "i" glyph, or by contextually replacing "i" by "ı" when followed by a diacritic above.

Arguably, it would be good to _remove_ the combining dot when lowercasing this sequence, so that it reliably displays as "i" without an extra dot, but this is a separate issue from the Turkish behavior.
Comment on attachment 606188 [details] [diff] [review]
patch, add Turkish support for text-transform and small-caps

Review of attachment 606188 [details] [diff] [review]:
-----------------------------------------------------------------

r=me (assuming rebasing for bug 605021)
Attachment #606188 - Flags: review?(smontagu) → review+
Attachment #606212 - Flags: review?(smontagu) → review+
Once these patches for Turkish etc are merged to m-c, I think we should resolve this bug as fixed, and file followups for any additional languages that require special case-mapping treatment. (We already have work in progress in bug 307039 for Greek.)
https://hg.mozilla.org/mozilla-central/rev/4e28b565455d
https://hg.mozilla.org/mozilla-central/rev/c510b7d0069c
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Blocks: 740477
I never thought this was going to be fixed. Thank you guys, on behalf of all Turkish web makers!

Is this fix live yet?
http://www.w3.org/International/tests/html-css/generate?test=text-transform-040&format=h5 doesn't seem to be working in the latest Nightly.
(In reply to Selim Sumlu from comment #32)
> Is this fix live yet?
> http://www.w3.org/International/tests/html-css/generate?test=text-transform-
> 040&format=h5 doesn't seem to be working in the latest Nightly.

See comment 25 above. Perhaps we should file a bug on the tests?
Sorry, I've missed that.

I've just tested a few more Turkish websites on Nightly and they all work well.
(In reply to Simon Montagu from comment #33)
> (In reply to Selim Sumlu from comment #32)
> > Is this fix live yet?
> > http://www.w3.org/International/tests/html-css/generate?test=text-transform-
> > 040&format=h5 doesn't seem to be working in the latest Nightly.
> 
> See comment 25 above. Perhaps we should file a bug on the tests?

I think that would be appropriate. I just checked the current text at http://dev.w3.org/html5/spec/single-page.html#the-lang-and-xml:lang-attributes, and it says (in part):

<quote>
Authors must not use the lang attribute in the XML namespace on HTML elements in HTML documents. To ease migration to and from XHTML, authors may specify an attribute in no namespace with no prefix and with the literal localname "xml:lang" on HTML elements in HTML documents, but such attributes must only be specified if a lang attribute in no namespace is also specified, and both attributes must have the same value when compared in an ASCII case-insensitive manner.

NOTE: The attribute in no namespace with no prefix and with the literal localname "xml:lang" has no effect on language processing.
</quote>

AFAICS, the way xml:lang is used in the testcase mentioned (and other similar ones) violates this, and Firefox is correct to ignore it and just respect the lang="en" setting from the root element.
cc'ing Richard Ishida as author of the testcase concerned.
Thanks for pointing out that bug, Jonathan. http://www.w3.org/International/tests/html-css/generate?test=text-transform-040&format=h5 and associated tests should be fixed now.
sadly this isn't totally solved right now - regarding the allowed value "tr-TR" for the lang-attribute, the described problem still occurs.
(In reply to Maximilian Franzke from comment #38)
> sadly this isn't totally solved right now - regarding the allowed value
> "tr-TR" for the lang-attribute, the described problem still occurs.

You mean it works as expected with "tr", but fails with "tr-TR"? If so, please file a new bug to track the remaining issue - thanks.
Depends on: 905381
See Also: → 1225827
You need to log in before you can comment on or make changes to this bug.