Closed Bug 210501 Opened 21 years ago Closed 12 years ago

case-folding support for non-BMP (Unicode plane 1 and above) characters (ToUpper,ToLower, ToTitlecase)

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla14

People

(Reporter: jshin1987, Assigned: jfkthame)

References

Details

(Keywords: intl)

Attachments

(2 files)

Christian asked about ToLower and ToUpper defined in intl/unicharutil.
They're implemented in two places, nsCaseConversionImpl.cpp and 
nsUnicharUtils.h

Simon and I have similar ideas about this. Either we need to change their
function signatures to accept and 'emit' PRUint32 (instead of PRUnichar) or make
new ones with PRUint32 and fix callers to select the 'right' ones depending on
the situation.

Currently, only Desert alphabets (and possibly plane 14 language tags that are
kinda obsolete) among non-BMP characters  need case-folding (case conversion).
According to Simon, Math letters in plane 1 doesn't have case-folding. 

As for IsUpper, IsLower and other character properties, I'll file a separate bug
with summary line that will probably reads 'need to update unicharutils to
Unicode 4.0'.
Adding titlecase. 
Summary: case-folding support for non-BMP characters (ToUpper,ToLower) → case-folding support for non-BMP characters (ToUpper,ToLower, ToTitlecase)
Related to this bug is bug 210502. 
Summary: case-folding support for non-BMP characters (ToUpper,ToLower, ToTitlecase) → case-folding support for non-BMP (Unicode plane 1 and above) characters (ToUpper,ToLower, ToTitlecase)
Blocks: 202208
Attached patch w-i-pSplinter Review
This still needs a lot of work.
No longer blocks: 202208
Simon, what's the size of static arrays for case folding? It seems like it has
about 200 elements so that the memory footprint will increase by 200 * 3 *
2bytes = 1.2kB(PRUint16 -> PRUint32). There might be a clever (and complicated)
way to avoid that, but I guess it's not much worth the trouble.
(In reply to comment #4)
> Simon, what's the size of static arrays for case folding? It seems like it has
> about 200 elements so that the memory footprint will increase by 200 * 3 *

s/it has/they have/. In addition, most new characters in the queue for addition
to the Unicode don't require case-folding so that those arrays will not grow
significantly in the future.
(In reply to comment #4)
> Simon, what's the size of static arrays for case folding? It seems like it has
> about 200 elements so that the memory footprint will increase by 200 * 3 *
> 2bytes = 1.2kB(PRUint16 -> PRUint32). There might be a clever (and complicated)
> way to avoid that, but I guess it's not much worth the trouble.
> 

It occurs to me that since there is only the one range of non-BMP characters
with case folding, perhaps we could just put the lower surrogates into the
arrays instead of the full UTF-32 forms, and then everything would Just Work
with no code changes or almost none. What do you think?
QA Contact: amyy → i18n
Here's a possible approach, based on replacing the existing case table and mapping function with lookup tables added to nsUnicodeProperties, by extending the table-generation tool there. This provides upper/lower/titlecase mappings for the full Unicode character repertoire.

The data tables here are a bit larger than the old version would have been (about 10K or so). This is a deliberate tradeoff; by using this structure we can significantly simplify the code, completely eliminating nsCompressedMap with its binary-search of character ranges and cache of recently-used mappings - instead, all we have to do is a couple of simple array lookups. So we save some code size, and gain significantly faster case mappings (according to my timing tests on an OS X opt build, anyhow).
Attachment #602827 - Flags: review?(smontagu)
Blocks: 605021
Comment on attachment 602827 [details] [diff] [review]
patch, implement case mapping for the full Unicode repertoire

Review of attachment 602827 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/unicharutil/tools/genUnicodePropertyData.pl
@@ +215,5 @@
> +        my $upper = hex $fields[12];
> +        my $lower = hex $fields[13];
> +        my $title = hex $fields[14];
> +        # we only store one mapping for each character,
> +        # but also record what kind of mapping it is

This is rather devious, but I guess it works out as a good trade-off between data size and performance.
Attachment #602827 - Flags: review?(smontagu) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/edc0871b4e5b
Assignee: smontagu → jfkthame
Target Milestone: --- → mozilla14
https://hg.mozilla.org/mozilla-central/rev/edc0871b4e5b
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Depends on: 736227
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: