case-folding support for non-BMP (Unicode plane 1 and above) characters (ToUpper,ToLower, ToTitlecase)

RESOLVED FIXED in mozilla14

Status

()

Core
Internationalization
RESOLVED FIXED
14 years ago
6 years ago

People

(Reporter: Jungshik Shin, Assigned: jfkthame)

Tracking

({intl})

Trunk
mozilla14
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Reporter)

Description

14 years ago
Christian asked about ToLower and ToUpper defined in intl/unicharutil.
They're implemented in two places, nsCaseConversionImpl.cpp and 
nsUnicharUtils.h

Simon and I have similar ideas about this. Either we need to change their
function signatures to accept and 'emit' PRUint32 (instead of PRUnichar) or make
new ones with PRUint32 and fix callers to select the 'right' ones depending on
the situation.

Currently, only Desert alphabets (and possibly plane 14 language tags that are
kinda obsolete) among non-BMP characters  need case-folding (case conversion).
According to Simon, Math letters in plane 1 doesn't have case-folding. 

As for IsUpper, IsLower and other character properties, I'll file a separate bug
with summary line that will probably reads 'need to update unicharutils to
Unicode 4.0'.
(Reporter)

Comment 1

14 years ago
Adding titlecase. 
Summary: case-folding support for non-BMP characters (ToUpper,ToLower) → case-folding support for non-BMP characters (ToUpper,ToLower, ToTitlecase)
(Reporter)

Comment 2

14 years ago
Related to this bug is bug 210502. 
Summary: case-folding support for non-BMP characters (ToUpper,ToLower, ToTitlecase) → case-folding support for non-BMP (Unicode plane 1 and above) characters (ToUpper,ToLower, ToTitlecase)

Updated

13 years ago
Blocks: 202208
Created attachment 178365 [details] [diff] [review]
w-i-p

This still needs a lot of work.

Updated

13 years ago
No longer blocks: 202208
(Reporter)

Comment 4

13 years ago
Simon, what's the size of static arrays for case folding? It seems like it has
about 200 elements so that the memory footprint will increase by 200 * 3 *
2bytes = 1.2kB(PRUint16 -> PRUint32). There might be a clever (and complicated)
way to avoid that, but I guess it's not much worth the trouble.
(Reporter)

Comment 5

13 years ago
(In reply to comment #4)
> Simon, what's the size of static arrays for case folding? It seems like it has
> about 200 elements so that the memory footprint will increase by 200 * 3 *

s/it has/they have/. In addition, most new characters in the queue for addition
to the Unicode don't require case-folding so that those arrays will not grow
significantly in the future.
(In reply to comment #4)
> Simon, what's the size of static arrays for case folding? It seems like it has
> about 200 elements so that the memory footprint will increase by 200 * 3 *
> 2bytes = 1.2kB(PRUint16 -> PRUint32). There might be a clever (and complicated)
> way to avoid that, but I guess it's not much worth the trouble.
> 

It occurs to me that since there is only the one range of non-BMP characters
with case folding, perhaps we could just put the lower surrogates into the
arrays instead of the full UTF-32 forms, and then everything would Just Work
with no code changes or almost none. What do you think?
QA Contact: amyy → i18n
(Assignee)

Comment 7

6 years ago
Created attachment 602827 [details] [diff] [review]
patch, implement case mapping for the full Unicode repertoire

Here's a possible approach, based on replacing the existing case table and mapping function with lookup tables added to nsUnicodeProperties, by extending the table-generation tool there. This provides upper/lower/titlecase mappings for the full Unicode character repertoire.

The data tables here are a bit larger than the old version would have been (about 10K or so). This is a deliberate tradeoff; by using this structure we can significantly simplify the code, completely eliminating nsCompressedMap with its binary-search of character ranges and cache of recently-used mappings - instead, all we have to do is a couple of simple array lookups. So we save some code size, and gain significantly faster case mappings (according to my timing tests on an OS X opt build, anyhow).
Attachment #602827 - Flags: review?(smontagu)
(Assignee)

Updated

6 years ago
Blocks: 605021
Comment on attachment 602827 [details] [diff] [review]
patch, implement case mapping for the full Unicode repertoire

Review of attachment 602827 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/unicharutil/tools/genUnicodePropertyData.pl
@@ +215,5 @@
> +        my $upper = hex $fields[12];
> +        my $lower = hex $fields[13];
> +        my $title = hex $fields[14];
> +        # we only store one mapping for each character,
> +        # but also record what kind of mapping it is

This is rather devious, but I guess it works out as a good trade-off between data size and performance.
Attachment #602827 - Flags: review?(smontagu) → review+
(Assignee)

Comment 9

6 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/edc0871b4e5b
Assignee: smontagu → jfkthame
Target Milestone: --- → mozilla14
https://hg.mozilla.org/mozilla-central/rev/edc0871b4e5b
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Depends on: 736227
You need to log in before you can comment on or make changes to this bug.