Last Comment Bug 210501 - case-folding support for non-BMP (Unicode plane 1 and above) characters (ToUpper,ToLower, ToTitlecase)
: case-folding support for non-BMP (Unicode plane 1 and above) characters (ToUp...
Status: RESOLVED FIXED
: intl
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: All All
: -- normal with 2 votes (vote)
: mozilla14
Assigned To: Jonathan Kew (:jfkthame)
:
Mentors:
Depends on: 736227
Blocks: 605021
  Show dependency treegraph
 
Reported: 2003-06-24 08:07 PDT by Jungshik Shin
Modified: 2012-04-02 00:52 PDT (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
w-i-p (32.30 KB, patch)
2005-03-23 07:45 PST, Simon Montagu :smontagu
no flags Details | Diff | Splinter Review
patch, implement case mapping for the full Unicode repertoire (135.75 KB, patch)
2012-03-05 00:42 PST, Jonathan Kew (:jfkthame)
smontagu: review+
Details | Diff | Splinter Review

Description Jungshik Shin 2003-06-24 08:07:07 PDT
Christian asked about ToLower and ToUpper defined in intl/unicharutil.
They're implemented in two places, nsCaseConversionImpl.cpp and 
nsUnicharUtils.h

Simon and I have similar ideas about this. Either we need to change their
function signatures to accept and 'emit' PRUint32 (instead of PRUnichar) or make
new ones with PRUint32 and fix callers to select the 'right' ones depending on
the situation.

Currently, only Desert alphabets (and possibly plane 14 language tags that are
kinda obsolete) among non-BMP characters  need case-folding (case conversion).
According to Simon, Math letters in plane 1 doesn't have case-folding. 

As for IsUpper, IsLower and other character properties, I'll file a separate bug
with summary line that will probably reads 'need to update unicharutils to
Unicode 4.0'.
Comment 1 Jungshik Shin 2003-06-24 08:08:32 PDT
Adding titlecase. 
Comment 2 Jungshik Shin 2004-03-26 18:04:46 PST
Related to this bug is bug 210502. 
Comment 3 Simon Montagu :smontagu 2005-03-23 07:45:27 PST
Created attachment 178365 [details] [diff] [review]
w-i-p

This still needs a lot of work.
Comment 4 Jungshik Shin 2005-03-31 21:33:26 PST
Simon, what's the size of static arrays for case folding? It seems like it has
about 200 elements so that the memory footprint will increase by 200 * 3 *
2bytes = 1.2kB(PRUint16 -> PRUint32). There might be a clever (and complicated)
way to avoid that, but I guess it's not much worth the trouble.
Comment 5 Jungshik Shin 2005-03-31 21:37:09 PST
(In reply to comment #4)
> Simon, what's the size of static arrays for case folding? It seems like it has
> about 200 elements so that the memory footprint will increase by 200 * 3 *

s/it has/they have/. In addition, most new characters in the queue for addition
to the Unicode don't require case-folding so that those arrays will not grow
significantly in the future.
Comment 6 Simon Montagu :smontagu 2005-04-01 22:43:20 PST
(In reply to comment #4)
> Simon, what's the size of static arrays for case folding? It seems like it has
> about 200 elements so that the memory footprint will increase by 200 * 3 *
> 2bytes = 1.2kB(PRUint16 -> PRUint32). There might be a clever (and complicated)
> way to avoid that, but I guess it's not much worth the trouble.
> 

It occurs to me that since there is only the one range of non-BMP characters
with case folding, perhaps we could just put the lower surrogates into the
arrays instead of the full UTF-32 forms, and then everything would Just Work
with no code changes or almost none. What do you think?
Comment 7 Jonathan Kew (:jfkthame) 2012-03-05 00:42:30 PST
Created attachment 602827 [details] [diff] [review]
patch, implement case mapping for the full Unicode repertoire

Here's a possible approach, based on replacing the existing case table and mapping function with lookup tables added to nsUnicodeProperties, by extending the table-generation tool there. This provides upper/lower/titlecase mappings for the full Unicode character repertoire.

The data tables here are a bit larger than the old version would have been (about 10K or so). This is a deliberate tradeoff; by using this structure we can significantly simplify the code, completely eliminating nsCompressedMap with its binary-search of character ranges and cache of recently-used mappings - instead, all we have to do is a couple of simple array lookups. So we save some code size, and gain significantly faster case mappings (according to my timing tests on an OS X opt build, anyhow).
Comment 8 Simon Montagu :smontagu 2012-03-13 12:22:47 PDT
Comment on attachment 602827 [details] [diff] [review]
patch, implement case mapping for the full Unicode repertoire

Review of attachment 602827 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/unicharutil/tools/genUnicodePropertyData.pl
@@ +215,5 @@
> +        my $upper = hex $fields[12];
> +        my $lower = hex $fields[13];
> +        my $title = hex $fields[14];
> +        # we only store one mapping for each character,
> +        # but also record what kind of mapping it is

This is rather devious, but I guess it works out as a good trade-off between data size and performance.
Comment 10 Marco Bonardo [::mak] (Away 6-20 Aug) 2012-03-15 07:55:27 PDT
https://hg.mozilla.org/mozilla-central/rev/edc0871b4e5b

Note You need to log in before you can comment on or make changes to this bug.