Closed Bug 210501 Opened 21 years ago Closed 12 years ago

case-folding support for non-BMP (Unicode plane 1 and above) characters (ToUpper,ToLower, ToTitlecase)

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla14

People

(Reporter: jshin1987, Assigned: jfkthame)

References

Details

(Keywords: intl)

Attachments

(2 files)

w-i-p 19 years ago Simon Montagu :smontagu 32.30 KB, patch		Details \| Diff \| Splinter Review
patch, implement case mapping for the full Unicode repertoire 12 years ago Jonathan Kew [:jfkthame] 135.75 KB, patch	smontagu : review+	Details \| Diff \| Splinter Review

Jungshik Shin

Reporter

Description

•

21 years ago

Christian asked about ToLower and ToUpper defined in intl/unicharutil.
They're implemented in two places, nsCaseConversionImpl.cpp and 
nsUnicharUtils.h

Simon and I have similar ideas about this. Either we need to change their
function signatures to accept and 'emit' PRUint32 (instead of PRUnichar) or make
new ones with PRUint32 and fix callers to select the 'right' ones depending on
the situation.

Currently, only Desert alphabets (and possibly plane 14 language tags that are
kinda obsolete) among non-BMP characters  need case-folding (case conversion).
According to Simon, Math letters in plane 1 doesn't have case-folding. 

As for IsUpper, IsLower and other character properties, I'll file a separate bug
with summary line that will probably reads 'need to update unicharutils to
Unicode 4.0'.

Jungshik Shin

Reporter

Comment 1

•

21 years ago

Adding titlecase.

Summary: case-folding support for non-BMP characters (ToUpper,ToLower) → case-folding support for non-BMP characters (ToUpper,ToLower, ToTitlecase)

Jungshik Shin

Reporter

Comment 2

•

20 years ago

Related to this bug is bug 210502.

Summary: case-folding support for non-BMP characters (ToUpper,ToLower, ToTitlecase) → case-folding support for non-BMP (Unicode plane 1 and above) characters (ToUpper,ToLower, ToTitlecase)

Simon Montagu :smontagu

Updated

•

19 years ago

Blocks: 202208

Simon Montagu :smontagu

Comment 3

•

19 years ago

Attached patch w-i-p — Details — Splinter Review

This still needs a lot of work.

Simon Montagu :smontagu

Updated

•

19 years ago

No longer blocks: 202208

Jungshik Shin

Reporter

Comment 4

•

19 years ago

Simon, what's the size of static arrays for case folding? It seems like it has
about 200 elements so that the memory footprint will increase by 200 * 3 *
2bytes = 1.2kB(PRUint16 -> PRUint32). There might be a clever (and complicated)
way to avoid that, but I guess it's not much worth the trouble.

Jungshik Shin

Reporter

Comment 5

•

19 years ago

(In reply to comment #4)
> Simon, what's the size of static arrays for case folding? It seems like it has
> about 200 elements so that the memory footprint will increase by 200 * 3 *

s/it has/they have/. In addition, most new characters in the queue for addition
to the Unicode don't require case-folding so that those arrays will not grow
significantly in the future.

Simon Montagu :smontagu

Comment 6

•

19 years ago

(In reply to comment #4)
> Simon, what's the size of static arrays for case folding? It seems like it has
> about 200 elements so that the memory footprint will increase by 200 * 3 *
> 2bytes = 1.2kB(PRUint16 -> PRUint32). There might be a clever (and complicated)
> way to avoid that, but I guess it's not much worth the trouble.
> 

It occurs to me that since there is only the one range of non-BMP characters
with case folding, perhaps we could just put the lower surrogates into the
arrays instead of the full UTF-32 forms, and then everything would Just Work
with no code changes or almost none. What do you think?

Phil Ringnalda (:philor)

Updated

•

15 years ago

QA Contact: amyy → i18n

Jonathan Kew [:jfkthame]

Assignee

Comment 7

•

12 years ago

Attached patch patch, implement case mapping for the full Unicode repertoire — Details — Splinter Review

Here's a possible approach, based on replacing the existing case table and mapping function with lookup tables added to nsUnicodeProperties, by extending the table-generation tool there. This provides upper/lower/titlecase mappings for the full Unicode character repertoire.

The data tables here are a bit larger than the old version would have been (about 10K or so). This is a deliberate tradeoff; by using this structure we can significantly simplify the code, completely eliminating nsCompressedMap with its binary-search of character ranges and cache of recently-used mappings - instead, all we have to do is a couple of simple array lookups. So we save some code size, and gain significantly faster case mappings (according to my timing tests on an OS X opt build, anyhow).

Attachment #602827 - Flags: review?(smontagu)

Jonathan Kew [:jfkthame]

Assignee

Updated

•

12 years ago

Blocks: 605021

Simon Montagu :smontagu

Comment 8

•

12 years ago

Comment on attachment 602827 [details] [diff] [review]
patch, implement case mapping for the full Unicode repertoire

Review of attachment 602827 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/unicharutil/tools/genUnicodePropertyData.pl
@@ +215,5 @@
> +        my $upper = hex $fields[12];
> +        my $lower = hex $fields[13];
> +        my $title = hex $fields[14];
> +        # we only store one mapping for each character,
> +        # but also record what kind of mapping it is

This is rather devious, but I guess it works out as a good trade-off between data size and performance.

Attachment #602827 - Flags: review?(smontagu) → review+

Jonathan Kew [:jfkthame]

Assignee

Comment 9

•

12 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/edc0871b4e5b

Assignee: smontagu → jfkthame

Target Milestone: --- → mozilla14

Marco Bonardo [:mak]

Comment 10

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/edc0871b4e5b

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Karl Tomlinson (:karlt)

Updated

•

12 years ago

Depends on: 736227

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

case-folding support for non-BMP (Unicode plane 1 and above) characters (ToUpper,ToLower, ToTitlecase)

Categories

(Core :: Internationalization, defect)

Tracking

()

People

(Reporter: jshin1987, Assigned: jfkthame)

References

Details

(Keywords: intl)

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Updated

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Updated

Attachment

General

Description

File Name

Content Type