Closed Bug 121281 Opened 23 years ago Closed 23 years ago

need to change MacOS X collation code

Categories

(Core :: Internationalization, defect)

PowerPC
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla0.9.9

People

(Reporter: ftang, Assigned: nhottanscp)

Details

(Keywords: intl)

Attachments

(2 files, 5 obsolete files)

here are some reply form apple Here is some more info for you on estimating the space required for a CollationKey: At 5:31 PM -0800 1/16/02, Yung-Fong Tang wrote: >Then at least tell us how other application, such as "Mac OS Finder", prepare for the space ? Do you use a formual to estimate the length? or an algorithm to call the function several time as trial-and-error to find out the necessary length ? If I have a length 10 unicode string, how many memory I need to parepare before I call the function. I don't think the Finder uses the UCGetCollationKey (I think it just uses UCCompareText). But here is some more info on how to estimate the length required for a CollationKey. - First, you would need to estimate the length of the Unicode string if it undergoes the compatibility decomposition as specified by the Unicode Consortium (that is, apply all the compatibility decompositions, then recursively apply all the canonical decompositions). TEC can generate the canonical decomposition of an arbitrary Unicode string, but not (currently) the compatibility decomposition. - Then, you would need to multiply that length by 4 (to account for 4 different levels of sorting data) and add 3 (to account for the separators between the data for each level. The result is the value you should pass as maxKeySize, the maximum dimension of the collationKey. The worst case is typically Korean composed Hangul. A string of 10 composed Hangul could decompose into 30 conjoining jamo; multiplying this by 4 and then adding 3 gives 123 as the value that you should pass for maxKeySize. This does seem rather large, especially since this is a dimension for an array of UInt32. We will look into reducing the size, but that is the current situation. >Do you need this kind of memory requirment for all different collating options? can one option combination require less memory and the other require more ? for certain collating operation, the precision is less important than the memory requirement. Is that possible you can use less memory for some option combination and use more for the others? The maximum does depend on the type of characters in the string as well as the selected options. However, the worst case value (for a string of all composed Hangul) is the same for all options currently. >>If LCMapStringW generates something much smaller, I have to suspect that it may not be supporting non-BMP characters, etc. >> >Probably, but why the caller need to pay for such cost if the data it pass in do NOT have surrogate characters? Should you only require such cost when the data have surrogate characters ? The requirements for the key are based not just on what the characters are in the string it is generated from, but on all the other possible strings that could be used to generate keys that it may be compared with. So if some CollationKey X is generated only from BMP characters, but it could be compared with another key Y that was generated from text that had non-BMP characters, then the sorting values used in X must handle non-BMP characters too. -Peter -- ---------------------------------------------- Peter Edberg . . . . Apple Computer, Inc. Mac OS Engineering: International & Text Group [below written by ftang] based on the description above, I think UCGetCollationKey is not ready for prime time yet. We should swtich back to use either the old MacOS 9 code or use only UCCompareText instead.
->nhotta
Assignee: yokoyama → nhotta
Keywords: intl
Status: NEW → ASSIGNED
Implementation Note for MacOS X collation API problems: MacOS X collation API requires the caller to allocate the memory for a key but the API to estimate the correct key length is not provided. Here the list of possible problems. 1) The key may be truncated if the estimation is not correct (correctness). 2) Larger buffer may be allocated than the actual key length (memory foot print). 3) Takes more time if the input string has to be preprocessed (e.g. decomposition) to get a correct estimation of the buffer length (performance). options: 1) Turn off and use MacOS 9 collation. 2) Use an original string as key and use CompareString in the comparison code (decrease performance). 3) Loop and increase the buffer until the API returns success (increase memory foot print). 4) Use fixed length buffer and ignore the error for the large input string (i.e. the result key is truncated after it exceeds a certain length). no performance loss, foot print improve by allocating for an actual key length instead of the estimated lenght, correctness problem for an input string larger than the internal buffer. The current plan is to take the fourth option to pass the fix buffer for the MacOS collation API. Using that option, we can allocate the exact memory after calling the API with the fixed buffer. This can be done by calling the MacOS API in our length estimation API. In addition to return the estimated length, we can cache the result key to avoid calling the collation API again in our create key API. One problem is that the key could be truncated if its length exceeds the fixed buffer. Usually, that is not significant, the strings can be differenciated by first 10 characters or so. The current plan is to have a buffer for 512 characters which is 512 * 4 = 2048 bytes.
Keywords: nsbeta1
Hardware: PC → All
Target Milestone: --- → mozilla0.9.9
Hardware: All → Macintosh
Summary: need to change mac os x collation code → need to change MacOS X collation code
Attached patch repost diff with '-u' option (obsolete) — Splinter Review
Attachment #68473 - Attachment is obsolete: true
Attachment #68476 - Attachment is obsolete: true
Comment on attachment 68647 [details] [diff] [review] Fixed to make sure to set out length for all the cases. r=ftang looks good after change according to earily feedback
Attachment #68647 - Flags: review+
Nit: +const PRUint32 cCacheSize = 80; Usually prefer 'k' for const prefix, so this becomes kCacheSize. This could also be more descriptive -- what cache? kCollationKeyCacheSize ? + // Set the default value for an output length. // According to the document, the length of the key should typically be // at least 5 * textLength "According to the documentation" + if (mCachedKeyLen && + !memcmp((void *) mCachedString, + (void *) (const UniChar*) PromiseFlatString(stringIn).get(), mCachedKeyLen)) Is this memcmp actually required, since it could be somewhat expensive.
>Is this memcmp actually required, since it could be somewhat expensive. This is for in case if the input string is different from the one which was used in the length estimation function, but we do not have this case by the existing callers.
> but we do not have this case by the existing callers. but are there ways of calling the interfaces that would allow it? If not, then just use NS_ASSERTION and only slow down debug builds.
Attachment #68647 - Attachment is obsolete: true
There are separate functions to get a key and to create a key, so it is possible.
Comment on attachment 68905 [details] [diff] [review] Added/changed comment, use 'k' for constant. sr=sfraser
Attachment #68905 - Flags: superreview+
Attachment #68905 - Flags: review+
The patch was checked in but I found a problem. I will attach a patch.
Comment on attachment 69188 [details] [diff] [review] Use a length of the input string to determine if the key is cached. r=ftang
Attachment #69188 - Flags: review+
Simon, could you review the last patch?
Shouldn't you test for length equality first, and only then do the memcmp?
I agree. I will revise the patch (I need to remember the cached string length).
I changed to store the length and do length check before the memcmp. Also increase the string length to remember because I saw many messages in my inbox still not cached with the size 80. Simon, please review the latest patch.
Comment on attachment 69300 [details] [diff] [review] Changed to store the input string lengh to use for length checking later, also increased the number of characters to cache. r=ftang
+ *outLen = (1 + mCachedStringLen) * 5 * sizeof(UCCollationValue); + kCacheSize * 5, + PRUint32 keyLength = (1 + stringInLen) * 5 * sizeof(UCCollationValue); What is this magic number 5 that keeps cropping up? How about a const for that?
Attachment #69300 - Attachment is obsolete: true
Simon, please review the latest patch.
Comment on attachment 69520 [details] [diff] [review] Changed to use a constant. Thanks. sr=sfraser (and moving r= forward)
Attachment #69520 - Flags: superreview+
Attachment #69520 - Flags: review+
checked in
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Should IQA be involved in this? if so, could you please provide some test case?
Switching qa contact to ftank for now. Frank, can you verify this bug within development or provide IQA with a test case? Thanks.
QA Contact: ruixu → ftang
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: