Closed Bug 415209 Opened 17 years ago Closed 6 years ago

Need isUnicodeAlpha() for Unicode/non-ASCII alpha characters

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla65

People

(Reporter: BenB, Assigned: smontagu)

References

Details

(Keywords: intl)

Attachments

(2 files)

We have isAsciiAlpha() which gives true for a-zA-Z and false for all else. Not even 'ü' etc., which is in ISO-8859-1 in the upper byte, is recognized. Much less in UTF-8, or Japanese etc.. isUTFAlpha() *must* return false for whitespace, special characters like '!', ',', '"' (also in their Unicode variants), symbols (Wingdings etc.) etc.. It should return true only for what is considered a letter in an alphabeth, i.e. a-zA-Z, ü, ß, accented characters (à, ô etc.), Japanese/Chinese/Korean word signs etc.. If in doubt, return false. This implies that it needs a long list of characters, or at least list of character ranges, which hinders its feasiblity.
Blocks: 106028
Summary: Need isUTFAlpha() for Unicode/non-ASCII alpha characters → Need isUnicodeAlpha() for Unicode/non-ASCII alpha characters
See what spellchecker does at http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/extensions/spellcheck/src/mozEnglishWordUtils.cpp&rev=1.11&mark=157-161#155 I don't know what the comment "this needs vast improvement" refers to exactly. There certainly has been *some* improvement since that comment was written. The full list of characters which this would return true for is the combination of: http://www.fileformat.info/info/unicode/category/Lu/list.htm http://www.fileformat.info/info/unicode/category/Ll/list.htm http://www.fileformat.info/info/unicode/category/Lt/list.htm http://www.fileformat.info/info/unicode/category/Lm/list.htm http://www.fileformat.info/info/unicode/category/Lo/list.htm ... but the last URI is a little misleading: it lists U+3400 <CJK Ideograph Extension A, First> U+4DB5 <CJK Ideograph Extension A, Last> U+4E00 <CJK Ideograph, First> U+9FBB <CJK Ideograph, Last> U+AC00 <Hangul Syllable, First> U+D7A3 <Hangul Syllable, Last> U+20000 <CJK Ideograph Extension B, First> U+2A6D6 <CJK Ideograph Extension B, Last> as if these were all individual characters, when in fact the whole range from each "First" to the corresponding "Last" are defined characters with the Lo category. This makes the "# of characters 7963" at http://www.fileformat.info/info/unicode/category/Lo/index.htm much too small.
OS: Linux → All
Hardware: x86 → All
Simon (assignee), it would be great if you could pick this up again and finish it, and perhaps it will be even easier to do now as there might be some new helpful code/lists etc. since this was filed. Bug 106028 which depends on this bug features 16 duplicates currently.
Flags: needinfo?(smontagu)
Attached patch 415209.diffSplinter Review
This builds, but I have no clue how to test it
Attachment #796907 - Flags: review?(jfkthame)
Flags: needinfo?(smontagu)
The latest patch in bug 106028 includes a unit test for this.
Comment on attachment 796907 [details] [diff] [review] 415209.diff Review of attachment 796907 [details] [diff] [review]: ----------------------------------------------------------------- I think I'd prefer to see these defined in nsUnicodeProperties.h (e.g. adding them right after GetGenCategory), so that other potential users are more likely to run across and use them rather than reinventing them. I'd also favor names that make it more explicit exactly what they're testing. Something like "bool IsUnicodeLetterCategory(uint32_t aChar)" and "bool IsUnicodeNumberCategory(uint32_t aChar)", maybe? I know it's verbose, but seeing plain IsAlpha() in the code would tend to leave me wondering whether the function is in fact unicode-aware, or some legacy ASCII-only thing. And "letter category" rather than "alpha", as the latter could be understood to exclude Han ideographs (which are not "alphabetic", although they are categorized as "letters"). These would be defined as inline functions in the mozilla::unicode namespace, and then imported with "using" to reduce verbosity at the point of use. If you're OK with that approach, I think it's likely to be more widely useful.
I prefer more generic names IsAlpha(), because I don't want all callers to be change each time the function is changed. IsAlpha() or IsLetter() is expresses exactly what the caller wants. Whether it implements ASCII or Unicode is then up to the function.
Status: NEW → ASSIGNED
(In reply to Ben Bucksch (:BenB) from comment #7) > I prefer more generic names IsAlpha(), because I don't want all callers to > be change each time the function is changed. IsAlpha() or IsLetter() is > expresses exactly what the caller wants. Whether it implements ASCII or > Unicode is then up to the function. I'm not sure I understand your concern here. Why would we expect to repeatedly change them? "Whether it implements ASCII or Unicode" is a critical thing for the caller to know, IMO ... bugs happen if people make false assumptions about what's actually implemented by a vaguely-named API. I suppose mozilla::unicode::IsLetter() and mozilla::unicode::IsNumber() would be OK, as the mozilla::unicode namespace provides a good clue that these are Unicode-based functions. We could then import them with "using namespace..." and use the short names if desired, but that would be up to the individual caller.
(In reply to Jonathan Kew (:jfkthame) from comment #8) +1
jfkthame, the functions are local to the file.
See first point in comment 6.
It's great to see progress on this, thanks to everyone involved.
jfkthame, if you want to extract the functions into another file, that's fine with me and you can call them however you want. In that case, I would still like to have functions IsAlpha() and IsDigit() in mozTXTToHTMLConv.cpp, in order to be able to improve or exchange the implementation without modifying all callers.
(No speed penality, because such one-line functions should be inlined by the compiler, or we can mark them "inline".)
No longer blocks: 106028
Blocks: 106028
I've just updated your bitrotted patch.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Attachment #796907 - Flags: review?(jfkthame)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: