Closed
Bug 415209
Opened 17 years ago
Closed 6 years ago
Need isUnicodeAlpha() for Unicode/non-ASCII alpha characters
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
mozilla65
People
(Reporter: BenB, Assigned: smontagu)
References
Details
(Keywords: intl)
Attachments
(2 files)
1.80 KB,
patch
|
Details | Diff | Splinter Review | |
1.87 KB,
patch
|
Details | Diff | Splinter Review |
We have isAsciiAlpha() which gives true for a-zA-Z and false for all else.
Not even 'ü' etc., which is in ISO-8859-1 in the upper byte, is recognized. Much less in UTF-8, or Japanese etc..
isUTFAlpha() *must* return false for whitespace, special characters like '!', ',', '"' (also in their Unicode variants), symbols (Wingdings etc.) etc..
It should return true only for what is considered a letter in an alphabeth, i.e. a-zA-Z, ü, ß, accented characters (à, ô etc.), Japanese/Chinese/Korean word signs etc..
If in doubt, return false.
This implies that it needs a long list of characters, or at least list of character ranges, which hinders its feasiblity.
Reporter | ||
Updated•17 years ago
|
Summary: Need isUTFAlpha() for Unicode/non-ASCII alpha characters → Need isUnicodeAlpha() for Unicode/non-ASCII alpha characters
Assignee | ||
Comment 1•17 years ago
|
||
See what spellchecker does at http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/extensions/spellcheck/src/mozEnglishWordUtils.cpp&rev=1.11&mark=157-161#155
I don't know what the comment "this needs vast improvement" refers to exactly. There certainly has been *some* improvement since that comment was written.
The full list of characters which this would return true for is the combination of:
http://www.fileformat.info/info/unicode/category/Lu/list.htm
http://www.fileformat.info/info/unicode/category/Ll/list.htm
http://www.fileformat.info/info/unicode/category/Lt/list.htm
http://www.fileformat.info/info/unicode/category/Lm/list.htm
http://www.fileformat.info/info/unicode/category/Lo/list.htm
... but the last URI is a little misleading: it lists
U+3400 <CJK Ideograph Extension A, First>
U+4DB5 <CJK Ideograph Extension A, Last>
U+4E00 <CJK Ideograph, First>
U+9FBB <CJK Ideograph, Last>
U+AC00 <Hangul Syllable, First>
U+D7A3 <Hangul Syllable, Last>
U+20000 <CJK Ideograph Extension B, First>
U+2A6D6 <CJK Ideograph Extension B, Last>
as if these were all individual characters, when in fact the whole range from each "First" to the corresponding "Last" are defined characters with the Lo category. This makes the "# of characters 7963" at http://www.fileformat.info/info/unicode/category/Lo/index.htm much too small.
Updated•12 years ago
|
OS: Linux → All
Hardware: x86 → All
Comment 2•11 years ago
|
||
Simon (assignee), it would be great if you could pick this up again and finish it, and perhaps it will be even easier to do now as there might be some new helpful code/lists etc. since this was filed.
Bug 106028 which depends on this bug features 16 duplicates currently.
Flags: needinfo?(smontagu)
Assignee | ||
Comment 3•11 years ago
|
||
This builds, but I have no clue how to test it
Attachment #796907 -
Flags: review?(jfkthame)
Flags: needinfo?(smontagu)
Comment 4•11 years ago
|
||
Maybe easiest to test it in connections with your fix for bug 106028.
See http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/test/unit/test_mozTXTToHTMLConv.js
Assignee | ||
Comment 5•11 years ago
|
||
The latest patch in bug 106028 includes a unit test for this.
Comment 6•11 years ago
|
||
Comment on attachment 796907 [details] [diff] [review]
415209.diff
Review of attachment 796907 [details] [diff] [review]:
-----------------------------------------------------------------
I think I'd prefer to see these defined in nsUnicodeProperties.h (e.g. adding them right after GetGenCategory), so that other potential users are more likely to run across and use them rather than reinventing them.
I'd also favor names that make it more explicit exactly what they're testing. Something like "bool IsUnicodeLetterCategory(uint32_t aChar)" and "bool IsUnicodeNumberCategory(uint32_t aChar)", maybe? I know it's verbose, but seeing plain IsAlpha() in the code would tend to leave me wondering whether the function is in fact unicode-aware, or some legacy ASCII-only thing. And "letter category" rather than "alpha", as the latter could be understood to exclude Han ideographs (which are not "alphabetic", although they are categorized as "letters").
These would be defined as inline functions in the mozilla::unicode namespace, and then imported with "using" to reduce verbosity at the point of use.
If you're OK with that approach, I think it's likely to be more widely useful.
Reporter | ||
Comment 7•11 years ago
|
||
I prefer more generic names IsAlpha(), because I don't want all callers to be change each time the function is changed. IsAlpha() or IsLetter() is expresses exactly what the caller wants. Whether it implements ASCII or Unicode is then up to the function.
Updated•11 years ago
|
Status: NEW → ASSIGNED
Comment 8•11 years ago
|
||
(In reply to Ben Bucksch (:BenB) from comment #7)
> I prefer more generic names IsAlpha(), because I don't want all callers to
> be change each time the function is changed. IsAlpha() or IsLetter() is
> expresses exactly what the caller wants. Whether it implements ASCII or
> Unicode is then up to the function.
I'm not sure I understand your concern here. Why would we expect to repeatedly change them?
"Whether it implements ASCII or Unicode" is a critical thing for the caller to know, IMO ... bugs happen if people make false assumptions about what's actually implemented by a vaguely-named API.
I suppose mozilla::unicode::IsLetter() and mozilla::unicode::IsNumber() would be OK, as the mozilla::unicode namespace provides a good clue that these are Unicode-based functions. We could then import them with "using namespace..." and use the short names if desired, but that would be up to the individual caller.
Comment 9•11 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #8)
+1
Reporter | ||
Comment 10•11 years ago
|
||
jfkthame, the functions are local to the file.
Comment 11•11 years ago
|
||
See first point in comment 6.
Comment 12•11 years ago
|
||
It's great to see progress on this, thanks to everyone involved.
Reporter | ||
Comment 13•11 years ago
|
||
jfkthame, if you want to extract the functions into another file, that's fine with me and you can call them however you want.
In that case, I would still like to have functions IsAlpha() and IsDigit() in mozTXTToHTMLConv.cpp, in order to be able to improve or exchange the implementation without modifying all callers.
Reporter | ||
Comment 14•11 years ago
|
||
(No speed penality, because such one-line functions should be inlined by the compiler, or we can mark them "inline".)
Comment 15•8 years ago
|
||
I've just updated your bitrotted patch.
Comment 16•6 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Updated•6 years ago
|
Attachment #796907 -
Flags: review?(jfkthame)
You need to log in
before you can comment on or make changes to this bug.
Description
•