Closed Bug 415209 Opened 17 years ago Closed 6 years ago

Need isUnicodeAlpha() for Unicode/non-ASCII alpha characters

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla65

People

(Reporter: BenB, Assigned: smontagu)

References

Details

(Keywords: intl)

Attachments

(2 files)

415209.diff 11 years ago Simon Montagu :smontagu 1.80 KB, patch		Details \| Diff \| Splinter Review
415209_v1b.diff de-bitrotted version of 415209.diff 8 years ago Alfred Peters [:infofrommozilla] 1.87 KB, patch		Details \| Diff \| Splinter Review

Ben Bucksch (:BenB)

Reporter

Description

•

17 years ago

We have isAsciiAlpha() which gives true for a-zA-Z and false for all else. Not even 'ü' etc., which is in ISO-8859-1 in the upper byte, is recognized. Much less in UTF-8, or Japanese etc.. isUTFAlpha() *must* return false for whitespace, special characters like '!', ',', '"' (also in their Unicode variants), symbols (Wingdings etc.) etc.. It should return true only for what is considered a letter in an alphabeth, i.e. a-zA-Z, ü, ß, accented characters (à, ô etc.), Japanese/Chinese/Korean word signs etc.. If in doubt, return false. This implies that it needs a long list of characters, or at least list of character ranges, which hinders its feasiblity.

Ben Bucksch (:BenB)

Reporter

Updated

•

17 years ago

Blocks: 106028

Ben Bucksch (:BenB)

Reporter

Updated

•

17 years ago

Summary: Need isUTFAlpha() for Unicode/non-ASCII alpha characters → Need isUnicodeAlpha() for Unicode/non-ASCII alpha characters

Simon Montagu :smontagu

Assignee

Comment 1

•

17 years ago

See what spellchecker does at http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/extensions/spellcheck/src/mozEnglishWordUtils.cpp&rev=1.11&mark=157-161#155 I don't know what the comment "this needs vast improvement" refers to exactly. There certainly has been *some* improvement since that comment was written. The full list of characters which this would return true for is the combination of: http://www.fileformat.info/info/unicode/category/Lu/list.htm http://www.fileformat.info/info/unicode/category/Ll/list.htm http://www.fileformat.info/info/unicode/category/Lt/list.htm http://www.fileformat.info/info/unicode/category/Lm/list.htm http://www.fileformat.info/info/unicode/category/Lo/list.htm ... but the last URI is a little misleading: it lists U+3400 <CJK Ideograph Extension A, First> U+4DB5 <CJK Ideograph Extension A, Last> U+4E00 <CJK Ideograph, First> U+9FBB <CJK Ideograph, Last> U+AC00 <Hangul Syllable, First> U+D7A3 <Hangul Syllable, Last> U+20000 <CJK Ideograph Extension B, First> U+2A6D6 <CJK Ideograph Extension B, Last> as if these were all individual characters, when in fact the whole range from each "First" to the corresponding "Last" are defined characters with the Lo category. This makes the "# of characters 7963" at http://www.fileformat.info/info/unicode/category/Lo/index.htm much too small.

Magnus Melin [:mkmelin]

Updated

•

12 years ago

OS: Linux → All

Hardware: x86 → All

Thomas D. (:thomas8)

Comment 2

•

11 years ago

Simon (assignee), it would be great if you could pick this up again and finish it, and perhaps it will be even easier to do now as there might be some new helpful code/lists etc. since this was filed. Bug 106028 which depends on this bug features 16 duplicates currently.

Flags: needinfo?(smontagu)

Simon Montagu :smontagu

Assignee

Comment 3

•

11 years ago

Attached patch 415209.diff — Details — Splinter Review

This builds, but I have no clue how to test it

Attachment #796907 - Flags: review?(jfkthame)

Flags: needinfo?(smontagu)

Magnus Melin [:mkmelin]

Comment 4

•

11 years ago

Maybe easiest to test it in connections with your fix for bug 106028. See http://mxr.mozilla.org/comm-central/source/mozilla/netwerk/test/unit/test_mozTXTToHTMLConv.js

Simon Montagu :smontagu

Assignee

Comment 5

•

11 years ago

The latest patch in bug 106028 includes a unit test for this.

Jonathan Kew [:jfkthame]

Comment 6

•

11 years ago

Comment on attachment 796907 [details] [diff] [review] 415209.diff Review of attachment 796907 [details] [diff] [review]: ----------------------------------------------------------------- I think I'd prefer to see these defined in nsUnicodeProperties.h (e.g. adding them right after GetGenCategory), so that other potential users are more likely to run across and use them rather than reinventing them. I'd also favor names that make it more explicit exactly what they're testing. Something like "bool IsUnicodeLetterCategory(uint32_t aChar)" and "bool IsUnicodeNumberCategory(uint32_t aChar)", maybe? I know it's verbose, but seeing plain IsAlpha() in the code would tend to leave me wondering whether the function is in fact unicode-aware, or some legacy ASCII-only thing. And "letter category" rather than "alpha", as the latter could be understood to exclude Han ideographs (which are not "alphabetic", although they are categorized as "letters"). These would be defined as inline functions in the mozilla::unicode namespace, and then imported with "using" to reduce verbosity at the point of use. If you're OK with that approach, I think it's likely to be more widely useful.

Ben Bucksch (:BenB)

Reporter

Comment 7

•

11 years ago

I prefer more generic names IsAlpha(), because I don't want all callers to be change each time the function is changed. IsAlpha() or IsLetter() is expresses exactly what the caller wants. Whether it implements ASCII or Unicode is then up to the function.

Thomas D. (:thomas8)

Updated

•

11 years ago

Status: NEW → ASSIGNED

Jonathan Kew [:jfkthame]

Comment 16

•

6 years ago

Fixed by bug 1505911: https://hg.mozilla.org/mozilla-central/rev/0c40ca38958f

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla65

Jorg K (CEST = GMT+2)

Updated

•

6 years ago

Attachment #796907 - Flags: review?(jfkthame)

You need to log in before you can comment on or make changes to this bug.