Closed
Bug 258974
Opened 20 years ago
Closed 8 years ago
Add Unicode character class patterns to JavaScript Regular Expressions (RegExp)
Categories
(Core :: JavaScript Engine, enhancement)
Core
JavaScript Engine
Tracking
()
RESOLVED
DUPLICATE
of bug 1361876
mozilla1.9alpha1
People
(Reporter: gekacheka, Assigned: tedders1)
References
Details
(Keywords: intl)
Attachments
(1 file)
522 bytes,
application/vnd.mozilla.xul+xml
|
Details |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040910 Firefox/0.10
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040910 Firefox/0.10
JavaScript RegExp only supports ASCII character classes, so it is not possible
to easily use it for internationalized forms/apps where users may enter
non-ascii characters.
While it is possible to explicitly list character ranges such as
[A-Za-z\u00C0-00D6\u00D8-\u00F6\u00F8-01BA...]+ for a small number of locales,
it is tediuous and not practical to list them for many/all locales.
(e.g., Alphabetic letters cover 429(!) ranges within 0000..FFFF listed in
http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt)
Reproducible: Always
Steps to Reproduce:
1. JavaScript console: new RegExp("^\\p{L}+([\d\p{Ll}]+)$").exec("Año12a3b")
Actual Results:
Error: invalid quantifier {
Expected Results:
Año12a3b, 12a3b
(returned array elements)
ICU provides a open source RegExp that implements patterns such as \p{L} to
match any unicode letter.
http://oss.software.ibm.com/icu/
http://oss.software.ibm.com/icu/userguide/regexp.html
http://oss.software.ibm.com/icu/userguide/icufaq.html#faq-intro-5
(It is not a drop-in replacement, e.g., it does not implement the /g flag, but
might be wrapped in compatibility code to implement the /g.)
Java java.util.regex.Pattern
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
Perl
http://www.perldoc.com/perl5.8.0/pod/perlunicode.html
Python: changes the definition of \w depending on locale or unicode flag
http://docs.python.org/lib/re-syntax.html
Updated•20 years ago
|
Assignee: smontagu → general
Component: Internationalization → JavaScript Engine
QA Contact: amyy → pschwartau
Comment 1•20 years ago
|
||
Comment 2•20 years ago
|
||
Can the JS engine use nsIUGenCategory? Maybe not...If not, we may have to find a
way to share data and code somehow
nsIUGenDetailCategory is not yet implemented, but nsIUGenCategory is implemented
(although it's not up to date. I'm gonna update it soon).
Keywords: intl
Comment 3•20 years ago
|
||
jshin: the JS engine has its own character classification tables, see
http://lxr.mozilla.org/mozilla/source/js/src/jsstr.c#2944. This bug is really
against ECMA-262, looking for a reasonable extension to Edition 3. I'll add
this to the list that ECMA TG1 should address for Edition 4.
/be
Comment 4•20 years ago
|
||
(In reply to comment #3)
> jshin: the JS engine has its own character classification tables, see
> http://lxr.mozilla.org/mozilla/source/js/src/jsstr.c#2944.
Gosh, those tables seem to have hardly changed since 1998, when the latest
version of Unicode was 2.0
Comment 5•20 years ago
|
||
smontagu: indeed -- but how would they change? Aren't the changes since then to
the non-BMP planes, which JS blissfully ignores per ECMA-262 Edition 3?
/be
Comment 6•20 years ago
|
||
A whole bunch of characters (at least a few thousands) have been added to the
BMP since 1998 and the properties of some characters that were in Unicode 2.0
have changed.
Updated•19 years ago
|
OS: Windows 2000 → All
Hardware: PC → All
Target Milestone: --- → mozilla1.9alpha
Comment 8•16 years ago
|
||
Whatever we do in Ecma TC39 for this bug, there's a Mozilla-specific bug cited in comment 6. Need a spin-off bug.
/be
Comment 9•16 years ago
|
||
(In reply to comment #8)
> Whatever we do in Ecma TC39 for this bug, there's a Mozilla-specific bug cited
> in comment 6. Need a spin-off bug.
That would be bug 394604, no?
Comment 10•16 years ago
|
||
(In reply to comment #9)
> (In reply to comment #8)
> > Whatever we do in Ecma TC39 for this bug, there's a Mozilla-specific bug cited
> > in comment 6. Need a spin-off bug.
>
> That would be bug 394604, no?
Thanks!
This bug (here, the one I'm commenting in: bug 258974) seems misplaced. It would be better filed at http://bugs.ecmascript.org/. We can keep this bugzilla report alive to track a ticket there. I'll file shortly.
/be
Reporter | ||
Comment 11•16 years ago
|
||
Ecmascript 4th edition Draft 2 spec includes unicode character classes
such as [\p{L}]
http://wiki.ecmascript.org/lib/exe/fetch.php?id=spec%3Aspec&cache=cache&media=spec:library-d2.html#RegExp%20grammar
Fixed spec bug: Implement Unicode character class matching
http://bugs.ecmascript.org/ticket/45
original proposal wiki
http://wiki.ecmascript.org/doku.php?id=proposals:extend_regexps#extending_regexps_for_unicode_ranges
original proposal wiki discussion
http://wiki.ecmascript.org/doku.php?id=discussion:extend_regexps#extending_regexps_for_unicode_ranges
Comment 13•11 years ago
|
||
Any news about this?
UTS #18: Unicode Regular Expressions http://www.unicode.org/reports/tr18/index.html
Comment 14•11 years ago
|
||
(In reply to lennart.borgman from comment #13)
> Any news about this?
>
> UTS #18: Unicode Regular Expressions
> http://www.unicode.org/reports/tr18/index.html
EcmaScript 6 doesn't specify unicode handling in regular expressions. IIUC, a future version of the EcmaScript internationalization spec will rectify that. CCing Norbert for verification.
Comment 15•11 years ago
|
||
(In reply to Till Schneidereit [:till] from comment #14)
Thanks Schneidereit.
Comment 16•11 years ago
|
||
ES6 specifies *some* Unicode handling in regular expressions, e.g., case insensitive matching and the new /u flag for supporting the full Unicode character set including supplementary characters. But it still doesn't specify support for character properties or character classes outside of ASCII. To the extent that such support is added to regular expressions, it would most likely have to happen in the ECMAScript Language Specification, not the Internationalization API Specification.
Updated•10 years ago
|
Assignee: general → nobody
Assignee | ||
Comment 17•10 years ago
|
||
Okay, I'm about to start working on this, because it would greatly help with some bidi work I'm about to do for Gaia.
To be clear, this isn't part of ES6, although it is a Harmony proposal. That shouldn't stop us. For example, we implement the /y "sticky" flag for regular expressions, even though it too is only a proposal and not part of ES6. This feature has been a part of Perl and Java for more than a decade, and is long overdue for Javascript. (This bugzilla ticket was created in 2004!) It is also recommended by the Unicode committee in UTS #18.
Because we don't yet know what will end up in the ECMAScript standard, I intend to implement only a very minimal subset of the notations supported by Perl and Java.
Specifically, I intend to implement the \p{property=value} and \P{property=value} notations, with the following details:
* Either an equal-sign or a colon can be used.
* Both the 'property' and the 'value' are case-insensitive, and any spaces, hyphens, or underscores will be ignored.
* The braces are not optional.
* For the names of 'property' and 'value', I will support the variant names given in PropertyAliases.txt
and PropertyValueAliases.txt, so you can say either \p{Bidi_Class=Left_to_Right} or \p{bc=l}.
* For binary properties (which can only be true or false), the notation would be
\p{property} or \P{property}, with no 'value' specified. So you would just write \p{Upper}, not \p{Upper=True} or \p{Upper: Yes}.
* The 'property' can be omitted if it is General_Category (gc). So you can write \p{Lu} instead of \p{gc=Lu} or \p{General_Category=Uppercase_Letter}.
At this point, the only properties I intend to support are General_Category, Script, and Bidi_Class. General_Category and Script are probably the two which are most needed by developers. I need Bidi_Class for my bidi work.
Other properties which would be easy to implement (meaning that our code in intl/unicharprops/util/ currently supports them) are East_Asian_Width, Vertical_Orientation, Numeric_Value, Combining_Class, Hangul_Syllable_Type, Bidi_Mirror, and Bidi_Control. However, I don't know if there's much developer demand for them.
I wanted to support the Uppercase, Lowercase, and Alphabetic properties, but it turns out our C++ code doesn't currently keep track of those properties. (In practice, Lu, Ll, and L are usually sufficient.) Also note that our code doesn't currently support the Block or Name properties.
Note that I don't intend to support the following notations:
* Using the caret for negation, e.g. \p{Script=^Latin}. Perl supports this, but Java does not. Also, you can just use \P{Script=Latin}
* The In/Is/In_/Is_ prefixes.
* Writing '=true', '=false', '=yes', or '=no' for binary properties.
* Perl allows you to omit the 'script' property, so you can write \p{Arabic} instead of \p{Script=Arabic}. But Java doesn't support this, and I'm worried it might be confused with \p{Block=Arabic}.
* None of that intersection/union/subtraction ****. Java, Perl, and the Unicode committee all have different notations for this, and it's complicated and unnecessary. The same thing can already be achieved using character sets and lookahead assertions.
If anyone objects to this, speak up now because I'm about to start coding this. (NB: I might completely ignore your objection.)
Status: NEW → ASSIGNED
Comment 18•10 years ago
|
||
(In reply to Ted Clancy [:tedders1] from comment #17)
> Okay, I'm about to start working on this, because it would greatly help with
> some bidi work I'm about to do for Gaia.
>
> To be clear, this isn't part of ES6, although it is a Harmony proposal.
There is no formal proposal yet to add \p{…} to /u-enabled regular expressions. There is some discussion in https://github.com/mathiasbynens/es6-unicode-regexp-proposal/issues/2, but no spec text yet.
> That shouldn't stop us. For example, we implement the /y "sticky" flag for
> regular expressions, even though it too is only a proposal and not part of
> ES6.
The `/y` flag is part of ES6.
Assignee | ||
Comment 19•10 years ago
|
||
> The `/y` flag is part of ES6.
I stand corrected. Still, we had support for it back in 2008, well before ES6 was released.
Assignee | ||
Comment 20•10 years ago
|
||
> There is no formal proposal yet to add \p{…} to /u-enabled regular
> expressions. There is some discussion in
> https://github.com/mathiasbynens/es6-unicode-regexp-proposal/issues/2, but
> no spec text yet.
I was referring to this thing, which was referenced in Comment 11:
http://wiki.ecmascript.org/doku.php?id=proposals:extend_regexps#extending_regexps_for_unicode_ranges
It's independent of the /u stuff.
The discussion at https://github.com/mathiasbynens/es6-unicode-regexp-proposal/issues/2 is interesting too.
Comment 21•10 years ago
|
||
(In reply to Ted Clancy [:tedders1] from comment #17)
> I'm about to start working on this, because it would greatly help with
> some bidi work I'm about to do for Gaia.
No objections to experimentation on my part.
But note: until there's spec language substantially along in the standards process, anything that is implemented will be limited to nightly-only. The days when we unilaterally implemented and shipped JS extensions to broad audiences (covering /y most relevantly, but also covering let, const, non-function* generators, array and generator comprehensions, and many many other things) are gone. Now, we implement things when there's general agreement as to what's being implemented, so that we don't burn ourselves too hard when the standardized form of any given extension turns out to differ in critical ways from whatever we'd previously done. (Or when the thing we implemented ends up never being standardized, leaving us with painful security and compatibility burdens.)
So go wild implementing, but don't expect this to be something Gaia can use immediately when you're finished.
Assignee | ||
Comment 22•10 years ago
|
||
Hi Jeff,
> until there's spec language substantially along in the standards process
I don't know what you consider to be "substantially along", but a version of this feature was formally specified as part of the ECMAScript 4 draft, as described in Comment 11. When ECMAScript 4 was abandoned, the plan was that ECMAScript 4 features would become part of ES6, but somehow this feature hasn't yet been added to ES6. I don't know why.
https://bugs.ecmascript.org/show_bug.cgi?id=764
> Now, we implement things when there's general agreement as to what's being implemented, so
> that we don't burn ourselves too hard when the standardized form of any given extension turns
> out to differ in critical ways from whatever we'd previously done
The features I'm talking about implementing already have "general agreement". Everything I want to implement is:
a) Based on Unicode Technical Standard #18; AND
b) Is a widely-used feature in other languages like Perl.
I've deliberately left out lesser-used features, features which aren't necessary, and features which differ between implementations. (For example, even though UTS #18 says that "\p{Script=Arabic}" can be abbreviated to "\p{Arabic}", and even though Perl supports it, Java doesn't allow it so I left it out.)
What I'm talking about implementing is a bare-bones, uncontroversial, and minimal set of useful features that will almost certainly be part of ECMAScript eventually.
There is almost no chance that the ECMAScript committee will decide upon something different. Javascript borrows its regular expression syntax from other languages. It would be a gross mistake to introduce a new syntax that differed from the ones used by Perl and Java.
[If someone from the ECMAScript committee could back me up here, it would be appreciated.]
> The days when we unilaterally implemented and shipped JS extensions to broad audiences are gone.
Respectfully, we're going to need to make an exception here. This is a feature that is 10 years overdue, and I need it in order to fix a broad class of bidi problems that we're having on Gaia. We're about to launch Firefox OS in the Middle East, so bidi issues are piling up fast. (We've been trying to work around these problems with band-aids, but we can't apply band-aids fast enough.)
With Firefox OS, the question was "Can we build an OS purely out of Javascript?". I'd hate to answer "No, we can't, because the ECMAScript committee keeps dropping the ball on a feature every other language had 10 years ago."
Assignee | ||
Comment 23•10 years ago
|
||
Hi again, Jeff.
Uh, nevermind. My boss says we can probably include this in Firefox OS without including it in the release version of Firefox for the time being.
Comment 24•10 years ago
|
||
(In reply to Ted Clancy [:tedders1] from comment #23)
> Uh, nevermind. My boss says we can probably include this in Firefox OS
> without including it in the release version of Firefox for the time being.
Are you saying do this as a JS library, without touching Gecko/SpiderMonkey at all? Or are you suggesting forking SpiderMonkey for b2g? I think everyone on SpiderMonkey -- and many, many people in Gecko and beyond -- would agree that it's an extraordinarily perilous idea to fork the JS language solely for b2g, and then have a body of use that is flat-out not the Web.
(In reply to Ted Clancy [:tedders1] from comment #22)
> a version of this feature was formally specified as part of the
> ECMAScript 4 draft, as described in Comment 11.
ES4's complete abandonware, dropt when ES5 happened.
> When ECMAScript 4 was abandoned, the plan was that
> ECMAScript 4 features would become part of ES6, but somehow this feature
> hasn't yet been added to ES6. I don't know why.
Recent consensus on adding the feature on es-discuss, and a start at spec text in ES7 (note new features aren't being added to ES6 now), seems like about the right baseline to me. A bug report that hasn't been touched in almost two and a half years is consensus timeout.
> What I'm talking about implementing is a bare-bones, uncontroversial, and
> minimal set of useful features that will almost certainly be part of
> ECMAScript eventually.
Sure. (To be clear, I'm not commenting on your proposal *technically* at all.) I'm not at all confident such a set of features for RegExp Unicode character classes exists right now, tho. It'd be a shame if you spent awhile implementing something, only to find out at the end that what seemed obvious to you, wasn't obvious to everyone else in TC39. I can count any number of things I thought were sure deals for future ECMAScript, that now have no chance of ever becoming reality; predictions about specs can easily turn out to be wrong.
If you have the time (notwithstanding your boss's words -- or even if you don't, mailing list discussion is cheap so maybe worth it), I suggest you send mail to the es-discuss list https://mail.mozilla.org/listinfo/es-discuss to get the ball rolling again.
And if you have even more time, writing spec language for everything you'd want to implement, and making your proposal incorporate that, would be the best way to get buy-in on something that *would* be implementable and shippable soonest.
For inexact comparison, an Atomics spec is being implemented (still Nightly-only) and has https://docs.google.com/document/d/1NDGA_gZJ7M7w1Bh8S0AoDyEqwDdRh4uSoTPSNn77PFk/edit to lend it support. If it's accepted -- it may not be, although I think momentum is in its favor -- that spec will have a lot to do with acceptance.
Comment 25•10 years ago
|
||
(In reply to Ted Clancy [:tedders1] from comment #20)
> It's independent of the /u stuff.
When this topic was discussed on es-discuss or during TC39 meetings, concerns were raised that simply adding \p{...} is not web-compatible. For example /\p{Lu}/ currently matches the string "p{Lu}" in all major engines, and most likely this behaviour needs to be preserved for backward compatibility.
TC39 instead decided to restrict escape sequences when Unicode mode is enabled (\u flag), so future ECMAScript editions can add support for \p{...} (but only in Unicode mode!). That means support for Unicode RegExps needs to be implemented in SpiderMonkey before \p{...} can be added.
Assignee | ||
Comment 26•10 years ago
|
||
(In reply to André Bargull from comment #25)
> TC39 instead decided to restrict escape sequences when Unicode mode is
> enabled (\u flag), so future ECMAScript editions can add support for \p{...}
> (but only in Unicode mode!). That means support for Unicode RegExps needs to
> be implemented in SpiderMonkey before \p{...} can be added.
Thanks for that information André.
Do you have any idea of what should happen if someone uses an unrecognized escape sequence in a (unicode) regex? Should the escape sequence just be ignored, or should it cause an error?
Comment 27•10 years ago
|
||
(In reply to Ted Clancy [:tedders1] from comment #26)
> Do you have any idea of what should happen if someone uses an unrecognized
> escape sequence in a (unicode) regex? Should the escape sequence just be
> ignored, or should it cause an error?
In `/u` regexps it causes an error.
Relevant resource: https://mathiasbynens.be/notes/es6-unicode-regex
Assignee | ||
Comment 28•10 years ago
|
||
Thanks, Matt!
Updated•8 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•