Closed Bug 258974 Opened 20 years ago Closed 7 years ago

Add Unicode character class patterns to JavaScript Regular Expressions (RegExp)

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 1361876
mozilla1.9alpha1

People

(Reporter: gekacheka, Assigned: tedders1)

References

Details

(Keywords: intl)

Attachments

(1 file)

522 bytes, application/vnd.mozilla.xul+xml
Details
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040910 Firefox/0.10
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040910 Firefox/0.10

JavaScript RegExp only supports ASCII character classes, so it is not possible
to easily use it for internationalized forms/apps where users may enter
non-ascii characters.  

While it is possible to explicitly list character ranges such as 
[A-Za-z\u00C0-00D6\u00D8-\u00F6\u00F8-01BA...]+ for a small number of locales,
it is tediuous and not practical to list them for many/all locales.  
(e.g., Alphabetic letters cover 429(!) ranges within 0000..FFFF listed in
http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt)


Reproducible: Always
Steps to Reproduce:
1. JavaScript console: new RegExp("^\\p{L}+([\d\p{Ll}]+)$").exec("Año12a3b")
Actual Results:  
Error: invalid quantifier {

Expected Results:  
Año12a3b, 12a3b
(returned array elements)

ICU provides a open source RegExp that implements patterns such as \p{L} to
match any unicode letter.  
http://oss.software.ibm.com/icu/
http://oss.software.ibm.com/icu/userguide/regexp.html
http://oss.software.ibm.com/icu/userguide/icufaq.html#faq-intro-5
(It is not a drop-in replacement, e.g., it does not implement the /g flag, but
might be wrapped in compatibility code to implement the /g.)

Java java.util.regex.Pattern
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

Perl
http://www.perldoc.com/perl5.8.0/pod/perlunicode.html

Python: changes the definition of \w depending on locale or unicode flag
http://docs.python.org/lib/re-syntax.html
Assignee: smontagu → general
Component: Internationalization → JavaScript Engine
QA Contact: amyy → pschwartau
Attached file testcase
Can the JS engine use nsIUGenCategory? Maybe not...If not, we may have to find a
way to share data and code somehow

nsIUGenDetailCategory is not yet implemented, but nsIUGenCategory is implemented
(although it's not up to date. I'm gonna update it soon). 
Keywords: intl
jshin: the JS engine has its own character classification tables, see
http://lxr.mozilla.org/mozilla/source/js/src/jsstr.c#2944.  This bug is really
against ECMA-262, looking for a reasonable extension to Edition 3.  I'll add
this to the list that ECMA TG1 should address for Edition 4.

/be
(In reply to comment #3)
> jshin: the JS engine has its own character classification tables, see
> http://lxr.mozilla.org/mozilla/source/js/src/jsstr.c#2944.

Gosh, those tables seem to have hardly changed since 1998, when the latest
version of Unicode was 2.0
smontagu: indeed -- but how would they change?  Aren't the changes since then to
the non-BMP planes, which JS blissfully ignores per ECMA-262 Edition 3?

/be
A whole bunch of characters (at least a few thousands) have been added to the
BMP since 1998 and the properties of some characters that were in Unicode 2.0
have changed.
-> default qa
QA Contact: pschwartau → general
OS: Windows 2000 → All
Hardware: PC → All
Target Milestone: --- → mozilla1.9alpha
Whatever we do in Ecma TC39 for this bug, there's a Mozilla-specific bug cited in comment 6. Need a spin-off bug.

/be
(In reply to comment #8)
> Whatever we do in Ecma TC39 for this bug, there's a Mozilla-specific bug cited
> in comment 6. Need a spin-off bug.

That would be bug 394604, no?
(In reply to comment #9)
> (In reply to comment #8)
> > Whatever we do in Ecma TC39 for this bug, there's a Mozilla-specific bug cited
> > in comment 6. Need a spin-off bug.
> 
> That would be bug 394604, no?

Thanks!

This bug (here, the one I'm commenting in: bug 258974) seems misplaced. It would be better filed at http://bugs.ecmascript.org/. We can keep this bugzilla report alive to track a ticket there. I'll file shortly.

/be
Any news about this?

UTS #18: Unicode Regular Expressions http://www.unicode.org/reports/tr18/index.html
(In reply to lennart.borgman from comment #13)
> Any news about this?
> 
> UTS #18: Unicode Regular Expressions
> http://www.unicode.org/reports/tr18/index.html

EcmaScript 6 doesn't specify unicode handling in regular expressions. IIUC, a future version of the EcmaScript internationalization spec will rectify that. CCing Norbert for verification.
(In reply to Till Schneidereit [:till] from comment #14)

Thanks Schneidereit.
ES6 specifies *some* Unicode handling in regular expressions, e.g., case insensitive matching and the new /u flag for supporting the full Unicode character set including supplementary characters. But it still doesn't specify support for character properties or character classes outside of ASCII. To the extent that such support is added to regular expressions, it would most likely have to happen in the ECMAScript Language Specification, not the Internationalization API Specification.
Assignee: general → nobody
Assignee: nobody → tclancy
Blocks: 1152074
Blocks: 1154438
Okay, I'm about to start working on this, because it would greatly help with some bidi work I'm about to do for Gaia.

To be clear, this isn't part of ES6, although it is a Harmony proposal. That shouldn't stop us. For example, we implement the /y "sticky" flag for regular expressions, even though it too is only a proposal and not part of ES6. This feature has been a part of Perl and Java for more than a decade, and is long overdue for Javascript. (This bugzilla ticket was created in 2004!) It is also recommended by the Unicode committee in UTS #18.

Because we don't yet know what will end up in the ECMAScript standard, I intend to implement only a very minimal subset of the notations supported by Perl and Java.

Specifically, I intend to implement the \p{property=value} and \P{property=value} notations, with the following details:

* Either an equal-sign or a colon can be used.

* Both the 'property' and the 'value' are case-insensitive, and any spaces, hyphens, or underscores will be ignored.

* The braces are not optional.

* For the names of 'property' and 'value', I will support the variant names given in PropertyAliases.txt
and PropertyValueAliases.txt, so you can say either \p{Bidi_Class=Left_to_Right} or \p{bc=l}.

* For binary properties (which can only be true or false), the notation would be
\p{property} or \P{property}, with no 'value' specified. So you would just write \p{Upper}, not \p{Upper=True} or \p{Upper: Yes}.

* The 'property' can be omitted if it is General_Category (gc). So you can write \p{Lu} instead of \p{gc=Lu} or \p{General_Category=Uppercase_Letter}.

At this point, the only properties I intend to support are General_Category, Script, and Bidi_Class. General_Category and Script are probably the two which are most needed by developers. I need Bidi_Class for my bidi work.

Other properties which would be easy to implement (meaning that our code in intl/unicharprops/util/ currently supports them) are East_Asian_Width, Vertical_Orientation, Numeric_Value, Combining_Class, Hangul_Syllable_Type, Bidi_Mirror, and Bidi_Control. However, I don't know if there's much developer demand for them.

I wanted to support the Uppercase, Lowercase, and Alphabetic properties, but it turns out our C++ code doesn't currently keep track of those properties. (In practice, Lu, Ll, and L are usually sufficient.) Also note that our code doesn't currently support the Block or Name properties.

Note that I don't intend to support the following notations:
* Using the caret for negation, e.g. \p{Script=^Latin}. Perl supports this, but Java does not. Also, you can just use \P{Script=Latin}
* The In/Is/In_/Is_ prefixes.
* Writing '=true', '=false', '=yes', or '=no' for binary properties.
* Perl allows you to omit the 'script' property, so you can write \p{Arabic} instead of \p{Script=Arabic}. But Java doesn't support this, and I'm worried it might be confused with \p{Block=Arabic}.
* None of that intersection/union/subtraction ****. Java, Perl, and the Unicode committee all have different notations for this, and it's complicated and unnecessary. The same thing can already be achieved using character sets and lookahead assertions.

If anyone objects to this, speak up now because I'm about to start coding this. (NB: I might completely ignore your objection.)
Status: NEW → ASSIGNED
(In reply to Ted Clancy [:tedders1] from comment #17)
> Okay, I'm about to start working on this, because it would greatly help with
> some bidi work I'm about to do for Gaia.
> 
> To be clear, this isn't part of ES6, although it is a Harmony proposal.

There is no formal proposal yet to add \p{…} to /u-enabled regular expressions. There is some discussion in https://github.com/mathiasbynens/es6-unicode-regexp-proposal/issues/2, but no spec text yet.

> That shouldn't stop us. For example, we implement the /y "sticky" flag for
> regular expressions, even though it too is only a proposal and not part of
> ES6.

The `/y` flag is part of ES6.
> The `/y` flag is part of ES6.

I stand corrected. Still, we had support for it back in 2008, well before ES6 was released.
> There is no formal proposal yet to add \p{…} to /u-enabled regular
> expressions. There is some discussion in
> https://github.com/mathiasbynens/es6-unicode-regexp-proposal/issues/2, but
> no spec text yet.

I was referring to this thing, which was referenced in Comment 11: 
http://wiki.ecmascript.org/doku.php?id=proposals:extend_regexps#extending_regexps_for_unicode_ranges

It's independent of the /u stuff.

The discussion at https://github.com/mathiasbynens/es6-unicode-regexp-proposal/issues/2 is interesting too.
(In reply to Ted Clancy [:tedders1] from comment #17)
> I'm about to start working on this, because it would greatly help with
> some bidi work I'm about to do for Gaia.

No objections to experimentation on my part.

But note: until there's spec language substantially along in the standards process, anything that is implemented will be limited to nightly-only.  The days when we unilaterally implemented and shipped JS extensions to broad audiences (covering /y most relevantly, but also covering let, const, non-function* generators, array and generator comprehensions, and many many other things) are gone.  Now, we implement things when there's general agreement as to what's being implemented, so that we don't burn ourselves too hard when the standardized form of any given extension turns out to differ in critical ways from whatever we'd previously done.  (Or when the thing we implemented ends up never being standardized, leaving us with painful security and compatibility burdens.)

So go wild implementing, but don't expect this to be something Gaia can use immediately when you're finished.
Hi Jeff,

> until there's spec language substantially along in the standards process

I don't know what you consider to be "substantially along", but a version of this feature was formally specified as part of the ECMAScript 4 draft, as described in Comment 11. When ECMAScript 4 was abandoned, the plan was that ECMAScript 4 features would become part of ES6, but somehow this feature hasn't yet been added to ES6. I don't know why.

https://bugs.ecmascript.org/show_bug.cgi?id=764

> Now, we implement things when there's general agreement as to what's being implemented, so 
> that we don't burn ourselves too hard when the standardized form of any given extension turns 
> out to differ in critical ways from whatever we'd previously done

The features I'm talking about implementing already have "general agreement". Everything I want to implement is:
a) Based on Unicode Technical Standard #18; AND
b) Is a widely-used feature in other languages like Perl.

I've deliberately left out lesser-used features, features which aren't necessary, and features which differ between implementations. (For example, even though UTS #18 says that "\p{Script=Arabic}" can be abbreviated to "\p{Arabic}", and even though Perl supports it, Java doesn't allow it so I left it out.)

What I'm talking about implementing is a bare-bones, uncontroversial, and minimal set of useful features that will almost certainly be part of ECMAScript eventually.

There is almost no chance that the ECMAScript committee will decide upon something different. Javascript borrows its regular expression syntax from other languages. It would be a gross mistake to introduce a new syntax that differed from the ones used by Perl and Java.

[If someone from the ECMAScript committee could back me up here, it would be appreciated.]

> The days when we unilaterally implemented and shipped JS extensions to broad audiences are gone.

Respectfully, we're going to need to make an exception here. This is a feature that is 10 years overdue, and I need it in order to fix a broad class of bidi problems that we're having on Gaia. We're about to launch Firefox OS in the Middle East, so bidi issues are piling up fast. (We've been trying to work around these problems with band-aids, but we can't apply band-aids fast enough.)

With Firefox OS, the question was "Can we build an OS purely out of Javascript?". I'd hate to answer "No, we can't, because the ECMAScript committee keeps dropping the ball on a feature every other language had 10 years ago."
Hi again, Jeff.

Uh, nevermind. My boss says we can probably include this in Firefox OS without including it in the release version of Firefox for the time being.
(In reply to Ted Clancy [:tedders1] from comment #23)
> Uh, nevermind. My boss says we can probably include this in Firefox OS
> without including it in the release version of Firefox for the time being.

Are you saying do this as a JS library, without touching Gecko/SpiderMonkey at all?  Or are you suggesting forking SpiderMonkey for b2g?  I think everyone on SpiderMonkey -- and many, many people in Gecko and beyond -- would agree that it's an extraordinarily perilous idea to fork the JS language solely for b2g, and then have a body of use that is flat-out not the Web.

(In reply to Ted Clancy [:tedders1] from comment #22)
> a version of this feature was formally specified as part of the
> ECMAScript 4 draft, as described in Comment 11.

ES4's complete abandonware, dropt when ES5 happened.

> When ECMAScript 4 was abandoned, the plan was that
> ECMAScript 4 features would become part of ES6, but somehow this feature
> hasn't yet been added to ES6. I don't know why.

Recent consensus on adding the feature on es-discuss, and a start at spec text in ES7 (note new features aren't being added to ES6 now), seems like about the right baseline to me.  A bug report that hasn't been touched in almost two and a half years is consensus timeout.

> What I'm talking about implementing is a bare-bones, uncontroversial, and
> minimal set of useful features that will almost certainly be part of
> ECMAScript eventually.

Sure.  (To be clear, I'm not commenting on your proposal *technically* at all.)  I'm not at all confident such a set of features for RegExp Unicode character classes exists right now, tho.  It'd be a shame if you spent awhile implementing something, only to find out at the end that what seemed obvious to you, wasn't obvious to everyone else in TC39.  I can count any number of things I thought were sure deals for future ECMAScript, that now have no chance of ever becoming reality; predictions about specs can easily turn out to be wrong.

If you have the time (notwithstanding your boss's words -- or even if you don't, mailing list discussion is cheap so maybe worth it), I suggest you send mail to the es-discuss list https://mail.mozilla.org/listinfo/es-discuss to get the ball rolling again.

And if you have even more time, writing spec language for everything you'd want to implement, and making your proposal incorporate that, would be the best way to get buy-in on something that *would* be implementable and shippable soonest.

For inexact comparison, an Atomics spec is being implemented (still Nightly-only) and has https://docs.google.com/document/d/1NDGA_gZJ7M7w1Bh8S0AoDyEqwDdRh4uSoTPSNn77PFk/edit to lend it support.  If it's accepted -- it may not be, although I think momentum is in its favor -- that spec will have a lot to do with acceptance.
(In reply to Ted Clancy [:tedders1] from comment #20)
> It's independent of the /u stuff.

When this topic was discussed on es-discuss or during TC39 meetings, concerns were raised that simply adding \p{...} is not web-compatible. For example /\p{Lu}/ currently matches the string "p{Lu}" in all major engines, and most likely this behaviour needs to be preserved for backward compatibility.

TC39 instead decided to restrict escape sequences when Unicode mode is enabled (\u flag), so future ECMAScript editions can add support for \p{...} (but only in Unicode mode!). That means support for Unicode RegExps needs to be implemented in SpiderMonkey before \p{...} can be added.
(In reply to André Bargull from comment #25)
> TC39 instead decided to restrict escape sequences when Unicode mode is
> enabled (\u flag), so future ECMAScript editions can add support for \p{...}
> (but only in Unicode mode!). That means support for Unicode RegExps needs to
> be implemented in SpiderMonkey before \p{...} can be added.

Thanks for that information André.

Do you have any idea of what should happen if someone uses an unrecognized escape sequence in a (unicode) regex? Should the escape sequence just be ignored, or should it cause an error?
(In reply to Ted Clancy [:tedders1] from comment #26)
> Do you have any idea of what should happen if someone uses an unrecognized
> escape sequence in a (unicode) regex? Should the escape sequence just be
> ignored, or should it cause an error?

In `/u` regexps it causes an error.

Relevant resource: https://mathiasbynens.be/notes/es6-unicode-regex
Thanks, Matt!
No longer blocks: 1154438
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: