Frum: Marc Guillemot - view profile
Dete-a: Thurs 2 Aug 2007 08:39
Email: Marc Guillemot <mguille...@yahoo.fr>
I've mentioned for some time that it would be interesting to use
java.util.regex to implement RegExp in Rhino. I think that this would be
faster, safer and would simplify Rhino code.
I've now introduced in HtmlUnit a custom RegExp proxy to catch calls to
String.replace(*, String) and interpret them using java.util.regex. For
our use this fixed some RegExp bugs and is really faster.
I was interested in a special regexp with non capturing group. A small
test with String.replace(myRegExp, String) gave me impressive results:
Test string repeated 100 times:
Pure Rhino: 25 ms
String.replace using java.util.regex: 7 ms
Test string repeated 1000 times:
Pure Rhino: 440 ms
String.replace using java.util.regex: 15 ms
I've written a post about this and the usage of a current custom
RexExpProxy in HtmlUnit:
If this motivate someone to use java.util.regex in Rhino's RegExp, I
would be quite happy to use it. Otherwise I will surely propose a patch
for that one day, but it may take some time.
I mentioned this issue on jvm-languages Google group and got this very helpful reply from John Cowan:
Message from discussion on on message Rhino status (was: CommunityOne talk on JVM language implementers):
On Tue, Apr 22, 2008 at 9:52 AM, Norris Boyd <norrisb...@gmail.com> wrote:
> We welcome contributions and contributors; see
Your list asks about ECMAscript regular expressions. As far as I can
tell by closely comparing the 3rd Edition with the Javadoc for
java.util.regex.Pattern (supplemented by a few experiments), they are
a proper subset of Java regular expressions with the following three
Java does not support the \v escape: use \ck instead.
Java does not support the \0 escape: use \x00 instead.
Java does not support the \b escape within character classes: for
[...\b...] read [...\ch...].
Java also provides the following extensions over ECMAscript:
Octal escapes (\0d, \0dd, \01dd)
\a (same as \cg) and \e (same as \x1b)
Posix, Unicode, and Java-specific character classes with \p and \P
\A (beginning of input), \z (end of input), and \Z (end of input
except for final line terminator)
Possessive quantifiers ?+, *+, ++ (match as much as possible even if
other parts fail as a result)
\Q and \E (force all characters in between to be escaped)
(?<=X) and (?<!X) for positive and negative lookbehind
(?idmnsux) Turn on special matching flags
(?idmnsux:X) Turn on special matching flags in this group
Character class union (by concatenation) and intersection (with &&)
The Java syntax for character class union and intersection provokes
incompatible interpretations in certain cases: for example,
[a-z&&[^d-f]] is the same as [a-cg-z] in Java (modulo locale issues),
but in ECMAscript it should match any of a-z&^[ followed by ].
However, this is a very improbable way of writing that regular
expression in ECMAscript (or any non-Java regular expression
language), so the syntax is *in effect* backward compatible.
Likewise, [a-z is invalid in Java (erroneous nested character class)
but should match any of a-z or [ in ECMAscript.
This is a really interesting information!
Looking at the first steps of my js2javaRegex conversion method, it seems to me that it doesn't cover everything, for instance \12 should be converted to \012.
I have a patch ready replacing current RegExp support with Java RegExp based support which seems to not break any currently passing test and which fixes the currently opened issues:
I'd like some input on the preferred way to get my patch included in Rhino code base: on one side my patch seems to work as desired but on the other side related clean up can be performed to simplify Rhino code base. Additionally, I don't know if we have to care to some compatibility issues here. My preferred way would be to see a first version of the patch applied and then to propose additional patch to improve the code.
Can some Rhino committer provide some feedback please?
I think it would be a good idea to move forward with this issue. Marc, could you post your patch here so we can have a look?
If/once we make the step to java.util.regex we should probably keep the old implementation as optional implementation for a few release cycles like we did (and still do) with the old E4X implementation.
Announcement of comment #3 was before I discovered that the unit tests reported many failed tests as success (see bug #460739). This means that I definitely still have errors but I haven't had time to look at it again.
In the mean time it woud be interesting to define precisely the transition rules. For instance when you say that old implementation should be kept as optional implemenation, should both be able to communicate? Should scripts compiled with old implementation still be able to run correctly?
JRuby went through quite an odyssey with their regular expression engine, switching implementations several times. They started with java.util.regex, but had problems with its recursive nature. A few links i dug out:
I have implemented what I believe to be a fully compliant regexp engine for rhino using java.util.regex. The performance improvements are approx 10-times for regex in the sunspider and v8 benchmarks.
Patch can be found at:
The patch caters for all the inconsistencies noted by John Cowan (as well as the additional change in ecma v 5 for '\s' to include unicode byte-order-mark) by rewriting JS regexps into the equivalent jre syntax.
The patch does not include the 'static' RegExp properties (leftContext, rightContext, etc) which were previously in NativeRegExpCtor. As far as I can tell, these are not part of ecma. They could probably be added if considered required / useful.
I have made further changes available at https://github.com/joelhockey/rhino-mirror. If anyone wants to commit this code, I can create a patch against CVS HEAD if that would help.
I have got both java.util.regex (jur) and joni engines working. Joni is 100% compliant; jur has one incompatibility related to case-folding unicode chars.
The big decision for the core rhino team now is which engine makes most sense for the future. java.util.regex has the advantage that it keeps rhino footprint small and automatically gets any improvements that oracle make. Joni has the advantages that its code is modifiable and the jruby team are keen to help rhino. The compiled footprint of joni is ~400k.
Rhino could even support all 3 engines via config.
The changes I have made are:
1/ improve performance of the original rhino/spidermonkey code by caching the input string and avoiding java.lang.String.toCharArray() calls when the same input string is used repeatedly. This reduced times by about 75% (2700ms to 630ms) in the sunspider benchmark, and about 20% for v8 (1300ms - 1100ms).
This is a very simple change to add a 'str' and 'charArray' field to RECompiled, and modify NativeRegExp.executeRegExp so that it compares the input 'str' against the current value of re.str and only does the java.lang.String.toCharArray call if the strings do not match
Further (smaller) improvements were made by removing the SubString class and using java's standard String.substring method to produce substrings. Once again, the saving is caused by reducing string->charArray conversions.
2/ fix bug where the static value RegExp.input was not being set. Rhino was setting 'leftContext', 'rightContext', etc, but not 'input'.
3/ created new RegExpEngine interface to allow multiple implementations of regexp engines, and moved engine code out of NativeRegExp into RERhino class. Apart from cutting the engine code out of NativeRegExp, the only change required to use the new engine interface is to modify the compileRE method to return an instance of RegExpEngine, and replace the single call to matchRegExp to call the engine instance.
4/ created java.util.regex (jur) and joni implementations. I was able to make both engines mostly compliant with the js re syntax by parsing and rewriting regexps into the equivalent syntax of jur and joni. The only tricky area is unicode case-folding. ECMA262 has peculiar requirements about how to do case-insensitive comparisons.
I was able to make joni 100% compliant by modifying it (jcodings project) and adding a config flag to stop unicode case-folding to ascii (as required by ecma). I will ask for this patch to be added to the core joni project. I believe that joni is BSD licenced. I'm not very familiar with the MPL, so I don't know if it is ok to include joni with rhino or not. I have talked to the jruby team who ported oniguruma from c to java, and they are very happy for us to use it.
The java.util.regex engine will fails ecma requirements by not matching case-insensitive unicode chars by default. E.g. \u01f1 (LATIN CAPITAL LETTER DZ) will not match \u01f3 (LATIN SMALL LETTER DZ)
This can be overcome by setting the java.util.Pattern.UNICODE_CASE flag, but then unicode chars will fold to ascii when they shouldn't. E.g. \u017f (LATIN SMALL LETTER LONG S) (un)folds to \u0053 (ascii capital s).
I haven't attempted to extract the java.util.regexp source and modify it to make it compliant since I expect this would be in breach of sun/oracle's licence. I don't expect that it would take much to change it and add an extra regexp flag to support this mode if oracle wanted to help us though. As a workaround, if we really wanted to use j.u.r, I would suggest to not use the Pattern.UNICODE_CASE flag and allow support for case-insensitive unicode matching on individual RegExp objects when required by using java's regexp syntax (?u:...). The current parsing between js-syntax and jur-syntax allows extended jur options to be used.
Both the joni and java.util.regex engines perform better than rhino/spidermonkey (but not as much as I thought once I added the RECompiled input caching to RERhino). joni performs fastest (joni=360ms, jur=450, rhino=1100) for v8 which uses real-world regexp matching (regexps from top 50 websites), and java.util.regex performs fastest for sunspider (jur=260, joni=530, rhino=620) which matches many different regexps against the same input string.
The reason I'm holding back with this patch is that without input caching, the new Rhino regexp implementation seems quite a bit slower than the old one (about 10%). I'm currently trying to find out why this is so.