Closed Bug 1592302 Opened 5 years ago Closed 4 years ago

Import V8 regexp code into SpiderMonkey

Categories

(Core :: JavaScript Engine, task, P1)

task

Tracking

()

RESOLVED FIXED
mozilla73
Tracking Status
firefox73 --- fixed

People

(Reporter: iain, Assigned: iain)

References

Details

Attachments

(3 files)

The first step of refreshing our regexp code is to pull the current V8 code into our tree.

This is an unmodified copy of https://github.com/v8/v8/tree/master/src/regexp, with the following files removed:

regexp/OWNERS
regexp/regexp.cc
regexp/regexp-utils.cc
regexp/regexp-utils.h
regexp/s390/*
regexp/ppc/*

This patch adds a python script to update #includes from V8 to SM, and runs the script over each of the V8 source files. This automates 90%+ of the changes that are necessary to prepare V8 source for inclusion in SM. The remaining changes will come in a subsequent patch. Almost all of them simply involve adding a shim #include in the right place.

Depends on D51929

Pushed by iireland@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d5b26400bc5d
Part 1: Import regexp directory from v8 r=mgaudet,mhoye
https://hg.mozilla.org/integration/autoland/rev/9197fffa4083
Part 2: Add new regexp files to build behind config option r=mgaudet
https://hg.mozilla.org/integration/autoland/rev/f2a72cda0843
Part 3: Update #includes in new regexp files r=mgaudet

V8 is calling ICU's C++ API, whereas we're using only the C API, because we allow embedders to bring their own ICU copy (see MOZ_SYSTEM_ICU and --with-system-icu build flag). (C++ doesn't have a stable ABI, so using C is required for this use case.)

Do we already have a plan how to handle this issue?

We do not have a plan. I haven't looked at the ICU code very closely yet; I was building with --disable-intl-api until last week.

In the short term, we might have to turn off ICU support for regexps when built with MOZ_SYSTEM_ICU. It looks like it is mostly (only?) used for case-insensitive comparisons, which we currently don't support at all, so embedders shouldn't see a regression vs the status quo.

(In reply to Iain Ireland [:iain] from comment #7)

It looks like it is mostly (only?) used for case-insensitive comparisons, which we currently don't support at all, so embedders shouldn't see a regression vs the status quo.

status quo w.r.t. the new imported irregexp code? Because the old irregexp code supports case-insensitive comparisons (with and without the Unicode flag).

IIRC in addition to Unicode RegExps with the case-insensitive flag, V8 also uses ICU for the Unicode properties character class escapes (\p and \P). But we don't yet support those.

Sorry, I confused myself about which unicode-related feature we don't already support. Where I said "case-insensitive unicode comparisons" above, please read "unicode character class escapes". Let me try to lay out my current understanding, and you can explain where I am still wrong.

SpiderMonkey currently does not use ICU for anything regexp-related.

V8's implementation of irregexp does use ICU, for at least two features: case-insensitive unicode comparisons, and character class escapes. Case-insensitive unicode comparisons have a non-ICU backup implementation. (See, for example, the V8 commit that implemented them in the first place. It looks like there is no equivalent backup implementation for character classes.

If we turn off V8's ICU support with MOZ_SYSTEM_ICU, my understanding is that case-insensitive unicode comparisons will use the non-ICU fallback path, and character class escapes will not be supported. This is, I think, equivalent to the SM status quo.

Does this make sense? There is a very good chance I am missing something.

(In reply to Iain Ireland [:iain] from comment #9)

Sorry, I confused myself about which unicode-related feature we don't already support. Where I said "case-insensitive unicode comparisons" above, please read "unicode character class escapes". Let me try to lay out my current understanding, and you can explain where I am still wrong.

Ah, I see.

SpiderMonkey currently does not use ICU for anything regexp-related.

That's correct.

V8's implementation of irregexp does use ICU, for at least two features: case-insensitive unicode comparisons, and character class escapes.

Yep.

Case-insensitive unicode comparisons have a non-ICU backup implementation. (See, for example, the V8 commit that implemented them in the first place. It looks like there is no equivalent backup implementation for character classes.

If we turn off V8's ICU support with MOZ_SYSTEM_ICU, my understanding is that case-insensitive unicode comparisons will use the non-ICU fallback path, and character class escapes will not be supported. This is, I think, equivalent to the SM status quo.

The backup implementation isn't spec-compliant, because it simply ignores the cases where Unicode compliant case-comparisons are needed. (See "test/mjsunit/es6/unicode-regexp-ignore-case.js" and "test/mjsunit/es6/unicode-regexp-ignore-case-noi18n.js" in the V8 sources.) So we'll have a regression when either --without-intl-api or --with-system-icu is used, because then case-comparisons will be wrong for certain strings.
I can't tell if it's worth spending time to ensure this won't regress, because I don't know how many embedders use either --without-intl-api or --with-system-icu. And there's already functionality which isn't spec-compliant when either --without-intl-api (cf. final Sigma handling for String.prototype.toLowerCase) or --with-system-icu is used (time zone handling in vm/DateTime), so this isn't a totally new situation for us. If you ask me, I'd say it's okay if certain things don't work correctly when the in-tree ICU version isn't used, as long as we are aware of that situation.

And just like said, not supporting Unicode property class character escapes when MOZ_SYSTEM_ICU should be acceptable, because we currently don't support it at all (bug 1361876).

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: