Closed Bug 1594545 Opened 5 years ago Closed 4 years ago

Implement NativeRegExpMacroAssembler for new regexp import based on irregexp/NativeRegExpMacroAssembler.cpp

Categories

(Core :: JavaScript Engine, task, P1)

task

Tracking

()

RESOLVED FIXED
mozilla76
Tracking Status
firefox76 --- fixed

People

(Reporter: iain, Assigned: iain)

References

Details

Attachments

(13 files)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review

My initial plan for handling the code-generation side of the fresh irregexp import was to shim the MacroAssembler surface used by the V8 code. This was a fairly large task, especially because: a) V8 and SM disagree on Intel vs AT&T syntax, so every call has the arguments in the wrong order, and b) V8's macroassembler API is architecture-specific, so we would have to implement a separate set of shims for every architecture we support.

Jakob Gruber suggested an alternative, which in retrospect is a much better idea: we can implement the NativeRegExpMacroAssembler API. This is a higher-level API with a smaller surface area (instead of lea/cmp/addq/etc, it has Backtrack/CheckCharacter/Fail/etc). Even better, we implemented an older version of this API when we first imported irregexp (see NativeRegExpMacroAssembler.cpp), which will save us a lot of work.

This bug is to track the work necessary to implement the current API, borrowing heavily from our previous implementation.

Irregexp defines a RegExpMacroAssembler API (with methods like "CheckCharacter", "SetRegister", and "Backtrack"). This API is used as a backend for the RegexpCompiler. In V8, the API is implemented by the bytecode generator for the interpreter, and by a native code generator for each of their supported architecture.

We are reusing V8's bytecode generator, but we need our own native codegen. This patch stack implements the RegExpMacroAssembler API using masm, folowing Brian Hackett's approach from the previous import of irregexp (and copying a lot of his code). Where possible, I based my implementation off Brian's implementation (in js/src/irregexp/NativeRegExpMacroAssembler.cpp). I also relied heavily on V8's x86 implementation (https://github.com/v8/v8/blob/master/src/regexp/ia32/regexp-macro-assembler-ia32.cc).

This patch gets the files set up and implements a couple of trivial methods. Our old import kept the V8 license, despite being heavily rewritten. These files are dual-licensed with both the Mozilla license and the V8 license, which is our current preference.

Future patches in this stack add extra functionality in roughly increasing order of complexity. I've made sure that each patch compiles, but in a few places there are comments referring to things that don't exist yet. If anything is unclear without context, let me know.

This patch defines our register usage, and implements a few simple ops that don't rely on anything else.

Depends on D68412

This patch implements labels and backtracking.

Pushing the address of a label onto the backtrack stack is somewhat tricky, because PushBacktrack is generally called before Bind. This patch adds code to store the CodeOffset that must be patched alongside the Label. (This works, because we only ever push a given label on the backtrack stack in a single place.) The actual machinery of patching will be added in a later patch.

As part of backtracking, we also implement backtrack stack limit checks. The OOL helper that is called if those checks fail will be added in a later patch.

Depends on D68414

Now that we have labels set up, we can add a bunch of Check* methods.

Depends on D68415

Pushed by iireland@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5cf4e0a8f205
Part 1: Add SMRegExpMacroAssembler r=jandem
https://hg.mozilla.org/integration/autoland/rev/153dea7044cc
Part 2: Push, Pop, register allocation r=jandem
https://hg.mozilla.org/integration/autoland/rev/bcc25f6e2edf
Part 3: Labels and backtracking r=jandem
https://hg.mozilla.org/integration/autoland/rev/e69279c8674f
Part 4: Simple checks r=jandem
Keywords: leave-open

Some less frequently used data is stored in the stack frame. This patch adds the accessors for that data and implements a few methods that use it.

Note: V8 stores inputStartMinusOne instead of inputStart. (Our old implementation does the same thing.) This is because we use inputStart - 1 when initializing or clearing capture registers. After spending multiple hours convincing myself that back-references in lookbehind assertions did not look past the beginning of the string, I got fed up and decided to simplify. Saving one subtraction when clearing registers is not worth the mental overhead.

The irregexp API has a bunch of methods that work with "registers". They are stored on the stack, and in practice they work like local variables.

Depends on D68627

CheckBitInTable looks up the current character (mod 128) in a table to decide whether or not to branch.

The ownership story of the table is somewhat complex. In V8, the ByteArray is a GC thing, rooted in a HandleScope higher up the stack. In SM, because we need to be able to allocate a ByteArray while generating masm, the ByteArray lives on the C++ stack. Its ownership is managed via unique pointer. Initially, the ByteArray is owned by the HandleScope, and will be freed when it goes out of scope. When we use a table in a CheckBitInTable, the macroassembler takes ownership of the ByteArray. Once we are done compiling, the RegExpShared will take ownership of the tables it uses.

Depends on D68628

This method implements faster code where possible for various character classes. It is always safe to return false.

Depends on D68629

This is the last of the MacroAssembler interface, aside from GetCode, which handles everything else: prologue, epilogue, OOL handlers, and linking.

Note: non-unicode case insensitive comparisons (/i, not /iu) are weird in JS. V8 didn't follow the spec very well until I uplifted some fixes. (SM was at least a little better; I caught the V8 problem when one of our non262 tests failed.) There's a similar problem in backreferences that I didn't catch at the time. For example, "/(.)\1/i" should not match "kK" (where the second letter is KELVIN SIGN), but it does, for both V8 and SM.

Getting this right requires CheckNotBackReferenceIgnoreCase to behave differently if the regexp doesn't have the unicode flag set. Ironically, Jakob Gruber landed a patch less than a month ago to remove that information. It looks like the fix will cut across several parts of irregexp, so I'm putting it off until later.

GetCode is called at the end of code generation. It generates the prologue and epilogue code, as well as some OOL handlers if necessary, then links the code and returns a JitCode object. In our old implementation this is a single 400 line function, but for the sake of my sanity I broke it up into chunks. This patch contains the skeleton. The prologue, epilogue, and handlers are in the next patches.

The V8 interface requires that we return a Handle here. This makes OOM handling slightly awkward. This is exacerbated by masm's ability to quietly OOM and not report it until the end, which makes it non-trivial to put an AutoEnterOOMUnsafeRegion in the right place. In most places in the irregexp port I crash on OOM, but here the least gross approach appeared to be to return a fake handle pointing to UndefinedValue, which can be handled cleanly in the caller.

InputOutputData will eventually need to be visible in CodeGenerator.cpp when calling from Ion into irregexp, so I put the definition in a new header file.

Depends on D68999

V8 (and our old import) supports a global cache optimization where we find multiple global matches at once and then iterate over those when finding the next match, instead of jumping back into the generated code each time. We never actually hooked that optimization up to anything, though, and I'm not sure whether it would add enough performance to justify the complexity.

I've pulled all of that logic out, so the success handler here is significantly simpler than it was in the old implementation.

(I had already written most of the code by the time I realized what was going on, so I have an optional patch in my stack to re-add the support if we decide we want to implement the global cache.)

Depends on D69000

The stack overflow handler is based very heavily on the old implementation. There may be a nicer way to write it.

After this patch, we have a complete RegExpMacroAssembler implementation.

Depends on D69001

Pushed by iireland@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/96331c813c4b
Part 9: CheckNotBackReference r=jandem
Pushed by iireland@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5e82a9d26f0e
Part 10: GetCode r=jandem
https://hg.mozilla.org/integration/autoland/rev/a6d1292d28aa
Part 11: Allocate and initialize stack frame r=jandem
https://hg.mozilla.org/integration/autoland/rev/df132fbab8e6
Part 12: Success and exit handlers r=jandem
https://hg.mozilla.org/integration/autoland/rev/44264ec48d76
Part 13: Stack overflow and backtrack handlers r=jandem
Status: NEW → RESOLVED
Closed: 4 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla76
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: