Implement NativeRegExpMacroAssembler for new regexp import based on irregexp/NativeRegExpMacroAssembler.cpp
Categories
(Core :: JavaScript Engine, task, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox76 | --- | fixed |
People
(Reporter: iain, Assigned: iain)
References
Details
Attachments
(13 files)
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review |
My initial plan for handling the code-generation side of the fresh irregexp import was to shim the MacroAssembler surface used by the V8 code. This was a fairly large task, especially because: a) V8 and SM disagree on Intel vs AT&T syntax, so every call has the arguments in the wrong order, and b) V8's macroassembler API is architecture-specific, so we would have to implement a separate set of shims for every architecture we support.
Jakob Gruber suggested an alternative, which in retrospect is a much better idea: we can implement the NativeRegExpMacroAssembler API. This is a higher-level API with a smaller surface area (instead of lea/cmp/addq/etc, it has Backtrack/CheckCharacter/Fail/etc). Even better, we implemented an older version of this API when we first imported irregexp (see NativeRegExpMacroAssembler.cpp), which will save us a lot of work.
This bug is to track the work necessary to implement the current API, borrowing heavily from our previous implementation.
Assignee | ||
Comment 1•4 years ago
|
||
Irregexp defines a RegExpMacroAssembler API (with methods like "CheckCharacter", "SetRegister", and "Backtrack"). This API is used as a backend for the RegexpCompiler. In V8, the API is implemented by the bytecode generator for the interpreter, and by a native code generator for each of their supported architecture.
We are reusing V8's bytecode generator, but we need our own native codegen. This patch stack implements the RegExpMacroAssembler API using masm, folowing Brian Hackett's approach from the previous import of irregexp (and copying a lot of his code). Where possible, I based my implementation off Brian's implementation (in js/src/irregexp/NativeRegExpMacroAssembler.cpp). I also relied heavily on V8's x86 implementation (https://github.com/v8/v8/blob/master/src/regexp/ia32/regexp-macro-assembler-ia32.cc).
This patch gets the files set up and implements a couple of trivial methods. Our old import kept the V8 license, despite being heavily rewritten. These files are dual-licensed with both the Mozilla license and the V8 license, which is our current preference.
Future patches in this stack add extra functionality in roughly increasing order of complexity. I've made sure that each patch compiles, but in a few places there are comments referring to things that don't exist yet. If anything is unclear without context, let me know.
Assignee | ||
Comment 2•4 years ago
|
||
This patch defines our register usage, and implements a few simple ops that don't rely on anything else.
Depends on D68412
Assignee | ||
Comment 3•4 years ago
|
||
This patch implements labels and backtracking.
Pushing the address of a label onto the backtrack stack is somewhat tricky, because PushBacktrack is generally called before Bind. This patch adds code to store the CodeOffset that must be patched alongside the Label. (This works, because we only ever push a given label on the backtrack stack in a single place.) The actual machinery of patching will be added in a later patch.
As part of backtracking, we also implement backtrack stack limit checks. The OOL helper that is called if those checks fail will be added in a later patch.
Depends on D68414
Assignee | ||
Comment 4•4 years ago
|
||
Now that we have labels set up, we can add a bunch of Check* methods.
Depends on D68415
Pushed by iireland@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5cf4e0a8f205 Part 1: Add SMRegExpMacroAssembler r=jandem https://hg.mozilla.org/integration/autoland/rev/153dea7044cc Part 2: Push, Pop, register allocation r=jandem https://hg.mozilla.org/integration/autoland/rev/bcc25f6e2edf Part 3: Labels and backtracking r=jandem https://hg.mozilla.org/integration/autoland/rev/e69279c8674f Part 4: Simple checks r=jandem
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 6•4 years ago
|
||
Some less frequently used data is stored in the stack frame. This patch adds the accessors for that data and implements a few methods that use it.
Note: V8 stores inputStartMinusOne instead of inputStart. (Our old implementation does the same thing.) This is because we use inputStart - 1 when initializing or clearing capture registers. After spending multiple hours convincing myself that back-references in lookbehind assertions did not look past the beginning of the string, I got fed up and decided to simplify. Saving one subtraction when clearing registers is not worth the mental overhead.
Assignee | ||
Comment 7•4 years ago
|
||
The irregexp API has a bunch of methods that work with "registers". They are stored on the stack, and in practice they work like local variables.
Depends on D68627
Assignee | ||
Comment 8•4 years ago
|
||
CheckBitInTable looks up the current character (mod 128) in a table to decide whether or not to branch.
The ownership story of the table is somewhat complex. In V8, the ByteArray is a GC thing, rooted in a HandleScope higher up the stack. In SM, because we need to be able to allocate a ByteArray while generating masm, the ByteArray lives on the C++ stack. Its ownership is managed via unique pointer. Initially, the ByteArray is owned by the HandleScope, and will be freed when it goes out of scope. When we use a table in a CheckBitInTable, the macroassembler takes ownership of the ByteArray. Once we are done compiling, the RegExpShared will take ownership of the tables it uses.
Depends on D68628
Assignee | ||
Comment 9•4 years ago
|
||
This method implements faster code where possible for various character classes. It is always safe to return false.
Depends on D68629
Comment 10•4 years ago
|
||
bugherder |
Comment 11•4 years ago
|
||
Pushed by iireland@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/049b6c8d1276 Part 5: FrameData r=jandem https://hg.mozilla.org/integration/autoland/rev/ff4dd0cd8877 Part 6: Registers r=jandem https://hg.mozilla.org/integration/autoland/rev/6f0d82db1e4e Part 7: CheckBitInTable r=jandem https://hg.mozilla.org/integration/autoland/rev/4493eec3591a Part 8: CheckSpecialCharacterClass r=jandem
Assignee | ||
Comment 12•4 years ago
|
||
This is the last of the MacroAssembler interface, aside from GetCode, which handles everything else: prologue, epilogue, OOL handlers, and linking.
Note: non-unicode case insensitive comparisons (/i, not /iu) are weird in JS. V8 didn't follow the spec very well until I uplifted some fixes. (SM was at least a little better; I caught the V8 problem when one of our non262 tests failed.) There's a similar problem in backreferences that I didn't catch at the time. For example, "/(.)\1/i" should not match "kK" (where the second letter is KELVIN SIGN), but it does, for both V8 and SM.
Getting this right requires CheckNotBackReferenceIgnoreCase to behave differently if the regexp doesn't have the unicode flag set. Ironically, Jakob Gruber landed a patch less than a month ago to remove that information. It looks like the fix will cut across several parts of irregexp, so I'm putting it off until later.
Comment 13•4 years ago
|
||
bugherder |
Assignee | ||
Comment 14•4 years ago
|
||
GetCode is called at the end of code generation. It generates the prologue and epilogue code, as well as some OOL handlers if necessary, then links the code and returns a JitCode object. In our old implementation this is a single 400 line function, but for the sake of my sanity I broke it up into chunks. This patch contains the skeleton. The prologue, epilogue, and handlers are in the next patches.
The V8 interface requires that we return a Handle here. This makes OOM handling slightly awkward. This is exacerbated by masm's ability to quietly OOM and not report it until the end, which makes it non-trivial to put an AutoEnterOOMUnsafeRegion in the right place. In most places in the irregexp port I crash on OOM, but here the least gross approach appeared to be to return a fake handle pointing to UndefinedValue, which can be handled cleanly in the caller.
Assignee | ||
Comment 15•4 years ago
|
||
InputOutputData will eventually need to be visible in CodeGenerator.cpp when calling from Ion into irregexp, so I put the definition in a new header file.
Depends on D68999
Assignee | ||
Comment 16•4 years ago
|
||
V8 (and our old import) supports a global cache optimization where we find multiple global matches at once and then iterate over those when finding the next match, instead of jumping back into the generated code each time. We never actually hooked that optimization up to anything, though, and I'm not sure whether it would add enough performance to justify the complexity.
I've pulled all of that logic out, so the success handler here is significantly simpler than it was in the old implementation.
(I had already written most of the code by the time I realized what was going on, so I have an optional patch in my stack to re-add the support if we decide we want to implement the global cache.)
Depends on D69000
Assignee | ||
Comment 17•4 years ago
|
||
The stack overflow handler is based very heavily on the old implementation. There may be a nicer way to write it.
After this patch, we have a complete RegExpMacroAssembler implementation.
Depends on D69001
Comment 18•4 years ago
|
||
Pushed by iireland@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/96331c813c4b Part 9: CheckNotBackReference r=jandem
Comment 19•4 years ago
|
||
bugherder |
Comment 20•4 years ago
|
||
Pushed by iireland@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5e82a9d26f0e Part 10: GetCode r=jandem https://hg.mozilla.org/integration/autoland/rev/a6d1292d28aa Part 11: Allocate and initialize stack frame r=jandem https://hg.mozilla.org/integration/autoland/rev/df132fbab8e6 Part 12: Success and exit handlers r=jandem https://hg.mozilla.org/integration/autoland/rev/44264ec48d76 Part 13: Stack overflow and backtrack handlers r=jandem
Assignee | ||
Updated•4 years ago
|
Comment 21•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Description
•