1064374 - Add Maximal Munch to the lowering phase, so instruction-fusion can be implemented

Reporter

Description

•

10 years ago

Attached patch AddTiles-r0.patch — Details — Splinter Review

There are many optimizations that combine a sequence of instructions into a single instruction. Maximal munch is an algorithm that is easily extensible, and allows pretty arbitrary fusion. While writing a few fusions, I had some ideas to clean up the code, and in general, make it easier to use, but I didn't want to get caught up in feature-bloat-hell, so I'd like to save any non-essential refactorings for a later patch.

Attachment #8485796 - Flags: feedback?(sunfish)

Marty Rosenberg [:mjrosenb]

Reporter

Comment 1

•

10 years ago

Attached patch AddMLA-r0.patch — Details — Splinter Review

ARM can do d <- a + (b*c) in a single instruction, this generates that sequence. Sadly, it only generates it for (b*c)+a at the moment.

Attachment #8485801 - Flags: feedback?(sunfish)

Marty Rosenberg [:mjrosenb]

Reporter

Comment 2

•

10 years ago

Attached patch AddRotate-r0.patch — Details — Splinter Review

recognizes (x>>>y) OP (x<<(32-y)), for OP \in {|,+,^} as a rotate right instruction. This needs a bit more work to reach its full potential, namely, it needs to be able to handle (x>>>y) OP expr OP (x<<(32-y)), where expr is an arbitrary chain of OP's.

Attachment #8485804 - Flags: feedback?(sunfish)

Marty Rosenberg [:mjrosenb]

Reporter

Comment 3

•

10 years ago

Attached patch AddFShift-r0.patch — Details — Splinter Review

All of ARM's ALU operations can accept an arbitrary shift to its second operand. This implements *most* of that. Currently missing is support for Rotate (mostly because rotate doesn't exist before lowering, but I have a derently elegant plan for this)

Attachment #8485810 - Flags: feedback?(sunfish)

Marty Rosenberg [:mjrosenb]

Reporter

Comment 4

•

10 years ago

Attached patch AddScaleRandom-r0.patch — Details — Splinter Review

This is mostly a 'because I can" patch. I noticed that Math.floor(Math.random()*c) is a rather common pattern, where c is a small, integer constant. Since Math.random() is implemented as: Generate52BitRandom() / (double) (1LL << 52) this should be equivalent to: Generate52BitRandom() * c >> 52 which I believe is much easier to implement in jit-code than the current vanilla Math.random(), since that requires an int64->float64 conversion. This avoids all double code, and thus just needs a 64-bit integer multiply (by a small constant, so 2 32-bit multiplies should be sufficient).

Attachment #8485815 - Flags: feedback?(sunfish)

Marty Rosenberg [:mjrosenb]

Reporter

Comment 5

•

10 years ago

Future Plans: * RInstruction integration: Currently, none of these optimizations trigger if any of the intermediate results get captured in a resume point. Most of these can be implemented as part of the recover instructions framework. * Better matching system than providing a hard-coded MDefinition::Opcode, templates will likely be involved. This will cut down on the large number of twigs that are currently used for all of fused shift-alu ops. * Offline storage for match state: Currently, I don't implement commutative operations because that information can't be stored anywhere, and without that, I don't really trust the multiple passes to generate the same tiling. If the best-tile stored an index, rather than a pointer, we could store this in the extra bits, after the index. * Automatically generate the huge table of twigs. I know how to do this, but it requires some c++-14 features, which are a while off from being supported atm.

Marty Rosenberg [:mjrosenb]

Reporter

Comment 6

•

10 years ago

Oh yes. the const work, how could I forget about the const work? on ARM, and x64, not all relevant constants can be used in any instruction. My idea for dealing with this is to more or less remove useRegisterOrConstant, let GVN do its thing with constants, and then in lowering, fuse appropriate constants into the instruction. This way, if there is a tight loop adding 257 to a number, we can trivially put 257 into a register outside the loop, rather than inlining the constant, then having the MacroAssembler manually put it into a scratch register, since 257 doesn't fit into a single instruction. Another optimization on top of this is in the register allocator. Rather than ever spilling a const to the stack, we should probably just have the register allocator re-synthesize the const when it is needed. This may already be done, but it doesn't matter as much right now.

Lars T Hansen [:lth]

Comment 7

•

10 years ago

One thing to note: with the shared memory and atomics work some patches are coming that will depend on certain reorderings of instructions not happening, even when there are no explicit fence instructions in the object code. x86 and SPARC-TSO will generally not have fence instructions, they only require them after certain stores. We might have to need to have a way to communicate this to the muncher.

Luke Wagner [:luke]

Comment 8

•

10 years ago

Would those constraints already be expressed in the MIR nodes to prevent other reordering optimizations?

Marty Rosenberg [:mjrosenb]

Reporter

Comment 9

•

10 years ago

The muncher currently doesn't do any reorderings, but I guess if you have |t1 <- foo(a,b); FENCE; t2 <- bar(t1, c)|, and have a twig for foobar, it will result in |FENCE; foobar(a,b,c)| I suspect all of the twigs I wrote so far can't have any particularly bad interactions with fences, since they're all math operations (except Math.random)

Dan Gohman [:sunfish]

Comment 10

•

10 years ago

Comment on attachment 8485796 [details] [diff] [review] AddTiles-r0.patch Review of attachment 8485796 [details] [diff] [review]: ----------------------------------------------------------------- Ok, here are some comments based on an initial pass through the patch. There's more to come here, but this should get things started. Overall, I think this is a cool new direction! One high-level question: Are tables of twigs the right approach here? An alternative way to do this is just to have each target define an MDefinitionVisitor which would take an MInstruction* and return a twig. If a target has multiple twigs for a single opcode, it just checks for them one after another. An advantage of this approach is that it makes it easier to factor out checks that are redundant between twigs. I don't have an opinion about what's best yet; I just wanted to hear your thoughts about how we should handle the twigs. ::: js/src/jit/LIR.cpp @@ +554,5 @@ > +LAllocation > +LInstructionTwig::useRegisterOrConstant(LIRGenerator *gen, MDefinition *mir) > +{ > + return gen->useRegisterOrConstant(mir); > +} These wrapper functions don't appear to add a lot of value. And, between AtStart variants and use vs useRegister vs useAny vs useFixed etc., there are a lot of these functions, so it'd be unfortunate to have boilerplate wrappers for all of them. ::: js/src/jit/LIR.h @@ +597,5 @@ > class LSafepoint; > class LInstructionVisitor; > +template <size_t Defs, size_t Operands, size_t Temps> > +class LInstructionHelper; > +class LInstructionTwig { Style nit: Please add a blank line before this class definition. Also, I think I saw you explain what a twig is on irc, but it'd be a good thing to mention in a comment. @@ +602,5 @@ > + protected: > + MDefinition::Opcode iMatch; > + MIRType iMatchType; > + // Both tell if the given instruction matches > + // the input, and give the match weight for the generated instructions. Style nit: Comments should go above their associated code rather than below it. @@ +604,5 @@ > + MIRType iMatchType; > + // Both tell if the given instruction matches > + // the input, and give the match weight for the generated instructions. > + public: > + MOZ_CONSTEXPR LInstructionTwig(MDefinition::Opcode op, MIRType type) : iMatch(op), iMatchType(type) {}; Style nit: this semicolon is unnecessary. @@ +617,5 @@ > + sum += inst->getOperand(i)->getWeight(); > + } > + return sum; > + } > + virtual void choose(MInstruction *inst) const { This function should have a comment briefly describing what it does. It looks like the |inst| argument is redundant with the |this| argument. Or will there be cases where it's different? @@ +958,5 @@ > }; > > +// Helper for fused instructions of the form: > +// foo <- def_op(a,b,c,d...) > +// bar,baz, quux <- use_op(o1, o2, .., o_n = foo, ...). This comment is tantalizing, but it's not clear what the relationship between a,b,c,d and op1,op2,... is. Are the operand lists concatenated? @@ +1028,5 @@ > + continue; > + if (!ndef->hasOneUse()) > + continue; > + if (inst->block() != ndef->block()) > + continue; I'm curious why we're requiring inst and ndef to be in the same block here. ::: js/src/jit/Lowering.cpp @@ +3755,5 @@ > + for (uint32_t i = 0; (*iter)->applicableTwigs()[i] != nullptr; i++) { > + const LInstructionTwig *t = (*iter)->applicableTwigs()[i]; > + int32_t curWeight = t->matchWeight(*iter); > + if (curWeight != -1 && (bestIdx == -1 || curWeight < bestWeight)) { > + (*iter)->setTwig(i, curWeight); Should there be an assignment to bestWeight here? Also, it seems confusing to call setTwig inside the loop. I think it would be clearer to just compute the best bestWeight value in the loop, and then after the loop call setTwig once using that value. @@ +3764,5 @@ > + for (MInstructionReverseIterator riter = block->rbegin(); riter != block->rend(); riter++) { > + if ((*riter)->getTwig() != nullptr) { > + (*riter)->getTwig()->choose(*riter); > + } > + } Ok, the algorithm here is that it's doing a match for each instruction in a top-down walk, and then clearing the twig field for instructions that are matched as part of patterns in lower roots. Would you get the same effect from a single pass which just starts at the last instruction in the block, picks its twig, and marks all the instructions that participate in the twig as done, and then walk up until an instruction which isn't done yet is reached, and then pick its twig, and so on? If so, would it be simpler and possibly faster to do that? If not, then it may be interesting to understand why not. ::: js/src/jit/MIR.cpp @@ +3633,5 @@ > +#endif > +#ifndef ARCH_BitNot_TWIGS > +#define ARCH_BitNot_TWIGS > +#endif > +const LInstructionTwig *MAsmJSNeg::ApplicableTwigs_[] = {&LInstructionTwig::DefaultTwig, nullptr}; const nit: there should be a const after the * also, to say that the pointers themselves are const.

Attachment #8485796 - Flags: feedback?(sunfish)

Dan Gohman [:sunfish]

Comment 11

•

10 years ago

Comment on attachment 8485815 [details] [diff] [review] AddScaleRandom-r0.patch Review of attachment 8485815 [details] [diff] [review]: ----------------------------------------------------------------- I'd really like to focus on the underlying maximal-munch framework and the more obvious use cases here, like the mul-add patch. Math.random brings in a bunch of concerns that I'm afraid risk being a distraction here.

Attachment #8485815 - Flags: feedback?(sunfish)

Lars T Hansen [:lth]

Comment 12

•

10 years ago

(In reply to Luke Wagner [:luke] from comment #8) > Would those constraints already be expressed in the MIR nodes to prevent > other reordering optimizations? They are, with suitable combinations of setGuard, !setMovable, isEffectful, and alias sets (currently very conservative but could be refined).

Lars T Hansen [:lth]

Comment 13

•

10 years ago

(In reply to Marty Rosenberg [:mjrosenb] from comment #9) > The muncher currently doesn't do any reorderings, but I guess if you have > |t1 <- foo(a,b); FENCE; t2 <- bar(t1, c)|, and have a twig for foobar, it > will result in |FENCE; foobar(a,b,c)| That would be Wrong, and especially it would be easy to get this wrong if FENCE is not an actual instruction, as is the case on x86 and SPARC-TSO.

Marty Rosenberg [:mjrosenb]

Reporter

Comment 14

•

10 years ago

(In reply to Dan Gohman [:sunfish] from comment #10) > Comment on attachment 8485796 [details] [diff] [review] > AddTiles-r0.patch > > Review of attachment 8485796 [details] [diff] [review]: > ----------------------------------------------------------------- > > Ok, here are some comments based on an initial pass through the patch. > There's more to come here, but this should get things started. Overall, I > think this is a cool new direction! > > One high-level question: Are tables of twigs the right approach here? An > alternative way to do this is just to have each target define an > MDefinitionVisitor which would take an MInstruction* and return a twig. If a > target has multiple twigs for a single opcode, it just checks for them one > after another. An advantage of this approach is that it makes it easier to > factor out checks that are redundant between twigs. I don't have an opinion > about what's best yet; I just wanted to hear your thoughts about how we > should handle the twigs. > I think tables are certainly an easy approach. One of my goals was to have adding a new twig/optimization be as easy as possible. Attempting to weave a new check into a pre-existing visitor is something that I've always found to be a bit awkward. That being said, the explicit tables are very ugly, and updating them is straightforward, but tedious. With variadic templates, we can derive all of the tables at compile time, unfortunately, I'm pretty sure that variadic templates are required. > ::: js/src/jit/LIR.cpp > @@ +554,5 @@ > > +LAllocation > > +LInstructionTwig::useRegisterOrConstant(LIRGenerator *gen, MDefinition *mir) > > +{ > > + return gen->useRegisterOrConstant(mir); > > +} > > These wrapper functions don't appear to add a lot of value. And, between > AtStart variants and use vs useRegister vs useAny vs useFixed etc., there > are a lot of these functions, so it'd be unfortunate to have boilerplate > wrappers for all of them. They don't! the only reason they are there is because those methods are protected, and I thought that adding every single twig as a friend would be tedious, and error prone. The internet was not able to give me a better solution, but I am all ears. > @@ +617,5 @@ > > + sum += inst->getOperand(i)->getWeight(); > > + } > > + return sum; > > + } > > + virtual void choose(MInstruction *inst) const { > > This function should have a comment briefly describing what it does. > > It looks like the |inst| argument is redundant with the |this| argument. Or > will there be cases where it's different? this will be an LInstructionTwig, not an MInstruction. > > @@ +958,5 @@ > > }; > > > > +// Helper for fused instructions of the form: > > +// foo <- def_op(a,b,c,d...) > > +// bar,baz, quux <- use_op(o1, o2, .., o_n = foo, ...). > > This comment is tantalizing, but it's not clear what the relationship > between a,b,c,d and op1,op2,... is. Are the operand lists concatenated? > Yeah. I didn't put as much thought into that as I should have. The comment deals mainly with the matching phase, not the emitting phase, so it does't deal with the operands for the generated LInstruction, which will have something-like concatenation. > @@ +1028,5 @@ > > + continue; > > + if (!ndef->hasOneUse()) > > + continue; > > + if (inst->block() != ndef->block()) > > + continue; > > I'm curious why we're requiring inst and ndef to be in the same block here. > Paranoia, plain and simple **. Imagine the gains we could get later from just removing it! Also, it has the ability to be bad. An add instruction is one cycle, a multiply-accumulate instruction is 3 cycles. If we move a multiply into a hot loop, by fusing it into a multiply-accumulate, we'll slow things down a lot. > ::: js/src/jit/Lowering.cpp > @@ +3755,5 @@ > > + for (uint32_t i = 0; (*iter)->applicableTwigs()[i] != nullptr; i++) { > > + const LInstructionTwig *t = (*iter)->applicableTwigs()[i]; > > + int32_t curWeight = t->matchWeight(*iter); > > + if (curWeight != -1 && (bestIdx == -1 || curWeight < bestWeight)) { > > + (*iter)->setTwig(i, curWeight); > > Should there be an assignment to bestWeight here? > > Also, it seems confusing to call setTwig inside the loop. I think it would > be clearer to just compute the best bestWeight value in the loop, and then > after the loop call setTwig once using that value. > Sounds reasonable. > @@ +3764,5 @@ > > + for (MInstructionReverseIterator riter = block->rbegin(); riter != block->rend(); riter++) { > > + if ((*riter)->getTwig() != nullptr) { > > + (*riter)->getTwig()->choose(*riter); > > + } > > + } > > Ok, the algorithm here is that it's doing a match for each instruction in a > top-down walk, and then clearing the twig field for instructions that are > matched as part of patterns in lower roots. > > Would you get the same effect from a single pass which just starts at the > last instruction in the block, picks its twig, and marks all the > instructions that participate in the twig as done, and then walk up until an > instruction which isn't done yet is reached, and then pick its twig, and so > on? If so, would it be simpler and possibly faster to do that? If not, then > it may be interesting to understand why not. > I don't think that'll work. This is part of why the match function returns a weight. Without knowing the weight of every operand, and all their ancestors, you can't really know the minimal weight of the current node. Let's say I've implemented the ARM instruction sbfx (signed bit-field extract), which does |rd <- (rs >> C1) & ((1<<C2) - 1)| and we have the expression |d = ((a & 0xff0) >> 4) ^ (b << 5)|. there are a few choices here. We can fuse the |<< 5| into the xor, or we can fuse the |>>4| into the xor. The best choice here would be to fuse |<<5| in, so that |(a&0xff0) >> 4| can be fused into a single |sbfx rd, ra, #4, #8| If the >>4 got folded into the xor, then we couldn't generate the sbfx instruction. We could reduce this to two passes, if we could somehow or other generate the LInstructions for each block backwards, and at the same time combine choose and emit. I suspect that would not work though. **RE: same block, I also wanted to make sure that all operands were processed before their instruction. I'm pretty sure that nothing would break if we looked at instructions in other blocks, but on the off chance the blocks got processed out of order, I didn't feel like risking it.

Lars T Hansen [:lth]

Comment 15

•

10 years ago

(In reply to Lars T Hansen [:lth] from comment #13) > (In reply to Marty Rosenberg [:mjrosenb] from comment #9) > > The muncher currently doesn't do any reorderings, but I guess if you have > > |t1 <- foo(a,b); FENCE; t2 <- bar(t1, c)|, and have a twig for foobar, it > > will result in |FENCE; foobar(a,b,c)| > > That would be Wrong, and especially it would be easy to get this wrong if > FENCE is not an actual instruction, as is the case on x86 and SPARC-TSO. Let me rephrase that: that would be wrong if foo and bar both have memory references to shared memory.

Dan Gohman [:sunfish]

Comment 16

•

10 years ago

(In reply to Marty Rosenberg [:mjrosenb] from comment #14) > (In reply to Dan Gohman [:sunfish] from comment #10) > > Comment on attachment 8485796 [details] [diff] [review] > > AddTiles-r0.patch > > > > Review of attachment 8485796 [details] [diff] [review]: > > ----------------------------------------------------------------- > > > > Ok, here are some comments based on an initial pass through the patch. > > There's more to come here, but this should get things started. Overall, I > > think this is a cool new direction! > > > > One high-level question: Are tables of twigs the right approach here? An > > alternative way to do this is just to have each target define an > > MDefinitionVisitor which would take an MInstruction* and return a twig. If a > > target has multiple twigs for a single opcode, it just checks for them one > > after another. An advantage of this approach is that it makes it easier to > > factor out checks that are redundant between twigs. I don't have an opinion > > about what's best yet; I just wanted to hear your thoughts about how we > > should handle the twigs. > > > I think tables are certainly an easy approach. One of my goals was to have > adding a new twig/optimization be as easy as possible. Yes, I agree with this goal! > Attempting to weave > a new check into a pre-existing visitor is something that I've always found > to be a bit awkward. That being said, the explicit tables are very ugly, > and updating them is straightforward, but tedious. With variadic templates, > we can derive all of the tables at compile time, unfortunately, I'm pretty > sure that variadic templates are required. The table approach runs each pattern independently of the others, so if we did a non-table approach and established a convention like this (rough pseudo-code): bool patternMatch(MInstruction *ins) { return patternMatchA(ins) || patternMatchB(ins) || patternMatchC(ins) || ... } bool patternMatchA(MInstruction *ins) { ... it seems like we'd get about the same effect, with roughly the same ease of adding new patterns, and less overall boilerplate and/or variadic template sorcery. I'd also suggest we create a new source file for holding pattern-matching code, to make it easier for people to navigate. This approach may even be more efficient, because you wouldn't need a per-pattern virtual function call, as the table approach does. What do you think of this approach? > > ::: js/src/jit/LIR.cpp > > @@ +554,5 @@ > > > +LAllocation > > > +LInstructionTwig::useRegisterOrConstant(LIRGenerator *gen, MDefinition *mir) > > > +{ > > > + return gen->useRegisterOrConstant(mir); > > > +} > > > > These wrapper functions don't appear to add a lot of value. And, between > > AtStart variants and use vs useRegister vs useAny vs useFixed etc., there > > are a lot of these functions, so it'd be unfortunate to have boilerplate > > wrappers for all of them. > They don't! the only reason they are there is because those methods are > protected, and I thought that adding every single twig as a friend would be > tedious, and error prone. The internet was not able to give me a better > solution, but I am all ears. I'd prefer to make those methods public than to wrap them all in that way. > > > @@ +617,5 @@ > > > + sum += inst->getOperand(i)->getWeight(); > > > + } > > > + return sum; > > > + } > > > + virtual void choose(MInstruction *inst) const { > > > > This function should have a comment briefly describing what it does. > > > > It looks like the |inst| argument is redundant with the |this| argument. Or > > will there be cases where it's different? > this will be an LInstructionTwig, not an MInstruction. Oh, oops. Yes, you're right. > > > > > @@ +958,5 @@ > > > }; > > > > > > +// Helper for fused instructions of the form: > > > +// foo <- def_op(a,b,c,d...) > > > +// bar,baz, quux <- use_op(o1, o2, .., o_n = foo, ...). > > > > This comment is tantalizing, but it's not clear what the relationship > > between a,b,c,d and op1,op2,... is. Are the operand lists concatenated? > > > Yeah. I didn't put as much thought into that as I should have. The comment > deals mainly with the matching phase, not the emitting phase, so it does't > deal with the operands for the generated LInstruction, which will have > something-like concatenation. Ok. This is one of the more "interesting" parts of this patch, so anything you can do to clarify what's going on here would be helpful. > > @@ +1028,5 @@ > > > + continue; > > > + if (!ndef->hasOneUse()) > > > + continue; > > > + if (inst->block() != ndef->block()) > > > + continue; > > > > I'm curious why we're requiring inst and ndef to be in the same block here. > > > Paranoia, plain and simple **. Imagine the gains we could get later from > just removing it! > Also, it has the ability to be bad. An add instruction is one cycle, a > multiply-accumulate instruction is 3 cycles. If we move a multiply into a > hot loop, by fusing it into a multiply-accumulate, we'll slow things down a > lot. Sounds reasonable, especially for now. > > ::: js/src/jit/Lowering.cpp > > @@ +3755,5 @@ > > > + for (uint32_t i = 0; (*iter)->applicableTwigs()[i] != nullptr; i++) { > > > + const LInstructionTwig *t = (*iter)->applicableTwigs()[i]; > > > + int32_t curWeight = t->matchWeight(*iter); > > > + if (curWeight != -1 && (bestIdx == -1 || curWeight < bestWeight)) { > > > + (*iter)->setTwig(i, curWeight); > > > > Should there be an assignment to bestWeight here? > > > > Also, it seems confusing to call setTwig inside the loop. I think it would > > be clearer to just compute the best bestWeight value in the loop, and then > > after the loop call setTwig once using that value. > > > Sounds reasonable. > > @@ +3764,5 @@ > > > + for (MInstructionReverseIterator riter = block->rbegin(); riter != block->rend(); riter++) { > > > + if ((*riter)->getTwig() != nullptr) { > > > + (*riter)->getTwig()->choose(*riter); > > > + } > > > + } > > > > Ok, the algorithm here is that it's doing a match for each instruction in a > > top-down walk, and then clearing the twig field for instructions that are > > matched as part of patterns in lower roots. > > > > Would you get the same effect from a single pass which just starts at the > > last instruction in the block, picks its twig, and marks all the > > instructions that participate in the twig as done, and then walk up until an > > instruction which isn't done yet is reached, and then pick its twig, and so > > on? If so, would it be simpler and possibly faster to do that? If not, then > > it may be interesting to understand why not. > > > I don't think that'll work. This is part of why the match function returns a > weight. Without knowing the weight of every operand, and all their > ancestors, you can't really know the minimal weight of the current node. > Let's say I've implemented the ARM instruction sbfx (signed bit-field > extract), which does |rd <- (rs >> C1) & ((1<<C2) - 1)| > and we have the expression |d = ((a & 0xff0) >> 4) ^ (b << 5)|. there are a > few choices here. We can fuse the |<< 5| into the xor, or we can fuse the > |>>4| into the xor. The best choice here would be to fuse |<<5| in, so that > |(a&0xff0) >> 4| can be fused into a single |sbfx rd, ra, #4, #8| If the > >>4 got folded into the xor, then we couldn't generate the sbfx instruction. > We could reduce this to two passes, if we could somehow or other generate > the LInstructions for each block backwards, and at the same time combine > choose and emit. I suspect that would not work though. Ah, thanks for the very useful example. The question then is whether this benefit will justify the extra compile time of doing the extra pattern matching that this presumably entails. Have you measured the impact on compile time of the current patches? Of course, we also have to consider the compile time in the future when we add more patterns. Would it be feasible to implement both approaches, controlled by a flag, so that we can compare the two? > **RE: same block, I also wanted to make sure that all operands were > processed before their instruction. I'm pretty sure that nothing would > break if we looked at instructions in other blocks, but on the off chance > the blocks got processed out of order, I didn't feel like risking it. Makes sense. A brief comment mentioning this and the concern about loops etc. would be useful. Thanks!

Dan Gohman [:sunfish]

Comment 17

•

10 years ago

Comment on attachment 8485804 [details] [diff] [review] AddRotate-r0.patch Review of attachment 8485804 [details] [diff] [review]: ----------------------------------------------------------------- As an overall comment, this patch and what it represents is really cool! ::: js/src/jit/LIR-Common.h @@ +2349,3 @@ > return js_CodeName[op_]; > } > + virtual bool isRotate() { This function appears to only be called in asserts. Can it be guarded by #ifdef DEBUG? Otherwise, whenever I see it I'll wonder what it's for :-). @@ +2355,5 @@ > + > +class LRotateI : public LShiftI { > + LIR_HEADER(RotateI); > + public: > + LRotateI() : LShiftI(JSOP_FAKE_LIMIT) {} Random suggestion (feel free to ignore it): What if we used JSOP_OR here, since JSOP_OR is one of the nodes that would be at the root of a rotate expression? Alternatively, perhaps we should introduce a new enum for LRotateI. It may become desirable in any case if we also do a rotate-left in addition to rotate-right. @@ +2356,5 @@ > +class LRotateI : public LShiftI { > + LIR_HEADER(RotateI); > + public: > + LRotateI() : LShiftI(JSOP_FAKE_LIMIT) {} > + virtual bool isRotate() { Ditto about ifdef DEBUG? @@ +2364,5 @@ > + return "Rotate"; > + } > + // Rotate is a bit more complex than the standard fusion instructions, > + // so I need to implement this independently. > + static MInstruction *getOp(MInstruction *inst, int id) { This getOp returns the operand if it's an instruction. It's not obvious whether this is needed, instead of having code just call getOperand directly. MDefinition is where op() and isFoo() and toFoo() and hasOneUse() etc. are defined, which I naively expect is most of what the pattern-matching code actually needs. ::: js/src/jit/LIR.cpp @@ +610,5 @@ > + > +int > +LRotateI::LRotateITwig::matchWeight(MInstruction *inst) const > +{ > + Style nit: no blank line here @@ +622,5 @@ > + case MDefinition::Op_Add: > + break; > + default: > + return -1; > + } Is this switch necessary? The rotate twig is only added to the tables for bitor, bitxor, and add, so it looks like matchWeight shouldn't ever get called on anything else. Or am I missing something? @@ +628,5 @@ > + MInstruction *right = getOp(inst, 1); > + if (!left || !right) { > + return -1; > + } > + if (right->isLsh() && left->isUrsh()) { Talking about getOp above; here, we end up tesing isLsh() and isUrsh() immediately, so the isInstruction checks that getOp did are redundant. And ditto for most of the getOp calls below. @@ +631,5 @@ > + } > + if (right->isLsh() && left->isUrsh()) { > + MInstruction *tmp = left; > + left = right; > + right = tmp; You can use mozilla::Swap to do this. @@ +642,5 @@ > + MInstruction *baseR = getOp(right, 0); > + if (!baseL || !baseR) { > + return -1; > + } > + if (!baseL->congruentTo(baseR)) { On getOp again, here the isInstruction checks that getOp did are needlessly preventing this from forming rotates when the input to the rotate would be a phi. Also, it should be sufficient to test baseL == baseR here instead of calling congruentTo, since they both must dominate the base of this expression tree, so GVN should have merged them by this point. Ditto for the congruentTo calls below. @@ +661,5 @@ > + if (!c->toConstant()->value().isInt32()) > + return -1; > + if (c->toConstant()->value().toInt32() != 32) > + return -1; > + MInstruction *possibleAmt = getOp(amt, 1); We're just pattern-matching rotate-right here, and not rotate-left, right? That's fine for now, but it doesn't look right to pattern-match a "32 - amt" on the right-shift in that case. Am I missing something? @@ +698,5 @@ > +} > +void > +LRotateI::LRotateITwig::choose(MInstruction *inst) const > +{ > + Style nit: no blank line @@ +703,5 @@ > + MInstruction *l = getOp(inst, 0); > + MInstruction *r = getOp(inst, 1); > + MInstruction *compAmt = nullptr; > + JS_ASSERT(l && r); > + if (l->hasOneUse()) { Is there a reason why we're testing hasOneUse() in the choose function, rather than the matchWeight function? @@ +711,5 @@ > + } > + } > + > + if (r->hasOneUse()) { > + Style nit: no blank line @@ +713,5 @@ > + > + if (r->hasOneUse()) { > + > + r->clearTwig(); > + if (r->isLsh()) { If the hasOneUse() checks are moved into matchWeight, then we can skip this check, because it will be part of the pattern. @@ +718,5 @@ > + compAmt = getOp(r, 1); > + } > + } > + > + if (compAmt && compAmt->hasOneUse()) { If we do the hasOneUse checks in matchWeight, could the null check for compAmt be removed? @@ +723,5 @@ > + compAmt->clearTwig(); > + } > +} > +bool > +LRotateI::LRotateITwig::emit(LIRGenerator *gen, MInstruction *inst) const The somewhat convention in Lowering circles is to name the MInstruction/MNode/etc. being lowered "mir". @@ +740,5 @@ > +#ifdef JS_CODEGEN_ARM > + lir->setOperand(0, useRegister(gen, base)); > + lir->setOperand(1, useRegisterOrConstant(gen, amt)); > + return define(gen, lir, inst, > + LDefinition(LDefinition::TypeFrom(inst->type()), LDefinition::REGISTER)); At a glance, this doesn't appear to be doing anything that ARM's lowerForShift isn't doing. Can we just call lowerForShift here for all targets, instead of having an ifdef? @@ +743,5 @@ > + return define(gen, lir, inst, > + LDefinition(LDefinition::TypeFrom(inst->type()), LDefinition::REGISTER)); > +#else > + return gen->lowerForShift(lir, inst, base, amt); > +#endif Otherwise, if we're going to do an ifdef, I think it makes sense to do #ifdef JS_CODEGEN_X86 || JS_CODEGEN_X64, since x86/x64 really are the ones with special requirements here. ARM is just being normal here :-). ::: js/src/jit/MIR.h @@ +1088,5 @@ > class MConstant : public MNullaryInstruction > { > Value value_; > + public: > + MConstant(const Value &v, types::CompilerConstraintList *constraints); Why is this being made public? ::: js/src/jit/arm/CodeGenerator-arm.cpp @@ +505,5 @@ > masm.ma_mla(ToRegister(acc), ToRegister(lhs), ToRegister(rhs), ToRegister(dest)); > return true; > } > > +bool Er, is this fixing a syntax error? It'd be nice to fix this in the patch which introduced it. ::: js/src/jit/shared/Lowering-x86-shared.h @@ +13,5 @@ > namespace jit { > > class LIRGeneratorX86Shared : public LIRGeneratorShared > { > + friend class LRotateI; I hope this isn't a trend... ;-). Do we just need to make a bunch of stuff in LIRGenerator and its subclasses public?

Attachment #8485804 - Flags: feedback?(sunfish)

Dan Gohman [:sunfish]

Comment 18

•

10 years ago

Comment on attachment 8485810 [details] [diff] [review] AddFShift-r0.patch Review of attachment 8485810 [details] [diff] [review]: ----------------------------------------------------------------- Someone with more ARM experience than I should look at the ARM parts here; I just have a few high-level comments. ::: js/src/jit/arm/LIR-arm.h @@ +481,5 @@ > LIR_HEADER(FMulAddI); > static const FMulAddITwig MyTwig; > }; > > +template <class ALU, class LIR> What is the LIR template parameter for here? @@ +488,5 @@ > + MDefinition::Opcode shiftop_; > + public: > + typedef LFusedInstructionHelper<LShiftI, ALU, 0> Parent; > + MDefinition::Opcode bitop() const {return bitop_;} > + MDefinition::Opcode shiftop() const {return shiftop_;} Style nit: put a space after { and before }. @@ +505,5 @@ > +class LFShiftAddI : public LFShiftALUI<LAddI, LFShiftAddI> { > + public: > + // nothing extra needed here, just need to declare the twigs. > + LIR_HEADER(FShiftAddI); > + static const FShiftALUITwig STwig; You marked the constructor as MOZ_CONSTEXPR. Can these instances be marked MOZ_CONSTEXPR too (if we're staying with the tables approach)?

Attachment #8485810 - Flags: feedback?(sunfish)

Dan Gohman [:sunfish]

Comment 19

•

10 years ago

Comment on attachment 8485801 [details] [diff] [review] AddMLA-r0.patch Review of attachment 8485801 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/arm/LIR-arm.h @@ +466,5 @@ > return getTemp(0); > } > }; > > +class LFMulAddI : public LFusedInstructionHelper<LMulI, LAddI, 0> { There's an odd naming ambiguity here with the word "fused" and "muladd" making it sound like this might be related to a "fused muladd". Please add a comment mentioning that that's not the case here. @@ +478,5 @@ > + // TODO: I almost certainly need a few extra checks here for this to be safe. > + }; > + public: > + LIR_HEADER(FMulAddI); > + static const FMulAddITwig MyTwig; MOZ_CONSTEXPR?

Attachment #8485801 - Flags: feedback?(sunfish)

Dan Gohman [:sunfish]

Comment 20

•

10 years ago

> @@ +478,5 @@ > > + // TODO: I almost certainly need a few extra checks here for this to be safe. > > + }; > > + public: > > + LIR_HEADER(FMulAddI); > > + static const FMulAddITwig MyTwig; > > MOZ_CONSTEXPR? Or rather, MOZ_CONSTEXPR_VAR, since this should be const even in non-constexpr compilers.

Dan Gohman [:sunfish]

Updated

•

10 years ago

Blocks: 1048641

Dan Gohman [:sunfish]

Comment 21

•

10 years ago

FYI, I just filed bug 1067484 with a patch for MIR-node pattern-matching which complements the main patch here and aims to make pattern-matching code easier to write.

Douglas Crosher [:dougc]

Comment 22

•

10 years ago

Fwiw when exploring the optimization asm.js heap access, and moving constant offsets into the load or store instruction, it was useful to be able to include part of the offset in the instruction and to emit separate MIR elements for the remainder. Often the compiler could hoist the remainder, for example when there are multiple offsets that differ by only a small amount. Since this depends on hoisting the remainders it would need to be done at the MIR stage, not at lowering. This does require dealing with instruction specific matters when emitting the MIR which does not seem very clean from the point of view of abstraction. (In reply to Marty Rosenberg [:mjrosenb] from comment #6) > Oh yes. the const work, how could I forget about the const work? > on ARM, and x64, not all relevant constants can be used in any instruction. > My idea for dealing with this is to more or less remove > useRegisterOrConstant, let GVN do its thing with constants, and then in > lowering, fuse appropriate constants into the instruction. This way, if > there is a tight loop adding 257 to a number, we can trivially put 257 into > a register outside the loop, rather than inlining the constant, then having > the MacroAssembler manually put it into a scratch register, since 257 > doesn't fit into a single instruction. > Another optimization on top of this is in the register allocator. Rather > than ever spilling a const to the stack, we should probably just have the > register allocator re-synthesize the const when it is needed. This may > already be done, but it doesn't matter as much right now.

Marty Rosenberg [:mjrosenb]

Reporter

Comment 23

•

10 years ago

(In reply to Dan Gohman [:sunfish] from comment #17) > Comment on attachment 8485804 [details] [diff] [review] > AddRotate-r0.patch > > Review of attachment 8485804 [details] [diff] [review]: > ----------------------------------------------------------------- > > As an overall comment, this patch and what it represents is really cool! > > ::: js/src/jit/LIR-Common.h > @@ +2349,3 @@ > > return js_CodeName[op_]; > > } > > + virtual bool isRotate() { > > This function appears to only be called in asserts. Can it be guarded by > #ifdef DEBUG? Otherwise, whenever I see it I'll wonder what it's for :-). I could have sworn that I used this. Either way, ion > > @@ +2355,5 @@ > > + > > +class LRotateI : public LShiftI { > > + LIR_HEADER(RotateI); > > + public: > > + LRotateI() : LShiftI(JSOP_FAKE_LIMIT) {} > > Random suggestion (feel free to ignore it): What if we used JSOP_OR here, > since JSOP_OR is one of the nodes that would be at the root of a rotate > expression? > Not the worst idea I've heard, I'll try it after I get the rest of this patch tested and in working order > Alternatively, perhaps we should introduce a new enum for LRotateI. It may > become desirable in any case if we also do a rotate-left in addition to > rotate-right. > This is probably the way forward. > @@ +2356,5 @@ > > +class LRotateI : public LShiftI { > > + LIR_HEADER(RotateI); > > + public: > > + LRotateI() : LShiftI(JSOP_FAKE_LIMIT) {} > > + virtual bool isRotate() { > > Ditto about ifdef DEBUG? > > @@ +2364,5 @@ > > + return "Rotate"; > > + } > > + // Rotate is a bit more complex than the standard fusion instructions, > > + // so I need to implement this independently. > > + static MInstruction *getOp(MInstruction *inst, int id) { > > This getOp returns the operand if it's an instruction. It's not obvious > whether this is needed, instead of having code just call getOperand > directly. MDefinition is where op() and isFoo() and toFoo() and hasOneUse() > etc. are defined, which I naively expect is most of what the > pattern-matching code actually needs. > using MDefinition is fine. I just found MInstruction, and declared "This is what I'm looking for, I shall use it!" > ::: js/src/jit/LIR.cpp > @@ +610,5 @@ > > + > > +int > > +LRotateI::LRotateITwig::matchWeight(MInstruction *inst) const > > +{ > > + > > Style nit: no blank line here > > @@ +622,5 @@ > > + case MDefinition::Op_Add: > > + break; > > + default: > > + return -1; > > + } > > Is this switch necessary? The rotate twig is only added to the tables for > bitor, bitxor, and add, so it looks like matchWeight shouldn't ever get > called on anything else. Or am I missing something? > We should probably have some safeguards against placing a twig in a context where it doesn't belong. > @@ +628,5 @@ > > + MInstruction *right = getOp(inst, 1); > > + if (!left || !right) { > > + return -1; > > + } > > + if (right->isLsh() && left->isUrsh()) { > > Talking about getOp above; here, we end up tesing isLsh() and isUrsh() > immediately, so the isInstruction checks that getOp did are redundant. And > ditto for most of the getOp calls below. yup. getOp is gone. > > @@ +631,5 @@ > > + } > > + if (right->isLsh() && left->isUrsh()) { > > + MInstruction *tmp = left; > > + left = right; > > + right = tmp; > > You can use mozilla::Swap to do this. TIL! > > @@ +642,5 @@ > > + MInstruction *baseR = getOp(right, 0); > > + if (!baseL || !baseR) { > > + return -1; > > + } > > + if (!baseL->congruentTo(baseR)) { > > On getOp again, here the isInstruction checks that getOp did are needlessly > preventing this from forming rotates when the input to the rotate would be a > phi. > really rubbing this getOp thing in, aren't you :-D > Also, it should be sufficient to test baseL == baseR here instead of calling > congruentTo, since they both must dominate the base of this expression tree, > so GVN should have merged them by this point. Ditto for the congruentTo > calls below. I had high hopes. My initial plan was to extract X, then generate 32-X, and test congruence between that and the other shift amount. The goal was that I could teach GVN about the equivalence of various sequences, then get them for free in this code. e.g. 32 - (x&31) == 31 ^ (x&31); 32 - (x&31) == (32-(x&31))&31 > > @@ +661,5 @@ > > + if (!c->toConstant()->value().isInt32()) > > + return -1; > > + if (c->toConstant()->value().toInt32() != 32) > > + return -1; > > + MInstruction *possibleAmt = getOp(amt, 1); > > We're just pattern-matching rotate-right here, and not rotate-left, right? > That's fine for now, but it doesn't look right to pattern-match a "32 - amt" > on the right-shift in that case. Am I missing something? > ... continuing the list, 32-(x-5) == 37 - x, 32 - 17 = 15. While it is less efficient to rotate right by 32-x, this is a totally legit amount to rotate right by, so for y << x | y >>> (32-x), we can still generate |ror y, (32-x)|, which we'd miss without this. Obviously on architectures that support left rotates, that would be preferable. > @@ +698,5 @@ > > +} > > +void > > +LRotateI::LRotateITwig::choose(MInstruction *inst) const > > +{ > > + > > Style nit: no blank line > > @@ +703,5 @@ > > + MInstruction *l = getOp(inst, 0); > > + MInstruction *r = getOp(inst, 1); > > + MInstruction *compAmt = nullptr; > > + JS_ASSERT(l && r); > > + if (l->hasOneUse()) { > > Is there a reason why we're testing hasOneUse() in the choose function, > rather than the matchWeight function? yes. This won't affect whether or not we use the twig, but it will affect if we eliminate the unused operand. If it has more than one use, then we can't eliminate it, if it has only one use, then we can eliminate it. A better idea would be to literally decrement the use count, so if we have 32-x that is used in 5 rotates, we'll emit 5 rotates, and 0 subtractions... This will require some heuristics to determine if we can actually eliminate all uses of the value. > > @@ +713,5 @@ > > + > > + if (r->hasOneUse()) { > > + > > + r->clearTwig(); > > + if (r->isLsh()) { > > If the hasOneUse() checks are moved into matchWeight, then we can skip this > check, because it will be part of the pattern. > Ditto. > > @@ +723,5 @@ > > + compAmt->clearTwig(); > > + } > > +} > > +bool > > +LRotateI::LRotateITwig::emit(LIRGenerator *gen, MInstruction *inst) const > > The somewhat convention in Lowering circles is to name the > MInstruction/MNode/etc. being lowered "mir". > > @@ +740,5 @@ > > +#ifdef JS_CODEGEN_ARM > > + lir->setOperand(0, useRegister(gen, base)); > > + lir->setOperand(1, useRegisterOrConstant(gen, amt)); > > + return define(gen, lir, inst, > > + LDefinition(LDefinition::TypeFrom(inst->type()), LDefinition::REGISTER)); > > At a glance, this doesn't appear to be doing anything that ARM's > lowerForShift isn't doing. Can we just call lowerForShift here for all > targets, instead of having an ifdef? > > @@ +743,5 @@ > > + return define(gen, lir, inst, > > + LDefinition(LDefinition::TypeFrom(inst->type()), LDefinition::REGISTER)); > > +#else > > + return gen->lowerForShift(lir, inst, base, amt); > > +#endif > > Otherwise, if we're going to do an ifdef, I think it makes sense to do > #ifdef JS_CODEGEN_X86 || JS_CODEGEN_X64, since x86/x64 really are the ones > with special requirements here. ARM is just being normal here :-). I should probably refactor lowerForShift, so I can just use it in the lowering for fused-shift-alu instructions. > > ::: js/src/jit/MIR.h > @@ +1088,5 @@ > > class MConstant : public MNullaryInstruction > > { > > Value value_; > > + public: > > + MConstant(const Value &v, types::CompilerConstraintList *constraints); > > Why is this being made public? > See pie-in-the-sky plans for using GVN :-) > ::: js/src/jit/arm/CodeGenerator-arm.cpp > @@ +505,5 @@ > > masm.ma_mla(ToRegister(acc), ToRegister(lhs), ToRegister(rhs), ToRegister(dest)); > > return true; > > } > > > > +bool > > Er, is this fixing a syntax error? It'd be nice to fix this in the patch > which introduced it. I have moved this line around like 20 times while rebasing :-( > > ::: js/src/jit/shared/Lowering-x86-shared.h > @@ +13,5 @@ > > namespace jit { > > > > class LIRGeneratorX86Shared : public LIRGeneratorShared > > { > > + friend class LRotateI; > > I hope this isn't a trend... ;-). Do we just need to make a bunch of stuff > in LIRGenerator and its subclasses public? Probably. It would be cool if we could just friend *all* twig classes in one fell swoop, but that seems unlikely to work.

Dan Gohman [:sunfish]

Comment 24

•

10 years ago

(In reply to Marty Rosenberg [:mjrosenb] from comment #23) > > Also, it should be sufficient to test baseL == baseR here instead of calling > > congruentTo, since they both must dominate the base of this expression tree, > > so GVN should have merged them by this point. Ditto for the congruentTo > > calls below. > I had high hopes. My initial plan was to extract X, then generate 32-X, and > test congruence between that and the other shift amount. > The goal was that I could teach GVN about the equivalence of various > sequences, then get them for free in this code. > e.g. 32 - (x&31) == 31 ^ (x&31); 32 - (x&31) == (32-(x&31))&31 Oh, interesting. In that case, I'll reserve judgement until I see it in action :-). > > @@ +661,5 @@ > > > + if (!c->toConstant()->value().isInt32()) > > > + return -1; > > > + if (c->toConstant()->value().toInt32() != 32) > > > + return -1; > > > + MInstruction *possibleAmt = getOp(amt, 1); > > > > We're just pattern-matching rotate-right here, and not rotate-left, right? > > That's fine for now, but it doesn't look right to pattern-match a "32 - amt" > > on the right-shift in that case. Am I missing something? > > > ... continuing the list, 32-(x-5) == 37 - x, 32 - 17 = 15. While it is less > efficient to rotate right by 32-x, this is a totally legit amount to rotate > right by, so for y << x | y >>> (32-x), we can still generate |ror y, > (32-x)|, which we'd miss without this. Obviously on architectures that > support left rotates, that would be preferable. This includes such architectures as x86 and ARM :-). I'm ok with not doing every last rotate optimization right now, I just wanted to make sure I wasn't misunderstanding something. > > @@ +703,5 @@ > > > + MInstruction *l = getOp(inst, 0); > > > + MInstruction *r = getOp(inst, 1); > > > + MInstruction *compAmt = nullptr; > > > + JS_ASSERT(l && r); > > > + if (l->hasOneUse()) { > > > > Is there a reason why we're testing hasOneUse() in the choose function, > > rather than the matchWeight function? > yes. This won't affect whether or not we use the twig, but it will affect > if we eliminate the unused operand. If it has more than one use, then we > can't eliminate it, if it has only one use, then we can eliminate it. A > better idea would be to literally decrement the use count, so if we have > 32-x that is used in 5 rotates, we'll emit 5 rotates, and 0 subtractions... > This will require some heuristics to determine if we can actually eliminate > all uses of the value. Makes sense, thanks.

Lars T Hansen [:lth]

Comment 25

•

8 years ago

Deprioritizing for now (no movement for 2 years, Marty no longer at Mozilla). Probably this will be superseded by Cretonne if that is successful: https://github.com/stoklund/cretonne.

Priority: -- → P5

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

AddTiles-r0.patch 10 years ago Marty Rosenberg [:mjrosenb] 42.43 KB, patch		Details \| Diff \| Splinter Review
AddMLA-r0.patch 10 years ago Marty Rosenberg [:mjrosenb] 6.05 KB, patch		Details \| Diff \| Splinter Review
AddRotate-r0.patch 10 years ago Marty Rosenberg [:mjrosenb] 21.02 KB, patch		Details \| Diff \| Splinter Review
AddFShift-r0.patch 10 years ago Marty Rosenberg [:mjrosenb] 12.51 KB, patch		Details \| Diff \| Splinter Review
AddScaleRandom-r0.patch 10 years ago Marty Rosenberg [:mjrosenb] 15.75 KB, patch		Details \| Diff \| Splinter Review