601135 - Change LIR_d2i to prefer truncation over rounding

Reporter

Description

•

15 years ago

The Tamarin project wants to use asm_d2i (only on SSE capable machines) and prefers if d2i does truncation instead of rounding. 1.9 should become 1.0 instead of 2.0. This requires changing CVTSD2SI into CVTTSD2SI for asm_d2i on i386 + x64.

Werner Sharp

Reporter

Updated

•

15 years ago

Blocks: 600585

Werner Sharp

Reporter

Comment 1

•

15 years ago

Attached patch Switch asm_d2i to call CVTTSD2SI (obsolete) — Details — Splinter Review

Attachment #480121 - Flags: review?(nnethercote)

Werner Sharp

Reporter

Comment 2

•

15 years ago

Untested ARM code for truncating d2i (Tamarin never calls this): +#define FTOSIZD(_Sd,_Dm) do { \ + underrunProtect(4); \ + NanoAssert(ARM_VFP); \ + NanoAssert(((_Sd) == S14) && IsFpReg(_Dm)); \ + *(--_nIns) = (NIns)( COND_AL | (0xEBD<<16) | (0x7<<12) | (0xBC<<4) | FpRegNum(_Dm) ); \ + asm_output("ftosizd s14, %s", gpn(_Dm)); \ + } while (0) +

Nicholas Nethercote [inactive]

Comment 3

•

15 years ago

Comment on attachment 480121 [details] [diff] [review] Switch asm_d2i to call CVTTSD2SI >-OP___(d2i, 113, Op1, I, 1) // convert double to int (no exceptions raised, platform rounding rules) >+OP___(d2i, 113, Op1, I, 1) // convert double to int (no exceptions raised, platform truncation rules preferred but not guaranteed) Sorry, "preferred but not guaranteed" is not acceptable. What you really mean is "I've made it truncate on i386, X64 and ARM and I don't want to think about the other back-ends right now". If you want truncation, *require* truncation. As we discussed, there's nothing stopping you from doing so, because no-one's relying on the rounding behaviour currently. When you do that, please specify exactly what "truncation" means using some positive and negative examples. That obviously won't fit in a single line, so use a multi-line comment above the d2i line (see 'modi' for a similar example). The rest of the patch looks fine, but you'll need to consider the other back-ends; at the very least, make them assert in asm_d2i() so the maintainers can fix them later. (An aside: AFAICT, changing d2i to truncate removes the one bit of undefined behaviour in LIR, which I think is a good thing.)

Attachment #480121 - Flags: review?(nnethercote) → review-

Werner Sharp

Reporter

Comment 4

•

15 years ago

As we discussed (in email), there is no clean way to support truncate on non-SSE x86. Do we want to emit FP control word logic for non-SSE? "The main stumbling point I see is non-SSE x86 support which would need to deal with FP control word logic. I don't think we want to support this: uint16_t oldcw, newcw; int32_t intval; _asm fnstcw [oldcw]; _asm mov ax,[oldcw]; _asm or ax,0xc3f; _asm mov [newcw],ax; _asm fldcw [newcw]; _asm fld [value]; _asm fistp [intval]; _asm fldcw [oldcw]; return intval;" Since Tamarin only uses the non-SSE variant, that is currently all we care about it. If i assert in the x86 non-SSE variant, is someone really going to fix it somehow? Or just remove the assert? I realize this is not the cleanest opcode design. x64 - truncates x86+SSE - truncates x86+x97 - rounds ARM - rounds No other back ends support this opcode currently.

Nicholas Nethercote [inactive]

Comment 5

•

15 years ago

So you are saying that Tamarin will only rely on the truncation behaviour if the machine is a SSE2-capable x86/x86_64 machine? Probably the most sensible thing is to specify the varying rounding semantics clearly in LIRopcode.tbl, ie. this: > x64 - truncates > x86+SSE - truncates > x86+x97 - rounds > ARM - rounds But fleshed out with the remaining platforms, some of which are not-yet-implemented, and with examples/clear descriptions of exactly what "rounding" and "truncating" means. It's kind of gross specifying the semantics of LIR on a platform-by-platform basis, but that's what you want, right?

Steven Johnson

Comment 6

•

15 years ago

(In reply to comment #5) > It's kind of gross specifying the semantics of LIR on a platform-by-platform > basis, but that's what you want, right? This would be an icky precedent to set. Unless there's a compelling performance reason to have different semantics on different platforms, settling on one everywhere is much preferable IMHO.

Nicholas Nethercote [inactive]

Comment 7

•

15 years ago

(In reply to comment #6) > (In reply to comment #5) > > It's kind of gross specifying the semantics of LIR on a platform-by-platform > > basis, but that's what you want, right? > > This would be an icky precedent to set. Unless there's a compelling performance > reason to have different semantics on different platforms, settling on one > everywhere is much preferable IMHO. I agree! It's Adobe that wants this change :P Another possibility: we already have some x86/x86_64-only opcodes (divi, modi). We could create d2itrunc (truncd2i?) and only define it on x86/x86_64. Not sure it's much better than per-platform semantics, though, and adding new opcodes requires additional work.

Edwin Smith

Comment 8

•

15 years ago

(In reply to comment #5) > So you are saying that Tamarin will only rely on the truncation behaviour if > the machine is a SSE2-capable x86/x86_64 machine? Yes. The piece of code we want to inline is already a big ifdef nest, so its okay. The next platform of interest is ARM, but nobody has looked yet. > Probably the most sensible thing is to specify the varying rounding semantics > clearly in LIRopcode.tbl, ie. this: > > > x64 - truncates > > x86+SSE - truncates > > x86+x97 - rounds > > ARM - rounds > > But fleshed out with the remaining platforms, some of which are > not-yet-implemented, and with examples/clear descriptions of exactly what > "rounding" and "truncating" means. +1. Putting the cpu-specific instruction in the comment would make it really specific. The SSE2 comment should say that the escape value is 0x80000000 since frontends have to test for it. > It's kind of gross specifying the semantics of LIR on a platform-by-platform > basis, but that's what you want, right? Yep. Nailing down LIR_d2i's behavior on SSE2 and leaving the other platforms as-is is less work than adding a new LIR instruction just for CVTTSD2SI. If we had to generalize to more CPUs, then I think adding new instructions (maybe d2itrunc) would be worth the work.

Julian Seward [:jseward]

Comment 9

•

15 years ago

I jumped through essentially the same hoops when doing arch neutral IR representations for floating point conversions in Valgrind. My 2 Euro-cents / random comments: > It's kind of gross specifying the semantics of LIR on a platform-by-platform > basis, but that's what you want, right? Urr, don't go there. Semantic swamp ahead. regarding > uint16_t oldcw, newcw; > int32_t intval; > _asm fnstcw [oldcw]; > _asm mov ax,[oldcw]; > _asm or ax,0xc3f; > _asm mov [newcw],ax; > _asm fldcw [newcw]; > _asm fld [value]; > _asm fistp [intval]; > _asm fldcw [oldcw]; > return intval;" as a Plan B. One thing nobody has said (afaics) is that fldcw is an expensive operation (like, pipeline-flushingly-expensive), especially on older hardware. You really don't want to be doing that too often. ----------- IEEE754 defines 4 rounding modes: to +inf, to -inf, to nearest, to zero (P, M, N, Z). 'to nearest' is the default. One thing you could do is clone d2i for as many of these as required, viz: d2i_P, d2i_M, d2i_N, d2i_Z. That keeps the LIR arch-neutral w.r.t. rounding, whilst allowing you to specify whatever kind you want. The backends should assume that the processor runs in N (to nearest) by default. So for a d2i_N you can just emit a conversion insn and nothing else. For a d2i_M (which is what Werner wants) you can emit either CVTTSD2SI or the long, expensive sequence above, if that's what's necessary. The tricky bit is that this assumes that any helper functions called from JITted code don't change the rounding mode away from _N. On at least some platforms I've seen, the ABI supposedly guarantees this but hmm ... Valgrind actually checks the rounding mode every time it leaves JIT generated code, to catch any snafus, and we have caught one, although it was a bug in the Linux kernel not preserving the mode across context switches (amazingly). It also means that _M (truncate) conversions are expensive unless the hardware supports a _M-style conversion instruction that doesn't involve messing with the FPU control word. These are becoming more common but by no means universal and don't exist on remotely oldish hardware or on some lower-end hardware where maxing out FP performance isn't a priority. Rather than "trunc" etc prefixes, I'd suggest using the 754 terminology (+Inf/-Inf/Zero/Nearest, P/M/Z/N), so we don't have to keep remembering which of those "trunc" (and the unadorned version) map to. ------ The above proposal tidies up the rounding issue a bit. It still fudges on the overflow detection, because some platforms produce 0x80..00 for overflows, and others produce 0x80..00 or 0x7F..FF depending on the sign.

Werner Sharp

Reporter

Comment 10

•

15 years ago

The basic stumbling point is pre-SSE2 support on x86. Would it be possible to just drop support for this architecture for asm_d2i? We could put in an assert and/or just do rounding and all other platforms will truncate. Or support it with rounding mode and a very large warning comment. If we want consistency for one opcode, whether or not we use d2i or introduce a new opcode, we need to somehow support x87/non-SSE2. If we need to emit those 8 instructions (which are terribly slow as Julian mentions), we will need to add a bunch of new instruction support the to native i386 backend. I can write this for d2i but the caveat is it will be much slower than rounding mode for your i2d(d2i() check.

Steven Johnson

Comment 11

•

15 years ago

(In reply to comment #10) > The basic stumbling point is pre-SSE2 support on x86. Does Mozilla still support pre-SSE2? Flash/AIR still do, but they are becoming a pretty small percentage of the market, so part of me feels like we don't need to use heroic measures for hardware that is generally >5 years old and no longer made...

Edwin Smith

Comment 12

•

15 years ago

There's no more stumbling block. All we want (and Nick and I agree on this) is for the SSE2 case to use CVTTSD2SI, and for the other existing cases (x87, and other cpu's), to stay the way they are, and for the comments to be super clear about what you get in each case. As for what the comments say, I liked Julian's suggested terminology (and opcode names) since it derives from the IEEE specs. I hadn't made the connection that CVTTSD2SI == truncate == round-towards -Inf, but its nice if the comments are grounded in spec-based specifics like that.

Jeff Walden [:Waldo]

Comment 13

•

15 years ago

Maybe I'm just more unread in the ways of IEEE754 than I thought, but what njn deems N for nearest I'd always heard described as round to nearest even, which N does not immediately imply to me. "even" is the key word there for me, not nearest -- just saying this to keep it in mind for naming.

Edwin Smith

Comment 14

•

15 years ago

(In reply to comment #12) > I hadn't made the > connection that CVTTSD2SI == truncate == round-towards -Inf Oops, I misspoke. CVTTSD2SI = truncate = round-towards-zero. (d2i_Z, rather than _M, if i'm reading the CVTTSD2SI spec right).

Julian Seward [:jseward]

Comment 15

•

15 years ago

(In reply to comment #14) > (In reply to comment #12) > > I hadn't made the > > connection that CVTTSD2SI == truncate == round-towards -Inf > > Oops, I misspoke. CVTTSD2SI = truncate = round-towards-zero. (d2i_Z, rather > than _M, if i'm reading the CVTTSD2SI spec right). Yeh, I was just going to say that. Except that it was my mistake to begin with :) I must say, the Intel spec is a bit loosely worded, but I think _Z is right.

Julian Seward [:jseward]

Comment 16

•

15 years ago

(In reply to comment #13) The 754 doc I have ("Reaffirmed May 21, 1991") says Round to Nearest: [...] In this mode the representable value nearest to the infinitely precise result shall be delivered; if the two nearest representable values are equally near, the one with its least significant bit zero shall be delivered. So the even thing is kinda in there, but only for tiebreaker purposes. (I must say, I really like '754. It's a model of conciseness.)

Jeff Walden [:Waldo]

Comment 17

•

15 years ago

Ah, my slight mis-memory. I think the sticking point for me is that *all* the rounding modes round to nearest in general. It's only in the equally-close special case that they differ, and how they differ there (up, down, even, zero) is what's most important when thinking about rounding modes.

Werner Sharp

Reporter

Comment 18

•

15 years ago

I talked to Ed over IM who also talked to Andreas. The current proposal: 1. Leave LIR_d2i as is. 2. Add LIR_d2i_z which supports round to zero behavior. Specifically document CVTTSD2SI usage and also 0x80000000 failure value which we depend upon. 3. LIR_d2i_z will start with support on x86+SSE and x64. Non-SSE x86 will assert. (ARM should be easy with code from comment 2 but I have no way to test) 4. Add NJ_D2I_Z_SUPPORTED for x86 and x86 5. Rename NJ_F2I_SUPPORTED to NJ_D2I_SUPPORTED. 6. While I'm creating a patch, add NJ_DIVI_SUPPORTED define for platforms that support LIR_divi. I need this for bug 570476.

Werner Sharp

Reporter

Comment 19

•

15 years ago

Attached patch 2nd round (obsolete) — Details — Splinter Review

This patch contains my proposal in comment #18. I hadn't realize that LIR_d2i was supported on a bunch of platforms besides x86, arm, x64. LIR_d2i_z is currently only supported on x86+SSE and x64. I did remove a bit of LIR_d2i support in ExprFilter::ins1. case LIR_d2i: - if (oprnd->isImmD()) - return insImmI(int32_t(oprnd->immD())); A int32_t(double) cast is definitely different than what d2i performs in the JIT. It is closer to what d2i_z performs (maybe identical?) but seems iffy to leave this as-is. Tamarin does not currently depend upon this code.

Attachment #480121 - Attachment is obsolete: true

Attachment #481337 - Flags: review?(nnethercote)

Edwin Smith

Comment 20

•

15 years ago

(In reply to comment #18) > I talked to Ed over IM who also talked to Andreas. The current proposal: I'm guilty of waffling on this today, after IM'ing with Nick last night and Julian this afternoon. The new proposed name isn't firm, but the idea is to get something specific that becomes CVTTSD2SI on SSE2, at least. Also, ARM has a dedicated round-to-zero conversion instruction, which we will probably want soonish. The clincher (for me at least) comes after re-reading Nick's most excellent post: http://blog.mozilla.com/nnethercote/2010/02/01/a-win-for-code-hygiene/ The whole post is spot-on, but these two "lessons learned" stand out: * Clean IR semantics are worth the effort... * Listen to Julian, especially when he talks about correctness... > I did remove a bit of LIR_d2i support in ExprFilter::ins1. > > case LIR_d2i: > - if (oprnd->isImmD()) > - return insImmI(int32_t(oprnd->immD())); > > A int32_t(double) cast is definitely different than what d2i performs in the > JIT. It is closer to what d2i_z performs (maybe identical?) but seems iffy to > leave this as-is. Tamarin does not currently depend upon this code. Speaking of getting it right (!). Could we preserve the optimization when int32_t(d) == d?

Rick Reitmaier

Comment 21

•

15 years ago

A bit of heresy but would it make sense to have a callout to a helper function on platforms that aren't currently supported? This way we don't have to play macro games (esp. in the front end) and the behaviour is consistent across all platforms.

Nicholas Nethercote [inactive]

Comment 22

•

15 years ago

I think this is all overkill. What Julian says in comment 9 is great in the Valgrind context, when an arch-neutral IR needs to represent every possible type of rounding instruction. And if we found ourselves in the position of needing multiple rounding modes, that's the right direction in which to head. But we only need one specific rounding mode, and it's not even needed on all platforms. As for naming ("truncate" vs. "round towards zero" or whatever) I'm happy to go with IEEE terminology or whatever others deem appropriate. But that's moot if we go with the original simple suggestion, as we'd just stick with 'd2i'. > * Clean IR semantics are worth the effort... > * Listen to Julian, especially when he talks about correctness... IMO, that case was a bit different to this one. If we go with the simple solution here, we'll have semantics that are a bit odd but still well-defined. That blog post was talking about the old LIR_ov instruction which subverted the semantics by involving the condition codes which weren't modelled in the IR. As for non-SSE2 machines, AFAIK it's not yet decided if Mozilla will support them for Firefox 4.0, so we should assume for the moment that we will suppport them. I don't care if the non-SSE2 code is bad, though.

Werner Sharp

Reporter

Comment 23

•

15 years ago

Nick, what happened to your thought that "It's kind of gross specifying the semantics of LIR on a platform-by-platform basis". Is that your proposed solution? Or should we add an entirely new opcode? Those seem like the two choices and either way is fine with me. There are a lot of platforms that support d2i (more than I thought originally) and I don't know exactly what their "platform rounding rules" are currently so LIR_d2i is already somewhat vague. Who can make a final decision on how to proceed with this bug?

Nicholas Nethercote [inactive]

Comment 24

•

15 years ago

(In reply to comment #23) > Nick, what happened to your thought that "It's kind of gross specifying the > semantics of LIR on a platform-by-platform basis". I think it's the least worst solution. > Who can make a final decision on how to proceed with this bug? Ed, I guess!

Nicholas Nethercote [inactive]

Comment 25

•

15 years ago

Comment on attachment 481337 [details] [diff] [review] 2nd round As I said earlier, I think a new opcode is needless complexity. Ed, do you want to weigh in again?

Attachment #481337 - Flags: review?(nnethercote) → review-

Edwin Smith

Comment 26

•

15 years ago

I thought I already did, but I obviously don't see my own comment here (oops! dang.) Earlier this week I wrote that I value consensus more than the extra opcode, and we should just go with the first patch that just uses CVTTSD2SI on SSE2 cpus with no changes to other cpus, and comments fixed as suggested to clarify the behavior for each cpu. I also suggest a comment pointing to this bug saying something to the effect of: "If rounding-mode-specific d2i opcodes are needed in the future, bug XXX has discussion and a proof of concept"

Nicholas Nethercote [inactive]

Comment 27

•

15 years ago

(In reply to comment #26) > > and comments fixed as suggested to clarify > the behavior for each cpu. Mentioning the exact machine instruction used on each platform would be a good thing to include in those comments.

Werner Sharp

Reporter

Comment 28

•

15 years ago

Attached patch change x64/x86, add comments for all platforms — Details — Splinter Review

Attachment #481337 - Attachment is obsolete: true

Attachment #487951 - Flags: review?(nnethercote)

Nicholas Nethercote [inactive]

Comment 29

•

15 years ago

Comment on attachment 487951 [details] [diff] [review] change x64/x86, add comments for all platforms Looks good, thanks for persevering.

Attachment #487951 - Flags: review?(nnethercote) → review+

Werner Sharp

Reporter

Comment 30

•

15 years ago

NC submit: http://hg.mozilla.org/projects/nanojit-central/rev/6e1ed02c675e

Nicholas Nethercote [inactive]

Comment 31

•

15 years ago

http://hg.mozilla.org/tracemonkey/rev/3e76412d3749

Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey

Rick Reitmaier

Comment 32

•

15 years ago

Werner http://hg.mozilla.org/tamarin-redux/rev/635a3d71cf90

Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit,fixed-in-tracemonkey,fixed-in-tamarin

Robert Sayre

Comment 33

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/3e76412d3749

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: Core → Core Graveyard

Switch asm_d2i to call CVTTSD2SI 15 years ago Werner Sharp 10.30 KB, patch	n.nethercote : review-	Details \| Diff \| Splinter Review
2nd round 15 years ago Werner Sharp 27.01 KB, patch	n.nethercote : review-	Details \| Diff \| Splinter Review
change x64/x86, add comments for all platforms 15 years ago Werner Sharp 10.35 KB, patch	n.nethercote : review+	Details \| Diff \| Splinter Review