Cleanup of X87 FP stack code

RESOLVED FIXED

Status

RESOLVED FIXED
8 years ago
8 years ago

People

(Reporter: wmaddox, Assigned: wmaddox)

Tracking

Details

(Whiteboard: fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey)

Attachments

(2 attachments)

(Assignee)

Description

8 years ago
1) We do not use the x87 stack in its full generality.  Instead,
the top of the stack is treated as the single allocatable FP
register.  This requires some fancy footwork to keep the stack
balanced, such as popping the stack when the FST0 "register"
becomes dead.
The "register" FST0 is the sole member of the x87regs register
class, and there are many places in the code that rely on this.
In many places, however, the code is written so as to strongly
suggest that there might be multiple such registers.  This makes
the code confusing and difficult to follow.  This patch removes
such conceits, replacing expressions such as (rmask(r) & x87regs)
with (r == FST0), etc.

2) prepareResultReg() has been slightly refactored to make the x87
stack fiddling a bit easier to follow.  Additionally, the code appeared
to assume that if FST0 was in 'allow', that FST0 would in fact become
the allocated result register.  While this might hold true, it is not
obvious, and I've changed the logic to look at the result of the
findRegFor(), i.e, the actual allocated register, instead.

3) The "pop" argument, being a very IA32-specific artifact of the x87,
is just noise when passed to asm_spill() on non-IA32 platforms.  The
now distinguishes the idiosyncratic IA32 version of asm_spill() from
the generic case by giving them different signatures.

4) In multiple cases, boolean values were normalized to 0 or 1 by an
expressions such as (p?1:0), only to be passed as a bool argument and
normalized again.  These redundancies are removed by the patch.

5) The consistency check for stack depth vs. the allocation status of
FST0 has been commented.

A follow-up patch is planned to eliminate the dead pops that can be
generated following unconditional branches.
(Assignee)

Updated

8 years ago
Assignee: nobody → wmaddox
(Assignee)

Comment 1

8 years ago
Created attachment 466528 [details] [diff] [review]
Cleanup of x87 stack manipulation
Attachment #466528 - Flags: review?(nnethercote)
Attachment #466528 - Flags: feedback?(edwsmith)
Comment on attachment 466528 [details] [diff] [review]
Cleanup of x87 stack manipulation

This all looks good except for...

>             // since we generate backwards the depth is negative
>             inline void fpu_push() {
>-                debug_only( ++_fpuStkDepth; NanoAssert(_fpuStkDepth<=0); )
>+                debug_only( ++_fpuStkDepth; NanoAssert(_fpuStkDepth <= 0); )
>             }
>             inline void fpu_pop() {
>-                debug_only( --_fpuStkDepth; NanoAssert(_fpuStkDepth<=0); )
>+                debug_only( --_fpuStkDepth; NanoAssert(_fpuStkDepth >= -7); )
>             }

Where does the -7 come from?  I long have thought (without knowing for sure)
that the minimum value for _fpuStkDepth was -1, in which case asserting that
it is 0 or -1 in both fpu_push() and fpu_pop() would be good.
(Assignee)

Comment 3

8 years ago
(In reply to comment #2)
> Comment on attachment 466528 [details] [diff] [review]
> Cleanup of x87 stack manipulation

> Where does the -7 come from?  I long have thought (without knowing for sure)
> that the minimum value for _fpuStkDepth was -1, in which case asserting that
> it is 0 or -1 in both fpu_push() and fpu_pop() would be good.

Between instructions, it is so constrained.  The expansion of a LIR instruction may use unmanaged temporaries internally, including deeper levels of the x87 stack.  Thus, the consistency check verifies 0 or -1, depending on FST0 allocation status, but the check here checks against the full depth as supported in hardware.

See the non-sse2 case for asm_cmpd().  Also, note that FCOMPP pops twice.

Comment 4

8 years ago
Comment on attachment 466528 [details] [diff] [review]
Cleanup of x87 stack manipulation

I'm somewhat ambivalent towards this code; if it has a bug, lets fix it, but otherwise lets push the cruft into Nativei386.h/cpp wherever possible.

Could the fpu_push/pop code move from Assembler.h to Nativei386.h?

it might be out of scope for this bug, but also see bug 491084 and the
workaround for it in asm_farg() in Nativei386.cpp.

Updated

8 years ago
Attachment #466528 - Flags: feedback?(edwsmith) → feedback+

Updated

8 years ago
OS: Mac OS X → Windows XP

Comment 5

8 years ago
Is Firefox even supporting pre-SSE2 systems anymore? (I know Flash is, and probably will be for a while...)
(Assignee)

Comment 6

8 years ago
(In reply to comment #4)
> Comment on attachment 466528 [details] [diff] [review]

> Could the fpu_push/pop code move from Assembler.h to Nativei386.h?

The quirkyness of the x87 touches on many other places in Assembler.cpp, and which do not seem amenable to clean abstraction.  I'd rather centralize all of this in Assembler.h/Assembler.cpp than to distribute parts of it to Nativei386.h/Nativei386.cpp in an unprincipled way -- it just makes it easier to follow what is going on when trying to grok the register allocator.  I'd feel differently if there appeared to be a clean way to express what the x87 needs as a parameterization of a general register allocation scheme.
It does seem that there is too much platform-dependent cruft in the Assembler.h/Assembler.cpp, but the contract between the platform-independent and platform-dependent files is not at all clearly delineated.

I could move the fpu_push/pop declarations into the DECLARE_PLATFORM_ASSEMBLER() macro.  Once I tried doing this, however, a bunch of other definitions begged for the same treatment.  I'm hesitant to start pulling on this thread.  All of this stuff is incredibly ugly, but the intent of the patch is to be a conservative and localized improvement that went with the flow, rather than a grand reorganization or the beginnings of such a refactoring arc.

Why do we not subclass Assembler for each platform?

Comment 7

8 years ago
Sounds fine.  

(In reply to comment #6)
> Why do we not subclass Assembler for each platform?

I dont think at any point in Assembler's lifetime, anyone has come along and proposed a base-class/sub-class reorganization that a) improved things enough to do the work, and b) didn't introduce unwanted abstraction overhead, like virtual calls.

I've contemplated doing that as a way of isolating the legacy-386 support from the x86-sse2 code, since the two cases are substantially different... but see above.

Comment 8

8 years ago
(In reply to comment #6)
> (In reply to comment #4)
> > Comment on attachment 466528 [details] [diff] [review] [details]
>  I'd
> feel differently if there appeared to be a clean way to express what the x87
> needs as a parameterization of a general register allocation scheme.

I took a crack at this a while back and it just exploded into a bunch of code that no other architectures would ever use.

If we needed such an abstraction for at least one other architecture I'd be tempted to try again; but alas there isn't
Attachment #466528 - Flags: review?(nnethercote) → review+
(Assignee)

Comment 9

8 years ago
Created attachment 468749 [details] [diff] [review]
Add commentary to further explain usage of the x87 FPU stack
Attachment #468749 - Flags: review?(nnethercote)
(Assignee)

Updated

8 years ago
Whiteboard: fixed-in-nanojit
Comment on attachment 468749 [details] [diff] [review]
Add commentary to further explain usage of the x87 FPU stack

Nice!  I like big high-level comments.
Attachment #468749 - Flags: review?(nnethercote) → review+
http://hg.mozilla.org/tracemonkey/rev/31ce22a238ac
Whiteboard: fixed-in-nanojit → fixed-in-nanojit, fixed-in-tracemonkey

Comment 14

8 years ago
http://hg.mozilla.org/tamarin-redux/rev/f2fa20a391ca
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey
(Assignee)

Updated

8 years ago
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.