Closed Bug 413787 Opened 17 years ago Closed 16 years ago

NSToIntRound, NSToCoordRound code generation

Categories

(Core Graveyard :: GFX, enhancement)

x86
Windows XP
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.9

People

(Reporter: mmoy, Assigned: mmoy)

References

(Blocks 1 open bug)

Details

(Keywords: perf)

Attachments

(2 files, 6 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20080101 BonEcho/2.0.0.11 (mmoy CE K8C-X03)
Build Identifier: Trunk

At -O1 optimization, NSToIntRound generates this code in nsAccessible:

?NS_lroundf@@YAHM@Z PROC				; NS_lroundf, COMDAT

; 62   :     return x >= 0.0f ? PRInt32(x + 0.5f) : PRInt32(x - 0.5f);

	fldz
	fld	DWORD PTR _x$[esp-4]
	fcom	ST(1)
	fnstsw	ax
	fstp	ST(1)
	test	ah, 1
	jne	SHORT $LN3@NS_lroundf
	fadd	QWORD PTR __real@3fe0000000000000
$LN10@NS_lroundf:
	jmp	__ftol2_sse
$LN3@NS_lroundf:
	fsub	QWORD PTR __real@3fe0000000000000
	jmp	SHORT $LN10@NS_lroundf
?NS_lroundf@@YAHM@Z ENDP				; NS_lroundf
_TEXT	ENDS

I'd guess that __ftol2_sse checks for sse support to determine
whether to use SSE or f87 to convert the result to an integer.

so that an actual invocation results in this:

; 337  :   return NSToIntRound(float(aAppUnits) / aAppUnitsPerPixel);

	fild	DWORD PTR _aAppUnits$[ebp]
	push	ecx
	fidiv	DWORD PTR _aAppUnitsPerPixel$[ebp]
	fstp	DWORD PTR tv79[ebp]
	fld	DWORD PTR tv79[ebp]
	fstp	DWORD PTR [esp]
	call	?NS_lroundf@@YAHM@Z			; NS_lroundf
	pop	ecx

The source code for NS_lroundf in nsmathutils.h is:

inline NS_HIDDEN_(PRInt32) NS_lroundf(float x)
{
    return x >= 0.0f ? PRInt32(x + 0.5f) : PRInt32(x - 0.5f);
}

with the compiler choosing not to inline the code despite the inline tag in the source. The thing is that f87 has hardware float to integer with rounding support as does SSE2 but there's no language support to get to the f87 instructions. The rounding mode is controlled by the RC field of the FPU control word but in actual practice with Mozilla, it's always been normal rounding from what I've seen. This would be one way to do it in MSVC and it should be faster than what is there now, even with the additional stack overhead of inline assembler. 

inline PRInt32 NSToIntRound(float aValue)
{
int a;
  __asm{
     fld      aValue
     fistp    a
     }
  return a;
 }

On Mac OSX (Intel) and Windows x64, this could be used:

inline nscoord NSToCoordRound(float aValue)
{
  return (_mm_cvtss_si32(_mm_load_ss(&aValue)));
 }

as these platforms guarantee SSE2.

Reproducible: Always

Steps to Reproduce:
1.
2.
3.
Status: UNCONFIRMED → NEW
Component: XPCOM → GFX
Ever confirmed: true
QA Contact: xpcom → general
Version: unspecified → Trunk
Cripes! That gets called a lot. Will the asm or intrinsics inhibit optimization significantly?
There's a good paper at http://ldesoras.free.fr/doc/articles/rounding_en.pdf for hardware, MSVC and C on ceil, floor and on rounding in general.

The inline assembler above is very cheap but this paper indicates that the default rounding mode is really round towards the even integer. But it does provide ceil, floor and rounding functions which are pretty efficient. The round_int routine that it provides matches NSToCoordRound in my testing. I didn't see an equivalent of NSToIntRound though and I'm curious if there's a way to do it without using any conditionals.

These are the results that I found. Quick is the inline assembler above. lroundf is used by NSToIntRound, and round_int is from the paper. Perhaps a starting approach would be to use round_int for NSToCoordRound on Win32.

Number   Quick  lroundf roundint NSToCoordRound
 -3.00     -3     -3       -3          -3
 -2.75     -3     -3       -3          -3
 -2.50     -2     -3       -2          -2
 -2.25     -2     -2       -2          -2
 -2.00     -2     -2       -2          -2
 -1.75     -2     -2       -2          -2
 -1.50     -2     -2       -1          -1
 -1.25     -1     -1       -1          -1
 -1.00     -1     -1       -1          -1
 -0.75     -1     -1       -1          -1
 -0.50      0     -1        0           0
 -0.25      0      0        0           0
  0.00      0      0        0           0
  0.25      0      0        0           0
  0.50      0      1        1           1
  0.75      1      1        1           1
  1.00      1      1        1           1
  1.25      1      1        1           1
  1.50      2      2        2           2
  1.75      2      2        2           2
  2.00      2      2        2           2
  2.25      2      2        2           2
  2.50      2      3        3           3
  2.75      3      3        3           3
  3.00      3      3        3           3
The rounding behaviour is actually quite important, see
https://bugzilla.mozilla.org/show_bug.cgi?id=410748#c14
So we need round_int.
In particular, rounding towards the nearest even integer would mean a 1px-wide border covering [0.5, 1.5) would be drawn 2px wide covering [0, 2).

lroundf is rounding away from 0 and has a similar problem, [-0.5, 0.5) becomes [-1, 1].
cairo has a special lround() which claims to do what we want and fast, and already in the tree...

http://lxr.mozilla.org/seamonkey/source/gfx/cairo/cairo/src/cairo-misc.c#218

It only works on doubles, but float to double conversion should be cheap?  Maybe similar functions could be done for floor() and ceil().
Oooh, nice catch. I bet a float version of it is possible, even actually slightly more efficient. Anyone want to take that on? It's tricky but on the bright side you can exhaustively test all 2^32 possible float values :-).
I took a look at the generated assembler code for Win32 at -O1 for lround and found that it took around 50 cycles on AMD K8. roundint took about 30 cycles. I think that things might be a little closer on Intel processors as Intel has higher memory latencies and roundint has one more memory access compared to lround.

I'm going to try to write a non inline assembler version of roundint to get rid of the inline assembler stack overhead to trim a few cycles.
I may have been comparing apples to oranges on the latencies if lroundf is comparable to lround instead of NSToCoordRound.
NSToCoordRound is used in 666 modules so it seems that it would be worthwhile to use the routine referenced in the paper. I sent an email to the author (he puts out open source software) to see if he has any restrictions on the code in his paper.

An SSE2 solution using intrinsics for Mac OSX looks possible.
_cairo_lround is basically the same as NSToCoordRound and roundint. lroundf is not what we want.

So it sounds like you're saying roundint wins. The only thing is that _cairo_lround works on doubles whereas we only need a version that works on floats, so it may be possible to shave some cycles off the _cairo_lround approach.
The roundint approach takes six instructions compared to about 36 for lroundf and I got roundint down to about 21 cycles. Unfortunately I just realized that roundint doesn't cover one billion to two billion as adding the number to itself results in losing a bit. The fix costs nine cycles so I'm back to 30 again.

I have the code for NSToIntRound too and that's about 30 cycles.

I'll take a look at cairo_lround for single-precision.
I had a closer look at _cairo_lround and it looks to me to take 38 cycles not including a few pushes and pops and the call overhead. I think that the routine is too large to be inlined by the compiler without using __forceinline. Going from double to single precision looks like it would save about 5 cycles. Three for not having to do put the lower doubleword and higher doubleword together (2 shifts and and 1 or) and the partial savings of a move doubleword as the entire number is in one doubleword for a cost of about 33 cycles.

roundint uses about 20 cycles but doesn't support numbers above 2^30. roundint does inline on -O1 and the code is pretty small so there should be minimal impact on image size. The current NSToCoordRound routine gets inlined but calls out to ftol which is the expensive part. The inlined part alone is probably slower than roundint. It may be valid using roundint for coordinates even if it doesn't support numbers above 2^30 as the mantissa in a single precision floating point number is only 23 bits.

I modified roundint with a fix for numbers between 2^30 and 2^31 by writing out the integer as a 64-bit quadword and then grabbing the lowest bit off the high quad and the upper 31 bits of the low quad and or-ing them together at a cost of three cycles or 23 in all. The resulting routine is not inlined at -O1 optimization unless __forceinline is used.

And I did an NSToIntRound which should be pretty efficient compared to what we are doing now. I'm going to post the patch with the accurate routines for those that want to take a look. I think that roundint should work fine but would be interested in the opinions of others.

The _cairo_lround just requires figuring out the correct constants to use for single instead of double and getting rid of the 64-bit stuff that isn't needed with single precision.
Attached patch Accurate version of roundint (obsolete) — Splinter Review
This version of roundint takes about 23 cycles on an AMD K8 chip. An NSToIntRound solution is also provided in the patch and it too is an accurate version.
Here's the single precision version of -cairo_lround. It has a limit somewhere between 1,000,000 and 10,000,000. Numbers higher (in absolute value) than the limit return zero. The double version doesn't do this because it has more than 32 bits to work with in the mantissa. Single precision only has 23 bits so it can't represent numbers over the limit. The routine only does right-shift even though the shift_amount can be negative which would make the numbers bigger instead of smaller. A modification could be put in to check the sign of the shift_amount and shift to the left if the shift_amount is negative.

The routine here is 32 cycles compiled with -O1.

int cairo_lround(float x)
{
  unsigned int *i = (int *) &x;
  unsigned int j, k, top, shift_amount;

  j = *i;
  top = j >> 23;
  shift_amount = 149 - (top & 0xff);
  top >>= 8;
  j &= 0x7fffff;   /* mantissa */
  j |= 0x800000;   /* implicit 1 */
  j -= top;
  top--;
  j >>= shift_amount;
  j = (j >> 1) + (j & 1);
  j &= ((shift_amount > 31) - 1);
  j = (j & top) - (j & ~top);
  return j;
}

The patch that I attached doesn't compile in Mozilla as it apparently doesn't handle the float already in the stack so I need to play around with the routines a little.
Attached patch Accurate version of roundint (obsolete) — Splinter Review
Mixed up the pointer nomenclature between destination and source operands.
Attachment #307535 - Attachment is obsolete: true
This is the patch that doesn't handle values above 2^30.
Assignee: nobody → mmoy
Status: NEW → ASSIGNED
Noming since this looks like a good perf win.
Flags: blocking1.9?
Coords are limited to 2^30 so the issue with NSToCoordRound isn't an issue. I changed the NSToIntRound back to use the accurate version
Wouldn't block the release for this, but do want to get it in, especially since Michael has a patch -- ask for review, probably from roc?  I had a look as well, and the patch looked fine.  Do we know whether this will get inlined for sure?  If not, do we want to do the force_inline bits to make sure it is?

After roc's review, request approval-1.9 on it to get it in, thanks!
Flags: wanted1.9+
Flags: blocking1.9?
Flags: blocking1.9-
Attachment #307781 - Flags: review?(roc)
Here's a snippet from the assembler listing of nsHTMLReflowState at -O1 which shows that NSToCoordRound is inlined. NSToIntRound isn't inlined unless -O2 is used.

; 2002 :       normalLineHeight = NSToCoordRound(emHeight * NORMAL_LINE_HEIGHT_FACTOR);

	fild	DWORD PTR _emHeight$[ebp]
	fmul	QWORD PTR __real@3ff3333340000000
	fstp	DWORD PTR $T48755[ebp]
	fld	DWORD PTR $T48755[ebp]
	fadd	ST(0), ST(0)
	fadd	DWORD PTR _round_to_nearest
	fistp	DWORD PTR _i$48753[ebp]
	mov	eax, DWORD PTR _i$48753[ebp]
	sar	eax, 1
If you replace
+    mov     eax, dword ptr i
+    sar     eax, 1
with "return i/2", does that generate just as good code but create more potential for optimization, maybe?

+#if defined(XP_WIN32) && defined(_M_IX86) && !defined(__GNUC__)

Would be nice to factor this out so we have #defines identifying various inline assemblers we support. And it would be nice to have this code on Linux and Mac too ... these can be followup issues.

+    mov     eax, dword ptr i
+    shr     eax, 1
+    test    aValue, -2147483648
+    je      positive
+    neg     eax
Likewise, I wonder if writing C code to implement this would give as good code, while giving the compiler more freedom to optimize around.

We should really hunt down users of NSToIntRound and see if they'd be as well off or better using the other rounding mode.

Maybe it would be a good idea to define these function in nsMathUtils and let NSToCoordRound and NSToIntRound be wrappers around them. This isn't really anything to do with coords.
Yeah it looks to me like pretty much all users of NSToIntRound should be using the other rounding mode.
http://mxr.mozilla.org/seamonkey/search?string=NSToIntRound
Maybe we can just get rid of it.
Good points on reverting to C code for as much work as possible. I only wish that Microsoft had intrinsics for f87. It would have been easier to do this in intrinsics and that's what I plan to do with Mac OSX and Win64 as they guarantee SSE2. I'm not that good with GCC-style inline assembler but it's a good thing to learn.

I used NS_lroundf2 for the nsMathUtils.h file and am running a build right now with the modified code for the first inline assembler routine. I'll do the other one later on.

One difference between NSToIntRound and NSToCoordRound is that the latter isn't supported above 2^30 while the former is. There are comments in nsCoord.h to this affect on doing conversions between the two.

On the #ifs for the different kinds of inline assembler and intrinsics, perhaps a bug should be opened for this.
Hmm, good point. I guess we need two functions, one which works on the full range of integers and one that only works on valid nscoords.
I tried using the compiler for the division stage and the compiler inserts additional instructions (CDQ and SUB) and uses an additional register in the process so I reverted to the inline assembler-only code. If you would still prefer the other code, let me know and I'll change it back.
Attachment #307544 - Attachment is obsolete: true
Attachment #307610 - Attachment is obsolete: true
Attachment #307781 - Attachment is obsolete: true
Attachment #308240 - Flags: review?(roc)
Attachment #307781 - Flags: review?(roc)
Bummer. Oh well.

The code looks fine, but we need better names for these things, and some documentation.

Instead of NS_lroundf2, how about NS_lroundup30? 30 meaning 30 bits of precision, up meaning 0.5 always goes up.

And instead of NS_roundf2, why aren't we just replacing NS_lroundf?
Instead of dividing by 2, I can just do

return i >> 1;

for NSToCoordRound and something similar for NSToIntRound.
Attachment #308240 - Attachment is obsolete: true
Attachment #308612 - Flags: review?(roc)
Attachment #308240 - Flags: review?(roc)
+/*
 inline NS_HIDDEN_(PRInt32) NS_lroundf(float x)
 {
     return x >= 0.0f ? PRInt32(x + 0.5f) : PRInt32(x - 0.5f);
 }
+*/

I would just remove this.

+#if defined(XP_WIN32) && defined(_M_IX86) && !defined(__GNUC__)
+const float round_to_nearest = 0.5f;
+#endif

Move this into NS_lroundf and NS_lroundup30 and make it static in each one. Avoids namespace pollution and excessive #ifdefs.

+    if (*xp & -2147483648)

Would be more readable in hex.

+        return -(*ip >> 1);
+    else
+        return (*ip >> 1);

I'd share the *ip >> 1 in a local variable.

Otherwise great.
I removed the replacement for NSToIntRound. VS2005 improves the conversion from float to int quite a bit and I couldn't beat it. The replacement for NSToCoordRound takes about one-seventh the time than the current version.
Attachment #308612 - Attachment is obsolete: true
Attachment #310454 - Flags: review?(roc)
Attachment #308612 - Flags: review?(roc)
We should get this in for beta5 if we can.   
This code on Mac OSX uses about 1/3rd the time on my tests compared to NSToCoordRound.

__inline int roundint(float x)
{
  static const float round_to_nearest[1] = {0.5};
 int i;
 __m128 a, b, c;

 a = _mm_load_ss(&x);
 b = _mm_load_ss(round_to_nearest);
 a = _mm_add_ss(a,a);
 a = _mm_add_ss(a,b);

 return (_mm_cvtss_si32(a)) >> 1;
}
Attachment #310454 - Flags: approval1.9b5?
Attachment #310454 - Flags: approval1.9?
Keywords: perf
Michael, one more thing that would be nice is an NS_ASSERTION that the input value is a legal nscoord and not NS_UNCONSTRAINEDSIZE.

I'm kinda wondering whether the fact that this mangles NS_UNCONSTRAINEDSIZE into something negative whereas the old function passed it through could cause problems when it interacts with code that would be technically wrong but currently works. Maybe we should slip it to the next release because of that.
Keywords: perf
It would be good if we had Talos perf or codesize numbers to help make that decision.
Keywords: perf
Comment on attachment 310454 [details] [diff] [review]
Final patch (hopefully)

I don't think we want to take this post-freeze, the risk isn't well-understood enough at this point.
Attachment #310454 - Flags: approval1.9b5? → approval1.9b5-
(In reply to comment #33)
> Michael, one more thing that would be nice is an NS_ASSERTION that the input
> value is a legal nscoord and not NS_UNCONSTRAINEDSIZE.
> 
> I'm kinda wondering whether the fact that this mangles NS_UNCONSTRAINEDSIZE
> into something negative whereas the old function passed it through could cause
> problems when it interacts with code that would be technically wrong but
> currently works. Maybe we should slip it to the next release because of that.
> 

Is there a way to mitigate that?  

Any ideas on the possible perf impacts here?
(In reply to comment #36)
> Is there a way to mitigate that?

Not without eating some of the perf improvement.

> Any ideas on the possible perf impacts here?

Not yet.
Can we do a try server build to test impact?
All tests and reftests pass with this patch on win32; it doesn't compile out of the box on linux/OSX:

../../dist/include/xpcom/nsMathUtils.h: In function 'PRInt32 NS_lroundup30(float)':
../../dist/include/xpcom/nsMathUtils.h:96: error: 'NS_floorf' was not declared in this scope
../../dist/include/xpcom/nsMathUtils.h:96: error: 'nscoord' was not declared in this scope

but that looks like a missing #include.

Tp may have dropped slightly, but the tryserver talos numbers aren't in line with the normal tinderbox ones (for this patch, Tp was 444.. previous 2 runs on the tryserver were 446 and 448; main tinderbox Tp is in the 480 range).
For the #include issue, I'll have to move the arch/compiler selection up to nscoord.h instead of nsMathUtils.h, but it's straightforward.  I'm thinking not for b5 at this point, though.
The problem with the patch in OSX and Linux is that the definitions for the non-Windows case needs functions defined later in the module.

Single-precision floating point numbers only have accuracy to about 8 million or about 4 million when you include the sign. Beyond that, you get coarse rounding.

I came up with another alternative that handles numbers from -(2^31) to +(2^31) with performance comparable to the previous assembler routine without any assembler code that inlines at -O1 on Windows.

inline nscoord NSToCoordRound(float aValue)
{
  return nscoord(((unsigned int)(x + 2147483648.5) - (unsigned int) 0x80000000));
}

It produces this code as a function. Of course the first and last instructions can be optimized inline by the compiler.

	fld	DWORD PTR _x$[esp-4]
	fadd	QWORD PTR __real@41e0000000100000
	call	__ftol2
	sub	eax, -2147483648			; 80000000H

The approach is simple for rounding: add 2^31 and then round and then subtract out the 2^31. This will provide the floor behaviour for positive and negative numbers in the range as it makes negative numbers positive.

I need to do some testing on the Mac OSX side with this.

As an interesting aside, I tried this with -O2 optimization and found that the compiler generates slower code for converting floating points to longs with truncation. It uses floating point control word code with -O2 instead of using the faster library code. With -O1, my bench took 3 seconds, with -O2, it too 6 seconds and with the current floorf approach, it took 14 seconds. 
Some statistics on Windows at -O1 and -O2. The program also checks values from -(2^31) to 2^31. The test program is the next attachment. I'm going to try it on Mac OSX next.

F:\mozilla>round1
249999999
249999999
start = 0, end1 = 3765, end2 = 30640
roundint = 3765 milliseconds  NSToCoordRound = 26875 milliseconds
roundint percentage of NSToCoordRound = 14.009302

checking from -2147483648.000000 to -1000000000.000000 by 0.25
checking from -1000000000.000000 to 0.000000 by 0.25
checking from 0.000000 to 1000000000.000000 by 0.25
checking from 1000000000.000000 to 2147483648.000000 by 0.25

F:\mozilla>cl -FAs -O2 round1.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.42 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

round1.c
Microsoft (R) Incremental Linker Version 8.00.50727.42
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:round1.exe
round1.obj

F:\mozilla>round1
249999999
249999999
start = 0, end1 = 3796, end2 = 15125
roundint = 3796 milliseconds  NSToCoordRound = 11329 milliseconds
roundint percentage of NSToCoordRound = 33.506929
Michael - does this new version produce results exactly the same as the old one?
Comment on attachment 312209 [details]
Test program for timing and checking results

I'd like to try the new version to see affects on Tp - if anything goes screwy let's back it out asap and differ to .next..
Attachment #312209 - Flags: approval1.9+
Keywords: checkin-needed
hmm, schrep, did you perhaps mean to approve attachment 310454 [details] [diff] [review] instead?
Keywords: checkin-needed
Attachment #310454 - Flags: approval1.9? → approval1.9+
Yes - my greasemonkey script got the best of me..
Keywords: checkin-needed
A few comments:

I ran a comparison of values from -2^31 to 2^31 on Windows and found no differences in values. The test checks values in increments of 0.25. That's what the test program is for besides benchmarking. I modified the program this morning to directly use floats as the current attachment program uses a double in the loop and I needed to test with floats as that's the argument type. Everything was fine.

Things are quite different on Mac OSX though. On Mac OSX, my code is 2% slower because the codegen adds the floating point control stuff to my code and the floor function is apparently much faster on Mac OSX. I also ran the verification stuff and I think that the Mac OSX results are incorrect for the current code - that is the rounding isn't done correctly for some intervals below about -8 million and above +8 million. I'd have to do this code for just Windows as I think that I want to understand better what's going on with Mac OSX.
Checking in gfx/public/nsCoord.h;
/cvsroot/mozilla/gfx/public/nsCoord.h,v  <--  nsCoord.h
new revision: 1.13; previous revision: 1.12
done
Checking in xpcom/ds/nsMathUtils.h;
/cvsroot/mozilla/xpcom/ds/nsMathUtils.h,v  <--  nsMathUtils.h
new revision: 3.2; previous revision: 3.1
done
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9
I think this busted me on linux and several on the Firefox tbox:

In file included from ../../dist/include/layout/nsIPresShell.h:58,
                 from /work/mozilla/builds/1.9.0/mozilla/uriloader/base/nsDocLoader.cpp:58:
../../dist/include/gfx/nsCoord.h: In function ?nscoord NSToCoordRound(float)?:
../../dist/include/gfx/nsCoord.h:298: error: ?x? was not declared in this scope
No updated patch was attached that addressed comment #39, so this definitely burned when it hit the tree since I didn't see that comment. I took vlad's advice in comment #40 and fixed the problem that way, but I forgot to rename a variable, so that's what that issue is. I've renamed the variable, so it should (hopefully) work now.
WE'RE NOW FAILING ACID2, and I think that this change to round-off calculations did it.

On the small black squares on either side of the chin (not the forehead),, we now display TALLER than the reference image. The way the chin div height is calculated is influenced by the div inside of it- we round up to 13 px.
I see that new bug # 426616 has been created for this regression. I'm not sure that this code caused the change, but let's have someone smarter than me check it out...
Depends on: 426616
No longer depends on: 426616
I think that something is changing the FPU rounding mode and isn't changing it back which means that this assumption (that noone changes it) is incorrect. Could someone point me how I can run ACID2?
Thanks, ROC, for taking a look and determining that this WASN'T the cause. I was unsure if the "hopefully final" patch of March 19 had ever actually been applied. And yes, you're right that I DID NOT not the failure occur right around the 4/1 patch-- it's been present for at least a week, and maybe months.

I'll head over to 426616 and see what smarter heads than mine have been brought to bear over there. :))
Product: Core → Core Graveyard
Depends on: 1024274
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: