Closed Bug 97861 Opened 23 years ago Closed 14 years ago

Rounding error in nsTransform2D

Categories

(Core :: Layout, defect)

x86
OS/2
defect
Not set
major

Tracking

()

RESOLVED FIXED
Future

People

(Reporter: jhpedemonte, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: testcase)

Attachments

(4 files, 3 obsolete files)

On OS/2, there are several off-by-one pixel errors where images or text will
shift by one pixel when they are clicked.  I tracked this down to an issue in
gfx/src/nsTransform2D.cpp, where in some cases m20 = 241.4999985 (and got
rounded to 241.5), and in other cases m20 = 241.5000000 (and got rounded to 242,
a difference of one pixel).  

This is due to the way the values of m20 and m21 are calculated.  They are
stored as floats, but the CPU does calculations using doubles.  So while all the
variables are floats, when doing float arithmetic, everything gets converted to
double, then mutltiplied/added/etc, and then saved to a float.  This last step
causes some information to be truncated.  Take this example, where:

From the code in nsTransform2D::AddTranslate, where [ m20 += ptX * m00 ]:

 First time through:
  m00 = 8.33333358e-02
  ptX = 2.76000000e+02
  m20 = 0
  => When multiplying [ ptX * m00 ], these two float values get changed to
doubles in order to the the mutliplication, and then added to (double)m20, which
is 0.0f.  The result is 2.3000000685453415e+01.  This then gets stored in a
float, so the result gets truncated to 2.30000000e+01.  Now lets look at the
second pass.
 
 Second time through:
  m00 =  8.33333358e-02
  ptX = -2.76000000e+02
  m20 =  2.30000000e+01
  => Again, in order to the mathematical operations, the floats get converted to
doubles, with the result of [ ptX * m00 ] being -2.3000000685453415e+01.  This
then gets added to double-converted m20, which is 2.3000000000000000e+01. 
Therefore, the result of their addition is -6.85453415e-07, although we would
expect zero.

The problem here is that we are not properly considering of the most significant
bits.  In the second pass, we should have cast the result of [ ptX * m00 ] to a
float, in order to properly truncate the irrelevant data at the end of the
number.  By saving [ ptX * m00 ] to a temporary float variable, and then adding
to m20, we get the correct result of zero.
Attached file Test case
The above test case should compile on any platform, and should display the error
on any intel based platform (os/2, windows, linux, etc).  As you can see, saving
to temporary float variable removes the errors from the mutliplication.

From what I can tell, there are two solutions to this problem.  One is to change
the internal members of the nsTranform2D class (m00, m01, m10, m11, m20, m21, as
well as the temporary variables used throughout) to double.  This stores the
'errors' from the mutliplication and produces the correct results.  I have tried
this on OS/2, and it fixes the one-pixel problems I was seeing.

The better solution, though, is to make sure that when adding or subtracting we
store the results of mutliplications in temporary float variables, in order to
truncate the 'error' bits.
dcone:

This appears similar to http://bugzilla.mozilla.org/show_bug.cgi?id=19601

What do you think of this?

I also have looked at some of these problems and have found the following case.


When you scrolled a page in the Y direction the element looked like it was 
jiggling because the width of this element would vary by a pixel and this was 
caused from the calculation of the starting position and the error introduced 
there.  For example.. if the starting y position was 5.3, and the width is 5.4, 
the right side of the element would be 10.7, rounded you get 11, for a width of 
11-5 or... 6.  If the starting position was 5 and the width is 5.4 the pixel 
that the right would land on would be 10.4, rounded would be 10 and the width of 
the element would be 10-5 or... 5.  So the width of this element would jiggle 
depending on how and were this starting position was calculated.  The solution I 
used to fix this.. was cache this error and use it for the calculation of the 
width.. to keep the right side from moving, or calculate the width at the time 
your calculating the left side.  We sometimes calculate the right side, then the 
left side and I have tried to hunt those down and fix them.  Using double will 
not fix this problem.. but I am not sure this is contributing to your bug.  
Attached patch possible fix (obsolete) — Splinter Review
I have attached one a solution here, that makes sure to change the results of
any multiplication to float before doing the addition.  In this case, I've only
made the changes when calculating m20 and m21.  If you add printfs to output m20
at the end of the changed functions, you can see how this fix makes things
better:  the instances of m20 ending with .4999997 have disappeared.
I think this is the wrong approach. The fundamental issue is converting
from floating point to integer. That's where the problems actually occur
and that's where the problems should be addressed. Fiddling with
floating point is just masking the issue.

In a certain sense the problem is too much precision in the
calculations. The example here shows a loss of two decimal digits of
precision (between 6 and 7 bits) but even a ieee float has 23 bits so
that leaves a lot of bits.

I think a better solution is to retain some extra bits when converting
to integers and do the final rounding in the integer domain. Something
like this:

#define Real float
#define EXTRABITS 8
#define N (2<<EXTRABITS)
#define NR ((Real)N)

int toInt(Real f)
{
        int i = NR*f + (Real)0.5;
        return (i + N/2) >> EXTRABITS;
}

This is a clear candidate for inlining or macrozing.

Just a suggestion.
That is right.. the problem comes from to much presision.. as I mentioned above.  
If you calculate a position of an two different elements.. relative to a point, 
let say is 0,0, the error of rounding to a pixel location will not be consistent 
from one element to another. See my comments up above and in the related bug to 
understand more deeply.  The only way to consistently place elements is to 
always use the error from previously calculated elements and add in those 
error.. a virtual screen if you will, but these errors can not just be thrown 
away.. or rounded a different way to solve this problem.
->layout
Assignee: trudelle → attinasi
Component: XP Toolkit/Widgets → Layout
QA Contact: jrgm → petersen
Over to rendering, and specifically, to dcone.
Assignee: attinasi → dcone
dcone:  then is this solved by making m20 and m21 doubles?  This way, they
'save' the error across function calls, and it gets factored in when doing the
calculations.
Keywords: testcase
*** Bug 102391 has been marked as a duplicate of this bug. ***
This is also causing our bug where scrolling some images causes them 
to have white lines across them.
I think modifying cross-platform code is a bad idea. All the ideas, so
far, make implicit assumptions about how the compiler works and that's
bound to cause trouble.

On Linux, gcc works differently than the compiler you're using.

For the original example code, gcc reproduces the supposed errors
without optimization but, with optimization, gives "correct" results.

The *toFloat helper functions in the proposed patch are quite small so
they could well be prime candidates for inlining by an optimizing
compiler. If that happens, then you must be concerned that the compiler
might notice that all it's doing is popping an fpu stack value into a
temporary register and pushing that value back onto the fpu stack so it
might decide that's a waste of time and eliminate it. Gcc does inline
the functions but does not eliminate them. However, I haven't tried all
possible combinations of command-line options so I certainly wouldn't
want to depend on this behavior.

Changing all the floats to doubles doesn't fix anything; it just makes
the problem less likely. On Linux, the fpu generally runs in extended
precision so even doubles are prone to this sort of round-off error. I
have some examples where gcc gives surprising results for seemingly
simple double precision calculations.

Also, doubles are twice as big as floats so you have to consider the
possibility of stack overflows.

If you want a cross-platform solution, you almost certainly need to
examine the users of these classes. It might be faster to just change
platform-specific stuff instead.
You are correct about the doubles;  i spoke without thinking.  However, we keep
seeing more and more problems in our build are related to this issue, so we
would really like for this to be fixed.  The ToFloat fix was one solution that I
came up with.  I have tried to implement dcone's suggestion (saving the error
across function calls), but I have not been successful.
As an aside, I need some clarification on how nsTransform2D is used in the code.
 Basically, the original problem stems from using two different nsTransform2Ds
(which are calculated differently, but should end up with the same values) when
dealing with the same image.  Why is this?  Shouldn't the same transform be used
every time we deal with that image?
This problem can be recreated on windows.  If you hardcode mTwipsToPixels in
nsDeviceContext to be 1/12.0 (what it is set to on OS/2, while windows is
normally set to 1/15.0), you will see instances of text and images moving one
pixel when you click or highlight it. 

Also, I don't think saving the error across function calls will work.  In this
case, the error would just get compounded over each iteration.

The problem here is one of Least Significant Digit.  When we do the
multiplication of the floats, the result is a double (until it gets saved to a
float member var).  We then add this double to a float.  However, this is not
the correct thing to do.  We are introducing error into the equation.  I coded
ToFloat in order to deal with this issue.  We have two floats being multiplied,
so the end result (before any addition/subtraction) should be a float.  The
'error' should be immediately discarded before any further calculation.
The problem described is similar to the one-dimensional problem of maintaining a
checking account balance in float or double.  You can add and subtract dollars
and cents but the balance is only an approximation because pennies (0.01) are
not exactly representable in float or double.  You can add and subtract many
transactions with a net sum of zero, but if you examine the final balance it can
be less than zero, equal to zero, or greater than zero due to rounding
approximations at each transaction for a given path.  How does one resolve this
problem?  The simplest and most direct solution is to maintain the balance in
pennies stored in an integer.  The balance is always precise and correct.

How does this relate to the problem at hand? Adding and subtracting x and y
translations that are imprecise into an accumulator that is imprecise will cause
the balance to be imprecise.  If the positional translation is in an integer
type e.g., milli pixels and the translations occur in milli pixels, the balance
will always be exact.  One will return to the origin with moves that sum to
zero.  A similar strategy could also be adopted for rotation (milli degrees) and
absolute magnification.

The transformations will be a bit more complex, but the inconsistencies observed
will go away.  No amount of extra precision or rounding strategies will solve
the problem presented in the first paragraph.
Attached patch patch based on khanson's post (obsolete) — Splinter Review
I tried to come up with a patch based on khanson's post above.  I made the
private member variables all integers.  m20 and m21 now represent hundredths. 
The others are multiplied/divided by 60;  I chose this number since it is evenly
divisible by 12 and 15 (the twips-to-pixels values for OS/2 and win32,
respectively).  It has fixed most of the issues I was seeing, although the
printf's have shown that m20 can still end with .48 or .49 (just not as often as
before).  I am still looking into this.  Obviously, this is a work in progess. 
Let me know what you guys think.
I wrote this next bit before the latest comments.

Integer arithmetic is attractive and it may well solve a number of
problems but it has issues of its own.

C/C++ only permits integer arithmetic modulo some power of two.  
Overflows are silently truncated and the language does not tell you when
this happens. Since losing the most significant bits is a bad idea here,
it is necessary to limit the size of the input operands so overflows
cannot happen. For an underlying n-bit integer, the inputs must be n/2-1
bit integers or smaller. If you internally scale the inputs to help
control round-off errors then the inputs must be even smaller.
Furthermore, avoiding overflow imposes stringent conditions on the type
and order of operations that are allowed. For example, a*(b+c)/d might
be allowed but a*(b+c+d)/d might not.

Obviously this has been done before so it's not impossible. It is,
however, more difficult that just changing floats to integers. Note also 
that all the testing and whatnot may be slower that using floats.

As for the latest patch, I hate to say it but the fact that using
integers doesn't fix the round-off problems shows that neither the
problem nor the solution lie here. I think we all need to look at who 
uses these functions and how they use them.
Actually, I found the last remaining issue, which seems to fix the remaining
rounding problems.  m20 & m21 may need to represent sixths, so I changed the
factor involved with them from 100 to 60;  so now all the member vars are
multiplied/divided by 60, simplifying some of the calculations.

Yes, I agree that overflow may be a problem in this case, and we will continue
to look into it.

As for the calling functions, many of the x and y and dimension coordinates come
in as float, although for all functions but one they are actaully integers cast
as floats and then passed in.  So maybe those should all in fact be changed to
integers.

We have an issue on OS/2, though (and possibly other platforms), that causes
visual inconsitencies that we would like to fix as soon as possible.  If you are
talking about changing all the calling functions, then maybe that should be
opened up as a different bug that is more longterm, while using this bug to come
up with a shortterm solution to fix the OS/2 problems.
Attached patch better patch (obsolete) — Splinter Review
I agree there seem to be two issues. The idea of caching errors seems 
odd. Floats are perfectly fine containers for this sort of data. That 
most likely should be addressed higher up.

I renew my objections to fixing OS/2 problems in cross-platform code. 
Can you do it in OS/2-specific code instead?
This isn't an OS/2 specific problem.

As Javier mentioned, it is just a coincidence that only OS/2 sees it. If Windows 
were 120 DPI instead of 96 DPI at the OS level, it would see this problem as 
well. You can show this by making the change Javier mentioned.

The bug IS cross platform. Only OS/2 is visually showing the bug.
Attachment #52971 - Attachment is obsolete: true
Attachment #53123 - Attachment is obsolete: true
Added a patch to test for overflow in every multiplication and addition.  Ran
with this patch for a while, and also ran some of the browser buster automated
tests.  Did not run into any of the overflow assertions during that time.
I believe dcone is on sabbatical.

I'm reassigning the bug to pavlov, cc'ing roc and waterson for expertise and
super-review.  I don't see how anyone can argue that the code does not have
problems on any platform, given the right DPI and other constraints.  I think we
should try to fix this soon, and if necessary, file a long-term fix sequel bug.

Pav, I'll stop by your cube.

/be
Assignee: dcone → pavlov
Keywords: mozilla0.9.6
It appears that most of the consumers of this class are using the
integer *Coord methods, e.g. TransformCoord and all these methods depend
on nsToCoordRound in xpcom/ds/nsUnitConversion.h. The coordinate
functions there are rather simplistic.

I suspect that the quickest fix would be to replace these coordinate
functions with something more sophisticated.
mkaply: I seem to recall you mentioning that changing the member variables to
doubles also solved the problem.  Is that still the case?
Changing to doubles just hides the problem temporarily, it doesn't fix it.

What really needs to happen here is to use integers for these types of 
computations.
BTW, you can't assume anything other than float precision in the 
floating point because mozilla neither checks nor sets the fpu control 
word.
Brendan, I already knew about the FIX_FPU macro but it's a nullity on
Linux and other platforms and we're talking cross-platform code here. I 
tend to think there's a js bug here.

There's also the possibility that a third-party library could play with
the control word.
tenthumbs: can you provide a patch?  We've never had troubles in 5+ years
outside of Windows (wherefore FIX_FPU).

/be
Brendan, I filed bug 109286 on the js issue.
More information.

We found an image painting error that also only happens when DPI is 1/12 - it 
doesn't happen with 1/15.

I don't know what to do with these problems. Can someone tell me what 
mTwipsToPixels is set to on other platforms (Mac, Linux)
hinting back and forth between this and bug 104544
Is this the right layer to worry about this?  If the transform classes are
defined to operate over floats, then the users of those classes must take
the appropriate care when converting to/from floats.  In specific, in the
context of bug 104544, when float coordinates are used for purposes of
refresh, the rendering code should be *pessimistic* and round outward.
For other purposes, rounding downward may be more appropriate.  My point is
that the decision over conversion policy may best be decided by the user of
these transform classes, and not handled as an internal matter.
... and regarding that patch (and applying it to
nsDeviceContextOS2::CommonInit), I believe that change is necessary in order to
fix all rounding errors.  See my explanation in bug 16200 comment 16.  (What I'm
not sure of is whether the rounding errors you're seeing are necessarily caused
by that problem.)
Hmm.  I'm going to read these in more detail, but on first flush this sounds
similar to some issues where moving the mouse over or using UI (XUL) and
sometimes page elements would cause them to shift position by a pixel.  I've had
that on and off on FreeBSD for a year or more; I think I reported it at least
once.
Take a look at the definitions of CEIL_CONST_FLOAT and
ROUND_EXCLUSIVE_CONST_FLOAT in mozilla/xpcom/ds/nsUnitConversion.h,
particularly the comments.

The constants are far too precise for single precision and, on Linux,
gcc compiles them to exactly 1.0 and 0.5 so many of the functions in the
file are seriously broken.

I redefined them to this
  #define CEIL_CONST_FLOAT (1.0f - FLT_EPSILON)
  #define ROUND_EXCLUSIVE_CONST_FLOAT (0.5f*CEIL_CONST_FLOAT)
and ran some tests on the functions and there are a _lot_ of
differences.

If these things are busted, nothing else can possibly work right.
brendan, sorry to let you down, but I don't have a lot to add here. I notice
that my p2t is 19. (Which seems to mean that we'd need to change the patch to
multiply by 1140 instead of 60 if I'm to be happy? And lord knows how we'll
please dbaron, who has a p2t of 15 1/6.)
This bug probably depends on bug 118117 or bug 120690. At the very least 
they're related.
i'm seeing something exactly like this on macOSX with both gcc and codewarrior.
UI elements are shifted by a pixel on hover and click, but only _sometimes_. 

It's driving me crazy, and i've isolated it to issues in the transform2D with
m20 and m21. Smells a lot like this.
Severity: normal → major
Keywords: mozilla0.9.9
OS: OS/2 → All
Hardware: PC → All
Yes, I have noticed the same issues on MacOSX.  Unfortunately, as pointed out
earlier, the patch above which uses 60 as a common multiple only works for the
default cases of windows and OS/2.  It will not work correctly for DPI for other
platforms or 'custom' dpi settings.  We need someone who is familiar with this
code to come up with a better solution.  I tried, but I don't know it well enough.
i applied the patch and it didn't fix my problems on osx. i still see jiggling.
What is the PixelToTwips value in OSX?  On windows, it is 15.  On OS/2, it is
12.  Both divide evenly into 60.  That's why I chose that number.  Take the
PixelToTwips value on OSX and multiply it by 60.  Change all the '60's to this
number in both the cpp and h file.  Does this fix the jiggling?
on mac, the pixelsToTwips value is <drum roll> 17.1458266

gotta love it ;)
pinkerton: If it's not an integer, the problems you're seeing are likely just
bug 16200, and you should probably port the fix to that bug to the mac.
dbaron, ok, doing that fixed some of my issues (jiggling checkboxes) but not all
(i still have jiggling scrollbars). Any ideas?
so making sure mPixelsToTwips helped on the cfm mac builds, where it was
14.99999 and is now 15.0. However, on the mach-o builds it was 17.xxx and is now
17.0. 17 isn't a friendly multiple, so we still get the same behavior of jiggling.

I'll try the 17*60 trick with the patch in a few.
*** Bug 123235 has been marked as a duplicate of this bug. ***
Target Milestone: --- → Future
Is this the underlying cause of bug 83289 ? It sure sounds like it.

Just a side note: On Windows with Large Fonts my Pixel to Twips = 15.0 and with
Small fonts = 12.0. It can be different values on the same OS. 

I can also confirm that I get lines in images/text during scrolling on
particular pages with p2t = 15.0, but not at p2t = 12.0.

Attachment #48908 - Attachment is obsolete: true
Given that this problem is not going to be fixed any time soon (we probably need 
to remove twips from Mozilla completely) and given that it makes the nightly 
OS/2 builds look like crap, I'd like to propose that we create an 
nsTransform2DOS2.cpp that is only built on OS/2 so we can put in the fixes we 
have that make OS/2 better.

What do people think of that?

Note it would have to live in gfx/src, since gfx/src and gfx/src/os2 are two 
different DLLs (shared libs)
I have tried changing my p2t to 16, and that seems to help. (it was 20.) I think
the reason is that t2p is now 1/16, which can be written exactly in binary.
(0.0001). 
I am still seeing some problems, but not nearly as much. My testcases for bug
83289 are fixed (without the patch in that bug).
This is on linux, build from cvs, around 20020318.
Blocks: 134942
OK, so this bug won't be fixed anytime soon and Mozilla 1.0 is approaching
fast.

I want to create an nsTransform2DOS2.cpp and use it only for OS/2.

I am sick and tired of the fact that OS/2 nightlies suck eggs because of this
problem.

Looking for review.
I am certainly in favour of anything that allows me not to 
wonder whether or not I'm wearing my contact lenses.

I don't think any change could make it much worse, so I 
think the suggestion is a goer.
Comment on attachment 81544 [details] [diff] [review]
Temporary fix for OS/2

r=cls
Attachment #81544 - Flags: review+
This appears to be an OS/2 only fix. Adding adt1.0.0+ for 1.0 checkin.
Keywords: adt1.0.0+
I've checked the OS/2 only fix onto branch and trunk.
Keywords: adt1.0.0+fixed1.0.0
I don't access to a 0S/2 machine to verify this issue. Anyone willing to check
this on trunk and branch ?
Blocks: 63336
Blocks: pixels
Assignee: pavlov → nobody
QA Contact: chrispetersen → layout
Could someone verify this now after the unit fix and the border rewrite landed? I think it should be gone now.
https://bugzilla.mozilla.org/show_bug.cgi?id=173051 also seems to have fixed an nsTransform2D rounding error. Is this still a bug?
Changing the OS to make this bug appears in the Warpzilla searches, getting it visible for those who can verify it. And who still care about OS/2.
OS: All → OS/2
Hardware: All → x86
nsTransform2DOS2 was checked in on 2002-05-15 20:42 and was again removed in bug 156402 / a checkin on 2003-01-15 15:15 (see http://bonsai.mozilla.org/cvslog.cgi?file=mozilla/gfx/src/nsTransform2DOS2.cpp).
Whatever happened this temporary thing fixed the problem before a more thorough patch was created. I don't see a reason to leave this open.

--> FIXED in 2002
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: