Closed Bug 654858 Opened 13 years ago Closed 13 years ago

Perma-orange on Linux opt reftest with -O3 for img-widthAndHeight-meet-2.html, img-widthAndHeight-slice-2.html, img-novb-widthAndHeight-meet-1-em.html, img-novb-widthAndHeight-slice-1-em.html

Categories

(Core :: SVG, defect)

x86
Linux
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla7

People

(Reporter: dholbert, Assigned: dholbert)

References

Details

Attachments

(2 files)

Attached file reftest failure log
Since some point this morning, we've had perma-orange in these four reftests:
layout/reftests/svg/as-image/img-widthAndHeight-meet-2.html
layout/reftests/svg/as-image/img-widthAndHeight-slice-2.html
layout/reftests/svg/as-image/img-novb-widthAndHeight-meet-1-em.html
layout/reftests/svg/as-image/img-novb-widthAndHeight-slice-1-em.html

Attaching reftest log, suitable for pasting into:
http://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml

For most of today, this has been mis-starred as Bug 604621.  (see Bug 604621 comment 3 through Bug 604621 comment 8)

This sudden perma-orange seems likely to be related to the GCC PGO disabling this morning (bug 654700).
Whiteboard: [orange]
(In reply to comment #0)
> This sudden perma-orange seems likely to be related to the GCC PGO disabling
> this morning (bug 654700).

...in that, on m-c, it started being orange on the push that disabled PGO, and it's been orange ever since.

(Note that there was a complimentary build machine config change at the same time as that push (~9:30 AM MV time) -- glandium tells me that the m-c push was just a marker for m-c.  That explains why this shows up in tryserver builds that started after that point, whether or not they had that cset as an ancestor.)
/me removes [orange] whiteboard to keep tbplbot from finding this.  AIUI, [orange] is for random oranges, not permanent oranges.  I'd rather not have every single reftest run on all trees trigger a TBPLbot stamp on this bug from now until whenever it's sorted out.

(Feel free to manually star reftest runs with this bug number - we just don't gain anything by having the tbplbot noise here, since there's no randomness)

Barring any objections on IRC, I'll be closing the tree in a few minutes, until this bug is resolved.
Whiteboard: [orange]
(ok, closed the tree)

glandium suggests that this was likely caused by this cset (adding -O3):
   http://hg.mozilla.org/mozilla-central/rev/bc80c46f185d
which was pushed last week, simultaneously with the PGO enabling.

The logic is this -- when we had neither -O3 nor PGO on m-c, we were fine.  When we had both of them together on m-c, we were fine.  (They balanced each other out, or something.)  But now we've disabled PGO, leaving us with just the -O3, and that apparently causes rounding issues in these reftests.

I pushed a backout of the -O3 cset to TryServer (and then another cset without the backout, as a control), and I'm waiting on results from those before backing this out on m-c.
That seems to have fixed the perma-orange on TryServer, so I pushed the backout to m-c.  Resolving this as FIXED; we can reopen if this crops up again.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Blocks: 590181
For what it's worth, this perma orange doesn't appear on try with gcc 4.5 and -O3 and without PGO on cset dd0f92e2ea92.
The backout was fine for fixing the perma-orange, but I'm reopening this because as glandium mentions in bug 590181 comment 147, if this test fails at -O3, this failure may reappear if the PGO optimizer decides to optimize this code at -O3 at a later date. We should fix these tests to not fail at -O3.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Perma-orange on Linux opt reftest for img-widthAndHeight-meet-2.html, img-widthAndHeight-slice-2.html, img-novb-widthAndHeight-meet-1-em.html, img-novb-widthAndHeight-slice-1-em.html → Perma-orange on Linux opt reftest with -O3 for img-widthAndHeight-meet-2.html, img-widthAndHeight-slice-2.html, img-novb-widthAndHeight-meet-1-em.html, img-novb-widthAndHeight-slice-1-em.html
Status: REOPENED → NEW
My bisection is getting near an end, the trigger frame is between 0e8c23e50c6c
 and c5c1cac00a06 (the latter being a build-system commit ; dd0f92e2ea92 was actually on build-system). All the commits in that range are from bug 641426 except 0e8c23e50c6c, but that one only touch sync.
Blocks: 641426
Does that mean bug 641426 has a bug, or it just triggers a gcc bug?
(In reply to comment #10)
> Does that mean bug 641426 has a bug, or it just triggers a gcc bug?

At this point, I'm tempted to say bug 641426 introduces rounding issues.
Maybe. Did you narrow down to a specific patch or should I do that? I might have to update my Linux distro...
Still bisecting. I'm down to 12f37ad53b41 which is still orange. Next iteration will tell which one of 36f62297c1e1 and 848c520e9158 is the root cause.
848c520e9158 is the one (part 1 of bug 641426)
I don't see any smoking guns there :-(.

That patch can be quite easily split up further; for starters the gfxSize/gfxPoint/gfxIntSize changes can be separated from the nsSize/nsPoint/nsIntSize changes. Can you try that?
The nsSize/nsPoint/nsIntSize half alone triggers it.
The patch I tested: http://hg.mozilla.org/try/rev/a194ef8a8f12
I'm tempted to say that could be related to the use of floats in nsSize::ConvertAppUnits/nsPoint::ConvertAppUnits.
I wouldn't think that ConvertAppUnits would get used unless zooming is involved, and looking at the reftests it doesn't seem that zooming would be involved.
That is to stay, it could be used, but only with aFromAPP == aToAPP.
I verified that by setting break points and running those reftests.
(In reply to comment #20)
> I verified that by setting break points and running those reftests.

Me too, for the nsSize one -- but nsPoint::ConvertAppUnits seems to be called very frequently (e.g. on every window focus / unfocus), and one of its calls during pageload is inside of SVGDrawingCallback::operator() (which draws SVG images), 15 frames down, in a call to nsIFrame::GetOffsetToCrossDoc().

I wasn't using the reftest harness, but I don't think the behavior of SVGDrawingCallback::operator() should care about that.

So it seems possible that nsPoint::ConvertAppUnits might be involved here.
nsPoint::ConvertAppUnits will be called, but the (aFromAPP != aToAPP) case was never hit for me. Was that case hit for you?
You're right, sorry - I was just posting that as a followup.  (I initially misread comment 20 as having verified comment 18, not comment 19)

I've now conditioned my breakpoint on "aFromAPP != aToAPP", and that keeps it from ever being hit.

So, I agree with tn - seems like those ConvertAppUnits methods aren't involved.
Thanks Mike...

It shouldn't be hard to narrow down that patch further to a specific class, and then operator.

I've looked at that patch pretty hard and it still looks innocuous to me :-(.
Now that's interesting. It fails if I only patch nsSize.h, *and* if I only patch nsPoint.h. However, it doesn't fail if I only patch nsIntSize, which would suggest the problem is in nsSize *and* nsPoint. Narrowing down further.
Wait a second, patching nsSize.h only does fail the same way, but patching nsPoint.h only does *not*. It actually fails with different errors.
This: http://hg.mozilla.org/try/diff/ae3fa6d7a6a6/gfx/src/nsSize.h
fails. It looks like the main difference it makes, in the resulting binary, is the way ConvertAppUnits is inlined. My guess is that considering the function is inlined, your breakpoints just don't catch all these inlined instances.
There is only one use of nsSize::ConvertAppUnits in the tree: http://mxr.mozilla.org/mozilla-central/source/layout/base/nsPresShell.cpp#7308 It only gets called if we receive a context menu event (and zooming is involved).

I tried editing nsSize::ConvertAppUnits and making it dereference null unconditionally and then running the entire layout/reftests/svg/as-image/ directory of reftests and it didn't crash. So I don't think nsSize::ConvertAppUnits is getting called during these tests.
You can simplify the test patch even more by moving the BaseSize declaration into nsSize.h. Then try removing the Sub template parameter, try changing T to nscoord in BaseSize::width/height, and then remove the entire template declaration.

Moving the nsSize::ConvertAppUnits declaration to not be inlinable might be interesting too.

Mike, when you say you're suspicious about the float usage of ConvertAppUnits, do you mean that you think we might be depending on getting extra float precision in some places but not others, depending on compiler optimizations?
(In reply to comment #30)
> You can simplify the test patch even more by moving the BaseSize declaration
> into nsSize.h. Then try removing the Sub template parameter, try changing T
> to nscoord in BaseSize::width/height, and then remove the entire template
> declaration.
>
> Moving the nsSize::ConvertAppUnits declaration to not be inlinable might be
> interesting too.

Still failing when nsSize is also modified, but I'll also try without nsSize modification.

> Mike, when you say you're suspicious about the float usage of
> ConvertAppUnits, do you mean that you think we might be depending on getting
> extra float precision in some places but not others, depending on compiler
> optimizations?

Yes, something along these lines. Because visually, these errors look like rounding problems. And ConvertAppUnits is about the last remaining thing that involves rounding.
This works:
http://hg.mozilla.org/try/rev/15a6a082aab0

This doesn't:
http://hg.mozilla.org/try/rev/37dbda0cee63

As for not inlining ConvertAppUnits when nsSize if unmodified, it doesn't change anything.
Well, I'm stumped. If the problem is legitimate effects of inlining or optimization decisions, why would it only show up on gcc 4.5 -O3 when we work with all sorts of other compilers, versions and options?

Have you looked at the difference in the binaries for the builds in comment #32?

Is it worth trying -ffloat-store or -fexcess-precision to test the theory that it's related to float precision?
There are a whole lot of differences, so it's really difficult to narrow down to something precisely. However, in one instance of code differences I looked at, a division by a register became a division by a value from memory.

I'll try -ffloat-store.
You could try deleting nsSize::ConvertAppUnits entirely, just delete that one line in nsPresShell.cpp that uses it and see if that changes anything.
-ffloat-store is confirmed to solve the issue, as well as the rounding errors we fixed by changing a few test cases in bug 590181. I'll rerun a test with -ffloat-store on current m-c and see if that makes a big difference on Talos.
Is there a reason why excess float precision would be bad?
Isn't this suggesting that the test is ill-conditioned?
(In reply to comment #37)
> Isn't this suggesting that the test is ill-conditioned?

Just to be clear -- these tests just check that <img> and <embed> render identically in a number of different situations.

If it turns out that the test "correctly" renders <img> and <embed> differently when we add extra float precision, then I'd think that'd indicate a rounding bug in our SVG code, rather than a bug in the test.
We don't want to actually ship -ffloat-store, I'm very sure.

So I guess we just need to reduce the failing test and then debug the bad binary to figure out where the difference creeps in. Daniel, can you do that?
Yup, I'll look into that.
Applying -ffloat-store to specific object files may help narrow things down.

(In reply to comment #38)
> If it turns out that the test "correctly" renders <img> and <embed>
> differently when we add extra float precision, then I'd think that'd
> indicate a rounding bug in our SVG code, rather than a bug in the test.

It's possible that img and embed are "incorrectly" rounded with different algorithms, but, when floating point arithmetic and pixel snapping are involved, if the test arranges for any edges to fall midway between pixels, floating point arithmetic could go either way in some situations.

(In reply to comment #37)
> Is there a reason why excess float precision would be bad?

I guess equality comparisons where one argument has been truncated through a store but the other hasn't would be one situation.  Perhaps a good reason to prefer fixed-point arithmetic.
Assignee: nobody → dholbert
Status: NEW → ASSIGNED
Note that e.g. img-widthAndHeight-meet-2.html looks blurry when rendered with a failing build while it doesn't with a working build.

It could be interesting to see which particular optimization -O3 enables is responsible for the code change.
From http://glandium.org/blog/?p=1998&cpage=1#comment-52757 :

"BTW 64bit control word should be matter of adding -mpc64 to your linking flags. Windows default to that so doing that should reduce disprepancy in between your Windows and Linux x86 test results involving floats. (x86-64 don’t care because of -mfpmath=sse default)"

I also saw -mpc64 suggested a long time ago in another similar precision problem in libxslt.
Nope, -mpc64 doesn't help.
I've been trying to generate a local build that I can reproduce this in, but I've been unable to do so, so far.

I can reproduce it with an affected tinderbox build[1], so I tried compiling with the exact m-c revision and mozconfig of that build (such that about:buildconfig is identical[2] between my local build and the tinderbox one).  However, I still can't reproduce in my locally-made build.

[1] http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux/1304542786/firefox-6.0a1.en-US.linux-i686.tar.bz2
[2] identical, save for the exact gcc path and version.
The tinderbox build uses "/tools/gcc-4.5/bin/gcc || gcc version 4.5.2 (GCC)",
whereas mine is "gcc-4.5 || gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4)"

I wonder if this boils down to the specific sub-revision of GCC 4.5.2 that I'm using on Ubuntu 11.04 vs. the GCC 4.5.2 that's on the tinderbox builds' Fedora installation?  I don't know what else could be a factor. (perhaps some other system library that gets used in compilation that differs between Fedora and Ubuntu?)  I may try to spin up a Fedora installation like the one on tinderbox, to test this.

(BTW, I'm using "img-novb-widthAndHeight-meet-1-em.html" to test for reproducibility.  That test (and its "-slice-" variant) are the simplest to see the bug in -- buggy builds have 3 distinct row-widths, whereas correctly-working builds have only 2 distinct row-widths.)
Yeah, I can't reproduce with the gcc from Debian either. AFAIK we don't patch our gcc, so it's a vanilla upstream gcc 4.5.2.
FWIW, I've been able to reproduce with our gcc 4.5. I wonder if bug 633072 is the same kind of problem (in which case I think the "fix" there just hides a real problem)
the thing bug 633072 changes is to avoid the blending that takes place when two colour regions merge.
There have been a couple of intermittent Linux failures of this starred in bug 604621. I'm guessing it's just GCC's optimizer occasionally deciding to optimize some code at -O3 instead of -Os, and hitting this bug?
This actually looks like it might be perma-orange on the build-system & services-central tbpl pages, ever since some time yesterday.  (Those trees have a low enough push-rate that there have only been a few pushes since then (just 1 on services-central), and they all seem to have hit this...)
The 1 services-central push that hit this was just a merge from m-c.  I triggered 2 more reftest runs on that push, and they were both orange (though tbpl doesn't show them).  I also triggered 1 more reftest run on the previous push (also a merge), and that one remained green.
Yeah, I mentioned that in bug 604621. It went orange on my b-s merge from m-c, and then orange on the next two pushes (neither of which were C++ code changes).
The permaorange is from bug 658452, which switched the generic mozconfig to using gcc-4.5 without PGO.
(In reply to comment #40)
> Yup, I'll look into that.

All right, I've finally gotten a locally-built vanilla gcc 4.5.2 successfully compiling buggy firefox builds, in an Ubuntu VM.  (I'd initially tried to use our build team's GCC rpm on a fedora VM, but my VM was missing some additional configuration & couldn't build firefox with that gcc version for some reason.)

It looks like the issue (at least in a reduced testcase I made) is that the values in our outer SVG frame's intrinsic ratio end up being off-by-one due to float-to-nscoord rounding, here:
> 267 nsSVGOuterSVGFrame::GetIntrinsicRatio()
[...]
> 278     nsSize ratio(width.GetAnimValue(content), height.GetAnimValue(content));
http://mxr.mozilla.org/mozilla-central/source/layout/svg/base/src/nsSVGOuterSVGFrame.cpp#278

Those GetAnimValue methods return floats, whereas the nsSize constructor wants nscoord values.  So we're implicitly doing float-to-int conversion here, and leaving ourselves open for rounding error.

If I add NSToCoordRoundWithClamp() around the floats here, that appears to fix this.
Attached patch fixSplinter Review
(this patch applies the same fix to another chunk with the same problem, a little lower in the same function)
Attachment #538957 - Flags: review?(roc)
I pushed two PGO-disabled builds to try (hopefully matching the config on build-system & services-central trees that are hitting this). First push lacked fix:
 http://tbpl.mozilla.org/?tree=Try&rev=da3e33dd18a1
...and the second push had fix:
 http://tbpl.mozilla.org/?tree=Try&rev=fb41562e71fd
Hopefully the first will be orange & the second will be green...
(In reply to comment #56)
> Hopefully the first will be orange & the second will be green...
...and that's exactly how it turned out. So, patch confirmed to work on tinderbox. Woot!
Blocks: 658313
Comment on attachment 538957 [details] [diff] [review]
fix

Review of attachment 538957 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #538957 - Flags: review?(roc) → review+
Landed on m-i: http://hg.mozilla.org/integration/mozilla-inbound/rev/b3e5dd43281f
(wanted to land directly on m-c, but it's pretty orange right now)
Target Milestone: --- → mozilla7
Whiteboard: [inbound]
http://hg.mozilla.org/mozilla-central/rev/b3e5dd43281f
Status: ASSIGNED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Whiteboard: [inbound]
Thanks for tracking this down and fixing it!
Setting resolution to Verified Fixed on Mozilla/5.0 (X11; Linux x86_64; rv:7.0) Gecko/20100101 Firefox/7.0 beta 1

Based on today's TBPL http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Beta/1314181061.1314181851.11237.gz&fulltext=1
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: