Closed Bug 675474 Opened 13 years ago Closed 13 years ago

svg/as-image failures with OpenGL layers on linux

Categories

(Core :: Graphics, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla9

People

(Reporter: mattwoodrow, Unassigned)

References

Details

(Whiteboard: [inbound])

Attachments

(2 files, 3 obsolete files)

With OpenGL layers enabled on linux, I'm getting a bunch of intermittent failures in svg/as-image. These are bottom parts of the image and look like some sort of synchronization issue.

Example reftest log: http://tinderbox.mozilla.org/showlog.cgi?log=Try/1311929219.1311931062.4099.gz&fulltext=1

Example failure: http://pastebin.mozilla.org/1286043 (in case the reftest log is deleted)

I can't reproduce any of these on my system.

I've tried all sorts of synchronization tests when reading the GL buffer back to software, including usleep(), XSync() and GLXWaitGL(). We also do an XSync() before binding the X pixmap to a texture.
Interestingly adding usleep(100000) into the bind texture call (after XSync) fixes the problem. I also tried adding reftest-wait and 1 second setTimeout, and this had no effect.

What synchronization could we be waiting on, that isn't fixed by XSync() and would have changed with GL?

Is it possible that there is more than one X display, and XSync(DefaultXDisplay(), False) is incorrect?
The function DefaultXDisplay (https://mxr.mozilla.org/mozilla-central/source/gfx/src/X11Util.h#66) uses GDK_DISPLAY that is deprecated since 2.22:
http://developer.gnome.org/gdk/stable/gdk-X-Window-System-Interaction.html#GDK-DISPLAY:CAPS

Maybe using this new function would correct this behaviour? Or even better, you could try using XOpenDisplay(NULL), that should work with either GTK or QT.

However maybe using other functions than XSync could be better in terms of performance (http://www.opengl.org/sdk/docs/man/xhtml/glXWaitX.xml or the last lines of http://glprogramming.com/red/appendixg.html)
Thanks Marco. I'm currently more worried about correctness than performance though.

As you correctly point out, XSync() should be overkill for this operation (and something that we should change once this bug is sorted). Instead I'm needing to manually sleep to get the correct synchronization.

I can think of two possible causes:

- The surface's XDisplay() is different to that of DefaultXDisplay(), and we're syncing the wrong display connection. (Does this even make a difference?) I'll run a tryserver build soon to test this possibility.

- These svg/as-image tests are waiting on something separate. Are svg data urls decoded in a separate thread? Is it possible that by requiring that surface is reading slightly earlier (at thebes layer render time, instead of XSync'ing at the end of compositing) that we are just attempting to draw before the decode is done?

Any other possible ideas would be appreciated, I am very much guessing here.
The problem of the wrong display sounds strange, I think this isn't the cause of the failed reftest.

It could be a synchronization problem, above all if SVGs are drawn in separate threads with OpenGL. In this case infact I think XSync isn't enough for synchronization. I think the best thing to do is to have a unique context and use a main thread to dispatch OpenGL commands.

http://developer.amd.com/gpu_assets/GDC2005_OpenGL_Performance.pdf - From page 33 to page 37.

http://www.equalizergraphics.com/documentation/parallelOpenGLFAQ.html

http://abral.altervista.org/code/?c=glthreads - This is an example from Mesa of multiple rendering threads.

These three links are useful only if SVGs are drawn in different threads with OpenGL. In this case we can also try to disable OpenGL acceleration from Cairo (that, I suppose, draws SVGs) and see what happens.

Have you tried also glXWaitX() or glFlush() or glFinish() instead of XSync()? glFinish() should wait until every OpenGL command is complete, so this should be the slowest but should guarantee strict synchronization.
We only have compositing happening with OpenGL. All SVG drawing will be happening in software, through normal X API's.

Indeed the X Display thing appears to not be an issue.
I've tried glXWaitX, and this patch to draw all thebes content before compositing.

This should improve performance (reduce the number of synchronization points we need), and I was hoping it would reduce the problem by delaying our bind texture calls.

The problems persist on tryserver and I've run out of ideas at this point.
Attachment #553675 - Flags: feedback?(bjacob)
Hrm, I may have mixed up my tryserver emails. Looks like I didn't get any failures with glXWaitX.

Benoit: Can you test this to see if it fixes the problem you're seeing in bug 678134?

I'll test more thoroughly with tryserver when the trees reopen.
Attachment #553675 - Attachment is obsolete: true
Attachment #553675 - Flags: feedback?(bjacob)
Attachment #553688 - Flags: feedback?(bjacob)
Comment on attachment 553688 [details] [diff] [review]
Draw all thebes content before compositing v2

Had to apply the patch from bug 675532 first.

This patch fixes bug 678134 for me! \o/
Attachment #553688 - Flags: feedback?(bjacob) → feedback+
Comment on attachment 553688 [details] [diff] [review]
Draw all thebes content before compositing v2

Tryserver looks good! Hooray
Attachment #553688 - Flags: review?(bjacob)
Comment on attachment 553688 [details] [diff] [review]
Draw all thebes content before compositing v2

Adding :roc for reviewing the Layers bits.

>   mGLContext->fClearColor(0.0, 0.0, 0.0, 0.0);
>   mGLContext->fClear(LOCAL_GL_COLOR_BUFFER_BIT | LOCAL_GL_DEPTH_BUFFER_BIT);
> 

This isn't part of your patch, just context, but I would like to know: the mGLContext here, could it ever be the GL context underlying a WebGL context? Like, in the case of a CanvasLayer corresponding to a WebGL context? I ask because, in this case, we would really need to call GLContext::ClearSafely() here instead of the above naive clearing code.

>-  // Drawing thebes layers can change the current context, reset it.
>+void
>+ThebesLayerOGL::RenderLayer(int aPreviousFrameBuffer,
>+                            const nsIntPoint& aOffset)
>+{
>+  NS_ABORT_IF_FALSE(mBuffer, "should have a buffer here");
>+  
>+  gfxMatrix transform2d;
>+  PRUint32 flags = 0;
>+  if (GetEffectiveTransform().Is2D(&transform2d)) {
>+    if (transform2d.HasNonIntegerTranslation()) {
>+      flags |= ThebesLayerBufferOGL::PAINT_WILL_RESAMPLE;
>+    }
>+  } else {
>+    flags |= ThebesLayerBufferOGL::PAINT_WILL_RESAMPLE;
>+  }
>+
>   gl()->MakeCurrent();
>+  gl()->fActiveTexture(LOCAL_GL_TEXTURE0);

Why is a new fActiveTexture call needed here? If it wasn't needed before.

> 
>     PRBool SwapBuffers()
>     {
>         if (!mDoubleBuffered)
>             return PR_FALSE;
>         sGLXLibrary.xSwapBuffers(mDisplay, mDrawable);
>+        fFinish();
>         sGLXLibrary.xWaitGL();
>         return PR_TRUE;
>     }

Why is the finish() needed here? Also, have you measured the impact on performance?

There are GL contexts that never interfere with XRender rendering, such as WebGL contexts. Why should they pay the price of having a finish() at each swapBuffer()?
Attachment #553688 - Flags: review?(roc)
> There are GL contexts that never interfere with XRender rendering, such as
> WebGL contexts. Why should they pay the price of having a finish() at each
> swapBuffer()?

Hm, I guess this was a stupid question as WebGL contexts render to a single framebuffer object so they're not concerned with SwapBuffer. Ignore that.
Comment on attachment 553688 [details] [diff] [review]
Draw all thebes content before compositing v2

Review of attachment 553688 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/opengl/ThebesLayerOGL.cpp
@@ +729,5 @@
> +      flags |= ThebesLayerBufferOGL::PAINT_WILL_RESAMPLE;
> +    }
> +  } else {
> +    flags |= ThebesLayerBufferOGL::PAINT_WILL_RESAMPLE;
> +  }

Change both occurrences of this to the simpler
  if (!GetEffectiveTransform().Is2D(&transform2d) ||
      transform2d.HasNonIntegerTranslation()) {
    flags |= ThebesLayerBufferOGL::PAINT_WILL_RESAMPLE;
  }

... or better still, add gfx3DMatrix::HasNonIntegerTranslation() and call that.
Attachment #553688 - Flags: review?(roc) → review+
> This isn't part of your patch, just context, but I would like to know: the
> mGLContext here, could it ever be the GL context underlying a WebGL context?
> Like, in the case of a CanvasLayer corresponding to a WebGL context? I ask
> because, in this case, we would really need to call GLContext::ClearSafely()
> here instead of the above naive clearing code.

No it can't, |this| is a LayerManagerOGL, so mGLContext is always a context wrapping a window.

> 
> Why is a new fActiveTexture call needed here? If it wasn't needed before.

It was always there, just before the Thebes drawing code.

> Why is the finish() needed here? Also, have you measured the impact on
> performance?
> 
> There are GL contexts that never interfere with XRender rendering, such as
> WebGL contexts. Why should they pay the price of having a finish() at each
> swapBuffer()?

This is not intentionally part of the patch, will remove it :)
Removed fFinish call, and tidied the resampling check.

Note that I didn't implement gfx3DMatrix::HasNonIntegerTranslation for the sake of my 3d transforms patches. I'm happy to add this later if you like.
Attachment #554284 - Flags: review?(bjacob)
Attachment #553688 - Attachment is obsolete: true
Attachment #553688 - Flags: review?(bjacob)
Comment on attachment 554284 [details] [diff] [review]
Draw all thebes content before compositing v3

Carrying forward r=roc
Attachment #554284 - Flags: review+
Attachment #554284 - Flags: review?(bjacob) → review+
As a side note, it's interesting that swapping XSync for glXWaitX fixes this bug. The texture_from_pixmap spec (http://www.opengl.org/registry/specs/EXT/texture_from_pixmap.txt) says that either can be used to synchronize before binding a texture.

It appears that XSync is flushing the X command pipeline, but not waiting for the (assumedly OpenGL) XRender pipeline to complete.
I believe one of the changesets in that push (bug 594876, bug 675474 or bug 675532) has caused an OSX 10.6 reftest perma-orange, so needs backing out:
http://tbpl.allizom.org/?tree=Mozilla-Inbound&usebuildbot=1&rev=0a920411e64c
Note that Talos reported massive perf regressions on Linux for the push this was part.
Looks like this patch also broke some tests on mac.

Splitting this into the glXWaitX() part which is needed for GLX, and the Validate() cycle which is an optimization for all platforms.

Carrying forward r=bjacob
Attachment #554284 - Attachment is obsolete: true
Attachment #555276 - Flags: review+
http://hg.mozilla.org/integration/mozilla-inbound/rev/99c5e7b68207


Landed the first part on inbound
Whiteboard: [inbound]
http://hg.mozilla.org/mozilla-central/rev/99c5e7b68207
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla9
Depends on: 687831
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: