Closed Bug 1072140 Opened 10 years ago Closed 10 years ago

glDrawArrays() call on fbo leaves un-drawn gaps (and may draw past end-of-buffer)

Categories

(Firefox OS Graveyard :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: cjones, Unassigned)

References

Details

Attachments

(1 file)

Attached image Layerscope dump
See the attached screenshot. This is a layerscope dump of a dead-simple page with some text on top of a background gradient. So gecko is using one intermediate surface for the page. Layerscope shows a distinctive pattern of banding on all the tiles. This is somewhat reminiscent of the distorition in bug 1072138, so possibly related. (However, I don't 100% trust the display of layerscope here --- I'm not sure where in the pipeline this corruption could be happening. So first thing to do is verify layerscope.)
I should add that the pattern of pink lines isn't entirely static --- sometimes more or fewer pink regions spring into or out of existence. (And yes, that screams "cache coherency" again.)
The pink in the readback surfaces is perfect pink, #ff00ff. The origin seems to be http://mxr.mozilla.org/mozilla-central/source/gfx/gl/GLReadTexImageHelper.cpp#719 which strongly implies the renderbuffer isn't being drawn entirely. Pixel format mismatches seem unlikely because the valid content is color-correct. Cache-coherence/write-flushing problems still look very very suspicious. (But I'm still going to double-check that layerscope is doing the right thing.)
I sprinkled in some QPU cache clear/disable and ARM cache flushes around the readback, but nothing really changed. However, with a glFlush/glFinish pair just before the readback, I see a change both in the readback renderbuffer and the screen fb. The renderbuffer has much more variance in "pink" unrendered area, and the screen fb has more non-black (rendered?) area, although it's still prety corrupted. This might mean we're not waiting on the render jobs to finish hard enough? And, with GL debugging turned on, adding a glFlush just before the glFinish we make to check the error status causes what looks to be a reproducible crash in the GL impl. glFlush definitely seems to make interesting things happen.
> And, with GL debugging turned on, adding a glFlush just before the glFinish we make to check the error status causes what looks to be a reproducible crash in the GL impl. glFlush definitely seems to make interesting things happen. I should add, just the glFinish() doesn't change the rendered buffers or trigger the crash (as reliably anyway), even though glFinish should be equivalent to glFlush/glFinish.
> This might mean we're not waiting on the render jobs to finish hard enough? Verified that the PCS register says everything is done on the 3d unit after the render-complete interrupt fires. So it seems that glFlush() is doing (or starting to do) something that glFinish() ought to be doing too, but that something is corrupting the heap. The hwcomposer swap usage in GonkDisplayICS::SwapBuffers() also involves an explicit glFinish(). That probably explains why routing around that code path makes bug 1072145 less frequent.
I was able to verify that the glReadPixels() we do to copy from the renderbuffer to the image surface sent back to the layerscope UI is doing the right thing. It ends up on a mind-numbingly slow fallback copy path. The renderbuffer looks like this  The purple parts with the white streaks aren't corrupted; rather they're a special tiled image format used for the GPU. However, the big white blocks *are* the problem; they should have similar content. So next step is to see whether the source texture itself is missing content, or it's the texture->renderbuffer draw call that's losing the content.
Big lead: stepping through more GL impl guts showed code that looked like it maybe should have been used but was #ifdef'd out (or was correctly not used, wasn't sure). Tracing back through the chain of build flags proved very interesting: AFAICT the brcm_usrlib/dag code drop was configured (on android build flavors) for the bcm21553 chip (Athena?), which seems to have a newer/different VideoCore chip than the RPI's 2853. Referencing other users of the brcm_usrlib/dag code was ... interesting, but another story. (If I'm right about the other story, we may another way to weasel out of these bugs.) I tried configuring the build flags for 2853, but b2g gets into a GL_OUT_OF_MEMORY error loop in the resulting build. I think the problem is that numerous parts of the code assume that Athena == Android, so there may be a fair number of fixups needed. Too many for tonight :).
> Tracing back through the chain of build flags proved very interesting: AFAICT the brcm_usrlib/dag code drop was configured (on android build flavors) for the bcm21553 chip (Athena?), which seems to have a newer/different VideoCore chip than the RPI's 2853 The 2853 definitely doesn't seem to be "Athena". It's unclear whether it's the "BCM2708A0" device. Adjusting the build flags to be "not-Athena" but also "not-BCM2708A0" seems to give the same results as default flags. Making the build "not-Athena" and "is-BCM2708A0" results in several unrelated problems, including two deadlocks in early startup in the GL code. This may be worth debugging, but for now I've put it back aside; see below. > Referencing other users of the brcm_usrlib/dag code was ... interesting Actually, it turns out that the Raspbian gfx-challenge code seems to be building as "not-Athena" and "not-BCM2708A0" (but also of course "not-ANDROID"). So my build opts and that code's are approximately the same now. > we may another way to weasel out of these bugs The code drop seems include all the code needed to build an RPC-to-VideoCore GL backend, going through vcihq. I took a crack at compiling this, but it's very far from building, and the RPC frontend in the code drop is highly unlikely to speak the same protocol version as the latest VC firmware. Another option but not promising. Going back to stepping through the GL bowels turned up another useful nugget: in GrallocTextureSourceOGL::Lock(), where we bind the gralloc segment's EGLImage to a texture, the EGLImage appears to have the correct contents (albeit in a special tiled rgb565 format):  Note that there aren't empty blocks like in the capture of the framebuffer contents above. So the incorrect content shown by layerscope almost certainly is coming from the draw call to render texture -> framebuffer. It's possible there's an intermediate texture allocated by the backend for an rgb565->rgbx888 conversion, and if so it's possible the corruption happens there. Some kind of rgb565->rgbx888 flub looks most probable.
I should add that I tried forcing gecko to only allocate rgbx8888 surfaces, avoiding rgb565, to see if this made the fb-draw issue go away. But for some reason, bug 1072145 has started biting hard enough that I can't get layerscope data :/.
> I should add that I tried forcing gecko to only allocate rgbx8888 surfaces, avoiding rgb565, to see if this made the fb-draw issue go away. But for some reason, bug 1072145 has started biting hard enough that I can't get layerscope data :/. OK, the crashes randomly went away again and I was able to test this. Like forcing a 32-bit framebuffer at the driver level (which should have done essentially the same thing), this didn't make a difference in the content in the on-screen fb or the gralloc textures.
(Updating title to reflect that gralloc surfaces and EGLImages created from them seem to have correct content, it's the fbo's that are incorrect.)
Summary: Content in gralloc buffers seems partially wrong → glDrawArrays() call on fbo leaves un-drawn gaps (and may draw past end-of-buffer)
WONTFIX'ing because I couldn't resolve the issues in this stack, and there's no long-term support plan for the code.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: