Closed Bug 784244 Opened 7 years ago Closed 7 years ago

Intermittent "snow" artifacts in content

Categories

(Core :: Graphics: Layers, defect)

ARM
Gonk (Firefox OS)
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla18
blocking-basecamp +

People

(Reporter: cjones, Assigned: cjones)

References

Details

(Whiteboard: [WebAPI:P0][LOE:S])

Attachments

(2 files, 3 obsolete files)

This happens intermittently on different content.  I suspect it's related to intermittent errors I see about locking gralloc buffers.
While working on bug 778261, I happened to notice that we still have genlock disabled even though our kernels should support it now.  That's likely the cause of these artifacts.  Testing now.
I still see these snow artifacts after reverting the genlock disable patch.  (And maybe a new type of color artifact.)  I'm not sure I backed out everything I need to, but will resume tomorrow.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #2)
> I still see these snow artifacts after reverting the genlock disable patch. 
> (And maybe a new type of color artifact.)  I'm not sure I backed out
> everything I need to, but will resume tomorrow.

Shouldn't matter - our copy of genlock comes from the vendor.
The patch (which I haven't read) is to the qcom gralloc HAL code, so I assume it disables attempting to talk to the driver.
This is reproducible on the lock screen. Dragging the handle and release it would always produce a ghost image in the opposite direction.
blocking-basecamp: --- → ?
I think that's a separate bug.
blocking-basecamp: ? → +
Whiteboard: [WebAPI:P0]
Assignee: nobody → jones.chris.g
Whiteboard: [WebAPI:P0] → [WebAPI:P0][LOE:S]
This seems to have gotten worse lately.  Not sure why.  May need to bump priority.

This is especially obvious in the map view of the maps app.  It also seems especially bad in the "Scanning ..." UI in the gallery.
Hm, bug 780330 (which was an innocuous and correct optimization) may have magnified the underlying problem here.
Tried a new Nexus S build, and I do see snow there.  The snow isn't correlated with any error messages in logcat.

So the finger is pointed at gecko, but based on the kind of artifacts we see, it sure looks like a cache coherency problem somewhere ...
Adding code to gecko that does an (apparently non-standard; following example in gralloc.so) cacheflush() on gralloc buffers before unlocking them doesn't affect the snow.

AutoOpenSurface::~AutoOpenSurface()
{
  if (mSurface) {
    gfxImageSurface* img = static_cast<gfxImageSurface*>(mSurface.get());
    cacheflush(intptr_t(img->Data()),
               intptr_t(img->Data()) + img->GetDataSize(),
               0/*?*/);

So the problem doesn't seem to be CPU-cache<-->RAM coherency, unless I'm doing something wrong here.  That's the most likely hypothesis so this result is a bit worrisome.

I see there's a cache-flushing pmem IOCTL too, which is going to be quite a pain to test.  But I assume that would eliminate the hypothesis of RAM<-->GPU-cache coherency problems.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #7)
 
> This is especially obvious in the map view of the maps app.  It also seems
> especially bad in the "Scanning ..." UI in the gallery.

And the calculator app
IIRC, we should make sure we know exactly which caches a given function is flushing. On the ARM, L2 cache is normally virtually tagged and L1 is normally physically tagged. So often times, the cache flushing code only does L2 flushes. This is sufficient to ensure coherency between processes, but not sufficient to ensure coherency between HW.
Alright - shot me. I got L1 and L2 exactly reversed on my previous comment (I'm going to blame it on trying to write something technical before my first coffee)
Good point, need to check that --- cacheflush() is a redheaded step child in the kernel API, and bionic makes it worse.
If I did not get completely lost in the kernel sources, the handling of the cacheflush syscall starts at [1] and the respective assembly code for flushing the caches on ARMv7 is located at [2]. I haven't yet decoded what the assembly does.

[1] http://lxr.free-electrons.com/source/arch/arm/kernel/traps.c?v=3.0#L501
[2] http://lxr.free-electrons.com/source/arch/arm/mm/cache-v7.S#L166
So yeah, those functions only flush the L1 cache.

The outer_xxx functions (http://lxr.linux.no/#linux+v3.0.8/arch/arm/include/asm/outercache.h) are the ones which deal with the L2 cache. They're setup here:

http://lxr.linux.no/#linux+v3.0.8/arch/arm/mm/cache-l2x0.c#L341

I see some references over here:
http://lxr.linux.no/#linux+v3.0.8/drivers/staging/msm/memory.c#L96

but I'm not sure if any of that code is in our kernel.
OK, good to know.  I'll try an experiment with a manual cache flusher.  (That'll affect timing, but at least would be evidence.)
Marked as P1 given it's a blocker.
Whiteboard: [WebAPI:P0][LOE:S] → [WebAPI:P1][LOE:S]
Chris,

This was WebAPI:P0 because it's not a WebAPI bug.  The WebAPI team prioritized only its own bugs, because we didn't want to impose on anyone else's priorities.

(FWIW, most to all of the bugs that we prioritized are blockers, so the fact that a WebAPI bug is a blocker doesn't make it an automatic WebAPI:P1.)

Like I'd said in another bug, if you want to have a separate, parallel set of priorities, you're welcome to create your own whiteboard tag.  Or alternatively, I doubt anyone would mind if you used Bugzilla's built-in priorities for your own prioritization scheme.
Whiteboard: [WebAPI:P1][LOE:S] → [WebAPI:P0][LOE:S]
Attached image Screen capture
The "unit" of the snow is a 16-pixel strip.  That means they represent either 32 or 64 bytes being read by some processor.

The arm v7a cache line size is 64 bytes, so this is fairly suspicious.
It's also possible that the GPU is finishing a composition with a buffer that's (supposed to be) read-locked, the renderer write-locks it for repaint and starts scribbling on it, but somehow the synchronization isn't working as it's supposed to.
If you slowly pan left/right in the ocean in gmaps, there are also patterns that seem correlated to the gmaps tile size.  Might be a clue.
Attached patch Some experiments (obsolete) — Splinter Review
I'm about 90% sure this is a L2 coherency issue now.

The pipeline here basically looks like

  content       compositor
 ---------     ------------
  begin txn
  draw
  (A)
  forward txn 
                recv txn
                schedule composite
                ...
  begin txn     composite
  (B)
  draw
  [etc.]        [etc.]

I tried 4 experiments
 1. put a poor man's cacheflush (scribble 4MB RAM) at (A)
    -> reliably did away with snow
 2. poor man's cacheflush at (B)
    -> didn't fix the snow; ameliorated but didn't remove in gmaps
 3. usleep(50000) at (A)
    -> not much change
 4. usleep(50000) at (B)
    -> similar result as (2); less amelioration

If synchronization were the problem, I would have expected (2) or (4) to mostly make this go away.  Didn't happen.

Looks like the pixels aren't making it to RAM before the GPU starts reading them.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #20)
> Created attachment 661085 [details]
> Screen capture
> 
> The "unit" of the snow is a 16-pixel strip.  That means they represent
> either 32 or 64 bytes being read by some processor.
> 
> The arm v7a cache line size is 64 bytes, so this is fairly suspicious.

So, it actually seems to depend on the processor. The Cortex-A9 has 32-byte cache lines. It looks like the Cortex-A8 has 64-byte cache lines, and I seem to recall that the otoro is an A8, so that would make your statement correct for L1 cache.

The PL310 only supports 32-byte cache lines.
http://lxr.linux.no/#linux+v3.5.3/arch/arm/mm/cache-l2x0.c#L29
(and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0246h/Beifajbj.html 0b00 is the only value listed for cache line length)
Looking through the vendor source, we should be ending up here [1]

  ioctl(fd, PMEM_CLEAN_INV_CACHES, &pmem_addr)

after a lock(READ|WRITE).  Now to see what that's supposed to do ...

[1] https://www.codeaurora.org/gitweb/quic/la/?p=platform/hardware/qcom/display.git;a=blob;f=libgralloc/pmemalloc.cpp;h=01628ecfa2d5ce062a284b07d4a3c2adfb98873c;hb=refs/heads/ics_chocolate#l100
So that bounces into either arch/arm/mach-msm/memory.c or drivers/staging/memory.c.  I assume the former.  As far as I can tell, it's doing the right thing.  (I double checked and our kernel is configured with CONFIG_OUTER_CACHE=y.)

This one's getting scary.  Will check the assumption of comment 25.
I instrumented gralloc.so and there are no errors from genlocking/genunlocking or the pmem flush.  So userspace (on the renderer side) is doing all the right things.

Afraid we might need to ask for help on this one.
I doubt it matters much, but I noticed while testing that the artifacts in settings are 1/2 the length of those in gmaps.  So that means gmaps is using an r5g6b5 surface and the strip is 32 bytes.
So then the question becomes which virtual address is being used? A virtual address is only valid within exactly one process.

Is the virtual address being allocated, used, cache-flushed all from the same process?
Our /proc/config.gz says CONFIG_ARCH_MSM_CORTEX_A5
ARM says that the Cortex A5 has 32-byte cache lines:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0434b/BABCFDAH.html
It also bothers me that dmesg shows:

<6>[01-01 00:00:00.000] [0: swapper]L310 cache controller enabled
<6>[01-01 00:00:00.000] [0: swapper]L310 cache controller enabled
<6>[01-01 00:00:00.000] [0: swapper]l2x0: 8 ways, CACHE_ID 0x410054c7, AUX_CTRL 0x00540000, Cache size: 262144 B

So why does L310 cache controller enabled show up twice? and the next line only shows up once?
Are folks seeing this in GB builds too?
Whiteboard: [WebAPI:P0][LOE:S] → [WebAPI:P0][LOE:S][caf:help]
Is this really LOE:S?
Whiteboard: [WebAPI:P0][LOE:S][caf:help] → [WebAPI:P0][LOE:?][caf:help]
chris: I sent my GB otoro back to MV, so I can't check for you.
Yes.
Whiteboard: [WebAPI:P0][LOE:?][caf:help] → [WebAPI:P0][LOE:S][caf:help]
There's one more hypothesis we should eliminate on the gecko side.  I assumed it wasn't possible that we were failing to unlock the buffers, because we previously got errors on attempted reflexive write locks.  But I realized there's a crazy way we might be skipping unlock ...
Whiteboard: [WebAPI:P0][LOE:S][caf:help] → [WebAPI:P0][LOE:S]
I directly tried patching the crazy thing, and then wrote a small patch and python script to balance lock/unlock.  No effect from former, and all good on the latter.  Gecko is indeed doing what it's supposed to be doing.
Whiteboard: [WebAPI:P0][LOE:S] → [WebAPI:P0][LOE:S][caf:help]
After constructing a trivial app that demonstrates the issue I found that while I see "snow" when it runs OOP, there is no snow when blacklisted.  A hint?  Next will look into the call flow differences between these two cases now from the POV of the bits beneath gecko
Although globally disabling OOP in settings is not a panacea so maybe c38 will get us nowhere...
I've been seeing snow in in-process content (like settings before this week), load indicator browser, popup windows ...
Let me try again, with English enabled.

It's quite easy to reproduce snow in in-process content.  For example,
 - hold down power button
 - tap-down on "Power Off" in popup menu, hold finger down
 - slide finger off "Power Off"
Not that we didn't know this already but it's safe to say that this is 100% a ARM-->RAM caching issue, as snow is completely gone if |pmem[id].cached = 0| is forced in the kernel pmem driver.

Ref: https://www.codeaurora.org/gitweb/quic/la/?p=kernel/msm.git;a=blob;f=drivers/misc/pmem.c;h=2f48d716d6a77286168793c378a5e635a76284a8;hb=ics_strawberry#l2644
Ball's back in our court --- it's indeed a cache coherency issue, but it's caused by us writing to the buffer after unlocking for write.

Which is ... disturbing.
Whiteboard: [WebAPI:P0][LOE:S][caf:help] → [WebAPI:P0][LOE:S]
Would love to see this upstreamed in some form.
Attachment #661102 - Attachment is obsolete: true
HAH!  You guys are going to love this one.
So are you just going to leave us hanging...
I'm a little embarrassed because I thought it was something cute, but my first guess was wrong.  It's actually just a run-of-the-mill subtle logic bug.
I think that debugging patch has a bug; it was still kicking in after this fix.

Anyway, it's a hot day in hell, snow.  Not my finest hour but a lot of good came out of this.
Attachment #663544 - Attachment is obsolete: true
Comment on attachment 663586 [details] [diff] [review]
We really, seriously, can only let the BufferTracker take care of our required AutoOpenSurfaces

There's a bug here that I fixed locally.

The goal now is figure out why bug 792663 bounced on inbound.
Attachment #663586 - Attachment is obsolete: true
To resummarize the problem here, we always must use the buffer tracker to unmap buffers used by BasicThebesLayer, because the buffers can be mapped from several different locations.  What was happening here is that we were using an AutoOpenSurface to do the front->back sync that *wasn't* tracked by buffer tracker.  The sequence of events in goes something like this

 - transaction 1
   * thebes layer T has no buffers
   * enter BasicShadowableThebesLayer::Paint
     . put BufferTracker on the stack
     . ThebesLayerBuffer::BeginPaint() causes buffer B1 to be created; BufferTracker B1->lock()s it
     . paint to B1
     . exit Paint(); BufferTracker goes out of scope, B1->unlock()
   * send B1 to the compositor, get it back as the read-only front buffer

 - transaction 2
   * thebes layer T has no back buffer, and readOnlyFront = B1
   * enter BasicShadowableThebesLayer::Paint
     . put BufferTracker on the stack
     . enter SyncFrontBufferToBackBuffer
       o no back buffer, so allocate back buffer = B2
       o put AutoOpenSurface(B2) on the stack, B2->lock()
       o *** SET ThebesLayerBuffer.mBuffer = B2 IN PREP FOR COPY ***
       o copy front->back
       o exit Sync(); AutoOpenSurface(B2) goes out of scope, *** B2->unlock() ***
     . ThebesLayerBuffer::BeginPaint() sees it already has mBuffer; doesn't ask mBufferProvider to map it
     . paint to B2 *** outside the lock ***
     . exit Paint(); BufferTracker goes out of scope; doesn't do anything because it didn't see B2 being created

Whups.
Comment on attachment 663613 [details] [diff] [review]
We really, seriously, can only let the BufferTracker take care of our required AutoOpenSurfaces

Nick, you probably understand this code as well as anyone now :).  I want to get this landed asap, so will try to find someone to steal while you enjoy your weekend ;).
Attachment #663613 - Flags: review?(ncameron)
I don't like to nag about review requests, but this bug is probably the most-complained-about issue with b2g at the moment, so everyone really wants to get it off our plates.
Attachment #663613 - Flags: review?(ncameron) → review?(roc)
https://hg.mozilla.org/mozilla-central/rev/beed670216f6
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
Keywords: verifyme
QA Contact: jsmith
Verified on 9/30 Otoro build.
Status: RESOLVED → VERIFIED
Keywords: verifyme
(In reply to Chris Jones [:cjones] [:warhammer] from comment #52)
> I don't like to nag about review requests, but this bug is probably the
> most-complained-about issue with b2g at the moment, so everyone really wants
> to get it off our plates.

Sorry Chris, I was on vacation and away from the internets last week.

Thanks for the review roc!
(In reply to Nick Cameron [:nrc] from comment #56)
> (In reply to Chris Jones [:cjones] [:warhammer] from comment #52)
> > I don't like to nag about review requests, but this bug is probably the
> > most-complained-about issue with b2g at the moment, so everyone really wants
> > to get it off our plates.
> 
> Sorry Chris, I was on vacation and away from the internets last week.

No worries at all!  roc told me you were on PTO right after I wrote this comment.
Duplicate of this bug: 796532
You need to log in before you can comment on or make changes to this bug.