Closed Bug 984618 Opened 6 years ago Closed 6 years ago

[B2G] [Marketplace] [AccuWeather] Rendering errors while navigating around the site, especially on long pages

Categories

(Core :: Graphics, defect)

30 Branch
ARM
Gonk (Firefox OS)
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla31
blocking-b2g 1.4+
Tracking Status
firefox29 --- unaffected
firefox30 --- fixed
firefox31 --- fixed
b2g-v1.3 --- unaffected
b2g-v1.4 --- verified
b2g-v2.0 --- verified

People

(Reporter: ckreinbring, Assigned: nical)

References

Details

(Keywords: regression, Whiteboard: dogfood1.4)

Attachments

(7 files, 6 obsolete files)

5.50 KB, text/plain
Details
29.86 KB, image/jpeg
Details
5.68 KB, text/plain
Details
895 bytes, patch
cwiiis
: review+
Details | Diff | Splinter Review
15.80 KB, patch
bas.schouten
: review-
Details | Diff | Splinter Review
692 bytes, patch
bas.schouten
: review+
Details | Diff | Splinter Review
38.87 KB, patch
Details | Diff | Splinter Review
Description:
Sections of AccuWeather pages show as either blank white or as other sections of the already-loaded page.  Longer pages such as world weather and extended forecasts have multiple blank spots.

Repro Steps:
1) Update Buri to Build ID: 20140317040204
2) Download the AccuWeather app from the Marketplace.
3) Return to the homescreen and launch AccuWeather.
4) Tap World Weather, or tap the search textbox and enter a city name that will generate a "choose city" page.
5) Observe the page as it loads.

Actual:
When the page loads, parts of it will either fail to load (blank white section) or different parts of the page load on the sections.

Expected:
The entire page loads with no rendering issues.

Environmental Variables
Device: Buri 1.4 mozilla RIL
Build ID: 20140317040204
Gecko: https://hg.mozilla.org/mozilla-central/rev/25cfa01ba054
Gaia: 8f802237927c7d5e024fb7dca054dd5efef6b2e6
Platform Version: 30.0a1
Firmware Version: V1.2-device.cfg

Notes:
Repro frequency: 70%
See attached screenshot and logcat logs
Does not occur on the Everything.me version of the site.
Most likely to occus after a fresh flash and first download.
Does not repro on 1.3 mozilla RIL.

Build ID: 20140314004002
Gecko: https://hg.mozilla.org/releases/mozilla-b2g28_v1_3/rev/0479f5480378
Gaia: 932789749406a79c5630c27c724f1856bd3c3f10
Platform Version: 28.0
Firmware Version: v1.2-device.cfg
blocking-b2g: --- → 1.4?
Component: General → Graphics
Product: Firefox OS → Core
Version: unspecified → 30 Branch
Might be the same as bug 984531.
Does this reproduce with tiling disabled?
Bas, this looks like weird things on tiling boundary.
Assignee: nobody → bas
blocking-b2g: 1.4? → 1.4+
Assigning to nical for diagnosis (and possible fixing)
Assignee: bas → nical.bugzilla
Status: NEW → ASSIGNED
(In reply to Jason Smith [:jsmith] from comment #3)
> Might be the same as bug 984531.

Indeed, same genlock failures in the logcat.
Can you try switching the pref 'layers.overzealous-gralloc-unlocking' to true and seeing if you can reproduce?
Flags: needinfo?(ckreinbring)
It does not reproduce with tiling disabled in 1.4
Keywords: qawanted
With 'layers.overzealous-gralloc-unlocking' to true, (I pulled the prefs.js, added user_pref("layers.overzealous-gralloc-unlocking", true); line, and pushed it back using modPref.js from https://gist.github.com/edmoz/5596162)

After the reboot (with tiling enabled), I do see a lot of checkerboarding when drawing the screen, but I do not see the same type of rendering issue i saw before.  in my case, portions of the screen were black where there should be graphics.

Gaia      c03a6af9028c4b74a84b5a98085bbb0c07261175
Gecko     https://hg.mozilla.org/mozilla-central/rev/082761b7bc54
BuildID   20140318160201
Version   31.0a1
ro.build.version.incremental=eng.tclxa.20131223.163538
ro.build.date=Mon Dec 23 16:36:04 CST 2013
Flags: needinfo?(ckreinbring)
Blocks: b2g-tiling
(In reply to npark from comment #10)
> With 'layers.overzealous-gralloc-unlocking' to true, (I pulled the prefs.js,
> added user_pref("layers.overzealous-gralloc-unlocking", true); line, and
> pushed it back using modPref.js from https://gist.github.com/edmoz/5596162)
> 
> After the reboot (with tiling enabled), I do see a lot of checkerboarding
> when drawing the screen, but I do not see the same type of rendering issue i
> saw before.  in my case, portions of the screen were black where there
> should be graphics.
> 
> Gaia      c03a6af9028c4b74a84b5a98085bbb0c07261175
> Gecko     https://hg.mozilla.org/mozilla-central/rev/082761b7bc54
> BuildID   20140318160201
> Version   31.0a1
> ro.build.version.incremental=eng.tclxa.20131223.163538
> ro.build.date=Mon Dec 23 16:36:04 CST 2013

So it looks like this has to do wit the overzealous unlocking stuff again :s. That's tricky as I'm not sure what our options here are, what do you think Chris?
(In reply to Bas Schouten (:bas.schouten) from comment #11)
> (In reply to npark from comment #10)
> > With 'layers.overzealous-gralloc-unlocking' to true, (I pulled the prefs.js,
> > added user_pref("layers.overzealous-gralloc-unlocking", true); line, and
> > pushed it back using modPref.js from https://gist.github.com/edmoz/5596162)
> > 
> > After the reboot (with tiling enabled), I do see a lot of checkerboarding
> > when drawing the screen, but I do not see the same type of rendering issue i
> > saw before.  in my case, portions of the screen were black where there
> > should be graphics.
> > 
> > Gaia      c03a6af9028c4b74a84b5a98085bbb0c07261175
> > Gecko     https://hg.mozilla.org/mozilla-central/rev/082761b7bc54
> > BuildID   20140318160201
> > Version   31.0a1
> > ro.build.version.incremental=eng.tclxa.20131223.163538
> > ro.build.date=Mon Dec 23 16:36:04 CST 2013
> 
> So it looks like this has to do wit the overzealous unlocking stuff again
> :s. That's tricky as I'm not sure what our options here are, what do you
> think Chris?

Fixing this properly is tricky... If we could have some kind of message to know when the GrallocTexture has no more users on the host-side, we could use that to stop locks succeeding on the client-side (when there are still users left). Then I suppose we'd need to introduce some 'retry' code for when opening the lock fails client-side to just get a different texture (or wait, I suppose, though that's a bit dangerous).

nical, what do you think of the above?
Flags: needinfo?(nical.bugzilla)
(In reply to Chris Lord [:cwiiis] from comment #12)
> Fixing this properly is tricky... If we could have some kind of message to
> know when the GrallocTexture has no more users on the host-side, we could
> use that to stop locks succeeding on the client-side (when there are still
> users left). Then I suppose we'd need to introduce some 'retry' code for
> when opening the lock fails client-side to just get a different texture (or
> wait, I suppose, though that's a bit dangerous).
> 
> nical, what do you think of the above?

Thinking about this some more, this is essentially what the read locks are doing, but they unlock immediately after the render of the flipped buffer. Ideally, they'd unlock on EndFrame though, as this is really when the buffers are unlocked.

I guess we could introduce a Flush method on the Compositor's TexturePool that could be called before unlocking, but that feels kind of nasty to me... That said, given that this is 'fixed' by overzealous unlocking, it kind of insinuates that this isn't the problem and that we're locking a texture for writing that's currently being composited.

Is there a situation other than the over-production case where this can happen? Although it seems unlikely, I'd really like to rule out that case. I'll see if I can reproduce this and work on a fix for that.
My hamachi is on a one-week-old-ish revision of mozilla-central and I can't reproduce the bug. I am currently looking at the last patches that may have caused the regression. In the mean time it would be great if someone could build a slightly older version of gecko that has tiling and confirm whether or not the bug was already there.
Flags: needinfo?(nical.bugzilla)
I can reproduce this and confirm that it's due to the genlock failures. I doubt very much this is overproduction, however - looking into it.
I can't reproduce this if I disable the texture client pool (by reporting all clients as lost in Discard(Front|Back)Buffer). This would hint that the problem may be re-use of a TextureClient that was returned before it was unlocked.
Further good news, I can't reproduce this at all if I disable ReturnTextureClientDeferred (and just report those clients as lost). Given that this is a known issue with this code, I'll just go ahead and see if I can get a fix going (I have one in mind).
Assignee: nical.bugzilla → chrislord.net
See Also: → 984673
See Also: 984673
Blocks: 960372
This isn't beautiful code and I'm not saying this is the best method - I'm pretty much expecting an r-... But it does fix the problem, and it is something we need to do some way or another.
Attachment #8394453 - Flags: review?(bas)
Comment on attachment 8394453 [details] [diff] [review]
Guarantee that locked TextureClients aren't reused

Review of attachment 8394453 [details] [diff] [review]:
-----------------------------------------------------------------

In general this is fine, and if it fixes the bug that's a good thing to have on hand, and we could even land it mostly in its current form to paper over the issue if we can't figure out why this bug is happening, there's 3 things we need though:

1) Figure out if FlushRendering works right (and therefor, if it will prevent overproduction)
2) If it does work right, figure out how this problem can even occur if we're not overproducing. (Because there's a reasonable chance this patch is papering over another bug in that case)
3) If there is no overproduction, but this patch does fix the bug, figure out if with the patch we're still properly re-using texture clients, or if it just fixes the bug by spending a long time in blocking gralloc allocations.

::: gfx/layers/client/TextureClientPool.cpp
@@ +138,5 @@
> +TextureClientPool::ReturnDeferredTextureClients(bool aDestroy)
> +{
> +  MOZ_ASSERT(aData.callback);
> +
> +  // Guard against re-entry

What could cause re-entry? That's scary.

::: gfx/layers/client/TiledContentClient.cpp
@@ +491,5 @@
>  TileClient::DiscardFrontBuffer()
>  {
>    if (mFrontBuffer) {
>      MOZ_ASSERT(mFrontLock);
> +    TextureClientPool* pool = mManager->GetTexturePool(mFrontBuffer->AsTextureClientDrawTarget()->GetFormat());

Filter this logic out into a separate function.
Here's a good log that only adds to the mystery sadly, notice how it doesn't seem like there's overproduction. And it sort of seems mostly business as usual. (Note the time between Wanting to start drawing, and starting to draw is the time it's making sure the compositor finishes drawing that I added with a little code).

https://gist.github.com/nhirata/9677272
A new log here gives some very interesting data:

https://gist.github.com/nhirata/9677890

Note the two different processes we're dealing with, 1229 is getting the failures, that particular process isn't overproducing at all! We're seeing the main/composition process do a lot of compositions before, and after 1229 has started its transaction. So at this point I'm still not sure what's going on.

Note that apparently the bug became a lot harder to produce with all my printf's in, so it might be very timing sensitive. Although it could've just been a coincidence.

03-20 18:47:00.689: I/Gecko(1229): XXX - Bas - Wanting to draw client side!
03-20 18:47:00.689: I/Gecko(1229): XXX - Bas - Starting to draw client side!
03-20 18:47:00.829: I/Gecko(136): XXX - Bas - Wanting to draw client side!
03-20 18:47:00.829: I/Gecko(136): XXX - Bas - Starting to draw client side!
03-20 18:47:00.829: I/Gecko(136): XXX - Bas - Finished validation!
03-20 18:47:00.839: I/Gecko(136): XXX - Bas - Starting to draw host side!
03-20 18:47:00.839: I/Gecko(136): XXX - Bas - Finished compositing host side!
03-20 18:47:00.839: I/Gecko(136): XXX - Bas - Starting to draw host side!
03-20 18:47:00.839: I/Gecko(136): XXX - Bas - Forwarded transaction!
03-20 18:47:00.839: I/Gecko(136): XXX - Bas - Finished drawing client side!
03-20 18:47:00.839: I/Gecko(136): XXX - Bas - Finished compositing host side!
03-20 18:47:00.969: I/Gecko(136): XXX - Bas - Wanting to draw client side!
03-20 18:47:00.979: I/Gecko(136): XXX - Bas - Starting to draw client side!
03-20 18:47:00.979: I/Gecko(136): XXX - Bas - Finished validation!
03-20 18:47:00.979: I/Gecko(136): XXX - Bas - Starting to draw host side!
03-20 18:47:00.979: I/Gecko(136): XXX - Bas - Finished compositing host side!
03-20 18:47:00.979: I/Gecko(136): XXX - Bas - Starting to draw host side!
03-20 18:47:00.979: I/Gecko(136): XXX - Bas - Forwarded transaction!
03-20 18:47:00.979: I/Gecko(136): XXX - Bas - Finished drawing client side!
03-20 18:47:00.979: I/Gecko(136): XXX - Bas - Finished compositing host side!
03-20 18:47:01.369: I/Gecko(1283): ###################################### forms.js loaded
03-20 18:47:01.389: I/Gecko(1283): ############################### browserElementPanning.js loaded
03-20 18:47:01.409: I/Gecko(1283): ######################## BrowserElementChildPreload.js loaded
03-20 18:47:01.799: E/libgenlock(1229): perform_lock_unlock_operation: GENLOCK_IOC_DREADLOCK failed (lockType0x1, err=Connection timed out fd=61)
03-20 18:47:01.799: E/msm7627a.gralloc(1229): gralloc_lock: genlock_lock_buffer (lockType=0x2) failed
03-20 18:47:01.799: W/GraphicBufferMapper(1229): lock(...) failed -22 (Invalid argument)
03-20 18:47:02.799: E/libgenlock(1229): perform_lock_unlock_operation: GENLOCK_IOC_DREADLOCK failed (lockType0x1, err=Connection timed out fd=64)
03-20 18:47:02.799: E/msm7627a.gralloc(1229): gralloc_lock: genlock_lock_buffer (lockType=0x2) failed
03-20 18:47:02.799: W/GraphicBufferMapper(1229): lock(...) failed -22 (Invalid argument)
03-20 18:47:03.799: E/libgenlock(1229): perform_lock_unlock_operation: GENLOCK_IOC_DREADLOCK failed (lockType0x1, err=Connection timed out fd=43)
03-20 18:47:03.799: E/msm7627a.gralloc(1229): gralloc_lock: genlock_lock_buffer (lockType=0x2) failed
03-20 18:47:03.799: W/GraphicBufferMapper(1229): lock(...) failed -22 (Invalid argument)
03-20 18:47:04.129: E/QCALOG(191): [MessageQ] ProcessNewMessage: [XTWiFi-PE] unknown deliver target [OS-Agent]
03-20 18:47:04.129: E/QCALOG(191): [MessageQ] ProcessNewMessage: [XT-CS] unknown deliver target [OS-Agent]
03-20 18:47:04.149: E/QCALOG(191): [MessageQ] ProcessNewMessage: [XTWWAN-PE] unknown deliver target [OS-Agent]
03-20 18:47:04.469: D/wpa_supplicant(595): RX ctrl_iface - hexdump(len=11): 53 49 47 4e 41 4c 5f 50 4f 4c 4c
03-20 18:47:04.469: D/wpa_supplicant(595): nl80211: survey data missing!
03-20 18:47:04.799: E/libgenlock(1229): perform_lock_unlock_operation: GENLOCK_IOC_DREADLOCK failed (lockType0x1, err=Connection timed out fd=67)
03-20 18:47:04.799: E/msm7627a.gralloc(1229): gralloc_lock: genlock_lock_buffer (lockType=0x2) failed
03-20 18:47:04.799: W/GraphicBufferMapper(1229): lock(...) failed -22 (Invalid argument)
03-20 18:47:04.799: I/Gecko(1229): XXX - Bas - Finished validation!
These are probably stupid questions, but:

Do you often see the pre-allocated process launch prior to the genlock failures?  Or is that just because you captured just after app launch in these cases.  See the messages about loading form.js, etc.

I assume the pre-allocated app does not do anything with gfx resources, but just thought I would ask.

Also, what app was terminated in the second gist?  You can see the IPC delivery failures.  Just curious.
Flags: needinfo?(nhirata.bugzilla)
(In reply to Ben Kelly [:bkelly] (PTO Mar 21, back Mar 24) from comment #22)
> These are probably stupid questions, but:
> 
> Do you often see the pre-allocated process launch prior to the genlock
> failures?  Or is that just because you captured just after app launch in
> these cases.  See the messages about loading form.js, etc.
> 
> I assume the pre-allocated app does not do anything with gfx resources, but
> just thought I would ask.
> 
> Also, what app was terminated in the second gist?  You can see the IPC
> delivery failures.  Just curious.

I didn't make this trace, I just ordered the printf's, fwiw, I'm not sure what most of those questions mean :). I don't know what the 'pre-allocated' process is, if you mean 1283 in this case, it isn't in the log attached to the bug, fwiw? I think it's just because the bug repro's on app-launch generally.
Version of the patch with comments addressed.

With regards to debugging this problem, it seems to stem from the ReturnTextureClientDeferred call in DiscardFrontBuffer.

Looking at when this is called, it can happen when the format changes, when a write-lock fails and it will happen if we fail to add the texture client to the compositable client. In these three cases, I think the texture ought to be reported as lost, rather than returned to the pool (or alternatively, we should wait on the lock, then return it).

The other time it gets called is when we do a full tile update, which I think is fine.

My patch will have us wait on the lock if the cases I mention above ever occur, otherwise it should result in roughly the same behaviour as before - Although I agree that it's papering over the issue to some extent, I think this is an easier way of dealing with the issue and it guarantees that we never return a locked client to the pool.

For the record, I think we should *also* block on over-production on non-progressive updates, but my logging and yours too show that this is a very rare case on mobile (may be more of an issue on desktop) and so not a high priority right now (imo).
Attachment #8394453 - Attachment is obsolete: true
Attachment #8394453 - Flags: review?(bas)
Attachment #8394755 - Flags: review?(bas)
Blocks: 984577
Blocks: 984531
Blocks: 984482
Blocks: 985170
Blocks: 985779
Blocks: 983883
Blocks: 985162
Blocks: 986103
Naoki, do you think you could try to reproduce with the patch attached to this bug? I can't reproduce it, but Bas was saying that this can become very hard to reproduce when the timing is affected, so I'd like to know if it actually fixes it (which I think it does), or if it's just a red herring.

I'll use my patch to debug this issue on Monday (with it, I can track if locked textures are being returned to the pool and when/why) and see if I can figure out the underlying cause.
I will try as well, and CC-ing No-Jun in case he can take a look sooner.
Flags: needinfo?(npark)
It seems I've lost the ability to reproduce this bug on plain m-c. My previous steps that hit it instantly no longer hit it, so if anyone has decent STR, let me know...
(In reply to Chris Lord [:cwiiis] from comment #27)
> It seems I've lost the ability to reproduce this bug on plain m-c. My
> previous steps that hit it instantly no longer hit it, so if anyone has
> decent STR, let me know...

Do you try the STR in any of the blocking bugs?
I did some more logging around the issue and here is what I see:

when looking up 'genlock' in the log, you can look at the last TextureClient lock that went in the log _within the same process as the failure_ (client 3799)

one of these is the lock of TextureClient [id:13]
going back in time we can see that TextureHost [id:13] was locked and its TextureSource uses the gl texture [tex:196].

The next time EndFrame is called, [tex:196] is not in the "unused" part of the pool.
When EndFrame is called again after that, [tex:196] is deleted, but it happens *after* the genlock failure.

here are the relevant bits of the logcat in order:

I/Gecko   ( 3598):  -- GrallocTextureHostOGL(0x446dd280)::Lock [id:13]
I/Gecko   ( 3598):  -- GrallocTextureSourceOGL: using [tex:196]

I/Gecko   ( 3598):  -- PerFrameTexturePoolOGL::EndFrame

I/Gecko   ( 3799):  ** Pool: Return deferred TextureClient [id:13] (refcnt:1)

I/Gecko   ( 3799):  ** Pool: Return TextureClient [id:13] (refcnt:1)

I/Gecko   ( 3799):  ** Pool: GetTextureClient from pool [id:13] (refcnt:1)

I/Gecko   ( 3799):  ** GrallocTextureClientOGL::Lock [id:13]

E/libgenlock( 3799): perform_lock_unlock_operation: GENLOCK_IOC_DREADLOCK failed (lockType0x1, err=
Connection timed out fd=71)
E/msm7627a.gralloc( 3799): gralloc_lock: genlock_lock_buffer (lockType=0x2) failed
W/GraphicBufferMapper( 3799): lock(...) failed -22 (Invalid argument)

I/Gecko   ( 3598):  -- PerFrameTexturePoolOGL::EndFrame
I/Gecko   ( 3598):  -- EndFrame: delete gl texture [tex:196]
Attached patch logging patch (obsolete) — Splinter Review
Looking at the same log and rethinking about when the lagginess shows: we seem to be in places where we followed a link, and the client side is doing a lot of throwing around and recycling TextureClients, while the compositor side is doing close to nothing. This isn't your typical overproduction case but it looks kinda similar. Maybe we don't schedule compositions when just destroy layers (which means we don't do the pool's EndFrame?
This is similar to the first patch in intention, but slightly different in design. This patch changes nothing about our old semantics, it just guarantees that we don't return locked textures, and if ever that situation appears, it asserts (but will continue working).

I think this is desirable as it lets us track the issue (via the asserts and extra points to add logging), but also stops them from having quite so nasty side-effects (of course, if there's a leak, the app will eventually crash, but misuse of gralloc surfaces quite often tends to have worse side-effects than that (like the whole phone crashing)).
Attachment #8394755 - Attachment is obsolete: true
Attachment #8394755 - Flags: review?(bas)
Attachment #8395927 - Flags: review?(bas)
For the record, I don't think these lock failures come up as a result of inconsistency of information between client/host, but more likely a situation like what nical describes in comment #32.
(In reply to Chris Lord [:cwiiis] from comment #34)
> For the record, I don't think these lock failures come up as a result of
> inconsistency of information between client/host, but more likely a
> situation like what nical describes in comment #32.

Yeah, have been doing a lot of logging today and it always looks the same: we are throwing away layers and rebuilding new ones immediately, making us both put tiles in the pool and recyce them before we do an EndFrame on the compositor side to release the gralloc textures.
I haven't yet found out whether there is a better place for us to do the Compositor's gl texture pool EndFrame call, but I am pretty sure we need to add a call to the gl pool's EndFrame somewhere specifically for this case, or make sure we don't return the deferred textures in the client-side pool in this specific case (either of the two).
(In reply to Nicolas Silva [:nical] from comment #35)
> (In reply to Chris Lord [:cwiiis] from comment #34)
> > For the record, I don't think these lock failures come up as a result of
> > inconsistency of information between client/host, but more likely a
> > situation like what nical describes in comment #32.
> 
> Yeah, have been doing a lot of logging today and it always looks the same:
> we are throwing away layers and rebuilding new ones immediately, making us
> both put tiles in the pool and recyce them before we do an EndFrame on the
> compositor side to release the gralloc textures.
> I haven't yet found out whether there is a better place for us to do the
> Compositor's gl texture pool EndFrame call, but I am pretty sure we need to
> add a call to the gl pool's EndFrame somewhere specifically for this case,
> or make sure we don't return the deferred textures in the client-side pool
> in this specific case (either of the two).

In the logs the client side that we care about in this bug (i.e. The pool we're dealing with) is not throwing around transactions at all. And the client transactions we are seeing a lot of should be using a different pool. So I'm still not clear on why this would happen, could you describe (stepwise) a situation that could cause this, that is consistent with the logs we're seeing?
(In reply to Nicolas Silva [:nical] from comment #35)
> (In reply to Chris Lord [:cwiiis] from comment #34)
> > For the record, I don't think these lock failures come up as a result of
> > inconsistency of information between client/host, but more likely a
> > situation like what nical describes in comment #32.
> 
> Yeah, have been doing a lot of logging today and it always looks the same:
> we are throwing away layers and rebuilding new ones immediately, making us
> both put tiles in the pool and recyce them before we do an EndFrame on the
> compositor side to release the gralloc textures.
> I haven't yet found out whether there is a better place for us to do the
> Compositor's gl texture pool EndFrame call, but I am pretty sure we need to
> add a call to the gl pool's EndFrame somewhere specifically for this case,
> or make sure we don't return the deferred textures in the client-side pool
> in this specific case (either of the two).

Specifically, I can't see this explaining either the log, or the failure of FlushRendering to fix it.
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(chrislord.net)
Attached file logcat.txt
With cwiis' v3 patch, I had to come up with new steps to reproduce the bug.

1. Launch Accuweather
2. change to landscape orientation on the phone
3. As soon as the page change the orientation to landscape pick one of the options ( World Weather, Africa... )
4. wait til load

I think you don't necessarily have to have memory pressured in order to reproduce it this way.
Flags: needinfo?(nhirata.bugzilla)
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #38)
> Created attachment 8396091 [details]
> logcat.txt
> 
> With cwiis' v3 patch, I had to come up with new steps to reproduce the bug.
> 
> 1. Launch Accuweather
> 2. change to landscape orientation on the phone
> 3. As soon as the page change the orientation to landscape pick one of the
> options ( World Weather, Africa... )
> 4. wait til load
> 
> I think you don't necessarily have to have memory pressured in order to
> reproduce it this way.

I'll try to confirm this myself - but this would make me change my mind entirely, there should be no possibility of reusing a locked client with this patch (unless I got it wrong, which I'll verify), in which case it would greatly support Bas's theory.
Flags: needinfo?(chrislord.net)
(In reply to Chris Lord [:cwiiis] from comment #39)
> (In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from
> comment #38)
> > Created attachment 8396091 [details]
> > logcat.txt
> > 
> > With cwiis' v3 patch, I had to come up with new steps to reproduce the bug.
> > 
> > 1. Launch Accuweather
> > 2. change to landscape orientation on the phone
> > 3. As soon as the page change the orientation to landscape pick one of the
> > options ( World Weather, Africa... )
> > 4. wait til load
> > 
> > I think you don't necessarily have to have memory pressured in order to
> > reproduce it this way.
> 
> I'll try to confirm this myself - but this would make me change my mind
> entirely, there should be no possibility of reusing a locked client with
> this patch (unless I got it wrong, which I'll verify), in which case it
> would greatly support Bas's theory.

Ugh, and I realise, I can't possibly reproduce this because the buri has no accelerometer/gyro :| I'll see if I can reproduce on a Keon...
and it appears my build doesn't have working motion events anyway...? (Keon that was previously fine now isn't doing auto-rotation or delivery devicemotion events :/)
So I've reflashed my Keon, devicemotion works correctly, but I don't get any screen rotation. No idea why, but it means I'm unable to reproduce these lock failures anymore :/
The following is all on a Keon.

So, replacing NS_NOTREACHED with a macro that just prints the error, I can get TextureClients that are locked to attempt to re-use (which the patch prevents, but warns about), but I have no reasonable STR for such (I managed it by browsing to planet.mozilla.org and panning quickly quite far down, then all the way back to the top).

I can get genlock failures to occur due to out of memory, I assume from a situation that causes too many tiles to be created and fds getting exhausted (the lock fails, rather than the allocate), but this is only when zooming in to extreme levels, and I assume caused by bug 957668.

I've spent a reasonable amount of time trying to reproduce this, I need more solid steps to reproduce that don't involve rotation. Also, perhaps the m-c rev may be useful.

Whether it fixes it or not, I think it's a reasonable idea to have a patch similar to this to at least assert if this situation gets hit.
Jeff and Benoit, let's continue on this today here, maybe with the QRD we have.
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bgirard)
A nice way to help debug this issue, stick this at the top of TextureClientPool.cpp:

#define NS_NOTREACHED(s) printf_stderr("XXX: %s\n", s)

then,

adb logcat|grep -E 'XXX|gralloc'

The warnings you get would likely occur when you *would* have had genlock errors (you can confirm that by changing 'return TextureClientPool::DEFER' to 'return TextureClientPool::READY' in TiledContentClient.cpp.

Any other lock errors you get are suspect and should be investigated - if they are resolved by overzealous unlocking, the errors stem from using tiles, otherwise they stem from another gralloc user.
I can't reproduce the bug with this (embarrassingly) simple patch \o/
Attachment #8396450 - Flags: review?(chrislord.net)
Comment on attachment 8396450 [details] [diff] [review]
Also clear the compositor's unused textures when the frame was composited by hwcomposer

Review of attachment 8396450 [details] [diff] [review]:
-----------------------------------------------------------------

Nicely done.
Attachment #8396450 - Flags: review?(chrislord.net) → review+
Flags: needinfo?(bgirard)
For those that have been able to reproduce this bug, it would mean that turning off hardware composer in the developer prefs should make the problem go away, if you want to test before the patch shows up in the nightly.
Flags: needinfo?(jmuizelaar)
When hardware composer is turned off, the bug was no longer reproducible on below Gaia build. (and I was able to consistently reproduce this bug on this build)

Gaia      a2a88d0638594a6510f878d2c5e99a6ead7520ad
Gecko     https://hg.mozilla.org/releases/mozilla-aurora/rev/67bdb575d833
BuildID   20140325000201
Version   30.0a2
Flags: needinfo?(npark)
Comment on attachment 8395927 [details] [diff] [review]
Guarantee that locked TextureClients aren't reused v3

Review of attachment 8395927 [details] [diff] [review]:
-----------------------------------------------------------------

I've decided that with the actual cause found, I'm not so sure this patch is a good idea. Currently the gralloc lock is not timing out, but it seems to be hit occasionally, which is good since it will throttle our production based on our rate of consumption in the compositor. This patch would undo that and instead make us sometimes create new buffers, requiring a round-trip to the compositor and in the end a locker blocking period. That does not seem desirable.
Attachment #8395927 - Flags: review?(bas) → review-
I think nical's identification of the problem means we shouldn't hit this now. However, there are situations where we start writing to a locked tile before it's unlocked on the compositor side. When we hit these, it's usually a sign of some kind of over-production, so we'd rather just wait on the lock and what we do is fine.

There's a small chance that this could cause problems though (and a larger one on PVR I think), so this patch doesn't change any behaviour, but for debug builds, will output a warning when we attempt to reuse a locked tile. It also gives us a few more points where we can inject debugging code, which may come in handy later.
Attachment #8395927 - Attachment is obsolete: true
Attachment #8396660 - Flags: review?(bas)
(In reply to Nicolas Silva [:nical] from comment #49)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/0c0cfce48311

I forgot to remind - when things land on b2g-inbound, we get the build cached per check-in, helps with further testing.  Either way, we have the fix :)
I tried just https://hg.mozilla.org/integration/mozilla-inbound/rev/0c0cfce48311 and the issue appears still.

I turned the hwc off and the issue no longer appeared using steps in both comment 38 and comment 0.

Gaia      c76ca4811cb80e7fe7082d2d3b97b189b4463b1d
Gecko     e6d096d7473abeda806cc1534b4e78f05ead0a6a
BuildID   20140325123144
Version   31.0a1
ro.build.version.incremental=eng.tclxa.20131223.163538
ro.build.date=Mon Dec 23 16:36:04 CST 2013

Maybe I did something wrong when building...
That (you doing a bad build) may be too much to hope for, but let's see what it looks like in the nightly tomorrow.
https://hg.mozilla.org/mozilla-central/rev/0c0cfce48311
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
No-Jun - Can you verify this when you get into work tomorrow to see if this works correctly on trunk?
Flags: needinfo?(npark)
Keywords: verifyme
On latest Trunk (1.5.0), this is still reproducible when Accuweather is downloaded and executed.

Gaia      80af23f8c74d9d2e9388d8ed3c204040b5c528ec
Gecko     https://hg.mozilla.org/mozilla-central/rev/c69c55582faa
BuildID   20140326040202
Version   31.0a1
ro.build.version.incremental=eng.cltbld.20140306.073741
ro.build.date=Thu Mar  6 08:02:39 EST 2014
Flags: needinfo?(npark)
It looks like CompsoitorOGL::AbortFrame should also call the pool's EndFrame.
Assignee: chrislord.net → nical.bugzilla
Status: RESOLVED → REOPENED
Attachment #8397107 - Flags: review?(bas)
Resolution: FIXED → ---
Attachment #8397107 - Attachment is patch: true
Attachment #8397107 - Flags: review?(bas) → review+
(In reply to Nicolas Silva [:nical] from comment #60)
> Created attachment 8397107 [details] [diff] [review]
> Clear the compositor's unused textures when the frame was aborted
> 
> It looks like CompsoitorOGL::AbortFrame should also call the pool's EndFrame.

From where does CompsoitorOGL::AbortFrame() get called?
(In reply to npark from comment #59)
> On latest Trunk (1.5.0), this is still reproducible when Accuweather is
> downloaded and executed.
> 
> Gaia      80af23f8c74d9d2e9388d8ed3c204040b5c528ec
> Gecko     https://hg.mozilla.org/mozilla-central/rev/c69c55582faa
> BuildID   20140326040202
> Version   31.0a1
> ro.build.version.incremental=eng.cltbld.20140306.073741
> ro.build.date=Thu Mar  6 08:02:39 EST 2014

I haven't been able to reproduce it on the trunk with the local build.
Attached patch Logging patch v2Splinter Review
Attachment #8395830 - Attachment is obsolete: true
Attachment #8395832 - Attachment is obsolete: true
Attachment #8396388 - Attachment is obsolete: true
Keywords: verifyme
Tested the patch w/ Nical's build/patch seems to resolve the issue.

From irc convo
nical: so the build I gave nhirata_ contains the first EndFrame fix (the one that fixes it for me) + a second somewhat speculative fix that is still on inbound, but that I suspect is fixing dead code + the logging
[11:00am] nical: With this logging I could still easily reproduce the bug before the first fix, so it most likely means that the bug is fixed but was shadowing another bug which is harder to reproduce.
(In reply to Milan Sreckovic [:milan] from comment #63)
> I haven't been able to reproduce it on the trunk with the local build.

I spent today trying to reproduce as well and could not (optimized build with non logging). I am closing this bug because I am pretty certain that attachment 8396450 [details] [diff] [review] fixes the problem that we diagnosed here. If the symptoms show up again, it means it is a separate bug that we should followup on in a different bugzilla ticket.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Here's a weird thing.  I can reproduce this with the latest nightly build (which should have the first EndFrame fix only.)  I can't reproduce with my local build (which should have the first EndFrame fix only.) Nical's tracing seems to point to the change somehow not being in the nightly build as well. May have to check tomorrow.  In the meantime, Nical, can you uplift the (first) fix to Aurora when you have a chance?
Nical was right; It turns out looking at the nightly checkins the nightly build doesn't have the fix it seems.

Making my own build, I don't see an issue when in the app; I'm seeing issues when I place the app in the background.  We should check this again with tomorrow's build or a build from later today.
Confused via the some of the comments above. Have we confirmed that this is verified or not on trunk? There seems to be conflicting information on the testing above.
Flags: needinfo?(npark)
https://hg.mozilla.org/releases/mozilla-aurora/rev/bfcdcc4b9c83 is in today's 1.4 build:
Gaia      7d716de0c186416b5b123baa1f3242e23d50529b
Gecko     https://hg.mozilla.org/releases/mozilla-aurora/rev/69e896713b11
BuildID   20140327074806
Version   30.0a2
ro.build.version.incremental=324
ro.build.date=Thu Dec 19 14:04:55 CST 2013

We should be able to test there
Right - but let's first verify on trunk :)
This is no longer reproducible with / without HWC enabled on today's master branch:

│ Gaia      9da1b9c11bf518bce882be305ae121c44c5d1e05                         │
│ Gecko     https://hg.mozilla.org/mozilla-central/rev/9afe2a1145bd          │
│ BuildID   20140327040202                                                   │
│ Version   31.0a1
Status: RESOLVED → VERIFIED
Flags: needinfo?(npark)
No longer blocks: 983883
No longer blocks: 985162
Duplicate of this bug: 985162
Verified on Aurora with
Gaia      7d716de0c186416b5b123baa1f3242e23d50529b
Gecko     https://hg.mozilla.org/releases/mozilla-aurora/rev/69e896713b11
BuildID   20140327074806
Version   30.0a2
ro.build.version.incremental=324
ro.build.date=Thu Dec 19 14:04:55 CST 2013
Buri

Verified with Mako on : 
Gaia      9da1b9c11bf518bce882be305ae121c44c5d1e05
Gecko     https://hg.mozilla.org/mozilla-central/rev/9afe2a1145bd
BuildID   20140327040202
Version   31.0a1
ro.build.version.incremental=eng.cltbld.20140327.073722
ro.build.date=Thu Mar 27 07:37:33 EDT 2014
Mako

Note: 1.5 Buri has issues with today's build.
No longer blocks: 984531
Duplicate of this bug: 984531
No longer blocks: 984577
Duplicate of this bug: 984577
No longer blocks: 985170
Duplicate of this bug: 985170
No longer blocks: 986103
Duplicate of this bug: 986103
No longer blocks: 985779
Duplicate of this bug: 985779
Comment on attachment 8396660 [details] [diff] [review]
Warn when locked texture clients are reused (in debug builds)

Review of attachment 8396660 [details] [diff] [review]:
-----------------------------------------------------------------

I'm a little concerned about most of this code running in release builds but not actually doing anything useful :s.
Attachment #8396660 - Flags: review?(bas) → review-
No longer blocks: 984482
Duplicate of this bug: 984482
You need to log in before you can comment on or make changes to this bug.