Closed Bug 1416015 Opened 7 years ago Closed 7 years ago

Broken tiles and crashes with SurfaceTexture tiles enabled

Categories

(Core :: Graphics: Layers, defect)

All
Android
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox58 --- wontfix
firefox59 --- fixed

People

(Reporter: esawin, Assigned: jnicol)

References

(Blocks 1 open bug, )

Details

Attachments

(4 files)

STR (Nexus 6P, Android 8.0, gfx.use-surfacetexture-textures enabled)
1. Open an article on zeit.de (reproduced on http://www.zeit.de/politik/ausland/2017-11/saudi-arabien-mohammed-bin-salman-massenverhaftungen)
2. Pinch-zoom in
3. Slowly pinch-zoom out until surrounding tiles stop updating [1] (remain white or grey)
4. (Repeat 2. and 3. until issue occurs)

At this point the browser tab is unresponsive, reloading won't work, neither will opening new tabs (remain white). Only restarting the browser fixes the state.

Sometimes the issue reproduces with a crash instead [2].

[1] https://photos.google.com/share/AF1QipNEW4QMmQiELMS3J3L-5vC-VEXpVJOjMZvsFlM2nk_fF9edm74oqvw252e33zmmkQ?key=MGktRmFTcUxqeHFtcThRU2k0NENhSkNEc2pmV0RR
[2] https://crash-stats.mozilla.com/report/index/00b1a079-9bd6-4f7c-b550-e32990171109
Attached file Crash
Attachment #8927048 - Attachment mime type: text/html → text/plain
Thanks for testing.

The crash is an oom, so ANativeWindow_lock fails, and we MOZ_CRASH.

We need to handle the failure properly. Also need to investigate whether we're doing something to cause the oom in the first place too.

Nothing in the log for the hang. Will dig deeper on monday. To confirm, can you reproduce this easily?
Assignee: nobody → jnicol
Yes, I can easily reproduce it when pinch-zooming. It also seems independent from the content, as I can reproduce it on most sites that allow zooming that I've tried.
Eugen do you get a bunch of failures in logcat related releaseTexImage()? If that goes south when we're destroying a tile, it could mean the memory doesn't get freed correctly. I remember there was an issue where we weren't destroying with the right GLContext current, but I thought maybe that was fixed? Something to look at at least.
Whoops I really meant detachFromGLContext(), which we call during destruction. And there are lots of failures there in the logs.
Attached file Crash 2
This one seems to have better logs, including

E GLConsumer: [SurfaceTexture-0-13300-693] detachFromContext: invalid current EGLDisplay

E GLConsumer: [SurfaceTexture-0-13300-735] syncForReleaseLocked: error dup'ing native fence fd: 0x3000
E GLConsumer: [SurfaceTexture-0-13300-735] syncForReleaseLocked failed (slot=1), err=-2147483648

W GeckoSurfaceTexture: releaseTexImage() failed
W GeckoSurfaceTexture: java.lang.RuntimeException: Error during updateTexImage
OK that looks like we're also running out of fds, which indicates a leak.
I'm first going to make this resilient to out-of-memory situations, then we can find and fix the leak.

ANativeWindow_lock returns -ENOMEM when we run out of memory. We should simply return false rather than crashing. That will fix the crash.

The hang occurs when ReleaseTexImage (in SurfaceTextureHost::NotifyNotUsed) fails.

> W Adreno-GSL: <gsl_syncobj_dup_fd:265>: 0xb28e4cd8 dup(212) error 24 Too many open files
> GLConsumer: [SurfaceTexture-0-20515-729] syncForReleaseLocked: error dup'ing native fence fd: 0x3000
> E GLConsumer: [SurfaceTexture-0-20515-729] syncForReleaseLocked failed (slot=1), err=-2147483648
> W GeckoSurfaceTexture: releaseTexImage() failed
> W GeckoSurfaceTexture: java.lang.RuntimeException: Error during updateTexImage (see logcat for details)
> W GeckoSurfaceTexture: 	at android.graphics.SurfaceTexture.nativeReleaseTexImage(Native Method)
> W GeckoSurfaceTexture: 	at android.graphics.SurfaceTexture.releaseTexImage(SurfaceTexture.java:252)
> W GeckoSurfaceTexture: 	at org.mozilla.gecko.gfx.GeckoSurfaceTexture.releaseTexImage(GeckoSurfaceTexture.java:111)

(There is a copy paste error in the android source which makes it say "error during updateTexImage" when it means "error during releaseTexImage" - https://android.googlesource.com/platform/frameworks/base/+/master/core/jni/android/graphics/SurfaceTexture.cpp#342)

ANativeWindow_lock hangs waiting for a buffer, since it was not released. I'm not sure how best to fix this. It might be possible to get the TextureHost to not release its read lock, making the TiledContentClient then discard the buffer.
Blocks: 1413142
See Also: → 1419351
An explanation of the problem for if future people come across this bug:

ReleaseTexImage() was failing to create a sync object, because we ran out of FDs. We discovered this occurred because the fds are only closed when the SurfaceTexture object was finalized by the garbage collector, and it was quite easy to reach the limit before this happens.
Comment on attachment 8930292 [details]
Bug 1416015 - Ensure SurfaceTexture desctruction happens correctly

https://reviewboard.mozilla.org/r/201364/#review206856

::: mobile/android/geckoview/src/main/java/org/mozilla/gecko/gfx/GeckoSurfaceTexture.java:48
(Diff revision 1)
>      private GeckoSurfaceTexture(int handle, boolean singleBufferMode) {
>          super(0, singleBufferMode);
>          init(handle, singleBufferMode);
>      }
>  
> +    @Override

I'm not sure how things are done in Java, but should this perhaps be protected, and "throws Throwable" (with the try-catch at the call site) so that it matches the super class' definition?

::: mobile/android/geckoview/src/main/java/org/mozilla/gecko/gfx/GeckoSurfaceTexture.java:185
(Diff revision 1)
> +            }
> -                }
> +        }
> -            }
> +    }
>  
> -            release();
> +    @WrapForJNI
> +    public static void destroyUnused(long context) {

I don't see what doing this gets us over simply calling DetachFromGLContext when the TextureHost drops its reference to the SurfaceTexture. Especially now that the TextureHost is responsible for attaching it in the first place, that seems easier, and conceptually cleaner to me, and frees the resources sooner.

We could assert that mAttachedContext == 0 when the final reference is dropped, to catch any mistakes.
Comment on attachment 8930292 [details]
Bug 1416015 - Ensure SurfaceTexture desctruction happens correctly

https://reviewboard.mozilla.org/r/201364/#review206856

> I'm not sure how things are done in Java, but should this perhaps be protected, and "throws Throwable" (with the try-catch at the call site) so that it matches the super class' definition?

Yup, these are good points, I'll change it.

> I don't see what doing this gets us over simply calling DetachFromGLContext when the TextureHost drops its reference to the SurfaceTexture. Especially now that the TextureHost is responsible for attaching it in the first place, that seems easier, and conceptually cleaner to me, and frees the resources sooner.
> 
> We could assert that mAttachedContext == 0 when the final reference is dropped, to catch any mistakes.

The problem is that the SurfaceTexture is ref counted, and the Compositor isn't always the last reference holder. The lifetime of the Surface on the content side is shared with that of its SurfaceTexture, so sometimes the content thread/process is the last to unref a SurfaceTexture. In that case, the decrementUse() is called on a binder thread and not the compositor thread, so this destroyUnused() pattern is used to make sure we do the destruction from the compositor. It would be nicer if we could just queue a Runnable to run on the Compositor thread, but we don't have a Java Looper or anything like that to allow this.
Comment on attachment 8930292 [details]
Bug 1416015 - Ensure SurfaceTexture desctruction happens correctly

https://reviewboard.mozilla.org/r/201364/#review206918
Attachment #8930292 - Flags: review?(jnicol) → review+
Pushed by jwillcox@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/efc5b1e03ead
Ensure SurfaceTexture desctruction happens correctly r=jnicol
https://hg.mozilla.org/mozilla-central/rev/efc5b1e03ead
https://hg.mozilla.org/mozilla-central/rev/840cf9044de3
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
I think this needs backed out, I'm still getting hangs.
I guess this can ride the trains to 59, please go ahead and request uplift to beta if not.
This is still off behind a pref anyway so no need to uplift
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: