Closed Bug 817067 Opened 8 years ago Closed 8 years ago

Many many concurrency problems in the thumbnailing code

Categories

(Firefox for Android :: General, defect)

All
Android
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 20
Tracking Status
firefox19 --- fixed
firefox20 --- fixed

People

(Reporter: kats, Assigned: kats)

References

Details

Attachments

(1 file, 3 obsolete files)

I was doing some code inspection to investigate bug 812752 and I found way too many concurrency issues in the thumbnailing code. I think it's time for a rewrite! The current problems can I believe variously result in exceptions, memory corruption, garbage thumbnails, and so on.

For example, I was able to reproduce the crash in bug 812752 by inserting a Thread.sleep(10000) in GeckoApp.getAndProcessThumbnailForTab here:

@@ -729,13 +730,20 @@ abstract public class GeckoApp
 
         int dw = Tabs.getThumbnailWidth();
         int dh = Tabs.getThumbnailHeight();
-        GeckoAppShell.sendEventToGecko(GeckoEvent.createScreenshotEvent(tab.getId(), 0, 0, 0, 0, 0, 0, dw, dh, dw, dh, ScreenshotHandler.SCREENSHOT_THUMBNAIL, tab.getThumbnailBuffer()));
+        GeckoEvent e = GeckoEvent.createScreenshotEvent(tab.getId(), 0, 0, 0, 0, 0, 0, dw, dh, dw, dh, ScreenshotHandler.SCREENSHOT_THUMBNAIL, tab.getThumbnailBuffer());
+        try {
+            Thread.sleep(10000);
+        } catch (InterruptedException ie) {
+            ie.printStackTrace();
+        }
+        GeckoAppShell.sendEventToGecko(e);
     }
 
     void handleThumbnailData(Tab tab, ByteBuffer data) {
         if (shouldUpdateThumbnail(tab)) {

and then doing the following:

1) adb shell am start -a android.intent.action.VIEW -n org.mozilla.fennec_kats/.App -d "http://people.mozilla.com/"
2) wait until the above sleep is hit (I had some extra logging to tell me when). Then:
3) adb shell am start -a android.intent.action.VIEW -n org.mozilla.fennec_kats/.App -d "about:home"
4) Open tabs tray
5) Wait

Once the thumbnails are done, boom:

11-30 11:04:32.365 W/dalvikvm( 9662): threadid=11: thread exiting with uncaught exception (group=0x40c2e930)
11-30 11:04:32.365 E/GeckoAppShell( 9662): >>> REPORTING UNCAUGHT EXCEPTION FROM THREAD 591 ("GeckoBackgroundThread")
11-30 11:04:32.365 E/GeckoAppShell( 9662): java.lang.RuntimeException: Buffer not large enough for pixels
11-30 11:04:32.365 E/GeckoAppShell( 9662): 	at android.graphics.Bitmap.copyPixelsFromBuffer(Bitmap.java:417)
11-30 11:04:32.365 E/GeckoAppShell( 9662): 	at org.mozilla.gecko.GeckoApp.handleThumbnailData(GeckoApp.java:747)
11-30 11:04:32.365 E/GeckoAppShell( 9662): 	at org.mozilla.gecko.ScreenshotHandler$2.run(ScreenshotHandler.java:370)
11-30 11:04:32.365 E/GeckoAppShell( 9662): 	at android.os.Handler.handleCallback(Handler.java:725)
11-30 11:04:32.365 E/GeckoAppShell( 9662): 	at android.os.Handler.dispatchMessage(Handler.java:92)
11-30 11:04:32.365 E/GeckoAppShell( 9662): 	at android.os.Looper.loop(Looper.java:137)
11-30 11:04:32.365 E/GeckoAppShell( 9662): 	at org.mozilla.gecko.util.GeckoBackgroundThread.run(GeckoBackgroundThread.java:31)
11-30 11:04:32.373 F/libc    ( 9662): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1), thread 9675 (Thread-591)


An few of the specific problems I found are:
- improper synchronization on Tabs.getThumbnailWidth/Tabs.setThumbnailWidth
- improper synchronization across the Tabs.getThumbnailWidth() and tab.getThumbnailBuffer() calls in GeckoApp.getAndProcessThumbnailForTab
- being able to invoke Tab.getThumbnailBuffer and destroy an existing buffer while it is in the middle of being used
Version: unspecified → Trunk
Attached patch Rewrite stuff (obsolete) — Splinter Review
Needs more testing, but early feedback welcome.
Assignee: nobody → bugmail.mozilla
Attached patch Rewrite stuff (obsolete) — Splinter Review
Updated with some better comments and more robust error-handling. This works fine on my Galaxy Nexus but I still need to test it on a tablet. Will do that next week but requesting review now anyway since I'm fairly confident in this code.
Attachment #687253 - Attachment is obsolete: true
Attachment #687275 - Flags: review?(gpascutto)
Attachment #687275 - Flags: review?(cpeterson)
Comment on attachment 687275 [details] [diff] [review]
Rewrite stuff

Review of attachment 687275 [details] [diff] [review]:
-----------------------------------------------------------------

Looking good, but I have some questions:

::: mobile/android/base/Tab.java
@@ +166,5 @@
> +        if (mThumbnailBitmap == null
> +            || mThumbnailBitmap.getWidth() != width
> +            || mThumbnailBitmap.getHeight() != height)
> +        {
> +            mThumbnailBitmap = Bitmap.createBitmap(width, height, Bitmap.Config.RGB_565);

Should we mThumbnailBitmap.recycle() before calling createBitmap()? This code path will overwrite mThumbnailBitmap (when the bitmap dimensions change on a non-Honeycomb device). The old mThumbnailBitmap's native buffer won't be freed until GC.

::: mobile/android/base/ThumbnailHelper.java
@@ +24,5 @@
> + * Changes to the thumbnail width are stashed in mPendingWidth and the change is
> + * applied between thumbnail processing. This allows a single thumbnail buffer to
> + * be used for all thumbnails.
> + */
> +public final class ThumbnailHelper {

I think you can make ThumbnailHelper a "package-private" (default visibility) class instead of public. The class is only used within the org.mozilla.gecko package.

@@ +27,5 @@
> + */
> +public final class ThumbnailHelper {
> +    private static final String LOGTAG = "GeckoThumbnailHelper";
> +
> +    public static final float THUMBNAIL_ASPECT_RATIO = 0.714f;

Can you add a comment explaining where the 0.714 aspect ratio comes from?

@@ +37,5 @@
> +    public static ThumbnailHelper getInstance() {
> +        if (sInstance == null) {
> +            sInstance = new ThumbnailHelper();
> +        }
> +        return sInstance;

Should getInstance() be a synchronized method? If it is only called by the UI thread, I think you should guard it with a GeckoApp.assertOnGeckoThread() check. If you need to synchronize but want to use the Double-Checked Locking pattern to avoid making the entire method synchronized, you will need to make sInstance volatile.

@@ +42,5 @@
> +    }
> +
> +    // instance stuff
> +
> +    private final LinkedList<Tab> mPendingThumbnails;    // synchronized access only

How many thumbnails do you expect will be pending at any time? If the number is very small, an ArrayList might be faster and use less memory than a LinkedList.

Also, I would consider declaring mPendingThumbnails's type as the java.util.List interface (to hide its actual implementation).

@@ +55,5 @@
> +        mHeight = -1;
> +        mPendingWidth = -1;
> +    }
> +
> +    public void getAndProcessThumbnailFor(Tab tab) {

Since the ThumbnailHelper class was added to fix concurrency bugs, I think you should consider guarding the class's public methods with GeckoApp.assertOnGeckoThread() or GeckoApp.assertOnUiThread() debug checks, where appropriate.

@@ +197,5 @@
> +                bitmap = BitmapFactory.decodeByteArray(compressed, 0, compressed.length);
> +            }
> +            tab.updateThumbnail(bitmap);
> +        } catch (OutOfMemoryError ome) {
> +            Log.w(LOGTAG, "setTabThumbnail: decoding byte array of length " + compressed.length + " ran out of memory");

Do you want to mPendingThumbnails.clear() on OOM here like in requestThumbnailFor() above?
Attachment #687275 - Flags: review?(cpeterson) → feedback+
Attached patch Rewrite stuff (v2) (obsolete) — Splinter Review
(In reply to Chris Peterson (:cpeterson) from comment #3)
> Should we mThumbnailBitmap.recycle() before calling createBitmap()? This
> code path will overwrite mThumbnailBitmap (when the bitmap dimensions change
> on a non-Honeycomb device). The old mThumbnailBitmap's native buffer won't
> be freed until GC.

Done. The old code wasn't doing that either so I wasn't sure if I should change that or not, but hey, that's what rewrites are for :)

> > +public final class ThumbnailHelper {
> 
> I think you can make ThumbnailHelper a "package-private" (default
> visibility) class instead of public. The class is only used within the
> org.mozilla.gecko package.

Done.

> > +    public static final float THUMBNAIL_ASPECT_RATIO = 0.714f;
> 
> Can you add a comment explaining where the 0.714 aspect ratio comes from?

I have no idea where it comes from. I got it from the code in Tabs.java that I deleted.

> Should getInstance() be a synchronized method? If it is only called by the
> UI thread, I think you should guard it with a GeckoApp.assertOnGeckoThread()
> check. If you need to synchronize but want to use the Double-Checked Locking
> pattern to avoid making the entire method synchronized, you will need to
> make sInstance volatile.

Made it synchronized. It can be called by any thread so I didn't add thread checks.

> > +    private final LinkedList<Tab> mPendingThumbnails;    // synchronized access only
> 
> How many thumbnails do you expect will be pending at any time? If the number
> is very small, an ArrayList might be faster and use less memory than a
> LinkedList.
> 
> Also, I would consider declaring mPendingThumbnails's type as the
> java.util.List interface (to hide its actual implementation).

This one annoyed me as well. I really wanted to make it a Queue because that's what it is semantically but then I needed the lastIndexOf function which doesn't exist on the Queue interface so I ended up using the specific LinkedList implementation instead. I could use List but then I don't get peek() and instead need to do isEmpty()/get(0) combinations which is uglier. For now I've left this as-is but I can change it to use List if you really want me to.

> > +    public void getAndProcessThumbnailFor(Tab tab) {
> 
> Since the ThumbnailHelper class was added to fix concurrency bugs, I think
> you should consider guarding the class's public methods with
> GeckoApp.assertOnGeckoThread() or GeckoApp.assertOnUiThread() debug checks,
> where appropriate.

The idea was to make it thread-safe internally so that any public method can be called from any thread and everything will still work right. However I did look over it again and noticed that the handleThumbnailData was public when it should really only be called from the one spot in ScreenshotHandler that's currently calling it, so I scoped that down to package-scope and added a assertOnGeckoThread check there.

> > +        } catch (OutOfMemoryError ome) {
> > +            Log.w(LOGTAG, "setTabThumbnail: decoding byte array of length " + compressed.length + " ran out of memory");
> 
> Do you want to mPendingThumbnails.clear() on OOM here like in
> requestThumbnailFor() above?

I don't think that's necessary in this case, because the decodeByteArray gets called on the code path where the thumbnail data is coming out of BrowserDB rather than from processing thumbnail requests. So most likely at that point the mPendingThumbnails will be empty anyway and if it's not then it's still quite possible that they will succeed even if the decompression here fails.
Attachment #687275 - Attachment is obsolete: true
Attachment #687275 - Flags: review?(gpascutto)
Attachment #687357 - Flags: review?(gpascutto)
Attachment #687357 - Flags: review?(cpeterson)
(In reply to Kartikaya Gupta (:kats) from comment #4)
> > > +    public static final float THUMBNAIL_ASPECT_RATIO = 0.714f;
> > 
> > Can you add a comment explaining where the 0.714 aspect ratio comes from?
> 
> I have no idea where it comes from. I got it from the code in Tabs.java that
> I deleted.

:wesj, do you mind explaining where this comes from?
Just a UX decision.
Ok, thanks, I can add a comment to that effect.
Comment on attachment 687357 [details] [diff] [review]
Rewrite stuff (v2)

Review of attachment 687357 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/base/ThumbnailHelper.java
@@ +46,5 @@
> +    private final LinkedList<Tab> mPendingThumbnails;    // synchronized access only
> +    private int mWidth;
> +    private int mHeight;
> +    private ByteBuffer mBuffer;
> +    private int mPendingWidth;  // always synchronize on "this" when touching this

Why not make it an AtomicInteger? Gets rid of the explicit locking and avoids "bad" accesses, also simplifies the code.

@@ +96,5 @@
> +    public void setThumbnailWidth(int width) {
> +        width = IntSize.nextPowerOfTwo(width);
> +        synchronized (this) {
> +            mPendingWidth = width;
> +        }

mPendingWidth.set(width);

@@ +106,5 @@
> +            if (mPendingWidth >= 0) {
> +                mWidth = mPendingWidth;
> +                mPendingWidth = -1;
> +            }
> +        }

mWidth = mPendingWidth.get(); (and remove everything else, you don't actually care if it got updated or not)

@@ +109,5 @@
> +            }
> +        }
> +
> +        if (mWidth < 0) {
> +            mWidth = (int)GeckoApp.mAppContext.getResources().getDimension(R.dimen.tab_thumbnail_width);

Is it possible to move this to the constructor? (and initialize mWidth = mPendingWidth right away if we're using AtomicInt)

If not, using the AtomicInt is a little more difficult:
int defaultWidth = (int)GeckoApp.mAppContext.getResources().getDimension(R.dimen.tab_thumbnail_width);
mPendingWidth.compareAndSet(-1, defaultWidth);  // Don't care for success, just don't overwrite if it's no longer the default.
mWidth = mPendingWidth.get();
Attachment #687357 - Flags: review?(gpascutto) → review+
Oh cool, I didn't know AtomicInteger even existed. That does simplify the code quite a bit. Re-requesting review just to be sure the changes are good.
Attachment #687357 - Attachment is obsolete: true
Attachment #687357 - Flags: review?(cpeterson)
Attachment #687760 - Flags: review?(gpascutto)
Attachment #687760 - Flags: review?(cpeterson)
Comment on attachment 687760 [details] [diff] [review]
Rewrite stuff (v3)

Review of attachment 687760 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM. FYI I didn't know about AtomicInteger before doing the review either. Normal integers in Java are already atomic, and it looked to me like the locking wasn't needed with just adding a volatile or memory barrier in there, so I was checking the Java docs for that. But AtomicInteger is a much clearer API.
Attachment #687760 - Flags: review?(gpascutto) → review+
Comment on attachment 687760 [details] [diff] [review]
Rewrite stuff (v3)

Review of attachment 687760 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM!
Attachment #687760 - Flags: review?(cpeterson) → review+
(In reply to Kartikaya Gupta (:kats) from comment #4)
> However I
> did look over it again and noticed that the handleThumbnailData was public
> when it should really only be called from the one spot in ScreenshotHandler
> that's currently calling it, so I scoped that down to package-scope and
> added a assertOnGeckoThread check there.

Oh whoops. Turns out this code is run on the background thread, not on the gecko thread. There doesn't appear to be an assertOnGeckoBackgroundThead assert though, presumably because we don't bother keeping a handle to the thread object.
https://hg.mozilla.org/mozilla-central/rev/810a5eea090b
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 20
Comment on attachment 687760 [details] [diff] [review]
Rewrite stuff (v3)

[Approval Request Comment]
Bug caused by (feature/regressing bug #): various bugs. crashes were aggravated by bug 787765 which is in 19 and up
User impact if declined: various crashes in thumbnail code, like bug 812752
Testing completed (on m-c, etc.): on nightly for a few days now, crash volume has dropped and no regressions observed yet
Risk to taking this patch (and alternatives if risky): low-risk (patch applies cleanly); touches fennec only.
String or UUID changes made by this patch: none
Attachment #687760 - Flags: approval-mozilla-aurora?
Comment on attachment 687760 [details] [diff] [review]
Rewrite stuff (v3)

low risk, mobile-only, approving.
Attachment #687760 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Depends on: 819953
You need to log in before you can comment on or make changes to this bug.