Closed Bug 857353 Opened 11 years ago Closed 3 years ago

ANR: deadlock between UI and Gecko threads inside gecko.Tab/gecko.Tabs

Categories

(Firefox for Android Graveyard :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: jchen, Unassigned)

Details

(Whiteboard: [ANR])

The ANR reporter caught a deadlock between Tab and Tabs, when they're being accessed simultaneously on both the UI and Gecko threads:

>"main" prio=5 tid=1 MONITOR
>  | group="main" sCount=1 dsCount=0 obj=0x40a0f460 self=0xc17888
>  | sysTid=10244 nice=0 sched=0/0 cgrp=default handle=1074967688
>  | schedstat=( 0 0 0 ) utm=207 stm=24 core=0
>  at org.mozilla.gecko.Tab.onChange(Tab.java:~109)
>  - waiting to lock <0x41209978> (a org.mozilla.gecko.Tab) held by tid=14 (Gecko)
>  at org.mozilla.gecko.Tabs.onTabChanged(Tabs.java:541)
>  at org.mozilla.gecko.Tabs.access$300(Tabs.java:31)
>  at org.mozilla.gecko.Tabs$4.run(Tabs.java:509)
>  at android.app.Activity.runOnUiThread(Activity.java:4289)
>  at org.mozilla.gecko.Tabs.notifyListeners(Tabs.java:506)
>  at org.mozilla.gecko.Tabs.notifyListeners(Tabs.java:501)
>  at org.mozilla.gecko.Tabs.selectTab(Tabs.java:212)
>  at org.mozilla.gecko.Tabs.loadUrl(Tabs.java:643)
>  at org.mozilla.gecko.Tabs.loadUrl(Tabs.java:590)
>  at org.mozilla.gecko.AboutHomeContent$13$1.onClick(AboutHomeContent.java:705)
>  at android.view.View.performClick(View.java:3540)
>  at android.view.View$PerformClick.run(View.java:14167)
>  at android.os.Handler.handleCallback(Handler.java:605)
>  at android.os.Handler.dispatchMessage(Handler.java:92)
>  at android.os.Looper.loop(Looper.java:137)
>  at android.app.ActivityThread.main(ActivityThread.java:4486)
>  at java.lang.reflect.Method.invokeNative(Native Method)
>  at java.lang.reflect.Method.invoke(Method.java:511)
>  at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:784)
>  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:551)
>  at dalvik.system.NativeStart.main(Native Method)

>"Gecko" prio=5 tid=14 MONITOR
>  | group="main" sCount=1 dsCount=0 obj=0x411f9678 self=0xf7e3d8
>  | sysTid=10308 nice=0 sched=0/0 cgrp=default handle=14804680
>  | schedstat=( 0 0 0 ) utm=267 stm=60 core=1
>  at org.mozilla.gecko.Tabs.getActivity(Tabs.java:~334)
>  - waiting to lock <0x4116a728> (a org.mozilla.gecko.Tabs) held by tid=1 (main)
>  at org.mozilla.gecko.Tabs.notifyListeners(Tabs.java:506)
>  at org.mozilla.gecko.Tabs.notifyListeners(Tabs.java:501)
>  at org.mozilla.gecko.Tab.updateTitle(Tab.java:250)
>  at org.mozilla.gecko.Tab.handleLocationChange(Tab.java:528)
>  at org.mozilla.gecko.Tabs.handleMessage(Tabs.java:405)
>  at org.mozilla.gecko.util.EventDispatcher.dispatchEvent(EventDispatcher.java:75)
>  at org.mozilla.gecko.GeckoAppShell.handleGeckoMessage(GeckoAppShell.java:1853)
>  at org.mozilla.gecko.mozglue.GeckoLoader.nativeRun(Native Method)
>  at org.mozilla.gecko.mozglue.GeckoLoader.nativeRun(Native Method)
>  at org.mozilla.gecko.GeckoAppShell.runGecko(GeckoAppShell.java:290)
>  at org.mozilla.gecko.GeckoThread.run(GeckoThread.java:104)
Is this new, Jim? Can you specify a version or commit?
(My temptation to just completely rip out all of our tabs code and rewrite it from scratch to be immutable grows ever stronger…)
(In reply to Richard Newman [:rnewman] from comment #1)
> Is this new, Jim? Can you specify a version or commit?

This specific report was from Fennec 22 Nightly (3/27).

But considering the ANR reporter was turned on only last week, this bug most likely have been a long-standing issue, and we just didn't know about it because it results in a hang, not crash :-/
I dread to think what it would have reported prior to Bug 844407!
Ugh, I hadn't followed the chain deep enough when I was looking at this code (either that, or bnicholson's work didn't touch it enough).

Yeah, there are interleaved calls back and forth between the two objects here, which is a perfect opportunity for deadlock.

It is *probably* sufficient to (a) remove most of the Tabs accesses from Tab, and (b) add a layer of indirection for the rest.

None of the uses I can see really make any sense.

For example, why do we call myTab.readerMode(), and it asks Tabs to do the work? Why not just ask Tabs to open a new tab? Why are Tab instances responsible for notifying Tabs's listeners *on this thread*?

So let's think about getting rid of those, so that accesses to each tab and Tabs only happen on the threads we're expecting them to. One way to do that is to have a thread-safe 'pipe' for tab events (e.g., for those notifications, for new tab events, etc.).

bnicholson, do you want to tackle this, or should I?
(In reply to Richard Newman [:rnewman] from comment #5)
> Ugh, I hadn't followed the chain deep enough when I was looking at this code
> (either that, or bnicholson's work didn't touch it enough).

I should clarify: not casting aspersions on bnicholson's work here. By "didn't touch it enough" I mean "I didn't think I needed to read it, because his patches didn't change it".
(In reply to Richard Newman [:rnewman] from comment #5)

> For example, why do we call myTab.readerMode(), and it asks Tabs to do the
> work? Why not just ask Tabs to open a new tab?

Because we would miss setting the "entering reader mode" state. We could make a Tabs.loadReaderMode(tab), but it would still need to pull and push state into the Tab.

> Why are Tab instances responsible for notifying Tabs's listeners *on this thread*?

Do you mean forcing notifications on the UI thread?
http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/Tabs.java#507

> So let's think about getting rid of those, so that accesses to each tab and
> Tabs only happen on the threads we're expecting them to. One way to do that
> is to have a thread-safe 'pipe' for tab events (e.g., for those
> notifications, for new tab events, etc.).

I don't follow this.

I do think we could move out some of the code in Tabs/Tab. Code like loadUrl (and overloads):
http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/Tabs.java#605

I think that starts to blur Tabs as a state management class and starts adding too much behavior to it.

The showXxxHistory code in Tab could be in the same category:
http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/Tab.java#407
(In reply to Mark Finkle (:mfinkle) from comment #7)

> > For example, why do we call myTab.readerMode(), and it asks Tabs to do the
> > work? Why not just ask Tabs to open a new tab?
> 
> Because we would miss setting the "entering reader mode" state. We could
> make a Tabs.loadReaderMode(tab), but it would still need to pull and push
> state into the Tab.

I contend that entering reader mode should create a new Tab (whether or not that corresponds to a visual tab) and replace the old one (or push onto a stack on top of it, or whatever).

All of these threading problems come from mutable state… or, rather, the distribution of mutable state around the program.

To enter reader mode involves a lot of hopping around. loadUrl is driven by flags. This is very brittle.


> > Why are Tab instances responsible for notifying Tabs's listeners *on this thread*?
> 
> Do you mean forcing notifications on the UI thread?
> http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/Tabs.
> java#507

We do that… but in order to do so Tab first calls into Tabs.notifyListeners *on this thread*, which calls getActivity on this thread, which is synchronized, which results in deadlock: the other thread is already notifying listeners, and is holding and requesting locks all over the place.

My suggestion below is "don't rely on taking a lock to do something like notify listeners", because of course notifying listeners can result in arbitrary work being done while holding locks. There are several ways we can do that -- make it async by using some kind of queue, consolidate logic, apply immutability like some kind of beautiful sculpture -- but we need to do one of them.


> > So let's think about getting rid of those, so that accesses to each tab and
> > Tabs only happen on the threads we're expecting them to. One way to do that
> > is to have a thread-safe 'pipe' for tab events (e.g., for those
> > notifications, for new tab events, etc.).
> 
> I don't follow this.

Essentially a safe way for Tab to ask Tabs to do something (like notify its listeners).


> I think that starts to blur Tabs as a state management class and starts
> adding too much behavior to it.

Right now we have some state management in Tabs (which knows about Tab), and some in Tab (which knows about the singleton Tabs, as well as its own state).

This is a mess (and causes threading problems); we would be better off moving all Tab logic into Tabs, and having a Tab be merely a 'token' that you hand over to Tabs to do the work, or can inspect.
(In reply to Richard Newman [:rnewman] from comment #8)

> Right now we have some state management in Tabs (which knows about Tab), and
> some in Tab (which knows about the singleton Tabs, as well as its own state).
> 
> This is a mess (and causes threading problems); we would be better off
> moving all Tab logic into Tabs, and having a Tab be merely a 'token' that
> you hand over to Tabs to do the work, or can inspect.

Can we do this in a way that doesn't look like crap? Tabs will become an ugly kitchen-sink, while Tab becomes an integer ID. I mean all the methods of Tab move to Tabs:

Tabs.getReaderModeForTab(Tab)
Tabs.getDesktopModeForTab(Tab)

Is this the best we can do in a threaded world? Making all Tab methods synchronized is just the opposite extreme, and just as unappealing.
Immutable Tab objects mean we can't hold onto a Tab for any length of time anywhere else in the app. This might be an appealing approach. I don't think we hold onto Tab objects.
(In reply to Mark Finkle (:mfinkle) from comment #9)
> (In reply to Richard Newman [:rnewman] from comment #8)

> Making all Tab methods
> synchronized is just the opposite extreme, and just as unappealing.

And it won't solve the problem. Adding synchronization eliminates races in exchange for deadlock.

This is why I tend to refer to "just add synchronized" as sprinkling magic thread-safety dust.


> Is this the best we can do in a threaded world? 

There might be a good solution here, given enough time to come up with a design. (Which I won't have for a couple of weeks.)

How prevalent are tabs-related hangs?

I agree that dumping the kitchen sink into Tabs is not a good answer -- it's ugly, and its only advantage is that it isolates most of the problematic code in one place, which makes it easier to reason about.

That said, we have a system involving multiple threads that was not really designed to be thread-safe, and that has grown organically. I would find it hard to say that the current state is a good answer, either! Such is software engineering in the real world.

(Just look at the amount of state in Tab -- seven flags, three different bits of favicon state, a bunch of IDs, a bunch of plugin data, an informal state, as well as the obvious URL and title. And more besides. All of which has to be modified consistently by different components on different threads. Eek.)

The 'proper' solution to all of this is likely to involve stripping out everything to do with Tabs and Tab, and designing a ground-up safe, non-singleton tab implementation that meets current needs, with a clear internal API -- here's how you open a tab, here's how you read info, here're the notifications you get.

That will probably involve concepts like (to coin some names for domain concepts) Visit, Tab (a sequential history of Visits), TabCollection (an aggregation of open Tabs), TabFavicon, and TabEventListener (to handle all of those title changes etc.). Many of those things can be immutable, or have well-defined mutation interfaces: Tab allows you to add a new Visit, for example, perhaps by asking the -- immutable -- topmost Visit for a new ReaderModeVisit. A Tab would be given a TabNotificationChannel by its TabCollection on creation. And so on.


> Immutable Tab objects mean we can't hold onto a Tab for any length of time
> anywhere else in the app. This might be an appealing approach. I don't think
> we hold onto Tab objects.

They certainly seem to act as tokens -- "persist all the open tabs", "open this one in Reader Mode", etc.

This is why I'm optimistic about a targeted rewrite: the surface area of the interface seems quite small.
FWIW, now we have more ANR data, this does not appear to be a frequent deadlock.
Whiteboard: [ANR]
We have completed our launch of our new Firefox on Android. The development of the new versions use GitHub for issue tracking. If the bug report still reproduces in a current version of [Firefox on Android nightly](https://play.google.com/store/apps/details?id=org.mozilla.fenix) an issue can be reported at the [Fenix GitHub project](https://github.com/mozilla-mobile/fenix/). If you want to discuss your report please use [Mozilla's chat](https://wiki.mozilla.org/Matrix#Connect_to_Matrix) server https://chat.mozilla.org and join the [#fenix](https://chat.mozilla.org/#/room/#fenix:mozilla.org) channel.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.