Closed Bug 1238856 Opened 8 years ago Closed 8 years ago

removeCurrentTab method of gBrowser locks up when called inside a response to a sync message from the content process

Categories

(Firefox :: Extension Compatibility, defect)

46 Branch
x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: snuz.gary, Unassigned)

References

Details

User Agent: Mozilla/5.0 (Windows NT 5.1; rv:46.0) Gecko/20100101 Firefox/46.0
Build ID: 20160111030207

Steps to reproduce:

Call the removeCurrentTab() method of gBrowser from my addon's chrome process while a e10s tab is current. running in Nightly of 1/9/2016.


Actual results:

the browser hangs for several seconds, no errors in console, no exception thrown, and most importantly the current tab remains open. this behaviour was intermittent for a few months, but as of 1/2016 it seems to be consistent.


Expected results:

current tab should close as it does in non e10s tabs
OS: Unspecified → Windows XP
Hardware: Unspecified → x86
Until I have time to make a test case, you could get my very tiny (~1KB) add-on "Microgesture" from AMO. Make the back gesture from a tab which has no history to go back to such as a tab that has just been opened, this should call removeCurrentTab() and close the tab -- but it won't, unless it's a non e10s tab like about:config or about:addons which still work.

I think this is bug 1100700 related and have posted some comments there.
Blocks: e10s-addons
Component: Untriaged → Extension Compatibility
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to snuz_2@yahoo.com from comment #1)
> Until I have time to make a test case, you could get my very tiny (~1KB)
> add-on "Microgesture" from AMO. Make the back gesture from a tab which has
> no history to go back to such as a tab that has just been opened, this
> should call removeCurrentTab() and close the tab -- but it won't, unless
> it's a non e10s tab like about:config or about:addons which still work.
> 
> I think this is bug 1100700 related and have posted some comments there.

FWIW, I don't see any comments by you on that bug. Did you mean a different bug?

The issue here seems to be that you're dispatching a sync message to the parent from your frame script, and so you block on hearing back from the parent, but the parent then sends an async message to the child and spins the event loop, so you never hear back from the parent, and the parent is waiting to hear from the child that is stuck. I'm surprised the entire process doesn't deadlock. :-\

Bill, I don't know anything about the e10s internals here, but is there some way we can detect this and avoid this from the removeCurrentTab code?

snuz_2: you could avoid this by doing the removeCurrentTab in a setTimeout(, 0) or some other way of making the sync return of your message not block on the tab removal having happened. Ideally, don't use sync messages from the content process at all, but I don't know if you can avoid this... from a brief look at the code, it looks like you want to know "canGoBack" and "canGoForward". Both of those are available on nsIWebNavigation, so in the content process you can do:

let webnav = docShell.QueryInterface(Components.interfaces.nsIWebNavigation);
let canGoForward = webnav.canGoForward;
let canGoBack = webnav.canGoBack;

then you can decide what to do with the gesture in the content process, and can just use an async message to get the parent process to act on that if necessary (ie remove the tab, etc.).

It also seems like you use a pref for the down gesture, and do some other checks on the message content. That too could live on the client/content side of things (prefs are readable in the content process, but not writable).

Hopefully that helps you fix this in the short term. We should still look at whether there's anything we can do to avoid this kind of issue on the Firefox side... let's see what Bill has to say about that.
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(wmccloskey)
Summary: removeCurrentTab method of gBrowser broken in chrome process of e10s tabs → removeCurrentTab method of gBrowser locks up when called inside a response to a sync message from the content process
I may be looking at a different add-on since Gijs's comments don't seem quite right to me. Is it this one?
https://addons.mozilla.org/en-US/firefox/addon/microgesture-tab-edition/

Looking at the code, I don't understand why the message needs to be sync. The add-on doesn't examine canGoBack. It does call goBack, which I guess does nothing if canGoBack is false. But I don't see any reason not to call it unconditionally. The doIt function doesn't return anything, which is usually the reason for using sendSyncMessage.

As to a general way of fixing this, that would be bug 1191143. It cancels CPOWs in the child when the parent spins a nested event loop. It's pretty hacky, but so are CPOWs. I backed it out because it seemed like it was causing some IPC crashes, but I think I have those under control now. Hopefully I can land it in the next week or two.
Flags: needinfo?(wmccloskey)

@Gijs: sorry, I read the wrong tab, the bug I thought this could be related to is 967873, and that's where I left comments. It was you who sent from there to here in response to that comment. Once I saw that in your explanation of the problem that "you" refers to the FF internal code rather myself, I understand what is going on. Maybe I never caught this behaviour, but I think it works in the current FF beta... if that's the case, why does it suddenly break?

@Bill: that's a different variant of the addon in question. but it also can call removeCurrentTab, so it will have the same problem. yeah, I don't think it needs to by sync in either addon's case. I'll look into it.
(In reply to Bill McCloskey (:billm) from comment #3)
> I may be looking at a different add-on since Gijs's comments don't seem
> quite right to me. Is it this one?
> https://addons.mozilla.org/en-US/firefox/addon/microgesture-tab-edition/

I looked at: https://addons.mozilla.org/en-US/firefox/addon/microgesture/

Surprisingly, Bing search is more useful here than Google, IME (which suggested all-in-one gestures and a host of other add-ons).

> Looking at the code, I don't understand why the message needs to be sync.
> The add-on doesn't examine canGoBack. It does call goBack, which I guess
> does nothing if canGoBack is false. But I don't see any reason not to call
> it unconditionally. The doIt function doesn't return anything, which is
> usually the reason for using sendSyncMessage.

Right, in the add-on I linked to it does return something, though my initial reading wasn't quite correct - it depends on other things than canGoBack... but all things that should be available on the child process side. Sounds like snuz is looking into that.

> As to a general way of fixing this, that would be bug 1191143.
> ...
> Hopefully I can land it in the next week or two.

\o/


(In reply to snuz_2@yahoo.com from comment #4)
> 
> @Gijs: sorry, I read the wrong tab, the bug I thought this could be related
> to is 967873, and that's where I left comments. It was you who sent from
> there to here in response to that comment.

Right, OK, back on the same page now. :-)

> Once I saw that in your
> explanation of the problem that "you" refers to the FF internal code rather
> myself, I understand what is going on. Maybe I never caught this behaviour,
> but I think it works in the current FF beta... if that's the case, why does
> it suddenly break?

beta doesn't use e10s. We needed to use messaging for e10s' implementation of "onbeforeunload", so closing a tab will now try to talk to the child process before closing. That's related to the issue you're seeing.

> @Bill: that's a different variant of the addon in question. but it also can
> call removeCurrentTab, so it will have the same problem. yeah, I don't think
> it needs to by sync in either addon's case. I'll look into it.

Great! Feel free to ping me here or over email if you need more help.
Depends on: 1191143
The reason for using chrome/content split is not for testing canGoBack/canGoForward but to access the gBrower methods for opening and closing tabs, loading URL's and actual goBack/goForward navigation methods based on events in the frame. It was my understanding that these methods are now only available at the chrome level. Are you saying that there is still a way to access them from the content environment? that would make things a bit simpler!

the "tab edition" of the addon doesn't return anything from the chrome process since it does all its work there.

However, the "regular edition" returns a code that informs the content process to scroll the tab. I did this because I couldn't find a way to implement "scroll" from the chrome process, if such a method existed, it wouldn't need to return anything and could be async. the chrome process currently determines if the gesture is applied to a link ( in which case it loads it into a tab ) or the page as a whole ( pass back code to content process to handle scrolling). If the content environment could load a tab, then i wouldn't need to do this. Otherwise, I was thinking of trying to move the test of whether the gesture is over a link or not into the content process, and do the scrolling there if necessary, so that the pass-back can go away. I haven't actually looked at the code in months, so I don't remember too clearly. 

Anyway, if you know of any ways to access some of those gBrowser methods or their equivalents from content, please let me know as this would make the job very easy.
I released new versions of both addons with async calls. they work fine on the todays's nightly.
based on comment 7, closing as working
Status: UNCONFIRMED → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
(In reply to :shell escalante from comment #8)
> based on comment 7, closing as working

Hmm, the bug is still there, I just changed my addons to avoid triggering it. anyone who tries to do the same thing and actually needs the synchronous call will run into the same problem. 

It's your call whether to close the bug or not, I am just concerned that you don't understand the problem here and based on comment 7 decided it's no longer an issue, I think it is still an issue. Especially because of the behaviour it causes in the browser. It's surprising that it doesn't completely lock up with no warning or diagnostic whatsoever. ( it almost does) I wouldn't want that lurking in my OS...
Bill, is there a reason not to reland the patch in bug 1191143? :-)
Flags: needinfo?(wmccloskey)
It's not even clear we're going to ship CPOWs in a release build, so I'd rather not land something risky in order to make them work a little better.
Flags: needinfo?(wmccloskey)
You need to log in before you can comment on or make changes to this bug.