Closed Bug 676780 Opened 13 years ago Closed 13 years ago

Fennec is unable to load webpages and close tabs

Categories

(Firefox for Android Graveyard :: General, defect, P1)

Firefox 9
ARM
Android
defect

Tracking

(firefox6 fixed, firefox7 fixed, firefox8 fixed, firefox9 fixed, fennec+)

RESOLVED FIXED
Tracking Status
firefox6 --- fixed
firefox7 --- fixed
firefox8 --- fixed
firefox9 --- fixed
fennec + ---

People

(Reporter: xti, Assigned: mfinkle)

References

Details

(Keywords: relnote)

Attachments

(2 files)

Build id : Mozilla/5.0 (Android;Linux armv7l;rv:6.0)Gecko/20110804
Firefox/6.0 Fennec/6.0
Device: HTC Desire Z
OS: Android 2.3.3

Steps to reproduce:
1. Open Fennec App
2. Go to Preferences > Language
3. Change the language pack for a couple of times
4. Install Quit Fennec add-on
5. Go to Preferences > Feedback Tools tab and enable the Error Console
6. Browse to www.google.com
7. Close the tab opened at step 6.

Expected result:
After step 6, new tab is opened and the page is loaded completely.
After step 7, the tab is closed.

Actual result:
Fennec is unable to load any webpage, except those one with special protocols (like about: file: etc). The X button on the left side of the tab thumbnails doesn't work at all.

Notes:
- Please see the following video: http://www.youtube.com/user/qaioana#p/u/27/nV4oyPLOFYY
- After step 7, if the Fennec process is killed and then the app is reopened, all installed add-on are visibly disabled even if they are listed in the Add-ons Manager and the Preferences/Beta tab is blank. Also the Error Console Tab is missing.
Whiteboard: [fennec 6.0b5]
Build: Mozilla/5.0 (Android; Linux armv7l; rv:6.0) Gecko/20110804 Firefox/6.0 Fennec/6.0
Device: Samsung Galaxy SII
OS: Android 2.3.3

I can reproduce this, raising severity - although the steps to reproduce are convoluted, Cristian, can you simplify this?
Severity: normal → blocker
Priority: -- → P1
Force stopping the application, restarting it all I get is a white screen unable to interact with the browser whatsoever
(In reply to Aaron Train [:aaronmt] from comment #2)
> Force stopping the application, restarting it all I get is a white screen
> unable to interact with the browser whatsoever

What happens after clearing the profile using Settings > Applications ...  ?
I'm working for a simple way to reproduce this issue. But even if the profile is cleared, if those steps are performed again, it occurs again. As far, I was able to reproduce it 100%.
I can reproduce this with simpler STR

New Profile
- Change to deutsch, restart
- After first-run animation, tap URL bar, tap magnifying glass and tap Google
Severity: blocker → major
As per mentioned on IRC: "Quitting" (via Addon install) Firefox Beta restores functionality, and that there may be a possibility of a corrupt file
Unable to reproduce on Aurora (7.0a2) (08/05)
I cannot reproduce this in 6.0b4 or 6.0b5 using clean installs with new profiles on Samsung Galaxy Tab 7" (Android 2.2), HTC T-Mobile G2 (2.2), or Motorola Xoom (3.2), using the steps from comment 5.
tracking-fennec: --- → ?
STR:
1. go to about:firefox
2. change language to french, restart
3. long tap on Assistance, open in a new tab (first selection)
Seems to only occur on 2.3 (used Flyer); does not seem to occur on 2.2 (thunderbolt), nor 3.x (thrive)
per irc in #mobile, this has been seen before in b3, b4. This is not a new regression in b5.
(In reply to Naoki Hirata :nhirata from comment #10)
> Seems to only occur on 2.3 (used Flyer); does not seem to occur on 2.2
> (thunderbolt), nor 3.x (thrive)

I was able to reproduce this issue on LG Optimus 2X - Android 2.2
No longer blocks: 675732
Chris, do you might have a clue what is going on here? Perhaps this has a relationship to bug 669289?
We need to get to the bottom of this. If we are going to respin it needs to be today or tomorrow.
Could not reproduce in Firefox Beta on a Samsung Galaxy Tab running Gingerbread.
(In reply to Christian Legnitto [:LegNeato] from comment #14)
> We need to get to the bottom of this. If we are going to respin it needs to
> be today or tomorrow.

We are collecting more data about the bug, but given the current situation, this bug does not block any release.

If new data on frequency and ease-of-reproducing come to light, we can re-assess.
bug 660185 which is the same thing as this referenced an error

uncaught exception: [Exception... "Node was not found"  code: "8" nsresult: "0x80530008 (NS_ERROR_DOM_NOT_FOUND_ERR)"  location: "chrome://browser/content/browser.js Line: 2781"]
The symptoms described here indicate that the child process is 'stuck', that is, not responding properly. However it is responding enough for the parent process to not know it is in a bad state (in which case the parent would have restarted it). We saw this in the past with threading problems in the audio code for example.

Given the severity of this situation (can't close or use  tabs, can't easily get to a working state), perhaps we should add code to check if the child is in this state, and kill it if it is (or do we already have that)? Something like sending a high-level message (same level as 'close tab', which fails in this situation) and making sure we get a valid response after some (long) time. Or perhaps there is some lower-level way to do this. cjones, what do you think?
I can't reproduce this on Nightly on a Galaxy S or linux desktop.

(In reply to Aaron Train [:aaronmt] from comment #5)
> I can reproduce this with simpler STR
> 
> New Profile
> - Change to deutsch, restart
> - After first-run animation, tap URL bar, tap magnifying glass and tap Google

Why is the first-run animation showing in this case? This is the second run (during first run in this profile you change the language) or am I misunderstanding these STR?
(In reply to Alon Zakai (:azakai) from comment #19)
> I can't reproduce this on Nightly on a Galaxy S or linux desktop.
> 
> (In reply to Aaron Train [:aaronmt] from comment #5)
> > I can reproduce this with simpler STR
> > 
> > New Profile
> > - Change to deutsch, restart
> > - After first-run animation, tap URL bar, tap magnifying glass and tap Google
> 
> Why is the first-run animation showing in this case? This is the second run
> (during first run in this profile you change the language) or am I
> misunderstanding these STR?

It lands on about:firstrun after the restart
Aurora (8.0a1) HTC Evo

Following STR in comment #5 I get 
http://www.flickr.com/photos/ozten/6026742774/in/photostream

Cleared Data.

Following STR in comment #9, I could not reproduce.

Switched to Duetch. Again same error as above.
(In reply to Austin King [:ozten] from comment #21)
> Aurora (8.0a1) HTC Evo
> 
> Following STR in comment #5 I get 
> http://www.flickr.com/photos/ozten/6026742774/in/photostream
> 
> Cleared Data.
> 
> Following STR in comment #9, I could not reproduce.
> 
> Switched to Duetch. Again same error as above.

Unrelated - that is bug -> 674830
(In reply to Alon Zakai (:azakai) from comment #18)
> The symptoms described here indicate that the child process is 'stuck', that
> is, not responding properly. However it is responding enough for the parent
> process to not know it is in a bad state (in which case the parent would
> have restarted it). We saw this in the past with threading problems in the
> audio code for example.

I don't know what state you're referring to here, so I don't know how we would detect or fix it.  Have to a link to one of old bugs you're referring to?

If anyone is able reproduce this, we should attach gdb and see what's going on.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #23)
> (In reply to Alon Zakai (:azakai) from comment #18)
> > The symptoms described here indicate that the child process is 'stuck', that
> > is, not responding properly. However it is responding enough for the parent
> > process to not know it is in a bad state (in which case the parent would
> > have restarted it). We saw this in the past with threading problems in the
> > audio code for example.
> 
> I don't know what state you're referring to here, so I don't know how we
> would detect or fix it.  Have to a link to one of old bugs you're referring
> to?
> 
> If anyone is able reproduce this, we should attach gdb and see what's going
> on.

There's a handful of folks in qa that said they can reproduce.  What do you need from us?   You're welcome to borrow any device if you're local.
Sorry, remote.  I would like to know what the plugin-container process is doing.  If someone can attach gdb to plugin-container, then get the output of |thread apply all backtrace| and attach it here, that would help quite a bit.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #23)
> (In reply to Alon Zakai (:azakai) from comment #18)
> > The symptoms described here indicate that the child process is 'stuck', that
> > is, not responding properly. However it is responding enough for the parent
> > process to not know it is in a bad state (in which case the parent would
> > have restarted it). We saw this in the past with threading problems in the
> > audio code for example.
> 
> I don't know what state you're referring to here, so I don't know how we
> would detect or fix it.  Have to a link to one of old bugs you're referring
> to?
> 

Sorry for not being clearer. I finally managed to find the bug I meant before, bug 634407. The main thread deadlocked there, leading to exactly the same symptoms as in this bug (pages don't load, can't close tabs, etc. - child is frozen but not crashed).

Do we have any 'keepalive' type checks in the IPC code? (I mean, that the parent decides the child must be restarted if it doesn't respond to some periodic message?)

> If anyone is able reproduce this, we should attach gdb and see what's going
> on.

I agree that that is the way to go for this bug, so just to clarify, my questions above are more regarding a general approach to prevent such problems in the future. If you agree that some type of 'keepalive' check that would catch cases like this might make sense, I'll file a separate bug for that.
A keepalive check is a good idea, but seems to me like it would be hard to implement well (wrt battery, allowed blocking waits, etc.).

Firefox-on-desktop is vulnerable to these kinds of deadlock bugs too, but we haven't had the need to implement a main-thread keepalive or watchdog there yet (except for OOPP, but that's an easy problem).  I wonder why.  Maybe without content processes, these deadlocks are so bad as to be stop-the-world-and-fix, but with content processes things seem to /almost/ work well enough that perceived priority is lower.

Before trying something really hard (keepalive), maybe an easier solution would work well enough: if what we lack is feedback from users on these kinds of problems, maybe we can implement a "Kill content process" button, maybe just for unofficial builds.  The button would kill -11 the content process, which would give us a crash report and unwedge whatever the user was trying to do.  We could note in the crash report that is was generated by the "kill switch".

How hard would it be to add a kill switch to the UI?
Attached patch patch 1Splinter Review
This patch is for mozilla-beta.

Aaron was able to get a copy of his profile folder in the corrupted state. One thing I noticed in sessionstore.js was the "selected" index was out-of-bounds. It was greater than the stored number of tabs.

The session restore code wasn't protecting against that. I made a Fennec 6.0b5 build on Linux desktop. I used Aaron's sessionstore.js file and was able to experience the busted state.

Putting the "selected" check in the code kept the busted state from happening.

I do not know if this will fix the problem when running on a Android phone. I have not been able to reproduce it. Likewise, no one has been able to reproduce it using my test builds. However, it's a good start.
Assignee: nobody → mark.finkle
Attachment #552002 - Flags: review?(mbrubeck)
Comment on attachment 552002 [details] [diff] [review]
patch 1

r=mbrubeck

Nominating for approval-mozilla-beta.  This patch is mobile-only and extremely low-risk.  For a non-corrupt sessionstore file, it will have no effect.  For the specific corruption we were able to reproduce, it simply ignores the corrupt value.
Attachment #552002 - Flags: review?(mbrubeck)
Attachment #552002 - Flags: review+
Attachment #552002 - Flags: approval-mozilla-beta?
Built this with the patch http://people.mozilla.org/~kbrosnan/tmp/676780/fennec-6.0.en-US.eabi-arm.apk it is en-US only but I think you can add language packs from the previous beta to test http://ftp.mozilla.org/pub/mozilla.org/mobile/releases/latest-beta/linux/
(In reply to Kevin Brosnan [:kbrosnan] from comment #30)
> Built this with the patch
> http://people.mozilla.org/~kbrosnan/tmp/676780/fennec-6.0.en-US.eabi-arm.apk
> it is en-US only but I think you can add language packs from the previous
> beta to test
> http://ftp.mozilla.org/pub/mozilla.org/mobile/releases/latest-beta/linux/

I've installed several language packs and I've performed the str and it works fine. It seems that the patch has fixed this issue.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #27)
> How hard would it be to add a kill switch to the UI?

I think it would be bad UX to have one, actually. We should check if content processes are alive routinely, just like we do with plugin processes, and kill/restart them (and send a hang report pair) if they don't react to keep-alive pings from the main process within a certain timeout.
Attached patch patch for m-cSplinter Review
This patch is for mozilla-central. It adds the same "selected" clamp check. It also adds some better protection for failed session restores, making sure the browser is left in a "good" state.

(borrowed the session restore addition from Wes)
Attachment #552069 - Flags: review?(mbrubeck)
Attachment #552069 - Flags: feedback?(wjohnston)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #32)
> (In reply to Chris Jones [:cjones] [:warhammer] from comment #27)
> > How hard would it be to add a kill switch to the UI?
> 
> I think it would be bad UX to have one, actually. We should check if content
> processes are alive routinely, just like we do with plugin processes

I think there's a misunderstanding: (i) we don't do that for plugin processes; (ii) that's hard to do right; (iii) "kill switch" proposal is for nightlies, not release (not a UX concern).

But since the data we could gather with that wouldn't have helped identify this bug, no need to bother.
Blocks: 669289
Comment on attachment 552069 [details] [diff] [review]
patch for m-c

Review of attachment 552069 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine to me (but I wrote it). "fail" is probably a bit general. We could send back a real error code if we want to be fancy.
Attachment #552069 - Flags: feedback?(wjohnston) → feedback+
Comment on attachment 552002 [details] [diff] [review]
patch 1

we are taking this and will spin a build for QA to test. land on mozilla-beta
Attachment #552002 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Chris Jones [:cjones] [:warhammer] from comment #34)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #32)
> > (In reply to Chris Jones [:cjones] [:warhammer] from comment #27)
> > > How hard would it be to add a kill switch to the UI?
> > 
> > I think it would be bad UX to have one, actually. We should check if content
> > processes are alive routinely, just like we do with plugin processes
> 
> I think there's a misunderstanding: (i) we don't do that for plugin
> processes; (ii) that's hard to do right; (iii) "kill switch" proposal is for
> nightlies, not release (not a UX concern).
> 
> But since the data we could gather with that wouldn't have helped identify
> this bug, no need to bother.

We can add a kill switch for nightlies or as an addon, and that could help with manual investigation of future problems that have these symptoms, once we are aware of those specific problems.

But it wouldn't help prevent normal users from getting stuck with those symptoms, that is, with an unusable browser. And it wouldn't get us automatic reports from those users about that kind of problem. An automatic keepalive system could do both of those things I think.
tracking-fennec: ? → 6+
(In reply to Alon Zakai (:azakai) from comment #37)
> But it wouldn't help prevent normal users from getting stuck with those
> symptoms, that is, with an unusable browser. And it wouldn't get us
> automatic reports from those users about that kind of problem. An automatic
> keepalive system could do both of those things I think.

Yep.  My only concern is that a keepalive impl doesn't seem trivial at all, so would want commensurate benefit to attack.  It wouldn't have caught this bug, e.g.
Comment on attachment 552069 [details] [diff] [review]
patch for m-c

This mobile-only change is also needed on Aurora for Firefox 7 to fix this bug and bug 669289.  Both bugs render Firefox unusable and can cause dataloss when the user needs to wipe their profile to continue.
Attachment #552069 - Flags: review?(mbrubeck)
Attachment #552069 - Flags: review+
Attachment #552069 - Flags: approval-mozilla-aurora?
(In reply to Chris Jones [:cjones] [:warhammer] from comment #34)
> I think there's a misunderstanding: (i) we don't do that for plugin
> processes;

Erm, what else is it that kills plugins when they don't react and sends hang reports to us? Still, I feel like this belongs into a followup bug.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #42)
> (In reply to Chris Jones [:cjones] [:warhammer] from comment #34)
> > I think there's a misunderstanding: (i) we don't do that for plugin
> > processes;
> 
> Erm, what else is it that kills plugins when they don't react and sends hang
> reports to us? Still, I feel like this belongs into a followup bug.

That's a very narrow and simple mechanism: when the browser makes a blocking request to the plugin (on behalf of web content), and the plugin process doesn't reply within X seconds, then we declare it hung and invoke all that machinery.  If there were never any blocking requests browser-->plugin, we would never detect hangs within the current system.  A keepalive mechanism is quite different.
Filed bug 678073 to discuss the keepalive idea.
Blocks: 678159
Merged:
http://hg.mozilla.org/mozilla-central/rev/b7d5fd20d40a
No longer blocks: 678159
Status: NEW → RESOLVED
Closed: 13 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Whiteboard: [fennec 6.0b5]
Target Milestone: --- → Firefox 8
Version: Firefox 6 → Trunk
Blocks: 678159
Attachment #552069 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
http://hg.mozilla.org/releases/mozilla-aurora/rev/a3d8fc803419
Severity: major → blocker
tracking-fennec: 6+ → ---
Target Milestone: Firefox 8 → Firefox 6
Version: Trunk → Firefox 6
I'm able to reproduce this issue on Firefox 6 RC on a clean profile. I will reopen this bug.

--
Build id : Mozilla/5.0 (Android;Linux armv7l;rv:6.0)Gecko/20110811
Firefox/6.0 Fennec/6.0
Device: HTC Desire Z
OS: Android 2.3.3

Build id : Mozilla/5.0 (Android;Linux armv7l;rv:6.0)Gecko/20110811
Firefox/6.0 Fennec/6.0
Device: LG Optimus 2X
OS: Android 2.2
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Cristian Nicolae (:xti) from comment #48)
> I'm able to reproduce this issue on Firefox 6 RC on a clean profile. I will
> reopen this bug.
> 
> --
> Build id : Mozilla/5.0 (Android;Linux armv7l;rv:6.0)Gecko/20110811
> Firefox/6.0 Fennec/6.0
> Device: HTC Desire Z
> OS: Android 2.3.3
> 
> Build id : Mozilla/5.0 (Android;Linux armv7l;rv:6.0)Gecko/20110811
> Firefox/6.0 Fennec/6.0
> Device: LG Optimus 2X
> OS: Android 2.2

Can we reproduce this in-house?   Can we confirm the RC was built with this fix?
Yes, the RC was built with this fix. We aren't just seeing bug 678159?
(In reply to Christian Legnitto [:LegNeato] from comment #50)
> Yes, the RC was built with this fix. We aren't just seeing bug 678159?

AFAIK, they are the same bug.  There's 2 parts to the bug. A corruption in the session-profile and a hang in the process or something similar when the session profile is corrupt.  

1) This bug fixes quitting of the bug, but getting to the state where the the remote pages won't load still exists.  The current state is with this fix, when you quit the app, you don't have to clear data on the app to get fennec to work again.

2) However the part where the session hangs still exists, so you can still hit the state.  The work around is to quit the app and restart it at that point.
I can reproduce this 100% of the time doing the following on the Samsung Galaxy SII/2.3.3

1. Clear profile of Firefox Beta 6 RC
2. Reboot device
3. Immediately following startup and back to home-screen, launch Firefox Beta
4. Change language, Russian, Restart
5. After first-run animation, tap awesome-bar, tap magnifying glass and tap Google

Steps in YouTube -> http://www.youtube.com/watch?v=hip4JFcrlYg
I was able to reproduce this issue without a change of languages but strictly with adding and disabling addons.

1. go to about:home
2. add cleary, clear mobile history, homeskin, Bigger Text, personas, Phony, quit Firefox for Mobile, URL Fixer, readability.
3. restart the app
4. once the app starts quickly long tap on about:home and select open in new tab
5. close the new tab while it's loading

Can't close the tab.
Force Quitting the app will restore the app with a backup of the sessionstate, there is a slight data loss from the last time the sessionstate was backedup, however it is better than before where the user was forced to clear data from the app.
I can reproduce this only when there is one tab with an about: page, I can't reproduce when the first tab is a http:// url or there is more than 1 tab and one of the tabs is a http:// url.
Serious, but no RC blocker for me, given the circumstances it takes to reproduce the issue.
Added relnote keyword
Keywords: relnote
Stumbled upon another STR to reproduce this from testing out bug 686901

1. http://www.epsn.com (should load the mobile site)
2. Scroll to bottom, tap epsn.com (should load the desktop site)
3. On the top right, tap 'Sign In'

You should see a popup frame. When this happens, attempt to close the active tab or visit any new URL.
Status: REOPENED → NEW
tracking-fennec: --- → ?
Target Milestone: Firefox 6 → ---
Version: Firefox 6 → Firefox 9
In bug 686901, I mentioned a regression range for which the issue in comment 58 also falls.
tracking-fennec: ? → 9+
tracking-fennec: 9+ → +
There are no new patches for this bug, but autmation thinks we should land the approved patches to aurora and beta - which we have already done.

I am closing this bug. Open a new bug if this is still an issue and we can triage it  in the new bug.
Status: NEW → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.