Closed Bug 1016746 Opened 10 years ago Closed 10 years ago

A few hours monkey test,the homescreen turns up a black screen,there is a lot of activity-window under windows div

Categories

(Core :: DOM: Core & HTML, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla32
blocking-b2g 1.3T+
Tracking Status
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: yang.zhao, Assigned: bkelly)

References

Details

(Keywords: regression, Whiteboard: [sprd313838])

Attachments

(5 files)

Attached image 2012-01-02-23-14-09.png
After a few hours monkey test ,several phones got the issue that the homescreen turns up a black screen .Press Home button can't make the phone return to homescreen.
From the app-manager ,you can see a lot of activity-window are active.
Here is the app-manager screenshot.
Hi,fabrice
Is it related with the inproc homescreen ?
Flags: needinfo?(james.zhang)
Flags: needinfo?(fabrice)
Flags: needinfo?(alive)
I guess it's because homescreen inproc -> homescreen's activity is inproc -> activitydone is not fired well at this case. But I am not sure.
Flags: needinfo?(alive)
Can we check which activity this is to help reproduce?
Flags: needinfo?(fabrice)
Here are some of the activity windows
ok thanks. I'll look at that tomorrow.
Assignee: nobody → fabrice
blocking-b2g: --- → 1.3?
If we can't fix it quickly, we'll enable home screen oop and cause performance issue again.
Assignee: fabrice → nobody
blocking-b2g: 1.3? → 1.3T?
Flags: needinfo?(james.zhang)
After these things happen,when I press the power key to lock the phone,the following log will be printed out:

E/GeckoConsole(  599): [JavaScript Error: "TypeError: this.frame is undefined" {file: "app://system.gaiamobile.org/js/window.js" line: 125}]

Press the power key to light the phone,then unlock it,the following log will be printed out:

E/GeckoConsole(  599): [JavaScript Error: "TypeError: this.frame is undefined" {file: "app://system.gaiamobile.org/js/window.js" line: 120}]
E/GeckoConsole(  599): [JavaScript Error: "NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [inIDOMUtils.setContentState]" {file: "chrome://global/content/BrowserElementPanning.js" line: 59}]
Blocks: 1014272
I found the following logs:
E GeckoConsole: [JavaScript Error: "TypeError: this.enableWithoutCover is not a function" {file: "app://system.gaiamobile.org/js/lockscreen.js" line: 446}]

hi,Anthony Ricaud
Since you removed this function in commit 449e0cd70ed999ffc4ae9e6152d57ab01193376d.It seems that you didn't remove clearly.Please help to check it.Thank you!
Flags: needinfo?(anthony)
blocking-b2g: 1.3T? → 1.3T+
Keywords: regression
I can't reproduce locally. Activities triggered both from the homescreen and from an oop app are closed as expected, in all cases (success, cancel or going back to the homescreen).

One thing I note is that in the screenshot from comment 4, activity frames are direct children of <div id="windows">. This is not what I see locally, where they are children of the wrapper div for the app that called MozActivity().

Alive, any idea how we can end up in this situation?
Flags: needinfo?(alive)
We can meet this issue every day with 12 hours monkey test.
Danny, do you also meet it on your side?
Flags: needinfo?(ttsai)
Flags: needinfo?(styang)
Flags: needinfo?(kkuo)
Flags: needinfo?(dliang)
(In reply to Fabrice Desré [:fabrice] from comment #9)
> I can't reproduce locally. Activities triggered both from the homescreen and
> from an oop app are closed as expected, in all cases (success, cancel or
> going back to the homescreen).
> 
> One thing I note is that in the screenshot from comment 4, activity frames
> are direct children of <div id="windows">. This is not what I see locally,
> where they are children of the wrapper div for the app that called
> MozActivity().
> 
> Alive, any idea how we can end up in this situation?

Good catch.
We render the activity inside caller in bug https://bugzilla.mozilla.org/show_bug.cgi?id=977934,
and it's also in 1.3t.
but in the screenshot it seems they didn't take this patch.

Yang, could you confirm?
Flags: needinfo?(alive)
(In reply to Fabrice Desré [:fabrice] from comment #9)
> I can't reproduce locally. Activities triggered both from the homescreen and
> from an oop app are closed as expected, in all cases (success, cancel or
> going back to the homescreen).
> 
> One thing I note is that in the screenshot from comment 4, activity frames
> are direct children of <div id="windows">. This is not what I see locally,
> where they are children of the wrapper div for the app that called
> MozActivity().
> 
> Alive, any idea how we can end up in this situation?

...And from the screenshot, check the parentapp property of the mozbrowser iframe: some is gallery and some is sms.
But it's also strange that this activity is not rendered inside gallery and sms.
Hi! Alive,

Could you help? Based on the log yang.zhao provided.

--
Keven
Flags: needinfo?(kkuo) → needinfo?(alive)
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #11)
> (In reply to Fabrice Desré [:fabrice] from comment #9)
> > I can't reproduce locally. Activities triggered both from the homescreen and
> > from an oop app are closed as expected, in all cases (success, cancel or
> > going back to the homescreen).
> > 
> > One thing I note is that in the screenshot from comment 4, activity frames
> > are direct children of <div id="windows">. This is not what I see locally,
> > where they are children of the wrapper div for the app that called
> > MozActivity().
> > 
> > Alive, any idea how we can end up in this situation?
> 
> Good catch.
> We render the activity inside caller in bug
> https://bugzilla.mozilla.org/show_bug.cgi?id=977934,
> and it's also in 1.3t.
> but in the screenshot it seems they didn't take this patch.
> 
> Yang, could you confirm?

The patch in 977934 already landed on v1.3t,the commit is:1188d95f1c903098bfc9026efb9d5251566c0036
Be noted activity is found leak in bug 1007520.
Hi, could you see if the patch of bug 1007520 helps? I cannot reproduce locally too.
Flags: needinfo?(alive)
(In reply to Ting-Yu Chou [:ting] from comment #15)
> Be noted activity is found leak in bug 1007520.
Hi,Ting-Yu 
   Does the three parts of patch in bug 1007520 are all needed?If not,please tell me which patch should I use for testing this issue.Thank you!
Flags: needinfo?(tchou)
(In reply to James Zhang from comment #10)
> We can meet this issue every day with 12 hours monkey test.
> Danny, do you also meet it on your side?

We also met this issue, 4 devices all are black screen under 24 hr monkey test.
My build included part 1~3 of bug 1007520 and some debug patches, detail in https://github.com/dannyliang/tarako-patches
Flags: needinfo?(dliang)
Whiteboard: [sprd313838]
(In reply to yang.zhao from comment #17)
> Hi,Ting-Yu 
>    Does the three parts of patch in bug 1007520 are all needed?If not,please
> tell me which patch should I use for testing this issue.Thank you!

According to comment 18, the patches do not help.
Flags: needinfo?(tchou)
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #16)
> Hi, could you see if the patch of bug 1007520 helps? I cannot reproduce
> locally too.

hi,Alive
  Is it possible that it's Bug 1014704 - '[B2G][Tarako][E.me] Everything.me apps are not adding to homescreen when selecting 'Add to Homescreen' causes the issue ?
Flags: needinfo?(alive)
(In reply to yang.zhao from comment #7)
> E/GeckoConsole(  599): [JavaScript Error: "TypeError: this.frame is
> undefined" {file: "app://system.gaiamobile.org/js/window.js" line: 120}]

Any idea that how can the frame be undefined? If an ActivityWindow is instantiated with a caller without frame, the containerElement will be "windows", which matches what Fabrice found in comment 9.
(In reply to Ting-Yu Chou [:ting] from comment #21)
> (In reply to yang.zhao from comment #7)
> > E/GeckoConsole(  599): [JavaScript Error: "TypeError: this.frame is
> > undefined" {file: "app://system.gaiamobile.org/js/window.js" line: 120}]
> 
> Any idea that how can the frame be undefined? If an ActivityWindow is
> instantiated with a caller without frame, the containerElement will be
> "windows", which matches what Fabrice found in comment 9.

Might be, but I really don't know why this will happen.
1. Does inproc homescreen causes this bug?
2. Could you log down something in activity_window?
Flags: needinfo?(alive)
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #22)
> (In reply to Ting-Yu Chou [:ting] from comment #21)
> > (In reply to yang.zhao from comment #7)
> > > E/GeckoConsole(  599): [JavaScript Error: "TypeError: this.frame is
> > > undefined" {file: "app://system.gaiamobile.org/js/window.js" line: 120}]
> > 
> > Any idea that how can the frame be undefined? If an ActivityWindow is
> > instantiated with a caller without frame, the containerElement will be
> > "windows", which matches what Fabrice found in comment 9.
> 
> Might be, but I really don't know why this will happen.
> 1. Does inproc homescreen causes this bug?
> 2. Could you log down something in activity_window?

This issues happens after the patch for inproc homescreen,but I don't think we could say that inproc homescreen caused this bug.
     I reverted the patch in  Bug 1014704 ,and run monkey test on two phones,it seems that this issue didn't happen.
     If you want some log in activity_window,you can add some log,and I will help to run the monkey test again.
We have removed bug 1014704 with WIP patch.
(In reply to James Zhang from comment #24)
> We have removed bug 1014704 with WIP patch.

Not a good idea imho. You should rather take the followup in bug 1017331
Flags: needinfo?(yang.zhao)
(In reply to Fabrice Desré [:fabrice] from comment #25)
> (In reply to James Zhang from comment #24)
> > We have removed bug 1014704 with WIP patch.
> 
> Not a good idea imho. You should rather take the followup in bug 1017331

Already get bug 1014704 back.
Flags: needinfo?(yang.zhao)
Fabrice, wonder if you are the right person to take this bug? Thanks
Flags: needinfo?(fabrice)
We believe this is a Gecko issue and bkelly is working on a patch.
Assignee: nobody → bkelly
Flags: needinfo?(fabrice)
So our theory is that ContentParent::NotifyTabDestroyed() is being executed in the activity case which runs ContentParent::ShutDownProcess() from a runnable.  If ShutDownProcess() completes prior to ActorDestroy() then mMessageManager is set to null preventing the 'child-process-shutdown' message from being sent.

Typically the Close() call in ShutDownProcess() should trigger ActorDestroy().  It only does this once, though, setting a flag to avoid running ActorDestroy() again.

So the theory is we get this sequence:

1) ContentParent::NotifyTabDestroyed() is putting runnable on event queue
2) ShutDownProcess() is called as normal
3) ContentParent::Close() is called
4) MessageChannel::Close()
5) MessageChannel flushes pending runnables in event queue triggering ShutDownProcess() again via runnable from (1)
6) Second ShutdownProcess() call clears mMessageManager
7) ActorDestroy() runs from first ShutDownProcess() call and fails to send 'child-process-shutdown'
8) First ShutDownProcess() call completes

I have not been able to verify this actually occurs.  Danny, can you run an overnight monkey test with this patch plus the patches from bug 1007520?
Attachment #8431738 - Flags: review?(khuey)
Attachment #8431738 - Flags: feedback?(dliang)
(In reply to Ben Kelly (PTO, back Mon June 9) [:bkelly] from comment #29)
> Created attachment 8431738 [details] [diff] [review]
> Consistently send 'child-process-shutdown' from ContentParent destruction.
> (v0)
> 
> So our theory is that ContentParent::NotifyTabDestroyed() is being executed
> in the activity case which runs ContentParent::ShutDownProcess() from a
> runnable.  If ShutDownProcess() completes prior to ActorDestroy() then
> mMessageManager is set to null preventing the 'child-process-shutdown'
> message from being sent.
> 
> Typically the Close() call in ShutDownProcess() should trigger
> ActorDestroy().  It only does this once, though, setting a flag to avoid
> running ActorDestroy() again.
> 
> So the theory is we get this sequence:
> 
> 1) ContentParent::NotifyTabDestroyed() is putting runnable on event queue
> 2) ShutDownProcess() is called as normal
> 3) ContentParent::Close() is called
> 4) MessageChannel::Close()
> 5) MessageChannel flushes pending runnables in event queue triggering
> ShutDownProcess() again via runnable from (1)
> 6) Second ShutdownProcess() call clears mMessageManager
> 7) ActorDestroy() runs from first ShutDownProcess() call and fails to send
> 'child-process-shutdown'
> 8) First ShutDownProcess() call completes

My flush step in (5) may be bogus.

Instead it appears if the Close() encounters an error we re-enter ShutDownProcess() and call CloseWithError().  Since CloseWithError() is async we then proceed with steps 6 to 8 here.

In theory.  We need to test the patch.
The feedback I got on IRC was that due to deadlines, timezones, and the test turn-around, I should land this for now and mark the bug leave-open.  The patch should cause no harm, even if it does not fix the root issue here.
Keywords: leave-open
Comment on attachment 8431738 [details] [diff] [review]
Consistently send 'child-process-shutdown' from ContentParent destruction. (v0)

Requesting for b2g28_v1.3t, although I don't see a specific flag for that.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Possibly bug 895204
User impact if declined: Memory leak
Testing completed: Light device testing on v1.3t to verify nothing obvious broke.  Need long running monkey test to test if this really fixes the bug here.
Risk to taking this patch (and alternatives if risky): Very low.
String or UUID changes made by this patch: None

The patch needed some minor rebasing for v1.3t.  See attachment 8431739 [details] [diff] [review].
Attachment #8431738 - Flags: approval-mozilla-b2g28?
Attachment #8431738 - Flags: approval-mozilla-b2g28?
TBPL b2g-inbound looks relatively good for the commit in comment 33, but I am hesitant to push to b2g28_v1_3t before it merges to mozilla-central.  Also, I'm leaving on PTO basically now, so I would not be around to fix any unexpected breakage.

So I'd like to leave the uplift to b2g28_v1_3t to the sheriffs if possible.

Also, throwing this over to Kyle while I'm gone.

Summary of next steps is to get this in v1.3t branch and re-run the monkey test to see if it helps.  It would be nice to include the patches from bug 1007520 in the test as well.
Assignee: bkelly → khuey
Component: Gaia → DOM
Product: Firefox OS → Core
With below commit of bug 1017331 and revert nothing,
commit e68858693b71d917c9c5ee7e215f7ceea04635f7
Author: Fabrice Desre <fabrice@desre.org>
Date:   Thu May 29 21:24:21 2014 -0700

    Bug 1017331 - [Tarako] Trying to create a contact from the SMS results does not create a contact. r=alive

the result is positive on 24hrs monkey test w/ 4 devices. We don't find any black homescreen in monkey test.
(In reply to Ben Kelly (PTO, back Mon June 9) [:bkelly] from comment #29)
> Created attachment 8431738 [details] [diff] [review]
> Consistently send 'child-process-shutdown' from ContentParent destruction.
> (v0)
> 
> So our theory is that ContentParent::NotifyTabDestroyed() is being executed
> in the activity case which runs ContentParent::ShutDownProcess() from a
> runnable.  If ShutDownProcess() completes prior to ActorDestroy() then
> mMessageManager is set to null preventing the 'child-process-shutdown'
> message from being sent.
> 
> Typically the Close() call in ShutDownProcess() should trigger
> ActorDestroy().  It only does this once, though, setting a flag to avoid
> running ActorDestroy() again.
> 
> So the theory is we get this sequence:
> 
> 1) ContentParent::NotifyTabDestroyed() is putting runnable on event queue
> 2) ShutDownProcess() is called as normal
> 3) ContentParent::Close() is called
> 4) MessageChannel::Close()
> 5) MessageChannel flushes pending runnables in event queue triggering
> ShutDownProcess() again via runnable from (1)
> 6) Second ShutdownProcess() call clears mMessageManager
> 7) ActorDestroy() runs from first ShutDownProcess() call and fails to send
> 'child-process-shutdown'
> 8) First ShutDownProcess() call completes
> 
> I have not been able to verify this actually occurs.  Danny, can you run an
> overnight monkey test with this patch plus the patches from bug 1007520?

Monkey test has setup today, it included part 1~3 of bug 1007520, let's wait result tomorrow.
hi Danny, can you please udpate the results? Thanks
Flags: needinfo?(dliang)
Flags: needinfo?(styang)
(In reply to Danny Liang [:dliang] from comment #39)
> (In reply to Ben Kelly (PTO, back Mon June 9) [:bkelly] from comment #29)
> > Created attachment 8431738 [details] [diff] [review]
> > Consistently send 'child-process-shutdown' from ContentParent destruction.
> > (v0)
> > 
> > So our theory is that ContentParent::NotifyTabDestroyed() is being executed
> > in the activity case which runs ContentParent::ShutDownProcess() from a
> > runnable.  If ShutDownProcess() completes prior to ActorDestroy() then
> > mMessageManager is set to null preventing the 'child-process-shutdown'
> > message from being sent.
> > 
> > Typically the Close() call in ShutDownProcess() should trigger
> > ActorDestroy().  It only does this once, though, setting a flag to avoid
> > running ActorDestroy() again.
> > 
> > So the theory is we get this sequence:
> > 
> > 1) ContentParent::NotifyTabDestroyed() is putting runnable on event queue
> > 2) ShutDownProcess() is called as normal
> > 3) ContentParent::Close() is called
> > 4) MessageChannel::Close()
> > 5) MessageChannel flushes pending runnables in event queue triggering
> > ShutDownProcess() again via runnable from (1)
> > 6) Second ShutdownProcess() call clears mMessageManager
> > 7) ActorDestroy() runs from first ShutDownProcess() call and fails to send
> > 'child-process-shutdown'
> > 8) First ShutDownProcess() call completes
> > 
> > I have not been able to verify this actually occurs.  Danny, can you run an
> > overnight monkey test with this patch plus the patches from bug 1007520?
> 
> Monkey test has setup today, it included part 1~3 of bug 1007520, let's wait
> result tomorrow.

We still found the leak from content-parent by yesterday's monkey test.

DUT1: No leak from content-parent
0 (100.0%) -- queued-ipc-messages
├──0 (100.0%) ── content-parent((Preallocated), pid=12337, open channel, 0x48e51400, refcnt=16)
└──0 (100.0%) ── content-parent(Video, pid=12298, open channel, 0x474ea000, refcnt=157)
DUT2:
0 (100.0%) -- queued-ipc-messages
├──0 (100.0%) ── content-parent(E-Mail, pid=-1, closed channel, 0x444b2800, refcnt=1)
└──0 (100.0%) ── content-parent(Usage, pid=-1, closed channel, 0x43e8ec00, refcnt=1)
DUT3:
0 (100.0%) -- queued-ipc-messages
├──0 (100.0%) ── content-parent((Preallocated), pid=14574, open channel, 0x46979000, refcnt=16)
├──0 (100.0%) ── content-parent(Clock, pid=13802, open channel, 0x46e15c00, refcnt=17)
└──0 (100.0%) ── content-parent(E-Mail, pid=-1, closed channel, 0x4386d800, refcnt=1)
DUT4: No leak from content-parent
0 (100.0%) -- queued-ipc-messages
└──0 (100.0%) ── content-parent((Preallocated), pid=15919, open channel, 0x40b72c00, refcnt=16)
Flags: needinfo?(dliang)
Comment on attachment 8431738 [details] [diff] [review]
Consistently send 'child-process-shutdown' from ContentParent destruction. (v0)

By test result, there is still leak of content-parent.
Attachment #8431738 - Flags: feedback?(dliang)
Was this patch in the monkey test build?
Flags: needinfo?(dliang)
So, actually, the most recent log doesn't have any activity-window-N divs in it.  It appears we fixed the issue in comment 0 \o/

Let's move everything else back to bug 1007520.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(dliang)
Resolution: --- → FIXED
Comment on attachment 8431738 [details] [diff] [review]
Consistently send 'child-process-shutdown' from ContentParent destruction. (v0)

We should take this on 1.4.  It fixes a leak that happens during an edge case in shutdown of app processes.  Those leaks can build up in the parent process and slow down the phone.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): N/A
User impact if declined: Possible leaks if apps shut down in strange ways.
Testing completed: Tested on tarako and m-c.
Risk to taking this patch (and alternatives if risky): Low risk
String or UUID changes made by this patch: None
Attachment #8431738 - Flags: approval-mozilla-b2g30?
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #43)
> Was this patch in the monkey test build?

Yes, I based on gecko commit 426342f00390d0bd2831e782245dd035477bbfcd which included this patch.

commit 426342f00390d0bd2831e782245dd035477bbfcd
Author: B2G Bumper Bot <release+b2gbumper@mozilla.com>
Date:   Mon Jun 2 19:19:26 2014 -0700

    Bumping manifests a=b2g-bump

commit 78a21287b3cdab6e4c97ac1b9be2856945e441cd
Author: Ben Kelly <ben@wanderview.com>
Date:   Mon Jun 2 19:08:09 2014 -0700

    Bug 1016746: Consistently send 'child-process-shutdown' from ContentParent destruction. r=khuey a=1.3T+
Flags: needinfo?(anthony)
Keywords: leave-open
Target Milestone: --- → mozilla32
Please provide testing details and explain technical risks.
Flags: needinfo?(praghunath)
Attachment #8431738 - Flags: approval-mozilla-b2g30? → approval-mozilla-b2g30+
Flags: needinfo?(praghunath)
Please run dolphin monkey test to verify.
Flags: needinfo?(yang.zhao)
(In reply to James Zhang (Spreadtrum) from comment #50)
> Please run dolphin monkey test to verify.
It seems no black screen but have other issues when running monkey test,such as bug 1019419.
Flags: needinfo?(yang.zhao)
Flags: needinfo?(ttsai)
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: