Closed Bug 988110 Opened 10 years ago Closed 10 years ago

[tarako] touch event filtering can get stuck when up/down actions are unbalanced

Categories

(Core :: IPC, defect, P1)

28 Branch
ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla31
blocking-b2g 1.3T+
Tracking Status
b2g-v1.3T --- fixed

People

(Reporter: yaoyao.wu, Assigned: viralwang)

References

Details

(Keywords: regression, Whiteboard: [priority])

Attachments

(16 files, 2 obsolete files)

14.44 KB, text/plain
Details
801.53 KB, application/x-bzip
Details
123.15 KB, image/png
Details
130.75 KB, text/x-log
Details
127.13 KB, text/x-log
Details
7.56 KB, text/plain
Details
130.07 KB, text/x-log
Details
15.84 KB, image/jpeg
Details
9.06 KB, image/png
Details
1.08 MB, application/zip
Details
1.14 MB, video/quicktime
Details
31.61 KB, application/zip
Details
4.40 MB, video/quicktime
Details
1.44 MB, application/zip
Details
6.11 MB, application/zip
Details
2.61 KB, patch
Details | Diff | Splinter Review
changeset from 191
=========================
gaia
commit 943ff7984dbb8fbe833c3daef18b09a62c139c1e
Merge: ea0653b f513242
Author: Ryan VanderMeulen <ryanvm@gmail.com>
Date:   Fri Mar 21 10:53:54 2014 -0400

    Merge branch 'v1.3' into v1.3t

commit ea0653b1329d78a569f2e9a19f763ad8881449e2
Merge: 1917b47 e33d80e
Author: Ryan VanderMeulen <rvandermeulen@mozilla.com>
Date:   Fri Mar 21 08:40:25 2014 -0400

    Merge pull request #17422 from timdream/keyboard-mempressure-race
    
    Bug 984238 - Don't kill the keyboard when when it's being launched +shepherd

commit 1917b47b7099de1fab2c9786ae04219122966049
Merge: a170c6f 28e0261
Author: Yuren Ju <yurenju@gmail.com>
Date:   Fri Mar 21 18:25:08 2014 +0800

    Merge pull request #17433 from evanxd/v1.3t-bug-986345
    
    Bug 986345 - Fix CI_ACTION=marionette_js failure on 1.3/1.3t branch. a=testonly

commit f5132420bef0d4dc27dfcb32bb9c5df2ab80530a
Merge: 864957c e66437c
Author: Yuren Ju <yurenju@gmail.com>
Date:   Fri Mar 21 18:25:00 2014 +0800

    Merge pull request #17432 from evanxd/v1.3-bug-986345
    
    Bug 986345 - Fix CI_ACTION=marionette_js failure on 1.3/1.3t branch. a=testonly

commit a170c6f78a2a710e843e68243691ac39746d3456
Merge: a6919e4 ea4731f
Author: Yuren Ju <yurenju@gmail.com>
Date:   Fri Mar 21 18:16:42 2014 +0800

    Merge pull request #17436 from yurenju/wappush-v1.3t
    
    Bug 986380 - fix unit test case "auth_test.js" in wappush app a=testonly

commit 864957ce10921155f702aa60413fddcf0c6b078e
Merge: 001f2e6 5b9c7ff
Author: Yuren Ju <yurenju@gmail.com>
Date:   Fri Mar 21 18:08:20 2014 +0800

    Merge pull request #17435 from yurenju/wappush
    
    Bug 986380 - fix unit test case "auth_test.js" in wappush app a=testonly

commit e66437c5522499cd98d349081691349d29bce4b4
Author: Evan Xd <itoyxd@gmail.com>
Date:   Fri Mar 21 17:17:25 2014 +0800

    Bug 986345 - Add black list on Travis.
Summary: [tarako] B2G_NOOPT=1 build, monkey test无响应 → [tarako] B2G_NOOPT=1 build, 测试2个小时左右会出现monkey test无响应的问题
Summary: [tarako] B2G_NOOPT=1 build, 测试2个小时左右会出现monkey test无响应的问题 → [tarako] gaia no respond after monkey test start about 2 hours
Flags: needinfo?(ttsai)
Flags: needinfo?(ehung)
Attached file modification.txt
Spreadtrum hudson#191 build first met this issue, and the attachment is change list.

We can still meet this issue in latest hudson build. It block our monkey test now.
Do we know in which state gaia is blocked? can we get a screenshot?
(In reply to Fabrice Desré [:fabrice] from comment #2)
> Do we know in which state gaia is blocked? can we get a screenshot?

Homescreen.
Yaoyao, please provide slog here. Thanks.
Flags: needinfo?(yaoyao.wu)
Add Yang here for analyzing.
Attached file hudson#198 slog
Flags: needinfo?(yaoyao.wu)
Can you attach a debugger to the hanging phone? It would be great to get stack traces for the hang situation. And maybe ps ax.
(In reply to Andreas Gal :gal from comment #6)
> Can you attach a debugger to the hanging phone? It would be great to get
> stack traces for the hang situation. And maybe ps ax.

Hi Andreas,
   Please see Yaoyao's attachment. ADB works well and no gecko/gaia error log, adb shell getevent can show touch panel log, home/power key also can work, so I think it's gaia issue.
Yang, please tell yaoyao how to run local monkey test and use gdb to attach homescreen process and print backtrace, thanks.
Flags: needinfo?(yaoyao.wu)
Flags: needinfo?(yang.zhao)
We can't use gdb to attach hudson build, please build in local PC and ./run-gdb.sh attach b2g-process or homescreen-process, and get backtrace and comment here.
Thanks.
Evelyn, can you please update to this bug on what you suggested over email to our partner? Thanks
Flags: needinfo?(yaoyao.wu)
Flags: needinfo?(yang.zhao)
Flags: needinfo?(ttsai)
Joe, please traige this bug. Its unconfirmed and its in the wrong component.
Bug 984238 - Don't kill the keyboard when when it's being launched +shepherd
Bug 959966: prevent the preallocated process from being killed by the lowmem killer by rasing its priority. r=khuey,gsvelto a=1.3t+

I recommend temporarily backing out these two patches to see if that fixes the problem.
(In reply to Joe Cheng [:jcheng] from comment #10)
> Evelyn, can you please update to this bug on what you suggested over email
> to our partner? Thanks

My suggestion over email aligned with comment 12, I also suspect these two bugs. Please try to remove these two patches and try your monkey test again.
Flags: needinfo?(ehung)
Run the monkey test,after about 3 hours,the phone stayed in a browser page and had no response at all.
adb shell b2g-info
                                     |     megabytes     |
           NAME   PID  PPID CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER
            b2g 13504     1 2429.5    0 49.7 51.3 53.5 179.6       0 root
         (Nuwa) 13556 13504    5.3    0  0.2  0.9  2.1  50.6       0 root
(Preallocated a 18276 13556    0.7   18  7.1  8.7 11.0  57.6       1 root

System memory info:

            Total 97.5 MB
     Used - cache 80.5 MB
  B2G procs (PSS) 60.8 MB
    Non-B2G procs 19.7 MB
     Free + cache 17.0 MB
             Free  2.2 MB
            Cache 14.8 MB

Low-memory killer parameters:

  notify_trigger 14336 KB

  oom_adj min_free
        0  1024 KB
        1  2048 KB
        2  4096 KB
        6 18432 KB
        8 18432 KB
       10 24576 KB
And backtrace in gdb is:

#0  ~nsCOMPtr_base (this=0xbed22dd0, __in_chrg=<value optimized out>) at /home/yangzhao/B2G/B2G6821_test/gecko/media/webrtc/signaling/./src/peerconnection/PeerConnectionImpl.h:629
#1  0x413a2ba0 in ~nsCOMPtr (aNode=0x4678be70) at ../../../dist/include/nsCOMPtr.h:473
#2  nsContentUtils::IsUserFocusIgnored (aNode=0x4678be70) at /home/yangzhao/B2G/B2G6821_test/gecko/content/base/src/nsContentUtils.cpp:5925
#3  0x4142785e in nsEventStateManager::PostHandleEvent (this=0x460a3d40, aPresContext=<value optimized out>, aEvent=0xbed23590, aTargetFrame=0x47f16580, aStatus=0xbed2309c)
    at /home/yangzhao/B2G/B2G6821_test/gecko/content/events/src/nsEventStateManager.cpp:3231
#4  0x415ff574 in PresShell::HandleEventInternal (this=0x44120d60, aEvent=0xbed23590, aStatus=<value optimized out>)
    at /home/yangzhao/B2G/B2G6821_test/gecko/layout/base/nsPresShell.cpp:7149
#5  0x415ff704 in PresShell::HandlePositionedEvent (this=0x44120d60, aTargetFrame=0x4678be70, aEvent=0xbed23590, aEventStatus=0xbed2309c)
    at /home/yangzhao/B2G/B2G6821_test/gecko/layout/base/nsPresShell.cpp:6867
#6  0x415ffddc in PresShell::HandleEvent (this=<value optimized out>, aFrame=0x47f16580, aEvent=0xbed23590, aDontRetargetEvents=<value optimized out>, aEventStatus=0xbed2309c)
    at /home/yangzhao/B2G/B2G6821_test/gecko/layout/base/nsPresShell.cpp:6666
#7  0x4138ccbe in nsViewManager::DispatchEvent (this=<value optimized out>, aEvent=0xbed23590, aView=<value optimized out>, aStatus=<value optimized out>)
    at /home/yangzhao/B2G/B2G6821_test/gecko/view/src/nsViewManager.cpp:753
#8  0x4138b9c4 in nsView::HandleEvent (this=<value optimized out>, aEvent=0xbed23590, aUseAttachedEvents=<value optimized out>)
    at /home/yangzhao/B2G/B2G6821_test/gecko/view/src/nsView.cpp:1084
#9  0x4122dfbc in nsWindow::DispatchEvent (this=<value optimized out>, aEvent=0x0, aStatus=@0xbed230c4) at /home/yangzhao/B2G/B2G6821_test/gecko/widget/gonk/nsWindow.cpp:478
#10 0x4122e956 in nsWindow::DispatchInputEvent (aEvent=..., aWasCaptured=0x0) at /home/yangzhao/B2G/B2G6821_test/gecko/widget/gonk/nsWindow.cpp:282
#11 0x4122d67a in sendMouseEvent (this=0xbed230e8) at /home/yangzhao/B2G/B2G6821_test/gecko/widget/gonk/nsAppShell.cpp:175
#12 GeckoInputDispatcher::dispatchOnce (this=0xbed230e8) at /home/yangzhao/B2G/B2G6821_test/gecko/widget/gonk/nsAppShell.cpp:599
#13 0x4122c514 in nsAppShell::ProcessNextNativeEvent (this=0x435554c0, mayWait=<value optimized out>) at /home/yangzhao/B2G/B2G6821_test/gecko/widget/gonk/nsAppShell.cpp:927
#14 0x41235126 in nsBaseAppShell::DoProcessNextNativeEvent (this=0xbed23590, mayWait=false, recursionDepth=0)
    at /home/yangzhao/B2G/B2G6821_test/gecko/widget/xpwidgets/nsBaseAppShell.cpp:137
#15 0x412351ec in nsBaseAppShell::OnProcessNextEvent (this=0x435554c0, thr=0x4031e8e0, mayWait=false, recursionDepth=0)
    at /home/yangzhao/B2G/B2G6821_test/gecko/widget/xpwidgets/nsBaseAppShell.cpp:283
#16 0x40d4fb7c in nsThread::ProcessNextEvent (this=0x4031e8e0, mayWait=false, result=0xbed237df) at /home/yangzhao/B2G/B2G6821_test/gecko/xpcom/threads/nsThread.cpp:585
#17 0x40d229c0 in NS_ProcessNextEvent (thread=0x435554c0, mayWait=false) at /home/yangzhao/B2G/B2G6821_test/gecko/xpcom/glue/nsThreadUtils.cpp:263
#18 0x40e7b4b0 in mozilla::ipc::MessagePump::Run (this=0x40302d30, aDelegate=0x4034c0c0) at /home/yangzhao/B2G/B2G6821_test/gecko/ipc/glue/MessagePump.cpp:85
#19 0x40e71458 in MessageLoop::RunInternal (this=0x0) at /home/yangzhao/B2G/B2G6821_test/gecko/ipc/chromium/src/base/message_loop.cc:222
#20 0x40e714d6 in MessageLoop::RunHandler (this=0x4034c0c0) at /home/yangzhao/B2G/B2G6821_test/gecko/ipc/chromium/src/base/message_loop.cc:215
#21 MessageLoop::Run (this=0x4034c0c0) at /home/yangzhao/B2G/B2G6821_test/gecko/ipc/chromium/src/base/message_loop.cc:189
#22 0x41234bdc in nsBaseAppShell::Run (this=0x435554c0) at /home/yangzhao/B2G/B2G6821_test/gecko/widget/xpwidgets/nsBaseAppShell.cpp:161
#23 0x41867d60 in nsAppStartup::Run (this=0x43560130) at /home/yangzhao/B2G/B2G6821_test/gecko/toolkit/components/startup/nsAppStartup.cpp:276
#24 0x4183f9e8 in XREMain::XRE_mainRun (this=0xbed23a14) at /home/yangzhao/B2G/B2G6821_test/gecko/toolkit/xre/nsAppRunner.cpp:4059
#25 0x418423ba in XREMain::XRE_main (this=0xbed23a14, argc=<value optimized out>, argv=<value optimized out>, aAppData=0x21170)
    at /home/yangzhao/B2G/B2G6821_test/gecko/toolkit/xre/nsAppRunner.cpp:4127
#26 0x41842524 in XRE_main (argc=1, argv=0xbed25bf4, aAppData=0x21170, aFlags=<value optimized out>) at /home/yangzhao/B2G/B2G6821_test/gecko/toolkit/xre/nsAppRunner.cpp:4337
#27 0x000098de in do_main (argc=1, argv=0xbed25bf4) at /home/yangzhao/B2G/B2G6821_test/gecko/b2g/app/nsBrowserApp.cpp:163
#28 main (argc=1, argv=0xbed25bf4) at /home/yangzhao/B2G/B2G6821_test/gecko/b2g/app/nsBrowserApp.cpp:256
(In reply to Evelyn Hung [:evelyn] from comment #13)
> (In reply to Joe Cheng [:jcheng] from comment #10)
> > Evelyn, can you please update to this bug on what you suggested over email
> > to our partner? Thanks
> 
> My suggestion over email aligned with comment 12, I also suspect these two
> bugs. Please try to remove these two patches and try your monkey test again.

Yaoyao will provide test result of remove patch build.
Flags: needinfo?(yaoyao.wu)
Doesn't seem to be affected by the JavaScript engine, giving a quick glance at the backtrace. Thanks for checking whether tarako bugs are provoked by the JavaScript engine or not before triaging them in this component.
Component: JavaScript Engine → General
Product: Core → Firefox OS
Version: 28 Branch → unspecified
Assignee: nobody → yaoyao.wu
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Component: General → Gaia
triage: minus for now, please renom if we have further information
blocking-b2g: 1.3T? → -
(In reply to Evelyn Hung [:evelyn] from comment #13)
> (In reply to Joe Cheng [:jcheng] from comment #10)
> > Evelyn, can you please update to this bug on what you suggested over email
> > to our partner? Thanks
> 
> My suggestion over email aligned with comment 12, I also suspect these two
> bugs. Please try to remove these two patches and try your monkey test again.
hi,Evelyn
    I reverted these two bugs:
Bug 984238 - Don't kill the keyboard when when it's being launched +shepherd
Bug 959966: prevent the preallocated process from being killed by the lowmem killer by rasing its 

     It seems ok after running over a night monkey test.I'll test it once again.
Flags: needinfo?(ehung)
Our QA has 60% reproduce rate, please check these revert commit.

Bug 984238 - Don't kill the keyboard when when it's being launched +shepherd
Bug 959966: prevent the preallocated process from being killed by the lowmem killer by rasing its
blocking-b2g: - → 1.3T?
Need Gaia help, we still meet this issue when monkey test.
We need connect nightly to debug this issue. 
But currently there is a prompt window when nightly connect!!!
Flags: needinfo?(ttsai)
Flags: needinfo?(styang)
Flags: needinfo?(kkuo)
Flags: needinfo?(gal)
We can meet this issue every day.
Depends on: 989234
Hi! Tim, Cervantes,

According to partner, the monkey test can go smoothly with reverting following patches.

Bug 984238 - Don't kill the keyboard when when it's being launched +shepherd
Bug 959966: prevent the preallocated process from being killed by the lowmem killer by rasing its 

Need your help to check those patches again. Thanks

--
Keven
Flags: needinfo?(timdream)
Flags: needinfo?(kkuo)
Flags: needinfo?(cyu)
Can you please check with patch breaks the release? And then lets re-open and back out the offending patch.
Flags: needinfo?(gal)
:dwi2, please also monitor this issue in China next week. thanks.
Flags: needinfo?(styang) → needinfo?(tzhuang)
(In reply to yang.zhao from comment #19)
> (In reply to Evelyn Hung [:evelyn] from comment #13)
> > (In reply to Joe Cheng [:jcheng] from comment #10)
> > > Evelyn, can you please update to this bug on what you suggested over email
> > > to our partner? Thanks
> > 
> > My suggestion over email aligned with comment 12, I also suspect these two
> > bugs. Please try to remove these two patches and try your monkey test again.
> hi,Evelyn
>     I reverted these two bugs:
> Bug 984238 - Don't kill the keyboard when when it's being launched +shepherd
> Bug 959966: prevent the preallocated process from being killed by the lowmem
> killer by rasing its 
> 
>      It seems ok after running over a night monkey test.I'll test it once
> again.

Today's monkey test is not ok.The phone stays on lockscreen ,and couldn't unlock the phone.There is no response at all.From the nightly,you can see that when you try to unlock the phone,there is no "touched" class added to lockscreen.But in a normal phone,there will add a class named "touched" when you put your finger on the unlock area.
Tim can you guys look at this?
One of the phone with the latest version is not ok.You could unlock the phone normally,but the homescreen has no response.From the nightly,I found that there were two "<div id="homescreen"...></div>“,the second one is empty.When I set "disply:none" to the second div,the phone come back to normal.Please see the picture in the attachment.
I think maybe there is something wrong in window_manager control.
Thanks for the diagnosis.
(In reply to Andreas Gal :gal from comment #24)
> Can you please check with patch breaks the release? And then lets re-open
> and back out the offending patch.

We can reproduce this issue today without these patch. So we should fix it by Yang's nightly debug information.
Assignee: yaoyao.wu → anygregor
Component: Gaia → Gaia::System::Window Mgmt
Alive may know what happened in comment 28.
Flags: needinfo?(ehung) → needinfo?(alive)
I remember we workaround a similar issue on v1.1hd before. Window manager before v1.4 is not robust enough to deal with randomly creating/switching windows by monkey test. As I remember, workaround in v1.1hd was - we force to clean up those windows which are in an unknown states in window manager while every visibility change (window transition happens).

Luke can find the patch we had on v1.1hd .
Flags: needinfo?(lchang)
The patch in bug 959966 is just priority changes to the preallocated process to make it less likely to fall victim to the lowmem killer. This should not have impact to functionality.
Flags: needinfo?(cyu)
(In reply to yang.zhao from comment #28)
> Created attachment 8399055 [details]
> two div with the id= homescreen
> 
> One of the phone with the latest version is not ok.You could unlock the
> phone normally,but the homescreen has no response.From the nightly,I found
> that there were two "<div id="homescreen"...></div>“,the second one is
> empty.When I set "disply:none" to the second div,the phone come back to
> normal.Please see the picture in the attachment.
> I think maybe there is something wrong in window_manager control.

I don't know how to do so called monkey test, but you could do some debugging here:
https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L149

The view is injected by render() and it's called by
1. constructor
2. ensure
3. restart

Since one of the #homescreen div is empty I guess render is called twice and kill()[https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L142] doesn't happen for some reason.
Flags: needinfo?(alive)
What exactly we are looking at in this bug?
Flags: needinfo?(timdream)
(In reply to Evelyn Hung [:evelyn] from comment #32)
> I remember we workaround a similar issue on v1.1hd before. Window manager
> before v1.4 is not robust enough to deal with randomly creating/switching
> windows by monkey test. As I remember, workaround in v1.1hd was - we force
> to clean up those windows which are in an unknown states in window manager
> while every visibility change (window transition happens).
> 
> Luke can find the patch we had on v1.1hd .

Bug 921829 is what Evelyn mentioned. However, the patch I attached in that bug is still not working for that issue so I'm not sure it's workable in this case.
Flags: needinfo?(lchang)
(In reply to Andreas Gal :gal from comment #29)
> Thanks for the diagnosis.
One phone with the latest version stays in lockscreen and has no response after running monkey test again.And when I try to unlock the phone ,the "touchstart" event hasn't been triggered and handled by handleEvent in lockscreen.js ,but you can see the touchevent does received from 'adb shell getevent -l'.So maybe something wrong in gecko and cause the event couldn't be passed to gaia,I guess.
triage: we should not block before we have more actionable information.
let's not block on this before we have more information. thanks
blocking-b2g: 1.3T? → -
(In reply to yang.zhao from comment #37)
> (In reply to Andreas Gal :gal from comment #29)
> > Thanks for the diagnosis.
> One phone with the latest version stays in lockscreen and has no response
> after running monkey test again.And when I try to unlock the phone ,the
> "touchstart" event hasn't been triggered and handled by handleEvent in
> lockscreen.js ,but you can see the touchevent does received from 'adb shell
> getevent -l'.So maybe something wrong in gecko and cause the event couldn't
> be passed to gaia,I guess.

 Reproduced again.From the nightly,I can see the homescreen div is recreated all the time."watch adb shell b2g-ps",you can see the homescreen process restarted again and again.
The b2g-info is as follows:
                            |     megabytes     |
           NAME   PID  PPID CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER    
            b2g   776     1 4895.8    0 31.4 33.4 37.0 160.1       0 root    
 Communications  4919     1  619.5    0  0.6  1.3  3.1  82.9       1 app_4919
         (Nuwa) 15409   776    6.6    0  0.8  3.4  7.5  50.5       0 root    
(Preallocated a 17056 15409    0.5   18  4.5  7.2 11.4  55.5       2 root    

System memory info:

            Total 98.5 MB
     Used - cache 71.9 MB
  B2G procs (PSS) 45.2 MB
    Non-B2G procs 26.7 MB
     Free + cache 26.6 MB
             Free  8.9 MB
            Cache 17.7 MB

Low-memory killer parameters:

  notify_trigger 14336 KB

  oom_adj min_free
        0  1024 KB
        1  2048 KB
        2  4096 KB
        6 18432 KB
        8 18432 KB
       10 24576 KB
Does anyone can tell me why the homescreen couldn't be created sucessfully,and why the process restarted again and again?
Flags: needinfo?(james.zhang)
Flags: needinfo?(alive)
Run monkey test with latest version,and the phone has no response after about 1 hour.
The b2g info:
APPLICATION      USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
b2g              root      691   1     151776 38720 ffffffff 40033518 S /system/b2g/b2g
(Nuwa)           root      759   691   51688  752   ffffffff 4010c518 S /system/b2g/plugin-container
Homescreen       app_4519  4519  759   69556  12100 ffffffff 4010c518 S /system/b2g/plugin-container
Usage            app_5106  5106  759   64176  16920 ffffffff 4010c518 S /system/b2g/plugin-container

b2g:
#0  js::VisitGrayWrapperTargets (zone=0x1, callback=0x466368c0, closure=0x421e8d10)
    at /home/bingjiechen/B2G6821/gecko/js/src/jsfriendapi.cpp:662
#1  0x423c32ac in mozilla::dom::UIEventBinding::InterfaceObjectClass ()
   from /home3/bingjiechen/B2G6821/objdir-gecko/dist/bin/libxul.so
#2  0x423c32ac in mozilla::dom::UIEventBinding::InterfaceObjectClass ()
   from /home3/bingjiechen/B2G6821/objdir-gecko/dist/bin/libxul.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)


Homescreen:
#0  epoll_wait () at bionic/libc/arch-arm/syscalls/epoll_wait.S:10
#1  0x40096fd4 in __wrap_epoll_wait (epfd=<value optimized out>, events=<value optimized out>, maxevents=<value optimized out>, 
    timeout=<value optimized out>) at /home/bingjiechen/B2G6821/gecko/mozglue/build/Nuwa.cpp:962
#2  0x40c3dc2c in virtual thunk to GeckoInputReaderPolicy::~GeckoInputReaderPolicy() ()
    at /home/bingjiechen/B2G6821/gecko/widget/gonk/nsAppShell.cpp:967
#3  0x00000000 in ?? ()
Flags: needinfo?(yaoyao.wu)
Do you have the phone still in this state? Could I come over and take a look?
(In reply to Andreas Gal :gal from comment #42)
> Do you have the phone still in this state? Could I come over and take a look?

Yes,the phone still in this state.You can come to take a look.Thank you!
(In reply to Andreas Gal :gal from comment #42)
> Do you have the phone still in this state? Could I come over and take a look?

I'm in Shanghai now,and my seat is near James.I'll keep the phone till you come here.
Actually I won't make it to you in time and to my next meeting so we will have to do this over bugzilla. Can you please post a full stack, then continue, and then break again, and post a stack again? I am trying to see whether Gecko is hanging in VisitGrayWrapperTargets or whether it was a coincidence that we broke in there.
Could we have the adb logcat and dmesg of the phone that is restarting the homescreen repeatedly?
Attached file dmesg.log
Attached file logcat.log
Flags: needinfo?(cyu)
Hi! Cervantes,

Both logcat and dmesg are captured under situation of restarting homescreen repeatly.
Nuwa is still alive.

--
Keven
It seems we have two bugs here. A homescreen restart and a homescreen hang. Please split the bug in 2 and assign people to investigate both cases.
We can't not reproduce the homescreen keep restarting issue.
From log, memory is not hit OOM criteria, so it's not because Homescreen is killed and system app's trying to restart it. Cervantes is helping on this restarting issue.

At the same time, we may be able to reproduce two homescreen <div> case after our monkey test, looking into detail now.
(In reply to Andreas Gal :gal from comment #50)
> It seems we have two bugs here. A homescreen restart and a homescreen hang.
> Please split the bug in 2 and assign people to investigate both cases.

agree, I will file a bug for two homescreen div, and leave this one for homescreen restarting issue.
See Also: → 990365
blocking-b2g: - → 1.3T?
(In reply to Evelyn Hung [:evelyn] from comment #52)
> (In reply to Andreas Gal :gal from comment #50)
> > It seems we have two bugs here. A homescreen restart and a homescreen hang.
> > Please split the bug in 2 and assign people to investigate both cases.
> 
> agree, I will file a bug for two homescreen div, and leave this one for
> homescreen restarting issue.

bug 990365 filed.
May I know where to get your monkey test build? I found our QA doesn't have this kind of build, so it's hard to reproduce these issues.
Flags: needinfo?(yang.zhao)
(In reply to Evelyn Hung [:evelyn] from comment #54)
> May I know where to get your monkey test build? I found our QA doesn't have
> this kind of build, so it's hard to reproduce these issues.

I build this in my local machine so that I could use gdb.I updated the code  yesterday.
Flags: needinfo?(yang.zhao)
(In reply to yang.zhao from comment #55)
> (In reply to Evelyn Hung [:evelyn] from comment #54)
> > May I know where to get your monkey test build? I found our QA doesn't have
> > this kind of build, so it's hard to reproduce these issues.
> 
> I build this in my local machine so that I could use gdb.I updated the code 
> yesterday.

And run "./_monkey.sh test-config/6821-hudson-config test" under folder monkey.test ,the time is 12 hours,so I run it in the night.
(In reply to yang.zhao from comment #56)
> (In reply to yang.zhao from comment #55)
> > (In reply to Evelyn Hung [:evelyn] from comment #54)
> > > May I know where to get your monkey test build? I found our QA doesn't have
> > > this kind of build, so it's hard to reproduce these issues.
> > 
> > I build this in my local machine so that I could use gdb.I updated the code 
> > yesterday.
> 
> And run "./_monkey.sh test-config/6821-hudson-config test" under folder
> monkey.test ,the time is 12 hours,so I run it in the night.

Thanks, we are trying to make a monkey build which all settings (remote debug + DEVICE_DEBUG=1) are enabled. So once we can reproduce the issue, we are able to look into details.
From your dmesg log:


cervantes@cyu-dt:/tmp$ grep Homescreen dmesg.log

(many lines removed for brevity)

<4>0[48789.036494] select 30361 (Homescreen), adj 8, size 4744, to kill
<4>0[48789.036570] send sigkill to 30361 (Homescreen), adj 8, size 4744
<4>0[48806.850460] select 30383 (Homescreen), adj 8, size 5433, to kill
<4>0[48806.850525] send sigkill to 30383 (Homescreen), adj 8, size 5433
<4>0[48813.106587] select 30422 (Homescreen), adj 8, size 6172, to kill
<4>0[48813.106658] send sigkill to 30422 (Homescreen), adj 8, size 6172
<4>0[48837.716777] select 30457 (Homescreen), adj 8, size 5227, to kill
<4>0[48837.716849] send sigkill to 30457 (Homescreen), adj 8, size 5227
<4>0[48845.512687] select 30495 (Homescreen), adj 8, size 5988, to kill
<4>0[48845.512755] send sigkill to 30495 (Homescreen), adj 8, size 5988
<6>0[48883.209845] [30536] 40536 30536    17528     4285   0       2           134 Homescreen
<4>0[49190.955490] select 30536 (Homescreen), adj 8, size 5376, to kill
<4>0[49190.955558] send sigkill to 30536 (Homescreen), adj 8, size 5376
<4>0[49220.400303] select 30724 (Homescreen), adj 8, size 5692, to kill
<4>0[49220.400375] send sigkill to 30724 (Homescreen), adj 8, size 5692

From logcat:

04-01 10:41:23.270   776   776 I Gonk    : Setting nice for pid 30361 to 18
04-01 10:41:23.270   776   776 I Gonk    : Changed nice for pid 30361 from 0 to 18.
04-01 10:41:23.280   776   776 D Sensors : activate handle=3; drv=1
04-01 10:41:23.290   776   776 D Sensors : SensorTMD2771::setEnable=0= en=0; newState=0; what=0; mEnabled=1
04-01 10:41:23.290   776   776 D Sensors : taos ---Light::TAOS_IOCTL_ALS_OFF=== err=0
04-01 10:41:23.290   776   776 D Sensors : SensorTMD2771::setEnable=1= en=0; newState=0; what=0; mEnabled=0
04-01 10:41:23.380   776   776 I GeckoDump: XXX FIXME : Got a mozContentEvent: nfc-hardware-state-change
04-01 10:41:23.410   776   776 I power   : *** set_screen_state 1
04-01 10:41:23.420   776   776 D Sensors : activate handle=3; drv=1
04-01 10:41:23.420   776   776 D Sensors : SensorTMD2771::setEnable=0= en=1; newState=1; what=0; mEnabled=0
04-01 10:41:23.530   776   776 D Sensors : taos ---Light::TAOS_IOCTL_ALS_ON=== err=0
04-01 10:41:23.530   776   776 D Sensors : SensorTMD2771::setEnable=1= en=1; newState=1; what=0; mEnabled=1
04-01 10:41:23.530   776   776 D Sensors : SensorTMD2771::getEnable===mEnabled=0x1; enable=1
04-01 10:41:23.620   776   776 I GeckoDump: XXX FIXME : Got a mozContentEvent: nfc-hardware-state-change
04-01 10:41:23.950   776   776 I Gonk    : Setting nice for pid 30361 to 18
04-01 10:41:23.950   776   776 I Gonk    : Changed nice for pid 30361 from 18 to 18.
04-01 10:41:23.950   776   776 I Gecko   : [Parent 776] WARNING: waitpid failed pid:30361 errno:10: file /home/yangzhao/B2G/B2G6821_test/gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
04-01 10:41:23.950   776   776 I power   : *** set_screen_state 0
04-01 10:41:24.000   776   816 D Sensors : SensorTMD2771::readEvents()======data_fd=52,count=16
04-01 10:41:24.000   776   816 D Sensors : SensorTMD2771::processEvent()=====code=40::value=97 
04-01 10:41:24.010   776   816 D Sensors : SensorTMD2771: mPendingEvents[Light].light = 97.000000
04-01 10:41:24.440   776   776 E [Gralloc-ERROR]: int gralloc_register_buffer(const gralloc_module_t*, const native_handle_t*):79 Unable to register handle 0x4a3f8d30 coming from different process: 776
04-01 10:41:24.440   776   776 E [Gralloc-ERROR]: int gralloc_register_buffer(const gralloc_module_t*, const native_handle_t*):79 Unable to register handle 0x4a3f8dd0 coming from different process: 776
04-01 10:41:25.290 30361 30361 I Gecko   : ###################################### forms.js loaded
04-01 10:41:25.310 30361 30361 I Gecko   : ############################### browserElementPanning.js loaded
04-01 10:41:25.330 30361 30361 I Gecko   : ######################## BrowserElementChildPreload.js loaded
04-01 10:41:25.640   776   776 I Gonk    : Setting nice for pid 30361 to 18
04-01 10:41:25.640   776   776 I Gonk    : Changed nice for pid 30361 from 18 to 18.
04-01 10:41:25.850   776   776 I Gecko   : 
04-01 10:41:25.850   776   776 I Gecko   : ###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
04-01 10:41:25.850   776   776 I Gecko   : 
04-01 10:41:25.960   776   794 I Gecko   : [Parent 776] WARNING: waitpid failed pid:30361 errno:10: file /home/yangzhao/B2G/B2G6821_test/gecko/ipc/chromium/src/base/process_util_posix.cc, line 254

The PIDs for the killed homescreen app are very close. We see the priority change from 0 to 18 and stays in 18 then the process got killed. So this means we restarted the homescreen app in BACKGROUND_HOMESCREEN priority and it got killed repeatedly. We need to check why the homescreen is restarted when it's in background.
Flags: needinfo?(cyu)
(In reply to Cervantes Yu from comment #58)
> We need to check why the homescreen is restarted
> when it's in background.

In case we could reproduce, check the callstack from these two functions:

https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L116
https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L173
Flags: needinfo?(alive)
(In reply to Alive Kuo [:alive][NEEDINFO!][God bless Taiwan.] from comment #59) 
> https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/
> homescreen_window.js#L116
> https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/
> homescreen_window.js#L173

Before we can reproduce the issue here, Tzu-lin can help provide a patch to print some logs or trace stack via remote debugger.

@Tzu-lin, please update your finding.
Yang Zhao showed me the phone which homescreen continuously restart. I set two break points by app-manager at 
https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L116
and 
https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L173

I found out that HomescreenWindow.prototype.ensure() is kept invoked repeatedly.
So I made a patch (https://github.com/dwi2/gaia/commit/a4155c43fc2f6b65c4f27d9b5ebeea9bddf66e5a) to print out caller of ensure() and try to get more detail.

Partners are working on running monkey test against the patch now.
Flags: needinfo?(tzhuang)
(In reply to Tzu-Lin Huang [:dwi2][:tzhuang] from comment #62)
> the print log patch in comment 61 should be this link
> https://github.com/dwi2/gaia/commits/bug988110_printlog
reproduce again with your first patch.The following logs appeared repeatedlly.



E/GeckoConsole(  644): Content JS LOG at app://system.gaiamobile.org/js/homescreen_window.js:178 in hw_ensure: THIS.ELEMENT  ID = homescreen
E/GeckoConsole(  644): Content JS WARN at app://system.gaiamobile.org/js/window.js:700 in aw_setOrientation: screen.mozLockOrientation() returned false for app://homescreen.gaiamobile.org orientation default
E/GeckoConsole(  644): Content JS LOG at app://system.gaiamobile.org/js/homescreen_window.js:123 in hw_restart: CALLER IS function (evt) {
E/GeckoConsole(  644):       if (evt.detail.type == 'fatal') {
E/GeckoConsole(  644):         evt.stopImmediatePropagation();
E/GeckoConsole(  644):         self.restart();
E/GeckoConsole(  644):       }
E/GeckoConsole(  644):     }
E/GeckoConsole(  644): Content JS LOG at app://system.gaiamobile.org/js/homescreen_window.js:175 in hw_ensure: CALLER IS function (next, evt) {
E/GeckoConsole(  644):     this.ensure();
E/GeckoConsole(  644):     if (!AttentionScreen.isFullyVisible())
E/GeckoConsole(  644):       this.setVisible(true);
E/GeckoConsole(  644):     this.resetTransition();
E/GeckoConsole(  644):     this.setOrientation();
E/GeckoConsole(  644):     this.resize();
E/GeckoConsole(  644):   }
(In reply to yang.zhao from comment #63)
> (In reply to Tzu-Lin Huang [:dwi2][:tzhuang] from comment #62)
> > the print log patch in comment 61 should be this link
> > https://github.com/dwi2/gaia/commits/bug988110_printlog
> reproduce again with your first patch.The following logs appeared
> repeatedlly.
> 
> 
> 
> E/GeckoConsole(  644): Content JS LOG at
> app://system.gaiamobile.org/js/homescreen_window.js:178 in hw_ensure:
> THIS.ELEMENT  ID = homescreen
> E/GeckoConsole(  644): Content JS WARN at
> app://system.gaiamobile.org/js/window.js:700 in aw_setOrientation:
> screen.mozLockOrientation() returned false for
> app://homescreen.gaiamobile.org orientation default
> E/GeckoConsole(  644): Content JS LOG at
> app://system.gaiamobile.org/js/homescreen_window.js:123 in hw_restart:
> CALLER IS function (evt) {

this is event handler of 'mozbrowsererror'
https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L77
Which means homescreen app is crashed or killed.

> E/GeckoConsole(  644):       if (evt.detail.type == 'fatal') {
> E/GeckoConsole(  644):         evt.stopImmediatePropagation();
> E/GeckoConsole(  644):         self.restart();
> E/GeckoConsole(  644):       }
> E/GeckoConsole(  644):     }
> E/GeckoConsole(  644): Content JS LOG at
> app://system.gaiamobile.org/js/homescreen_window.js:175 in hw_ensure: CALLER
> IS function (next, evt) {
This is one of the callbacks of transition state change
https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L183
I think ensure() is the place where homescreen really restart. But I am still try to get to know the logic of these transition state change callbacks. Anyone has idea?

> E/GeckoConsole(  644):     this.ensure();
> E/GeckoConsole(  644):     if (!AttentionScreen.isFullyVisible())
> E/GeckoConsole(  644):       this.setVisible(true);
> E/GeckoConsole(  644):     this.resetTransition();
> E/GeckoConsole(  644):     this.setOrientation();
> E/GeckoConsole(  644):     this.resize();
> E/GeckoConsole(  644):   }

I'll embed more log to see if homescreen app is in foreground or not when restar() is called after crash.
(In reply to Tzu-Lin Huang [:dwi2][:tzhuang] from comment #64)
> (In reply to yang.zhao from comment #63)
> > (In reply to Tzu-Lin Huang [:dwi2][:tzhuang] from comment #62)
> > > the print log patch in comment 61 should be this link
> > > https://github.com/dwi2/gaia/commits/bug988110_printlog
> > reproduce again with your first patch.The following logs appeared
> > repeatedlly.
> > 
> > 
> > 
> > E/GeckoConsole(  644): Content JS LOG at
> > app://system.gaiamobile.org/js/homescreen_window.js:178 in hw_ensure:
> > THIS.ELEMENT  ID = homescreen
> > E/GeckoConsole(  644): Content JS WARN at
> > app://system.gaiamobile.org/js/window.js:700 in aw_setOrientation:
> > screen.mozLockOrientation() returned false for
> > app://homescreen.gaiamobile.org orientation default
> > E/GeckoConsole(  644): Content JS LOG at
> > app://system.gaiamobile.org/js/homescreen_window.js:123 in hw_restart:
> > CALLER IS function (evt) {
> 
> this is event handler of 'mozbrowsererror'
> https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/
> homescreen_window.js#L77
> Which means homescreen app is crashed or killed.
> 
> > E/GeckoConsole(  644):       if (evt.detail.type == 'fatal') {
> > E/GeckoConsole(  644):         evt.stopImmediatePropagation();
> > E/GeckoConsole(  644):         self.restart();
> > E/GeckoConsole(  644):       }
> > E/GeckoConsole(  644):     }
> > E/GeckoConsole(  644): Content JS LOG at
> > app://system.gaiamobile.org/js/homescreen_window.js:175 in hw_ensure: CALLER
> > IS function (next, evt) {
> This is one of the callbacks of transition state change
> https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/
> homescreen_window.js#L183
> I think ensure() is the place where homescreen really restart. But I am
> still try to get to know the logic of these transition state change
> callbacks. Anyone has idea?
> 

Alive, any thought?
Flags: needinfo?(alive)
1.3T+, partner priority bug
blocking-b2g: 1.3T? → 1.3T+
Whiteboard: [priority]
The state machine is similar to v1.4:
http://alivedise.github.io/gaia-system-jsdoc/AppTransitionController.html
When the state is from closed to opening, _leave_closed function would be invoked.

It seems the homescreen is repeatedly killed 'at foreground' so homescreen_window would try to relaunch it again right away. But that shouldn't cause the problem if this.element is already there (https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L174).

I wonder that if we hit some bad LMK issue:
A mozbrowser iframe is inited as foreground state and if it's killed right away by some reason we will see homescreen is launched repeatedly.

The possible solution maybe to call ensure() only when home button is pressed. But I am not sure there's side effect or not. Maybe the user will see homescreen empty.
Flags: needinfo?(alive)
After discussion with :cyu and :eve,
I think homescreen_window mistakenly think homescreen is foreground because the decision is
https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L126
and we will assume homescreen is launched as foreground:
https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L150

So I guess that:
0. Homescreen is opened(active, foreground)
1. Incoming call
2. After 3sec, Homescreen is sent to background(https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/window_manager.js#L1118) but still has 'active' class.
3. Homescreen is killed by LMK.(https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L78)
4. Homescreen is restarted since it's having active class.(https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L125)
5. Homescreen is killed right away. (gecko issue)

Repeated.

The possible solution is:
(1) In AppWindow.prototype.setVisible, add |this._gaiaVisibilityState = true/false| after setting.
(2) Check this._gaiaVisibilityState in restart function.
It's because |this._visibilityState| is changed by asynchronous event callback(https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L87) and gecko may kill the app before it sends the event so it's not reliable.
Also we in gaia assume a new mozbrowser iframe is foreground. If there's no setVisible call we won't get visibilitychange event.
Hi Alive,

Thanks, I'll try this patch on partner side monkey test.
It is still happen after applying https://github.com/mozilla-b2g/gaia/pull/17868

Also I saw that b2g process and comms app consume lots of CPU

root@android:/ # b2g-info
                           |     megabytes     |
           NAME   PID PPID CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER     
            b2g  1343    1  786.0    0 26.4 28.9 33.6 143.4       0 root     
         (Nuwa)  1395 1343   18.2    0  0.1  0.2  0.5  50.5       0 root     
 Communications  2359 1395   81.7    0  0.8  1.4  3.3  84.1       1 app_2359 
     Homescreen 19317 1395    6.2   18  8.3 11.2 16.4  65.0       2 app_19317
(Preallocated a 19359 1395    2.7   18  1.7  3.6  7.8  57.5       1 root     

System memory info:

            Total 98.5 MB
     Used - cache 72.3 MB
  B2G procs (PSS) 45.3 MB
    Non-B2G procs 27.0 MB
     Free + cache 26.2 MB
             Free  6.8 MB
            Cache 19.4 MB

Low-memory killer parameters:

  notify_trigger 14336 KB

  oom_adj min_free
        0  1024 KB
        1  2048 KB
        2  4096 KB
        6 18432 KB
        8 18432 KB
       10 24576 KB
the log attached is generate with the patch https://github.com/dwi2/gaia/compare/536870d239...4cea602cf9

There is something weird that:
in hw_kill, before we remove homescreen iframe, 'this.element' did hold the iframe (see line 2), however, 'this.containerElement' did not have iframe (this.element) inside of it (see line 3 to line 16).
another weird part is that:
in hw_render (see line 48 of attachment 8401061 [details]), after inject <div id="homescreen" ...> into 'this.containerElement', there is still no homescreen div in it.
We can still meet this issue after apply alive's patch!
Flags: needinfo?(james.zhang)
(In reply to Tzu-Lin Huang [:dwi2][:tzhuang] from comment #74)
> another weird part is that:
> in hw_render (see line 48 of attachment 8401061 [details]), after inject
> <div id="homescreen" ...> into 'this.containerElement', there is still no
> homescreen div in it.

Tzu-lin's printing log patch here: https://github.com/dwi2/gaia/commit/17cc1ba7bc784b5f5658334606df5e01ffe75767
I am looking at this, will try to produce another patch (& patch for log) today.
https://github.com/timdream/gaia/compare/mozilla-b2g:v1.3t...v1.3t-crazy-monkey

Please try this. We suspect home screen is mistakenly relaunched when attention screen is being animated. There is an additional patch on top of existing logging commits and tentitive fix for it.
Severity: major → blocker
Priority: -- → P1
Attachment #8401061 - Attachment mime type: text/x-log → text/plain
Attached image picture1
Attached image picture2
hi,all
   I found another interesting phenomenon.
   1)launch an app,such as contacts,keep it foreground.Kill homescreen by adb shell kill PID
   2)Start homescreen by clicking "START" from the nightly.
   3)You'll see picture1 in the attachment
   4 [review])Press Home button,you'll see picture2 in the attachment

I don't know whether it's helpfull,just afford if for your information.
(In reply to yang.zhao from comment #80)
>    2)Start homescreen by clicking "START" from the nightly.

What you did is expected because you are launching Homescreen as a usual app, so it shows up in the cards view. It's not related to this bug.
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #78)
> https://github.com/timdream/gaia/compare/mozilla-b2g:v1.3t...v1.3t-crazy-
> monkey
> 
> Please try this. We suspect home screen is mistakenly relaunched when
> attention screen is being animated. There is an additional patch on top of
> existing logging commits and tentitive fix for it.

I just push a new commit to the branch to count the # of restarts.

I think even if we fix the window management state management part of this issue, we still need to figure out why Gecko doesn't allow us to launch app repetitively within a short time. Gecko should have a better behavior for Gaia System to depend on.
Flags: needinfo?(tzhuang)
Hi Tim,

Thanks, I've ask partner to use the latest patch that you provided to run monkey test. Also I slightly modify some of log printing to avoid truncating. You can see them here https://github.com/dwi2/gaia/compare/mozilla-b2g:v1.3t...bug988110_tims_patch (I modify it on my commit only)
Flags: needinfo?(tzhuang)
Flags: needinfo?(james.zhang)
I have started new build with your latest patch.
Flags: needinfo?(james.zhang)
What's the log for the last run? Even without the new logging commits, it would be still nice if we know what was going on on that build.
Attached file latest log
Hi Tim,

I've attach latest logs. Partner runs monkey test with the patch you provide last night.
You can find gaia output in logcat.log in the attachement. I also get dmesg and b2g-info for investigation.

The screen still hangs during monkey test. However, there is no repetitive homescreen restart, because homescreen app is no longer existed (see b2g-info), and there is no homescreen div in system. (see app-manager-screenshot.png)
Attached video IMG_1074.MOV
Here is a video showing the screen hanged
NI Tim and Alive.
Since you are discussing with Tzu-Lin.
Tzu-Lin mentioned homescreen recovered after 1 min.

--
Keven
Flags: needinfo?(timdream)
Flags: needinfo?(alive)
Here are b2g-info, logcat, dmesg output when I call "LockScreen.unlock()" in app-manager on the screen-hanged device. Also the DOM structure under #screen > #windows is in the file name DOM_screen_windows.

Homescreen did recover in 40 seconds after I call LockScreen.unlock.
Attached video FORCE_UNLOCK.MOV
Here is the video show what happened in comment 89
(In reply to Tzu-Lin Huang [:dwi2][:tzhuang] from comment #90)
> Created attachment 8401721 [details]
> FORCE_UNLOCK.MOV
> 
> Here is the video show what happened in comment 89

I saw the bug with the phone from tzhuang.
We can unlock the screen ,enter into homscreen ,lunch app by app-manager, and the app seemed run ok.
But the touch event seemed lost from that time, all touch operation were not responsed.
I think we have solved the window management issue here. From comment 88, we need to find out why lock screen does not respond to touches. Do we log hardware touches in dmesg and console? Do we see them? If so we need to add logs to lockscreen touch event listeners and run it again.

Besides that, from comment 90, we have serious memory issue. The phone evidently runs slower as the test progresses. We need platform engineers to solve that, which IMHO is the real issue.

Thanks Tzu-Lin!
(In reply to ying.xu from comment #91)
> I saw the bug with the phone from tzhuang.
> We can unlock the screen ,enter into homscreen ,lunch app by app-manager,
> and the app seemed run ok.
> But the touch event seemed lost from that time, all touch operation were not
> responsed.

There might be touch event focusing issue here. Maybe touch events is being sent to the wrong frame?
(This happens before)
Flags: needinfo?(timdream)
Update on comment 92. Tzu-lin just said we can still receive touches with |adb shell getevent|.

So we need to look at Gecko and maybe Gaia now.
I sent the patch https://github.com/dwi2/gaia/compare/mozilla-b2g:v1.3t...bug988110_tims_patch to partner to run another round of monkey test.
I'd added debug log at touch event handler in lockscreen, homescreen and utility tray. But I think I still need all kind of methods to debug possible touch event problem no  matter what layer it is.
(In reply to Tzu-Lin Huang [:dwi2][:tzhuang] from comment #95)
> I sent the patch
> https://github.com/dwi2/gaia/compare/mozilla-b2g:v1.3t...
> bug988110_tims_patch to partner to run another round of monkey test.
> I'd added debug log at touch event handler in lockscreen, homescreen and
> utility tray. But I think I still need all kind of methods to debug possible
> touch event problem no  matter what layer it is.

With your patch, we reproduce this issue.
We can't find you gecko log and can find getevent log. Is it gecko issue ?
Please see modification and revert gecko commit.

diff --git a/dom/ipc/TabChild.cpp b/dom/ipc/TabChild.cpp
index 5e20e9f..e5fccd0 100644
--- a/dom/ipc/TabChild.cpp
+++ b/dom/ipc/TabChild.cpp
@@ -72,7 +72,9 @@
 #include "nsILoadContext.h"
 #include "ipc/nsGUIEventIPC.h"
 #include "UnitTransforms.h"
-
+#include <android/log.h>
+using namespace android;
+#define LLOG(args...)  __android_log_print(ANDROID_LOG_INFO, "tzhuang ==>" , ## args)
 #ifdef DEBUG
 #include "PCOMContentPermissionRequestChild.h"
 #endif /* DEBUG */
@@ -1890,6 +1892,7 @@ bool
 TabChild::RecvRealTouchEvent(const WidgetTouchEvent& aEvent,
                              const ScrollableLayerGuid& aGuid)
 {
+ LLOG("Touch Receive: %d", aEvent.message);
   WidgetTouchEvent localEvent(aEvent);
   for (size_t i = 0; i < localEvent.touches.Length(); i++) {
     aEvent.touches[i]->mRefPoint = APZCCallbackHelper::ApplyCallbackTransform(aEvent.touches[i]->mRefPoint, aGuid, mWidget->GetDefaultScale());
diff --git a/dom/ipc/TabParent.cpp b/dom/ipc/TabParent.cpp
index 2ff44b6..d0ee96c 100644
--- a/dom/ipc/TabParent.cpp
+++ b/dom/ipc/TabParent.cpp
@@ -58,7 +58,9 @@
 #include "TabChild.h"
 #include "nsNetCID.h"
 #include <algorithm>
-
+#include <android/log.h>
+using namespace android;
+#define LLOG(args...)  __android_log_print(ANDROID_LOG_INFO, "tzhuang ==>" , ## args)
 using namespace mozilla::dom;
 using namespace mozilla::ipc;
 using namespace mozilla::layers;
@@ -786,6 +788,7 @@ bool TabParent::SendRealKeyEvent(WidgetKeyboardEvent& event)

 bool TabParent::SendRealTouchEvent(WidgetTouchEvent& event)
 {
+ LLOG("Touch Send: %d", event.message);
   if (mIsDestroyed) {
     return false;
   }
Hi! James,

Is this still due to lack of ZRAM?
Will Homescreen recover in 40 seconds after call LockScreen.unlock?

--
Keven
Keven,

Homescreen recovers in 40 seconds after I call LockScreen.unlock() through app-manager.

dmesg, b2g-info and logcat are in the attachement. This monkey test runs with the patch https://github.com/dwi2/gaia/compare/mozilla-b2g:v1.3t...bug988110_tims_patch and the patch James posted on comment 96
Alive, let's clone a bug and have your window management patch lands on 1.3t/master with tests first.

People can continuing working on touch event issue here.
Component: Gaia::System::Window Mgmt → General
Component: General → IPC
Product: Firefox OS → Core
Version: unspecified → 28 Branch
Whiteboard: [priority] → [priority][MP_Blocker]
We did another round of monkey test at partner's office. Here's the log. This time we add debug log of touch event handling in both gecko and gaia (please refer to patches/ in attachment)

After several hour of monkey test. the screen hanged. (Lockscreen still could be unlocked with app-manager). |adb shell getevent| shows that we did receive touch events. But there seems no touch events passed in gecko and gaia.
Flags: needinfo?(alive)
So Viral is working on investigating why touch event didn't pass to Gecko.
Flags: needinfo?(vwang)
Whiteboard: [priority][MP_Blocker] → [priority]
(In reply to Tzu-Lin Huang [:dwi2][:tzhuang] from comment #100)
> Created attachment 8402442 [details]
> with_gecko_touch_log_2.zip
> 
> We did another round of monkey test at partner's office. Here's the log.
> This time we add debug log of touch event handling in both gecko and gaia
> (please refer to patches/ in attachment)
> 
> After several hour of monkey test. the screen hanged. (Lockscreen still
> could be unlocked with app-manager). |adb shell getevent| shows that we did
> receive touch events. But there seems no touch events passed in gecko and
> gaia.

log shows GeckoInputDispatcher didn't dispatch event when issue happened.
next step is to check EventHub.cpp and InputReader.cpp in gonk layer.
Now I'm setting two devices for monkey test and try to reproduce this bug.
Flags: needinfo?(vwang)
When the touch events were not responded. the key press can be handled.

And they share the same code flow.
Next time this issue is reproduced, can someone attach the orangutan script that reproduces it?
Were these monkey tests run to completion, or were they interrupted after a few hours?
I ran run-monkey.sh for several hours until it completed and did not see this issue.
(In reply to Michael Wu [:mwu] from comment #105)
> Were these monkey tests run to completion, or were they interrupted after a
> few hours?

Yes, currently it will run for 12hours
hi,all
   I/Gonk    (13087): bug988110 dispatchOnce data.type: 0
   I/Gonk    (13087): bug988110 dispatchOnce MOTION mTouchEventsFiltered: 1

   The above is the logcat output when the phone reproduced this issue.It seems that the touch event has been filtered.The log patch is the belows:

  diff --git a/widget/gonk/nsAppShell.cpp b/widget/gonk/nsAppShell.cpp
index 2cd7dae..d87b797 100644
--- a/widget/gonk/nsAppShell.cpp
+++ b/widget/gonk/nsAppShell.cpp
-
+     LOG("bug988110 dispatchOnce data.type: %d", data.type);
     switch (data.type) {
     case UserInputData::MOTION_DATA: {
         if (!mTouchDownCount) {
             // No pending events, the filter state can be updated.
             mTouchEventsFiltered = isExpired(data);
         }
-
         int32_t action = data.action & AMOTION_EVENT_ACTION_MASK;
         switch (action) {
         case AMOTION_EVENT_ACTION_DOWN:
@@ -561,7 +561,7 @@ GeckoInputDispatcher::dispatchOnce()
         default:
             break;
         }
-
+        LOG("bug988110 dispatchOnce MOTION mTouchEventsFiltered: %d", mTouchEventsFiltered);
         if (mTouchEventsFiltered) {
             return;
         }
@@ -604,7 +604,7 @@ GeckoInputDispatcher::dispatchOnce()
             // No pending events, the filter state can be updated.
             mKeyEventsFiltered = isExpired(data);
         }
-
+        LOG("bug988110 dispatchOnce KEY mTouchEventsFiltered: %d", mTouchEventsFiltered);
         mKeyDownCount += (data.action == AKEY_EVENT_ACTION_DOWN) ? 1 : -1;
         if (mKeyEventsFiltered) {
             return;
I have monkey test in two devices.
One is still running over 12 hours and one met the case similar to comment #108.

In the defected device, log shows GeckoInputDispatcher did dispatch but gecko didn't send touch event in TabParent. (the result is not consistent with comment #100)

I would like to add more log between GeckoInputDispatcher and system app and do another round testing.
just figure out comment #108 is great help on this bug.
since this filter land on 3/21, shall we have any information like this bug only happened after 3/21?
Alright, so here is my theory:

Due to a bug in the driver we aren't seeing on other silicon mTouchDownCount gets out of balance and doesn't go back to 0, as a result mTouchEventsFiltered remains true and we filter all touch events.
Actually this might be some sort of a race condition since the touch events are sent by orangutan and not the actual driver. In any case the mTouchDownCount counting seems to be not stable and we should solve that somehow differently.
I think the filter Viral means is this patch: https://bugzilla.mozilla.org/show_bug.cgi?id=971633
(In reply to viral [:viralwang] from comment #110)
> just figure out comment #108 is great help on this bug.
> since this filter land on 3/21, shall we have any information like this bug
> only happened after 3/21?
(In reply to Andreas Gal :gal from comment #111)
> Alright, so here is my theory:
> 
> Due to a bug in the driver we aren't seeing on other silicon mTouchDownCount
> gets out of balance and doesn't go back to 0, as a result
> mTouchEventsFiltered remains true and we filter all touch events.
We do met mTouchDownCount out of balance and we will miss all touch events after that.
(failed rate: 2/2)

01-01 18:03:42.029 I/GonkTouch(   82): timeNowMs: 10221068, data.timeMs: 10188343  //mTouchEventsFiltered=1
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 0, mTouchDownCount: 1  //finger down
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 9, mTouchDownCount: 1  //hover enter
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 10, mTouchDownCount: 1 //hover exit
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 0, mTouchDownCount: 2  //finger up
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 1, mTouchDownCount: 1
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 0, mTouchDownCount: 2
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 1, mTouchDownCount: 1
bug 989234 is not blocking this bug and this could be an regression of bug 971633.
Depends on: 971633
No longer depends on: 989234
Keywords: regression
Are we sure it's a regression of bug 971633 ? Please only set dependencies while we have them verified!
So this is a bug in the driver. action is 0 for up and should be 1.

01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 0, mTouchDownCount: 1  //finger down
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 9, mTouchDownCount: 1  //hover enter
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 10, mTouchDownCount: 1 //hover exit
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 0, mTouchDownCount: 2 

James can you guys fix this?
Flags: needinfo?(james.zhang)
Summary: [tarako] gaia no respond after monkey test start about 2 hours → [tarako] touch event filtering can get stuck when up/down actions are unbalanced
Looking at the sequence here the 2nd finger down event is simply bogus. We should fix that in the driver but Android likely tolerates this, and we should do the same, because other vendors probably make the same mistake.

01-01 18:03:42.029 I/GonkTouch(   82): timeNowMs: 10221068, data.timeMs: 10188343  //mTouchEventsFiltered=1
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 0, mTouchDownCount: 1  //finger down
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 9, mTouchDownCount: 1  //hover enter
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 10, mTouchDownCount: 1 //hover exit
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 0, mTouchDownCount: 2  //bogus!
01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 1, mTouchDownCount: 1
(In reply to Andreas Gal :gal from comment #117)
> So this is a bug in the driver. action is 0 for up and should be 1.

AIUI, the monkey testing here involves events generated by a test program rather than by the driver itself.
I won't have time to work on this, please feel free to steal the patch and land it if it works.
Yeah, this might be an artificial failure that doesn't happen with real driver input. We have to unblock monkey testing either way though, so we can as well land the patch and make this work.
We make a test build with revised attachment 8403767 [details] [diff] [review] to monkey test again. We expect this revised patch can fix the side-effect of mTouchDownCount.
Flags: needinfo?(ttsai)
Hi Michael,

We already start another round testing now but still need hours to verify.
Could you please help to review it?
Thanks.
Attachment #8403861 - Flags: review?(mwu)
What's the rationale of fixing this in Gecko instead of in the monkey test program?
(In reply to Andreas Gal :gal from comment #117)
> So this is a bug in the driver. action is 0 for up and should be 1.
> 
> 01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 0,
> mTouchDownCount: 1  //finger down
> 01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 9,
> mTouchDownCount: 1  //hover enter
> 01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 10,
> mTouchDownCount: 1 //hover exit
> 01-01 18:03:42.029 I/GonkTouch(   82): dispatchOnce action: 0,
> mTouchDownCount: 2 
> 
> James can you guys fix this?
Flags: needinfo?(james.zhang) → needinfo?(ying.xu)
This event is not come from touch panel driver, it come from monkey test tool -- orng.
Without bug 971633 commit, our monkey test has run about 6 hours, it won't stuck.
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #125)
> What's the rationale of fixing this in Gecko instead of in the monkey test
> program?

defense in depth. Be liberal with what you receive, and careful with what you send.
Attachment #8403861 - Flags: review?(mwu) → review+
Thanks for the fix. We should definitely defend against this in Gecko. Drivers will get this wrong, too. Ship it!
We already have two devices pass over 12 hours monkey test and touch still working.
I will prepare a patch for landing.
Confirmed comment #131.
Attached patch bug988110.patchSplinter Review
update the patch for landing.
I think both 1.3t and master should apply this patch.
Assignee: anygregor → vwang
Attachment #8403767 - Attachment is obsolete: true
Attachment #8403861 - Attachment is obsolete: true
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/0db1de4da2ee
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Flags: needinfo?(ying.xu)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: