Closed Bug 971747 Opened 10 years ago Closed 6 years ago

Phone freezes while starting and stopping the b2g process

Categories

(Firefox OS Graveyard :: Vendcom, defect)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: viorela, Unassigned)

References

Details

(Keywords: regression)

Attachments

(2 files)

Attached file logcat.txt
Phone is freezing with a black screen while running tests, after trying to restart.
In order to reuse the phone, we need to do an adb reboot.
I couldn't get a stacktrace as I waited about half an hour to see the test failing with an error, but it didn't. I attached a logcat of the issue. 

Gaia      4c6b5142d3b716f1c4ea502eeb92d3119f2b01c6
Gecko     https://hg.mozilla.org/mozilla-central/rev/802d87c77e76
BuildID   20140211160209
Version   30.0a1
ro.build.version.incremental=eng.archermind.20131114.105818
ro.build.date=Thu Nov 14 10:58:33 CST 2013
You need to be more descriptive about what 'restart' the phone is because I know you don't mean using the restart button on the top of the device.
Also what is the effect of this?

It is freezing the device and stopping all following automation from running.
Component: Gaia::UI Tests → Gaia::System
I was able to reproduce this fail manually, by running adb shell stop b2g, adb shell start b2g repeatedly.
Summary: Phone freezes while trying to restart the phone → Phone freezes while starting and stopping the b2g process
On Buri device I am seeing a crash shortly after the device freezes: https://crash-stats.mozilla.com/report/index/be95802e-4f66-4c78-9449-06c822140212

I restarted the phone, and then it did not start. After a battery pull it restarted but the crash report came up. I will file a separate bug for the crash in a moment.
Just to clarify, Marcia found that bug by restarting using the power menu.

I replicated as per comment #3 and got no crash report.
The process is still running after it has frozen, a'la comment #3.
---
USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
root      1     0     428    288   c017b404 00008728 S /init
root      2     0     0      0     c00e9618 00000000 S kthreadd
root      3     2     0      0     c00d26fc 00000000 S ksoftirqd/0
root      4     2     0      0     c00e3d98 00000000 S kworker/0:0
root      5     2     0      0     c006a058 00000000 D kworker/u:0
root      6     2     0      0     c010b8ac 00000000 S migration/0
root      7     2     0      0     c00e555c 00000000 S khelper
root      8     2     0      0     c00e555c 00000000 S suspend_sys_syn
root      9     2     0      0     c00e555c 00000000 S suspend
root      10    2     0      0     c01498d4 00000000 S sync_supers
root      11    2     0      0     c014a8e4 00000000 S bdi-default
root      12    2     0      0     c00e555c 00000000 S kblockd
root      13    2     0      0     c034fde4 00000000 S khubd
root      14    2     0      0     c00e555c 00000000 S l2cap
root      15    2     0      0     c00e555c 00000000 S a2mp
root      16    2     0      0     c00e555c 00000000 S modem_notifier
root      17    2     0      0     c00e555c 00000000 S smd_channel_clo
root      18    2     0      0     c00e555c 00000000 S smsm_cb_wq
root      19    2     0      0     c00e3d98 00000000 S kworker/0:1
root      20    2     0      0     c00e3d98 00000000 S kworker/u:1
root      21    2     0      0     c00e555c 00000000 S rpcrouter
root      22    2     0      0     c00e555c 00000000 S rpcrotuer_smd_x
root      23    2     0      0     c006c6e8 00000000 S krpcserversd
root      25    2     0      0     c00e555c 00000000 S msm_adsp_probe
root      26    2     0      0     c006b4f0 00000000 D voicememo_rpc
root      27    2     0      0     c006b6e8 00000000 S kadspd
root      28    2     0      0     c00e555c 00000000 S rmt_storage
root      29    2     0      0     c01422cc 00000000 S kswapd0
root      30    2     0      0     c019d8dc 00000000 S fsnotify_mark
root      31    2     0      0     c00e555c 00000000 S crypto
root      43    2     0      0     c00e555c 00000000 S mdp_dma_wq
root      44    2     0      0     c00e555c 00000000 S mdp_vsync_wq
root      45    2     0      0     c00e555c 00000000 S mdp_pipe_ctrl_w
root      46    2     0      0     c00e555c 00000000 S k_hsuart
root      48    2     0      0     c00e555c 00000000 S kgsl-3d0
root      49    2     0      0     c010cd9c 00000000 S irq/273-1-0039
root      50    2     0      0     c0333098 00000000 S mtdblock0
root      51    2     0      0     c0333098 00000000 S mtdblock1
root      52    2     0      0     c0333098 00000000 S mtdblock2
root      53    2     0      0     c0333098 00000000 S mtdblock3
root      54    2     0      0     c0333098 00000000 S mtdblock4
root      55    2     0      0     c0333098 00000000 S mtdblock5
root      56    2     0      0     c0333098 00000000 S mtdblock6
root      57    2     0      0     c0333098 00000000 S mtdblock7
root      58    2     0      0     c0333098 00000000 S mtdblock8
root      59    2     0      0     c0333098 00000000 S mtdblock9
root      60    2     0      0     c0333098 00000000 S mtdblock10
root      61    2     0      0     c0333098 00000000 S mtdblock11
root      62    2     0      0     c0333098 00000000 S mtdblock12
root      63    2     0      0     c0333098 00000000 S mtdblock13
root      64    2     0      0     c0333098 00000000 S mtdblock14
root      65    2     0      0     c0333098 00000000 S mtdblock15
root      66    2     0      0     c0333098 00000000 S mtdblock16
root      67    2     0      0     c0333098 00000000 S mtdblock17
root      68    2     0      0     c0333098 00000000 S mtdblock18
root      69    2     0      0     c0333098 00000000 S mtdblock19
root      70    2     0      0     c0333098 00000000 S mtdblock20
root      71    2     0      0     c0333098 00000000 S mtdblock21
root      72    2     0      0     c0333098 00000000 S mtdblock22
root      73    2     0      0     c0333098 00000000 S mtdblock23
root      74    2     0      0     c00e555c 00000000 S k_rmnet_mux_wor
root      75    2     0      0     c00e555c 00000000 S f_mtp
root      76    2     0      0     c03932b4 00000000 S file-storage
root      77    2     0      0     c00e3d98 00000000 S kworker/0:2
root      78    2     0      0     c00e555c 00000000 S uether
root      79    2     0      0     c00e555c 00000000 S diag_wq
root      80    2     0      0     c00e555c 00000000 S diag_cntl_wq
root      81    2     0      0     c010cd9c 00000000 S irq/304-ft5x06_
root      82    2     0      0     c006b4f0 00000000 D krtcclntd
root      83    2     0      0     c006dcac 00000000 D krtcclntcbd
root      84    2     0      0     c00e555c 00000000 S kfmradio
root      85    2     0      0     c006b4f0 00000000 D kbatteryclntd
root      86    2     0      0     c006dcac 00000000 D kbatteryclntcbd
root      87    2     0      0     c00e555c 00000000 S battery_queue
root      88    2     0      0     c00e555c 00000000 S msm_adc
root      89    2     0      0     c00e555c 00000000 S dalrpc_rcv_DAL0
root      90    2     0      0     c00e555c 00000000 S iewq
root      91    2     0      0     c00e9588 00000000 D kinteractiveup
root      92    2     0      0     c00e555c 00000000 S binder
root      93    2     0      0     c00e3d98 00000000 S kworker/u:3
root      94    2     0      0     c0598a80 00000000 S krfcommd
root      95    2     0      0     c00e555c 00000000 S msm-cpufreq
root      96    2     0      0     c006b4f0 00000000 D khsclntd
root      97    2     0      0     c00e555c 00000000 S deferwq
root      98    1     312    4     c017b404 00008728 S /sbin/ueventd
root      99    2     0      0     c043650c 00000000 S mmcqd/0
root      100   2     0      0     c02394f4 00000000 S yaffs-bg-1
root      102   2     0      0     c02394f4 00000000 S yaffs-bg-1
root      103   2     0      0     c02394f4 00000000 S yaffs-bg-1
root      104   2     0      0     c02394f4 00000000 S yaffs-bg-1
system    129   1     852    224   c044d378 400f18b0 S /system/bin/servicemanager
root      130   1     4036   556   ffffffff 40067eb4 S /system/bin/vold
root      133   1     3336   504   ffffffff 400658b0 S /system/bin/fakeperm
root      136   1     7832   1080  ffffffff 4006ceb4 S /system/bin/netd
root      137   1     720    216   c0487240 400bc1b8 S /system/bin/debuggerd
radio     138   1     20340  3240  ffffffff 40047eb4 S /system/bin/rild
media     139   1     18660  5396  ffffffff 400928b0 S /system/bin/mediaserver
bluetooth 140   1     1352   532   c017b404 400c567c S /system/bin/dbus-daemon
keystore  141   1     1748   496   c0487240 4005b1b8 S /system/bin/keystore
system    153   1     1924   272   ffffffff 4004e0e8 S /system/bin/mm-pp-daemon
system    157   1     3052   688   c017b404 4003667c S /system/bin/mm-qcamera-daemon
root      158   1     1188   428   c017b404 4006f67c S /system/bin/gpu_dcvsd
root      159   2     0      0     c00e555c 00000000 S k_gserial
root      160   2     0      0     c00e555c 00000000 S k_gsmd
root      161   2     0      0     c00e555c 00000000 S k_rmnet_work
root      164   1     4480   176   ffffffff 0000829c S /sbin/adbd
root      191   1     872    252   c017b404 400929d4 S /system/bin/location-mq
radio     196   1     6032   560   ffffffff 400fd9d4 S /system/bin/qmuxd
root      197   1     14604  996   ffffffff 4005e810 S /system/bin/xtwifi-inet-agent
shell     203   1     792    296   c02c4448 4009e678 S /system/bin/sh
radio     204   1     6160   784   ffffffff 40052810 S /system/bin/netmgrd
root      205   1     12184  1524  ffffffff 40037810 S /system/bin/xtwifi-client
root      290   2     0      0     c018d644 00000000 S flush-31:5
root      360   2     0      0     c00e555c 00000000 S cfg80211
root      363   2     0      0     c00e555c 00000000 S ath6kl
root      367   2     0      0     c04325e0 00000000 S ksdioirqd/mmc1
system    403   1     1348   428   c017b404 4010967c S /system/bin/qosmgr
root      706   1     0      0     c02b7260 00000000 D b2g
root      745   706   0      0     c00cfac8 00000000 Z (Nuwa)
root      755   1     44152  9900  ffffffff 400d46f8 D /system/b2g/b2g
root      760   164   780    296   c0046254 4005e094 S /system/bin/sh
root      762   760   972    276   00000000 40065678 R ps
blocking-b2g: --- → 1.4?
Component: Gaia::System → General
Fabrice, might you be able to help us find the culprit, here?
Oops, I meant https://bugzilla.mozilla.org/show_bug.cgi?id=971605 sorry for the spam.
Clint, no it is not, it is different and unlikely to be related.

This can be replicated outside of automation.

I spent some time trying to work around it, changing the durations between the stop and the start but it had no effect. I'm thinking Naoki might know something about this, if the b2g process cannot be started because there's a filelock in place or somesuch.
It looks like the previous b2g hasn't actually stopped yet.

Starting a new b2g before the previous one has stopped will almost certainly not work.

If I'm reading things right, "stop b2g" sends a kill -9 to the process. The only way that will not kill the process is if its hung on some uninterruptible system call.

It would be useful to try the following:

echo t > /proc/sysrq-trigger
dmesg

You should be able to find the b2g process (From the above ps there is 706 and 755. 706 is the one we're interested in)

What I'm interested in is this:

<6>[ 4052.836422] b2g             S c05b46c0  4776   875      1 0x00000000
<4>[ 4052.836591] [<c05b46c0>] (__schedule+0x634/0x718) from [<c05b58f0>] (schedule_hrtimeout_range_clock+0x4c/0x144)
<4>[ 4052.836702] [<c05b58f0>] (schedule_hrtimeout_range_clock+0x4c/0x144) from [<c01a1074>] (sys_epoll_wait+0x1b8/0x314)
<4>[ 4052.836802] [<c01a1074>] (sys_epoll_wait+0x1b8/0x314) from [<c0042d60>] (ret_fast_syscall+0x0/0x30)

In this example b2g was pid 875 (the output will be in PID order so that should help to find the right stuff). We'd probably need to look at all of the threads as well.

So getting ps -t output and the dmesg output attached might shed some light.

The information that's shown here is the callstack inside the kernel for each thread.
(In reply to Stephen Donner [:stephend] from comment #7)
> Fabrice, might you be able to help us find the culprit, here?

What Dave said!
Thanks very much Dave, I'll be able to debug this tomorrow. Just clarify, do you want me to get the debug information from the point after stopping b2g or after restarting when it has jammed?

It's possible (or likely) we are stopping b2g while some important bits are going on as as soon as our final test step finishes, if there's no debug to collate we immediately stop the process (in order to wipe data/profile and start it again).
?ni Dave just for the clarification.
Flags: needinfo?(dhylands)
(In reply to Zac C (:zac) from comment #13)
> It's possible (or likely) we are stopping b2g while some important bits are
> going on as as soon as our final test step finishes, if there's no debug to
> collate we immediately stop the process (in order to wipe data/profile and
> start it again).

Also Dave, this strategy has worked successfully for us for around 12 months. This is definitely something that changed in yesterday's late-PST build but I'm not clear whether it's a regression on b2g side or something we need to resolve in the test framework.
(In reply to Zac C (:zac) from comment #13)
> Thanks very much Dave, I'll be able to debug this tomorrow. Just clarify, do
> you want me to get the debug information from the point after stopping b2g
> or after restarting when it has jammed?
> 
> It's possible (or likely) we are stopping b2g while some important bits are
> going on as as soon as our final test step finishes, if there's no debug to
> collate we immediately stop the process (in order to wipe data/profile and
> start it again).

Hi Zac,

I suppose both would be interesting. Figuring out where its hanging on startup is less interesting than knowing why its hanging on shutdown.

Normally, when you do stop b2g, it should stop almost immediately. Although maybe theres a difference between eng and user (I seem to remember cjones mentioning something along those lines once upon a time). Since you're testing, I'm assuming that you're using eng or userdebug builds?

And what do I need to do to try and reproduce this locally?
ok - using a userdebug build (which is what I happened to have on my phone, I was able to reproduce by doing

start b2g
stop b2g (while the firefox with the flaming tail logo was still visible on the screen)

I actually did:

stop b2g; sleep 5; b2g-ps; start-b2g

and then waited variable amounts of time before hitting the up arrow and return.

If you see b2g still present in the b2g-ps output then you've most likely hit the situation.

Here's my dmesg output at hang time;

For the b2g that's hung trying to stop:

<6>[ 8229.892480] b2g             D c05b46c0  5128  1689      1 0x00000001
<4>[ 8229.892640] [<c05b46c0>] (__schedule+0x634/0x718) from [<c05b4ba8>] (schedule_timeout+0x1c/0x2f8)
<4>[ 8229.892734] [<c05b4ba8>] (schedule_timeout+0x1c/0x2f8) from [<c05b3f30>] (wait_for_common+0x104/0x198)
<4>[ 8229.892837] [<c05b3f30>] (wait_for_common+0x104/0x198) from [<c02b7260>] (mipi_dsi_off+0x7c/0x25c)
<4>[ 8229.892932] [<c02b7260>] (mipi_dsi_off+0x7c/0x25c) from [<c02b9fa4>] (panel_next_off+0x3c/0x48)
<4>[ 8229.893020] [<c02b9fa4>] (panel_next_off+0x3c/0x48) from [<c02a00c8>] (mdp_off+0x30/0x4c)
<4>[ 8229.893109] [<c02a00c8>] (mdp_off+0x30/0x4c) from [<c0299b04>] (msm_fb_blank_sub+0xf0/0x120)
<4>[ 8229.893195] [<c0299b04>] (msm_fb_blank_sub+0xf0/0x120) from [<c029a7b4>] (msm_fb_release+0x60/0x98)
<4>[ 8229.893292] [<c029a7b4>] (msm_fb_release+0x60/0x98) from [<c0293820>] (fb_release+0x30/0x54)
<4>[ 8229.893384] [<c0293820>] (fb_release+0x30/0x54) from [<c016cec0>] (fput+0x10c/0x1d0)
<4>[ 8229.893465] [<c016cec0>] (fput+0x10c/0x1d0) from [<c01699f8>] (filp_close+0x78/0x84)
<4>[ 8229.893549] [<c01699f8>] (filp_close+0x78/0x84) from [<c00cdde0>] (put_files_struct+0x94/0xf0)
<4>[ 8229.893632] [<c00cdde0>] (put_files_struct+0x94/0xf0) from [<c00cf63c>] (do_exit+0x2bc/0x78c)
<4>[ 8229.893714] [<c00cf63c>] (do_exit+0x2bc/0x78c) from [<c00cfb98>] (do_group_exit+0x8c/0xb8)
<4>[ 8229.893802] [<c00cfb98>] (do_group_exit+0x8c/0xb8) from [<c00de9a4>] (get_signal_to_deliver+0x3f0/0x438)
<4>[ 8229.893900] [<c00de9a4>] (get_signal_to_deliver+0x3f0/0x438) from [<c0045840>] (do_signal+0xb0/0x58c)
<4>[ 8229.893995] [<c0045840>] (do_signal+0xb0/0x58c) from [<c0045d34>] (do_notify_resume+0x18/0x38)
<4>[ 8229.894084] [<c0045d34>] (do_notify_resume+0x18/0x38) from [<c0042db8>] (work_pending+0x24/0x28)

For the b2g that's hanging while starting up:

<6>[ 8229.894135] b2g             D c05b46c0  6048  1715      1 0x00000000
<4>[ 8229.894302] [<c05b46c0>] (__schedule+0x634/0x718) from [<c05b5300>] (__mutex_lock_slowpath+0x144/0x1c0)
<4>[ 8229.894399] [<c05b5300>] (__mutex_lock_slowpath+0x144/0x1c0) from [<c05b539c>] (mutex_lock+0x20/0x40)
<4>[ 8229.894494] [<c05b539c>] (mutex_lock+0x20/0x40) from [<c029393c>] (fb_open+0x60/0x1c0)
<4>[ 8229.894587] [<c029393c>] (fb_open+0x60/0x1c0) from [<c016eab0>] (chrdev_open+0x204/0x22c)
<4>[ 8229.894675] [<c016eab0>] (chrdev_open+0x204/0x22c) from [<c016a020>] (__dentry_open+0x218/0x330)
<4>[ 8229.894760] [<c016a020>] (__dentry_open+0x218/0x330) from [<c016a1e8>] (nameidata_to_filp+0x50/0x5c)
<4>[ 8229.894847] [<c016a1e8>] (nameidata_to_filp+0x50/0x5c) from [<c01772a0>] (do_last+0x7c8/0x8d4)
<4>[ 8229.894934] [<c01772a0>] (do_last+0x7c8/0x8d4) from [<c0178db4>] (path_openat+0xc4/0x380)
<4>[ 8229.895020] [<c0178db4>] (path_openat+0xc4/0x380) from [<c0179150>] (do_filp_open+0x30/0x7c)
<4>[ 8229.895104] [<c0179150>] (do_filp_open+0x30/0x7c) from [<c0169c6c>] (do_sys_open+0xd8/0x174)
<4>[ 8229.895189] [<c0169c6c>] (do_sys_open+0xd8/0x174) from [<c0042d60>] (ret_fast_syscall+0x0/0x30)

I couldn't run gdb since I had a build going and I had wiped out my out tree. So I'll wait until my build is finished and gather more information then.
Flags: needinfo?(dhylands)
So it looks like the one that's hung stopping is trying to close the framebuffer and the one thats hanging on startup is trying to open the framebuffer
Yeah - so I tried "kill -9 1689" and its not going away.

So this tells me that we have a driver bug (fundamentally drivers shouldn't be hanging like this).

It's quite possible that our code paths have changed recently are exposing the driver bug, so we can still investigate along those lines.
(In reply to Zac C (:zac) from comment #19)
> I don't have the buildid to hand but this started at this build:
> https://pvtbuilds.mozilla.org/pvt/mozilla.org/b2gotoro/nightly/mozilla-
> central-hamachi-eng/2014/02/2014-02-11-16-02-09/

I can also reproduce in 2014-02-11-04-02-00 but not in 2014-02-10-16-02-01 (so that's a 12 hour range)
Hmm. So looking at the sources.xml files, I get

hash d2ef16927622a687998c32093cc80d5357620a14 has the problem
hash c8a8bf353a31f20c11161e0f0274ad7c56b1e716 doesn't

And git shows:
d2ef169 Ryan VanderMeulen 2 days ago Merge inbound to m-c.
c8a8bf3 Ryan VanderMeulen 2 days ago Merge b2g-inbound to m-c.

So it looks like a merge from inbound seems to be the culprit.

And this is the output of git show --stat HEAD (with HEAD set to d2ef16927622a687998c32093cc80d5357620a14)

1150 >git show --stat HEAD
commit d2ef16927622a687998c32093cc80d5357620a14
Merge: c8a8bf3 eea6ca3
Author: Ryan VanderMeulen <ryanvm@gmail.com>
Date:   Mon Feb 10 15:42:58 2014 -0500

    Merge inbound to m-c.

 browser/app/Makefile.in                            |    1 -
 content/media/TextTrack.cpp                        |   15 +-
 content/media/TextTrackCue.h                       |    1 +
 content/media/test/chrome.ini                      |    1 +
 content/media/test/test_texttrackcue.html          |    9 +
 content/media/test/test_texttrackcue_chrome.html   |   33 +
 db/sqlite3/src/Makefile.in                         |    2 -
 db/sqlite3/src/moz.build                           |    4 +
 dom/plugins/test/testplugin/testplugin.mk          |    2 -
 dom/plugins/test/testplugin/testplugin.mozbuild    |    4 +
 dom/workers/File.cpp                               |    8 +-
 .../printingui/src/win/nsPrintDialogUtil.cpp       |    2 +-
 embedding/tests/winEmbed/Makefile.in               |    2 -
 embedding/tests/winEmbed/moz.build                 |    2 +
 gfx/angle/src/libEGL/Makefile.in                   |    1 -
 gfx/angle/src/libEGL/moz.build                     |    2 +
 gfx/angle/src/libGLESv2/Makefile.in                |    1 -
 gfx/angle/src/libGLESv2/moz.build                  |    2 +
 gfx/ipc/GfxMessageUtils.h                          |    8 +-
 gfx/layers/d3d11/TextureD3D11.cpp                  |   16 +-
 gfx/layers/ipc/AsyncPanZoomController.cpp          |    4 +-
 gfx/thebes/gfxDWriteFontList.cpp                   |    2 +-
 js/src/Makefile.in                                 |    5 -
 js/src/ctypes/Library.cpp                          |    2 +-
 js/src/jit/Bailouts.cpp                            |    2 +-
 js/src/jit/Bailouts.h                              |    3 -
 js/src/jit/BaselineBailouts.cpp                    |    6 +-
 js/src/jit/IonCode.h                               |    8 -
 js/src/jit/IonFrameIterator.h                      |   41 +-
 js/src/jit/IonFrames.cpp                           |  162 +++--
 js/src/jit/IonTypes.h                              |   11 +
 js/src/jit/SnapshotReader.h                        |  250 --------
 js/src/jit/SnapshotWriter.h                        |   78 ---
 js/src/jit/Snapshots.cpp                           |  650 +++++++++++---------
 js/src/jit/Snapshots.h                             |  463 ++++++++++++++
 js/src/jit/arm/Simulator-arm.cpp                   |   51 ++
 js/src/jit/arm/Simulator-arm.h                     |    1 +
 js/src/jit/shared/CodeGenerator-shared.cpp         |   50 +-
 js/src/jit/shared/CodeGenerator-shared.h           |    4 +-
 js/src/jsapi-tests/moz.build                       |    1 +
 js/src/jsapi-tests/testJitRValueAlloc.cpp          |  236 +++++++
 js/src/jsapi.cpp                                   |   41 +-
 js/src/jsapi.h                                     |    7 +-
 js/src/moz.build                                   |    2 +
 js/src/perf/jsperf.cpp                             |   10 +-
 js/src/shell/js.cpp                                |   16 +-
 js/src/vm/Stack.cpp                                |   10 +-
 js/xpconnect/src/XPCShellImpl.cpp                  |    2 +-
 layout/base/nsDisplayList.h                        |    6 +
 layout/mathml/nsMathMLmoFrame.cpp                  |    4 +
 .../reftests/mathml/mo-invisibleoperators-ref.html |   12 +
 layout/reftests/mathml/mo-invisibleoperators.html  |   17 +
 layout/reftests/mathml/reftest.list                |    1 +
 layout/style/ImageLoader.cpp                       |    2 +-
 mobile/android/components/PaymentsUI.js            |    6 +-
 netwerk/base/src/ProxyAutoConfig.cpp               |   14 +-
 python/mozbuild/mozbuild/frontend/emitter.py       |    2 +
 .../mozbuild/mozbuild/frontend/sandbox_symbols.py  |   12 +
 .../test/backend/data/variable_passthru/moz.build  |    3 +
 .../mozbuild/test/backend/test_recursivemake.py    |    6 +
 .../test/frontend/data/variable-passthru/moz.build |    3 +
 .../mozbuild/test/frontend/test_emitter.py         |    2 +
 security/manager/ssl/src/nsSmartCardMonitor.cpp    |    8 +-
 testing/marionette/client/marionette/client.py     |   14 +-
 .../marionette/client/marionette/runner/base.py    |   47 +-
 .../client/marionette/runner/mixins/b2g.py         |   91 ++-
 .../client/marionette/runner/mixins/reporting.py   |    1 +
 toolkit/library/Makefile.in                        |    1 -
 toolkit/library/moz.build                          |   11 +
 webapprt/PaymentUIGlue.js                          |    6 +-
 widget/os2/Makefile.in                             |    2 -
 widget/os2/moz.build                               |    2 +
 widget/windows/Makefile.in                         |    2 -
 widget/windows/moz.build                           |    2 +
 xpcom/ds/nsWindowsRegKey.cpp                       |    4 +-
 xpcom/reflect/xptcall/src/md/unix/moz.build        |    2 +-
 xulrunner/app/Makefile.in                          |    1 -
 77 files changed, 1663 insertions(+), 855 deletions(-)

This appears to be this merge (on m-c) 167907:d812f80a0f1d
https://hg.mozilla.org/mozilla-central/rev/d812f80a0f1d
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=d812f80a0f1d

which corresponds to 167915:d812f80a0f1d thru 167876:542f9093ec36 in mozilla-inbound.
(In reply to Dave Hylands [:dhylands] from comment #18)
> So it looks like the one that's hung stopping is trying to close the
> framebuffer and the one thats hanging on startup is trying to open the
> framebuffer

Could you be more specific about the details of this frame buffer? Is it potentially related to graphics?
(In reply to Jason Smith [:jsmith] from comment #23)
> (In reply to Dave Hylands [:dhylands] from comment #18)
> > So it looks like the one that's hung stopping is trying to close the
> > framebuffer and the one thats hanging on startup is trying to open the
> > framebuffer
> 
> Could you be more specific about the details of this frame buffer? Is it
> potentially related to graphics?

I'm still not sure. All I can say (from comment 17) is that the process that's hung wound up calling fb_release in the kernel, and it seems to be stuck there. The b2g process that's starting up called fb_open and it's waiting on a mutex which is presumably being held by the b2g process which called fb_release. So the problem is that fb_release is getting hung up.

I've done a bunch of builds and not been able to reprodcue the problem, when I realized that I was doing DEBUG builds. The nighlties that I was able to reproduce the problem with would have been non-DEBUG builds. So I'm just building stuff again with B2G_DEBUG=0 to see if that helps to reproduce the problem.

What exactly fb_release is hung up on, I'm not sure.

even with B2G_DEBUG=0, I haven't been able to reproduce the problem yet (at least not with the same gecko version that reproduces the problem from a nightly).

I'm going to go back to the latest and work my way backwards now.
I'm gettingit to fail more reliably by doing:

adb reboot && adb wait-for-device && adb shell

and then doing:

b2g-ps -t; stop b2g; sleep 2; b2g-ps -t; start b2g

So now builds that I thought were working are now failing (sometimes it takes 10 tries before I get a failure).

I'm currently on -r 167795 on b2g-inbound and observing a failure.
OK - I can reproduce the problem with just the base image.

i.e. I flashed V1.2-device.cfg using TeleWeb and trying to reproduce on that.

Settings reports: Boot2Gecko-1.2.0.0-prerelease

So this looks some type of race in the kernel driver, and the timings have changed just enough recently that they're being hit by the automation tests more frequently.

When it hangs, b2g hangs hard and is unkillable. The only way to recover is to reboot.

So I've recommended to Zac that we change the test code to do the following:

1. stop b2g
2. sleep for a couple of seconds, check to see if b2g is still runnung.
3. If b2g isn't running, then go ahead and start b2g (as per normal)
4. If b2g is still running, do a reboot followed by adb wait-for-device

I'm 99.9% sure that the fundamental problem is a bug in the kernel driver.
(In reply to Dave Hylands [:dhylands] from comment #26)
> OK - I can reproduce the problem with just the base image.
> 
> i.e. I flashed V1.2-device.cfg using TeleWeb and trying to reproduce on that.
> 
> Settings reports: Boot2Gecko-1.2.0.0-prerelease
> 
> So this looks some type of race in the kernel driver, and the timings have
> changed just enough recently that they're being hit by the automation tests
> more frequently.
> 
> When it hangs, b2g hangs hard and is unkillable. The only way to recover is
> to reboot.
> 
> So I've recommended to Zac that we change the test code to do the following:
> 
> 1. stop b2g
> 2. sleep for a couple of seconds, check to see if b2g is still runnung.
> 3. If b2g isn't running, then go ahead and start b2g (as per normal)
> 4. If b2g is still running, do a reboot followed by adb wait-for-device
> 
> I'm 99.9% sure that the fundamental problem is a bug in the kernel driver.

Zac, Dave, any idea what changed in the 1.4 tests or 1.4 gaia/gecko that is triggering this race condition?   We've been testing the V1.2-device.cfg base image since Jan 10th or so.   curious what caused this issue, but hopefully we can get unblocked from tests.

Thanks.
(In reply to Tony Chung [:tchung] from comment #27)
> Zac, Dave, any idea what changed in the 1.4 tests or 1.4 gaia/gecko that is
> triggering this race condition?   We've been testing the V1.2-device.cfg
> base image since Jan 10th or so.   curious what caused this issue, but
> hopefully we can get unblocked from tests.

I'm guessing that its a race. It could be as subtle as some added code A caused some other piece of code B to run from different cache lines which in turn caused B to run with slightly different timings.

Don't laugh - in a previous job I actually had exactly that happen.

Without the exact kernel code we'll probably never know the root cause. (TCL only releases kernel code for versions of FirefoxOS that they've released - which as I understand things is for 1.0 and kernels built from that code don't boot when used with 1.3 (or so I was told)).

The real solution is to get the buggy kernel driver fixed.

Based on experience, the amount of effort required to fix the real bug will almost certainly be significantly less than the amount of effort required to do an exact analysis and determine what triggered the race.

I couldn't reproduce this on my unagi.

It was suggested that it might be the panel driver (and I see that the panel driver is in the call stack of the hung b2g). So we need to get the vendor involved.
Moving to vendcom as we've confirmed this is a base image bug.

Vance - Can you followup with TCL about this?

Pulling qablocker as bug 972557 has a confirmed workaround. Removing regression as this problem has been present with the base image even back to 1.2.
blocking-b2g: 1.4? → ---
Component: General → Vendcom
Flags: needinfo?(vchen)
Whiteboard: [POVB]
Jason, bug 972557 has not been proven to fix the issue yet. My first attempt didn't work, but I have high hopes for the new version. However, until we have a working fix we shouldn't assume that automation is unblocked.
A big question I have is - why is this a vendor problem if there's a confirmed regression range? Do we what changeset actually triggered the problem to cause the driver hang?
Component: Vendcom → General
Flags: needinfo?(vchen) → needinfo?(dhylands)
Whiteboard: [POVB]
blocking-b2g: --- → 1.4?
(In reply to Jason Smith [:jsmith] from comment #32)
> A big question I have is - why is this a vendor problem if there's a
> confirmed regression range? Do we what changeset actually triggered the
> problem to cause the driver hang?

What I'm suggesting here btw is that we really should identify what changeset caused this bug to be triggered & understand why that's the case.
Something is not right here; prior to the build cited in comment #19 we stopped and started b2g hundreds of times a day (no exaggeration either) on various phones for a year, the last 3-4 months of that on Hamachi devices and never seen this problem. Thousands of successes doesn't really stack up with Dave saying that b2g could not stop properly at any time with any build.

While Dave has no doubt found a bug I'm beginning to wonder whether he's got something similar but not *precisely* the same as what the automation is getting tripped up on. Where we tried to reduce the replication steps to not use automation (with good intention) we may have led him to stumble on a different issue.

I'd like to take a step back and replicate this using the ui-test suite but with Dave's debug information from comment #11 and see if they match up.
(In reply to Jason Smith [:jsmith] from comment #33)
> (In reply to Jason Smith [:jsmith] from comment #32)
> > A big question I have is - why is this a vendor problem if there's a
> > confirmed regression range? Do we what changeset actually triggered the
> > problem to cause the driver hang?
> 
> What I'm suggesting here btw is that we really should identify what
> changeset caused this bug to be triggered & understand why that's the case.

See comment 22 for the list of files changed in the indicated regression range.

If it is in fact a race that's caused by adding code, then any changeset that moves the code could cause the problem to manifest more frequently.

Its also possible that there are multiple problems, and I've only discovered one of them.

What you're really asking is what caused the timing to change such that the bug is triggered. If I knew what exactly the bug in the kernel driver was, then I could probably find out what change we made in completely unrelated code triggered it. See comment 29.
Flags: needinfo?(dhylands)
(In reply to Zac C (:zac) from comment #34)
> Something is not right here; prior to the build cited in comment #19 we
> stopped and started b2g hundreds of times a day (no exaggeration either) on
> various phones for a year, the last 3-4 months of that on Hamachi devices
> and never seen this problem. Thousands of successes doesn't really stack up
> with Dave saying that b2g could not stop properly at any time with any build.
> 
> While Dave has no doubt found a bug I'm beginning to wonder whether he's got
> something similar but not *precisely* the same as what the automation is
> getting tripped up on. Where we tried to reduce the replication steps to not
> use automation (with good intention) we may have led him to stumble on a
> different issue.
> 
> I'd like to take a step back and replicate this using the ui-test suite but
> with Dave's debug information from comment #11 and see if they match up.

Perhaps. I'd need to to see the output of performing the steps in comment 11 when you've got a hung b2g process and compare it to what I reported comment 17.
Okay - thanks for clarification Dave.

Vance - Can you outreach to TCL to look into this?
blocking-b2g: 1.4? → ---
Component: General → Vendcom
Flags: needinfo?(vchen)
Attached file dmesg.txt
Oddly the automation has been fine on the last two builds.

However using an older build:
Device    Hamachi
Gaia      4c6b5142d3b716f1c4ea502eeb92d3119f2b01c6
Gecko     https://hg.mozilla.org/mozilla-central/rev/802d87c77e76
BuildID   20140212040203
Version   30.0a1

I have picked out the dmesg from after the stop b2g after running test_camera_multiple_shots.py which took 5 loops to replicate.
(In reply to Zac C (:zac) from comment #38)
> Created attachment 8376360 [details]
> dmesg.txt
> 
> Oddly the automation has been fine on the last two builds.
> 
> However using an older build:
> Device    Hamachi
> Gaia      4c6b5142d3b716f1c4ea502eeb92d3119f2b01c6
> Gecko     https://hg.mozilla.org/mozilla-central/rev/802d87c77e76
> BuildID   20140212040203
> Version   30.0a1
> 
> I have picked out the dmesg from after the stop b2g after running
> test_camera_multiple_shots.py which took 5 loops to replicate.

And here's the kernel backtrace of the hung thead:

<6>[  184.172774] mozStorage #1   D c05b46c0  5880  1309      1 0x00000001
<4>[  184.172818] [<c05b46c0>] (__schedule+0x634/0x718) from [<c05b4ba8>] (schedule_timeout+0x1c/0x2f8)
<4>[  184.172841] [<c05b4ba8>] (schedule_timeout+0x1c/0x2f8) from [<c05b3f30>] (wait_for_common+0x104/0x198)
<4>[  184.172869] [<c05b3f30>] (wait_for_common+0x104/0x198) from [<c02b7260>] (mipi_dsi_off+0x7c/0x25c)
<4>[  184.172894] [<c02b7260>] (mipi_dsi_off+0x7c/0x25c) from [<c02b9fa4>] (panel_next_off+0x3c/0x48)
<4>[  184.172916] [<c02b9fa4>] (panel_next_off+0x3c/0x48) from [<c02a00c8>] (mdp_off+0x30/0x4c)
<4>[  184.172939] [<c02a00c8>] (mdp_off+0x30/0x4c) from [<c0299b04>] (msm_fb_blank_sub+0xf0/0x120)
<4>[  184.172961] [<c0299b04>] (msm_fb_blank_sub+0xf0/0x120) from [<c029a7b4>] (msm_fb_release+0x60/0x98)
<4>[  184.172988] [<c029a7b4>] (msm_fb_release+0x60/0x98) from [<c0293820>] (fb_release+0x30/0x54)
<4>[  184.173011] [<c0293820>] (fb_release+0x30/0x54) from [<c016cec0>] (fput+0x10c/0x1d0)
<4>[  184.173033] [<c016cec0>] (fput+0x10c/0x1d0) from [<c01699f8>] (filp_close+0x78/0x84)
<4>[  184.173053] [<c01699f8>] (filp_close+0x78/0x84) from [<c00cdde0>] (put_files_struct+0x94/0xf0)
<4>[  184.173073] [<c00cdde0>] (put_files_struct+0x94/0xf0) from [<c00cf63c>] (do_exit+0x2bc/0x78c)
<4>[  184.173093] [<c00cf63c>] (do_exit+0x2bc/0x78c) from [<c00cfb98>] (do_group_exit+0x8c/0xb8)
<4>[  184.173114] [<c00cfb98>] (do_group_exit+0x8c/0xb8) from [<c00de9a4>] (get_signal_to_deliver+0x3f0/0x438)
<4>[  184.173139] [<c00de9a4>] (get_signal_to_deliver+0x3f0/0x438) from [<c0045840>] (do_signal+0xb0/0x58c)
<4>[  184.173161] [<c0045840>] (do_signal+0xb0/0x58c) from [<c0045d34>] (do_notify_resume+0x18/0x38)
<4>[  184.173183] [<c0045d34>] (do_notify_resume+0x18/0x38) from [<c0042db8>] (work_pending+0x24/0x28)

which appears identical to the one reported in comment 17
Blocks: 801898
Severity: normal → critical
Leaving this on the nom list as there's two possible resolution paths here:

1. Get the vendor to fix the problem
2. Backout the code triggering the regression here
blocking-b2g: --- → 1.4?
This hasn't affected the UI nodes since Thursday of last week. Not clear what made it work again but we'll hold off on any framework changes.

This bug is probably still valid, however.


(In reply to Stephen Donner [:stephend] from comment #40)
> This is hitting our performance tests, too:
> 
> http://qa-selenium.mv.mozilla.com:8080/job/b2g.hamachi.mozilla-central.
> master.perf.fps/1564/console
> http://qa-selenium.mv.mozilla.com:8080/view/B2G%20Perf/job/b2g.hamachi.
> mozilla-central.master.perf/1355/console

This isn't the bug; it just means that the device needs to be rebooted.
Removing qablocker as we're not affected right now, but let's keep an eye out if this happens again.
blocking-b2g: 1.4? → ---
Keywords: qablocker
Have we tried the 11/14 dated OEM firmware?  Maybe this is an issue with the v1.2-device.cfg OEM firmware?  (Sorry if I missed this in a previous comment... long thread!)
(In reply to Ben Kelly [:bkelly] from comment #44)
> Have we tried the 11/14 dated OEM firmware?  Maybe this is an issue with the
> v1.2-device.cfg OEM firmware?  (Sorry if I missed this in a previous
> comment... long thread!)

On your suggestion, I flashed US-V1.2-Dev-20131209.cfg and reproduced the problem with that build.

I see some images from Oct 20131011 (Germany and Hungary) and Brazil image 20131016, but I don't see any 11/14 images.

I think I'll try the Germany one, since it appears to be the oldest one available.
I tried the Germany image: TA-V1.1-17FFD-DT-Germany-20131011.cfg, but unfortunately, it's a production build, so "stop b2g" is a no-op, so I'm unable to determine if the problem exists in that load.
The latest builds don't seem to be coming up correctly on the inari after flashing... could this be related? See bug 975490.
(In reply to William Lachance (:wlach) from comment #47)
> The latest builds don't seem to be coming up correctly on the inari after
> flashing... could this be related? See bug 975490.

Since you're seeing some type of logo, this doesn't sound like the same bug (although it depends on whether you're seeing any animation, or just a stale image).

This bug is really about b2g stop hanging (which could happen after a flash - although with a full flash - which is what I believe that the inari does, you're essentially doing a cold boot, and this bug definitely doesn't apply in that case.
(In reply to Dave Hylands [:dhylands] from comment #48)
> (In reply to William Lachance (:wlach) from comment #47)
> > The latest builds don't seem to be coming up correctly on the inari after
> > flashing... could this be related? See bug 975490.
> 
> Since you're seeing some type of logo, this doesn't sound like the same bug
> (although it depends on whether you're seeing any animation, or just a stale
> image).

Yup, looks like that was a duplicate of bug 975451.
I have enhanced the flash script used by Jenkins so that if the stop b2g still has files open we perform an adb reboot and try again. It seems to have worked in a couple of places where we were otherwise failing.
Flags: needinfo?(vchen)
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: