Closed
Bug 971747
Opened 10 years ago
Closed 6 years ago
Phone freezes while starting and stopping the b2g process
Categories
(Firefox OS Graveyard :: Vendcom, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: viorela, Unassigned)
References
Details
(Keywords: regression)
Attachments
(2 files)
Phone is freezing with a black screen while running tests, after trying to restart. In order to reuse the phone, we need to do an adb reboot. I couldn't get a stacktrace as I waited about half an hour to see the test failing with an error, but it didn't. I attached a logcat of the issue. Gaia 4c6b5142d3b716f1c4ea502eeb92d3119f2b01c6 Gecko https://hg.mozilla.org/mozilla-central/rev/802d87c77e76 BuildID 20140211160209 Version 30.0a1 ro.build.version.incremental=eng.archermind.20131114.105818 ro.build.date=Thu Nov 14 10:58:33 CST 2013
Comment 1•10 years ago
|
||
You need to be more descriptive about what 'restart' the phone is because I know you don't mean using the restart button on the top of the device.
Comment 2•10 years ago
|
||
Also what is the effect of this? It is freezing the device and stopping all following automation from running.
Component: Gaia::UI Tests → Gaia::System
Reporter | ||
Comment 3•10 years ago
|
||
I was able to reproduce this fail manually, by running adb shell stop b2g, adb shell start b2g repeatedly.
Reporter | ||
Updated•10 years ago
|
Summary: Phone freezes while trying to restart the phone → Phone freezes while starting and stopping the b2g process
Comment 4•10 years ago
|
||
On Buri device I am seeing a crash shortly after the device freezes: https://crash-stats.mozilla.com/report/index/be95802e-4f66-4c78-9449-06c822140212 I restarted the phone, and then it did not start. After a battery pull it restarted but the crash report came up. I will file a separate bug for the crash in a moment.
Comment 5•10 years ago
|
||
Just to clarify, Marcia found that bug by restarting using the power menu. I replicated as per comment #3 and got no crash report.
Comment 6•10 years ago
|
||
The process is still running after it has frozen, a'la comment #3. --- USER PID PPID VSIZE RSS WCHAN PC NAME root 1 0 428 288 c017b404 00008728 S /init root 2 0 0 0 c00e9618 00000000 S kthreadd root 3 2 0 0 c00d26fc 00000000 S ksoftirqd/0 root 4 2 0 0 c00e3d98 00000000 S kworker/0:0 root 5 2 0 0 c006a058 00000000 D kworker/u:0 root 6 2 0 0 c010b8ac 00000000 S migration/0 root 7 2 0 0 c00e555c 00000000 S khelper root 8 2 0 0 c00e555c 00000000 S suspend_sys_syn root 9 2 0 0 c00e555c 00000000 S suspend root 10 2 0 0 c01498d4 00000000 S sync_supers root 11 2 0 0 c014a8e4 00000000 S bdi-default root 12 2 0 0 c00e555c 00000000 S kblockd root 13 2 0 0 c034fde4 00000000 S khubd root 14 2 0 0 c00e555c 00000000 S l2cap root 15 2 0 0 c00e555c 00000000 S a2mp root 16 2 0 0 c00e555c 00000000 S modem_notifier root 17 2 0 0 c00e555c 00000000 S smd_channel_clo root 18 2 0 0 c00e555c 00000000 S smsm_cb_wq root 19 2 0 0 c00e3d98 00000000 S kworker/0:1 root 20 2 0 0 c00e3d98 00000000 S kworker/u:1 root 21 2 0 0 c00e555c 00000000 S rpcrouter root 22 2 0 0 c00e555c 00000000 S rpcrotuer_smd_x root 23 2 0 0 c006c6e8 00000000 S krpcserversd root 25 2 0 0 c00e555c 00000000 S msm_adsp_probe root 26 2 0 0 c006b4f0 00000000 D voicememo_rpc root 27 2 0 0 c006b6e8 00000000 S kadspd root 28 2 0 0 c00e555c 00000000 S rmt_storage root 29 2 0 0 c01422cc 00000000 S kswapd0 root 30 2 0 0 c019d8dc 00000000 S fsnotify_mark root 31 2 0 0 c00e555c 00000000 S crypto root 43 2 0 0 c00e555c 00000000 S mdp_dma_wq root 44 2 0 0 c00e555c 00000000 S mdp_vsync_wq root 45 2 0 0 c00e555c 00000000 S mdp_pipe_ctrl_w root 46 2 0 0 c00e555c 00000000 S k_hsuart root 48 2 0 0 c00e555c 00000000 S kgsl-3d0 root 49 2 0 0 c010cd9c 00000000 S irq/273-1-0039 root 50 2 0 0 c0333098 00000000 S mtdblock0 root 51 2 0 0 c0333098 00000000 S mtdblock1 root 52 2 0 0 c0333098 00000000 S mtdblock2 root 53 2 0 0 c0333098 00000000 S mtdblock3 root 54 2 0 0 c0333098 00000000 S mtdblock4 root 55 2 0 0 c0333098 00000000 S mtdblock5 root 56 2 0 0 c0333098 00000000 S mtdblock6 root 57 2 0 0 c0333098 00000000 S mtdblock7 root 58 2 0 0 c0333098 00000000 S mtdblock8 root 59 2 0 0 c0333098 00000000 S mtdblock9 root 60 2 0 0 c0333098 00000000 S mtdblock10 root 61 2 0 0 c0333098 00000000 S mtdblock11 root 62 2 0 0 c0333098 00000000 S mtdblock12 root 63 2 0 0 c0333098 00000000 S mtdblock13 root 64 2 0 0 c0333098 00000000 S mtdblock14 root 65 2 0 0 c0333098 00000000 S mtdblock15 root 66 2 0 0 c0333098 00000000 S mtdblock16 root 67 2 0 0 c0333098 00000000 S mtdblock17 root 68 2 0 0 c0333098 00000000 S mtdblock18 root 69 2 0 0 c0333098 00000000 S mtdblock19 root 70 2 0 0 c0333098 00000000 S mtdblock20 root 71 2 0 0 c0333098 00000000 S mtdblock21 root 72 2 0 0 c0333098 00000000 S mtdblock22 root 73 2 0 0 c0333098 00000000 S mtdblock23 root 74 2 0 0 c00e555c 00000000 S k_rmnet_mux_wor root 75 2 0 0 c00e555c 00000000 S f_mtp root 76 2 0 0 c03932b4 00000000 S file-storage root 77 2 0 0 c00e3d98 00000000 S kworker/0:2 root 78 2 0 0 c00e555c 00000000 S uether root 79 2 0 0 c00e555c 00000000 S diag_wq root 80 2 0 0 c00e555c 00000000 S diag_cntl_wq root 81 2 0 0 c010cd9c 00000000 S irq/304-ft5x06_ root 82 2 0 0 c006b4f0 00000000 D krtcclntd root 83 2 0 0 c006dcac 00000000 D krtcclntcbd root 84 2 0 0 c00e555c 00000000 S kfmradio root 85 2 0 0 c006b4f0 00000000 D kbatteryclntd root 86 2 0 0 c006dcac 00000000 D kbatteryclntcbd root 87 2 0 0 c00e555c 00000000 S battery_queue root 88 2 0 0 c00e555c 00000000 S msm_adc root 89 2 0 0 c00e555c 00000000 S dalrpc_rcv_DAL0 root 90 2 0 0 c00e555c 00000000 S iewq root 91 2 0 0 c00e9588 00000000 D kinteractiveup root 92 2 0 0 c00e555c 00000000 S binder root 93 2 0 0 c00e3d98 00000000 S kworker/u:3 root 94 2 0 0 c0598a80 00000000 S krfcommd root 95 2 0 0 c00e555c 00000000 S msm-cpufreq root 96 2 0 0 c006b4f0 00000000 D khsclntd root 97 2 0 0 c00e555c 00000000 S deferwq root 98 1 312 4 c017b404 00008728 S /sbin/ueventd root 99 2 0 0 c043650c 00000000 S mmcqd/0 root 100 2 0 0 c02394f4 00000000 S yaffs-bg-1 root 102 2 0 0 c02394f4 00000000 S yaffs-bg-1 root 103 2 0 0 c02394f4 00000000 S yaffs-bg-1 root 104 2 0 0 c02394f4 00000000 S yaffs-bg-1 system 129 1 852 224 c044d378 400f18b0 S /system/bin/servicemanager root 130 1 4036 556 ffffffff 40067eb4 S /system/bin/vold root 133 1 3336 504 ffffffff 400658b0 S /system/bin/fakeperm root 136 1 7832 1080 ffffffff 4006ceb4 S /system/bin/netd root 137 1 720 216 c0487240 400bc1b8 S /system/bin/debuggerd radio 138 1 20340 3240 ffffffff 40047eb4 S /system/bin/rild media 139 1 18660 5396 ffffffff 400928b0 S /system/bin/mediaserver bluetooth 140 1 1352 532 c017b404 400c567c S /system/bin/dbus-daemon keystore 141 1 1748 496 c0487240 4005b1b8 S /system/bin/keystore system 153 1 1924 272 ffffffff 4004e0e8 S /system/bin/mm-pp-daemon system 157 1 3052 688 c017b404 4003667c S /system/bin/mm-qcamera-daemon root 158 1 1188 428 c017b404 4006f67c S /system/bin/gpu_dcvsd root 159 2 0 0 c00e555c 00000000 S k_gserial root 160 2 0 0 c00e555c 00000000 S k_gsmd root 161 2 0 0 c00e555c 00000000 S k_rmnet_work root 164 1 4480 176 ffffffff 0000829c S /sbin/adbd root 191 1 872 252 c017b404 400929d4 S /system/bin/location-mq radio 196 1 6032 560 ffffffff 400fd9d4 S /system/bin/qmuxd root 197 1 14604 996 ffffffff 4005e810 S /system/bin/xtwifi-inet-agent shell 203 1 792 296 c02c4448 4009e678 S /system/bin/sh radio 204 1 6160 784 ffffffff 40052810 S /system/bin/netmgrd root 205 1 12184 1524 ffffffff 40037810 S /system/bin/xtwifi-client root 290 2 0 0 c018d644 00000000 S flush-31:5 root 360 2 0 0 c00e555c 00000000 S cfg80211 root 363 2 0 0 c00e555c 00000000 S ath6kl root 367 2 0 0 c04325e0 00000000 S ksdioirqd/mmc1 system 403 1 1348 428 c017b404 4010967c S /system/bin/qosmgr root 706 1 0 0 c02b7260 00000000 D b2g root 745 706 0 0 c00cfac8 00000000 Z (Nuwa) root 755 1 44152 9900 ffffffff 400d46f8 D /system/b2g/b2g root 760 164 780 296 c0046254 4005e094 S /system/bin/sh root 762 760 972 276 00000000 40065678 R ps
Updated•10 years ago
|
Comment 7•10 years ago
|
||
Fabrice, might you be able to help us find the culprit, here?
Is this the same as https://bugzilla.mozilla.org/show_bug.cgi?id=964367?
Oops, I meant https://bugzilla.mozilla.org/show_bug.cgi?id=971605 sorry for the spam.
Comment 10•10 years ago
|
||
Clint, no it is not, it is different and unlikely to be related. This can be replicated outside of automation. I spent some time trying to work around it, changing the durations between the stop and the start but it had no effect. I'm thinking Naoki might know something about this, if the b2g process cannot be started because there's a filelock in place or somesuch.
Comment 11•10 years ago
|
||
It looks like the previous b2g hasn't actually stopped yet. Starting a new b2g before the previous one has stopped will almost certainly not work. If I'm reading things right, "stop b2g" sends a kill -9 to the process. The only way that will not kill the process is if its hung on some uninterruptible system call. It would be useful to try the following: echo t > /proc/sysrq-trigger dmesg You should be able to find the b2g process (From the above ps there is 706 and 755. 706 is the one we're interested in) What I'm interested in is this: <6>[ 4052.836422] b2g S c05b46c0 4776 875 1 0x00000000 <4>[ 4052.836591] [<c05b46c0>] (__schedule+0x634/0x718) from [<c05b58f0>] (schedule_hrtimeout_range_clock+0x4c/0x144) <4>[ 4052.836702] [<c05b58f0>] (schedule_hrtimeout_range_clock+0x4c/0x144) from [<c01a1074>] (sys_epoll_wait+0x1b8/0x314) <4>[ 4052.836802] [<c01a1074>] (sys_epoll_wait+0x1b8/0x314) from [<c0042d60>] (ret_fast_syscall+0x0/0x30) In this example b2g was pid 875 (the output will be in PID order so that should help to find the right stuff). We'd probably need to look at all of the threads as well. So getting ps -t output and the dmesg output attached might shed some light. The information that's shown here is the callstack inside the kernel for each thread.
Comment 12•10 years ago
|
||
(In reply to Stephen Donner [:stephend] from comment #7) > Fabrice, might you be able to help us find the culprit, here? What Dave said!
Comment 13•10 years ago
|
||
Thanks very much Dave, I'll be able to debug this tomorrow. Just clarify, do you want me to get the debug information from the point after stopping b2g or after restarting when it has jammed? It's possible (or likely) we are stopping b2g while some important bits are going on as as soon as our final test step finishes, if there's no debug to collate we immediately stop the process (in order to wipe data/profile and start it again).
Comment 15•10 years ago
|
||
(In reply to Zac C (:zac) from comment #13) > It's possible (or likely) we are stopping b2g while some important bits are > going on as as soon as our final test step finishes, if there's no debug to > collate we immediately stop the process (in order to wipe data/profile and > start it again). Also Dave, this strategy has worked successfully for us for around 12 months. This is definitely something that changed in yesterday's late-PST build but I'm not clear whether it's a regression on b2g side or something we need to resolve in the test framework.
Comment 16•10 years ago
|
||
(In reply to Zac C (:zac) from comment #13) > Thanks very much Dave, I'll be able to debug this tomorrow. Just clarify, do > you want me to get the debug information from the point after stopping b2g > or after restarting when it has jammed? > > It's possible (or likely) we are stopping b2g while some important bits are > going on as as soon as our final test step finishes, if there's no debug to > collate we immediately stop the process (in order to wipe data/profile and > start it again). Hi Zac, I suppose both would be interesting. Figuring out where its hanging on startup is less interesting than knowing why its hanging on shutdown. Normally, when you do stop b2g, it should stop almost immediately. Although maybe theres a difference between eng and user (I seem to remember cjones mentioning something along those lines once upon a time). Since you're testing, I'm assuming that you're using eng or userdebug builds? And what do I need to do to try and reproduce this locally?
Comment 17•10 years ago
|
||
ok - using a userdebug build (which is what I happened to have on my phone, I was able to reproduce by doing start b2g stop b2g (while the firefox with the flaming tail logo was still visible on the screen) I actually did: stop b2g; sleep 5; b2g-ps; start-b2g and then waited variable amounts of time before hitting the up arrow and return. If you see b2g still present in the b2g-ps output then you've most likely hit the situation. Here's my dmesg output at hang time; For the b2g that's hung trying to stop: <6>[ 8229.892480] b2g D c05b46c0 5128 1689 1 0x00000001 <4>[ 8229.892640] [<c05b46c0>] (__schedule+0x634/0x718) from [<c05b4ba8>] (schedule_timeout+0x1c/0x2f8) <4>[ 8229.892734] [<c05b4ba8>] (schedule_timeout+0x1c/0x2f8) from [<c05b3f30>] (wait_for_common+0x104/0x198) <4>[ 8229.892837] [<c05b3f30>] (wait_for_common+0x104/0x198) from [<c02b7260>] (mipi_dsi_off+0x7c/0x25c) <4>[ 8229.892932] [<c02b7260>] (mipi_dsi_off+0x7c/0x25c) from [<c02b9fa4>] (panel_next_off+0x3c/0x48) <4>[ 8229.893020] [<c02b9fa4>] (panel_next_off+0x3c/0x48) from [<c02a00c8>] (mdp_off+0x30/0x4c) <4>[ 8229.893109] [<c02a00c8>] (mdp_off+0x30/0x4c) from [<c0299b04>] (msm_fb_blank_sub+0xf0/0x120) <4>[ 8229.893195] [<c0299b04>] (msm_fb_blank_sub+0xf0/0x120) from [<c029a7b4>] (msm_fb_release+0x60/0x98) <4>[ 8229.893292] [<c029a7b4>] (msm_fb_release+0x60/0x98) from [<c0293820>] (fb_release+0x30/0x54) <4>[ 8229.893384] [<c0293820>] (fb_release+0x30/0x54) from [<c016cec0>] (fput+0x10c/0x1d0) <4>[ 8229.893465] [<c016cec0>] (fput+0x10c/0x1d0) from [<c01699f8>] (filp_close+0x78/0x84) <4>[ 8229.893549] [<c01699f8>] (filp_close+0x78/0x84) from [<c00cdde0>] (put_files_struct+0x94/0xf0) <4>[ 8229.893632] [<c00cdde0>] (put_files_struct+0x94/0xf0) from [<c00cf63c>] (do_exit+0x2bc/0x78c) <4>[ 8229.893714] [<c00cf63c>] (do_exit+0x2bc/0x78c) from [<c00cfb98>] (do_group_exit+0x8c/0xb8) <4>[ 8229.893802] [<c00cfb98>] (do_group_exit+0x8c/0xb8) from [<c00de9a4>] (get_signal_to_deliver+0x3f0/0x438) <4>[ 8229.893900] [<c00de9a4>] (get_signal_to_deliver+0x3f0/0x438) from [<c0045840>] (do_signal+0xb0/0x58c) <4>[ 8229.893995] [<c0045840>] (do_signal+0xb0/0x58c) from [<c0045d34>] (do_notify_resume+0x18/0x38) <4>[ 8229.894084] [<c0045d34>] (do_notify_resume+0x18/0x38) from [<c0042db8>] (work_pending+0x24/0x28) For the b2g that's hanging while starting up: <6>[ 8229.894135] b2g D c05b46c0 6048 1715 1 0x00000000 <4>[ 8229.894302] [<c05b46c0>] (__schedule+0x634/0x718) from [<c05b5300>] (__mutex_lock_slowpath+0x144/0x1c0) <4>[ 8229.894399] [<c05b5300>] (__mutex_lock_slowpath+0x144/0x1c0) from [<c05b539c>] (mutex_lock+0x20/0x40) <4>[ 8229.894494] [<c05b539c>] (mutex_lock+0x20/0x40) from [<c029393c>] (fb_open+0x60/0x1c0) <4>[ 8229.894587] [<c029393c>] (fb_open+0x60/0x1c0) from [<c016eab0>] (chrdev_open+0x204/0x22c) <4>[ 8229.894675] [<c016eab0>] (chrdev_open+0x204/0x22c) from [<c016a020>] (__dentry_open+0x218/0x330) <4>[ 8229.894760] [<c016a020>] (__dentry_open+0x218/0x330) from [<c016a1e8>] (nameidata_to_filp+0x50/0x5c) <4>[ 8229.894847] [<c016a1e8>] (nameidata_to_filp+0x50/0x5c) from [<c01772a0>] (do_last+0x7c8/0x8d4) <4>[ 8229.894934] [<c01772a0>] (do_last+0x7c8/0x8d4) from [<c0178db4>] (path_openat+0xc4/0x380) <4>[ 8229.895020] [<c0178db4>] (path_openat+0xc4/0x380) from [<c0179150>] (do_filp_open+0x30/0x7c) <4>[ 8229.895104] [<c0179150>] (do_filp_open+0x30/0x7c) from [<c0169c6c>] (do_sys_open+0xd8/0x174) <4>[ 8229.895189] [<c0169c6c>] (do_sys_open+0xd8/0x174) from [<c0042d60>] (ret_fast_syscall+0x0/0x30) I couldn't run gdb since I had a build going and I had wiped out my out tree. So I'll wait until my build is finished and gather more information then.
Flags: needinfo?(dhylands)
Comment 18•10 years ago
|
||
So it looks like the one that's hung stopping is trying to close the framebuffer and the one thats hanging on startup is trying to open the framebuffer
Comment 19•10 years ago
|
||
I don't have the buildid to hand but this started at this build: https://pvtbuilds.mozilla.org/pvt/mozilla.org/b2gotoro/nightly/mozilla-central-hamachi-eng/2014/02/2014-02-11-16-02-09/
Comment 20•10 years ago
|
||
Yeah - so I tried "kill -9 1689" and its not going away. So this tells me that we have a driver bug (fundamentally drivers shouldn't be hanging like this). It's quite possible that our code paths have changed recently are exposing the driver bug, so we can still investigate along those lines.
Comment 21•10 years ago
|
||
(In reply to Zac C (:zac) from comment #19) > I don't have the buildid to hand but this started at this build: > https://pvtbuilds.mozilla.org/pvt/mozilla.org/b2gotoro/nightly/mozilla- > central-hamachi-eng/2014/02/2014-02-11-16-02-09/ I can also reproduce in 2014-02-11-04-02-00 but not in 2014-02-10-16-02-01 (so that's a 12 hour range)
Comment 22•10 years ago
|
||
Hmm. So looking at the sources.xml files, I get hash d2ef16927622a687998c32093cc80d5357620a14 has the problem hash c8a8bf353a31f20c11161e0f0274ad7c56b1e716 doesn't And git shows: d2ef169 Ryan VanderMeulen 2 days ago Merge inbound to m-c. c8a8bf3 Ryan VanderMeulen 2 days ago Merge b2g-inbound to m-c. So it looks like a merge from inbound seems to be the culprit. And this is the output of git show --stat HEAD (with HEAD set to d2ef16927622a687998c32093cc80d5357620a14) 1150 >git show --stat HEAD commit d2ef16927622a687998c32093cc80d5357620a14 Merge: c8a8bf3 eea6ca3 Author: Ryan VanderMeulen <ryanvm@gmail.com> Date: Mon Feb 10 15:42:58 2014 -0500 Merge inbound to m-c. browser/app/Makefile.in | 1 - content/media/TextTrack.cpp | 15 +- content/media/TextTrackCue.h | 1 + content/media/test/chrome.ini | 1 + content/media/test/test_texttrackcue.html | 9 + content/media/test/test_texttrackcue_chrome.html | 33 + db/sqlite3/src/Makefile.in | 2 - db/sqlite3/src/moz.build | 4 + dom/plugins/test/testplugin/testplugin.mk | 2 - dom/plugins/test/testplugin/testplugin.mozbuild | 4 + dom/workers/File.cpp | 8 +- .../printingui/src/win/nsPrintDialogUtil.cpp | 2 +- embedding/tests/winEmbed/Makefile.in | 2 - embedding/tests/winEmbed/moz.build | 2 + gfx/angle/src/libEGL/Makefile.in | 1 - gfx/angle/src/libEGL/moz.build | 2 + gfx/angle/src/libGLESv2/Makefile.in | 1 - gfx/angle/src/libGLESv2/moz.build | 2 + gfx/ipc/GfxMessageUtils.h | 8 +- gfx/layers/d3d11/TextureD3D11.cpp | 16 +- gfx/layers/ipc/AsyncPanZoomController.cpp | 4 +- gfx/thebes/gfxDWriteFontList.cpp | 2 +- js/src/Makefile.in | 5 - js/src/ctypes/Library.cpp | 2 +- js/src/jit/Bailouts.cpp | 2 +- js/src/jit/Bailouts.h | 3 - js/src/jit/BaselineBailouts.cpp | 6 +- js/src/jit/IonCode.h | 8 - js/src/jit/IonFrameIterator.h | 41 +- js/src/jit/IonFrames.cpp | 162 +++-- js/src/jit/IonTypes.h | 11 + js/src/jit/SnapshotReader.h | 250 -------- js/src/jit/SnapshotWriter.h | 78 --- js/src/jit/Snapshots.cpp | 650 +++++++++++--------- js/src/jit/Snapshots.h | 463 ++++++++++++++ js/src/jit/arm/Simulator-arm.cpp | 51 ++ js/src/jit/arm/Simulator-arm.h | 1 + js/src/jit/shared/CodeGenerator-shared.cpp | 50 +- js/src/jit/shared/CodeGenerator-shared.h | 4 +- js/src/jsapi-tests/moz.build | 1 + js/src/jsapi-tests/testJitRValueAlloc.cpp | 236 +++++++ js/src/jsapi.cpp | 41 +- js/src/jsapi.h | 7 +- js/src/moz.build | 2 + js/src/perf/jsperf.cpp | 10 +- js/src/shell/js.cpp | 16 +- js/src/vm/Stack.cpp | 10 +- js/xpconnect/src/XPCShellImpl.cpp | 2 +- layout/base/nsDisplayList.h | 6 + layout/mathml/nsMathMLmoFrame.cpp | 4 + .../reftests/mathml/mo-invisibleoperators-ref.html | 12 + layout/reftests/mathml/mo-invisibleoperators.html | 17 + layout/reftests/mathml/reftest.list | 1 + layout/style/ImageLoader.cpp | 2 +- mobile/android/components/PaymentsUI.js | 6 +- netwerk/base/src/ProxyAutoConfig.cpp | 14 +- python/mozbuild/mozbuild/frontend/emitter.py | 2 + .../mozbuild/mozbuild/frontend/sandbox_symbols.py | 12 + .../test/backend/data/variable_passthru/moz.build | 3 + .../mozbuild/test/backend/test_recursivemake.py | 6 + .../test/frontend/data/variable-passthru/moz.build | 3 + .../mozbuild/test/frontend/test_emitter.py | 2 + security/manager/ssl/src/nsSmartCardMonitor.cpp | 8 +- testing/marionette/client/marionette/client.py | 14 +- .../marionette/client/marionette/runner/base.py | 47 +- .../client/marionette/runner/mixins/b2g.py | 91 ++- .../client/marionette/runner/mixins/reporting.py | 1 + toolkit/library/Makefile.in | 1 - toolkit/library/moz.build | 11 + webapprt/PaymentUIGlue.js | 6 +- widget/os2/Makefile.in | 2 - widget/os2/moz.build | 2 + widget/windows/Makefile.in | 2 - widget/windows/moz.build | 2 + xpcom/ds/nsWindowsRegKey.cpp | 4 +- xpcom/reflect/xptcall/src/md/unix/moz.build | 2 +- xulrunner/app/Makefile.in | 1 - 77 files changed, 1663 insertions(+), 855 deletions(-) This appears to be this merge (on m-c) 167907:d812f80a0f1d https://hg.mozilla.org/mozilla-central/rev/d812f80a0f1d https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=d812f80a0f1d which corresponds to 167915:d812f80a0f1d thru 167876:542f9093ec36 in mozilla-inbound.
Comment 23•10 years ago
|
||
(In reply to Dave Hylands [:dhylands] from comment #18) > So it looks like the one that's hung stopping is trying to close the > framebuffer and the one thats hanging on startup is trying to open the > framebuffer Could you be more specific about the details of this frame buffer? Is it potentially related to graphics?
Comment 24•10 years ago
|
||
(In reply to Jason Smith [:jsmith] from comment #23) > (In reply to Dave Hylands [:dhylands] from comment #18) > > So it looks like the one that's hung stopping is trying to close the > > framebuffer and the one thats hanging on startup is trying to open the > > framebuffer > > Could you be more specific about the details of this frame buffer? Is it > potentially related to graphics? I'm still not sure. All I can say (from comment 17) is that the process that's hung wound up calling fb_release in the kernel, and it seems to be stuck there. The b2g process that's starting up called fb_open and it's waiting on a mutex which is presumably being held by the b2g process which called fb_release. So the problem is that fb_release is getting hung up. I've done a bunch of builds and not been able to reprodcue the problem, when I realized that I was doing DEBUG builds. The nighlties that I was able to reproduce the problem with would have been non-DEBUG builds. So I'm just building stuff again with B2G_DEBUG=0 to see if that helps to reproduce the problem. What exactly fb_release is hung up on, I'm not sure. even with B2G_DEBUG=0, I haven't been able to reproduce the problem yet (at least not with the same gecko version that reproduces the problem from a nightly). I'm going to go back to the latest and work my way backwards now.
Comment 25•10 years ago
|
||
I'm gettingit to fail more reliably by doing: adb reboot && adb wait-for-device && adb shell and then doing: b2g-ps -t; stop b2g; sleep 2; b2g-ps -t; start b2g So now builds that I thought were working are now failing (sometimes it takes 10 tries before I get a failure). I'm currently on -r 167795 on b2g-inbound and observing a failure.
Comment 26•10 years ago
|
||
OK - I can reproduce the problem with just the base image. i.e. I flashed V1.2-device.cfg using TeleWeb and trying to reproduce on that. Settings reports: Boot2Gecko-1.2.0.0-prerelease So this looks some type of race in the kernel driver, and the timings have changed just enough recently that they're being hit by the automation tests more frequently. When it hangs, b2g hangs hard and is unkillable. The only way to recover is to reboot. So I've recommended to Zac that we change the test code to do the following: 1. stop b2g 2. sleep for a couple of seconds, check to see if b2g is still runnung. 3. If b2g isn't running, then go ahead and start b2g (as per normal) 4. If b2g is still running, do a reboot followed by adb wait-for-device I'm 99.9% sure that the fundamental problem is a bug in the kernel driver.
Comment 27•10 years ago
|
||
(In reply to Dave Hylands [:dhylands] from comment #26) > OK - I can reproduce the problem with just the base image. > > i.e. I flashed V1.2-device.cfg using TeleWeb and trying to reproduce on that. > > Settings reports: Boot2Gecko-1.2.0.0-prerelease > > So this looks some type of race in the kernel driver, and the timings have > changed just enough recently that they're being hit by the automation tests > more frequently. > > When it hangs, b2g hangs hard and is unkillable. The only way to recover is > to reboot. > > So I've recommended to Zac that we change the test code to do the following: > > 1. stop b2g > 2. sleep for a couple of seconds, check to see if b2g is still runnung. > 3. If b2g isn't running, then go ahead and start b2g (as per normal) > 4. If b2g is still running, do a reboot followed by adb wait-for-device > > I'm 99.9% sure that the fundamental problem is a bug in the kernel driver. Zac, Dave, any idea what changed in the 1.4 tests or 1.4 gaia/gecko that is triggering this race condition? We've been testing the V1.2-device.cfg base image since Jan 10th or so. curious what caused this issue, but hopefully we can get unblocked from tests. Thanks.
Comment 28•10 years ago
|
||
I opened bug #972557 which contains the fix suggested in https://bugzilla.mozilla.org/show_bug.cgi?id=971747#c26
Comment 29•10 years ago
|
||
(In reply to Tony Chung [:tchung] from comment #27) > Zac, Dave, any idea what changed in the 1.4 tests or 1.4 gaia/gecko that is > triggering this race condition? We've been testing the V1.2-device.cfg > base image since Jan 10th or so. curious what caused this issue, but > hopefully we can get unblocked from tests. I'm guessing that its a race. It could be as subtle as some added code A caused some other piece of code B to run from different cache lines which in turn caused B to run with slightly different timings. Don't laugh - in a previous job I actually had exactly that happen. Without the exact kernel code we'll probably never know the root cause. (TCL only releases kernel code for versions of FirefoxOS that they've released - which as I understand things is for 1.0 and kernels built from that code don't boot when used with 1.3 (or so I was told)). The real solution is to get the buggy kernel driver fixed. Based on experience, the amount of effort required to fix the real bug will almost certainly be significantly less than the amount of effort required to do an exact analysis and determine what triggered the race. I couldn't reproduce this on my unagi. It was suggested that it might be the panel driver (and I see that the panel driver is in the call stack of the hung b2g). So we need to get the vendor involved.
Comment 30•10 years ago
|
||
Moving to vendcom as we've confirmed this is a base image bug. Vance - Can you followup with TCL about this? Pulling qablocker as bug 972557 has a confirmed workaround. Removing regression as this problem has been present with the base image even back to 1.2.
blocking-b2g: 1.4? → ---
Component: General → Vendcom
Flags: needinfo?(vchen)
Keywords: qablocker,
regression
Whiteboard: [POVB]
Comment 31•10 years ago
|
||
Jason, bug 972557 has not been proven to fix the issue yet. My first attempt didn't work, but I have high hopes for the new version. However, until we have a working fix we shouldn't assume that automation is unblocked.
Comment 32•10 years ago
|
||
A big question I have is - why is this a vendor problem if there's a confirmed regression range? Do we what changeset actually triggered the problem to cause the driver hang?
Component: Vendcom → General
Flags: needinfo?(vchen) → needinfo?(dhylands)
Keywords: qablocker,
regression
Whiteboard: [POVB]
Updated•10 years ago
|
blocking-b2g: --- → 1.4?
Comment 33•10 years ago
|
||
(In reply to Jason Smith [:jsmith] from comment #32) > A big question I have is - why is this a vendor problem if there's a > confirmed regression range? Do we what changeset actually triggered the > problem to cause the driver hang? What I'm suggesting here btw is that we really should identify what changeset caused this bug to be triggered & understand why that's the case.
Comment 34•10 years ago
|
||
Something is not right here; prior to the build cited in comment #19 we stopped and started b2g hundreds of times a day (no exaggeration either) on various phones for a year, the last 3-4 months of that on Hamachi devices and never seen this problem. Thousands of successes doesn't really stack up with Dave saying that b2g could not stop properly at any time with any build. While Dave has no doubt found a bug I'm beginning to wonder whether he's got something similar but not *precisely* the same as what the automation is getting tripped up on. Where we tried to reduce the replication steps to not use automation (with good intention) we may have led him to stumble on a different issue. I'd like to take a step back and replicate this using the ui-test suite but with Dave's debug information from comment #11 and see if they match up.
Comment 35•10 years ago
|
||
(In reply to Jason Smith [:jsmith] from comment #33) > (In reply to Jason Smith [:jsmith] from comment #32) > > A big question I have is - why is this a vendor problem if there's a > > confirmed regression range? Do we what changeset actually triggered the > > problem to cause the driver hang? > > What I'm suggesting here btw is that we really should identify what > changeset caused this bug to be triggered & understand why that's the case. See comment 22 for the list of files changed in the indicated regression range. If it is in fact a race that's caused by adding code, then any changeset that moves the code could cause the problem to manifest more frequently. Its also possible that there are multiple problems, and I've only discovered one of them. What you're really asking is what caused the timing to change such that the bug is triggered. If I knew what exactly the bug in the kernel driver was, then I could probably find out what change we made in completely unrelated code triggered it. See comment 29.
Flags: needinfo?(dhylands)
Comment 36•10 years ago
|
||
(In reply to Zac C (:zac) from comment #34) > Something is not right here; prior to the build cited in comment #19 we > stopped and started b2g hundreds of times a day (no exaggeration either) on > various phones for a year, the last 3-4 months of that on Hamachi devices > and never seen this problem. Thousands of successes doesn't really stack up > with Dave saying that b2g could not stop properly at any time with any build. > > While Dave has no doubt found a bug I'm beginning to wonder whether he's got > something similar but not *precisely* the same as what the automation is > getting tripped up on. Where we tried to reduce the replication steps to not > use automation (with good intention) we may have led him to stumble on a > different issue. > > I'd like to take a step back and replicate this using the ui-test suite but > with Dave's debug information from comment #11 and see if they match up. Perhaps. I'd need to to see the output of performing the steps in comment 11 when you've got a hung b2g process and compare it to what I reported comment 17.
Comment 37•10 years ago
|
||
Okay - thanks for clarification Dave. Vance - Can you outreach to TCL to look into this?
blocking-b2g: 1.4? → ---
Component: General → Vendcom
Flags: needinfo?(vchen)
Comment 38•10 years ago
|
||
Oddly the automation has been fine on the last two builds. However using an older build: Device Hamachi Gaia 4c6b5142d3b716f1c4ea502eeb92d3119f2b01c6 Gecko https://hg.mozilla.org/mozilla-central/rev/802d87c77e76 BuildID 20140212040203 Version 30.0a1 I have picked out the dmesg from after the stop b2g after running test_camera_multiple_shots.py which took 5 loops to replicate.
Comment 39•10 years ago
|
||
(In reply to Zac C (:zac) from comment #38) > Created attachment 8376360 [details] > dmesg.txt > > Oddly the automation has been fine on the last two builds. > > However using an older build: > Device Hamachi > Gaia 4c6b5142d3b716f1c4ea502eeb92d3119f2b01c6 > Gecko https://hg.mozilla.org/mozilla-central/rev/802d87c77e76 > BuildID 20140212040203 > Version 30.0a1 > > I have picked out the dmesg from after the stop b2g after running > test_camera_multiple_shots.py which took 5 loops to replicate. And here's the kernel backtrace of the hung thead: <6>[ 184.172774] mozStorage #1 D c05b46c0 5880 1309 1 0x00000001 <4>[ 184.172818] [<c05b46c0>] (__schedule+0x634/0x718) from [<c05b4ba8>] (schedule_timeout+0x1c/0x2f8) <4>[ 184.172841] [<c05b4ba8>] (schedule_timeout+0x1c/0x2f8) from [<c05b3f30>] (wait_for_common+0x104/0x198) <4>[ 184.172869] [<c05b3f30>] (wait_for_common+0x104/0x198) from [<c02b7260>] (mipi_dsi_off+0x7c/0x25c) <4>[ 184.172894] [<c02b7260>] (mipi_dsi_off+0x7c/0x25c) from [<c02b9fa4>] (panel_next_off+0x3c/0x48) <4>[ 184.172916] [<c02b9fa4>] (panel_next_off+0x3c/0x48) from [<c02a00c8>] (mdp_off+0x30/0x4c) <4>[ 184.172939] [<c02a00c8>] (mdp_off+0x30/0x4c) from [<c0299b04>] (msm_fb_blank_sub+0xf0/0x120) <4>[ 184.172961] [<c0299b04>] (msm_fb_blank_sub+0xf0/0x120) from [<c029a7b4>] (msm_fb_release+0x60/0x98) <4>[ 184.172988] [<c029a7b4>] (msm_fb_release+0x60/0x98) from [<c0293820>] (fb_release+0x30/0x54) <4>[ 184.173011] [<c0293820>] (fb_release+0x30/0x54) from [<c016cec0>] (fput+0x10c/0x1d0) <4>[ 184.173033] [<c016cec0>] (fput+0x10c/0x1d0) from [<c01699f8>] (filp_close+0x78/0x84) <4>[ 184.173053] [<c01699f8>] (filp_close+0x78/0x84) from [<c00cdde0>] (put_files_struct+0x94/0xf0) <4>[ 184.173073] [<c00cdde0>] (put_files_struct+0x94/0xf0) from [<c00cf63c>] (do_exit+0x2bc/0x78c) <4>[ 184.173093] [<c00cf63c>] (do_exit+0x2bc/0x78c) from [<c00cfb98>] (do_group_exit+0x8c/0xb8) <4>[ 184.173114] [<c00cfb98>] (do_group_exit+0x8c/0xb8) from [<c00de9a4>] (get_signal_to_deliver+0x3f0/0x438) <4>[ 184.173139] [<c00de9a4>] (get_signal_to_deliver+0x3f0/0x438) from [<c0045840>] (do_signal+0xb0/0x58c) <4>[ 184.173161] [<c0045840>] (do_signal+0xb0/0x58c) from [<c0045d34>] (do_notify_resume+0x18/0x38) <4>[ 184.173183] [<c0045d34>] (do_notify_resume+0x18/0x38) from [<c0042db8>] (work_pending+0x24/0x28) which appears identical to the one reported in comment 17
This is hitting our performance tests, too: http://qa-selenium.mv.mozilla.com:8080/job/b2g.hamachi.mozilla-central.master.perf.fps/1564/console http://qa-selenium.mv.mozilla.com:8080/view/B2G%20Perf/job/b2g.hamachi.mozilla-central.master.perf/1355/console
Comment 41•10 years ago
|
||
Leaving this on the nom list as there's two possible resolution paths here: 1. Get the vendor to fix the problem 2. Backout the code triggering the regression here
blocking-b2g: --- → 1.4?
Comment 42•10 years ago
|
||
This hasn't affected the UI nodes since Thursday of last week. Not clear what made it work again but we'll hold off on any framework changes. This bug is probably still valid, however. (In reply to Stephen Donner [:stephend] from comment #40) > This is hitting our performance tests, too: > > http://qa-selenium.mv.mozilla.com:8080/job/b2g.hamachi.mozilla-central. > master.perf.fps/1564/console > http://qa-selenium.mv.mozilla.com:8080/view/B2G%20Perf/job/b2g.hamachi. > mozilla-central.master.perf/1355/console This isn't the bug; it just means that the device needs to be rebooted.
Comment 43•10 years ago
|
||
Removing qablocker as we're not affected right now, but let's keep an eye out if this happens again.
blocking-b2g: 1.4? → ---
Keywords: qablocker
Comment 44•10 years ago
|
||
Have we tried the 11/14 dated OEM firmware? Maybe this is an issue with the v1.2-device.cfg OEM firmware? (Sorry if I missed this in a previous comment... long thread!)
Comment 45•10 years ago
|
||
(In reply to Ben Kelly [:bkelly] from comment #44) > Have we tried the 11/14 dated OEM firmware? Maybe this is an issue with the > v1.2-device.cfg OEM firmware? (Sorry if I missed this in a previous > comment... long thread!) On your suggestion, I flashed US-V1.2-Dev-20131209.cfg and reproduced the problem with that build. I see some images from Oct 20131011 (Germany and Hungary) and Brazil image 20131016, but I don't see any 11/14 images. I think I'll try the Germany one, since it appears to be the oldest one available.
Comment 46•10 years ago
|
||
I tried the Germany image: TA-V1.1-17FFD-DT-Germany-20131011.cfg, but unfortunately, it's a production build, so "stop b2g" is a no-op, so I'm unable to determine if the problem exists in that load.
Comment 47•10 years ago
|
||
The latest builds don't seem to be coming up correctly on the inari after flashing... could this be related? See bug 975490.
Comment 48•10 years ago
|
||
(In reply to William Lachance (:wlach) from comment #47) > The latest builds don't seem to be coming up correctly on the inari after > flashing... could this be related? See bug 975490. Since you're seeing some type of logo, this doesn't sound like the same bug (although it depends on whether you're seeing any animation, or just a stale image). This bug is really about b2g stop hanging (which could happen after a flash - although with a full flash - which is what I believe that the inari does, you're essentially doing a cold boot, and this bug definitely doesn't apply in that case.
Comment 49•10 years ago
|
||
(In reply to Dave Hylands [:dhylands] from comment #48) > (In reply to William Lachance (:wlach) from comment #47) > > The latest builds don't seem to be coming up correctly on the inari after > > flashing... could this be related? See bug 975490. > > Since you're seeing some type of logo, this doesn't sound like the same bug > (although it depends on whether you're seeing any animation, or just a stale > image). Yup, looks like that was a duplicate of bug 975451.
Comment 50•10 years ago
|
||
I have enhanced the flash script used by Jenkins so that if the stop b2g still has files open we perform an adb reboot and try again. It seems to have worked in a couple of places where we were otherwise failing.
Updated•10 years ago
|
Flags: needinfo?(vchen)
Comment 51•9 years ago
|
||
Maybe related to https://bugzilla.mozilla.org/show_bug.cgi?id=1123037 ?
Comment 52•6 years ago
|
||
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•