[B2G][OTA] Ridiculously janky / unresponsive behavior after OTA update (making it hard to even unlock the phone)

VERIFIED FIXED in Firefox 21

Status

VERIFIED FIXED
6 years ago
6 years ago

People

(Reporter: nkot, Assigned: dhylands)

Tracking

({smoketest})

unspecified
B2G C4 (2jan on)
ARM
Gonk (Firefox OS)
smoketest
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:tef+, firefox19 wontfix, firefox20 wontfix, firefox21 fixed, b2g18 fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 fixed)

Details

Attachments

(5 attachments, 1 obsolete attachment)

(Reporter)

Description

6 years ago
Created attachment 713453 [details]
log file

Description:
Homescreen does not display after OTA update without having to restart device

Repro Steps:
1) Go to Settings => Device information => Software updates => Check now
2) Install new System Update 2013-02-13-071150
3) Wait for the homescreen to appear or press power button if display went off

Expected:
*Locked Homescreen displays, user is able to unlock and use device

Actual:
*Homecreen appears black - see screenshot
*Homescreen displays successfully after restart

Repro frequency:
100%
3/3 devices

*screenshot attached
*log file attached

Notes:
Updated from:
Gecko:70c8f2cf813626e8c7b0f89676e1a62fe4ddfcae
Gaia:ecca2ee860825547d5e1109436b50b74dfe9261e
Build ID:20130212070205
(Reporter)

Comment 1

6 years ago
Created attachment 713459 [details]
screenshot
That sounds bad. Can you consistently reproduce?
blocking-b2g: --- → tef?
Component: Gaia → Gaia::Homescreen
Something most definitely blew up here.

Weird logcat errors:

02-13 08:46:05.212: I/Gecko(108): ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv

02-13 08:46:11.508: E/GeckoConsole(20200): [JavaScript Error: "formatURLPref: Couldn't get pref: app.update.url.details" {file: "jar:file:///system/b2g/omni.ja!/components/nsURLFormatter.js" line: 126}]
Marshall - Any ideas?
Flags: needinfo?(marshall)

Updated

6 years ago
blocking-b2g: tef? → ---
I ran into this today.  :overholt did too.
Actually maybe I ran into a related but different issue.  In my case the screen stayed entirely blank, then I got the boot image for quite a while, then back to entirely blank.
I ran into this yesterday and today.  I have to pull the battery to get anything other than the "hardware" buttons to light up.
blocking-b2g: --- → tef?
Component: Gaia::Homescreen → General
Gah, should probably be shira?
blocking-b2g: tef? → shira?
(In reply to Andrew Overholt [:overholt] from comment #8)
> Gah, should probably be shira?

Why shira? Isn't this going to impact tef?
I've applied the update locally, and I also see "app.update.url.details" errors, but I'm able to get into the phone -- albeit _very_ slowly.

When the screen is off, pressing the power button to turn it back on takes on the order of ~15-20 seconds on my device, and then almost immediately goes back off.

If you tap the screen some while after you press the power button, you can avoid the timeout, but then you have to patiently drag the lock drawer up, and press the unlock button while the device remains extremely unresponsive.

Doing a quick top, I noticed that the b2g process is eating between 97-98%!
 2577  0  98% S    36 163016K  53780K  fg root     /system/b2g/b2g

Looking into why that might be..
Flags: needinfo?(marshall)
(In reply to Jason Smith [:jsmith] from comment #9)
> (In reply to Andrew Overholt [:overholt] from comment #8)
> > Gah, should probably be shira?
> 
> Why shira? Isn't this going to impact tef?

Agreed
Assignee: nobody → marshall
blocking-b2g: shira? → tef?
(See also bug 841517, which might be a dupe of this bug.)
So this has basically bricked the device for me.  Restarting doesn't help, I just get a black screen.  I can barely get the lockscreen to display.
(Assignee)

Comment 14

6 years ago
What does:

top -t -m 5

show? (from an adb shell)
adb doesn't even see the device, for me at least.  (adb shell says "error: device not found")

(I have udev rules set up, and I ran adb w/ sudo for good measure - so I'm not getting blocked from seeing the device due to lack-of-privileges)
It showed /system/b2g/b2g at 96%.  

I ended up kill that process and after it restart the phone is responsive again.  I'm not sure why yanking the battery out previously didn't fix it, unless I just gave it more time to finish whatever it was trying to do.
(In reply to Daniel Holbert [:dholbert] from comment #15)
> adb doesn't even see the device, for me at least.  (adb shell says "error:
> device not found")
> 
> (I have udev rules set up, and I ran adb w/ sudo for good measure - so I'm
> not getting blocked from seeing the device due to lack-of-privileges)

I ran into this at first, I think it was due to the device being so behind/bogged down that the daemon wasn't running (yet).
(Assignee)

Comment 18

6 years ago
(In reply to Lucas Adamski from comment #16)
> It showed /system/b2g/b2g at 96%.  

I was hoping to see the whole line, so we could tell which thread was consuming the CPU (top -t shows individual threads).
(Assignee)

Comment 19

6 years ago
(In reply to Daniel Holbert [:dholbert] from comment #15)
> adb doesn't even see the device, for me at least.  (adb shell says "error:
> device not found")
> 
> (I have udev rules set up, and I ran adb w/ sudo for good measure - so I'm
> not getting blocked from seeing the device due to lack-of-privileges)

adb might be disabled. It is by default for dogfooding. You can enable it by enabling:

Settings->Device Information->More Information->Developer->Remote Debugging
(In reply to Dave Hylands [:dhylands] from comment #18)
> 
> I was hoping to see the whole line, so we could tell which thread was
> consuming the CPU (top -t shows individual threads).

Sorry, I'd closed the window by the time I saw your question. :(
Spoke too soon.  Checked for updates, applied it, now stuck again:

User 83%, System 16%, IOW 0%, IRQ 0%
User 272 + Nice 0 + Sys 54 + Idle 0 + IOW 0 + IRQ 0 + SIRQ 0 = 326

  PID   TID PR CPU% S     VSS     RSS PCY UID      Thread          Proc
  526   550  0  24% R 157880K  52444K  fg root     DOM Worker      /system/b2g
  526   549  0  24% R 157880K  52444K  fg root     DOM Worker      /system/b2g
  526   555  0  24% R 157880K  52444K  fg root     DOM Worker      /system/b2g
  526   546  0  24% R 157880K  52444K  fg root     DOM Worker      /system/b2g
  617   617  0   1% R   1088K    444K  fg root     top             top
(Assignee)

Comment 22

6 years ago
So yeah dholbert is seeing the same thing. 4 DOM Workers consuming most of the CPU.
And they're all in the main process. I don't know what the DOM Workers do.
(Assignee)

Comment 24

6 years ago
I tried flashing https://pvtbuilds.mozilla.org/pub/mozilla.org/b2g/nightly/mozilla-b2g18-unagi/2013/02/2013-02-12-07-02-05/ and then OTA updating to my local built version and it didn't reproduce (my locally built version is a v1-train)

I reflashed and OTA updated as in comment 23 and it reproduced (so not a one off).
(Assignee)

Comment 25

6 years ago
Created attachment 714262 [details]
gdb back trace of main thread and DOM Worker threads

I was able to get into gdb and get some back traces.

I'm not sure of the validity of the symbols since the image that was being wasn't from the tree I was in.
[clarifying summary]
Summary: [B2G][OTA] Homescreen fails to display after OTA update → [B2G][OTA] Ridiculously janky / unresponsive behavior after OTA update (making it hard to even unlock the phone)
Duplicate of this bug: 841517
(In reply to Dave Hylands [:dhylands] from comment #25)
> I'm not sure of the validity of the symbols since the image that was being
> wasn't from the tree I was in.

Yeah... The traces don't look right to me.
Do you guys need more information to debug this issue?  i just reproduced these symptoms myself when updating to:

Gecko  http://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/d1288313218e
Gaia   6544fdb8dddc56f1aefe94482402488c89eeec49
BuildID 20130214070203
Version 18.0
For what its worth, if i pull battery and reboot, I can recover the device back into a usable state.  Just fodder for triage drivers under consideration.

Updated

6 years ago
blocking-b2g: tef? → tef+
(Assignee)

Comment 31

6 years ago
Created attachment 714481 [details]
gdb backtrace with good symbols

I was able to reproduce using a local build.

STR:
1 - Modify gecko/toolkit/content/UpdateChannel.sh near the end to override the channel:

    channel = "foobar";
    return channel;

2 - build.

3 - Create an update
    ./build.sh gecko-update-full

4 - Setup the phone to use the update
    tools/update-tools/test-update.py ${GECKO_OBJDIR}/dist/b2g-update/b2g-gecko-update.mar

5 - Do a Check Now and then do the update

This message:

AUS:SVC UpdateManager:get activeUpdate - channel has changed, reloading default preferences to workaround bug 802022

from here:
https://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/nsUpdateService.js#2795

seems to be the key. I'm hypothesising that the reload-default-prefs is the actual trigger.
Attachment #714262 - Attachment is obsolete: true
(Assignee)

Comment 32

6 years ago
Created attachment 714485 [details]
gdb backtrace taken shortly after the previous one

I let the process run for a bit longer and grabbed another set of backtraces
(Assignee)

Comment 33

6 years ago
I picked one of the DOM Workers and just hit n over and over in gdb.

It wound up doing these 4 lines continuously:

3428	#endif
(gdb) 

3408	    WorkerRunnable* event;
(gdb) 
3410	      MutexAutoLock lock(mMutex);
(gdb) 
3412	      while (!mControlQueue.Pop(event) && !syncQueue->mQueue.Pop(event)) {
(Assignee)

Comment 34

6 years ago
(In reply to Tony Chung [:tchung] from comment #29)
> Do you guys need more information to debug this issue?

I can reproduce at will now, so now it's mostly just trying to figure out whats going on.
Created attachment 714586 [details] [diff] [review]
patch
Assignee: marshall → anygregor
Attachment #714586 - Flags: review?(bent.mozilla)
Blocks: 828887
Attachment #714586 - Flags: review?(bent.mozilla) → review+
thx jdm!
https://hg.mozilla.org/integration/mozilla-inbound/rev/424de8168602

dhylands mentions that the bug is not completely gone.
Assignee: anygregor → dhylands
Whiteboard: leave-open
(Assignee)

Updated

6 years ago
Blocks: 841962
(Assignee)

Comment 37

6 years ago
I filed bug 841962 to followup on this problem and removed the leave-open on this bug.

At least with the patch applied in this bug, the phone now just takes alot longer to bootup after an update-channel change, but it seems to perform ok.
Whiteboard: leave-open
https://hg.mozilla.org/mozilla-central/rev/424de8168602
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
https://hg.mozilla.org/releases/mozilla-b2g18/rev/63211dc2a63e
https://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/af83e7e7f52a
status-b2g18: --- → fixed
status-b2g18-v1.0.0: --- → wontfix
status-b2g18-v1.0.1: --- → fixed
status-firefox19: --- → wontfix
status-firefox20: --- → wontfix
status-firefox21: --- → fixed
Target Milestone: --- → B2G C4 (2jan on)
Duplicate of this bug: 842222

Updated

6 years ago
Blocks: 847511
(Reporter)

Comment 41

6 years ago
Verifying fix on V1-train branch - OTA goes smoothly

tested with the following:
1. manually flashed to Unagi build 2013-03-20-070206
Gecko  http://hg.mozilla.org/releases/mozilla-b2g18/rev/778da49486f0
Gaia   6c3767c2dea43b5e9aff7d156d36d69649005621

2. revertNightly 

3. OTA to Build 2013-03-21-070203
Gecko  http://hg.mozilla.org/releases/mozilla-b2g18/rev/7508c5a1026b
Gaia   7af427d35c4d557c75b2060022815f07851acc28

Issue seems still to be occurring when OTA from V.1.0.1 builds but there are other bugs to cover that, please refer to bugs 847511 and 842932 (note that switching update channels takes place here)
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.