Closed Bug 1110960 Opened 10 years ago Closed 10 years ago

Device slowing down after more than one day of uptime

Categories

(Firefox OS Graveyard :: Gaia::System::Lockscreen, defect)

ARM
Gonk (Firefox OS)
defect
Not set
blocker

Tracking

(blocking-b2g:2.0+, b2g-v2.0 fixed, b2g-v2.0M fixed, b2g-v2.1 fixed, b2g-v2.2 fixed)

RESOLVED FIXED
2.2 S2 (19dec)
blocking-b2g 2.0+
Tracking Status
b2g-v2.0 --- fixed
b2g-v2.0M --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: gerard-majax, Assigned: gweng)

References

Details

(Keywords: regression)

Attachments

(4 files)

This is reproduced consistently since a couple of weeks at least. I cannot provide any STR at all other than "use the device". Reproduced on v2.0 on a v188 base system (OTA from Mozilla). Symptoms: device getting sluggish. |adb shell top -m 5 -d 1| shows that B2G main process is around 50%.
Attached file profile_captured.sym
This profile shows that B2G is spending quite a lot of time in LockScreenWindow.lockOrientation(). I cannot tell for sure that it's the reason for the slow down, but I find it suspicious enough.
Flags: needinfo?(gweng)
I think if the only STR is using the phone, we should make a test that disable the orientation call, and put it one day long to see if the device become slow. By the way, how you used the phone? If I only left it alone and do not manipulate it, would it become slow, too? Or I must use it as a daily phone, like twittering, browsering and dialing, like what we would do on our phone?
Flags: needinfo?(gweng) → needinfo?(lissyx+mozillians)
(In reply to Greg Weng [:snowmantw][:gweng][:λ] from comment #2) > I think if the only STR is using the phone, we should make a test that > disable the orientation call, and put it one day long to see if the device > become slow. By the way, how you used the phone? If I only left it alone and > do not manipulate it, would it become slow, too? Or I must use it as a daily > phone, like twittering, browsering and dialing, like what we would do on our > phone? This is a dogfooding device, so its usage is quite straightforward: - phone calls - sms/mms - radio fm - taking pictures, browsing gallery Letting the device rot by itself is not an option. Did you had a look at the profile ? Why would lockOrientation makes calls to mozApps ?
Flags: needinfo?(lissyx+mozillians) → needinfo?(gweng)
Yes I've viewed the log with Cleopatra. I just want to make sure it's the root cause as you said, with a real device and real case. Since if I need to pin down the root cause of course I need to make sure how to reproduce it by myself. And another thing I notice is you mentioned this affects since 2.0, this means it's possible a regression of Bug 1089529, that Alive and I encountered a case we need to workaround with the current Orientation API. I can revert it to see if your symptom would happen again. If not, we need a QA help to find regression window, since the bug is the only one I suspect related. And I don't understand what you mean that "lockOrientation makes calls to mozApps"? The function only call lock orientation API to do things.
Flags: needinfo?(gweng)
OK. Now I've two Flames: one nullized the function and one does have the function, and I added some log to see what happen. I would try to perform some user actions later to see if I can reproduce it, and if I revert the patch the symptom would disappear.
(In reply to Greg Weng [:snowmantw][:gweng][:λ] from comment #4) > Yes I've viewed the log with Cleopatra. I just want to make sure it's the > root cause as you said, with a real device and real case. Since if I need to > pin down the root cause of course I need to make sure how to reproduce it by > myself. Of course, but sadly we have not yet been able to identify proper STR :( > > And another thing I notice is you mentioned this affects since 2.0, this > means it's possible a regression of Bug 1089529, that Alive and I > encountered a case we need to workaround with the current Orientation API. I > can revert it to see if your symptom would happen again. If not, we need a > QA help to find regression window, since the bug is the only one I suspect > related. Yes, this is the bug I have spotted also. FYI, the date this bug landed would roughly match the first time the slowdown was experienced by my girlfriend dogfooding this device. Nothing 100% sure, but the timeframe do match. > > And I don't understand what you mean that "lockOrientation makes calls to > mozApps"? The function only call lock orientation API to do things. Ok, I'll have a look at the Gecko-side of this to see why.
Ok, in the mozLockOrientation() implementation, there's a call to permission checking: http://mxr.mozilla.org/mozilla-b2g32_v2_0/source/dom/base/nsScreen.cpp#269 This does a check of the app status (installed or not): http://mxr.mozilla.org/mozilla-b2g32_v2_0/source/dom/base/nsScreen.cpp#216 Then at some point, there's probably a call made to http://mxr.mozilla.org/mozilla-b2g32_v2_0/source/caps/src/nsScriptSecurityManager.cpp#275 whichs tries to get the local app id at http://mxr.mozilla.org/mozilla-b2g32_v2_0/source/caps/src/nsScriptSecurityManager.cpp#293
So I think there are some possible issues with the codes: 1. If I handle the ID in a wrong way, the setInterval may not stop and therefore we have a performance trouble. However, according to my previous log & test, the interval callback was cleared as the design. So maybe this is not the issue, but I would check it again. 2. If the slowdown happen because the call of the API, the phone would not be slowdown without unlock & lock. And I can guess this can lead to a new (shorter) STR that try to unlock & lock multiple times, the phone should be slowdown significantly. However, I need to do some test about it.
Attached file Patch v2.0
The reason is the set interval is too fast, which would set a new one before we clear the old one. I didn't realize this is because I set a slower value in my WIP patch. From the test logs I believe this can solve the issue. I would submit other patches for other branches later.
Attachment #8536326 - Flags: review?(alive)
Attached file Patch
For master. The content is the same.
Attached file Patch v2.1
Patch v2.1. The content is the same.
Assignee: nobody → gweng
Comment on attachment 8536326 [details] [review] Patch v2.0 Test please (lockOrientation() two times should not setInterval twice) Also, I am curious if turn screen on -> off quickly, do we still try to lock the orientation? Do we need to clear the interval if screen is off and try lock again later when screen is on?
Attachment #8536326 - Flags: review?(alive) → feedback+
I've tested the case you mentioned: it would set and clear immediately, and there is no leaking. That is, every time the screen get off and on, the set and clear pair would do the lock work.
Comment on attachment 8536326 [details] [review] Patch v2.0 Updated. Would update other versions after this is r+.
Attachment #8536326 - Flags: review?(alive)
Have you been able to expose cases where things may go wrong? I can't help much more sadly, yet, what I know is that this device has been running fine and started to get crazy like this just a couple of weeks ago, with the symptoms documented in comment 0. And the profile I captured was when it was getting slow. There was just 5-10 secs at most between the time I started profiling and when I did the capture.
Flags: needinfo?(gweng)
?? As I said at Comment 9, I thought that I pin down the root cause and the patch should fix it. If you want to test it before the patch land, you can take the v2.0 patch that I just updated.
Flags: needinfo?(gweng)
(In reply to Greg Weng [:snowmantw][:gweng][:λ] from comment #16) > ?? As I said at Comment 9, I thought that I pin down the root cause and the > patch should fix it. If you want to test it before the patch land, you can > take the v2.0 patch that I just updated. Ok, that's fine and that answers my question: we should keep an eye on this to make sure this was the root cause :)
Comment on attachment 8536326 [details] [review] Patch v2.0 r=me
Attachment #8536326 - Flags: review?(alive) → review+
OK I've updated all patches. Now waiting CI results.
We've encountered CI troubles... need manually re-trigger it. I'll keep to try to make them with clear results.
Current result: v2.0: with one irrelevant Gij failure at Calendar test, re-triggering. master: all green. v2.1: TBPL hook didn't trigger it...I'm now closing & opening to see if the test would be triggered. As a conclusion, I'll land the master patch first, and to see when the other two patches can be ready for submitting approvals.
master: https://github.com/mozilla-b2g/gaia/commit/9471af2ff4e995a0d9785ced2073c8fa9c8c357c Since I landed the master patch, I now set it as RESOLVED FIXED.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
OK, now TBPL is testing the v2.1 patch.
I re-triggered the jobs failed at B2G Desktop OS X opt for v2.1 patch, and would wait the result. Although I wonder the failures are not from my patch.
After re-ran it the v2.0 patch is green at TBPL: https://treeherder.mozilla.org/ui/#/jobs?repo=gaia-try&revision=cffe717b08e7 So I would land it.
Ahhh sorry I forgot to submit the approval! I now revert it...
Comment on attachment 8536326 [details] [review] Patch v2.0 NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings. [Approval Request Comment] [Bug caused by] (feature/regressing bug #): Bug 1089529 [User impact] if declined: Slowing down continues [Testing completed]: https://treeherder.mozilla.org/ui/#/jobs?repo=gaia-try&revision=cffe717b08e7 [Risk to taking this patch] (and alternatives if risky): No [String changes made]: No
Attachment #8536326 - Flags: approval-gaia-v2.0?(bbajaj)
Comment on attachment 8536403 [details] [review] Patch v2.1 [Approval Request Comment] [Bug caused by] (feature/regressing bug #): Bug 1089529 [User impact] if declined: Slowing down continues [Testing completed]: With some intermittent failures that are all green after re-runs: https://treeherder.mozilla.org/ui/#/jobs?repo=gaia-try&revision=7ceb1a1b7413 [Risk to taking this patch] (and alternatives if risky): No [String changes made]: No
Attachment #8536403 - Flags: approval-gaia-v2.1?(bbajaj)
I also noticed this issue on FxOS 2.0, both on a Peak and two ZTE Open C devices. I suspect it might the cause of many slowdowns I encountered. gerard-majax, maybe it would be worth waiting for this fix to land in the 2.0 branch before releasing a new OTA version for 2.0 in the ZTE Open C community builds?
Blocks: 1089529
blocking-b2g: --- → 2.0+
Comment on attachment 8536326 [details] [review] Patch v2.0 Approving, given this is a regression we introduced and a bad one for dogfooders. Alive/Greg, Thanks for the tests, if there is any additional testing that QA can do here to verify this please add "verifyme" and NI Brian Hunag for help.
Attachment #8536326 - Flags: approval-gaia-v2.0?(bbajaj) → approval-gaia-v2.0+
Attachment #8536403 - Flags: approval-gaia-v2.1?(bbajaj) → approval-gaia-v2.1+
Applied the update as soon as it was containing the fix. Device is well responsive after nearly three days: > $ adb shell uptime > up time: 2 days, 23:34:54, idle time: 05:38:31, sleep time: 2 days, 16:56:35 I'll keep an eye on this, but it looks good, thanks for fixing it!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: