Settings App perf test hangs due to Settings API

VERIFIED FIXED in Firefox 34

Status

()

defect
VERIFIED FIXED
5 years ago
5 years ago

People

(Reporter: qdot, Assigned: qdot)

Tracking

unspecified
2.1 S5 (26sep)
x86_64
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:2.1+, firefox33 wontfix, firefox34 verified, firefox35 verified, b2g-v1.4 unaffected, b2g-v2.0 unaffected, b2g-v2.0M unaffected, b2g-v2.1 verified, b2g-v2.2 verified)

Details

(Whiteboard: [systemsfe])

Attachments

(3 attachments, 3 obsolete attachments)

This is a followup to bug 1065128. Running the settings app startup test is causing crashes/hangs in perf after the app is stopped and restarted a few times.
Version: 25 Branch → unspecified
Moving logcat from b2gperf failure over here
The problem looks like it stems from how GaiaApps.kill is taking down the settings app, and how the settings API reacts to that. As far as I'm aware, this /should/ be a clean shutdown of the app, so I'm not sure why cleanup is going wrong.

So the two bad things I see in these logs:

---
09-17 10:29:36.212 E/GeckoConsole( 1482): Content JS LOG at dummy file:225 in GaiaApps.kill: terminating app with origin 'app://settings.gaiamobile.org'
09-17 10:29:36.212 E/GeckoConsole( 1482): Content JS LOG at dummy file:217 in GaiaApps.kill/gt_onAppTerminated/<: app with origin 'app://settings.gaiamobile.org' has terminated
09-17 10:29:36.642 E/Settings( 2156): [JavaScript Error: "TypeError: this._settingsManager._window is null" {file: "jar:file:///system/b2g/omni.ja!/components/SettingsManager.js" line: 110}]
---

This is a settings lock that gets orphaned because its manager died in between the time it scheduled itself to send a run/finalize message and the time the event loop actually got around to doing it. We should be able to test for the validity of the SettingsManager and make this not happen.

---
09-17 10:29:48.422 I/Gecko   ( 1482): 1410946188433	Marionette	INFO	sendToClient: {"from":"0","value":true}, {680f6dc8-e218-40fd-af7d-a6a378bec367}, {680f6dc8-e218-40fd-af7d-a6a378bec367}
09-17 10:29:48.422 E/GeckoConsole( 1482): Content JS LOG at dummy file:225 in GaiaApps.kill: terminating app with origin 'app://settings.gaiamobile.org'
09-17 10:29:48.422 E/GeckoConsole( 1482): Content JS LOG at dummy file:217 in GaiaApps.kill/gt_onAppTerminated/<: app with origin 'app://settings.gaiamobile.org' has terminated
09-17 10:29:48.532 E/QCALOG  (  328): [MessageQ] ProcessNewMessage: [XT-CS] unknown deliver target [OS-Agent]
09-17 10:29:48.532 E/QCALOG  (  328): [MessageQ] ProcessNewMessage: [XTWWAN-PE] unknown deliver target [OS-Agent]
09-17 10:29:48.862 E/GeckoConsole( 1482): [JavaScript Error: "Settings lock trying to run more tasks after finalizing. Ignoring tasks, but this is bad. Lock: {e9aad7ff-e78e-44dd-8df3-46ea8b5e75ac}" {file: "resource://gre/modules/SettingsRequestManager.jsm" line: 583}]
09-17 10:29:48.862 E/GeckoConsole( 1482): [JavaScript Error: "Settings lock trying to run more tasks after finalizing. Ignoring tasks, but this is bad. Lock: {e9aad7ff-e78e-44dd-8df3-46ea8b5e75ac}" {file: "resource://gre/modules/SettingsRequestManager.jsm" line: 583}]
---

This is where the hang happens. We throw this error message, but don't actually have the system move on to the next task. So we just stall and print forever.

Putting in patches for both of these now, but I'm having problems getting the perf tests to run locally in order to repro this.
Finally figured out b2gperf and got it running, I see the error mentioned. Retrying with patches.
Ok, with patches I just ran a 100 iteration set of b2gperf with settings and it didn't error out once. ni'ing zac/wlach for confirmation on their side since I'm not sure how to run eideticker stuff.
Flags: needinfo?(zcampbell)
Flags: needinfo?(wlachance)
[Blocking Requested - why for this release]: This problem is mostly triggered by programmatic launching/killing of apps, but assuming the system was either moving really slow or the user was moving really fast, we could hit this. Also, if 1065128 lands to aurora, it will break this, so this needs to hit aurora too.
Moved window existence check up in scope, since multiple things can happen under closeHelper
Attachment #8491111 - Attachment is obsolete: true
Attachment #8491111 - Flags: review?(bent.mozilla)
Attachment #8491192 - Flags: review?(bent.mozilla)
This is the bug that started all of this. Array.splice(index, -1) does nothing, and does so silently. -1 should've been 1. Due to this, we were never removing locks from the SettingsManager ownership tracking array, which meant we had the possibility of calling Finalize twice, which caused the other errors to cascade. So, with all 4 of these patches, we should be in much better shape overall.
Attachment #8491194 - Flags: review?(bent.mozilla)
After patch 4, I get no settings errors whatsoever on 100 iterators of b2gperf of settings. 

So, patch 1-3 are mostly "complain but don't take the system with if something goes really wrong", and patch 4 fixes something that actually went wrong.

The culprit goes all the way back to one of the last patch updates in bug 900551, but this happened in the cleanup() function that wasn't getting called correctly. Fixed that in bug 1065128, which is why this popped up more once that landed.
Attachment #8491110 - Flags: review?(bent.mozilla) → review+
Attachment #8491192 - Flags: review?(bent.mozilla) → review+
Attachment #8491151 - Flags: review?(bent.mozilla) → review+
Comment on attachment 8491194 [details] [diff] [review]
Patch 4 (v1) - Fix settings lock ownership array splice

Review of attachment 8491194 [details] [diff] [review]:
-----------------------------------------------------------------

Ugh
Attachment #8491194 - Flags: review?(bent.mozilla) → review+
Well, while this fixed perf, it may've caused more issues with Mnw. No clue why though. Retriggering them on try and well as trying locally.

Comment 15

5 years ago
(In reply to Kyle Machulis [:kmachulis] [:qdot] (USE NEEDINFO?) from comment #7)
> Ok, with patches I just ran a 100 iteration set of b2gperf with settings and
> it didn't error out once. ni'ing zac/wlach for confirmation on their side
> since I'm not sure how to run eideticker stuff.

To be honest the first time I ever ran b2gperf was yesterday too! 
But I can see it is perma-red on the device CI and that does a loop of 30.
To run 100 with on issues sounds solid.
Flags: needinfo?(zcampbell)
Will probably show up in endurance tests.
blocking-b2g: 2.1? → 2.1+
Target Milestone: --- → 2.1 S5 (26sep)
(In reply to Kyle Machulis [:kmachulis] [:qdot] (USE NEEDINFO?) from comment #7)
> Ok, with patches I just ran a 100 iteration set of b2gperf with settings and
> it didn't error out once. ni'ing zac/wlach for confirmation on their side
> since I'm not sure how to run eideticker stuff.

I'm 99% certain that when this lands, Eideticker will be fixed too. The way it and b2gperf use marionette is quite similar. So I'm happy to just see what happens and reopen if necessary (but I don't think it'll be necessary).
Flags: needinfo?(wlachance)
Due to one or more of these patches causing massive issues on Mnw, I'm now trying to break the set down to figure out what I can land first to resolve test issues while in parallel figuring out what's wrong. The most important patch is patch 4, which I've now run in Mnw try on:

https://tbpl.mozilla.org/?tree=Try&rev=0cd50da1d943

The orange has nothing to do with us, so I've retriggered a few more times to see what happens. Due to Linux Debug Gip coming out blue/red on the first try also, I'm going to rerun a try with linux debug and ICS emulator just in case, then land that and start working on the others.

I ran a local b2g perf with just that change, things work fine but we still get _window not existing error messages but those are mostly benign and will be fixed once I make sure that patch can land without taking Mnw down.
https://hg.mozilla.org/integration/b2g-inbound/rev/0ad56fef7fb2
https://hg.mozilla.org/integration/b2g-inbound/rev/2d5de8b63f69

Landed patches 1 and 4, which I've proved out via try to pass Mnw. Forgot that Linux B2g Desktop Debug Gip is hidden everywhere else so not worry about that now.

Going to divide out the other two patches in this bug to a new bug, since figuring out why they kill Mnw may take a while. This set should get eideticker back up and running though.
Comment on attachment 8491151 [details] [diff] [review]
Patch 3 (v1) - Make sure settings lock exists before trying to remove it

Obsoleting patch here and moving to bug 1069605 for further work
Attachment #8491151 - Attachment is obsolete: true
Comment on attachment 8491192 [details] [diff] [review]
Patch 2 (v2) - Only send messages in SettingsLocks when window is live

Obsoleting patch here and moving to bug 1069605 for further work
Attachment #8491192 - Attachment is obsolete: true
Comment on attachment 8491110 [details] [diff] [review]
Patch 1 (v1) - Fix task cleanup on settings locks where finalize has already been called

Approval Request Comment
[Feature/regressing bug #]: Bug 900551
[User impact if declined]: If apps close in a weird order that causes messages to misfire, whole system will lock up. Usually only happens programmatically.
[Describe test coverage new/current, TBPL]: Mnw and b2gperf cover this, ran through try multiple times
[Risks and why]: None
[String/UUID change made/needed]: None
Attachment #8491110 - Flags: approval-mozilla-aurora?
Comment on attachment 8491194 [details] [diff] [review]
Patch 4 (v1) - Fix settings lock ownership array splice

Approval Request Comment
[Feature/regressing bug #]: Bug 900551
[User impact if declined]: If apps close in a weird order that causes messages to misfire, whole system will lock up. Usually only happens programmatically.
[Describe test coverage new/current, TBPL]: Mnw and b2gperf cover this, ran through try multiple times
[Risks and why]: None
[String/UUID change made/needed]: None
Attachment #8491194 - Flags: approval-mozilla-aurora?
Attachment #8491110 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8491194 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Verified fixed on Flame 2.2 (319mb/full flash) and Flame 2.1 (319mb/full flash)
using repro steps in bug 1065128 and bug 1058729 which depend on this bug as well as repeatedly opening and closing the Settings application. 

Actual result: Settings app does not hang/ works as expected.

Device: Flame 2.2
BuildID: 20141011040204
Gaia: 95f580a1522ffd0f09302372b78200dab9b6f322
Gecko: 3f6a51950eb5
Gonk: 52c909e821d107d414f851e267dedcd7aae2cebf
Version: 35.0a1 (2.2)
Firmware: V180
User Agent: Mozilla/5.0 (Mobile; rv:35.0) Gecko/35.0 Firefox/35.0

Device: Flame 2.1
BuildID: 20141011000201
Gaia: f5d4ff60ffed8961f7d0380ada9d0facfdfd56b1
Gecko: d813d79d3eae
Gonk: 52c909e821d107d414f851e267dedcd7aae2cebf
Version: 34.0a2 (2.1)
Firmware: V180
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0
Status: RESOLVED → VERIFIED
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
You need to log in before you can comment on or make changes to this bug.