Closed
Bug 1110872
Opened 9 years ago
Closed 9 years ago
Apps that reload themselves in the same child process can jam Settings API
Categories
(Core :: DOM: Core & HTML, defect)
Tracking
()
People
(Reporter: gerard-majax, Assigned: qdot)
References
Details
(Whiteboard: [systemsfe])
Attachments
(6 files, 3 obsolete files)
+++ This bug was initially created as a clone of Bug #1105511 +++ Martin Shuman [:Marty] 2014-12-12 17:20:35 CET I hit this issue yesterday after testing Email for about 3 hours, with a little bit of time spent in Clock and Calendar before that. The main focus of my testing had been exploring Text Selection (Cut/Copy/Paste) throughout the app. After these hours of testing, I navigated to Settings, and saw that 'Operator Services' was present at the bottom of the menu, and I couldn't enable or access the Developer menu. Environmental Variables: Device: Flame 2.2 (319MB) BuildID: 20141211040208 (Full Flash) Gaia: 1a956437210d2b16988d2ddbf40c9a38d8474435 Gecko: d92db71d4e67 Gonk: 263b5f41f7733c5577fb101eb4dc8ac5c11cfa8d Version: 37.0a1 (2.2) Firmware Version: v188-1 User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0
Reporter | ||
Comment 1•9 years ago
|
||
Can you try collecting about:memory report ?
Just run:
> $ python tools/get_about_memory.py
From your B2G/ directory. Then you can provide the memory report as an attachment.
Also, during your testing, did you locked/unlocked the device a lot ?
Flags: needinfo?(mshuman)
Reporter | ||
Comment 2•9 years ago
|
||
As far as I can say, your Gecko does contains fixes but your Gaia does not include the fix for bug 1105639
Comment 3•9 years ago
|
||
I asked this question in bug 1105639 already. I don't get how we can allow Gaia code to break the device that badly.
Reporter | ||
Comment 4•9 years ago
|
||
Kyle, I had a look at the logcat of my Nexus S, and there was still a bunch of errors. Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=c24633af8019588b78e79ad428d7430cded2dffc Gaia: https://github.com/mozilla-b2g/gaia/tree/4cdeee67b449db90aae9384337311547c280093c As far as I can say, Setting was not broken.
Flags: needinfo?(kyle)
Assignee | ||
Comment 5•9 years ago
|
||
There were still a bunch of errors because the commit you reference of gecko does not contain the patches from bug 1105511. Please update to a newer commit and retry.
Flags: needinfo?(kyle)
Comment 6•9 years ago
|
||
When testing, I was not locking my device much at all. I had my screen time-out set to 'Never,' and probably only manually locked the screen 3 or 4 times over the course of 3+ hours. The next time I see this issue, I will make sure to collect a memory report.
Flags: needinfo?(mshuman)
Assignee | ||
Comment 7•9 years ago
|
||
Assignee: lissyx+mozillians → kyle
Assignee | ||
Comment 8•9 years ago
|
||
Ok, hijacking this to make it more specific. We've found a problem triggered by phones with multiple sims. The CostControl app has a few logic branches where, if a phone is multisim, it will do some checks about which one we're transferring data through. These checks can cause the app to reload itself in the same child process, meaning we destroy and recreate the SettingsManager, but don't send a message to notify the SettingsRequestManager in the parent process to clean up the locks in the old SettingsManager. This means that if we create a settings request that triggers this reset, it gets stuck in the queue. This would explain why gwagner has been seeing lockups after sms and phone calls. I'm not sure how this relates to other reports in this bug, but I'm sure there are other apps that may reload themselves in the child process that could cause this.
Summary: Broken Settings API → Apps that reload themselves in the same child process can jam Settings API
Comment 9•9 years ago
|
||
I managed to grab a memory report after encountering Operator Services. Prior to this occurring, I was sending PNG images to another phone via accessing the Messages App from Homescreen: Wi-Fi was disabled, Data Connection (3G Enabled from SIM1, nothing in SIM2) and Automatic Downloads was off. Immediately after, I pushed a small music file onto the phone to continue my testing via the 'adb push' command from my terminal, in an attempt to access the Messages app from within the Music App: (Open Music App; locate music file; play, and access Share -> Messages.) Once in my share activity of Messages, I added a title to the message, added my contact's number and attempted to send to no avail, which made me suspect the Settings API was having issues. Prior to me noticing the message send failure, I had attempted to preview the small music file and reached a broken UI screen (observed in a different UX in bug 1111092.) This forced me to force close my music app through card view (thus terminating the messages share activity) and then relaunch the Music app. All of this business around accessing Messages from within Music occurred while the Messages App remained open. Hopefully something within the above details helps may help narrow this down. I managed to pull a python memory report from the command suggested in comment 1. I also pulled a logcat. Both will be attached immediately below this comment.
Comment 10•9 years ago
|
||
Comment 11•9 years ago
|
||
Assignee | ||
Comment 12•9 years ago
|
||
Cleaning up the dependencies. This bug is purely due to bug 1065128 and 1082001. In bug 1065128, we changed messages from inner-window-destroyed to dom-window-destroyed due to the way we were caching permissions and sending around window objects. This caused the issues that lead to bug 1082001, because dom-window-destroyed was being called at times we didn't expect and we cleaned up locks that weren't ready yet. In bug 1082001, we moved the checks for inner-window-destroyed to the parent process as the belief was that the only time a SettingsManager could die would be either child process death or app change in the parent process. This did not account for the fact that an app could reload itself in the child process, as Cost Control does. However, the bug initially present in bug 1065128 should now be fixed by moving the caching of principals to lock objects, which was done in bug 1105511. Therefore, we /should/ be able to call finalize on all live locks in inner-window-destroyed and just not expect a response back. I'm working on a patch to test this hypothesis, will post a WIP when done.
Assignee | ||
Comment 13•9 years ago
|
||
So, I /think/ we can do this, but I haven't tested it yet. Possible problem is double finalizing when an app is in the parent process and settingsrequestmanager tries to deal with it too. Also, I need to talk to :sicking or :bent about how to deal with security around this, because we've changed how/when we send principals, and I'm not real sure about dealing with that.
Attachment #8536929 -
Flags: feedback?(lissyx+mozillians)
Attachment #8536929 -
Flags: feedback?(anygregor)
Reporter | ||
Comment 14•9 years ago
|
||
It's not breaking Mulet at least :)
Reporter | ||
Comment 15•9 years ago
|
||
Comment on attachment 8536929 [details] [diff] [review] Patch 1 (v1) - Try finalizing in inner-window-destroyed Review of attachment 8536929 [details] [diff] [review]: ----------------------------------------------------------------- With your feedback on the root cause thanks to Gregor's logs, I could confirm with my Settings API stressing app that: - the lockup comes from the reload of the app while a lock is held - your changes does make the queue still running.
Attachment #8536929 -
Flags: feedback?(lissyx+mozillians) → feedback+
Reporter | ||
Comment 16•9 years ago
|
||
This app will trigger the issue.
Assignee | ||
Comment 17•9 years ago
|
||
Updated•9 years ago
|
blocking-b2g: --- → 2.2?
Assignee | ||
Comment 18•9 years ago
|
||
I've now made this dependent on bug 1112936, and have moved the server load script there. That means, assuming that passes, all I'll be adding here is the new test. Will have a new WIP up ASAP.
Assignee | ||
Comment 19•9 years ago
|
||
Attachment #8537640 -
Attachment is obsolete: true
Assignee | ||
Comment 20•9 years ago
|
||
Cleaned up tests and made very sure they test exactly what they're supposed to. Note that this tests in-process cleanup too, since both the SettingsRequestManager (since it can see the inner-window-destroyed message) and the SettingsManager will file finalize tasks for a lock.
Assignee | ||
Updated•9 years ago
|
Attachment #8538212 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Attachment #8536929 -
Flags: feedback?(anygregor) → review?(lissyx+mozillians)
Assignee | ||
Updated•9 years ago
|
Attachment #8536929 -
Attachment description: Patch 1 (v1) - WIP: Try finalizing in inner-window-destroyed → Patch 1 (v1) - Try finalizing in inner-window-destroyed
Assignee | ||
Updated•9 years ago
|
Attachment #8538239 -
Flags: review?(lissyx+mozillians)
Assignee | ||
Comment 21•9 years ago
|
||
Try runs of r?'d patches. https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=79aeb711c56a https://tbpl.mozilla.org/?tree=Try&rev=79aeb711c56a
Comment 22•9 years ago
|
||
[Blocking Requested - why for this release]: Like explained in comment 8, every user who activates Cost Control and have a multiple SIM phone will freeze some basic functionalities of their phone without any further explication. This bug needs to be fixed on 2.1.
Reporter | ||
Comment 23•9 years ago
|
||
Comment on attachment 8536929 [details] [diff] [review] Patch 1 (v1) - Try finalizing in inner-window-destroyed Review of attachment 8536929 [details] [diff] [review]: ----------------------------------------------------------------- r+ since it fixes the issue as documented earlier. ::: dom/settings/SettingsManager.js @@ +462,4 @@ > > cleanup: function() { > Services.obs.removeObserver(this, "inner-window-destroyed"); > + // At this point, the window is dying, so there's nothing left Nit: check whitespaces at the end of the line @@ +464,5 @@ > Services.obs.removeObserver(this, "inner-window-destroyed"); > + // At this point, the window is dying, so there's nothing left > + // that we could do with our lock. Go ahead and run finalize on > + // it to make sure changes are commited. > + for (let i = 0; i < this._locks.length; ++i) { Nit: I'd prefer the this._locks.forEach(lock => {}) syntax, it's easier to read
Attachment #8536929 -
Flags: review?(lissyx+mozillians) → review+
Reporter | ||
Comment 24•9 years ago
|
||
Comment on attachment 8538239 [details] [diff] [review] Patch 2 (v3) - WIP: Mochitest for testing Settings OOP Reload Error Review of attachment 8538239 [details] [diff] [review]: ----------------------------------------------------------------- The test does properly consistently fail when I do not have the fix applied, yet: - it passes without the fix when launching only this test: - |./mach mochitest-plain dom/settings/tests/test_settings_bug1110872.html| is green WITHOUT the fix, which makes me a bit wondering why ? I have not had a look at the behavior, but is it possible that in this case we really do not block the queue since we are the only user ? - |./mach mochitest-plain dom/settings/tests/| did fail properly without the patch - I saw some iterations where it would expose more than the expected failure: multiple finish() called We should also make sure it's not getting some intermittence, since I think I had a couple of "multiple finish() called" occurrences when running tests while building a Gecko. With the patch applied, it was green for 5-6 runs and then failing for 2-3. I'm fine with the test if it's properly green on tbpl and/or you check my comment on the locks. Once we have consistently green tbpl/proved to be non intermittent, ask me again a review :) ::: dom/settings/tests/file_bug1110872.html @@ +17,5 @@ > + var lock = navigator.mozSettings.createLock(); > + var req = lock.get("wallpaper.image"); > + // We don't actually care about success or failure here, we just > + // want to know the queue gets processed at all. > + req.onsuccess = function () { Maybe it would be less prone to intermittences if: - we create a first lock, then do a request. Maybe without any handlers - we force a reload - after we reloaded, we create another lock, and we do a get() - only if this second get()'s onsuccess handler is called, we PASS the test @@ +25,5 @@ > + parent.postMessage({name:"done" + msg.data.step}, "*"); > + }; > + // If this is our first time through, reload > + // before the SettingsManager has a chance to get a response > + // to our query. Maybe we should make it more obvious in this comment that this is indeed expecting to expose the issue. ::: dom/settings/tests/file_bug1110872.js @@ +18,5 @@ > +} > + > +function iframeBodyRecv(msg) { > + switch (loadedEvents) { > + case 1: nit: indentation ?
Attachment #8538239 -
Flags: review?(lissyx+mozillians)
Reporter | ||
Comment 25•9 years ago
|
||
(In reply to Kyle Machulis [:kmachulis] [:qdot] (USE NEEDINFO?) from comment #21) > Try runs of r?'d patches. > > https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=79aeb711c56a > https://tbpl.mozilla.org/?tree=Try&rev=79aeb711c56a Those try are totally red, not even building :)
Flags: needinfo?(kyle)
Assignee | ||
Comment 26•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #24) > The test does properly consistently fail when I do not have the fix applied, > yet: > - it passes without the fix when launching only this test: > - |./mach mochitest-plain > dom/settings/tests/test_settings_bug1110872.html| is green WITHOUT the fix, > which makes me a bit wondering why ? I have not had a look at the > behavior, but is it possible that > in this case we really do not block the queue since we are the only > user ? > - |./mach mochitest-plain dom/settings/tests/| did fail properly without > the patch You didn't run either of these with --e10s, so it's running with tabs in-process, but this bug is specific to remote tab situations. Running in-process means that any inner-window-destroyed message exists within the parent process (since there is no child), so the handler for that case, which you added to SettingsRequestManager in bug 1082001. > - I saw some iterations where it would expose more than the expected > failure: multiple finish() called > > We should also make sure it's not getting some intermittence, since I think > I had a couple of "multiple finish() called" occurrences when running tests > while building a Gecko. > > With the patch applied, it was green for 5-6 runs and then failing for 2-3. Where was it failing? In the reload test? I can change it to actively look for messages. > Maybe it would be less prone to intermittences if: > - we create a first lock, then do a request. Maybe without any handlers > - we force a reload > - after we reloaded, we create another lock, and we do a get() > - only if this second get()'s onsuccess handler is called, we PASS the test That sounds fine, will add it.
Flags: needinfo?(kyle)
Reporter | ||
Comment 27•9 years ago
|
||
(In reply to Kyle Machulis [:kmachulis] [:qdot] (USE NEEDINFO?) from comment #26) > (In reply to Alexandre LISSY :gerard-majax from comment #24) > > The test does properly consistently fail when I do not have the fix applied, > > yet: > > - it passes without the fix when launching only this test: > > - |./mach mochitest-plain > > dom/settings/tests/test_settings_bug1110872.html| is green WITHOUT the fix, > > which makes me a bit wondering why ? I have not had a look at the > > behavior, but is it possible that > > in this case we really do not block the queue since we are the only > > user ? > > - |./mach mochitest-plain dom/settings/tests/| did fail properly without > > the patch > > You didn't run either of these with --e10s, so it's running with tabs > in-process, but this bug is specific to remote tab situations. Running > in-process means that any inner-window-destroyed message exists within the > parent process (since there is no child), so the handler for that case, > which you added to SettingsRequestManager in bug 1082001. Makes sense, but then maybe it's worth only running on e10s, and thus your patch lacks a skip-if for non e10s builds :) > > > - I saw some iterations where it would expose more than the expected > > failure: multiple finish() called > > > > We should also make sure it's not getting some intermittence, since I think > > I had a couple of "multiple finish() called" occurrences when running tests > > while building a Gecko. > > > > With the patch applied, it was green for 5-6 runs and then failing for 2-3. > > Where was it failing? In the reload test? I can change it to actively look > for messages. I think so, with this: > 609 INFO TEST-UNEXPECTED-FAIL | dom/settings/tests/test_settings_bug1110872.html | got response before we should've! - expected PASS
Assignee | ||
Comment 28•9 years ago
|
||
Rerunning try, this time with /all/ of the files commited. https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=58bc3d6483c5 https://tbpl.mozilla.org/?tree=Try&rev=58bc3d6483c5
Assignee | ||
Comment 29•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #27) > > 609 INFO TEST-UNEXPECTED-FAIL | dom/settings/tests/test_settings_bug1110872.html | got response before we should've! - expected PASS Ok, that actually means the test is broken. If we exhaust the queries before reloading, then we're not actually testing anything.
Assignee | ||
Comment 30•9 years ago
|
||
Landing fix due to dogfooding demands, breaking tests into a different patch since they'll require more work. https://hg.mozilla.org/integration/b2g-inbound/rev/236972484a61
Assignee | ||
Comment 31•9 years ago
|
||
Comment on attachment 8538239 [details] [diff] [review] Patch 2 (v3) - WIP: Mochitest for testing Settings OOP Reload Error Moving tests to separate bug.
Attachment #8538239 -
Attachment is obsolete: true
Assignee | ||
Comment 32•9 years ago
|
||
Comment on attachment 8536929 [details] [diff] [review] Patch 1 (v1) - Try finalizing in inner-window-destroyed NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings. [Approval Request Comment] Bug caused by (feature/regressing bug #): Bug 1082001 User impact if declined: Phone will lock up and require reboot at random times Testing completed: Tests exist, not checked in yet, see comments in this bug and bug 1113199. Has been dogfooded by a couple of engineers. Risk to taking this patch (and alternatives if risky): Settings API changes, which hit a lot, but also fixes a lot. String or UUID changes made by this patch: None
Attachment #8536929 -
Flags: approval-mozilla-b2g34?
Comment 33•9 years ago
|
||
waiting for central landing before considering branch approval.
Comment 34•9 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/236972484a61
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla37
Updated•9 years ago
|
Attachment #8536929 -
Flags: approval-mozilla-b2g34? → approval-mozilla-b2g34+
Comment 35•9 years ago
|
||
(In reply to Johan Lorenzo [:jlorenzo] (QA) from comment #22) > [Blocking Requested - why for this release]: Like explained in comment 8, > every user who activates Cost Control and have a multiple SIM phone will > freeze some basic functionalities of their phone without any further > explication. This bug needs to be fixed on 2.1. JOhan/QA can you please verify the 2.1 branch post landing ? Thanks!
Comment 36•9 years ago
|
||
https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/2570f8660a1a
Comment 37•9 years ago
|
||
Tests: https://hg.mozilla.org/mozilla-central/rev/ebab1280e934
Flags: in-testsuite+
Comment 39•9 years ago
|
||
My dogfooding phone has been at [1] for the last 6 days with Cost Control enabled (and 2 days of uptime currently). When this bug was around, it took me an average of 1 day to get it. I haven't hit the settings deadlock a single time since my last flash. This issue is verified to me on 2.1 then. If somebody has a different result, please call this out. [1] Build ID 20141225001203 Gaia Revision 17c7ad2e4919a994f0844239b483116090412dee Gaia Date 2014-12-22 21:46:43 Gecko Revision https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/39dfb662c82a Gecko Version 34.0 Device Name flame Firmware(Release) 4.4.2 Firmware(Incremental) eng.cltbld.20141120.194707 Firmware Date Thu Nov 20 19:47:17 EST 2014 Bootloader L1TC00011880
Comment 40•9 years ago
|
||
> up time: 6 days, 14:59:44, idle time: 05:58:53, sleep time: 6 days, 11:13:30 Still no issue on my dogfooding device with the same build as comment 39. I move to another build from now on.
Updated•9 years ago
|
Whiteboard: [systemsfe]
Updated•5 years ago
|
Component: DOM → DOM: Core & HTML
You need to log in
before you can comment on or make changes to this bug.
Description
•