Closed Bug 1110872 Opened 7 years ago Closed 7 years ago

Apps that reload themselves in the same child process can jam Settings API

Categories

(Core :: DOM: Core & HTML, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla37
blocking-b2g 2.1+
Tracking Status
firefox35 --- wontfix
firefox36 --- wontfix
firefox37 --- fixed
b2g-v2.1 --- verified
b2g-v2.2 --- fixed

People

(Reporter: gerard-majax, Assigned: qdot)

References

Details

(Whiteboard: [systemsfe])

Attachments

(6 files, 3 obsolete files)

+++ This bug was initially created as a clone of Bug #1105511 +++

Martin Shuman [:Marty] 2014-12-12 17:20:35 CET

I hit this issue yesterday after testing Email for about 3 hours, with a little bit of time spent in Clock and Calendar before that.  The main focus of my testing had been exploring Text Selection (Cut/Copy/Paste) throughout the app.  After these hours of testing, I navigated to Settings, and saw that 'Operator Services' was present at the bottom of the menu, and I couldn't enable or access the Developer menu.

Environmental Variables:
Device: Flame 2.2 (319MB)
BuildID: 20141211040208 (Full Flash)
Gaia: 1a956437210d2b16988d2ddbf40c9a38d8474435
Gecko: d92db71d4e67
Gonk: 263b5f41f7733c5577fb101eb4dc8ac5c11cfa8d
Version: 37.0a1 (2.2) 
Firmware Version: v188-1
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0
Can you try collecting about:memory report ?

Just run:
> $ python tools/get_about_memory.py

From your B2G/ directory. Then you can provide the memory report as an attachment.

Also, during your testing, did you locked/unlocked the device a lot ?
Flags: needinfo?(mshuman)
As far as I can say, your Gecko does contains fixes but your Gaia does not include the fix for bug 1105639
I asked this question in bug 1105639 already. I don't get how we can allow Gaia code to break the device that badly.
Kyle, I had a look at the logcat of my Nexus S, and there was still a bunch of errors.

Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=c24633af8019588b78e79ad428d7430cded2dffc
Gaia: https://github.com/mozilla-b2g/gaia/tree/4cdeee67b449db90aae9384337311547c280093c

As far as I can say, Setting was not broken.
Flags: needinfo?(kyle)
There were still a bunch of errors because the commit you reference of gecko does not contain the patches from bug 1105511. Please update to a newer commit and retry.
Flags: needinfo?(kyle)
When testing, I was not locking my device much at all.  I had my screen time-out set to 'Never,' and probably only manually locked the screen 3 or 4 times over the course of 3+ hours.

The next time I see this issue, I will make sure to collect a memory report.
Flags: needinfo?(mshuman)
Assignee: lissyx+mozillians → kyle
Ok, hijacking this to make it more specific.

We've found a problem triggered by phones with multiple sims. The CostControl app has a few logic branches where, if a phone is multisim, it will do some checks about which one we're transferring data through. These checks can cause the app to reload itself in the same child process, meaning we destroy and recreate the SettingsManager, but don't send a message to notify the SettingsRequestManager in the parent process to clean up the locks in the old SettingsManager. This means that if we create a settings request that triggers this reset, it gets stuck in the queue.

This would explain why gwagner has been seeing lockups after sms and phone calls. I'm not sure how this relates to other reports in this bug, but I'm sure there are other apps that may reload themselves in the child process that could cause this.
Summary: Broken Settings API → Apps that reload themselves in the same child process can jam Settings API
I managed to grab a memory report after encountering Operator Services.
Prior to this occurring, I was sending PNG images to another phone via accessing the Messages App from Homescreen: Wi-Fi was disabled, Data Connection (3G Enabled from SIM1, nothing in SIM2) and Automatic Downloads was off.

Immediately after, I pushed a small music file onto the phone to continue my testing via the 'adb push' command from my terminal, in an attempt to access the Messages app from within the Music App:
(Open Music App; locate music file; play, and access Share -> Messages.)

Once in my share activity of Messages, I added a title to the message, added my contact's number and attempted to send to no avail, which made me suspect the Settings API was having issues.
Prior to me noticing the message send failure, I had attempted to preview the small music file and reached a broken UI screen (observed in a different UX in bug 1111092.) This forced me to force close my music app through card view (thus terminating the messages share activity) and then relaunch the Music app. 
All of this business around accessing Messages from within Music occurred while the Messages App remained open.

Hopefully something within the above details helps may help narrow this down. I managed to pull a python memory report from the command suggested in comment 1. I also pulled a logcat. Both will be attached immediately below this comment.
Cleaning up the dependencies. This bug is purely due to bug 1065128 and 1082001. In bug 1065128, we changed messages from inner-window-destroyed to dom-window-destroyed due to the way we were caching permissions and sending around window objects. This caused the issues that lead to bug 1082001, because dom-window-destroyed was being called at times we didn't expect and we cleaned up locks that weren't ready yet. In bug 1082001, we moved the checks for inner-window-destroyed to the parent process as the belief was that the only time a SettingsManager could die would be either child process death or app change in the parent process. This did not account for the fact that an app could reload itself in the child process, as Cost Control does. However, the bug initially present in bug 1065128 should now be fixed by moving the caching of principals to lock objects, which was done in bug 1105511. Therefore, we /should/ be able to call finalize on all live locks in inner-window-destroyed and just not expect a response back. I'm working on a patch to test this hypothesis, will post a WIP when done.
Depends on: 1065128, 1082001
No longer depends on: 1107329, 1105639, 1106324, 1106896
So, I /think/ we can do this, but I haven't tested it yet. Possible problem is double finalizing when an app is in the parent process and settingsrequestmanager tries to deal with it too. Also, I need to talk to :sicking or :bent about how to deal with security around this, because we've changed how/when we send principals, and I'm not real sure about dealing with that.
Attachment #8536929 - Flags: feedback?(lissyx+mozillians)
Attachment #8536929 - Flags: feedback?(anygregor)
It's not breaking Mulet at least :)
Comment on attachment 8536929 [details] [diff] [review]
Patch 1 (v1) - Try finalizing in inner-window-destroyed

Review of attachment 8536929 [details] [diff] [review]:
-----------------------------------------------------------------

With your feedback on the root cause thanks to Gregor's logs, I could confirm with my Settings API stressing app that:
 - the lockup comes from the reload of the app while a lock is held
 - your changes does make the queue still running.
Attachment #8536929 - Flags: feedback?(lissyx+mozillians) → feedback+
Attached file settingstest.zip
This app will trigger the issue.
blocking-b2g: --- → 2.2?
I've now made this dependent on bug 1112936, and have moved the server load script there. That means, assuming that passes, all I'll be adding here is the new test. Will have a new WIP up ASAP.
Cleaned up tests and made very sure they test exactly what they're supposed to. Note that this tests in-process cleanup too, since both the SettingsRequestManager (since it can see the inner-window-destroyed message) and the SettingsManager will file finalize tasks for a lock.
Attachment #8538212 - Attachment is obsolete: true
Attachment #8536929 - Flags: feedback?(anygregor) → review?(lissyx+mozillians)
Attachment #8536929 - Attachment description: Patch 1 (v1) - WIP: Try finalizing in inner-window-destroyed → Patch 1 (v1) - Try finalizing in inner-window-destroyed
Attachment #8538239 - Flags: review?(lissyx+mozillians)
[Blocking Requested - why for this release]: Like explained in comment 8, every user who activates Cost Control and have a multiple SIM phone will freeze some basic functionalities of their phone without any further explication. This bug needs to be fixed on 2.1.
blocking-b2g: 2.2? → 2.1?
Comment on attachment 8536929 [details] [diff] [review]
Patch 1 (v1) - Try finalizing in inner-window-destroyed

Review of attachment 8536929 [details] [diff] [review]:
-----------------------------------------------------------------

r+ since it fixes the issue as documented earlier.

::: dom/settings/SettingsManager.js
@@ +462,4 @@
>  
>    cleanup: function() {
>      Services.obs.removeObserver(this, "inner-window-destroyed");
> +    // At this point, the window is dying, so there's nothing left	

Nit: check whitespaces at the end of the line

@@ +464,5 @@
>      Services.obs.removeObserver(this, "inner-window-destroyed");
> +    // At this point, the window is dying, so there's nothing left	
> +    // that we could do with our lock. Go ahead and run finalize on	
> +    // it to make sure changes are commited.	
> +    for (let i = 0; i < this._locks.length; ++i) {	

Nit: I'd prefer the this._locks.forEach(lock => {}) syntax, it's easier to read
Attachment #8536929 - Flags: review?(lissyx+mozillians) → review+
Comment on attachment 8538239 [details] [diff] [review]
Patch 2 (v3) - WIP: Mochitest for testing Settings OOP Reload Error

Review of attachment 8538239 [details] [diff] [review]:
-----------------------------------------------------------------

The test does properly consistently fail when I do not have the fix applied, yet:
 - it passes without the fix when launching only this test:
   - |./mach mochitest-plain dom/settings/tests/test_settings_bug1110872.html| is green WITHOUT the fix,
     which makes me a bit wondering why ? I have not had a look at the behavior, but is it possible that
     in this case we really do not block the queue since we are the only user ?
   - |./mach mochitest-plain dom/settings/tests/| did fail properly without the patch
 - I saw some iterations where it would expose more than the expected failure: multiple finish() called

We should also make sure it's not getting some intermittence, since I think I had a couple of "multiple finish() called" occurrences when running tests while building a Gecko.

With the patch applied, it was green for 5-6 runs and then failing for 2-3.

I'm fine with the test if it's properly green on tbpl and/or you check my comment on the locks.

Once we have consistently green tbpl/proved to be non intermittent, ask me again a review :)

::: dom/settings/tests/file_bug1110872.html
@@ +17,5 @@
> +                 var lock = navigator.mozSettings.createLock();
> +                 var req = lock.get("wallpaper.image");
> +                 // We don't actually care about success or failure here, we just
> +                 // want to know the queue gets processed at all.
> +                 req.onsuccess = function () {

Maybe it would be less prone to intermittences if:
 - we create a first lock, then do a request. Maybe without any handlers
 - we force a reload
 - after we reloaded, we create another lock, and we do a get()
 - only if this second get()'s onsuccess handler is called, we PASS the test

@@ +25,5 @@
> +                     parent.postMessage({name:"done" + msg.data.step}, "*");
> +                 };
> +                 // If this is our first time through, reload
> +                 // before the SettingsManager has a chance to get a response
> +                 // to our query.

Maybe we should make it more obvious in this comment that this is indeed expecting to expose the issue.

::: dom/settings/tests/file_bug1110872.js
@@ +18,5 @@
> +}
> +
> +function iframeBodyRecv(msg) {
> +  switch (loadedEvents) {
> +  case 1:

nit: indentation ?
Attachment #8538239 - Flags: review?(lissyx+mozillians)
(In reply to Kyle Machulis [:kmachulis] [:qdot] (USE NEEDINFO?) from comment #21)
> Try runs of r?'d patches.
> 
> https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=79aeb711c56a
> https://tbpl.mozilla.org/?tree=Try&rev=79aeb711c56a

Those try are totally red, not even building :)
Flags: needinfo?(kyle)
(In reply to Alexandre LISSY :gerard-majax from comment #24)
> The test does properly consistently fail when I do not have the fix applied,
> yet:
>  - it passes without the fix when launching only this test:
>    - |./mach mochitest-plain
> dom/settings/tests/test_settings_bug1110872.html| is green WITHOUT the fix,
>      which makes me a bit wondering why ? I have not had a look at the
> behavior, but is it possible that
>      in this case we really do not block the queue since we are the only
> user ?
>    - |./mach mochitest-plain dom/settings/tests/| did fail properly without
> the patch

You didn't run either of these with --e10s, so it's running with tabs in-process, but this bug is specific to remote tab situations. Running in-process means that any inner-window-destroyed message exists within the parent process (since there is no child), so the handler for that case, which you added to SettingsRequestManager in bug 1082001.

>  - I saw some iterations where it would expose more than the expected
> failure: multiple finish() called
> 
> We should also make sure it's not getting some intermittence, since I think
> I had a couple of "multiple finish() called" occurrences when running tests
> while building a Gecko.
> 
> With the patch applied, it was green for 5-6 runs and then failing for 2-3.

Where was it failing? In the reload test? I can change it to actively look for messages.

> Maybe it would be less prone to intermittences if:
>  - we create a first lock, then do a request. Maybe without any handlers
>  - we force a reload
>  - after we reloaded, we create another lock, and we do a get()
>  - only if this second get()'s onsuccess handler is called, we PASS the test

That sounds fine, will add it.
Flags: needinfo?(kyle)
(In reply to Kyle Machulis [:kmachulis] [:qdot] (USE NEEDINFO?) from comment #26)
> (In reply to Alexandre LISSY :gerard-majax from comment #24)
> > The test does properly consistently fail when I do not have the fix applied,
> > yet:
> >  - it passes without the fix when launching only this test:
> >    - |./mach mochitest-plain
> > dom/settings/tests/test_settings_bug1110872.html| is green WITHOUT the fix,
> >      which makes me a bit wondering why ? I have not had a look at the
> > behavior, but is it possible that
> >      in this case we really do not block the queue since we are the only
> > user ?
> >    - |./mach mochitest-plain dom/settings/tests/| did fail properly without
> > the patch
> 
> You didn't run either of these with --e10s, so it's running with tabs
> in-process, but this bug is specific to remote tab situations. Running
> in-process means that any inner-window-destroyed message exists within the
> parent process (since there is no child), so the handler for that case,
> which you added to SettingsRequestManager in bug 1082001.

Makes sense, but then maybe it's worth only running on e10s, and thus your patch lacks a skip-if for non e10s builds :)

> 
> >  - I saw some iterations where it would expose more than the expected
> > failure: multiple finish() called
> > 
> > We should also make sure it's not getting some intermittence, since I think
> > I had a couple of "multiple finish() called" occurrences when running tests
> > while building a Gecko.
> > 
> > With the patch applied, it was green for 5-6 runs and then failing for 2-3.
> 
> Where was it failing? In the reload test? I can change it to actively look
> for messages.

I think so, with this:

> 609 INFO TEST-UNEXPECTED-FAIL | dom/settings/tests/test_settings_bug1110872.html | got response before we should've! - expected PASS
(In reply to Alexandre LISSY :gerard-majax from comment #27)

> > 609 INFO TEST-UNEXPECTED-FAIL | dom/settings/tests/test_settings_bug1110872.html | got response before we should've! - expected PASS

Ok, that actually means the test is broken. If we exhaust the queries before reloading, then we're not actually testing anything.
Landing fix due to dogfooding demands, breaking tests into a different patch since they'll require more work.

https://hg.mozilla.org/integration/b2g-inbound/rev/236972484a61
Comment on attachment 8538239 [details] [diff] [review]
Patch 2 (v3) - WIP: Mochitest for testing Settings OOP Reload Error

Moving tests to separate bug.
Attachment #8538239 - Attachment is obsolete: true
Comment on attachment 8536929 [details] [diff] [review]
Patch 1 (v1) - Try finalizing in inner-window-destroyed

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 1082001
User impact if declined: Phone will lock up and require reboot at random times
Testing completed: Tests exist, not checked in yet, see comments in this bug and bug 1113199. Has been dogfooded by a couple of engineers.
Risk to taking this patch (and alternatives if risky): Settings API changes, which hit a lot, but also fixes a lot.
String or UUID changes made by this patch: None
Attachment #8536929 - Flags: approval-mozilla-b2g34?
waiting for central landing before considering branch approval.
https://hg.mozilla.org/mozilla-central/rev/236972484a61
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla37
blocking-b2g: 2.1? → 2.1+
Keywords: verifyme
Attachment #8536929 - Flags: approval-mozilla-b2g34? → approval-mozilla-b2g34+
(In reply to Johan Lorenzo [:jlorenzo] (QA) from comment #22)
> [Blocking Requested - why for this release]: Like explained in comment 8,
> every user who activates Cost Control and have a multiple SIM phone will
> freeze some basic functionalities of their phone without any further
> explication. This bug needs to be fixed on 2.1.

JOhan/QA can you please verify the 2.1 branch post landing ? Thanks!
So nobody checked ...
Flags: needinfo?(jlorenzo)
My dogfooding phone has been at [1] for the last 6 days with Cost Control enabled (and 2 days of uptime currently). When this bug was around, it took me an average of 1 day to get it. I haven't hit the settings deadlock a single time since my last flash.

This issue is verified to me on 2.1 then. If somebody has a different result, please call this out.

[1] Build ID               20141225001203
Gaia Revision          17c7ad2e4919a994f0844239b483116090412dee
Gaia Date              2014-12-22 21:46:43
Gecko Revision         https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/39dfb662c82a
Gecko Version          34.0
Device Name            flame
Firmware(Release)      4.4.2
Firmware(Incremental)  eng.cltbld.20141120.194707
Firmware Date          Thu Nov 20 19:47:17 EST 2014
Bootloader             L1TC00011880
Status: RESOLVED → VERIFIED
Flags: needinfo?(jlorenzo)
> up time: 6 days, 14:59:44, idle time: 05:58:53, sleep time: 6 days, 11:13:30

Still no issue on my dogfooding device with the same build as comment 39. I move to another build from now on.
Whiteboard: [systemsfe]
Keywords: verifyme
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.