Closed Bug 945948 Opened 11 years ago Closed 11 years ago

[B2G][SMS] The user cannot scroll through their entire contact list when composing a SMS

Categories

(Core Graveyard :: DOM: Contacts, defect, P1)

26 Branch
ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:koi+, firefox27 wontfix, firefox28 fixed, firefox29 fixed, b2g-v1.2 fixed, b2g-v1.3 fixed)

RESOLVED FIXED
mozilla29
blocking-b2g koi+
Tracking Status
firefox27 --- wontfix
firefox28 --- fixed
firefox29 --- fixed
b2g-v1.2 --- fixed
b2g-v1.3 --- fixed

People

(Reporter: KTucker, Assigned: bkelly)

References

Details

(Keywords: regression, Whiteboard: dogfood1.2 [c= p=4 s= u=1.2])

Attachments

(5 files, 7 obsolete files)

11.30 KB, text/plain
Details
3.42 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
3.54 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
4.24 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
4.62 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
Description:
If the user has a large contact list, they will not be able to scroll through all their contacts when composing a SMS. They will always get stuck at some point when scrolling through their list and cannot scroll any further. 

Repro Steps:
1)  Updated Buri to Build ID: 20131126004001
2)  Ensure the phone has 2000 contacts. Use the script ./makeReferenceWorkload.sh x-heavy if necessary.
3)  Tap the "Messages" icon.
4)  Tap on the "Compose new messsage" icon.
5)  Tap on the "+" icon to open up the user's contacts list.
6)  Keeping scrolling through the entire contact list.

Actual:
Eventually, the user will get stuck and cannot scroll any further down their contact list.

Expected:
The user can scroll down to see all of their contacts without issue.

Environmental Variables
Device: Buri v 1.2.0 COM RIL
Build ID: 20131126004001
Gecko: http://hg.mozilla.org/releases/mozilla-b2g26_v1_2/rev/21e2ad082d85
Gaia: 264c6044b941437ac3c4b28fe4ca392d2bc78445
Platform Version: 26.0
RIL Version: 01.02.00.019.102 

Notes:
Repro frequency: 100%
See attached: video clip, logcat
I tried to reproduce this issue on Leo v 1.1.0 but could not get the contacts loaded on the device using the script or by trying to import them using a SD card.
Attached file NewSMSlog.txt
The video of this issue has been uploaded to youtube:

http://www.youtube.com/watch?v=PetDwnQ_Rik&feature=youtu.be
ktucker - the youtube video here is private. Can you open this up?
Flags: needinfo?(ktucker)
Anyways - this sounds bad. We should never end up in a state where picking a contact isn't possible.
blocking-b2g: --- → koi?
Sorry about that. The video has been changed to unlisted. 

http://youtu.be/PetDwnQ_Rik
Flags: needinfo?(ktucker)
ktucker: you can try: "APP=communications/contacts make reference-workload-medium" from a gaia directory, this will load a 1.1-compatible database.
Component: Gaia::SMS → Gaia::Contacts
How does it work if you use the letters on the right? Does it stop at the same location?

(note to self: we should have a way to easily enable the debugging log to make useful logcats...)
Flags: needinfo?(ktucker)
After what number does the contact freeze?

David,

Please take a look here and reassign appropriately.
Flags: needinfo?(dscravaglieri)
(In reply to Preeti Raghunath(:Preeti) from comment #9)
> After what number does the contact freeze?
> 
> David,
> 
> Please take a look here and reassign appropriately.

Adding blocking+ per triage - we should never end up in a situation where adding a contact can fail all together.
blocking-b2g: koi? → koi+
(In reply to Julien Wajsberg [:julienw] from comment #8)
> How does it work if you use the letters on the right? Does it stop at the
> same location?
> 
> (note to self: we should have a way to easily enable the debugging log to
> make useful logcats...)

The alphabet bar will not take the user to the appropriate contacts until those contacts starting with that letter have been loaded which can take awhile. I have been able to jump down to the letter "Z" once to see all the contacts. Usually, i cannot jump down past whatever letter the user would get stuck on if they were manually scrolling through their contacts.
Flags: needinfo?(ktucker)
(In reply to Preeti Raghunath(:Preeti) from comment #9)
> After what number does the contact freeze?
> 
> David,
> 
> Please take a look here and reassign appropriately.

The user gets stuck on a different letter each time when scrolling through their contacts list. It is not the same letter each time.
hi Julien, will you be able to look at this? Thanks
Flags: needinfo?(felash)
Not at all, I know nothing about the Contacts app :(

Ben could be a better fit, redirecting the need info flag.
Flags: needinfo?(felash) → needinfo?(bkelly)
Keywords: regression
I could not reproduce this issue on Leo v 1.1.0 COM RIL

Environmental Variables
Device: Leo v 1.1.0 COM RIL
Build ID: 20131205041342
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/rev/05117f42088f
Gaia: 6ff3a607f873320d00cb036fa76117f6fadd010f
Platform Version: 18.1
RIL Version: 01.01.00.019.281 

The user can scroll through all the contacts without issue when composing a new message.
as we need to close koi+ very soon, to bkelly as suggested in comment 14. 
if you are not the proper owner, please unassign. thanks
Assignee: nobody → bkelly
It's on my list.  I actually spent some time trying to reproduce today, but got sidetracked by some bustage on 1.3.
Flags: needinfo?(bkelly)
Whiteboard: dogfood1.2 → dogfood1.2 [c= p= s= u=1.2]
Status: NEW → ASSIGNED
Priority: -- → P1
Whiteboard: dogfood1.2 [c= p= s= u=1.2] → dogfood1.2 [c= p=3 s= u=1.2]
QA Contact: sarsenyev
Reproduced.  Some selected interesting log messages:

E/GeckoConsole( 9652): [JavaScript Error: "Return value is not an object."]
E/GeckoConsole( 9652): [JavaScript Error: "Return value is not an object."]
I/GeckoDump( 9652): XXX FIXME : Got a mozContentEvent: activity-choice
E/GeckoConsole( 9652): [JavaScript Error: "Return value is not an object."]

E/GeckoConsole(10028): Content JS LOG at app://sms.gaiamobile.org/js/thread_ui.js:546 in thui_requestContact/activity.onerror<: WebActivities unavailable? : [object Event]
I think that sms log message was from when the contacts app ultimately got closed without picking anything.  We can ignore it.

We're definitely under a lot of memory pressure here with reference-workload-x-heavy:

                           |     megabytes    |
           NAME   PID NICE  USS  PSS  RSS VSIZE OOM_ADJ USER
            b2g 10276    0 49.7 52.7 62.1 185.2       0 root
     Homescreen 10334   18 13.9 16.5 25.4  69.4       8 app_10334
       Messages 10402    1 33.8 36.7 46.1 103.0       2 app_10402
 Communications 10874    1 22.6 27.8 39.7 123.3       2 app_10874
(Preallocated a 10899   18  4.6  8.1 15.0  50.1      10 root

System memory info:

            Total 179.7 MB
     Used - cache 146.9 MB
  B2G procs (PSS) 141.8 MB
    Non-B2G procs   5.2 MB
     Free + cache  32.8 MB
             Free   2.5 MB
            Cache  30.3 MB

The pre-allocated process gets killed which gets us to ~9MB free and we finally end up at ~11MB free when the problem happens.

I wonder if some piece of the IPC mechanism is being garbage collected before we can finish loading the contacts.  I checked and we do have the fix for bug 928389 in the b2g26.
So we are getting this:

I/Gecko   (12828): ###!!! [Child][AsyncChannel] Error: Channel closing: too late to send/recv, messages will be lost

But the client process is not being killed, which is when we normally see this message.  This suggests the IPC channel is closing due to another error.  I'm digging into that now.  It appears to be coming from one of the conditions in here:

http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#420

Ben, is there a known condition that can cause IPC channels to fail under memory pressure?  This is probably normal, but I just want to make sure I understand whats going on.
Flags: needinfo?(bent.mozilla)
Hmm, it appears the pipe went through an orderly shutdown.  We're hitting this case:

  http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#449

I'll have to look at the client code next week to understand how memory pressure could cause that without killing the process.
Hmm, after looking a bit further, I think this was just a red-herring.  Thats debug from the pre-allocated process being killed.  Sorry for the noise.  I'll keep investigating.
Flags: needinfo?(bent.mozilla)
I had a theory that perhaps the pre-allocated process launching was causing us to spike memory enough to hit an allocation failure during IPC deserialization, closing the pipe.

I commented out the pre-allocated process launch, though, and the problem still occurs.

We still might want to consider not spawning the pre-allocated process when we are already under memory pressure and we know it will be killed.  I wrote bug 947571 to track that idea.
Because 1.2 is going to be finished, should we mark this bug from koi+ to 1.3+? Thanks.
Attached patch ipc-lowmem-hack.patch (obsolete) — Splinter Review
So it looks like the linux kernel is returning 0 from recvmsg(), even though the pipe is still open.  The 0 return value is only supposed to be returned for an orderly shutdown.

This patch is a hack to essentially ignore the 0 return if the previous attempt returned EAGAIN.  This lets us get past this erroneous return value and then allows the IPC channel to clean up on the next recvmsg when it gets a broken pipe failure.

I don't really like this solution, but it lets me scroll through the entire contacts list with this workload.

I'm going to try to see why the kernel might be returning a zero here instead of returning EAGAIN.  I assume its hitting some corner case when a memory allocation fails.
this bug is WIP
Flags: needinfo?(dscravaglieri)
Ben, do you think this can be another SkiaGL issue?
(In reply to Julien Wajsberg [:julienw] from comment #27)
> Ben, do you think this can be another SkiaGL issue?

Well, anything that uses memory will contribute to the memory pressure.  What I am investigating now, though, is it seems the IPC channel breaks in an unexpected way under memory pressure.  Ideally I would like to fix that.
There is no a regression window for 1.2 build.
The issue reproduces on Buri 1.1

In the some point the scrolling is stuck and the user cannot scrolling further, however the behavior is a little different than on the video, in a few seconds the user is able to continue the scrolling after freezing when scrolling back and forth
(In reply to sarsenyev from comment #29)
> There is no a regression window for 1.2 build.
> The issue reproduces on Buri 1.1
> 
> In the some point the scrolling is stuck and the user cannot scrolling
> further, however the behavior is a little different than on the video, in a
> few seconds the user is able to continue the scrolling after freezing when
> scrolling back and forth

Are you absolutely sure you running the same STR here? That doesn't sound like the same exact bug as this. Can you check with ktucker about this?
Flags: needinfo?(sarsenyev)
Also I noticed, when the scrolling is stuck if the user pressing the "home" button and then returning to the "contacts" app he can see only a white screen
(In reply to sarsenyev from comment #31)
> Also I noticed, when the scrolling is stuck if the user pressing the "home"
> button and then returning to the "contacts" app he can see only a white
> screen

Did you talk to ktucker about this? He got different results you did.
For what its worth, I can very reliably reproduce this error on my buri using the original STR.
(In reply to Jason Smith [:jsmith] from comment #30)
> (In reply to sarsenyev from comment #29)
> > There is no a regression window for 1.2 build.
> > The issue reproduces on Buri 1.1
> > 
> > In the some point the scrolling is stuck and the user cannot scrolling
> > further, however the behavior is a little different than on the video, in a
> > few seconds the user is able to continue the scrolling after freezing when
> > scrolling back and forth
> 
> Are you absolutely sure you running the same STR here? That doesn't sound
> like the same exact bug as this. Can you check with ktucker about this?

The old 1.1 behavior is that the contact list was loading quite slowly, and was "blocking" at one point when scrolling down nutil th enext chunk was loaded and appended to the page.
(In reply to Julien Wajsberg [:julienw] from comment #34)
> (In reply to Jason Smith [:jsmith] from comment #30)
> > (In reply to sarsenyev from comment #29)
> > > There is no a regression window for 1.2 build.
> > > The issue reproduces on Buri 1.1
> > > 
> > > In the some point the scrolling is stuck and the user cannot scrolling
> > > further, however the behavior is a little different than on the video, in a
> > > few seconds the user is able to continue the scrolling after freezing when
> > > scrolling back and forth
> > 
> > Are you absolutely sure you running the same STR here? That doesn't sound
> > like the same exact bug as this. Can you check with ktucker about this?
> 
> The old 1.1 behavior is that the contact list was loading quite slowly, and
> was "blocking" at one point when scrolling down nutil th enext chunk was
> loaded and appended to the page.

Okay - that implies that the behavior being seen in comment 29 is different than 1.2's implementation.
Flags: needinfo?(sarsenyev)
More clearly: this bug could (possibly) happen also in 1.1 if we wait long enough that the full list is loaded. But the behavior seen in comment 29 is definitely a thing of the past.
Attachment #8344684 - Attachment is obsolete: true
Ok, I think I finally understand what is going on.

First, my comments about recvmsg() acting incorrectly in comment 25 were wrong.  When I added more debug I saw that it was actually a different file descriptor getting closed.  This is good because its nice when you can trust the kernel.

What is actually happening here is:

1) The CostControl app is OOM killed.
2) This triggers the IPC pipe for that app to close.
3) This then causes the ContentParent class to send a "child-process-shutdown" message.
4) This is propagated to the parent message manager.
5) The parent message manager notifies 8 listeners of the "child-process-shutdown" message.
6) One of those listeners is ContactService.jsm which unconditionally cleans up any outstanding cursors.
7) This halts the flow of data to the Contacts app.

I don't see any information provided in the "child-process-shutdown" message that ContactService could key off of.  Perhaps we could include the mChildID from ContentParent or the app URI in the message.  Not sure ContactService would know what to do with mChildID, though.

Gregor, what do you think?

Note, it looks like the issue of child-process-shutdown not having any distinguishing info was added in bug 777508.  ContactService.jsm began using child-process-shutdown in bug 836423.  So this problem probably existed in v1.0.0 on and we didn't notice because we hit other problems first.
Blocks: 777508, 836423
Flags: needinfo?(anygregor)
On second thought, the URI would not be helpful.  ContactService.jsm should work for other apps using the contacts API.  We really need some way to match the child ID of the closing app with the child ID of the app using the cursor.
I think I can at least work around this for our current apps by using a timer in ContactService to only shutdown cursors that have not been active for N seconds after a child-process-shutdown is received.

I'll work this up as a patch, but I think it would be preferable to actually match the event to the right cursor.  The timeout may not work for all 3rd party apps depending on how they use the cursors.
(In reply to Ben Kelly [:bkelly] from comment #39)
> I think I can at least work around this for our current apps by using a
> timer in ContactService to only shutdown cursors that have not been active
> for N seconds after a child-process-shutdown is received.
> 
> I'll work this up as a patch, but I think it would be preferable to actually
> match the event to the right cursor.  The timeout may not work for all 3rd
> party apps depending on how they use the cursors.

If this isn't a regression, should we continue to block on this? What do you think?
I was about to say I'm fine bumping this to 1.3 or 1.4, but then I remembered that I can't actually install reference-workload-x-heavy on m-c/m-a.  So from a purely practical point of view I can probably verify this is fixed on 1.2, but may have difficulty on the more recent releases.  (Probably not a good reason to keep it, I guess.)
Okay - I'm going to unblock this then, since this isn't a regression.
blocking-b2g: koi+ → -
Ack.  Actually this is a regression.  Sorry for my confusion here!

The ContactService does try to only clean up the appropriate cursors for a given nsFrameMessageManager, but due to a bug it treats all the message managers the same.  This is caused by trying to use the message manager object as a property in a javascript object.  This just coerces to a string; the same string for all managers.  This code was adding in bug 871872.

I have a patch I'm testing now to use a javascript Map instead.
Blocks: 871872
No longer blocks: 777508, 836423
blocking-b2g: - → koi?
Flags: needinfo?(anygregor)
Keywords: regression
Moving back to koi+ per comment 43.
blocking-b2g: koi? → koi+
This is a patch against b2g26_v1_2 that fixes ContactService.jsm to use a Map instead of just a Javascript object.  With this patch I can scroll completely through the contact list when opening from messages using reference-workload-x-heavy.

I'll make a separate patch for m-c since there have been enough changes that this doesn't cleanly apply there.
Attachment #8345406 - Flags: review?(reuben.bmo)
Attachment #8345406 - Attachment is patch: true
Here is the patch adapted for mozilla-central and mozilla-aurora.

Try build:

  https://tbpl.mozilla.org/?tree=Try&rev=913287b13f3c
Attachment #8345412 - Flags: review?(reuben.bmo)
Comment on attachment 8345406 [details] [diff] [review]
b2g26 patch:  Use a Map to associate cursors with message managers.

Review of attachment 8345406 [details] [diff] [review]:
-----------------------------------------------------------------

Oof, good thing we finally caught this :(
Looks good; should land with the tests. Thanks for diagnosing and fixing!
Attachment #8345406 - Flags: review?(reuben.bmo) → review+
Attachment #8345412 - Flags: review?(reuben.bmo) → review+
Update m-c/m-a patch:

1) Fix delete _cursors[mm] usage to _cursors.delete(mm) found by tests.
2) Update description to reflect two parts and r=reuben.
Attachment #8345412 - Attachment is obsolete: true
Attachment #8346097 - Flags: review+
Here is the test verifying the cleanup of the cursors on child shutdown.  This patch also doesn't apply cleanly on b2g26, so this is only for m-c/m-a.
Attachment #8346099 - Flags: review?(reuben.bmo)
Comment on attachment 8345406 [details] [diff] [review]
b2g26 patch:  Use a Map to associate cursors with message managers.

I need to re-roll the b2g26 patches, but won't get to it until later this evening.  Marking this obsolete for now.
Attachment #8345406 - Attachment is obsolete: true
Update b2g26 patch to match updated m-c patch.
Attachment #8346106 - Flags: review+
Here is the b2g26 patch for the tests.  Essentially the same as the other test patch, but marking this for review as well.
Attachment #8346108 - Flags: review?(reuben.bmo)
Comment on attachment 8346099 [details] [diff] [review]
m-c/m-a patch Part 2: Test contacts shutdown of cursors.

Review of attachment 8346099 [details] [diff] [review]:
-----------------------------------------------------------------

Nice! r=me

::: dom/contacts/tests/chrome.ini
@@ +1,3 @@
>  [DEFAULT]
>  
> +[test_contacts_shutdown.xul]

We disabled Android tests for now, but you should annotate this to disable the test on Android since it won't ever work there anyway.
Attachment #8346099 - Flags: review?(reuben.bmo) → review+
Component: Gaia::Contacts → DOM: Contacts
Product: Firefox OS → Core
Version: unspecified → 26 Branch
(In reply to Reuben Morais [:reuben] from comment #53)
> We disabled Android tests for now, but you should annotate this to disable
> the test on Android since it won't ever work there anyway.

Hmm, I'm not quite sure what you're asking me to do here.  Do you want me to add test_contacts_shutdown.xul to testing/android.json exclude list?
Flags: needinfo?(reuben.bmo)
Oh, I see.  The test_contact_upgrade.xul explicitly checks the UA for android and does nothing in that case.  I can do the same here.
Flags: needinfo?(reuben.bmo)
Updated tests to ignore android.  I also improved the assert messages and added code to restore the real ContactDB after the test completes.

Very similar to the previous test patch, but asking for re-review just to make sure I understood the android request.
Attachment #8346099 - Attachment is obsolete: true
Attachment #8346540 - Flags: review?(reuben.bmo)
Attachment #8346540 - Attachment is patch: true
Sanity check try build for the mochitests:

  https://tbpl.mozilla.org/?tree=Try&rev=496cb001344f
Updated patch for the tests on b2g26.
Attachment #8346108 - Attachment is obsolete: true
Attachment #8346108 - Flags: review?(reuben.bmo)
Whiteboard: dogfood1.2 [c= p=3 s= u=1.2] → dogfood1.2 [c= p=4 s= u=1.2]
Comment on attachment 8346540 [details] [diff] [review]
m-c/m-a patch Part 2: Test contacts shutdown of cursors.

Review of attachment 8346540 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with that nit fixed.

::: dom/contacts/tests/chrome.ini
@@ +1,3 @@
>  [DEFAULT]
>  
> +[test_contacts_shutdown.xul]

Sorry, I should have been clearer. The mochitest .ini files allow you to do something like this:

[test_contacts_shutdown.xul]
skip-if = os == "android"

The plain mochitest do this manually because only a small subset of the tests fail there. I'll fix test_contacts_upgrade.xul.
Attachment #8346540 - Flags: review?(reuben.bmo) → review+
Updated test patch using ini annotation to avoid running on android.  I also updated the description to show r=reuben.

Another sanity check try:

  https://tbpl.mozilla.org/?tree=Try&rev=5273326c9c59
Attachment #8346540 - Attachment is obsolete: true
Attachment #8346599 - Flags: review+
Update description on b2g26 test patch to show r=reuben.  Note, I left the android check in the b2g26 patch since that branch does not use ini files to configure the tests.
Attachment #8346604 - Flags: review+
Attachment #8346547 - Attachment is obsolete: true
I see green for android and non-android in try, so adding checkin-needed.

Sheriffs, these two patches should apply to m-c/m-a:

  Attachment 8346097 [details] [diff]
  Attachment 8346599 [details] [diff]

These two patches should apply to b2g26_v1_2:

  Attachment 8346106 [details] [diff]
  Attachment 8346604 [details] [diff]
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/6dc8607d6dd4
https://hg.mozilla.org/mozilla-central/rev/8ce2200b4bf6
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: