Closed Bug 998145 Opened 6 years ago Closed 6 years ago

[Tarako] Panning up and down on facebook feed causes facebook app to crash

Categories

(Tech Evangelism Graveyard :: Preinstalled B2G Apps, defect, P1, blocker)

ARM
Gonk (Firefox OS)
defect

Tracking

(b2g-v1.3T fixed)

RESOLVED FIXED
Tracking Status
b2g-v1.3T --- fixed

People

(Reporter: jhammink, Assigned: colby)

References

Details

(Keywords: memory-footprint, perf, Whiteboard: [c=memory p= s= u=tarako] [tarako-bug-bash-1.3T], LMK [NPOTB])

Attachments

(6 files, 1 obsolete file)

Build Information

Gaia      a8d2d399f2939f4845abaa0df57abab241a2c782
Gecko     https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/d97dad54cb61
BuildID   20140417004002
Version   28.1
ro.build.version.incremental=eng.cltbld.20140417.115200
ro.build.date=Thu Apr 17 11:52:06 EDT 2014

Description:
Swiping up and down on the Facebook feed will cause the facebook app to crash


Steps to Reproduce:
1.) Log into the facebook app
2.) Swipe up and down the facebook newsfeed ~5 times

Expected:
Panning is velvety smooth and buttery

Actual:
Panning is janky and crashes after ~ four times

1. 
2. 
3. 

Expected Results


Actual Results


Other Notes

Reproduction Frequency:
Summary: [Tarako] Panning up and down on facebook feed causes facebook app to carash → [Tarako] Panning up and down on facebook feed causes facebook app to crash
Nominating - FB is supposed to be core functionality.
blocking-b2g: --- → 1.3T?
ni? Harald. is it something you can help looking into? thanks
Flags: needinfo?(hkirschner)
It would be great to have a gecko profile already recorded when filing these bugs.

(In reply to John Hammink from comment #1)
> Created attachment 8408688 [details]
> facebook crashes (toward end of log)

I/Gecko   (   83): [Parent 83] WARNING: waitpid failed pid:2927 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
I/Gecko   (   83): [Parent 83] WARNING: Failed to deliver SIGKILL to 2927!(3).: file ../../../gecko/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc, line 118

That seems to be the only crash I see and it isn't even an OOM (which I could debug). What causes the crash for FB?
Flags: needinfo?(hkirschner) → needinfo?(jhammink)
Keywords: footprint, perf
Whiteboard: [tarako-bug-bash-1.3T] → [tarako-bug-bash-1.3T], LMK, [c=memory p= s= u=]
ni? ting on IPC
Flags: needinfo?(tchou)
(In reply to Joe Cheng [:jcheng] from comment #5)
> ni? ting on IPC

Not quite sure what information is needed from me, if it is about the log messages at comment 4, that's checking whether the process is alive and make sure it is dead. The errno 3 means the pid 2927 does not exist.
Flags: needinfo?(tchou)
waiting for John to provide more information here.
Priority: -- → P2
Whiteboard: [tarako-bug-bash-1.3T], LMK, [c=memory p= s= u=] → [c=memory p= s= u=tarako] [tarako-bug-bash-1.3T], LMK
Harald, I'm still waiting for your help with profiling, so that I can give you the information you need.

In the meantime I can still reliably produce the app crash.  This is what I'm seeing in logcat when that happens:

I/Gecko   (   84): [Parent 84] WARNING: waitpid failed pid:2054 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
E/GeckoConsole( 2347): Content JS ERROR at app://homescreen.gaiamobile.org/gaia_build_defer_index.js:390 in loadSVConfFileError: Failed parsing singleVariant configuration file [js/singlevariantconf.json]:  [Exception... "File error: Not found"  nsresult: "0x80520012 (NS_ERROR_FILE_NOT_FOUND)"  location: "JS frame :: app://homescreen.gaiamobile.org/gaia_build_defer_index.js :: loadFile :: line 382"  data: no]
W/WifiHW  (   84): wifi_command: SIGNAL_POLL
D/wpa_supplicant(  523): nl80211: survey data missing!
I/Gecko   (   84): [Parent 84] WARNING: waitpid failed pid:2054 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
I/Gecko   (   84): [Parent 84] WARNING: Failed to deliver SIGKILL to 2054!(3).: file ../../../gecko/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc, line 118
I/GonkMemoryPressure(   84): Checking to see if memory pressure is over.
I/GonkMemoryPressure(   84): Memory pressure is over.

I'm currently on 

Gaia      3771067de006633df690a590a97b4d28c44ef8b2
Gecko     https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/51a4e4d35e30
BuildID   20140423164005
Version   28.1
Flags: needinfo?(jhammink)
Flags: needinfo?(hkirschner)
https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Profiling_with_the_Built-in_Profiler#Profiling_Boot_to_Gecko_%28with_a_real_device%29

I still have to get it working on Tarako. I'll probably test on other devices to see the bottlenecks.
Flags: needinfo?(hkirschner)
any luck on getting things to work on tarako? thanks
Flags: needinfo?(hkirschner)
Attached file profile_tarako.tar
Tar'ed files of the profile.
Flags: needinfo?(nhirata.bugzilla)
About memory reports:http://people.mozilla.org/~nhirata/aboutmem/about_mem_reports.zip

I'm not sure if I captured them at the right moment.  I captured 11.
The 4th memory report, the app got killed in the middle of the report.  The last report, the app got killed between the gcc report and the first part of the report.
Flags: needinfo?(nhirata.bugzilla) → needinfo?(fabrice)
I'll take a closer look tomorrow, but there are a few things that stand out:
- you're running with devtools enabled. Please use the "ADB Only" setting instead, that will save a few hundreds kb.
- the browser app is running also in the parent. Looks like we waste 2MB there...

Can you try again with that corrected?

The facebook app itself uses a lot of ... everything. unused-gc-things is really high (1.6MB)
2.3MB of script sources, a bunch of strings

njn, is there anything that looks particularly suspicious to you?
Flags: needinfo?(fabrice) → needinfo?(n.nethercote)
Looking at about-memory-0, it looks like there is a reasonable amount of memory free:

> ├──33.32 MB (33.69%) ── other

In the main process, there are numerous resource://gre/modules/commonjs/sdk modules, like so:

> │  ├───1 (02.70%) ── [System Principal], resource://gre/modules/commonjs/sdk/console/traceback.js (from: resource://gre/modules/commonjs/toolkit/loader.js:35)
> │  ├───1 (02.70%) ── [System Principal], resource://gre/modules/commonjs/sdk/core/heritage.js (from: resource://gre/modules/commonjs/toolkit/loader.js:35)
> │  ├───1 (02.70%) ── [System Principal], resource://gre/modules/commonjs/sdk/core/namespace.js (from: resource://gre/modules/commonjs/toolkit/loader.js:35)

I don't know what they're for, nor why they aren't merged into the system compartment.

There's also a possibly-leaked window, though it's tiny:

> │   └──0.01 MB (00.02%) -- top(none)/detached/window([system])

As for the Facebook app, nothing leaps out. It's the usual death by a thousand cuts :(
Flags: needinfo?(n.nethercote)
(In reply to Nicholas Nethercote [:njn] from comment #14)
> Looking at about-memory-0, it looks like there is a reasonable amount of
> memory free:
> 
> > ├──33.32 MB (33.69%) ── other
> 
> In the main process, there are numerous resource://gre/modules/commonjs/sdk
> modules, like so:

These are from the devtools for remote debugging. I asked Noaki to turn that off.


> As for the Facebook app, nothing leaps out. It's the usual death by a
> thousand cuts :(

Yeah :(. Even in about-memory-10 or 11? These are the ones just before OOMing.
triage: facebook is a must have app for tarako. 1.3T+
blocking-b2g: 1.3T? → 1.3T+
Component: Gaia → Performance
Ah nuts.  Let me reprofile.
Flags: needinfo?(nhirata.bugzilla)
Severity: normal → blocker
Priority: P2 → P1
> Yeah :(. Even in about-memory-10 or 11? These are the ones just before OOMing.

In about-memory-11, we have this in the Main process:

│   ├──5.32 MB (11.82%) -- content
│   │  ├──5.32 MB (11.82%) -- used
│   │  │  ├──4.54 MB (10.08%) ── uncompressed-nonheap
│   │  │  ├──0.64 MB (01.41%) ── raw
│   │  │  └──0.14 MB (00.32%) ── uncompressed-heap

Why does the main process use images?

I don't see anything else notable for the Facebook app in 10 and 11.
(In reply to Ting-Yu Chou [:ting] from comment #6)
> (In reply to Joe Cheng [:jcheng] from comment #5)
> > ni? ting on IPC
> 
> Not quite sure what information is needed from me, if it is about the log
> messages at comment 4, that's checking whether the process is alive and make
> sure it is dead. The errno 3 means the pid 2927 does not exist.

I think the process killed, and might caused by LMK. I tried to reproduce this bug yesterday, and I met 4/4 crash caused by LMK.
Hi! Thinker,

Any idea about Naoki's finding? Thanks

--
Keven

(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #21)
> While I was doing the profiling, I was monitoring using firewatch (
> https://github.com/digitarald/firewatch )  I noticed b2g take a huge upward
> spike after an app gets killed.  Should that be reported as a separate issue?
Flags: needinfo?(tlee)
I am not sure.. I could be for relaunching homescreen by sysapp, and bring a lot data back from zram.  Is the VSZ also changed?  Or just change USS/RSS?
Flags: needinfo?(tlee)
Ben volunteered to investigate this.
Assignee: nobody → bkelly
Duplicate of this bug: 1003458
Plan to focus on the following to start:

1) Why do we have images in parent process per comment 18?
2) Is the memory increase related to these images or something else?
Status: NEW → ASSIGNED
Attached file memory-reports
Here's an about:memory dump of facebook running on a tarako. Notable items are the strings, especially the huge one starting with :

string(length=80706, copies=1, "<div class="mHomeNewsfeed"><div id="u_0_s"><div class="_5v5d _5v
With a GC log you could figure out what is holding that big string alive, which may or may not be useful.
Can we pull those off the phone or need to generate them on desktop FF?
Flags: needinfo?(hkirschner) → needinfo?(continuation)
It is generated as part of the standard about-memory log grabbing script that is run on the phones.

Fabrice got some GC logs, but for some reason the huge string isn't showing up in the JS heap, which is very odd.  I assumed the notable strings thing would only look at stuff allocated in the heap, but maybe it can pull them from somewhere else, too?  There are some kinds of JS strings (external strings?) that are basically adopted from the browser.  I don't know if those are counted.
Flags: needinfo?(continuation)
(GC logs are the files that have names starting with gc-edges)
I did a diff between a memory reports taken before and after the parent process memory increases.  Some interesting parts:

1.22 MB (100.0%) -- explicit
├──1.58 MB (129.14%) ── heap-unclassified
└──-0.36 MB (-29.14%) ++ (11 tiny)

 5.63 MB ── resident
 4.15 MB ── resident-unique

The fact that we increased RSS and USS so much greater than explicit size suggests to me the parent process is paging in some code that child processes don't use.  Since heap-unclassified increased as well, perhaps DMD could tell us what this is.
The huge strings don't look that suspicious to me. There aren't that many, they aren't *so* huge, and there's only 1 copy of each one.

mccr8: external strings do get considered by the notable string detector. However, the chars of external strings are not counted by the JS reporter, because it doesn't own them. Therefore, the likelihood of an external string showing up in the notable strings is low -- there would have to be enough copies of it that the GC things alone make it cross the threshold of notability. Since there's only one copy of these, it's clear they're not external. (Unless I've gotten something wrong in my reasononing, but that's my understanding of how all that works.)
Hmm... actually, maybe its the child process blowing up.  I see this in logcat:

  [Kill]: lowmem_shrink send sigkill to 10317 (Facebook), adj 2, size 14219

The size is given in pages which corresponds to ~56MB.  The last b2g-info output I got from my sleep(1) loop was:

                            |     megabytes     |
           NAME   PID  PPID CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER
            b2g 10090 10026   51.8    0 13.3 16.2 21.9 131.5       0 root
         (Nuwa) 10143 10090    1.1    0  0.0  1.1  4.4  50.3       0 root
       Facebook 10317 10143   53.0    1 27.2 30.4 36.5  99.7       2 app_10317
(Preallocated a 10406 10143    2.5   18  0.4  2.4  7.1  57.2       1 root

Its going to be hard to get a representative memory report with get_about_memory.py if the RSS of the app jumps by 20MB within 1 second.
So I put this in a while(1) loop while scrolling through facebook:

  adb shell cat /sys/block/zram0/orig_data_size

And it reached 63225856 right before the OOM.

Given that our max zram disksize is set to 67108864, it seems quite possible we are running out of swap space.
Attached file Memory report after point of OOM (obsolete) —
I modified /system/xbin/zram.sh to setup 128MB of swap instead of 64MB.  This let me scroll Facebook much longer and seemingly past the point it would have OOMed.

Here is a memory report I took when about 75MB to 80MB of swap was in use.  Note that explicit is much larger than resident size here because so much of the app is in swap.

It seems that perhaps FB is leaking something or just growing its DOM without ever shrinking it:

60.39 MB (100.0%) -- explicit
├──32.65 MB (54.07%) -- window-objects/top(https://m.facebook.com/, id=1)
│  ├──24.09 MB (39.90%) -- active
│  │  ├──22.79 MB (37.74%) -- window(https://m.facebook.com/)
│  │  │  ├──11.10 MB (18.39%) -- dom
│  │  │  │  ├───4.97 MB (08.23%) ── orphan-nodes
│  │  │  │  ├───3.39 MB (05.61%) ── event-targets [2]
│  │  │  │  ├───2.39 MB (03.95%) ── element-nodes
│  │  │  │  └───0.36 MB (00.60%) ++ (2 tiny)
│  │  │  ├───7.63 MB (12.63%) ++ js-compartment(https://m.facebook.com/)
Unfortunately FB got OOM'd while taking the memory report, so the GC/CC logs are missing for that process.
Here's a better memory report including GC/CC logs.

I see about 60 XMLHttpRequest objects in the gc log for facebook.  Using findroots, they all seem to be getting held onto through this bit of stack:

    --[define]--> 0x4305dfe0 [Function l]
    --[script]--> 0x43031538 [script https://fbstatic-a.akamaihd.net/rsrc.php/v2]
    --[enclosing]--> 0x4305df80 [Function]
    --[script]--> 0x43031358 [script https://fbstatic-a.akamaihd.net/rsrc.php/v2]
    --[objects[1]]--> 0x4305dfc0 [Function k]
    --[fun_callscope]--> 0x430610d0 [Call <no private>]
    --[c]--> 0x43062b20 [Object <no private>]
    --[InitMScrollArea]--> 0x45826ae0 [Object <no private>]
    --[exports]--> 0x44c29940 [Object <no private>]
    --[main]--> 0x44f175e0 [Function p]
    --[fun_callscope]--> 0x458210b0 [Call <no private>]
    --[m]--> 0x44cecb00 [Array <no private>]
    --[objectElements[0]]--> 0x464b27c0 [Object <no private>]
    --[scrollArea]--> 0x44c0c200 [Object <no private>]
    --[type]--> 0x44fe97a0 [type_object]
    --[type_proto]--> 0x44ff60a0 [Object <no private>]
    --[getScrollHeight]--> 0x44385be0 [Function t.prototype.getScrollHeight]
    --[fun_callscope]--> 0x44c16dc0 [Call <no private>]
    --[k]--> 0x43061a60 [Object <no private>]
    --[_handlers]--> 0x430d9380 [Array <no private>]
    --[objectElements[2818]]--> 0x4642c440 [Object <no private>]
    --[_callback]--> 0x4642c420 [Function]
    --[**UNKNOWN SLOT 0**]--> 0x4589a160 [Object <no private>]
    --[transport]--> 0x45876100 [XMLHttpRequest <no private>]
Attachment #8420225 - Attachment is obsolete: true
Unfortunately it seems this script URL is truncated:

  --[script]--> 0x43031538 [script https://fbstatic-a.akamaihd.net/rsrc.php/v2]

Trying to find the code that generates it so I can expand the string out.
I believe this script might be the cause for the XHR leak.  On this particular run it was loaded from the CDN at:

  https://fbstatic-a.akamaihd.net/rsrc.php/v2/y5/r/updszHuqpNP.js

It creates the XHR and stores it in 'transport' like this on line 4265:

    if (!this.transport) this.transport = this.getCORS() ? r()  : s();

However, the cleanup code uses '_transport' instead on line 4443:

    if (this._transport && this._transport.readyState != 4) {
      u.warn('abort', this.__id__, this.getURI());
      this._transport.abort();
    }

And clears the reference to '_transport' on line 4350:

    this._transport = null;
Jason, I'm moving this to Tech Evangelism for now since it appears to be an issue with the app.  Its possible there is still a platform bug somewhere, but we should fix this first if we can.  Right now FB is clearly just growing until zram swap is exhausted as you continue to scroll.

Is there anyone at FB we can contact about this?

If we still have an issue after getting this fixed, let me know and I will look at the problem again.  For now, though, I'm going to unassign myself.
Assignee: bkelly → nobody
Status: ASSIGNED → NEW
Component: Performance → Preinstalled B2G Apps
Flags: needinfo?(jsmith)
Product: Firefox OS → Tech Evangelism
Okay.

Unblocking this for being firmly proven to be an app issue.

Nicole - Can you send this over to the FB dev team?
Blocks: b2g-facebook
blocking-b2g: 1.3T+ → ---
Flags: needinfo?(jsmith) → needinfo?(nfong)
Whiteboard: [c=memory p= s= u=tarako] [tarako-bug-bash-1.3T], LMK → [c=memory p= s= u=tarako] [tarako-bug-bash-1.3T], LMK [NPOTB]
Colby - please review and provide a potential fix date.  Thank you!
Assignee: nobody → colby
Flags: needinfo?(colby)
Colby, Ben found some interesting piece in comment 40. Seems like a variable reference is not cleaned up, which could cause some memory leakage.
Flags: needinfo?(nfong)
Wow awesome catch. Thanks a lot! Fix will be out on Tuesday.
Flags: needinfo?(colby)
The fix should be live as of yesterday.
Need to verify
Flags: needinfo?(nhirata.bugzilla)
Keywords: qawanted
Panning seems to be a lot better.  It doesn't just get killed anymore.

Gaia      94b7ca3030b4b924f0ffe97910bb383fef9e160a
Gecko     https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/20ff18dbee48
BuildID   20140529014005
Version   28.1
ro.build.version.incremental=eng.cltbld.20140529.053252
ro.build.date=Thu May 29 05:33:02 EDT 2014
Tarako
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(nhirata.bugzilla)
Resolution: --- → FIXED
Product: Tech Evangelism → Tech Evangelism Graveyard
You need to log in before you can comment on or make changes to this bug.