[Tarako] memory usage of b2g main process increase after 24 hours monkey test

RESOLVED FIXED in Firefox 32

Status

()

defect
--
major
RESOLVED FIXED
5 years ago
5 months ago

People

(Reporter: seinlin, Assigned: ting)

Tracking

unspecified
mozilla33
ARM
Gonk (Firefox OS)
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:1.3T+, firefox31 wontfix, firefox32 fixed, firefox33 fixed, b2g-v1.3T fixed, b2g-v1.4 fixed, b2g-v2.0 fixed, b2g-v2.1 fixed)

Details

(Whiteboard: [MemShrink:P1])

Attachments

(12 attachments, 20 obsolete attachments)

5.24 KB, patch
Details | Diff | Splinter Review
46 bytes, text/x-github-pull-request
ting
: review+
Details | Review
6.57 KB, patch
ting
: review+
Details | Diff | Splinter Review
5.87 KB, patch
Details | Diff | Splinter Review
3.70 MB, application/gzip
Details
4.76 MB, application/gzip
Details
4.42 MB, application/gzip
Details
5.07 MB, application/gzip
Details
2.57 KB, patch
ting
: review+
Details | Diff | Splinter Review
4.31 MB, application/gzip
Details
4.28 MB, application/gzip
Details
46 bytes, text/x-github-pull-request
ting
: review+
Details | Review
After running monkey test for 24 hours, it can be observed that memory used by b2g main process increased significantly.

Below is the diff of about memory before and after test.

Main Process (pid NNN)
Explicit Allocations

24.11 MB (100.0%) -- explicit
├───8.49 MB (35.23%) ── heap-unclassified
├───8.38 MB (34.75%) -- js-non-window
│   ├──8.14 MB (33.75%) -- zones/zone(0xNNN)
│   │  ├──4.04 MB (16.75%) -- compartment([System Principal], inProcessTabChildGlobal?ownedBy=chrome://browser/content/shell.html)
│   │  │  ├──2.53 MB (10.47%) -- objects
│   │  │  │  ├──2.00 MB (08.28%) -- gc-heap
│   │  │  │  │  ├──1.05 MB (04.37%) ── ordinary
│   │  │  │  │  ├──0.66 MB (02.75%) ── dense-array [+]
│   │  │  │  │  ├──0.27 MB (01.12%) ── function
│   │  │  │  │  └──0.01 MB (00.04%) ── cross-compartment-wrapper [+]
│   │  │  │  └──0.53 MB (02.19%) ── malloc-heap/slots
│   │  │  ├──1.52 MB (06.30%) -- shapes
│   │  │  │  ├──0.77 MB (03.21%) -- malloc-heap
│   │  │  │  │  ├──0.40 MB (01.65%) ── dict-tables [+]
│   │  │  │  │  └──0.38 MB (01.56%) ++ (3 tiny)
│   │  │  │  └──0.75 MB (03.09%) -- gc-heap
│   │  │  │     ├──0.34 MB (01.41%) ── base
│   │  │  │     ├──0.30 MB (01.25%) ── tree/global-parented
│   │  │  │     └──0.11 MB (00.44%) ── dict [+]
│   │  │  └──-0.01 MB (-0.03%) ++ (3 tiny)
│   │  ├──3.57 MB (14.80%) ── unused-gc-things [3]
│   │  ├──0.42 MB (01.75%) ++ compartment([System Principal])
│   │  └──0.11 MB (00.45%) ++ (14 tiny)
│   └──0.24 MB (01.01%) ++ (2 tiny)
├───2.18 MB (09.04%) -- window-objects
│   ├──2.11 MB (08.74%) -- top(app://system.gaiamobile.org/index.html, id=3)
│   │  ├──2.17 MB (09.00%) -- js-zone(0xNNN)
│   │  │  ├──2.12 MB (08.80%) ── unused-gc-things [3]
│   │  │  └──0.05 MB (00.20%) ++ (4 tiny)
│   │  └──-0.06 MB (-0.26%) -- active
│   │     ├──-1.19 MB (-4.93%) ++ (3 tiny)
│   │     └───1.13 MB (04.67%) -- window(app://system.gaiamobile.org/index.html)
│   │         ├──0.65 MB (02.69%) -- js-compartment(app://system.gaiamobile.org/index.html)
│   │         │  ├──0.40 MB (01.65%) ++ (6 tiny)
│   │         │  └──0.25 MB (01.04%) ++ shapes
│   │         ├──0.25 MB (01.02%) ++ dom
│   │         └──0.23 MB (00.96%) ++ (3 tiny)
│   └──0.07 MB (00.30%) ++ top(chrome://browser/content/shell.html, id=1)
├───2.11 MB (08.77%) -- heap-overhead
│   ├──2.12 MB (08.80%) ── waste
│   └──-0.01 MB (-0.03%) ── page-cache
├───1.67 MB (06.94%) ── freetype
├───0.87 MB (03.61%) ── xpconnect
└───0.40 MB (01.66%) ++ (8 tiny)

Other Measurements

0.36 MB (100.0%) -- decommitted
├──0.35 MB (97.80%) ── js-non-window/gc-heap/decommitted-arenas
└──0.01 MB (02.20%) ── workers/workers()/worker(resource://gre/modules/ril_worker.js, 0xNNN)/gc-heap/decommitted-arenas

1,359 (100.0%) -- event-counts
└──1,359 (100.0%) -- window-objects
   ├──1,081 (79.54%) -- top(chrome://browser/content/shell.html, id=1)/active/window(chrome://browser/content/shell.html)/dom
   │  ├──1,078 (79.32%) ── event-targets
   │  └──────3 (00.22%) ── event-listeners
   └────278 (20.46%) -- top(app://system.gaiamobile.org/index.html, id=3)/active
        ├──351 (25.83%) -- window(app://system.gaiamobile.org/index.html)/dom
        │  ├──322 (23.69%) ── event-listeners
        │  └───29 (02.13%) ── event-targets
        └──-73 (-5.37%) ++ (3 tiny)

10.77 MB (100.0%) -- js-main-runtime
├───5.97 MB (55.42%) -- zones
│   ├──5.69 MB (52.85%) ── unused-gc-things
│   ├──0.24 MB (02.21%) -- strings
│   │  ├──0.21 MB (01.96%) -- normal
│   │  │  ├──0.17 MB (01.59%) ── malloc-heap
│   │  │  └──0.04 MB (00.37%) ── gc-heap
│   │  └──0.03 MB (00.25%) ++ (2 tiny)
│   └──0.04 MB (00.36%) ++ (4 tiny)
├───4.56 MB (42.33%) -- compartments
│   ├──2.71 MB (25.18%) -- objects
│   │  ├──2.09 MB (19.38%) -- gc-heap
│   │  │  ├──1.16 MB (10.77%) ── ordinary
│   │  │  ├──0.66 MB (06.14%) ── dense-array
│   │  │  ├──0.24 MB (02.23%) ── function
│   │  │  └──0.03 MB (00.25%) ── cross-compartment-wrapper
│   │  └──0.62 MB (05.80%) -- malloc-heap
│   │     ├──0.57 MB (05.28%) ── slots
│   │     └──0.06 MB (00.53%) ── elements/non-asm.js
│   ├──1.60 MB (14.91%) -- shapes
│   │  ├──0.92 MB (08.52%) -- malloc-heap
│   │  │  ├──0.44 MB (04.09%) ── dict-tables
│   │  │  ├──0.25 MB (02.35%) ── compartment-tables
│   │  │  ├──0.12 MB (01.13%) ── tree-shape-kids
│   │  │  └──0.10 MB (00.96%) ── tree-tables
│   │  └──0.69 MB (06.38%) -- gc-heap
│   │     ├──0.34 MB (03.12%) ── base
│   │     ├──0.24 MB (02.19%) -- tree
│   │     │  ├──0.24 MB (02.22%) ── global-parented
│   │     │  └──-0.00 MB (-0.03%) ── non-global-parented
│   │     └──0.12 MB (01.08%) ── dict
│   ├──0.22 MB (02.08%) -- baseline
│   │  ├──0.16 MB (01.48%) -- stubs
│   │  │  ├──0.15 MB (01.37%) ── fallback [+]
│   │  │  └──0.01 MB (00.11%) ── optimized
│   │  └──0.06 MB (00.60%) ── data [+]
│   └──0.02 MB (00.16%) ++ (5 tiny)
├───0.14 MB (01.31%) ── gc-heap/chunk-admin
└───0.10 MB (00.94%) ── runtime

0 (100.0%) ++ js-main-runtime-compartments

8.65 MB (100.0%) -- js-main-runtime-gc-heap-committed
├──5.69 MB (65.75%) ── unused/gc-things
└──2.96 MB (34.25%) -- used
   ├──2.78 MB (32.16%) ── gc-things
   ├──0.14 MB (01.63%) ── chunk-admin
   └──0.04 MB (00.46%) ── arena-admin

18 (100.0%) -- message-manager
└──18 (100.0%) -- referent
   ├──17 (94.44%) ── child-process-manager/strong
   └───1 (05.56%) ── global-manager/strong

7,309 (100.0%) -- observer-service
└──7,309 (100.0%) -- referent
   ├──4,077 (55.78%) -- weak
   │  ├──4,012 (54.89%) ── dead
   │  └─────65 (00.89%) ── alive
   └──3,232 (44.22%) ── strong

7,326 (100.0%) -- observer-service-suspect
├──3,862 (52.72%) ── referent(topic=ask-children-to-exit-fullscreen) [+]
├──1,091 (14.89%) ── referent(topic=disk-space-watcher) [+]
├──1,091 (14.89%) ── referent(topic=file-watcher-update) [+]
├──1,091 (14.89%) ── referent(topic=volume-state-changed) [+]
└────191 (02.61%) ── referent(topic=formsubmit) [+]

37 (100.0%) -- preference-service
└──37 (100.0%) ── referent/strong

-0.20 MB (100.0%) -- window-objects
├──-0.22 MB (105.41%) ── style-sheets
└───0.01 MB (-5.41%) ++ (4 tiny)

   13.20 MB ── heap-allocated
   15.21 MB ── heap-committed
      0.66% ── heap-overhead-ratio
          1 ── host-object-urls
 22,701,688 ── page-faults-hard
193,995,328 ── page-faults-soft
    4.90 MB ── resident
    8.32 MB ── resident-unique
   43.38 MB ── vsize

End of Main Process (pid NNN)
System
Other Measurements

0.00 MB (100.0%) -- mem
├──-9.38 MB (100.0%) ++ (2 tiny)
└──9.38 MB (100.0%) ── other

-4.54 MB (100.0%) -- processes
├──-4.41 MB (96.93%) -- shared-libraries
│  ├──-3.45 MB (75.89%) ── read-executable
│  └──-0.96 MB (21.04%) ── read-write
├──-1.73 MB (38.12%) ── other-files
└───1.59 MB (-35.05%) ++ (2 tiny)

73,504,356 (100.0%) -- zram-accesses
└──73,504,356 (100.0%) -- zram0
   ├──37,316,136 (50.77%) ── reads
   └──36,188,220 (49.23%) ── writes

10.12 MB (100.0%) -- zram-compr-data-size
└──10.12 MB (100.0%) ── zram0

0.00 MB (100.0%) -- zram-disksize
└──0.00 MB (100.0%) ++ zram0

End of System
Homescreen (pid NNN)
Explicit Allocations

-0.92 MB (100.0%) -- explicit
├──-0.90 MB (97.94%) -- heap-overhead
│  ├──-0.88 MB (95.82%) ── waste
│  └──-0.02 MB (02.13%) ── page-cache
└──-0.02 MB (02.06%) ++ (9 tiny)

Other Measurements

-0.83 MB (100.0%) -- decommitted
└──-0.83 MB (100.0%) ── js-non-window/gc-heap/decommitted-arenas

0 (100.0%) -- event-counts
└──0 (100.0%) ++ window-objects

-0.16 MB (100.0%) -- js-main-runtime
├──-0.16 MB (99.44%) -- zones
│  ├──-0.17 MB (102.59%) ── unused-gc-things
│  └───0.01 MB (-3.15%) ++ (3 tiny)
├──-0.02 MB (09.54%) ── gc-heap/chunk-admin
└───0.01 MB (-8.98%) ++ (2 tiny)

0 (100.0%) -- js-main-runtime-compartments
└──0 (100.0%) ++ user

-0.17 MB (100.0%) -- js-main-runtime-gc-heap-committed
├──-0.17 MB (97.77%) ── unused/gc-things
└──-0.00 MB (02.23%) -- used
   ├──-0.02 MB (09.09%) ── chunk-admin
   └───0.01 MB (-6.86%) ++ (2 tiny)

9 (100.0%) -- observer-service
└──9 (100.0%) -- referent
   ├──8 (88.89%) ── strong
   └──1 (11.11%) ── weak/alive

18 (100.0%) -- preference-service
└──18 (100.0%) ── referent/strong

0.05 MB (100.0%) -- window-objects
├──0.05 MB (102.69%) -- layout
│  ├──0.05 MB (87.29%) ── pres-shell
│  ├──0.01 MB (11.20%) ── rule-nodes
│  ├──0.00 MB (04.39%) ── style-contexts
│  └──-0.00 MB (-0.19%) ++ (3 tiny)
└──-0.00 MB (-2.69%) ++ (2 tiny)

 0.15 MB ── heap-allocated
-0.75 MB ── heap-committed
 -12.77% ── heap-overhead-ratio
   6,952 ── page-faults-hard
  41,255 ── page-faults-soft
-7.24 MB ── resident
-1.56 MB ── resident-unique
-2.02 MB ── vsize

End of Homescreen (pid NNN)
Blocks: 1005799
Can you please attach the full memory report from after the test?  Including the gc/cc logs.

This part looks interesting:

7,326 (100.0%) -- observer-service-suspect
├──3,862 (52.72%) ── referent(topic=ask-children-to-exit-fullscreen) [+]
├──1,091 (14.89%) ── referent(topic=disk-space-watcher) [+]
├──1,091 (14.89%) ── referent(topic=file-watcher-update) [+]
├──1,091 (14.89%) ── referent(topic=volume-state-changed) [+]
└────191 (02.61%) ── referent(topic=formsubmit) [+]

Can you point me to where these monkey tests are defined?
Flags: needinfo?(kli)
Whiteboard: [MemShrink]
Posted file about-memory (obsolete) —
Ben, the memory report with gc/cc log is attached. I am also checking the observers, but haven't sorted it out.
├──1,091 (14.89%) ── referent(topic=disk-space-watcher) [+]
├──1,091 (14.89%) ── referent(topic=file-watcher-update) [+]

Seems nsDOMDeviceStorage is leaked from Navigator::GetDeviceStorage(), but not sure how yet.
Ting-Yu has attached the about memory after 24 monkey test. This is memory report and gc-cc log after boot to homescreen.
Flags: needinfo?(kli)
├──1,091 (14.89%) ── referent(topic=disk-space-watcher) [+]
├──1,091 (14.89%) ── referent(topic=file-watcher-update) [+]
├──1,091 (14.89%) ── referent(topic=volume-state-changed) [+]

Those are from b2g/chrome/content/settings.js (http://goo.gl/w7cTlx), which is for bug 915974. Navigator.getDeviceStorages('sdcard') is called once following event is triggered:

  lockscreen.locked,
  lockscreen.enabled,
  devtools.debugger.remote-enabled,
  debugger.remote-mode

Navigator.getDeviceStorages() creates some nsDOMDeviceStorage instances and stores them in its mDeviceStorageStores array. From cc-edges.86.log of attachment 8419397 [details], the Navigator 0x4598e8f0 has 1086 nsDOMDeviceStorage store in the array. Need to measure their size, I am figuring how.
0 (100.0%) -- queued-ipc-messages
├──0 (100.0%) ── content-parent(Clock, pid=-1, closed channel, 0x43b24000, refcnt=1)
├──0 (100.0%) ── content-parent(Communications, pid=-1, closed channel, 0x42d41c00, refcnt=3)
├──0 (100.0%) ── content-parent(Communications, pid=-1, closed channel, 0x479dd000, refcnt=3)
├──0 (100.0%) ── content-parent(E-Mail, pid=-1, closed channel, 0x42d25c00, refcnt=2)
├──0 (100.0%) ── content-parent(Messages, pid=-1, closed channel, 0x42d3ec00, refcnt=2)
├──0 (100.0%) ── content-parent(Messages, pid=-1, closed channel, 0x4342ec00, refcnt=2)
├──0 (100.0%) ── content-parent(Music, pid=-1, closed channel, 0x42d1f400, refcnt=1)
├──0 (100.0%) ── content-parent(Music, pid=-1, closed channel, 0x44a64c00, refcnt=2)
└──0 (100.0%) ── content-parent(Video, pid=-1, closed channel, 0x477de400, refcnt=1)

I will also look into this.
BTW, I am not sure how to reproduce this, I tried to run Browser several times but it stays around 0.2x MB:

│  │  │  ├───4.26 MB (07.72%) -- compartment([System Principal], inProcessTabChildGlobal?ownedBy=chrome://browser/content/shell.html)
(In reply to Ting-Yu Chou [:ting] from comment #7)
> BTW, I am not sure how to reproduce this, I tried to run Browser several
> times but it stays around 0.2x MB:
> 
> │  │  │  ├───4.26 MB (07.72%) -- compartment([System Principal],
> inProcessTabChildGlobal?ownedBy=chrome://browser/content/shell.html)

I think it needs to launch and kill App a lot of times. During 24 hours Monkey test, App was launched and killed almost thousand times.
I tried to lock/unlock screen manually until:

  ├──1,014 (33.36%) ── referent(topic=file-watcher-update)
  ├──1,014 (33.36%) ── referent(topic=volume-state-changed)
  └──1,012 (33.29%) ── referent(topic=disk-space-watcher)

but it does not take significant memory.
In attachment 8419397 [details] gc log I see about 90 instances of:

  gc-edges.86.log:0x46711f10 B script resource://gre/modules/BrowserElementParent.jsm:14

It appears these are getting held on to through this reference chain:

    --[ActivityWindowFactory]--> 0x46a2e0b0 [Object <no private>]
    --[_activities]--> 0x46a2e100 [Array <no private>]
    --[objectElements[6]]--> 0x46aaf2e0 [Object <no private>]
    --[activityCaller]--> 0x46aea220 [Object <no private>]
    --[iframe, element]--> 0x46af6250 [HTMLIFrameElement <no private>]
    --[setInputMethodActive]--> 0x46aea190 [Proxy <no private>]
    --[private]--> 0x46757920 [Function]
    --[script]--> 0x467051f0 [script resource://gre/modules/BrowserElementParent]

It seems the |_activities| array entry should be cleared here:

  https://github.com/mozilla-b2g/gaia/blob/f5b9126fb1b14066baf49775ffa2f54eee106be7/apps/system/js/activity_window_manager.js#L84

But perhaps we are never seeing the 'activityterminated' event under some circumstances?
And I guess I'm wondering if comment 10 might be contributing to those leaked content-parents in comment 6.
I wrote a utility to count objects with the same node label in order to try to pick out possible problems:

  https://github.com/amccreight/heapgraph/pull/17

From the CC graph I see:

143 FragmentOrElement (xhtml) div class='screenshot-overlay' app://system.gaiamobile.org/index.html
145 FragmentOrElement (xhtml) div class='fade-overlay' app://system.gaiamobile.org/index.html
317 FragmentOrElement (xhtml) div class='picker-unit' app://system.gaiamobile.org/index.html

From the GC graph I see:

85 string <length 39> app://sms.gaiamobile.org/index.html#new
90 script resource://gre/modules/BrowserElementParent.jsm:14
139 string <length 51> app://communications.gaiamobile.org/manifest.webapp
158 string <length 40> app://sms.gaiamobile.org/manifest.webapp
286 Function sb_updateWifi

Does the monkey test perform a pick activity initiated from SMS?
(In reply to Ben Kelly [:bkelly] from comment #12)
> 143 FragmentOrElement (xhtml) div class='screenshot-overlay'
> app://system.gaiamobile.org/index.html
> 145 FragmentOrElement (xhtml) div class='fade-overlay'
> app://system.gaiamobile.org/index.html
> 317 FragmentOrElement (xhtml) div class='picker-unit'
> app://system.gaiamobile.org/index.html

As far as I can tell these are being held alive by the _activities array mentioned in comment 10.
The activity terminates when
1. home button is pressed
2. activity closes by postResult/postError
3. activity is killed by OOM
4. activity caller is killed

I have the same question as comment 12
(In reply to Ben Kelly [:bkelly] from comment #12)
> 85 string <length 39> app://sms.gaiamobile.org/index.html#new
> 139 string <length 51> app://communications.gaiamobile.org/manifest.webapp
> 158 string <length 40> app://sms.gaiamobile.org/manifest.webapp

These are all held alive via the ActivitiesService.jsm here:

  http://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/file/61e4f8a91380/dom/activities/src/ActivitiesService.jsm#l303

So apparently more evidence of activities being setup, but never cleaned up.
Whiteboard: [MemShrink] → [MemShrink:P1]
(In reply to Ben Kelly [:bkelly] from comment #12)
> 286 Function sb_updateWifi

This appears to be an unrelated, probably minor, leak.  I wrote bug 1009928 to track this.
Kai-Zhen, can you point us toward the source of the monkey tests being run here?
Flags: needinfo?(kli)
Ben, You can get it form google drive, share to mozilla people only because it is provided by partner.
http://goo.gl/hnnYDv

You can start the test with this command
$ ./_monkey.sh test-config/6821-hudson-config test log

To change the test duration, please modify MTCFG_TEST_TIME_LIMIT in test-config/6821-hudson-config and  unit is hour.
Flags: needinfo?(kli)
Posted file about-memory-1.tar.gz (obsolete) —
dannyliang@dannyliang-ubuntu-pc:~/work/codes/b2g_tarako_v1.3t_github/tools$ adb shell b2g-info
                           |      megabytes     |
           NAME   PID PPID  CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER     
            b2g    84    1 47909.2    0 39.4 42.4 47.7 171.9       0 root     
         (Nuwa)   335   84   236.3    0  0.0  0.2  0.6  46.6       0 root     
     Homescreen 25726  335     5.8   18  7.6 11.0 16.8  59.7       8 app_25726
(Preallocated a 26209  335     2.2   18  4.0  6.5 11.3  54.6       1 root     

System memory info:

            Total 98.0 MB
     Used - cache 76.1 MB
  B2G procs (PSS) 60.1 MB
    Non-B2G procs 16.0 MB
     Free + cache 22.0 MB
             Free  2.2 MB
            Cache 19.7 MB

Low-memory killer parameters:

  notify_trigger 14336 KB

  oom_adj min_free
        0  2048 KB
        1  3072 KB
        2  4096 KB
        6  7168 KB
        8 16384 KB
       10 18432 KB

dannyliang@dannyliang-ubuntu-pc:~/work/codes/b2g_tarako_v1.3t_github/tools$ adb shell cat /proc/84/status
Name:    b2g
State:    S (sleeping)
Tgid:    84
TracerPid:    0
Uid:    0    0    0    0
Gid:    0    0    0    0
FDSize:    256
Groups:    
VmPeak:      203076 kB
VmSize:      173696 kB
VmLck:           0 kB
VmHWM:       62436 kB
VmRSS:       48876 kB
VmData:      122176 kB
VmStk:         136 kB
VmExe:         104 kB
VmLib:       33936 kB
VmPTE:         196 kB
VmSwap:       27380 kB
Threads:    55
SigQ:    0/782
SigPnd:    0000000000000000
ShdPnd:    0000000000000000
SigBlk:    0000000000000000
SigIgn:    0000000000001000
SigCgt:    000000038000ceef
CapInh:    0000000000000000
CapPrm:    ffffffffffffffff
CapEff:    ffffffffffffffff
CapBnd:    ffffffffffffffff
Cpus_allowed:    1
Cpus_allowed_list:    0
voluntary_ctxt_switches:    5723745
nonvoluntary_ctxt_switches:    24336169
Comment on attachment 8422237 [details]
about-memory-1.tar.gz

Attachment is the get about memory w/ enable DMD
I tried to find the leak today, but haven't found it. The leak ContentParent can be reproduced by:

1. Launch Messages app
2. Write new message
3. Add an attachment and select from Gallery
4. Press Home key, and launch another app to let Messages get killed
5. $ MOZ_IGNORE_NUWA_PROCESS=1 ./tools/get_about_memory.py -m
6. └──0 (100.0%) ── content-parent(Messages, pid=-1, closed channel, 0x45ebe400, refcnt=1)
I'll try to reproduce locally today, but if someone who currently runs these tests could try flipping this pref to false:

  dom.ipc.reuse_parent_app

It seems possible this might be related to our recent changes to run activities within the same process.  Perhaps one of the cleanup paths is not quite right in that case.
Danny, Per Ben requested in comment 22. Could you set this perf to false when you running test? Thanks!
Flags: needinfo?(dliang)
(In reply to Ting-Yu Chou [:ting] from comment #21)
> 1. Launch Messages app
> 2. Write new message
> 3. Add an attachment and select from Gallery
> 4. Press Home key, and launch another app to let Messages get killed
> 5. $ MOZ_IGNORE_NUWA_PROCESS=1 ./tools/get_about_memory.py -m
> 6. └──0 (100.0%) ── content-parent(Messages, pid=-1, closed channel,
> 0x45ebe400, refcnt=1)

Hmm, I'm having trouble reproducing with these steps.  Thats too bad as I was hoping to avoid running the long test. :-(
(In reply to Ting-Yu Chou [:ting] from comment #21)
> I tried to find the leak today, but haven't found it. The leak ContentParent
> can be reproduced by:
> 
> 1. Launch Messages app
> 2. Write new message
> 3. Add an attachment and select from Gallery
> 4. Press Home key, and launch another app to let Messages get killed
> 5. $ MOZ_IGNORE_NUWA_PROCESS=1 ./tools/get_about_memory.py -m
> 6. └──0 (100.0%) ── content-parent(Messages, pid=-1, closed channel,
> 0x45ebe400, refcnt=1)

So I can reproduce this if I do GAIA_INSTALL_PARENT=/data/local with make reset-gaia.  Ting, Danny, do your devices have GAIA_INSTALL_PARENT=/data/local?
(In reply to Kai-Zhen Li from comment #18)
> Ben, You can get it form google drive, share to mozilla people only because
> it is provided by partner.
> http://goo.gl/hnnYDv
> 
> You can start the test with this command
> $ ./_monkey.sh test-config/6821-hudson-config test log
> 
> To change the test duration, please modify MTCFG_TEST_TIME_LIMIT in
> test-config/6821-hudson-config and  unit is hour.

I've run this twice now and can't reproduce the leak.  This is with the latest gecko/gaia with the default GAIA_INSTALL_PARENT of /system/b2g.
(In reply to Ben Kelly [:bkelly] from comment #25)
> So I can reproduce this if I do GAIA_INSTALL_PARENT=/data/local with make
> reset-gaia.  Ting, Danny, do your devices have
> GAIA_INSTALL_PARENT=/data/local?

No, mine is default /system/b2g, but I downloaded userdata.img and installed reference-workload, not sure if it makes any differences.
(In reply to Ben Kelly [:bkelly] from comment #22)
> I'll try to reproduce locally today, but if someone who currently runs these
> tests could try flipping this pref to false:
> 
>   dom.ipc.reuse_parent_app
> 
> It seems possible this might be related to our recent changes to run
> activities within the same process.  Perhaps one of the cleanup paths is not
> quite right in that case.

I have setup 4 devices to run 24hrs monkey test with below settings:
1. enable DMD
2. set dom.ipc.reuse_parent_app as false in b2g.js
Let's check the result tomorrow.
Flags: needinfo?(dliang)
(In reply to Ben Kelly [:bkelly] from comment #25)
> (In reply to Ting-Yu Chou [:ting] from comment #21)
> > I tried to find the leak today, but haven't found it. The leak ContentParent
> > can be reproduced by:
> > 
> > 1. Launch Messages app
> > 2. Write new message
> > 3. Add an attachment and select from Gallery
> > 4. Press Home key, and launch another app to let Messages get killed
> > 5. $ MOZ_IGNORE_NUWA_PROCESS=1 ./tools/get_about_memory.py -m
> > 6. └──0 (100.0%) ── content-parent(Messages, pid=-1, closed channel,
> > 0x45ebe400, refcnt=1)
> 
> So I can reproduce this if I do GAIA_INSTALL_PARENT=/data/local with make
> reset-gaia.  Ting, Danny, do your devices have
> GAIA_INSTALL_PARENT=/data/local?

No, I used default folder which is /system/b2g in tarako.
(In reply to Danny Liang [:dliang] from comment #28)
> (In reply to Ben Kelly [:bkelly] from comment #22)
> > I'll try to reproduce locally today, but if someone who currently runs these
> > tests could try flipping this pref to false:
> > 
> >   dom.ipc.reuse_parent_app
> > 
> > It seems possible this might be related to our recent changes to run
> > activities within the same process.  Perhaps one of the cleanup paths is not
> > quite right in that case.
> 
> I have setup 4 devices to run 24hrs monkey test with below settings:
> 1. enable DMD
> 2. set dom.ipc.reuse_parent_app as false in b2g.js
> Let's check the result tomorrow.

Update the test result of 24hrs monkey test.
1. 1 device met b2g was killed by OOM, it's weird due to b2g was killed at 462s after boot. Now is still under checking.
2. There other 3 devices have still alive and the USS+Swap are lower than previous one.
   Device 1: 57.4MB
   Device 2: 53.8MB
   Device 3: 53.8MB
   Previous : 66.7MB
So the b2g leak might relate to activities chains.
Posted file about-memory-device1.tar.gz (obsolete) —
about memory of device 1 on 5/16
Posted file about-memory-device2.tar.gz (obsolete) —
about memory of device 2 on 5/16
Posted file about-memory-device3.tar.gz (obsolete) —
about memory of device 3 on 5/16
Still haven't found the leak, here's what I have fo far:

- Pressing Home button kills activities, and the activities[] of
  ActivityWindowFactory will be empty.

- The refcnt of HTMLIFrameElement (from BrowserFrame of ActivityWindow) is also
  not 0. I tried to use NS_LOG_ADDREF()/NS_LOG_RELEASE() with stack trace, but
  there're too many to find the imbalance.

- ├──3,862 (52.72%) ── referent(topic=ask-children-to-exit-fullscreen) looks
  interesting, as there's another observer for "oop-frameloader-crashed" beside
  it, but only it is kept.

- The reprodce rate of STR at comment 21 seems inconsistent, probably not a
  correct one.
(In reply to Ting-Yu Chou [:ting] from comment #34)
> - The refcnt of HTMLIFrameElement (from BrowserFrame of ActivityWindow) is also
>   not 0. I tried to use NS_LOG_ADDREF()/NS_LOG_RELEASE() with stack trace, but
>   there're too many to find the imbalance.

Correction:

The refcnt of HTMLIFrameElement (from BrowserFrame of ActivityWindow) *is* 0
The refcnt of activities chain's head HTMLIFrameElement (from BrowserFrame of WindowManager.createFrame()) *is not* 0
(In reply to Danny Liang [:dliang] from comment #30)
> 2. There other 3 devices have still alive and the USS+Swap are lower than
> previous one.
>    Device 1: 57.4MB
>    Device 2: 53.8MB
>    Device 3: 53.8MB
>    Previous : 66.7MB
> So the b2g leak might relate to activities chains.

I looked at the memory reports and we are still leaking content-parents.  The cc/gc logs continue to contain activity related stuff.

For example:

bkelly@ubuntu:/srv/tmp/bug1007520/about-memory-device1$ grep "class='appWindow activityWindow inline-activity' app://system.gaiamobile.org/index.html#" *
cc-edges.83.log:0x4657c6a0 [rc=83] FragmentOrElement (xhtml) div id='activity-window-27' class='appWindow activityWindow inline-activity' app://system.gaiamobile.org/index.html#
cc-edges.83.log:0x47209290 [rc=3] FragmentOrElement (xhtml) div id='activity-window-13' class='appWindow activityWindow inline-activity' app://system.gaiamobile.org/index.html#
cc-edges.83.log:0x472941a0 [rc=3] FragmentOrElement (xhtml) div id='activity-window-14' class='appWindow activityWindow inline-activity' app://system.gaiamobile.org/index.html#
cc-edges.83.log:0x4334c790 [rc=3] FragmentOrElement (xhtml) div id='activity-window-29' class='appWindow activityWindow inline-activity' app://system.gaiamobile.org/index.html#
cc-edges.83.log:0x43763ba0 [rc=3] FragmentOrElement (xhtml) div id='activity-window-30' class='appWindow activityWindow inline-activity' app://system.gaiamobile.org/index.html#
cc-edges.83.log:0x4429d560 [rc=3] FragmentOrElement (xhtml) div id='activity-window-31' class='appWindow activityWindow inline-activity' app://system.gaiamobile.org/index.html#

Plus a lot more.

I also see a couple that having closing styles applied, but are still sticking around for some reason:

bkelly@ubuntu:/srv/tmp/bug1007520/about-memory-device1$ grep "class='appWindow activityWindow inline-activity active slidedown closing' app://system.gaiamobile.org/index.html#" *
cc-edges.83.log:0x441016a0 [rc=8] FragmentOrElement (xhtml) div id='activity-window-12' class='appWindow activityWindow inline-activity active slidedown closing' app://system.gaiamobile.org/index.html#
cc-edges.83.log:0x44665100 [rc=5] FragmentOrElement (xhtml) div id='activity-window-28' class='appWindow activityWindow inline-activity active slidedown closing' app://system.gaiamobile.org/index.html#

Again it seems the ActivityWindow is being held alive through the _activities property on ActivityWindowFactory.

So I guess this is unrelated to the dom.ipc.reuse_parent_app pref setting.
(In reply to Ben Kelly [:bkelly] from comment #36)
> (In reply to Danny Liang [:dliang] from comment #30)
> > 2. There other 3 devices have still alive and the USS+Swap are lower than
> > previous one.
> >    Device 1: 57.4MB
> >    Device 2: 53.8MB
> >    Device 3: 53.8MB
> >    Previous : 66.7MB
> > So the b2g leak might relate to activities chains.
> 
> I looked at the memory reports and we are still leaking content-parents. 
> The cc/gc logs continue to contain activity related stuff.
> 
> For example:
> 
> bkelly@ubuntu:/srv/tmp/bug1007520/about-memory-device1$ grep
> "class='appWindow activityWindow inline-activity'
> app://system.gaiamobile.org/index.html#" *
> cc-edges.83.log:0x4657c6a0 [rc=83] FragmentOrElement (xhtml) div
> id='activity-window-27' class='appWindow activityWindow inline-activity'
> app://system.gaiamobile.org/index.html#
> cc-edges.83.log:0x47209290 [rc=3] FragmentOrElement (xhtml) div
> id='activity-window-13' class='appWindow activityWindow inline-activity'
> app://system.gaiamobile.org/index.html#
> cc-edges.83.log:0x472941a0 [rc=3] FragmentOrElement (xhtml) div
> id='activity-window-14' class='appWindow activityWindow inline-activity'
> app://system.gaiamobile.org/index.html#
> cc-edges.83.log:0x4334c790 [rc=3] FragmentOrElement (xhtml) div
> id='activity-window-29' class='appWindow activityWindow inline-activity'
> app://system.gaiamobile.org/index.html#
> cc-edges.83.log:0x43763ba0 [rc=3] FragmentOrElement (xhtml) div
> id='activity-window-30' class='appWindow activityWindow inline-activity'
> app://system.gaiamobile.org/index.html#
> cc-edges.83.log:0x4429d560 [rc=3] FragmentOrElement (xhtml) div
> id='activity-window-31' class='appWindow activityWindow inline-activity'
> app://system.gaiamobile.org/index.html#
> 
> Plus a lot more.
> 
> I also see a couple that having closing styles applied, but are still
> sticking around for some reason:
> 
> bkelly@ubuntu:/srv/tmp/bug1007520/about-memory-device1$ grep
> "class='appWindow activityWindow inline-activity active slidedown closing'
> app://system.gaiamobile.org/index.html#" *
> cc-edges.83.log:0x441016a0 [rc=8] FragmentOrElement (xhtml) div
> id='activity-window-12' class='appWindow activityWindow inline-activity
> active slidedown closing' app://system.gaiamobile.org/index.html#
> cc-edges.83.log:0x44665100 [rc=5] FragmentOrElement (xhtml) div
> id='activity-window-28' class='appWindow activityWindow inline-activity
> active slidedown closing' app://system.gaiamobile.org/index.html#

OK, this means: the activity window is closing but never closed since animationend event for some reason is not fired so it is not removed. https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/activity_window.js#L118

On master there's a timer to ensure the closed state is entered but for v1.3t there is not.

I am not sure why there's no animationend event though.

> 
> Again it seems the ActivityWindow is being held alive through the
> _activities property on ActivityWindowFactory.
> 
> So I guess this is unrelated to the dom.ipc.reuse_parent_app pref setting.
I can kind of reprodcue the issue now, note I am not sure is this the only case causes leak.

1. Launch Contacts, select a contact
2. Press message button to write a new message
3. Select Gallery to pick an attachment, there'll be two processes: Communications, and Gallery
4. LMK kills Communications
   a. mozbrowsererror:
      Contacts.AppWindow.kill()
        Contacts.activityCallee.kill() (Messages.ActivityWindow.kill())
          Messages.ActivityWindow.close()
            Messages.ActivityWindow.restoreCaller()
              Messages.activityCaller.setVisible(true) (Contacts.setVisible(true))
        WindowManager.setDisplayedApp(HomescreenLauncher.origin)
          Homescreen.willrender
        Contacts.publish('terminated')
   b. mozbrowsererror:
      Messages.ActivityWindow.kill()
   c. animationend (slidedown):
      Messages.ActivityWindow.onClose()
        activityterminated
          ActivityWindowFactory._activities.splice(Messages)
        Messages.activityCallee.kill() (Gallery.ActivityWindow.kill())
          Gallery.ActivityWindow.close() -------------------------------------------------- [1]
            activitywillclose
              ActivityWindowFactory._activeActivity = Gallery.activityCaller = Messages --- [2]
        Contacts.frame.removeChild(Messages.HTMLDivElement)
   d. homescreenopened
        WindowManager.removeFrame(Contacts)
5. $ MOZ_IGNORE_NUWA_PROCESS=1 ./tools/get_about_memory.py -m --no-gc-cc-log
   ├──0 (100.0%) ── content-parent(Communications, pid=-1, closed channel, 0x44fbec00, refcnt=2)
   ├──0 (100.0%) ── content-parent(Gallery, pid=-1, closed channel, 0x46391800, refcnt=1)  

The status of ActivityWindowFactory is incorrect:

[1] _lastActivity = Gallery, the corresponding animationend event is not received, and it will be kept in ActivityWindowFactory._activities[].
[2] _activeActivity = Messages, it is set even Messages._transitonstate is closed, and will be the container element of later created ActivityWindow.

I am still checking followings:

- Why animationend event is not passed to Gallery.ActivityWindow?
- Why mozbrowsererror event is not passed to Gallery.ActivityWindow?
So I tried looking at the gc/cc logs again for frame related objects.  I see some nsFrameMessageManager objects being held by Activities.callers.  This is coming from the gecko ActivitiesService.jsm.

I wonder if this conditional is failing to match:

  http://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/file/7296746ab908/dom/activities/src/ActivitiesService.jsm#l355

Perhaps the OOM sometimes occurs before we can set childMM and therefore we never delete the entry from the callers array?

Ting, I'm going to leave this to you as it appears your actively working it and seem to be on the right track.
Or does the code linked in comment 39 just not work correctly for nested activities?  Perhaps it should not break out of the loop after the first match...
So I instrumented ActivitiesService.receiveMessage() and I don't see us getting "child-process-shutdown".

Instead we are receiving "PostError" which then seems to throw from its sendAsyncMessage() call here:

         caller.mm.sendAsyncMessage("Activity:FireError", msg);
         delete this.callers[msg.id];

The error in question is:

E/GeckoConsole( 1812): [JavaScript Error: "NS_ERROR_NOT_INITIALIZED: Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED) [nsIMessageSender.sendAsyncMessage]" {file: "resource://gre/modules/ActivitiesService.jsm" line: 15}]

So thats one way we leak callers.  I added a try/finally there, but it seems there is more going on, though.

I am also seeing these in the log when an activity is killed:

[JavaScript Error: "msg is null" {file: "resource://gre/modules/ActivitiesService.jsm" line: 15}]

Due to the minification of the file I'm not sure where that is coming from.  Any convenient way to turn that off?

I have to leave for the day, but it seems this would be a good part of the code to investigate further.

Also, if others are looking at this:  Its really easy to see the callers leak in the gc-edges file.  Just open it in vim and search for callers.  When you find the property named callers search by its address and you will see all the items contained in the hash.  You can match these up to the msg.id in ActivitiesService.receiveMessage().
(In reply to Ben Kelly [:bkelly] from comment #42)
> So I instrumented ActivitiesService.receiveMessage() and I don't see us
> getting "child-process-shutdown".
> 
> Instead we are receiving "PostError" which then seems to throw from its
> sendAsyncMessage() call here:
> 
>          caller.mm.sendAsyncMessage("Activity:FireError", msg);
>          delete this.callers[msg.id];
> 
> The error in question is:
> 
> E/GeckoConsole( 1812): [JavaScript Error: "NS_ERROR_NOT_INITIALIZED:
> Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED)
> [nsIMessageSender.sendAsyncMessage]" {file:
> "resource://gre/modules/ActivitiesService.jsm" line: 15}]
> 
> So thats one way we leak callers.  I added a try/finally there, but it seems
> there is more going on, though.
> 
> I am also seeing these in the log when an activity is killed:
> 
> [JavaScript Error: "msg is null" {file:
> "resource://gre/modules/ActivitiesService.jsm" line: 15}]

This was a bug in my debug code.  Ignore it.

After fixing that I could see that the child-process-shutdown case was also hitting the error when calling sendAsyncMessage.

I patched my code locally to use try/finally throughout to ensure callers was cleaned up.  The nsFrameMessageManager is still leaking, though.  It seems perhaps the problem is further up the stack in the nsFrameMessageManager and could be causing mCallback to be nulled out prematurely (and thus causing the uninitialized error above).

> 
> Due to the minification of the file I'm not sure where that is coming from. 
> Any convenient way to turn that off?
> 
> I have to leave for the day, but it seems this would be a good part of the
> code to investigate further.
> 
> Also, if others are looking at this:  Its really easy to see the callers
> leak in the gc-edges file.  Just open it in vim and search for callers. 
> When you find the property named callers search by its address and you will
> see all the items contained in the hash.  You can match these up to the
> msg.id in ActivitiesService.receiveMessage().
(In reply to Ben Kelly [:bkelly] from comment #42)
> Due to the minification of the file I'm not sure where that is coming from. 
> Any convenient way to turn that off?

My way:

apply the patch and rebuild/flash gecko

ffos@ffos2:~/yingxu/6821/device/sprd/sp6821a_gonk$ git diff .
diff --git a/sp6821a_gonk/vendorsetup.sh b/sp6821a_gonk/vendorsetup.sh
index 9c3a7dd..1bbe679 100755
--- a/sp6821a_gonk/vendorsetup.sh
+++ b/sp6821a_gonk/vendorsetup.sh
@@ -16,9 +16,3 @@
 
 add_lunch_combo sp6821a_gonk-userdebug
 
-if [ -z "$JS_BINARY" -a -d "$PWD/device/sprd/common/tools/jsshell/" ]; then
-  JSSHELL_DIR=$PWD/device/sprd/common/tools/jsshell/$(uname -s | tr "[[:upper:]]" "[[:lower:]]")-$(uname -p | tr "[[:u
-  if [ -d $JSSHELL_DIR ]; then
-    export JS_BINARY=$JSSHELL_DIR/js
-  fi
-fi
(In reply to Ting-Yu Chou [:ting] from comment #38)
> - Why animationend event is not passed to Gallery.ActivityWindow?
nsAnimationManager::BuildAnimations() is not even called for Gallery's "slidedown", I guess it's because the Messages' DIV is removed.
> - Why mozbrowsererror event is not passed to Gallery.ActivityWindow?
- Why mozbrowsererror event is not passed to Gallery.ActivityWindow?
mozbrowsererror is notified from TabParent::ActorDestroy() if it called with a non-null nsFrameLoader and destroy reason AbnormalShutdown. But when Messages is removed from its container element at 4c, TabParent::ActorDestroy() will then be called with null mFrameElement and destroy reason Deletion.

According to above and comment 38, I created a WIP to resolve incorrect ActivityWindowFactory status:

- Set ActivityWindowFactory._activityCaller to closing activity's caller only if it is active.
- Invoke ActivityWindow._transitionHandler() by timeout instead of animationend event if the closing activity's caller is not active.

I haven't looked into the leak Ben found in comment 42.
Attachment #8425327 - Flags: feedback?(alive)
I didn't bring the device home, I will test with both WIPs included tomorrow.
Ting, this was the patch I came up with to avoid the callers leak in ActivityService.jsm.  I ended up removing the manager before attempting the send since it seemed a bit cleaner.  I don't know which way is preferable, but thought I would post it here.
Attachment #8425465 - Flags: feedback?(tchou)
(In reply to Ben Kelly [:bkelly] from comment #47)
> Ting, this was the patch I came up with to avoid the callers leak in
> ActivityService.jsm.  I ended up removing the manager before attempting the
> send since it seemed a bit cleaner.  I don't know which way is preferable,
> but thought I would post it here.

Thanks Ben, guess we can leave the decision to reviewer, I'll make sure the leak goes away at first.
Attachment #8425465 - Flags: feedback?(tchou)
Comment on attachment 8425327 [details] [diff] [review]
WIP - part1: fix ActivityWindowFactory status

Review of attachment 8425327 [details] [diff] [review]:
-----------------------------------------------------------------

This is dangerous..why don't we figure out there's no animationend event?
I observed the same thing in bug 1009621. And definitely the frame of Gallery is not enveloped in Camera.
Attachment #8425327 - Flags: feedback?(alive)
(In reply to Ting-Yu Chou [:ting] from comment #45)
> Created attachment 8425327 [details] [diff] [review]
> WIP - part1: fix ActivityWindowFactory status
> 
> (In reply to Ting-Yu Chou [:ting] from comment #38)
> > - Why animationend event is not passed to Gallery.ActivityWindow?
> nsAnimationManager::BuildAnimations() is not even called for Gallery's
> "slidedown", I guess it's because the Messages' DIV is removed.

In v1.3 the container div of each activity is standalone, what makes you think so?
We need to find out what's going on in gecko otherwise we will have more and more 1.3t issues like this.
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #50)
> In v1.3 the container div of each activity is standalone, what makes you
> think so?
> We need to find out what's going on in gecko otherwise we will have more and
> more 1.3t issues like this.

I have explained to Alive f2f about this, he will comment again later.
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #50)
> (In reply to Ting-Yu Chou [:ting] from comment #45)
> > Created attachment 8425327 [details] [diff] [review]
> > WIP - part1: fix ActivityWindowFactory status
> > 
> > (In reply to Ting-Yu Chou [:ting] from comment #38)
> > > - Why animationend event is not passed to Gallery.ActivityWindow?
> > nsAnimationManager::BuildAnimations() is not even called for Gallery's
> > "slidedown", I guess it's because the Messages' DIV is removed.
> 
> In v1.3 the container div of each activity is standalone, what makes you
> think so?
> We need to find out what's going on in gecko otherwise we will have more and
> more 1.3t issues like this.

I am wrong that the mechanism is also in v1.3 too. See bug 977934. Sorry.
But I need to think about this again...
And it's worthy to see if this happens in master or not.
With the patches attachment 8425327 [details] [diff] [review], and attachment 8425456 [details] [diff] [review] included, I couldn't find callers and _activities leaks from gc/cc edges. But ContentParent is still leak, which seems not related to activity:

  ├──0 (100.0%) ── content-parent(Calendar, pid=-1, closed channel, 0x4688b400, refcnt=1)

find_roots.py on cc edges for its nsFrameLoader address shows:

  0x4472a160 [JS Object (XPCWrappedNative_NoHelper)]
      --[js::GetObjectPrivate(obj)]--> 0x46811fd0 [XPCWrappedNative]
      --[mIdentity]--> 0x477cfc40 [nsFrameLoader]

      Root 0x4472a160 is a marked GC object.

I am checking what does this mean.
Comment on attachment 8425327 [details] [diff] [review]
WIP - part1: fix ActivityWindowFactory status

Review of attachment 8425327 [details] [diff] [review]:
-----------------------------------------------------------------

After thinking again I couldn't figure out obvious problem with the timer since master has the same approach.

Thank you!
Attachment #8425327 - Flags: review+
Flags: needinfo?(james.zhang)
(In reply to Ting-Yu Chou [:ting] from comment #53)
>   ├──0 (100.0%) ── content-parent(Calendar, pid=-1, closed channel, 0x4688b400, refcnt=1)

I can't reproduce this and ContentParent does not leak after removing my debug code.

Danny, could you please include the patches and test again?
Flags: needinfo?(dliang)
blocking-b2g: --- → 1.3T?
Flags: needinfo?(james.zhang)
v1.3t+ please.
Fabrice, Ben and I both created a patch for ActivitiesService.jsm to make sure the callers will be cleaned up (attachment 8425456 [details] [diff] [review] and attachment 8425465 [details] [diff] [review]). We are not sure which way is preferable, how do you think?
Flags: needinfo?(fabrice)
(In reply to Ting-Yu Chou [:ting] from comment #55)
> (In reply to Ting-Yu Chou [:ting] from comment #53)
> >   ├──0 (100.0%) ── content-parent(Calendar, pid=-1, closed channel, 0x4688b400, refcnt=1)
> 
> I can't reproduce this and ContentParent does not leak after removing my
> debug code.
> 
> Danny, could you please include the patches and test again?

I have setup monkey test included your patches, let's check the result tomorrow.
Flags: needinfo?(dliang)
blocking-b2g: 1.3T? → 1.3T+
(In reply to Ting-Yu Chou [:ting] from comment #57)
> Fabrice, Ben and I both created a patch for ActivitiesService.jsm to make
> sure the callers will be cleaned up (attachment 8425456 [details] [diff] [review]
> [review] and attachment 8425465 [details] [diff] [review]). We are not sure
> which way is preferable, how do you think?

I like how terse the code is with Ben's version, but I don't like too much that we don't catch the exceptions... 

Maybe we could do something like:

maybeSendMessage: function(aId, aMessage, aPayload) {
  try {
    aPayload.id = aId;
    this.callers[id].mm.sendAsyncMessage(aMessage, aPayload);
  } finally {
    delete this.callers[id];
  }
}

and use that function.
Flags: needinfo?(fabrice)
Updated patch per comment 59.
Attachment #8425456 - Attachment is obsolete: true
Attachment #8426814 - Flags: review?(fabrice)
Posted file about-memory-0522.tar.gz (obsolete) —
Attach get about memory on device 2, b2g-info as below, uss+swap reached 60MB
           NAME   PID PPID  CPU(s) NICE  USS  PSS  RSS SWAP VSIZE OOM_ADJ USER     
            b2g    85    1 25889.7    0 40.5 43.6 48.5 21.0 158.7       0 root     
         (Nuwa)   341   85    26.3    0  0.1  0.3  1.0  8.9  46.6       0 root     
     Homescreen 19371  341    10.4    1 10.4 13.8 18.9  4.7  62.2       2 app_19371
(Preallocated a 19710  341     1.4   18  0.1  1.9  5.5 10.2  53.6       1 root  

0 (100.0%) -- queued-ipc-messages
├──0 (100.0%) ── content-parent((Preallocated), pid=19710, open channel, 0x4260fc00, refcnt=16)
└──0 (100.0%) ── content-parent(Homescreen, pid=19371, open channel, 0x4260e800, refcnt=36)

Looks no leak on content-parent.

Hi ting-yu, could you help to check attach about-memory log, thanks.
Flags: needinfo?(tchou)
Posted file about-memory-0522-m.tar.gz (obsolete) —
about memory on same device w/ cmd "-m" 
        NAME   PID PPID  CPU(s) NICE  USS  PSS  RSS SWAP VSIZE OOM_ADJ USER    
            b2g    85    1 25974.4    0 45.5 48.8 54.2 15.0 156.0       0 root    
         (Nuwa)   341   85    26.5    0  0.1  0.3  0.9  8.9  46.6       0 root    
     Homescreen  7154  341    10.1   18  7.4 11.0 16.8  6.5  60.7       8 app_7154
(Preallocated a 13625  341     1.5   18  1.0  3.3  7.9  9.1  53.6       1 root
I'll check followings from attachment 8426863 [details]:

0x46ac86a0 [rc=86] FragmentOrElement (xhtml) div class='anonymous-div' app://browser.gaiamobile.org/index.html#
0x46586c88 [rc=83] nsXPCWrappedJS (nsIMessageListener)
0x435450b0 [rc=9] FragmentOrElement (xhtml) div id='appframe75' class='appWindow' app://system.gaiamobile.org/index.html#
0x446d7ab0 [rc=9] FragmentOrElement (xhtml) div id='appframe43' class='appWindow render' app://system.gaiamobile.org/index.html#

Please be noted I am still learning debugging leaks by gc/cc edges, and not confident that I am on the right direction.
Flags: needinfo?(tchou)
(In reply to Ting-Yu Chou [:ting] from comment #63)
> 0x46ac86a0 [rc=86] FragmentOrElement (xhtml) div class='anonymous-div'
> app://browser.gaiamobile.org/index.html#
This is a child node of url-input from browser, and is referenced by EditTxn and nsRange. I haven't figured out how does it work, couldn't determine are there leaks or not.

> 0x46586c88 [rc=83] nsXPCWrappedJS (nsIMessageListener) 0x435450b0
Keyboard.jsm adds itself as a message listener when nsFrameLoader is created, looks ok.

> [rc=9] FragmentOrElement (xhtml) div id='appframe75'
> class='appWindow' app://system.gaiamobile.org/index.html# 0x446d7ab0
Bluetooth app, in process tab.

> [rc=9] FragmentOrElement (xhtml) div id='appframe43'
> class='appWindow render' app://system.gaiamobile.org/index.html#
Browser app, in process tab.

BTW, from memory-reports, those two are quite large:

  ├───7.72 MB (14.27%) ── unused-gc-things
  ├───7.16 MB (13.23%) -- compartment([System Principal], inProcessTabChildGlobal?ownedBy=chrome://browser/content/shell.html)

Danny told me he usually sees ~4xMB USS+SWAP for b2g process, but comment 62 shows ~60MB. So I am checking attachment 8426863 [details] to see if there're other noticeable leaks. However, I still don't have a good sense to guess what may be a leak and am afraid that I overlooked.

Ben, would you mind to also take a look at gc/cc logs of attachment 8426863 [details] see if there's anything suspicious? Thank you.

# I will keep checking around.
Comment on attachment 8426814 [details] [diff] [review]
part2: fix to avoid leaking Activities.callers in ActivitiesService.jsm

Review of attachment 8426814 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/activities/src/ActivitiesService.jsm
@@ +281,5 @@
> +  maybeSendMessage: function activities_maybeSendMessage(aId, aName, aPayload) {
> +    try {
> +      this.callers[aId].mm.sendAsyncMessage(aName, aPayload);
> +    } finally {
> +      delete this.callers[aId];

Can we name this something like "maybeSendFinalMessage()" or something?  It may not be obvious that it deletes the caller with the current name.
How about "trySendAndCleanup()"?
(In reply to Ting-Yu Chou [:ting] from comment #64)
> Ben, would you mind to also take a look at gc/cc logs of attachment 8426863 [details]
> [details] see if there's anything suspicious? Thank you.

I still see the DeviceStorage leak.  It seems those only get cleaned up when navigator.invalidate() is called.  That may never happen in the parent process.

I think we need to remove old DeviceStorage objects when we create a new one with the same type string.

Kyle, does that sound like the correct way to prevent these from growing without bound here?  I assume we cannot use a weak reference because we need to call Shutdown() on the storage objects.
Lost my ni due to mid-air collision.  Kyle, please see question in comment 67 about how best to cleanup DeviceStorage objects on navigator.  Thanks!
Flags: needinfo?(khuey)
(In reply to Ting-Yu Chou [:ting] from comment #66)
> How about "trySendAndCleanup()"?

Works for me.  Thanks!
The device storage leak is bug 985197.  I'm not sure that we can do anything there without changing the API :(
Flags: needinfo?(khuey)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #70)
> The device storage leak is bug 985197.  I'm not sure that we can do anything
> there without changing the API :(

I guess what I don't understand is why we have to create a new DeviceStorage every time.  If content asks for storage for 'sdcard' multiple times, can't we return the same object each time?  Or if the old one is invalid, can't we clean it up?
I don't know either, but changing that is a change in the API behavior, which might break things, we means we need to audit the code using it, etc.
I see a ton of DOMWindowUtils objects:

  bkelly@lenir:/srv/tmp/bug1007520/about-memory-5$ egrep ' [BG] DOMWindowUtils' * | wc -l
             7677

Which seem to mostly be held via SpecialPowers stuff.  It seems SpecialPowers() does this and never removes the listener?

  addMessageListener("SPPingService", this._messageListener);

Are we calling SpecialPowers over and over from the test in the parent system window or something?
I don't see anything in the monkey test that would use marionette or specialpowers.

Is it possible you are running against an engineering build and the script is clicking on test apps?  (I'm not sure how those would trigger specialpowers in the parent process, though.)

I guess it would be interesting to try this on a user variant build where marionette is disabled, etc.
Comment on attachment 8426814 [details] [diff] [review]
part2: fix to avoid leaking Activities.callers in ActivitiesService.jsm

Review of attachment 8426814 [details] [diff] [review]:
-----------------------------------------------------------------

r=me Feel free to rename to trySendAndCleanup()
Attachment #8426814 - Flags: review?(fabrice) → review+
Renamed maybeSendMessage() to trySendAndCleanup().
Attachment #8426814 - Attachment is obsolete: true
Attachment #8427481 - Flags: review+
Updated to pull request.
Attachment #8425327 - Attachment is obsolete: true
Attachment #8427542 - Flags: review+
Fabrice, do you have any ideas how we could be pulling in SpecialPowers with this test?  As far as I can tell its completely orangutan driven which operates at the android device level.  This is a bit of a mystery to me, but you can see a ton of SpecialPowers objects being leaked in the gc/cc logs for attachment 8426863 [details].  My only thoughts were:

1) Maybe a test app is being run accidentally that is triggering marionette somehow.  It seems unlikely that a content app could cause the parent process to enter a test scenario, though.

2) There is some kind of local change to this build.

Kai-Zhen, do you know if there are any local modifications to the build you are using that could explain this?
Flags: needinfo?(kli)
Flags: needinfo?(fabrice)
(In reply to Ben Kelly [:bkelly] from comment #74)
> I don't see anything in the monkey test that would use marionette or
> specialpowers.
> 
> Is it possible you are running against an engineering build and the script
> is clicking on test apps?  (I'm not sure how those would trigger
> specialpowers in the parent process, though.)
> 
> I guess it would be interesting to try this on a user variant build where
> marionette is disabled, etc.

I have setup the test w/ user build, let's check the result after 24 hrs.
The parent of:

  0x49269960 [gc] JS Object (Function - SpecialPowers/<.get)

is:

  0x4361f1f0 [gc.marked] JS Object (ContentFrameMessageManager)
  > 0x46f178a0 SpecialPowersAPI
  > 0x46f25a40 SpecialPowers

So I searched ContentFrameMessageManager on DXR, and suspect it relates to nsInProcessTabChildGlobal.cpp as there're bluetooth windows which I couldn't reproduce:

  │  └──0.40 MB (00.74%) ++ window(app://bluetooth.gaiamobile.org/message.html)
  └──0.59 MB (01.09%) ++ window(app://bluetooth.gaiamobile.org/onpair.html)

And I just found one way to reproduce the bluetooth windows, and there will be SpecialPowers in cc-edges as well:

1. Turn on bluetooth and enable visible to all.
2. Use any other device to pair the phone.
3. Doing nothing when the pin code entering window shows up until screen is off.

The childWindow of PairManager (http://goo.gl/zvsQ6i) is not closed, but I haven't figured out how does SpecialPowers come in at step 3.
Nice!  I believe all that's needed is for specialpowers.js to be sourced in order for the SpecialPowers objects to appear in the logs.
Try results of attachment 8427481 [details] [diff] [review]: https://tbpl.mozilla.org/?tree=Try&rev=b0f583676379

Please let me know if I should include other tests.
Ben, there is no local modification, only root privilege is enabled even in user build. Before test is starting, system will be remount and setting App will be removed by _test_prepare.sh, because running test with setting App presents will result more strange issue.

Below is how root is enabled for user build.
$ ENABLE_ADB_ROOT=1 ./build.sh
Flags: needinfo?(kli)
Thanks Kai-Zhen!  Looks like Ting has figured out the source of the SpecialPowers.
Flags: needinfo?(fabrice)
1. MarionetteComponent receives profile-after-change
2. MarionetteComponent observes final-ui-startup
3. MarionetteComponent receives final-ui-startup and loads marionette-server.js
4. MarionetteServer loads SpecialPowerObserver.js and calls init()
5. SpecialPowerObserver observes chrome-document-global-created
6. SpecialPowerObserver receives chrome-document-global-created and loads specialpowers.js
7. SpecialPowersManager adds listener for event DOMWindowCreated
8. SpecialPowersManager receives DOMWindowCreated and SpecialPowers.attachSpecialPowersToWindow() is applied.

I am trying to understand is chrome-document-global-created received at step6 a proper one or not.
Actually attachSpecialPowersToWindow() is called with not only bluetooth, but also system, homescreen, callscreen, and keyboard. I will check around the listener Ben mentioned in comment 73.

BTW, Marionette is disabled in user build.
Ben, please let me know if I should ask someone else for reviewing. Thanks.
Attachment #8428624 - Flags: review?(bkelly)
Comment on attachment 8428624 [details] [diff] [review]
part3: avoid leaking SpecialPowers to nsInProcessTabChildGlobal.mMessageManager

Ben, after I check git blame of specialpowers.js, I guess I should ask Ted for reviewing.
Attachment #8428624 - Flags: review?(bkelly) → review?(ted)
Attachment #8428624 - Attachment is patch: true
From Danny's monkey test results, there're still leaks in _activities[] even attachment 8427542 [details] [review] is applied. I closed the PR, and am checking all possibile paths now.
Attachment #8428624 - Flags: review?(ted) → review+
Double checked, there're two paths publish |activityterminated| event, which removes element from _activities[]:

 a. acw_kill() is called on an active instance and when acw__transitionHandler()
    calls onClose() callback (http://goo.gl/LSjWaE),
 b. acw_kill() is called on an inactive instance

I adjusted the patch a little bit to set timeout always to guarantee onClose() gets called for (a).

Danny, please help retest the patch. Thanks.
Attachment #8427542 - Attachment is obsolete: true
Flags: needinfo?(dliang)
Component: General → DOM
Product: Firefox OS → Core
(In reply to Ting-Yu Chou [:ting] from comment #91)
> Try result of attachment 8428624 [details] [diff] [review]:
> https://tbpl.mozilla.org/?tree=Try&rev=a25cc72da444

Confirmed the failed cases of "Ubuntu VM 12.04 try opt test mochitest-other" are caused by the patch. I used following command to test locally:

  $ ./mach mochitest-chrome docshell/test/chrome/test_bug112564.xul 

I haven't checked the other failed cases.
(In reply to Ting-Yu Chou [:ting] from comment #85)
> 7. SpecialPowersManager adds listener for event DOMWindowCreated
> 8. SpecialPowersManager receives DOMWindowCreated and
> SpecialPowers.attachSpecialPowersToWindow() is applied.

DOMWindowCreated is notified from nsGlobalWindow.setDocument(),
(In reply to Ting-Yu Chou [:ting] from comment #93)
> > 7. SpecialPowersManager adds listener for event DOMWindowCreated
> > 8. SpecialPowersManager receives DOMWindowCreated and
> > SpecialPowers.attachSpecialPowersToWindow() is applied.
> 
> DOMWindowCreated is notified from nsGlobalWindow.setDocument(),

Sorry, I pressed enter accidentally.

DOMWindowCreated is notified from nsGlobalWindow.setDocument() when a new inner window is instantiated. SpecialPowers listens to the event and attach a new SpecialPowers instance to the window. I saw following documents triggers DOMWindowCreated when device powers on:

  about:blank (more than one time),
  app://system.gaiamobile.org/index.html,
  app://callscreen.gaiamobile.org/index.html,
  app://keyboard.gaiamobile.org/index.html#en  

I haven't figured out how does attachment 8428624 [details] [diff] [review] fail the tests, but I am also not sure is a patch necessary as marionette is disabled in user build...
(In reply to Ting-Yu Chou [:ting] from comment #94)
> I haven't figured out how does attachment 8428624 [details] [diff] [review]
> fail the tests, but I am also not sure is a patch necessary as marionette is
> disabled in user build...

Not sure how this partner feels, but in the past we have blocked on "test only" problems because if we ignore them we can end up hiding real problems.  So we probably need to fix this somehow.

What I still don't understand, though, is what is different about this test compared with our other MTBF and monkey tests that we have not run into the SpecialPowers leak before?

As far as this patch, maybe you want something like this instead of using the 'unload' event?

   	Services.obs.addObserver(this, 'inner-window-destroyed', false);
Assignee: nobody → tchou
Differnet SpecialPowers instances are added to the same nsInProcessTabChildGlobal.mMessageManager, and the nsInProcessTabChildGlobal instance isn't destroyed.
(In reply to Ting-Yu Chou [:ting] from comment #96)
> Differnet SpecialPowers instances are added to the same
> nsInProcessTabChildGlobal.mMessageManager, and the nsInProcessTabChildGlobal
> instance isn't destroyed.

From attachment 8426863 [details] cc-edges.85.log, there're 7690 listeners here:

  0x46e5e060 [rc=1] nsFrameMessageManager
  > 0x46e8dc88 listeners[i] mStrongListener
  ...
  > 0x4031af08 listeners[i] mStrongListener

Follow it up I see systemapp:

  0x46e7c1c0 [rc=4] nsDOMEventTargetHelper 
  > 0x46e5e060 mMessageManager
  0x46e7c160 [rc=3] nsFrameLoader
  > 0x46e7c1c0 mChildMessageManager
  0x46e7c0a0 [rc=18] FragmentOrElement (xhtml) iframe id='systemapp' chrome://browser/content/shell.html
  > 0x46e7c160 mFrameLoader
4 devices w/ user build run monkey test for 24 hours, the result as following
- all devices met bug 1016746
- 1 device met b2g restart
- 1 device whose USS+swap exceed 60MB
           NAME   PID PPID  CPU(s) NICE  USS  PSS  RSS SWAP VSIZE OOM_ADJ USER
            b2g    86    1 35866.4    0 50.7 54.3 58.2 12.5 162.5       0 root
         (Nuwa)   315   86    83.0    0  0.3  0.5  1.1  8.7  46.5       0 root
(Preallocated a 15686  315     2.1   18  6.0  9.5 13.5  4.1  54.5       4 ro
Flags: needinfo?(dliang)
Posted file about-memory-0529.tar.gz (obsolete) —
get about memory w/ USS+swap exceed 60MB
There are still leaked ContentParent, and leaks in _activities[] and callers{} from attachment 8430576 [details]. :(

I can't reproduce it by STR at comment 38.
There are 15 activityWindow in cc-edges.86.log of attachment 8430576 [details], and all are still active:

0x45fd5880 [rc=9] FragmentOrElement (xhtml) div id='activity-window-23' class='appWindow activityWindow inline-activity active' app://system.gaiamobile.org/index.html
0x4462efb0 [rc=9] FragmentOrElement (xhtml) div id='activity-window-24' class='appWindow activityWindow inline-activity active' app://system.gaiamobile.org/index.html
...
homescreen_window.js listens to also mozbrowserclose and mozbrowsererror, and stop the event propagation:

  https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/homescreen_window.js#L75

I am checking are there chances that the events are consumed by homescreen_window only.
It looks like related to bug 1016746.
Updated to pull request.
Attachment #8429191 - Attachment is obsolete: true
Attachment #8431331 - Flags: review+
(In reply to Ben Kelly [:bkelly] from comment #95)
> As far as this patch, maybe you want something like this instead of using
> the 'unload' event?
> 
>    	Services.obs.addObserver(this, 'inner-window-destroyed', false);

Thanks Ben! I updated the patch and Try looks good this time: https://tbpl.mozilla.org/?tree=Try&rev=fbb9fe26ba86.

But there are some error messages from the test log:

02:49:35     INFO -  System JS : ERROR chrome://specialpowers/content/specialpowers.js:32 - NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIMessageListenerManager.removeMessageListener]
02:49:35     INFO -  System JS : ERROR chrome://specialpowers/content/specialpowers.js:32 - NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIMessageListenerManager.removeMessageListener]

I am still checking what goes wrong.
Attachment #8428624 - Attachment is obsolete: true
Updated commit message.
Attachment #8427481 - Attachment is obsolete: true
Attachment #8431343 - Flags: review+
(In reply to Ting-Yu Chou [:ting] from comment #105)
> 02:49:35     INFO -  System JS : ERROR
> chrome://specialpowers/content/specialpowers.js:32 - NS_ERROR_ILLEGAL_VALUE:
> Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE)
> [nsIMessageListenerManager.removeMessageListener]

This is because |inner-window-destroyed| comes after nsInProcessTabChildGlobal::DelayedDisconnect(), which its mMessageManager has been set null. Later nsInProcessTabChildGlobal::RemoveMessageListener() invocations will raise exception for null mMessageManager.

So I rewrote the patch to use addWeakMessageListener() instead. I ran mochitest locally and it looks fine, also by using |./tools/get_about_memory.py -m| I don't see leak SpecialPowers instances.

Try is still running: https://tbpl.mozilla.org/?tree=Try&rev=589d585d5083

Ted, please help review again. Thank you.
Attachment #8431342 - Attachment is obsolete: true
Attachment #8431472 - Flags: review?(ted)
Rebase to latest version.
Attachment #8431343 - Attachment is obsolete: true
Attachment #8431488 - Flags: review+
Comment on attachment 8431472 [details] [diff] [review]
part3: avoid leaking SpecialPowers to nsInProcessTabChildGlobal.mMessageManager

Try is not ok, checking now.
Attachment #8431472 - Flags: review?(ted)
Short summary:

- There are three test results with patches part1 and part2 included:
    5/22 comment 61: OK, no leak content parent
    5/27 comment 89: NG, leak content parent (about:memory was not attached)
    5/28 comment 98: NG, leak content parent

- There's actually no STR for this issue, I tried myself and found there's one
  way to reproduce the leak (comment 38), which part1 was created mainly based
  on it. But probably it is not the only path.

- Patch part3 has three different versions:
    attachment 8428624 [details] [diff] [review]: remove listener when |unload|, Try: NG
    attachment 8431342 [details] [diff] [review]: remove listener when |inner-window-destroyed|, Try: OK, but there are error messages (comment 105)
    attachment 8431472 [details] [diff] [review]: add listener by addWeakMessageListener(), Try: NG

- bug 1016746 runs into activity leak as well, but no STR, too.

- The test is Android monkey test (comment 18).
Not sure if this is the root cause or not, but here is a leak:

Looking at attachment 8430576 [details] from comment 98 I see this in the b2g CC log:

grep "FragmentOrElement (xhtml) div class='picker-unit' app://system.gaiamobile.org/index.html" cc-edges.86.log | wc -l
317

This appears to be a leak in the system app ValuePicker.  When you new ValuePicker(element) it adds divs to the given element in addPickerUnit().  The ValuePicker.uninit() never removes these divs.

This would not be a problem if the given element was thrown away, but it appears SpinDatePicker always passes the result of element.querySelector('.value-picker-year') to new ValuePicker().  This probably accumulates divs on a single element over time.

Alive, what do you think?
Flags: needinfo?(alive)
(In reply to Ben Kelly [:bkelly] from comment #112)
> grep "FragmentOrElement (xhtml) div class='picker-unit'
> app://system.gaiamobile.org/index.html" cc-edges.86.log | wc -l
> 317

I had checked this earlier, there are 200 years (1900~2099), 12 months and 31 days from spin-date-picker.js, at least 12 hours, 60 mins and [AM, PM] from value-selector.js.

200 + 12 + 31 + 12 + 60 + 2 = 317
(In reply to Ting-Yu Chou [:ting] from comment #114)
> (In reply to Ben Kelly [:bkelly] from comment #112)
> > grep "FragmentOrElement (xhtml) div class='picker-unit'
> > app://system.gaiamobile.org/index.html" cc-edges.86.log | wc -l
> > 317
> 
> I had checked this earlier, there are 200 years (1900~2099), 12 months and
> 31 days from spin-date-picker.js, at least 12 hours, 60 mins and [AM, PM]
> from value-selector.js.
> 
> 200 + 12 + 31 + 12 + 60 + 2 = 317

Oh, good point!
Flags: needinfo?(alive)
Try results: https://tbpl.mozilla.org/?tree=Try&rev=2d210c52d77a

Ted, I switch to monitor 'inner-window-destroyed' instead per Ben's comment 95, and use a try/catch to ignore the exception which when message manager has been destroyed. Please help review again, thank you.
Attachment #8431472 - Attachment is obsolete: true
Attachment #8433091 - Flags: review?(ted)
With below commit and part1~3 of patches, we still found the leak from content-parent on monkey test. Please note DUT1 and DUT2 met b2g restart due to killed by LMK.

DUT1: No leak from content-parent
0 (100.0%) -- queued-ipc-messages
├──0 (100.0%) ── content-parent((Preallocated), pid=12337, open channel, 0x48e51400, refcnt=16)
└──0 (100.0%) ── content-parent(Video, pid=12298, open channel, 0x474ea000, refcnt=157)
DUT2:
0 (100.0%) -- queued-ipc-messages
├──0 (100.0%) ── content-parent(E-Mail, pid=-1, closed channel, 0x444b2800, refcnt=1)
└──0 (100.0%) ── content-parent(Usage, pid=-1, closed channel, 0x43e8ec00, refcnt=1)
DUT3:
0 (100.0%) -- queued-ipc-messages
├──0 (100.0%) ── content-parent((Preallocated), pid=14574, open channel, 0x46979000, refcnt=16)
├──0 (100.0%) ── content-parent(Clock, pid=13802, open channel, 0x46e15c00, refcnt=17)
└──0 (100.0%) ── content-parent(E-Mail, pid=-1, closed channel, 0x4386d800, refcnt=1)
DUT4: No leak from content-parent
0 (100.0%) -- queued-ipc-messages
└──0 (100.0%) ── content-parent((Preallocated), pid=15919, open channel, 0x40b72c00, refcnt=16)
Attachment #8419397 - Attachment is obsolete: true
Attachment #8419491 - Attachment is obsolete: true
Attachment #8422237 - Attachment is obsolete: true
Attachment #8423725 - Attachment is obsolete: true
Attachment #8423726 - Attachment is obsolete: true
Attachment #8423727 - Attachment is obsolete: true
Attachment #8426847 - Attachment is obsolete: true
Attachment #8426863 - Attachment is obsolete: true
Attachment #8430576 - Attachment is obsolete: true
So looking at the latest report ...

A good chunk of heap-unclassified is coming from font related things.  Are we still loading Chinese fonts?
(Also related, http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxUserFontSet.cpp#383 is probably far too big for b2g).

Other than that, I can't really debug anything else from these memory reports.  We'd need to sit down in gdb to figure out what is going on.
From b2g cc-edges of attachment 8433988 [details] and attachment 8433989 [details], _activites[] are both empty and there's no div with class "activityWindow".

But there are still things in callers{}, also I noticed both have an AppWindow instance with activityCallee.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #122)
> A good chunk of heap-unclassified is coming from font related things.  Are
> we still loading Chinese fonts?

Should be removed already. Danny, could you please confirm that?
Flags: needinfo?(dliang)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #122)
> So looking at the latest report ...
> 
> A good chunk of heap-unclassified is coming from font related things.  Are
> we still loading Chinese fonts?
No, the build just include font of English, Bengali, Devanagari and Tamil

> (Also related,
> http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxUserFontSet.
> cpp#383 is probably far too big for b2g).
> 
> Other than that, I can't really debug anything else from these memory
> reports.  We'd need to sit down in gdb to figure out what is going on.
(In reply to Ting-Yu Chou [:ting] from comment #124)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #122)
> > A good chunk of heap-unclassified is coming from font related things.  Are
> > we still loading Chinese fonts?
> 
> Should be removed already. Danny, could you please confirm that?

Chinese fonts has been removed on normal build.
Flags: needinfo?(dliang)
In the log I looked at there are 7 iframes.

One is the iframe in shell.html that contains the system app.
One is the iframe inside the "attention screen".
One is the iframe leaked via bug 1020726.
The remaining four are browser iframes.  I have a browser0, browser58, browser379, and browser391.  These four iframes are in the DOM, so whatever is supposed to remove them is failing to do so.

There are 7 frame loaders in the GC graph and 7 in the CC graph, but only 5 of those in the CC graph are still connected to their <iframe>.  Frameloaders can entrain the ContentParent, so I investigated the two that did not appear in the CC graph and one was associated with the bug 1020726 iframe and other was associated with one of the browser iframes.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #127)
> other was associated with one of the browser iframes.

This one seems related to the activityCallee 0x47d0bb50 I noticed. The callee has frame appframe257, and iframe/element browser379. And from the iframe browser379 0x43735820 I can connect to a frame loader 0x46a37820. BTW, the AppWindow 0x4706c910 of the callee is referenced by _stack[] of StackManager.

I noticed also two strange things from the activityCallee:

1. Its element is browser379 not activity-window-N,
2. Its frame is appframe257, not the same to the element

Both does not match the code below, do I misunderstand anything?

  ActivityWindow.prototype.view = function acw_view() {
    this.instanceID = _id;
    return '<div class="appWindow activityWindow inline-activity' +
            '" id="activity-window-' + _id++ + '">' +
            '<div class="screenshot-overlay"></div>' +
            '<div class="fade-overlay"></div>' +
            '</div>';
  };

  ActivityWindow.prototype.render = function acw_render() {
    this.containerElement.insertAdjacentHTML('beforeend', this.view());
    ...
    this.browser = new BrowserFrame({
      ...
    });
    this.element =
      document.getElementById('activity-window-' + this.instanceID);
    this.element.insertBefore(this.browser.element, this.element.childNodes[0]);
    this.frame = this.element;
    this.iframe = this.browser.element;
(In reply to Ting-Yu Chou [:ting] from comment #128)
> 1. Its element is browser379 not activity-window-N,
> 2. Its frame is appframe257, not the same to the element
> 
> Both does not match the code below, do I misunderstand anything?

So I was looking at wrong pieice of code. The places make the things above:

1. https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/window.js#L236
2. https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/window_manager.js#L1069
There remains only one direction from activity caller 0x4706c910 to callee 0x47d0bb50, which the link is half broken. I checked around, and couldn't find code breaks the link like that. :(

A minor bug here: https://github.com/mozilla-b2g/gaia/blob/v1.3t/apps/system/js/window.js#L194
Comment on attachment 8433091 [details] [diff] [review]
part3: avoid leaking SpecialPowers to nsInProcessTabChildGlobal.mMessageManager

Review of attachment 8433091 [details] [diff] [review]:
-----------------------------------------------------------------

Wow, this is all horrible. Thanks for tracking these leaks down!

::: testing/specialpowers/content/specialpowers.js
@@ +32,5 @@
> +        // The message manager may have been destroyed.
> +        try {
> +          removeMessageListener("SPPingService", self._messageListener);
> +        } catch (e if e.result == Components.results.NS_ERROR_ILLEGAL_VALUE) {
> +          ;

I'd rather have a comment here than a bare semicolon, honestly.
Attachment #8433091 - Flags: review?(ted) → review+
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #131)
> Review of attachment 8433091 [details] [diff] [review]:
> > +        // The message manager may have been destroyed.
> > +        try {
> > +          removeMessageListener("SPPingService", self._messageListener);
> > +        } catch (e if e.result == Components.results.NS_ERROR_ILLEGAL_VALUE) {
> > +          ;
> 
> I'd rather have a comment here than a bare semicolon, honestly.

I moved the comment to above the semicolon and updated it a bit, thanks.
Attachment #8433091 - Attachment is obsolete: true
Attachment #8434984 - Flags: review+
Still can find the leak of content-parent by monkey test last weekend.
DUT2:
0 (100.0%) -- queued-ipc-messages
├──0 (100.0%) ── content-parent((Preallocated), pid=27379, open channel, 0x40c06400, refcnt=16)
├──0 (100.0%) ── content-parent(Browser, pid=-1, closed channel, 0x40c9e400, refcnt=1)
└──0 (100.0%) ── content-parent(Browser, pid=-1, closed channel, 0x40ca4000, refcnt=1)

DUT4:
0 (100.0%) -- queued-ipc-messages
├──0 (100.0%) ── content-parent((Preallocated), pid=32688, open channel, 0x43923400, refcnt=16)
├──0 (100.0%) ── content-parent(E-Mail, pid=-1, closed channel, 0x4466cc00, refcnt=1)
└──0 (100.0%) ── content-parent(Messages, pid=-1, closed channel, 0x43f2e800, refcnt=1)
The leak of the two Browser ContentParents goes away if you close the browser app, so I'm not too worried about that one.

The Messages leak is the "runningApps" leak that Ting-Yu has seen at least once before.  That one goes away if you go to the card view so I'm not too worried about that either.

The email one is the leak from comment 130.
Based on comment 136 I think we should not block on the ContentParent leaks.  What else are we still concerned about here?  How does the memory usage in general look over time?
blocking-b2g: 1.3T+ → 1.3T?
Flags: needinfo?(dliang)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #137)
> Based on comment 136 I think we should not block on the ContentParent leaks.
> What else are we still concerned about here?  How does the memory usage in
> general look over time?

Kyle, I think leak could from different places, and some of them got fixed in this bug. And these fixes will help 1.3t too. Is it possible to keep 1.3T+ and land the above fixes and then open a new bug to follow up the ContentParent leaks?
Flags: needinfo?(khuey)
Sure, yes.  We should absolutely land the patches we already have on Tarako.

So maybe we should set blocking+ again, land the patches, and close this bug.  We can file new bugs for the remaining things, which I think should not block.  How does that sound?
Flags: needinfo?(khuey) → needinfo?(kli)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #137)
> Based on comment 136 I think we should not block on the ContentParent leaks.
> What else are we still concerned about here?  How does the memory usage in
> general look over time?


By previous tests, b2g will be killed after 80hrs monkey test (bug 1005799), the root cause is b2g became big and LMK to kill it. I think we can keep this bug as a meta bug and create other bugs to track different memory leak. In my experience, the general memory usage is below 50MB (USS+SWAP).
Flags: needinfo?(dliang)
Please land this patch, after monkey test, we found b2g RSS+swap is bigger than before monkey test.
Kyle, I agree with you. There are some r+ patches on this bug, but I am not sure if each of them can be landed it at this moment. If the answer is yes, let's land them and close this bug. And then open a meta ug and other bug as necessary to follow up the leak.

Ting, How do you think?
Flags: needinfo?(kli) → needinfo?(tchou)
Ok, let's land.
blocking-b2g: 1.3T? → 1.3T+
I don't see any reason not to land them, memory leak is especially bad for memory contrained device.
Flags: needinfo?(tchou)
Try results are in comment 82 and comment 116.
Do we need the Gaia PR on trunk or just on v1.3t?
Flags: needinfo?(tchou)
Should be just on v1.3t, trunk code is quite different.
Flags: needinfo?(tchou)
https://hg.mozilla.org/mozilla-central/rev/0522e007a3aa
https://hg.mozilla.org/mozilla-central/rev/b1d4ab072fc2
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Comment on attachment 8431488 [details] [diff] [review]
part2: avoid leaking Activities.callers in ActivitiesService.jsm

We'll want these on 2.0 too.
Attachment #8431488 - Flags: approval-mozilla-aurora+
Do we need this on b2g30 for v1.4?
Flags: needinfo?(khuey)
Ideally, but I'm still waiting for bug 1016746 to get approved so maybe it's too late.
Flags: needinfo?(khuey)
We (partner) also want this in 1.4, bug 1016746 seems to have risk analysis now etc. so would be good if we can land it.
Flags: needinfo?(khuey)
Comment on attachment 8431488 [details] [diff] [review]
part2: avoid leaking Activities.callers in ActivitiesService.jsm

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unknown
User impact if declined: Memory leaks
Testing completed: On tarako/m-c/etc
Risk to taking this patch (and alternatives if risky): Low risk
String or UUID changes made by this patch: N/A
Attachment #8431488 - Flags: approval-mozilla-b2g30?
Flags: needinfo?(khuey)
Comment on attachment 8431488 [details] [diff] [review]
part2: avoid leaking Activities.callers in ActivitiesService.jsm

This needs to land on 1.4 Dolphin branch
Attachment #8431488 - Flags: approval-mozilla-b2g30?
Comment on attachment 8434984 [details] [diff] [review]
part3: avoid leaking SpecialPowers to nsInProcessTabChildGlobal.mMessageManager

Needs to land on 1.4 Dolphin branch
Attachment #8434984 - Flags: approval-mozilla-b2g30?
Comment on attachment 8431488 [details] [diff] [review]
part2: avoid leaking Activities.callers in ActivitiesService.jsm

NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
User impact if declined: 
Testing completed: 
Risk to taking this patch (and alternatives if risky): 
String or UUID changes made by this patch:
Attachment #8431488 - Flags: approval-mozilla-b2g30?
Comment on attachment 8434984 [details] [diff] [review]
part3: avoid leaking SpecialPowers to nsInProcessTabChildGlobal.mMessageManager

NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
User impact if declined: 
Testing completed: 
Risk to taking this patch (and alternatives if risky): 
String or UUID changes made by this patch:
Attachment #8434984 - Flags: approval-mozilla-b2g30?
Ting-Yu,
Can you check if this is ready to be landed for 1.4 and request approval? We should now land this directly to 1.4.

I changed the approval flags to ? above but it's probably better for you to check first. Thanks!
Flags: needinfo?(tchou)
Comment on attachment 8434984 [details] [diff] [review]
part3: avoid leaking SpecialPowers to nsInProcessTabChildGlobal.mMessageManager

I'll let the patch author to request for the approval
Attachment #8434984 - Flags: approval-mozilla-b2g30?
Comment on attachment 8431488 [details] [diff] [review]
part2: avoid leaking Activities.callers in ActivitiesService.jsm

I'll let the patch author to request for the approval
Attachment #8431488 - Flags: approval-mozilla-b2g30?
AIUI this is just waiting on a decision about whether or not to branch 1.4.
Hi Kyle,

According to the latest update from release management, there won't be new branch for dolphin. For the bugs that needs to be uplifted to v1.4, please request the approval again. Sorry for the inconvenience.
Comment on attachment 8431488 [details] [diff] [review]
part2: avoid leaking Activities.callers in ActivitiesService.jsm

Copy from Kyle's comment 156 as the patch is not changed.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unknown
User impact if declined: Memory leaks
Testing completed: On tarako/m-c/etc
Risk to taking this patch (and alternatives if risky): Low risk
String or UUID changes made by this patch: N/A
Attachment #8431488 - Flags: approval-mozilla-b2g30?
Flags: needinfo?(tchou)
Attachment #8434984 - Flags: approval-mozilla-b2g30?
Hi Lawrence,

Could you help to approve the bug for 1.4 uplifting. Thanks.
Flags: needinfo?(lmandel)
Comment on attachment 8431488 [details] [diff] [review]
part2: avoid leaking Activities.callers in ActivitiesService.jsm

Approved for 1.4.
Attachment #8431488 - Flags: approval-mozilla-b2g30? → approval-mozilla-b2g30+
Flags: needinfo?(lmandel)
Attachment #8434984 - Flags: approval-mozilla-b2g30? → approval-mozilla-b2g30+
https://hg.mozilla.org/releases/mozilla-b2g30_v1_4/rev/ae02c86181e6
https://hg.mozilla.org/releases/mozilla-b2g30_v1_4/rev/7b4798563882

I'm not sure if the PR that landed on v1.3t is needed for v1.4 or not, but if it is, it'll need rebasing.
Flags: needinfo?(tchou)
Posted file part1: for v1.4
Application state transition has been updated and moved to app_transition_controller.js in v1.4, it seems to me only activityCaller state checking before assiging it to _activeActivity is still needed.

Alive, could you please help confirm?
Flags: needinfo?(tchou) → needinfo?(alive)
(In reply to Ting-Yu Chou [:ting] from comment #170)
> Created attachment 8448496 [details] [review]
> part1: for v1.4
> 
> Application state transition has been updated and moved to
> app_transition_controller.js in v1.4, it seems to me only activityCaller
> state checking before assiging it to _activeActivity is still needed.
> 
> Alive, could you please help confirm?

Right, I think 1.4 and master will remove the activity frame correctly.
Flags: needinfo?(alive)
Comment on attachment 8448496 [details] [review]
part1: for v1.4

Carry r+ from comment 54.
Attachment #8448496 - Flags: review+
There are 2 failed cases on Travis for the PR (attachment 8448496 [details] [review]), am checking now.
(In reply to Ting-Yu Chou [:ting] from comment #173)
> There are 2 failed cases on Travis for the PR (attachment 8448496 [details] [review]
> [details]), am checking now.

1. unit-tests-in-firefox
   1) [system] system/HomescreenWindow homescreen window instance. homescreen is crashed Homescreen is crashed at foreground:rerender right away.:
   ...
   make: *** [test-agent-test] Error 1
=> Can't reproduce by running #test-agent with Firefox nightly on my PC.

2. gaia_ui_tests
   TEST-UNEXPECTED-FAIL | test_browser_lan.py test_browser_lan.TestBrowserLAN.test_browser_lan | TimeoutException: TimeoutException: Timed out after 30.0 seconds
=> It is timeout randomly even without the patch.

Ryan, I couldn't find "approval-gaia-v1.4" flag for the attachment, not sure why. Should I just write an approval comment?
Flags: needinfo?(ryanvm)
(In reply to Ting-Yu Chou [:ting] from comment #174)
> Ryan, I couldn't find "approval-gaia-v1.4" flag for the attachment, not sure
> why. Should I just write an approval comment?

Probably because this bug is filed under Core. I would just go with approval-mozilla-b2g30 instead. You can note in the request that the actual patch will be landing on Gaia, though.
Flags: needinfo?(ryanvm)
Comment on attachment 8448496 [details] [review]
part1: for v1.4

Note this patch will be landing on Gaia.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unknown
User impact if declined: Memory leak
Testing completed: Yes, see comment 154 for failed cases checking.
Risk to taking this patch (and alternatives if risky): Low
String or UUID changes made by this patch: n/a
Attachment #8448496 - Flags: approval-mozilla-b2g30?
Comment on attachment 8448496 [details] [review]
part1: for v1.4

Approved for 1.4.
Attachment #8448496 - Flags: approval-mozilla-b2g30? → approval-mozilla-b2g30+
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.