Closed Bug 1506776 Opened 10 months ago Closed 10 months ago

17.78% Base Content Resident Unique Memory (osx-10-10) regression on push 22d335fc020fb0509c67c4a945db21a27b8332d4 (Sat Nov 10 2018)

Categories

(Core :: Security: Process Sandboxing, defect, P1)

65 Branch
Unspecified
macOS
defect

Tracking

()

VERIFIED FIXED
mozilla65
Tracking Status
firefox-esr60 --- unaffected
firefox63 --- unaffected
firefox64 --- unaffected
firefox65 + fixed

People

(Reporter: igoldan, Assigned: haik)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression)

Attachments

(1 file)

We have detected an awsy regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=e51037702c9e8febad9b39f219fec3f22a6d7f26&tochange=22d335fc020fb0509c67c4a945db21a27b8332d4

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 18%  Base Content Resident Unique Memory osx-10-10 opt stylo     22,576,298.67 -> 26,590,549.33


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=17483

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/AWSY/Tests
Component: General → Security: Process Sandboxing
Product: Testing → Core
Flags: needinfo?(haftandilian)
This is a pretty huge regression from a Fission Memshrink point of view, I think we might need to backout.
I'll be looking into this. I first need to understand at what point in content process startup do we snapshot the memory size. Starting the sandbox earlier (which is enabled with bug 1501126) is going to use memory earlier in the content process startup than before.
The diff is pretty weird. 'explicit', 'resident', and 'vsize' went down but 'resident-unique' went up.

> Web Content (pid NNN)
> Explicit Allocations
> 
> -1.82 MB (100.0%) -- explicit
> ├──-1.77 MB (96.98%) ── heap-unclassified
> ├──-0.06 MB (03.38%) -- heap-overhead
> │  ├──-0.09 MB (04.94%) ── page-cache [2]
> │  ├───0.03 MB (-01.55%) ── bin-unused [2]
> │  └───0.00 MB (-00.01%) ── bookkeeping [2]
> └───0.01 MB (-00.36%) ++ (3 tiny)
> 
> Other Measurements
> 
> 0 (100.0%) ++ extensions
> 
> -1.82 MB (100.0%) -- heap-committed
> ├──-1.76 MB (96.62%) ── allocated [2]
> └──-0.06 MB (03.38%) ── overhead [2]
> 
> 0.01 MB (100.0%) -- js-main-runtime
> ├──0.01 MB (101.12%) -- zones
> │  ├──0.01 MB (114.29%) ── type-pool [2]
> │  ├──-0.00 MB (-15.29%) -- shapes
> │  │  ├──-0.00 MB (-14.29%) ── malloc-heap/tree-kids [2]
> │  │  └──-0.00 MB (-01.00%) ++ gc-heap
> │  └──0.00 MB (02.12%) ── unused-gc-things [2]
> └──-0.00 MB (-01.12%) ── realms/classes/objects/gc-heap [2]
> 
> 0.00 MB (100.0%) -- js-main-runtime-gc-heap-committed
> ├──0.00 MB (100.0%) -- unused/gc-things
> │  ├──0.00 MB (100.0%) ── objects [2]
> │  ├──0.00 MB (100.0%) ── shapes [2]
> │  └──0.00 MB (100.0%) ── base-shapes [2]
> └──-0.00 MB (100.0%) -- used/gc-things
>    ├──-0.00 MB (100.0%) ── objects [2]
>    ├──-0.00 MB (100.0%) ── shapes [2]
>    └──-0.00 MB (100.0%) ── base-shapes [2]
> 
>  -1.76 MB ── heap-allocated [2]
>     2,282 ── page-faults-soft [2]
>  -6.25 MB ── resident [2]
>  -4.48 MB ── resident-peak [2]
>   7.03 MB ── resident-unique [2]
> -25.96 MB ── vsize [2]
Assignee: nobody → haftandilian
Flags: needinfo?(haftandilian)
Priority: -- → P1
How much do we know about the services we're blocking access to with the early sandbox startup?  Could there be something that normally accesses shared memory but falls back to making its own copy when that fails?  If it's something like that, comparing the output of vmmap(1) before/after could help find what's different.
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #4)
> How much do we know about the services we're blocking access to with the
> early sandbox startup?  Could there be something that normally accesses
> shared memory but falls back to making its own copy when that fails?  If
> it's something like that, comparing the output of vmmap(1) before/after
> could help find what's different.

Yes, that's definitely possible and I think I've come across some instances of that before--where without access to a service, the process creates its own instances of things. Once I can reproduce this, vmmap is first on my list of things to check.
See Also: → 1507953
I've found that adding back access to the coreservicesd daemon via com.apple.CoreServices.coreservicesd improves the "Base Content Resident Unique Memory" measurement bringing it back in line with the measurement prior to enabling early sandbox startup.

This compares earlyinit=true (base) with earlyinit=true + coreservicesd:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3f0e50d050e7007f8b15fb1f6f2434a2f3b45f49&newProject=try&newRevision=5c7d54b858be6213be146661bd752fcc095a960b&framework=4

This compares earlyinit=false (base) with earlyinit=true + coreservicesd:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=28035db7d9a0&newProject=try&newRevision=5c7d54b858be6213be146661bd752fcc095a960b&framework=4

coreservicesd is sandboxed, but unfortunately its sandbox policy is highly permissive allowing global filesystem read access, filesystem write access to temp and cache dirs, outbound network access and more. If we added this back, it would still be a win to be dropping access to the unsandboxed dock daemon along with distnoted.

I'm going to continue debugging to try to get a better understanding of what's happening and to see if we can realize the same memory savings without granting access to coreservicesd.
I have not found a workaround for enabling CoreServices so I plan to add back CoreServices to the content policy and file a new bug to cover investigating this more. See below for some debugging details.

The memory impact of blocking CoreServices seems to be less on newer OS versions. To investigate this further, I think we should compare the memory impact and possibly turn on blocking just for newer OS versions. In addition, tracing all allocations and trying to see where/why we do the additional allocations when CoreServices is blocked may lead to some discovery.

On 10.10, comparing vmmap output between an about:blank content process without allowing CoreServices vs allowing Coreservices, the total writable regions is larger by ~4 MB when CoreServices is blocked which is consistent with the difference flagged by the awsy base test "Base Content Resident Unique Memory" (ab on try). I don't have an explanation for why just the awsy base test regressed, but not the "after tabs open" test. More debugging needed.

There are also some memory mapped files that I only see when CoreServices is allowed, but they are small and don't appear to be causing the difference.

Diff of vmmap output:

  *** vmmap-base.51799
  --- vmmap-with-core-services.51862
  ! ==== Summary for process 51799
    ReadOnly portion of Libraries: Total=345.5M resident=0K(0%) swapped_out_or_unallocated=345.5M(100%)
  ! Writable regions: Total=61.4M written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=61.4M(100%)
  
                                    VIRTUAL   REGION 
    REGION TYPE                        SIZE    COUNT (non-coalesced) 
    ===========                     =======  ======= 
    CG image                             4K        2 
    CG shared images                    32K        2 
  ! CoreServices                       500K       14
  --
  ! ==== Summary for process 51862
    ReadOnly portion of Libraries: Total=345.5M resident=0K(0%) swapped_out_or_unallocated=345.5M(100%)
  ! Writable regions: Total=56.7M written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=56.7M(100%)
  
                                    VIRTUAL   REGION 
    REGION TYPE                        SIZE    COUNT (non-coalesced) 
    ===========                     =======  ======= 
    CG image                             4K        2 
    CG shared images                    32K        2 
  ! CoreServices                       128K        4 

I looked into trying to avoid triggering the connection to CoreServices. We first connect in ContentChild::Init() when we call SetProcessName(). With that code removed, we then try to connect in nsURLHelperOSX.cpp calling ::FSGetVolumeInfo() for some HFS-specific path checks. And with that code removed, we try to connect when calling [NSApplication sharedApplication] from nsAppShell::Init().

  * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    * frame #0 libxpc.dylib`bootstrap_look_up3() 
      frame #1 libxpc.dylib`bootstrap_look_up2() 
      frame #2 CarbonCore`SCClientSession::checkinWithServer() 
      frame #3 CarbonCore`connectToCoreServicesD() 
      frame #4 CarbonCore`getStatus() 
      frame #5 CarbonCore`getSystemSession() 
      frame #6 CarbonCore`scCreateSystemCacheable() 
      frame #7 CarbonCore`___ZL17_CLV_GetCacheablev_block_invoke() 
      frame #8 libdispatch.dylib`_dispatch_client_callout() 
      frame #9 libdispatch.dylib`_dispatch_once_callout() 
      frame #10 CarbonCore`_CLV_GetCacheable() 
      frame #11 CarbonCore`_standalone_CopyCheckFixForBundleID() 
      frame #12 CarbonCore`GetBugsForOurBundleIDFromCoreservicesd() 
      frame #13 CarbonCore`_CSCheckFix() 
      frame #14 SkyLight`initDisplayState() 
      frame #15 SkyLight`initDisplayMappings() 
      frame #16 SkyLight`__SLSInitialize_block_invoke() 
      frame #17 libdispatch.dylib`_dispatch_client_callout() 
      frame #18 libdispatch.dylib`_dispatch_once_callout() 
      frame #19 SkyLight`CGS_CHECK_INIT() 
      frame #20 SkyLight`SLSMainConnectionID() 
      frame #21 CoreFoundation`_CFAppSleepSetupCoreGraphics() 
      frame #22 CoreFoundation`____CFRunLoopSetOptionsReason_block_invoke_2() 
      frame #23 libdispatch.dylib`_dispatch_client_callout() 
      frame #24 libdispatch.dylib`_dispatch_once_callout() 
      frame #25 CoreFoundation`__CFRunLoopSetOptionsReason() 
      frame #26 LaunchServices`_LSApplicationCheckIn() 
      frame #27 HIServices`_RegisterApplication() 
      frame #28 HIServices`GetCurrentProcess() 
      frame #29 HIToolbox`MenuBarInstance::GetAggregateUIMode() 
      frame #30 HIToolbox`MenuBarInstance::IsVisible() 
      frame #31 AppKit`_NSInitializeAppContext() 
      frame #32 AppKit`-[NSApplication init]() 
      frame #33 AppKit`+[NSApplication sharedApplication]() 
      frame #34 XUL`nsAppShell::Init() at nsAppShell.mm:337
      frame #35 XUL`nsAppShellInit() at nsAppShellSingleton.h:45
      frame #36 XUL`nsComponentManagerImpl::KnownModule::Load() at nsComponentManager.cpp:811
      frame #37 XUL`nsFactoryEntry::GetFactory() at nsComponentManager.cpp:1811
      frame #38 XUL`nsComponentManagerImpl::CreateInstance() at nsComponentManager.cpp:1048
      frame #39 XUL`nsComponentManagerImpl::GetService() at nsComponentManager.cpp:1293
      frame #40 XUL`nsGetServiceByCID::operator()() at nsComponentManagerUtils.cpp:56
      frame #41 XUL`nsGetServiceByCID::operator()() at nsComponentManagerUtils.cpp:253
      frame #42 XUL`nsCOMPtr_base::assign_from_gs_cid() at nsCOMPtr.cpp:73
      frame #43 XUL`XRE_RunAppShell() at nsCOMPtr.h:660
      frame #44 XUL`XRE_RunAppShell() at nsCOMPtr.h:657
      frame #45 XUL`XRE_RunAppShell() at nsEmbedFunctions.cpp:911
      frame #46 XUL`MessageLoop::Run() at message_loop.cc:325
      frame #47 XUL`MessageLoop::Run() at message_loop.cc:318
      frame #48 XUL`MessageLoop::Run() at message_loop.cc:298
      frame #49 XUL`XRE_InitChildProcess() at nsEmbedFunctions.cpp:780
      frame #50 plugin-container`main() at plugin-container.cpp:50
      frame #51 plugin-container`main() at MozillaRuntimeMain.cpp:25
      frame #52 libdyld.dylib`start() 
      frame #53 libdyld.dylib`start()
Add back access to com.apple.CoreServices.coreservicesd to avoid extra allocations that are triggered when the service is blocked.
Pushed by haftandilian@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/823acff9cd2f
17.78% Base Content Resident Unique Memory regression r=Alex_Gaynor
https://hg.mozilla.org/mozilla-central/rev/823acff9cd2f
Status: NEW → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
See Also: → 1505507
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.