The default bug view has changed. See this FAQ.

Measure mmapped/library io

NEW
Assigned to

Status

()

Core
General
8 years ago
7 years ago

People

(Reporter: (dormant account), Assigned: joelr)

Tracking

Trunk
x86
Mac OS X
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

8 years ago
Currently mmap io is good at evading most measuring tools. It is confirmed to be the dominant source of io latency on Linux and is likely to be that on other platforms. We need a good way to measure mmap io.
(Assignee)

Updated

8 years ago
Assignee: nobody → joelr
Not build-config.
Component: Build Config → General
QA Contact: build-config → general
Summary: Measure library io → Measure mmapped/library io
(Assignee)

Comment 2

8 years ago
My DTrace repo lives here:

http://github.com/wagerlabs/firefox-startup

Grab d, startup.d, sigcont.d and pagefault.d then run 'd' in one terminal window, e.g.

./d firefox-bin page-fault.d 

Then run something like this in another window

purge && /Applications/Minefield.app/Contents/MacOS/firefox-bin -no-remote -foreground -P 2

You should see output like this in the "dtrace" window. Note that I tested on Snow Leopard and kernel structures in Leopard may be different.

dtrace -Zqw -x dynvarsize=64m -x evaltime=exec -p 18743 -s start.d -s page-fault.d -s sigcont.d
vnode_pagein: XUL, offset: 20402176, size: 8192
vnode_pagein: XUL, offset: 19722240, size: 16384
vnode_pagein: XUL, offset: 19738624, size: 16384
vnode_pagein: XUL, offset: 19755008, size: 16384
vnode_pagein: XUL, offset: 19771392, size: 16384
vnode_pagein: XUL, offset: 19804160, size: 16384
vnode_pagein: XUL, offset: 19820544, size: 16384
vnode_pagein: XUL, offset: 19836928, size: 16384
vnode_pagein: XUL, offset: 19857408, size: 16384
vnode_pagein: XUL, offset: 19873792, size: 16384
vnode_pagein: XUL, offset: 19890176, size: 16384
vnode_pagein: XUL, offset: 19906560, size: 16384
vnode_pagein: XUL, offset: 19922944, size: 16384
vnode_pagein: XUL, offset: 19939328, size: 16384
vnode_pagein: XUL, offset: 19955712, size: 16384
vnode_pagein: XUL, offset: 20410368, size: 16384
vnode_pagein: XUL, offset: 19972096, size: 20480
vnode_pagein: XUL, offset: 19992576, size: 20480
vnode_pagein: XUL, offset: 20013056, size: 20480
vnode_pagein: XUL, offset: 20033536, size: 20480
vnode_pagein: XUL, offset: 20054016, size: 20480
vnode_pagein: XUL, offset: 20094976, size: 20480
vnode_pagein: XUL, offset: 20115456, size: 20480
vnode_pagein: XUL, offset: 20135936, size: 20480
vnode_pagein: XUL, offset: 20156416, size: 24576
vnode_pagein: XUL, offset: 20180992, size: 24576
vnode_pagein: XUL, offset: 20205568, size: 24576
vnode_pagein: XUL, offset: 20230144, size: 24576
vnode_pagein: XUL, offset: 20254720, size: 24576
vnode_pagein: XUL, offset: 20303872, size: 24576
vnode_pagein: XUL, offset: 20328448, size: 28672
vnode_pagein: XUL, offset: 20357120, size: 28672
vnode_pagein: XUL, offset: 20393984, size: 8192
vnode_pagein: XUL, offset: 19787776, size: 16384
vnode_pagein: XUL, offset: 20074496, size: 20480
vnode_pagein: XUL, offset: 20279296, size: 24576
vnode_pagein: XUL, offset: 20496384, size: 434176
vnode_pagein: XUL, offset: 20439040, size: 8192
vnode_pagein: XUL, offset: 14548992, size: 1048576
vnode_pagein: XUL, offset: 20385792, size: 8192
vnode_pagein: XUL, offset: 8192, size: 1040384
vnode_pagein: XUL, offset: 10944512, size: 999424
vnode_pagein: XUL, offset: 8921088, size: 1048576
vnode_pagein: XUL, offset: 20426752, size: 4096
vnode_pagein: XUL, offset: 20430848, size: 4096
vnode_pagein: XUL, offset: 20434944, size: 4096
vnode_pagein: XUL, offset: 19853312, size: 4096
vnode_pagein: XUL, offset: 20447232, size: 4096
vnode_pagein: XUL, offset: 20451328, size: 8192
vnode_pagein: XUL, offset: 12963840, size: 1048576
vnode_pagein: XUL, offset: 20459520, size: 36864
vnode_pagein: XUL, offset: 14012416, size: 536576
vnode_pagein: XUL, offset: 11943936, size: 1019904
vnode_pagein: XUL, offset: 1048576, size: 827392
vnode_pagein: XUL, offset: 10285056, size: 659456
vnode_pagein: XUL, offset: 1875968, size: 651264
vnode_pagein: XUL, offset: 9969664, size: 315392
vnode_pagein: XUL, offset: 3645440, size: 1048576
vnode_pagein: XUL, offset: 4694016, size: 819200
vnode_pagein: XUL, offset: 2658304, size: 987136
vnode_pagein: XUL, offset: 5701632, size: 1048576
vnode_pagein: XUL, offset: 7528448, size: 1048576
vnode_pagein: XUL, offset: 5513216, size: 188416
vnode_pagein: XUL, offset: 6934528, size: 593920
vnode_pagein: XUL, offset: 6750208, size: 184320
vnode_pagein: XUL, offset: 2527232, size: 131072
total run time: 5638.591735ms
(Assignee)

Comment 3

8 years ago
I want to refine the bug description a bit, this should help understand what I'm doing...

Dynamic libraries are mapped onto a range of memory by mmap. This memory range is backed by a set of virtual memory pages that are loaded from disk upon access to their corresponding memory address. 

Mac OSX uses 4K pages and these are fetched from disk on demand by the virtual memory pager. We are talking dynamic library pages here so each page is filled with code for functions that live in the dynamic library. Pages are accessed (and fetched from disk) when a call is made to a function in the dynamic library.

Pages should be accessed sequentially for best performance. It does look from the above output that we are loading multiple pages at the same time, e.g. size 1019904 corresponds to 249 pages. Page access looks quite random, though, which is killing us.

Now that we know what the access pattern is, we should try to first identify the symbols that are being accessed in each set of pages and then make the linker rearrange the code [1] such that page access is more sequential and less random.

[1] http://developer.apple.com/mac/library/documentation/Performance/Conceptual/CodeFootprint/Articles/ImprovingLocality.html#//apple_ref/doc/uid/20001862
(Assignee)

Comment 4

8 years ago
Saving it here for posterity:

http://wagerlabs.com/post/230853261/tracking-io-patterns-in-memory-mapped-dynamic-libaries

Summary:

You are not likely to see page-ins when running this DTrace script unless you have just restarted your Mac and are running Firefox for the first time. This is because the dynamic libraries will be stored in the Unified Buffer Cache (UBC) after first access and there won’t be any subsequent disk access until they are evicted from the cache.

I wrote about hacking the Unified Buffer Cache before [1] but the same technique does not work with mmap-ed data since the virtual manager and the cache are the same thing. Evicting the libraries would involve allocating at at least twice as much virtual memory as there’s RAM and then touching each page to make sure it’s cached. This is unlikely to correspond to normal use of Firefox, though.

[1] http://tinyco.de/2008/03/04/hacking-the-unified-buffer-cache.html
(Assignee)

Comment 5

8 years ago
Normally, we would just use gprof to give us four different types of symbol orderings:

1) Ordering based on a “closest is best” analysis of the profiling call graph. Calls that call each other frequently are placed close together.

2) Routines sorted by the number of calls made to each routine, largest numbers first.

3) Routines sorted by the order in which they are called.

4) Routines sorted by the amount of CPU time spent, largest times first.

gprof doesn't work with dynamic libraries, though, so the above would require a static build of Firefox. There are some issues with that according to Bug 525013.

In the meantime, 'size -m -l XUL' gives us the size (first number) and offset of each segment and section in XUL, e.g.

Segment __TEXT: 19062784 (vmaddr 0x0 fileoff 0)
	Section __const: 1120204 (addr 0x1300 offset 4864)
	Section __cstring: 441251 (addr 0x112ad0 offset 1125072)
	Section __text: 13853392 (addr 0x17e680 offset 1566336)
	Section __symbol_stub1: 9282 (addr 0xeb4950 offset 15419728)
	Section __stub_helper: 15486 (addr 0xeb6d92 offset 15429010)
	Section __gcc_except_tab: 70540 (addr 0xebaa10 offset 15444496)
	Section __unwind_info: 224592 (addr 0xeccd90 offset 15519120)
	Section __eh_frame: 3319048 (addr 0xf03ae0 offset 15743712)
	total 19053795
Segment __DATA: 1822720 (vmaddr 0x122e000 fileoff 19062784)
	Section __nl_symbol_ptr: 1136 (addr 0x122e000 offset 19062784)
	Section __la_symbol_ptr: 12376 (addr 0x122e470 offset 19063920)
	Section __mod_init_func: 1536 (addr 0x12314c8 offset 19076296)
	Section __const: 1574096 (addr 0x1231ae0 offset 19077856)
	Section __cfstring: 3296 (addr 0x13b1fb0 offset 20651952)
	Section __objc_data: 1760 (addr 0x13b2c90 offset 20655248)
	Section __objc_msgrefs: 10544 (addr 0x13b3370 offset 20657008)
	Section __objc_selrefs: 432 (addr 0x13b5ca0 offset 20667552)
	Section __objc_classrefs: 1560 (addr 0x13b5e50 offset 20667984)
	Section __objc_superrefs: 160 (addr 0x13b6468 offset 20669544)
	Section __objc_const: 13040 (addr 0x13b6508 offset 20669704)
	Section __objc_classlist: 176 (addr 0x13b97f8 offset 20682744)
	Section __objc_catlist: 48 (addr 0x13b98a8 offset 20682920)
	Section __objc_protolist: 24 (addr 0x13b98d8 offset 20682968)
	Section __objc_imageinfo: 8 (addr 0x13b98f0 offset 20682992)
	Section __data: 107592 (addr 0x13b9900 offset 20683008)
	Section __bss: 61316 (addr 0x13d3d60 offset 0)
	Section __common: 29672 (addr 0x13e2d00 offset 0)
	total 1818772
Segment __LINKEDIT: 18173952 (vmaddr 0x13eb000 fileoff 20791296)
total 39059456

The LINKEDIT segment houses the symbol table [1], a huge one in our case (18Mb) and one that cannot be stripped, e.g.

strip XUL1
strip: symbols referenced by indirect symbol table entries that can't be stripped in: /Users/joelr/Work/mozilla/profile/toolkit/library/XUL1

What I'll try to do is convert the offsets and sizes in comment #2 to the segments and sections above. This will give us a bit more insight into how things are accessed. Once that is taken care of, I'll proceed to reorder the code pages for sequential access. I will then try to resolve Bug 525013 by getting a static build going so that we can time profile-guided (gprof) reordering.

[1] http://developer.apple.com/mac/library/documentation/DeveloperTools/Conceptual/MachORuntime/Reference/reference.html
(Assignee)

Comment 6

8 years ago
This corresponds to page-ins in comment #2. I'll need to reboot and re-run the page-ins DTrace script as I don't see __TEXT, __text being hit here.

__DATA, __cfstring = 2 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 6 pages
__DATA, __cfstring = 6 pages
__DATA, __cfstring = 6 pages
__DATA, __cfstring = 6 pages
__DATA, __cfstring = 6 pages
__DATA, __cfstring = 6 pages
__DATA, __cfstring = 7 pages
__DATA, __cfstring = 7 pages
__DATA, __cfstring = 2 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 6 pages
__DATA, __cfstring = 106 pages
__DATA, __cfstring = 2 pages
__TEXT, __symbol_stub1 = 256 pages
__DATA, __cfstring = 2 pages
__TEXT, __cstring = 254 pages
__TEXT, __symbol_stub1 = 244 pages
__TEXT, __symbol_stub1 = 256 pages
__DATA, __cfstring = 1 pages
__DATA, __cfstring = 1 pages
__DATA, __cfstring = 1 pages
__DATA, __cfstring = 1 pages
__DATA, __cfstring = 1 pages
__DATA, __cfstring = 2 pages
__TEXT, __symbol_stub1 = 256 pages
__DATA, __cfstring = 9 pages
__TEXT, __symbol_stub1 = 131 pages
__TEXT, __symbol_stub1 = 249 pages
__TEXT, __cstring = 202 pages
__TEXT, __symbol_stub1 = 161 pages
__TEXT, __symbol_stub1 = 159 pages
__TEXT, __symbol_stub1 = 77 pages
__TEXT, __symbol_stub1 = 256 pages
__TEXT, __symbol_stub1 = 200 pages
__TEXT, __symbol_stub1 = 241 pages
__TEXT, __symbol_stub1 = 256 pages
__TEXT, __symbol_stub1 = 256 pages
__TEXT, __symbol_stub1 = 46 pages
__TEXT, __symbol_stub1 = 145 pages
__TEXT, __symbol_stub1 = 45 pages
__TEXT, __symbol_stub1 = 32 pages
(Assignee)

Comment 7

8 years ago
I'm now purging the caches using this Python script (64 bits):

#!/usr/bin/env python

import mmap

SIZE = 8589934592L # 8Gb

map = mmap.mmap(-1, SIZE)
map[::4096] = '\x01' * (SIZE // 4096)

I then run a small DTrace script that looks for page-ins in XUL, e.g.

dtrace -Zqws pf.d 

where the script looks like this:

fbt::vnode_pagein:entry
{
  self->v_name = stringof(((vnode_t)arg0)->v_name);
}

fbt::vnode_pagein:entry
/self->v_name == "XUL"/
{
  printf("vnode_pagein: %s, offset: %u, size: %u\n", 
    self->v_name, arg3, arg4);
}

I then analyze the output from the DTrace script and the list of segments, e.g.

./access-pattern.py xul.segments xul.pageins.2

where access-pattern.py lives at http://github.com/wagerlabs/firefox-startup/blob/master/mmap/access-pattern.py

What I get back is devoid of any code (__TEXT, __text) access (see below). I'm consulting the Darwin deities tonight and will post any suggestions as soon as I have them.

segment: __TEXT, size: 18673664, offset: 0
segment: __DATA, size: 1814528, offset: 18673664
segment: __LINKEDIT, size: 19382272, offset: 20393984
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__LINKEDIT, __symtab = 4 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__LINKEDIT, __symtab = 2 pages
__TEXT, __symbol_stub1 = 256 pages
__LINKEDIT, __symtab = 9 pages
__TEXT, __symbol_stub1 = 254 pages
__TEXT, __cstring = 131 pages
__TEXT, __symbol_stub1 = 249 pages
__LINKEDIT, __symtab = 2 pages
__LINKEDIT, __symtab = 2 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 4 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 5 pages
__DATA, __cfstring = 6 pages
__DATA, __cfstring = 6 pages
__DATA, __cfstring = 6 pages
__DATA, __cfstring = 6 pages
__DATA, __cfstring = 6 pages
__DATA, __objc_classlist = 6 pages
__LINKEDIT, __symtab = 6 pages
__LINKEDIT, __symtab = 7 pages
__LINKEDIT, __symtab = 7 pages
__LINKEDIT, __symtab = 7 pages
__LINKEDIT, __symtab = 128 pages
__LINKEDIT, __symtab = 1 pages
__LINKEDIT, __symtab = 1 pages
__LINKEDIT, __symtab = 1 pages
__LINKEDIT, __symtab = 2 pages
__DATA, __cfstring = 1 pages
__LINKEDIT, __symtab = 1 pages
__LINKEDIT, __symtab = 2 pages
__TEXT, __cstring = 256 pages
__TEXT, __symbol_stub1 = 149 pages
__TEXT, __symbol_stub1 = 256 pages
__TEXT, __symbol_stub1 = 256 pages
__TEXT, __symbol_stub1 = 199 pages
__TEXT, __symbol_stub1 = 200 pages
__TEXT, __symbol_stub1 = 256 pages
__TEXT, __symbol_stub1 = 134 pages
__TEXT, __symbol_stub1 = 130 pages
__TEXT, __symbol_stub1 = 241 pages
__TEXT, __symbol_stub1 = 256 pages
__TEXT, __symbol_stub1 = 256 pages
__TEXT, __symbol_stub1 = 46 pages
__TEXT, __symbol_stub1 = 145 pages
__TEXT, __symbol_stub1 = 45 pages
__TEXT, __symbol_stub1 = 4 pages
__TEXT, __symbol_stub1 = 3 pages
(Assignee)

Comment 8

8 years ago
Allright, so I'm an idiot and the output should look like this:

segment: __TEXT, size: 18673664, offset: 0
segment: __DATA, size: 1814528, offset: 18673664
segment: __LINKEDIT, size: 19382272, offset: 20393984
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__LINKEDIT, __symtab = 4 pages
__DATA, __const = 5 pages
__DATA, __const = 5 pages
__DATA, __const = 5 pages
__DATA, __const = 5 pages
__DATA, __const = 5 pages
__DATA, __data = 2 pages
__TEXT, __text = 256 pages
__LINKEDIT, __symtab = 9 pages
__TEXT, __text = 254 pages
__TEXT, __const = 131 pages
__TEXT, __text = 249 pages
__LINKEDIT, __symtab = 2 pages
__LINKEDIT, __symtab = 2 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 4 pages
__DATA, __const = 5 pages
__DATA, __const = 5 pages
__DATA, __const = 5 pages
__DATA, __const = 5 pages
__DATA, __const = 6 pages
__DATA, __const = 6 pages
__DATA, __const = 6 pages
__DATA, __const = 6 pages
__DATA, __const = 6 pages
__DATA, __objc_const = 6 pages
__DATA, __data = 6 pages
__DATA, __data = 7 pages
__DATA, __data = 7 pages
__LINKEDIT, __symtab = 7 pages
__LINKEDIT, __symtab = 128 pages
__LINKEDIT, __symtab = 1 pages
__LINKEDIT, __symtab = 1 pages
__LINKEDIT, __symtab = 1 pages
__LINKEDIT, __symtab = 2 pages
__DATA, __const = 1 pages
__LINKEDIT, __symtab = 1 pages
__LINKEDIT, __symtab = 2 pages
__TEXT, __const = 256 pages
__TEXT, __text = 149 pages
__TEXT, __text = 256 pages
__TEXT, __text = 256 pages
__TEXT, __text = 199 pages
__TEXT, __text = 200 pages
__TEXT, __text = 256 pages
__TEXT, __text = 134 pages
__TEXT, __text = 130 pages
__TEXT, __text = 241 pages
__TEXT, __text = 256 pages
__TEXT, __text = 256 pages
__TEXT, __text = 46 pages
__TEXT, __text = 145 pages
__TEXT, __text = 45 pages
__TEXT, __text = 4 pages
__TEXT, __text = 3 pages
(Assignee)

Comment 9

8 years ago
And here's an even better output, with offsets

segment: __TEXT, size: 18673664, offset: 0
segment: __DATA, size: 1814528, offset: 18673664
segment: __LINKEDIT, size: 19382272, offset: 20393984
__DATA, __const = 4 pages, 4843 pages in
__DATA, __const = 4 pages, 4848 pages in
__DATA, __const = 4 pages, 4852 pages in
__DATA, __const = 4 pages, 4856 pages in
__DATA, __const = 4 pages, 4860 pages in
__DATA, __const = 4 pages, 4864 pages in
__DATA, __const = 4 pages, 4868 pages in
__DATA, __const = 4 pages, 4872 pages in
__LINKEDIT, __symtab = 4 pages, 4983 pages in
__DATA, __const = 5 pages, 4876 pages in
__DATA, __const = 5 pages, 4881 pages in
__DATA, __const = 5 pages, 4896 pages in
__DATA, __const = 5 pages, 4901 pages in
__DATA, __const = 5 pages, 4906 pages in
__DATA, __data = 2 pages, 4977 pages in
__TEXT, __text = 256 pages, 3165 pages in
__LINKEDIT, __symtab = 9 pages, 4995 pages in
__TEXT, __text = 254 pages, 2 pages in
__TEXT, __const = 131 pages, 3421 pages in
__TEXT, __text = 249 pages, 2916 pages in
__LINKEDIT, __symtab = 2 pages, 4979 pages in
__LINKEDIT, __symtab = 2 pages, 4981 pages in
__DATA, __const = 4 pages, 4815 pages in
__DATA, __const = 4 pages, 4819 pages in
__DATA, __const = 4 pages, 4823 pages in
__DATA, __const = 4 pages, 4827 pages in
__DATA, __const = 4 pages, 4831 pages in
__DATA, __const = 4 pages, 4835 pages in
__DATA, __const = 4 pages, 4839 pages in
__DATA, __const = 5 pages, 4886 pages in
__DATA, __const = 5 pages, 4891 pages in
__DATA, __const = 5 pages, 4911 pages in
__DATA, __const = 5 pages, 4916 pages in
__DATA, __const = 6 pages, 4921 pages in
__DATA, __const = 6 pages, 4927 pages in
__DATA, __const = 6 pages, 4933 pages in
__DATA, __const = 6 pages, 4939 pages in
__DATA, __const = 6 pages, 4945 pages in
__DATA, __objc_const = 6 pages, 4951 pages in
__DATA, __data = 6 pages, 4957 pages in
__DATA, __data = 7 pages, 4963 pages in
__DATA, __data = 7 pages, 4970 pages in
__LINKEDIT, __symtab = 7 pages, 5004 pages in
__LINKEDIT, __symtab = 128 pages, 5011 pages in
__LINKEDIT, __symtab = 1 pages, 4987 pages in
__LINKEDIT, __symtab = 1 pages, 4988 pages in
__LINKEDIT, __symtab = 1 pages, 4989 pages in
__LINKEDIT, __symtab = 2 pages, 4990 pages in
__DATA, __const = 1 pages, 4847 pages in
__LINKEDIT, __symtab = 1 pages, 4992 pages in
__LINKEDIT, __symtab = 2 pages, 4993 pages in
__TEXT, __const = 256 pages, 3552 pages in
__TEXT, __text = 149 pages, 2767 pages in
__TEXT, __text = 256 pages, 2312 pages in
__TEXT, __text = 256 pages, 259 pages in
__TEXT, __text = 199 pages, 2568 pages in
__TEXT, __text = 200 pages, 1146 pages in
__TEXT, __text = 256 pages, 890 pages in
__TEXT, __text = 134 pages, 2178 pages in
__TEXT, __text = 130 pages, 515 pages in
__TEXT, __text = 241 pages, 649 pages in
__TEXT, __text = 256 pages, 1392 pages in
__TEXT, __text = 256 pages, 1838 pages in
__TEXT, __text = 46 pages, 1346 pages in
__TEXT, __text = 145 pages, 1693 pages in
__TEXT, __text = 45 pages, 1648 pages in
__TEXT, __text = 4 pages, 645 pages in
__TEXT, __text = 3 pages, 256 pages in
(Assignee)

Comment 10

8 years ago
Here's the same but with a running commentary.

We start by loading 8 chunks of 4 pages, starting 4843 into the file.

  __DATA, __const = 4 pages, 4843 pages in

Hit the symbol table for 4 pages.

  __LINKEDIT, __symtab = 4 pages, 4983 pages in

Continue loading constants from where we left off, 5 chunks of 5 pages.

  __DATA, __const = 5 pages, 4876 pages in
  ...
  __DATA, __const = 5 pages, 4906 pages in

Hit initialized mutable variables, such as writable C strings and data arrays for 2 pages.

  __DATA, __data = 2 pages, 4977 pages in
  
Load 256 pages of code

  __TEXT, __text = 256 pages, 3165 pages in
  
Grab 9 pages of symbols from the symbol table.

  __LINKEDIT, __symtab = 9 pages, 4995 pages in
 
Grab some code.

  __TEXT, __text = 254 pages, 2 pages in

Grab some constants.

  __TEXT, __const = 131 pages, 3421 pages in

Grab some code. This chunk extends up to page 3165 in our last code read.

  __TEXT, __text = 249 pages, 2916 pages in

Sequential read of 4 pages from the symbol table.

  __LINKEDIT, __symtab = 2 pages, 4979 pages in
  __LINKEDIT, __symtab = 2 pages, 4981 pages in

Sequential read of 16 x 4/5/6 pages, last read 4945 pages in.

  __DATA, __const = 4 pages, 4815 pages in

ObjC constants.

  __DATA, __objc_const = 6 pages, 4951 pages in
  
More data, sequential reading.

  __DATA, __data = 6 pages, 4957 pages in
  __DATA, __data = 7 pages, 4963 pages in
  __DATA, __data = 7 pages, 4970 pages in

Sequential reading of the symbol table.

  __LINKEDIT, __symtab = 7 pages, 5004 pages in
  __LINKEDIT, __symtab = 128 pages, 5011 pages in

And a sudden jump back for more sequential reading.

  __LINKEDIT, __symtab = 1 pages, 4987 pages in
  __LINKEDIT, __symtab = 1 pages, 4988 pages in
  __LINKEDIT, __symtab = 1 pages, 4989 pages in
  __LINKEDIT, __symtab = 2 pages, 4990 pages in
  
This page must be feeling lonely!

  __DATA, __const = 1 pages, 4847 pages in
 
Read some mor symbols.

  __LINKEDIT, __symtab = 1 pages, 4992 pages in
  __LINKEDIT, __symtab = 2 pages, 4993 pages in

And issue a bunch of completly random code reads.

  __TEXT, __const = 256 pages, 3552 pages in
  __TEXT, __text = 149 pages, 2767 pages in
  __TEXT, __text = 256 pages, 2312 pages in
  __TEXT, __text = 256 pages, 259 pages in
  __TEXT, __text = 199 pages, 2568 pages in
  __TEXT, __text = 200 pages, 1146 pages in
  __TEXT, __text = 256 pages, 890 pages in
  __TEXT, __text = 134 pages, 2178 pages in
  __TEXT, __text = 130 pages, 515 pages in
  __TEXT, __text = 241 pages, 649 pages in
  __TEXT, __text = 256 pages, 1392 pages in
  __TEXT, __text = 256 pages, 1838 pages in
  __TEXT, __text = 46 pages, 1346 pages in
  __TEXT, __text = 145 pages, 1693 pages in
  __TEXT, __text = 45 pages, 1648 pages in
  __TEXT, __text = 4 pages, 645 pages in
  __TEXT, __text = 3 pages, 256 pages in
(Assignee)

Comment 11

8 years ago
I think that trying to control page-in order is a dead end, at least on (Snow) Leopard. 

I reboot to make sure there are no traces of Firefox in memory. I then run Firefox and trace the page-ins. 

One run may give me (head -10 ...)

__LINKEDIT, __symtab = 2 pages, 5016 pages in
__LINKEDIT, __symtab = 2 pages, 5018 pages in
__DATA, __const = 4 pages, 4850 pages in
__DATA, __const = 4 pages, 4854 pages in
__DATA, __const = 4 pages, 4858 pages in
__DATA, __const = 4 pages, 4862 pages in
__DATA, __const = 4 pages, 4866 pages in

whereas another run

__LINKEDIT, __symtab = 2 pages, 5016 pages in
__DATA, __const = 5 pages, 4921 pages in
__DATA, __const = 6 pages, 4956 pages in
__DATA, __data = 7 pages, 4998 pages in
__TEXT, __const = 40 pages, 256 pages in
__TEXT, __cstring = 44 pages, 307 pages in
__TEXT, __text = 73 pages, 2390 pages in

I dug into the source code for vnode_pagein [1] and it appears that the file system starts by trying to fetch 256 x 4096 (256 pages = 1Mb) from disk but the end number of pages is limited by various factors outside of our control. 

I wasn't able to figure out why access pattern is so different, apart from the first two pages of the __LINKEDIT segment always being requested first.

I understand that we are really being hit by IO performance on mobile devices. These are Linux based, though, so shouldn't we try to improve mmap IO performance under Linux instead? 

I volunteer to look into this.

Also, since Linux performance (mobile!) is so important to us, we may want to spend some time improving the Linux DTrace port. I volunteer for this as well.

[1] xnu-1456.1.26/bsd/vm/vnode_pager.c
(Assignee)

Comment 12

8 years ago
It may not be a dead end, I take it back. I think it will require careful studying of the Mac OSX dynamic linker, further reading of the Snow Leopard kernel, as well as writing Mach-O (executable format) tools to further advance this. It will also take a few weeks to reach positive results, in addition to the 2-3 weeks I spent on this already. 

For starters, I will need to prove that the page-in pattern does not change if the start-up path through the code is static. I think it may not be in our case and that's what's may be causing varying results in our case. I'm not quite sure about this (more below) but it's worth looking into.

I used DTrace to get a sequential list of function calls that happen during our startup (time order) and told the static linker (ld) to order the symbols accordingly. This did not significantly improve our startup performance, e.g. using Vlad's timing:

Unordered 1762, 1716, 1724, 1726, 1720
Ordered   1746, 1670, 918,  932,  1737

I got these results by rebooting, waiting until my apps came up and then starting Firefox. That's 10 reboots right there but each one takes about 40 seconds which beats allocating 8Gb of memory to clear the OS caches (2x physical, 4-5 minutes). 

I suspect it will be more efficient to whip up a Python library to parse the Mach-O file format (I know all about it [1]) and then try to shuffle code and data according to the page-in pattern. 

[1] http://tinyco.de/2009/01/26/creating-mac-binaries-on-any-platform.html

Last but not least, I'll attach the cause of every XUL page-in for your viewing pleasure, first the user space backtrace then kernel backtrace and, finally, the page-in entry itself.

We do get to XUL code towards the end but note that the first 16 or so page-ins are due to the dynamic linker (dyld). The difference in the page-in pattern during two identical runs is evident in the first 10 page-ins (see comment 11). 

It will take me a while just to figure out whether this difference in due to the dynamic linker, the virtual memory (VM) pager or the virtual file system (VFS) layer.
(Assignee)

Comment 13

8 years ago
Created attachment 411100 [details]
XUL page-ins that happen during a cold Firefox startup + stack trace for each

Note that the kernel stack trace is always the same and that the first 16 or so page-ins are due to the dynamic linker (dyld).
(Assignee)

Comment 14

8 years ago
Here I'm looking at the differences in the dyld stage, before library initializers run.

Following are 3 complete samples of dyld-related page-ins. I determined that these are the product of pre-initialization dyld work by looking at the user-space stack trace, e.g. this is before initializers

              dyld`ImageLoaderMachOCompressed::rebase(ImageLoader::LinkContext const&)+0x63
              dyld`ImageLoader::recursiveRebase(ImageLoader::LinkContext const&)+0x67
              dyld`ImageLoader::recursiveRebase(ImageLoader::LinkContext const&)+0x45
              dyld`ImageLoader::link(ImageLoader::LinkContext const&, bool, bool, ImageLoader::RPathChain const&)+0x89
              dyld`dyld::link(ImageLoader*, bool, ImageLoader::RPathChain const&)+0x76
              dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**)+0xb06
              dyld`dyldbootstrap::start(macho_header const*, int, char const**, long)+0x31f
              dyld`_dyld_start+0x2a
              firefox-bin`0x100000000

Once I see a different stack trace, e.g

              libobjc.A.dylib`_mapStrHash+0xb
              libobjc.A.dylib`NXMapGet+0x2f
              libobjc.A.dylib`addNamedClass+0x26
              libobjc.A.dylib`_read_images+0x247
              libobjc.A.dylib`map_images_nolock+0x4f2
              libobjc.A.dylib`map_images+0x72
              dyld`dyld::notifyBatchPartial(dyld_image_states, bool, char const* (*)(dyld_image_states, unsigned int, dyld_image_info const*))+0x2fb
              dyld`dyld::registerImageStateBatchChangeHandler(dyld_image_states, char const* (*)(dyld_image_states, unsigned int, dyld_image_info const*))+0x1ba
              libSystem.B.dylib`dyld_register_image_state_change_handler+0x58
              libobjc.A.dylib`_objc_init+0x3a
              dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&)+0xe4
              dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0xec
              dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0x9d
              dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0x9d
              dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0x9d
              dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0x9d
              dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0x9d
              dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0x9d
              dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0x9d
              dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&)+0x3a

I know that initialization code in the libraries is being run. 

I would expect dyld to always be doing the same thing with the same library but either this is not the case or the kernel is reacting completely different each time.

Here's iteration #1 (73 pages)

__LINKEDIT, __symtab = 2 pages, 5016 pages in
__LINKEDIT, __symtab = 2 pages, 5018 pages in
__DATA, __const = 4 pages, 4850 pages in
__DATA, __const = 4 pages, 4854 pages in
__DATA, __const = 4 pages, 4858 pages in
__DATA, __const = 4 pages, 4862 pages in
__DATA, __const = 4 pages, 4866 pages in
__DATA, __const = 6 pages, 4956 pages in
__DATA, __const = 6 pages, 4962 pages in
__DATA, __const = 6 pages, 4968 pages in
__DATA, __const = 6 pages, 4974 pages in
__DATA, __const = 6 pages, 4980 pages in
__DATA, __objc_classrefs = 6 pages, 4986 pages in
__DATA, __data = 6 pages, 4992 pages in
__DATA, __data = 7 pages, 4998 pages in

iteration #2 (84 pages)

__LINKEDIT, __symtab = 2 pages, 5018 pages in
__DATA, __const = 6 pages, 4962 pages in
__DATA, __const = 6 pages, 4968 pages in
__DATA, __const = 6 pages, 4974 pages in
__DATA, __const = 6 pages, 4980 pages in
__DATA, __objc_classrefs = 6 pages, 4986 pages in
__DATA, __data = 6 pages, 4992 pages in
__DATA, __data = 7 pages, 4998 pages in
__DATA, __data = 7 pages, 5005 pages in
__LINKEDIT, __symtab = 1 pages, 5024 pages in
__LINKEDIT, __symtab = 2 pages, 5025 pages in
__LINKEDIT, __symtab = 4 pages, 5027 pages in
__LINKEDIT, __symtab = 25 pages, 5031 pages in

and iteration #3 (54 pages)

__LINKEDIT, __symtab = 2 pages, 5016 pages in
__LINKEDIT, __symtab = 2 pages, 5018 pages in
__DATA, __const = 4 pages, 4850 pages in
__DATA, __const = 4 pages, 4854 pages in
__DATA, __const = 4 pages, 4858 pages in
__DATA, __const = 4 pages, 4862 pages in
__DATA, __const = 4 pages, 4866 pages in
__DATA, __const = 4 pages, 4870 pages in
__DATA, __const = 4 pages, 4874 pages in
__DATA, __const = 4 pages, 4878 pages in
__DATA, __const = 4 pages, 4882 pages in
__DATA, __const = 4 pages, 4886 pages in
__DATA, __const = 5 pages, 4916 pages in
__DATA, __const = 5 pages, 4951 pages in
(Assignee)

Comment 15

8 years ago
We are faulting in an average of 70 pages or 280K before dyld runs library initializers. This is counted over 3 identical runs (see comment 14).

We are faulting in an average of 8361 pages or 32Mb during library initialization and beyond, i.e. after the above 70 pages.

I apologize for calling this a dead end, it looks like an optimization opportunity large enough to drive a truck through.
(Reporter)

Comment 16

8 years ago
(In reply to comment #11)

> I understand that we are really being hit by IO performance on mobile devices.
> These are Linux based, though, so shouldn't we try to improve mmap IO
> performance under Linux instead? 

We aren't hit by random io perf on Linux mobile devices. On linux flash random-io is fast enough. Random io is hurting us most on desktop devices with a harddrive and Windows CE(this isn't backed up by direct evidence as I don't have any tools that can measure library io).

> 
> I volunteer to look into this.
> 
> Also, since Linux performance (mobile!) is so important to us, we may want to
> spend some time improving the Linux DTrace port. I volunteer for this as well.

systemtap works well enough on linux.
(Assignee)

Comment 17

8 years ago
I'm digging further into this, if there are any doubts. 

I used DTrace to get a sequential list of function calls that happen during our
startup (time order) and told the static linker (ld) to order the symbols
accordingly. This did not significantly improve our startup performance, e.g.
using Vlad's timing:

Unordered 1762, 1716, 1724, 1726, 1720
Ordered   1746, 1670, 918,  932,  1737

The problem here is that time-ordering functions is the recommended approach but it's not improving startup performance.
(Assignee)

Comment 18

8 years ago
A few more data points running from a slow USB HD...

Unordered    5628, 5606, 5873, 4462, 6062 (after a reboot)
Ordered      2629, 5997, 6239, 6050, 5994 (right after unordered)
Ordered      4973, 5360, 4999             (after a reboot)
Ordered      6348, 4682, 5292             (after ejecting the HD)

As you can see, the numbers from the recommended approach of ordering symbols by use over time does not improve startup performance. Either that or my symbol ordering is wrong somehow. Need to keep digging.
(Assignee)

Comment 19

8 years ago
Taras asked for the duration of each page-in so here it is. 

What I want to do now is write a Python script that grabs the page-in log and a dynamic library (XUL for starters) and outputs a dynamic library re-arranged for optimal performance. It doesn't look like feeding an ordered list of symbols to the linker does the trick so I want to go deeper.

segment: __TEXT, size: 4594 pages, offset: 0 pages
segment: __DATA, size: 445 pages, offset: 4594 pages
segment: __LINKEDIT, size: 4433 pages, offset: 5016 pages
__LINKEDIT, __symtab = 2 pages in 0.057769ms, 5016 pages in
__LINKEDIT, __symtab = 2 pages in 0.075856ms, 5018 pages in
__DATA, __const = 4 pages in 0.030841ms, 4850 pages in
__DATA, __const = 4 pages in 0.025206ms, 4854 pages in
__DATA, __const = 4 pages in 0.032988ms, 4858 pages in
__DATA, __const = 4 pages in 0.060483ms, 4862 pages in
__DATA, __const = 4 pages in 0.048479ms, 4866 pages in
__DATA, __const = 4 pages in 0.039833ms, 4870 pages in
__DATA, __const = 4 pages in 0.038306ms, 4874 pages in
__DATA, __const = 4 pages in 0.047224ms, 4878 pages in
__DATA, __const = 4 pages in 0.045886ms, 4882 pages in
__DATA, __const = 4 pages in 0.039886ms, 4886 pages in
__DATA, __const = 4 pages in 0.045787ms, 4890 pages in
__DATA, __const = 4 pages in 0.032182ms, 4894 pages in
__DATA, __const = 4 pages in 0.05377ms, 4898 pages in
__DATA, __data = 7 pages in 0.069645ms, 4998 pages in
__TEXT, __const = 40 pages in 0.10155ms, 256 pages in
__TEXT, __text = 46 pages in 0.090554ms, 3673 pages in
__TEXT, __text = 51 pages in 0.099173ms, 2171 pages in
__TEXT, __text = 45 pages in 0.12043ms, 2845 pages in
__TEXT, __text = 36 pages in 0.14896ms, 1928 pages in
__TEXT, __text = 42 pages in 0.10131ms, 2549 pages in
__TEXT, __text = 47 pages in 0.10006ms, 2629 pages in
__TEXT, __text = 46 pages in 0.12338ms, 3285 pages in
__TEXT, __text = 43 pages in 0.12906ms, 2802 pages in
__TEXT, __text = 42 pages in 0.12084ms, 3508 pages in
__TEXT, __text = 41 pages in 0.13751ms, 2231 pages in
__TEXT, __text = 40 pages in 0.12185ms, 2463 pages in
__DATA, __const = 4 pages in 0.060871ms, 4902 pages in
__LINKEDIT, __symtab = 4 pages in 0.057369ms, 5020 pages in
__DATA, __const = 5 pages in 0.043033ms, 4906 pages in
__DATA, __const = 5 pages in 0.064485ms, 4911 pages in
__DATA, __const = 5 pages in 0.063558ms, 4916 pages in
__DATA, __const = 5 pages in 0.043632ms, 4921 pages in
__DATA, __const = 5 pages in 0.026748ms, 4926 pages in
__DATA, __const = 5 pages in 0.048189ms, 4931 pages in
__DATA, __const = 5 pages in 0.051392ms, 4936 pages in
__DATA, __const = 5 pages in 0.04664ms, 4941 pages in
__DATA, __const = 5 pages in 0.052771ms, 4946 pages in
__DATA, __const = 5 pages in 0.038657ms, 4951 pages in
__DATA, __const = 6 pages in 0.055948ms, 4956 pages in
__DATA, __const = 6 pages in 0.036347ms, 4962 pages in
__DATA, __const = 6 pages in 0.054377ms, 4968 pages in
__DATA, __const = 6 pages in 0.035665ms, 4974 pages in
__DATA, __const = 6 pages in 0.031481ms, 4980 pages in
__DATA, __objc_classrefs = 6 pages in 0.042742ms, 4986 pages in
__DATA, __data = 6 pages in 0.023917ms, 4992 pages in
__DATA, __data = 7 pages in 0.05238ms, 5005 pages in
__LINKEDIT, __symtab = 1 pages in 0.079198ms, 5024 pages in
__LINKEDIT, __symtab = 2 pages in 0.03906ms, 5025 pages in
__LINKEDIT, __symtab = 4 pages in 0.050371ms, 5027 pages in
__LINKEDIT, __symtab = 25 pages in 0.074566ms, 5031 pages in
__TEXT, __cstring = 37 pages in 0.093076ms, 351 pages in
__DATA, __data = 4 pages in 0.055401ms, 5012 pages in
__TEXT, __const = 44 pages in 0.15777ms, 2 pages in
__TEXT, __cstring = 44 pages in 0.082086ms, 307 pages in
__TEXT, __text = 34 pages in 0.041126ms, 388 pages in
__TEXT, __const = 50 pages in 0.048358ms, 176 pages in
__TEXT, __text = 48 pages in 0.14794ms, 3372 pages in
__TEXT, __text = 42 pages in 0.10351ms, 3049 pages in
__TEXT, __text = 44 pages in 0.1259ms, 1532 pages in
__TEXT, __text = 43 pages in 0.1081ms, 1964 pages in
__TEXT, __text = 47 pages in 0.09868ms, 3120 pages in
__TEXT, __text = 43 pages in 0.1234ms, 1834 pages in
__TEXT, __text = 40 pages in 0.1181ms, 3167 pages in
__TEXT, __text = 41 pages in 0.098898ms, 1650 pages in
__TEXT, __text = 39 pages in 0.10108ms, 3223 pages in
__TEXT, __cstring = 11 pages in 0.050122ms, 296 pages in
__TEXT, __const = 30 pages in 0.085229ms, 226 pages in
__TEXT, __text = 70 pages in 0.092412ms, 2007 pages in
__TEXT, __text = 81 pages in 0.21815ms, 672 pages in
__TEXT, __text = 43 pages in 0.1581ms, 2272 pages in
__TEXT, __text = 29 pages in 0.078536ms, 3091 pages in
__TEXT, __text = 76 pages in 0.11776ms, 2700 pages in
__TEXT, __text = 23 pages in 0.061873ms, 3262 pages in
__TEXT, __text = 42 pages in 0.11333ms, 422 pages in
__TEXT, __text = 73 pages in 0.17751ms, 2390 pages in
__TEXT, __text = 83 pages in 0.13849ms, 1698 pages in
__TEXT, __text = 63 pages in 0.14195ms, 2986 pages in
__TEXT, __text = 16 pages in 0.070791ms, 3207 pages in
__TEXT, __text = 79 pages in 0.18594ms, 2079 pages in
__TEXT, __text = 39 pages in 0.088643ms, 3420 pages in
__TEXT, __text = 46 pages in 0.10984ms, 2503 pages in
__TEXT, __text = 60 pages in 0.081019ms, 2926 pages in
__TEXT, __text = 41 pages in 0.11579ms, 3331 pages in
__TEXT, __text = 51 pages in 0.089781ms, 1877 pages in
__TEXT, __text = 49 pages in 0.14999ms, 464 pages in
__TEXT, __text = 12 pages in 0.076134ms, 519 pages in
__TEXT, __text = 147 pages in 0.22823ms, 901 pages in
__TEXT, __text = 49 pages in 0.078693ms, 3459 pages in
__TEXT, __text = 5 pages in 0.053016ms, 1781 pages in
__TEXT, __text = 170 pages in 0.28439ms, 1253 pages in
__TEXT, __text = 63 pages in 0.13733ms, 1587 pages in
__TEXT, __text = 26 pages in 0.088221ms, 2776 pages in
__TEXT, __text = 75 pages in 0.17244ms, 2315 pages in
__TEXT, __text = 48 pages in 0.090419ms, 1786 pages in
__TEXT, __text = 38 pages in 0.070018ms, 2591 pages in
__TEXT, __text = 107 pages in 0.17802ms, 553 pages in
__TEXT, __const = 111 pages in 0.13617ms, 60 pages in
__TEXT, __text = 6 pages in 0.061568ms, 513 pages in
__TEXT, __text = 22 pages in 0.10429ms, 531 pages in
__TEXT, __text = 87 pages in 0.17596ms, 753 pages in
__TEXT, __text = 61 pages in 0.11519ms, 840 pages in
__TEXT, __text = 12 pages in 0.073598ms, 660 pages in
__TEXT, __text = 9 pages in 0.06006ms, 2222 pages in
__TEXT, __text = 132 pages in 0.23864ms, 1048 pages in
__TEXT, __text = 13 pages in 0.073986ms, 2158 pages in
__TEXT, __text = 24 pages in 0.090026ms, 2676 pages in
__TEXT, __text = 36 pages in 0.078139ms, 2890 pages in
__TEXT, __text = 109 pages in 0.12254ms, 1423 pages in
__TEXT, __text = 73 pages in 0.073424ms, 1180 pages in
__TEXT, __unwind_info = 123 pages in 0.13173ms, 3719 pages in
__TEXT, __text = 123 pages in 0.16964ms, 3550 pages in
__TEXT, __text = 2 pages in 0.058368ms, 2077 pages in
__TEXT, __text = 7 pages in 0.051494ms, 1691 pages in
__TEXT, __text = 11 pages in 0.063852ms, 1576 pages in
(Assignee)

Comment 20

8 years ago
I just realized that "NNN pages in" may not be clear enough. I'll rename it to "offset = NNN pages" going forward.
(Assignee)

Comment 21

8 years ago
The page-in log in comment 19 is for an old and slow USB HD that I'll be using for testing until my Mozilla laptops arrive, same HD as in comment 18.
(Reporter)

Comment 22

8 years ago
(In reply to comment #21)
> The page-in log in comment 19 is for an old and slow USB HD that I'll be using
> for testing until my Mozilla laptops arrive, same HD as in comment 18.

There is something wrong with that log. It's impossible to have that much hard-drive io and have all of the individual file requests be <1ms. There have to be some seeks in there. Either we are missing some disk io, or these ms are actually seconds.
(Assignee)

Comment 23

8 years ago
Taras, the timings are originally reported in nanoseconds which I divide by 1,000,000 to get millis. There are no individual file requests here, these are all requests within the body of XUL.
(Reporter)

Comment 24

8 years ago
(In reply to comment #23)
> Taras, the timings are originally reported in nanoseconds which I divide by
> 1,000,000 to get millis. There are no individual file requests here, these are
> all requests within the body of XUL.

s/file requests/disk io/. Even fast hard drives spec their seek times at over 8ms, so I'm assuming something is missing here.
(Assignee)

Comment 25

8 years ago
Here are Shark profiles for a change. 

Domain:    https://jreymont-howe.strongspace.com
Login:     public
Password:  4ZD5CbOBbm

Click on the browse tab and download Firefox-Shark-Time-Profile.zip

Open the 'FF Unordered USB - Time Profile' in the archive. Set View in the bottom right corner to 'Heavy (Bottom-up)' then select the 'main' thread in the drop-down box to the left. 

mach_msg_trap should show at the top of the list with 67.4%. We don't care about that so pull up the menu (right mouse click) and select 'Remove callstacks with symbol mach_msg_trap'. 

You should be able to see what I pasted below, after expanding some items. 

That's 598 samples, 10ms each, for a total of ~6s. 9.7% is ours and 8% is on dyld reads, etc. 

I understand that libjar uses mmap but pt_Read (nsprpub/pr/src/pthreads/ptio.c) does not use mmap-ed IO and issues plain reads. It looks like we have a 10% optimization opportunity here.


6.4%	6.4%	XUL	nsZipArchive::GetData(nsZipItem*)
0.0%	6.4%	XUL	 nsJARInputStream::InitFile(nsJAR*, nsZipItem*)
0.0%	6.4%	XUL	  nsJAR::GetInputStreamWithSpec(nsACString_internal const&, char const*, nsIInputStream**)
0.0%	6.4%	XUL	   nsJARInputThunk::EnsureJarStream()
0.0%	6.4%	XUL	    nsJARChannel::Open(nsIInputStream**)
0.0%	4.5%	XUL	     nsXBLService::FetchBindingDocument(nsIContent*, nsIDocument*, nsIURI*, nsIURI*, int, nsIDocument**)
0.0%	4.5%	XUL	      nsXBLService::LoadBindingDocumentInfo(nsIContent*, nsIDocument*, nsIURI*, nsIPrincipal*, int, nsIXBLDocumentInfo**)
0.0%	4.5%	XUL	       nsXBLService::GetBinding(nsIContent*, nsIURI*, int, nsIPrincipal*, int*, nsXBLBinding**, nsTArray<nsIURI*>&)
0.0%	3.2%	XUL	        nsXBLService::GetBinding(nsIContent*, nsIURI*, int, nsIPrincipal*, int*, nsXBLBinding**)
0.0%	1.3%	XUL	        nsXBLService::GetBinding(nsIContent*, nsIURI*, int, nsIPrincipal*, int*, nsXBLBinding**, nsTArray<nsIURI*>&)
0.0%	0.8%	XUL	     CSSLoaderImpl::LoadSheet(SheetLoadData*, StyleSheetState)
0.0%	0.5%	XUL	     nsStringBundle::LoadProperties()
0.0%	0.3%	XUL	     mozJSSubScriptLoader::LoadSubScript(unsigned short const*)
0.0%	0.2%	XUL	     nsExpatDriver::OpenInputStreamFromExternalDTD(unsigned short const*, unsigned short const*, unsigned short const*, nsIInputStream**, nsAString_internal&)


5.0%	5.0%	dyld	pread
0.0%	5.0%	dyld	 dyld::loadPhase6(int, stat&, char const*, dyld::LoadContext const&)
0.0%	5.0%	dyld	  dyld::loadPhase5(char const*, dyld::LoadContext const&, std::vector<char const*, std::allocator<char const*> >*)
0.0%	5.0%	dyld	   dyld::loadPhase4(char const*, dyld::LoadContext const&, std::vector<char const*, std::allocator<char const*> >*)
0.0%	5.0%	dyld	    dyld::loadPhase3(char const*, dyld::LoadContext const&, std::vector<char const*, std::allocator<char const*> >*)
0.0%	5.0%	dyld	     dyld::loadPhase1(char const*, dyld::LoadContext const&, std::vector<char const*, std::allocator<char const*> >*)
0.0%	5.0%	dyld	      dyld::loadPhase0(char const*, dyld::LoadContext const&, std::vector<char const*, std::allocator<char const*> >*)
0.0%	5.0%	dyld	       dyld::load(char const*, dyld::LoadContext const&)
0.0%	3.7%	dyld	        dyld::libraryLocator(char const*, bool, char const*, ImageLoader::RPathChain const*)
0.0%	3.7%	dyld	         ImageLoader::recursiveLoadLibraries(ImageLoader::LinkContext const&, bool, ImageLoader::RPathChain const&)
0.0%	2.3%	dyld	          ImageLoader::link(ImageLoader::LinkContext const&, bool, bool, ImageLoader::RPathChain const&)
0.0%	1.3%	dyld	          ImageLoader::recursiveLoadLibraries(ImageLoader::LinkContext const&, bool, ImageLoader::RPathChain const&)
0.0%	1.3%	dyld	        dlopen


3.3%	3.3%	libSystem.B.dylib	read
0.0%	2.8%	libnspr4.dylib	 pt_Read
0.0%	1.3%	XUL	  nsFileInputStream::Read(char*, unsigned int, unsigned int*)
0.0%	1.0%	XUL	  nsChromeRegistry::ProcessManifest(nsILocalFile*, int)
0.0%	0.5%	XUL	  xptiInterfaceInfoManager::ReadXPTFile(nsILocalFile*, xptiWorkingSet*)
0.0%	0.3%	CoreFoundation	 _CFReadBytesFromFile
0.0%	0.2%	XUL	 IsLoadablePlugin(__CFURL const*)


3.0%	3.0%	dyld	ImageLoaderMachOCompressed::rebase(ImageLoader::LinkContext const&)
0.0%	3.0%	dyld	 ImageLoader::recursiveRebase(ImageLoader::LinkContext const&)
0.0%	1.8%	dyld	  ImageLoader::recursiveRebase(ImageLoader::LinkContext const&)
0.0%	1.2%	dyld	  ImageLoader::link(ImageLoader::LinkContext const&, bool, bool, ImageLoader::RPathChain const&)

2.2%	2.2%	libSystem.B.dylib	__memcpy
0.0%	0.8%	XUL	 nsJARInputStream::Read(char*, unsigned int, unsigned int*)
0.0%	0.3%	ATS	 AnnexFileReadTableBytesByIndex
0.0%	0.2%	XUL	 nsCSubstringTuple::WriteTo(char*, unsigned int) const
0.0%	0.2%	libmozjs.dylib	 js_InternalGetOrSet
0.0%	0.2%	libSystem.B.dylib	 fread
0.0%	0.2%	CoreGraphics	 CGDataProviderCreateWithCopyOfData
0.0%	0.2%	libmozsqlite3.dylib	 balance
0.0%	0.2%	AppKit	 -[NSView frame]
(Assignee)

Comment 26

8 years ago
Here's another observation...

The stack below represents 30 samples and thus 300ms. The dynamic linker (dyld) ONLY uses pread to grab the Mach-O headers located in the first page of a library or executable. 

We should be able to cut this down drastically just by reducing the number of shared libraries and making Firefox more static (Bug 525013). It seems like the lowest-hanging fruit here, should I take care of it first?

	5.0%	5.0%	dyld	pread
	0.0%	5.0%	dyld	 dyld::loadPhase6(int, stat&, char const*, dyld::LoadContext const&)
	0.0%	5.0%	dyld	  dyld::loadPhase5(char const*, dyld::LoadContext const&, std::vector<char const*, std::allocator<char const*> >*)
	0.0%	5.0%	dyld	   dyld::loadPhase4(char const*, dyld::LoadContext const&, std::vector<char const*, std::allocator<char const*> >*)
	0.0%	5.0%	dyld	    dyld::loadPhase3(char const*, dyld::LoadContext const&, std::vector<char const*, std::allocator<char const*> >*)
(Reporter)

Updated

7 years ago
OS: Linux → Mac OS X
You need to log in before you can comment on or make changes to this bug.