Open Bug 531406 Opened 15 years ago Updated 1 year ago

Improve code locality in Firefox to speed up startup

Categories

(Firefox :: General, defect)

x86
macOS
defect

Tracking

()

People

(Reporter: joelr, Unassigned)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [ts])

Get gprof to spit out the following 4 symbol orderings and measure startup of each one:

1) Ordering based on a “closest is best” analysis of the profiling call graph. Calls that call each other frequently are placed close together.

2) Routines sorted by the number of calls made to each routine, largest numbers first.

3) Routines sorted by the order in which they are called.

4) Routines sorted by the amount of CPU time spent, largest times first.
A huge pain in the rear at the moment because 

1) there's some kind of an event processing issue when Firefox is compiled with '-pg', regardless of whether the build is static or dynamic, and

2) gprof complains that it cannot find the module of some symbol.

It's possible to get the 4 orderings above using DTrace but it's tricky. Using DTrace involves switching between kernel and user space for every function call and DTrace is prone to dropping samples when the load is high. 

I will try to identify and fix gprof first and if that doesn't work then go back to DTrace.
Now that you have a static build no reason to go with dtrace. gcc's -finstrument-functions will work.
Won't work. I specifically need the 4 orderings that gprof produces. I decided not to mess with it, though, and go with DTrace instead.
I used DTrace to create list of symbols ordered by time of invocation. Here are two sets of startup timings (ms) from running the dynamic, static and static ordered versions of Firefox.

4532, 3623, 3795, 4168, 3688 - static, ordered by time
5982, 4684, 5683, 5389, 4806 - static, unordered 
5437, 4043, 3873, 3733, 3875 - dynamic

5980, 3546, 3793, 3524, 3399 - static, ordered by time
6007, 4053, 3866, 3871, 3695 - static, unordered
5614, 3687, 3667, 3747, 3690 - dynamic

The static version of Firefox is clearly a loser. I think this is due to the sheer size of the binary that needs to be mapped into memory and later wired via page-ins. 

I will check the page-in pattern from the ordered version to see if access is more sequential than with the regular static version.
Here's the page-in pattern for unordered Firefox. 

All the numbers are in pages (4K), except for time which is nanoseconds. Mac OSX will load a maximum of 256 pages (1Mb) at a time, i.e. size of 256. 

+/- is the seek delta from last position. 0 represents sequential access when last offset plus last size equals current offset. 

Here we seek all over the place and have just 3 sequential page-ins. 

offset: 3323, size: 256, time: 414279, +/-: -1454
offset: 4777, size: 16, time: 73964, +/-: 4777
offset: 4369, size: 16, time: 112221, +/-: -424
offset: 4735, size: 32, time: 122770, +/-: 350
offset: 4424, size: 4, time: 102901, +/-: -343
offset: 4428, size: 4, time: 77085, +/-: 0
offset: 4413, size: 6, time: 105407, +/-: -19
offset: 4432, size: 22, time: 130101, +/-: 13
offset: 4472, size: 32, time: 132064, +/-: 18
offset: 4704, size: 31, time: 133458, +/-: 200
offset: 4385, size: 28, time: 114916, +/-: -350
offset: 4454, size: 18, time: 140073, +/-: 41
offset: 4504, size: 82, time: 198366, +/-: 32
offset: 4586, size: 96, time: 237939, +/-: 0
offset: 4682, size: 22, time: 122088, +/-: 0
offset: 4419, size: 5, time: 65103, +/-: -285
offset: 4793, size: 138, time: 191939, +/-: 369
offset: 3040, size: 256, time: 334308, +/-: -1891
offset: 4767, size: 10, time: 75411, +/-: 1471
offset: 4, size: 252, time: 450750, +/-: -3575
offset: 270, size: 256, time: 401204, +/-: -2408
offset: 256, size: 14, time: 94259, +/-: -536
offset: 2678, size: 256, time: 492825, +/-: 2422
offset: 2533, size: 145, time: 273003, +/-: -401
offset: 2934, size: 106, time: 226473, +/-: 2408
offset: 543, size: 249, time: 453945, +/-: -2497
offset: 2397, size: 136, time: 225833, +/-: 2127
offset: 1453, size: 130, time: 323284, +/-: 256
offset: 855, size: 256, time: 465968, +/-: -1678
offset: 1197, size: 256, time: 471632, +/-: 86
offset: 1613, size: 256, time: 407027, +/-: 160
offset: 1111, size: 86, time: 202304, +/-: -758
offset: 2062, size: 238, time: 475440, +/-: 479
offset: 792, size: 63, time: 175406, +/-: -1134
offset: 1583, size: 30, time: 121940, +/-: -717
offset: 1926, size: 136, time: 370778, +/-: 313
offset: 1869, size: 57, time: 162204, +/-: -193
offset: 526, size: 17, time: 53438, +/-: -329
This is the page-in pattern for Firefox that has been ordered by time. We have 8 sequential page-ins which I think is an improvement. 

offset: 5486, size: 12, time: 79564, +/-: 0
offset: 5498, size: 3, time: 82658, +/-: 0
offset: 4049, size: 256, time: 312504, +/-: -1452
offset: 3654, size: 256, time: 325010, +/-: -651
offset: 5533, size: 26, time: 98405, +/-: 1623
offset: 5591, size: 134, time: 328116, +/-: 32
offset: 4, size: 252, time: 479343, +/-: -5721
offset: 3301, size: 256, time: 283220, +/-: -748
offset: 3149, size: 152, time: 264897, +/-: -408
offset: 569, size: 117, time: 227914, +/-: -2580
offset: 5559, size: 16, time: 97267, +/-: 5559
offset: 5575, size: 16, time: 97930, +/-: 0
offset: 5119, size: 23, time: 119801, +/-: -472
offset: 5191, size: 32, time: 128020, +/-: 49
offset: 5501, size: 32, time: 125811, +/-: 278
offset: 5146, size: 5, time: 87448, +/-: -387
offset: 5151, size: 16, time: 48192, +/-: 0
offset: 5173, size: 7, time: 79466, +/-: 6
offset: 5180, size: 7, time: 88884, +/-: 0
offset: 5223, size: 7, time: 86280, +/-: 36
offset: 5230, size: 128, time: 284864, +/-: 0
offset: 5358, size: 128, time: 248585, +/-: 0
offset: 313, size: 256, time: 397044, +/-: 57
offset: 3910, size: 139, time: 254499, +/-: 3341
offset: 3013, size: 136, time: 266054, +/-: -288
offset: 686, size: 111, time: 219117, +/-: -1129
offset: 3557, size: 97, time: 282840, +/-: 2526
offset: 1326, size: 256, time: 439144, +/-: -1687
offset: 2083, size: 256, time: 374657, +/-: 501
offset: 1815, size: 143, time: 319270, +/-: -524
offset: 2526, size: 252, time: 526497, +/-: 568
offset: 2389, size: 137, time: 289802, +/-: 306
offset: 1585, size: 230, time: 317291, +/-: 899
offset: 797, size: 163, time: 278984, +/-: 0
offset: 1031, size: 215, time: 375700, +/-: 71
offset: 256, size: 57, time: 168328, +/-: -990
offset: 960, size: 71, time: 172067, +/-: 647
offset: 5142, size: 4, time: 63727, +/-: 1488
offset: 2868, size: 145, time: 300991, +/-: -2278
offset: 1958, size: 125, time: 250496, +/-: -820
offset: 5187, size: 4, time: 34477, +/-: 2798
offset: 1582, size: 3, time: 47434, +/-: 256
offset: 5167, size: 6, time: 39267, +/-: 3582
offset: 2339, size: 50, time: 200333, +/-: -187
offset: 1246, size: 80, time: 140012, +/-: -3945
The Mac OSX kernel will not page in more than 256 pages at a time so a small delta (+/-) may be due to read-ahead but a large delta is definitely something else.
Further improvement in the static unordered version due to -mdynamic-no-pic. Still running off of an old IDE hard drive over USB (USB2, I think).

4929, 3333, 3466, 3342, 3242

Note 15 sequential accesses in the page-in pattern. Is this the source of the ~300ms improvement? It sure looks like it to me!

offset: 5151, size: 16, time: 46986, +/-: 0
offset: 5173, size: 7, time: 81671, +/-: 6
offset: 5486, size: 12, time: 124383, +/-: 0
offset: 3654, size: 256, time: 265733, +/-: -651
offset: 5533, size: 26, time: 70254, +/-: 1623
offset: 5591, size: 134, time: 247655, +/-: 32
offset: 4, size: 252, time: 430322, +/-: -5721
offset: 5142, size: 4, time: 47272, +/-: 1093
offset: 569, size: 147, time: 149711, +/-: -4577
offset: 5559, size: 16, time: 77747, +/-: 5559
offset: 5575, size: 16, time: 69565, +/-: 0
offset: 5119, size: 23, time: 128996, +/-: -472
offset: 5191, size: 32, time: 76881, +/-: 49
offset: 5501, size: 32, time: 88982, +/-: 278
offset: 5146, size: 5, time: 82654, +/-: -387
offset: 5180, size: 7, time: 54475, +/-: 0
offset: 5223, size: 7, time: 61044, +/-: 36
offset: 5230, size: 128, time: 289040, +/-: 0
offset: 5358, size: 128, time: 255784, +/-: 0
offset: 5498, size: 3, time: 75185, +/-: 0
offset: 4049, size: 256, time: 319336, +/-: -1452
offset: 313, size: 256, time: 492806, +/-: 57
offset: 3910, size: 139, time: 170189, +/-: 3341
offset: 716, size: 143, time: 242053, +/-: 0
offset: 859, size: 129, time: 229672, +/-: 0
offset: 988, size: 164, time: 204180, +/-: 0
offset: 1152, size: 149, time: 204287, +/-: 0
offset: 2875, size: 242, time: 293529, +/-: 1574
offset: 3117, size: 116, time: 145850, +/-: 0
offset: 3233, size: 107, time: 238421, +/-: 0
offset: 3340, size: 106, time: 127002, +/-: 0
offset: 3446, size: 148, time: 253861, +/-: 1623
offset: 256, size: 57, time: 109993, +/-: -3190
offset: 1578, size: 245, time: 372633, +/-: 1265
offset: 3594, size: 60, time: 148379, +/-: 0
offset: 1823, size: 135, time: 192595, +/-: -516
offset: 1958, size: 125, time: 225817, +/-: -822
offset: 2379, size: 145, time: 294852, +/-: 296
offset: 5187, size: 4, time: 22489, +/-: 2663
offset: 1301, size: 25, time: 69267, +/-: -3890
offset: 1326, size: 252, time: 298040, +/-: -2328
offset: 2083, size: 256, time: 301756, +/-: 505
offset: 2524, size: 256, time: 401689, +/-: 566
This is the size of our __TEXT and __DATA segments and the biggest sections

__TEXT            5119
  __text          3771
  __cstring       117
  __const         279
  __unwind_info   57
  __eh_frame      865

__DATA            440
  __const         390
  __data          37
  
Following is the log from comment #6, this time showing what sections we are hitting. It looks like the time order of symbols is still far from optimal as we are not paging things in sequentially (few +/-: 0).

offset: 5486, size: 12,   +/-: 0,     __DATA,__const
offset: 5498, size: 3,    +/-: 0,     __DATA,__const
offset: 4049, size: 256,  +/-: -1452, __TEXT,__const
offset: 3654, size: 256,  +/-: -651,  __TEXT,__text
offset: 5533, size: 26,   +/-: 1623,  __DATA,__data
offset: 5591, size: 134,  +/-: 32     (LINKEDIT???)
offset: 4,    size: 252,  +/-: -5721, __TEXT,__text
offset: 3301, size: 256,  +/-: -748,  __TEXT,__text
offset: 3149, size: 152,  +/-: -408,  __TEXT,__text
offset: 569,  size: 117,  +/-: -2580, __TEXT,__text
offset: 5559, size: 16,   +/-: 5559,  __DATA,__bss
offset: 5575, size: 16,   +/-: 0      (???)
offset: 5119, size: 23,   +/-: -472,  __DATA,__nl_symbol_ptr (+__la...)
offset: 5191, size: 32,   +/-: 49,    __DATA,__const
offset: 5501, size: 32,   +/-: 278,   __DATA,__const
offset: 5146, size: 5,    +/-: -387,  __DATA,__const
offset: 5151, size: 16,   +/-: 0,     __DATA,__const
offset: 5173, size: 7,    +/-: 6,     __DATA,__const
offset: 5180, size: 7,    +/-: 0,     __DATA,__const
offset: 5223, size: 7,    +/-: 36,    __DATA,__const
offset: 5230, size: 128,  +/-: 0,     __DATA,__const
offset: 5358, size: 128,  +/-: 0,     __DATA,__const
offset: 313,  size: 256,  +/-: 57,    __TEXT,__text
offset: 3910, size: 139,  +/-: 3341,  __TEXT,__const
offset: 3013, size: 136,  +/-: -288,  __TEXT,__text
offset: 686,  size: 111,  +/-: -1129, __TEXT,__text
offset: 3557, size: 97,   +/-: 2526   __TEXT,__text
offset: 1326, size: 256,  +/-: -1687  __TEXT,__text
offset: 2083, size: 256,  +/-: 501    __TEXT,__text
offset: 1815, size: 143,  +/-: -524   __TEXT,__text
offset: 2526, size: 252,  +/-: 568    __TEXT,__text
offset: 2389, size: 137,  +/-: 306    __TEXT,__text
offset: 1585, size: 230,  +/-: 899    __TEXT,__text
offset: 797,  size: 163,  +/-: 0      __TEXT,__text
offset: 1031, size: 215,  +/-: 71     __TEXT,__text
offset: 256,  size: 57,   +/-: -990   __TEXT,__text
offset: 960,  size: 71,   +/-: 647    __TEXT,__text
offset: 5142, size: 4,    +/-: 1488,  __DATA,__const
offset: 2868, size: 145,  +/-: -2278  __TEXT,__text
offset: 1958, size: 125,  +/-: -820   __TEXT,__text
offset: 5187, size: 4,    +/-: 2798,  __DATA,__const
offset: 1582, size: 3,    +/-: 256    __TEXT,__text
offset: 5167, size: 6,    +/-: 3582,  __DATA,__const
offset: 2339, size: 50,   +/-: -187   __TEXT,__text
offset: 1246, size: 80,   +/-: -3945  __TEXT,__text
I'm gonna try to draw conclusions from all of the above...

1) A static build does seem to speed things up.

2) -mdynamic-no-pic further improves things in a static build.

3) Time-ordering of symbols is supposed to give the ideal code layout but page-ins are still far from sequential.

4) I have no idea how to make page-ins to be more sequential short of writing a custom linker to re-arrange data and code taking into account the page-in log.

Another thing that throws everything off is that our startup is not deterministic due to threading and Javascript, e.g. there's a difference between the function calls captured by DTrace on two runs of the same build:

diff time-order.log time-order.log.1 |more
631c631,632
< __ZN8nsThread8DispatchEP11nsIRunnablej
---
> __Z21NS_ProcessNextEvent_PP9nsIThreadi
> __ZN8nsThread16ProcessNextEventEiPi
633d633
< __ZN8nsThread8PutEventEP11nsIRunnable
634a635,636
> __ZN8nsThread8DispatchEP11nsIRunnablej
> __ZN8nsThread8PutEventEP11nsIRunnable
636,637d637
< __Z21NS_ProcessNextEvent_PP9nsIThreadi
< __ZN8nsThread16ProcessNextEventEiPi
643d642
< __ZN8nsThread11SetObserverEP17nsIThreadObserver
644a644
> __ZN8nsThread11SetObserverEP17nsIThreadObserver
654d653
< __ZN24nsSocketTransportService7ReleaseEv
655a655
> __ZN24nsSocketTransportService7ReleaseEv
1436d1435
< __ZL9NameMatchP12PLDHashTablePK15PLDHashEntryHdrPKv
1751a1751
> __Z13js_SetVersionP9JSContext9JSVersion
1965a1966
> __ZL9NameMatchP12PLDHashTablePK15PLDHashEntryHdrPKv
2161a2163
> JS_TriggerOperationCallback
I'm an idiot since I was running Firefox with 3 tabs open, 2 of which were SSL and so required NSS during startup. 

Here are the timings redone, starting with about:blank and Vlad's start.html page with the startup time.

5511, 3655, 3636, 3677, 3686 - dynamic
5632, 3478, 3445, 3859, 3892 - static, ordered

Ready for an amazing discovery?!!!

3374, 1527, 1413, 1424, 1422 - static, unordered, -mdynamic-no-pic

Just to prove that I'm not on crack... 

2952, 1353, 1432, 1433, 1478 - static, unordered, -mdynamic-no-pic
2979, 1406, 1448, 1407, 1435 - same as above

It appears that relocation of symbols (posiion-independent code, PIC) is the culprit on Mac OSX!

Wooot!
My apology, -mdynamic-no-pic has nothing to do with relocation [1].

"This option generates code that is not position-independent but has position-independent external references." and "This option may reduce code size and produce more efficient code." which is clearly happening to us.

[1] http://www.serc.iisc.ernet.in/ComputingFacilities/systems/Tesla_Doc/Intel_C%2B%2B_Compiler/main_cls/copts/common_options/option_mdynamic_no_pic.htm
This is why it's important to run all the tests together. 

6109, 3743, 3916, 3853, 3706 - just XUL is static
5364, 3817, 3668, 3646, 3589 - dynamic
6069, 4206, 4355, 4407, 4034 - static
5102, 3294, 3325, 3338, 3288 - static, no PIC

Yesterday's 50% improvement is nowhere to be found but PIC still shows a great improvement.
I wrote a blog post on optimizing binary layout on Linux.
http://blog.mozilla.com/tglek/2010/02/19/teaching-ld-to-optimize-binaries-for-startup/
Depends on: 549749
Blocks: 447581
Note to self: I think once .text is reordered, then I need to madvice the r-x mmaped areas of libxul.so up to the last symbol I ordered, and madvice the whole ro mmaped area of libxul.so.
Severity: normal → S3

The severity field for this bug is relatively low, S3. However, the bug has 11 votes.
:mossop, could you consider increasing the bug severity?

For more information, please visit auto_nag documentation.

Flags: needinfo?(dtownsend)

The last needinfo from me was triggered in error by recent activity on the bug. I'm clearing the needinfo since this is a very old bug and I don't know if it's still relevant.

Flags: needinfo?(dtownsend)
You need to log in before you can comment on or make changes to this bug.