Open
Bug 531406
Opened 15 years ago
Updated 1 year ago
Improve code locality in Firefox to speed up startup
Categories
(Firefox :: General, defect)
Tracking
()
NEW
People
(Reporter: joelr, Unassigned)
References
(Blocks 1 open bug, )
Details
(Whiteboard: [ts])
Get gprof to spit out the following 4 symbol orderings and measure startup of each one: 1) Ordering based on a “closest is best” analysis of the profiling call graph. Calls that call each other frequently are placed close together. 2) Routines sorted by the number of calls made to each routine, largest numbers first. 3) Routines sorted by the order in which they are called. 4) Routines sorted by the amount of CPU time spent, largest times first.
Reporter | ||
Comment 1•15 years ago
|
||
A huge pain in the rear at the moment because 1) there's some kind of an event processing issue when Firefox is compiled with '-pg', regardless of whether the build is static or dynamic, and 2) gprof complains that it cannot find the module of some symbol. It's possible to get the 4 orderings above using DTrace but it's tricky. Using DTrace involves switching between kernel and user space for every function call and DTrace is prone to dropping samples when the load is high. I will try to identify and fix gprof first and if that doesn't work then go back to DTrace.
Comment 2•15 years ago
|
||
Now that you have a static build no reason to go with dtrace. gcc's -finstrument-functions will work.
Reporter | ||
Comment 3•15 years ago
|
||
Won't work. I specifically need the 4 orderings that gprof produces. I decided not to mess with it, though, and go with DTrace instead.
Reporter | ||
Comment 4•15 years ago
|
||
I used DTrace to create list of symbols ordered by time of invocation. Here are two sets of startup timings (ms) from running the dynamic, static and static ordered versions of Firefox. 4532, 3623, 3795, 4168, 3688 - static, ordered by time 5982, 4684, 5683, 5389, 4806 - static, unordered 5437, 4043, 3873, 3733, 3875 - dynamic 5980, 3546, 3793, 3524, 3399 - static, ordered by time 6007, 4053, 3866, 3871, 3695 - static, unordered 5614, 3687, 3667, 3747, 3690 - dynamic The static version of Firefox is clearly a loser. I think this is due to the sheer size of the binary that needs to be mapped into memory and later wired via page-ins. I will check the page-in pattern from the ordered version to see if access is more sequential than with the regular static version.
Reporter | ||
Comment 5•15 years ago
|
||
Here's the page-in pattern for unordered Firefox. All the numbers are in pages (4K), except for time which is nanoseconds. Mac OSX will load a maximum of 256 pages (1Mb) at a time, i.e. size of 256. +/- is the seek delta from last position. 0 represents sequential access when last offset plus last size equals current offset. Here we seek all over the place and have just 3 sequential page-ins. offset: 3323, size: 256, time: 414279, +/-: -1454 offset: 4777, size: 16, time: 73964, +/-: 4777 offset: 4369, size: 16, time: 112221, +/-: -424 offset: 4735, size: 32, time: 122770, +/-: 350 offset: 4424, size: 4, time: 102901, +/-: -343 offset: 4428, size: 4, time: 77085, +/-: 0 offset: 4413, size: 6, time: 105407, +/-: -19 offset: 4432, size: 22, time: 130101, +/-: 13 offset: 4472, size: 32, time: 132064, +/-: 18 offset: 4704, size: 31, time: 133458, +/-: 200 offset: 4385, size: 28, time: 114916, +/-: -350 offset: 4454, size: 18, time: 140073, +/-: 41 offset: 4504, size: 82, time: 198366, +/-: 32 offset: 4586, size: 96, time: 237939, +/-: 0 offset: 4682, size: 22, time: 122088, +/-: 0 offset: 4419, size: 5, time: 65103, +/-: -285 offset: 4793, size: 138, time: 191939, +/-: 369 offset: 3040, size: 256, time: 334308, +/-: -1891 offset: 4767, size: 10, time: 75411, +/-: 1471 offset: 4, size: 252, time: 450750, +/-: -3575 offset: 270, size: 256, time: 401204, +/-: -2408 offset: 256, size: 14, time: 94259, +/-: -536 offset: 2678, size: 256, time: 492825, +/-: 2422 offset: 2533, size: 145, time: 273003, +/-: -401 offset: 2934, size: 106, time: 226473, +/-: 2408 offset: 543, size: 249, time: 453945, +/-: -2497 offset: 2397, size: 136, time: 225833, +/-: 2127 offset: 1453, size: 130, time: 323284, +/-: 256 offset: 855, size: 256, time: 465968, +/-: -1678 offset: 1197, size: 256, time: 471632, +/-: 86 offset: 1613, size: 256, time: 407027, +/-: 160 offset: 1111, size: 86, time: 202304, +/-: -758 offset: 2062, size: 238, time: 475440, +/-: 479 offset: 792, size: 63, time: 175406, +/-: -1134 offset: 1583, size: 30, time: 121940, +/-: -717 offset: 1926, size: 136, time: 370778, +/-: 313 offset: 1869, size: 57, time: 162204, +/-: -193 offset: 526, size: 17, time: 53438, +/-: -329
Reporter | ||
Comment 6•15 years ago
|
||
This is the page-in pattern for Firefox that has been ordered by time. We have 8 sequential page-ins which I think is an improvement. offset: 5486, size: 12, time: 79564, +/-: 0 offset: 5498, size: 3, time: 82658, +/-: 0 offset: 4049, size: 256, time: 312504, +/-: -1452 offset: 3654, size: 256, time: 325010, +/-: -651 offset: 5533, size: 26, time: 98405, +/-: 1623 offset: 5591, size: 134, time: 328116, +/-: 32 offset: 4, size: 252, time: 479343, +/-: -5721 offset: 3301, size: 256, time: 283220, +/-: -748 offset: 3149, size: 152, time: 264897, +/-: -408 offset: 569, size: 117, time: 227914, +/-: -2580 offset: 5559, size: 16, time: 97267, +/-: 5559 offset: 5575, size: 16, time: 97930, +/-: 0 offset: 5119, size: 23, time: 119801, +/-: -472 offset: 5191, size: 32, time: 128020, +/-: 49 offset: 5501, size: 32, time: 125811, +/-: 278 offset: 5146, size: 5, time: 87448, +/-: -387 offset: 5151, size: 16, time: 48192, +/-: 0 offset: 5173, size: 7, time: 79466, +/-: 6 offset: 5180, size: 7, time: 88884, +/-: 0 offset: 5223, size: 7, time: 86280, +/-: 36 offset: 5230, size: 128, time: 284864, +/-: 0 offset: 5358, size: 128, time: 248585, +/-: 0 offset: 313, size: 256, time: 397044, +/-: 57 offset: 3910, size: 139, time: 254499, +/-: 3341 offset: 3013, size: 136, time: 266054, +/-: -288 offset: 686, size: 111, time: 219117, +/-: -1129 offset: 3557, size: 97, time: 282840, +/-: 2526 offset: 1326, size: 256, time: 439144, +/-: -1687 offset: 2083, size: 256, time: 374657, +/-: 501 offset: 1815, size: 143, time: 319270, +/-: -524 offset: 2526, size: 252, time: 526497, +/-: 568 offset: 2389, size: 137, time: 289802, +/-: 306 offset: 1585, size: 230, time: 317291, +/-: 899 offset: 797, size: 163, time: 278984, +/-: 0 offset: 1031, size: 215, time: 375700, +/-: 71 offset: 256, size: 57, time: 168328, +/-: -990 offset: 960, size: 71, time: 172067, +/-: 647 offset: 5142, size: 4, time: 63727, +/-: 1488 offset: 2868, size: 145, time: 300991, +/-: -2278 offset: 1958, size: 125, time: 250496, +/-: -820 offset: 5187, size: 4, time: 34477, +/-: 2798 offset: 1582, size: 3, time: 47434, +/-: 256 offset: 5167, size: 6, time: 39267, +/-: 3582 offset: 2339, size: 50, time: 200333, +/-: -187 offset: 1246, size: 80, time: 140012, +/-: -3945
Reporter | ||
Comment 7•15 years ago
|
||
The Mac OSX kernel will not page in more than 256 pages at a time so a small delta (+/-) may be due to read-ahead but a large delta is definitely something else.
Reporter | ||
Comment 8•15 years ago
|
||
Further improvement in the static unordered version due to -mdynamic-no-pic. Still running off of an old IDE hard drive over USB (USB2, I think). 4929, 3333, 3466, 3342, 3242 Note 15 sequential accesses in the page-in pattern. Is this the source of the ~300ms improvement? It sure looks like it to me! offset: 5151, size: 16, time: 46986, +/-: 0 offset: 5173, size: 7, time: 81671, +/-: 6 offset: 5486, size: 12, time: 124383, +/-: 0 offset: 3654, size: 256, time: 265733, +/-: -651 offset: 5533, size: 26, time: 70254, +/-: 1623 offset: 5591, size: 134, time: 247655, +/-: 32 offset: 4, size: 252, time: 430322, +/-: -5721 offset: 5142, size: 4, time: 47272, +/-: 1093 offset: 569, size: 147, time: 149711, +/-: -4577 offset: 5559, size: 16, time: 77747, +/-: 5559 offset: 5575, size: 16, time: 69565, +/-: 0 offset: 5119, size: 23, time: 128996, +/-: -472 offset: 5191, size: 32, time: 76881, +/-: 49 offset: 5501, size: 32, time: 88982, +/-: 278 offset: 5146, size: 5, time: 82654, +/-: -387 offset: 5180, size: 7, time: 54475, +/-: 0 offset: 5223, size: 7, time: 61044, +/-: 36 offset: 5230, size: 128, time: 289040, +/-: 0 offset: 5358, size: 128, time: 255784, +/-: 0 offset: 5498, size: 3, time: 75185, +/-: 0 offset: 4049, size: 256, time: 319336, +/-: -1452 offset: 313, size: 256, time: 492806, +/-: 57 offset: 3910, size: 139, time: 170189, +/-: 3341 offset: 716, size: 143, time: 242053, +/-: 0 offset: 859, size: 129, time: 229672, +/-: 0 offset: 988, size: 164, time: 204180, +/-: 0 offset: 1152, size: 149, time: 204287, +/-: 0 offset: 2875, size: 242, time: 293529, +/-: 1574 offset: 3117, size: 116, time: 145850, +/-: 0 offset: 3233, size: 107, time: 238421, +/-: 0 offset: 3340, size: 106, time: 127002, +/-: 0 offset: 3446, size: 148, time: 253861, +/-: 1623 offset: 256, size: 57, time: 109993, +/-: -3190 offset: 1578, size: 245, time: 372633, +/-: 1265 offset: 3594, size: 60, time: 148379, +/-: 0 offset: 1823, size: 135, time: 192595, +/-: -516 offset: 1958, size: 125, time: 225817, +/-: -822 offset: 2379, size: 145, time: 294852, +/-: 296 offset: 5187, size: 4, time: 22489, +/-: 2663 offset: 1301, size: 25, time: 69267, +/-: -3890 offset: 1326, size: 252, time: 298040, +/-: -2328 offset: 2083, size: 256, time: 301756, +/-: 505 offset: 2524, size: 256, time: 401689, +/-: 566
Reporter | ||
Comment 9•15 years ago
|
||
This is the size of our __TEXT and __DATA segments and the biggest sections __TEXT 5119 __text 3771 __cstring 117 __const 279 __unwind_info 57 __eh_frame 865 __DATA 440 __const 390 __data 37 Following is the log from comment #6, this time showing what sections we are hitting. It looks like the time order of symbols is still far from optimal as we are not paging things in sequentially (few +/-: 0). offset: 5486, size: 12, +/-: 0, __DATA,__const offset: 5498, size: 3, +/-: 0, __DATA,__const offset: 4049, size: 256, +/-: -1452, __TEXT,__const offset: 3654, size: 256, +/-: -651, __TEXT,__text offset: 5533, size: 26, +/-: 1623, __DATA,__data offset: 5591, size: 134, +/-: 32 (LINKEDIT???) offset: 4, size: 252, +/-: -5721, __TEXT,__text offset: 3301, size: 256, +/-: -748, __TEXT,__text offset: 3149, size: 152, +/-: -408, __TEXT,__text offset: 569, size: 117, +/-: -2580, __TEXT,__text offset: 5559, size: 16, +/-: 5559, __DATA,__bss offset: 5575, size: 16, +/-: 0 (???) offset: 5119, size: 23, +/-: -472, __DATA,__nl_symbol_ptr (+__la...) offset: 5191, size: 32, +/-: 49, __DATA,__const offset: 5501, size: 32, +/-: 278, __DATA,__const offset: 5146, size: 5, +/-: -387, __DATA,__const offset: 5151, size: 16, +/-: 0, __DATA,__const offset: 5173, size: 7, +/-: 6, __DATA,__const offset: 5180, size: 7, +/-: 0, __DATA,__const offset: 5223, size: 7, +/-: 36, __DATA,__const offset: 5230, size: 128, +/-: 0, __DATA,__const offset: 5358, size: 128, +/-: 0, __DATA,__const offset: 313, size: 256, +/-: 57, __TEXT,__text offset: 3910, size: 139, +/-: 3341, __TEXT,__const offset: 3013, size: 136, +/-: -288, __TEXT,__text offset: 686, size: 111, +/-: -1129, __TEXT,__text offset: 3557, size: 97, +/-: 2526 __TEXT,__text offset: 1326, size: 256, +/-: -1687 __TEXT,__text offset: 2083, size: 256, +/-: 501 __TEXT,__text offset: 1815, size: 143, +/-: -524 __TEXT,__text offset: 2526, size: 252, +/-: 568 __TEXT,__text offset: 2389, size: 137, +/-: 306 __TEXT,__text offset: 1585, size: 230, +/-: 899 __TEXT,__text offset: 797, size: 163, +/-: 0 __TEXT,__text offset: 1031, size: 215, +/-: 71 __TEXT,__text offset: 256, size: 57, +/-: -990 __TEXT,__text offset: 960, size: 71, +/-: 647 __TEXT,__text offset: 5142, size: 4, +/-: 1488, __DATA,__const offset: 2868, size: 145, +/-: -2278 __TEXT,__text offset: 1958, size: 125, +/-: -820 __TEXT,__text offset: 5187, size: 4, +/-: 2798, __DATA,__const offset: 1582, size: 3, +/-: 256 __TEXT,__text offset: 5167, size: 6, +/-: 3582, __DATA,__const offset: 2339, size: 50, +/-: -187 __TEXT,__text offset: 1246, size: 80, +/-: -3945 __TEXT,__text
Reporter | ||
Comment 10•15 years ago
|
||
I'm gonna try to draw conclusions from all of the above... 1) A static build does seem to speed things up. 2) -mdynamic-no-pic further improves things in a static build. 3) Time-ordering of symbols is supposed to give the ideal code layout but page-ins are still far from sequential. 4) I have no idea how to make page-ins to be more sequential short of writing a custom linker to re-arrange data and code taking into account the page-in log. Another thing that throws everything off is that our startup is not deterministic due to threading and Javascript, e.g. there's a difference between the function calls captured by DTrace on two runs of the same build: diff time-order.log time-order.log.1 |more 631c631,632 < __ZN8nsThread8DispatchEP11nsIRunnablej --- > __Z21NS_ProcessNextEvent_PP9nsIThreadi > __ZN8nsThread16ProcessNextEventEiPi 633d633 < __ZN8nsThread8PutEventEP11nsIRunnable 634a635,636 > __ZN8nsThread8DispatchEP11nsIRunnablej > __ZN8nsThread8PutEventEP11nsIRunnable 636,637d637 < __Z21NS_ProcessNextEvent_PP9nsIThreadi < __ZN8nsThread16ProcessNextEventEiPi 643d642 < __ZN8nsThread11SetObserverEP17nsIThreadObserver 644a644 > __ZN8nsThread11SetObserverEP17nsIThreadObserver 654d653 < __ZN24nsSocketTransportService7ReleaseEv 655a655 > __ZN24nsSocketTransportService7ReleaseEv 1436d1435 < __ZL9NameMatchP12PLDHashTablePK15PLDHashEntryHdrPKv 1751a1751 > __Z13js_SetVersionP9JSContext9JSVersion 1965a1966 > __ZL9NameMatchP12PLDHashTablePK15PLDHashEntryHdrPKv 2161a2163 > JS_TriggerOperationCallback
Reporter | ||
Comment 11•15 years ago
|
||
I'm an idiot since I was running Firefox with 3 tabs open, 2 of which were SSL and so required NSS during startup. Here are the timings redone, starting with about:blank and Vlad's start.html page with the startup time. 5511, 3655, 3636, 3677, 3686 - dynamic 5632, 3478, 3445, 3859, 3892 - static, ordered Ready for an amazing discovery?!!! 3374, 1527, 1413, 1424, 1422 - static, unordered, -mdynamic-no-pic Just to prove that I'm not on crack... 2952, 1353, 1432, 1433, 1478 - static, unordered, -mdynamic-no-pic 2979, 1406, 1448, 1407, 1435 - same as above It appears that relocation of symbols (posiion-independent code, PIC) is the culprit on Mac OSX! Wooot!
Reporter | ||
Comment 12•15 years ago
|
||
My apology, -mdynamic-no-pic has nothing to do with relocation [1]. "This option generates code that is not position-independent but has position-independent external references." and "This option may reduce code size and produce more efficient code." which is clearly happening to us. [1] http://www.serc.iisc.ernet.in/ComputingFacilities/systems/Tesla_Doc/Intel_C%2B%2B_Compiler/main_cls/copts/common_options/option_mdynamic_no_pic.htm
Reporter | ||
Comment 13•15 years ago
|
||
This is why it's important to run all the tests together. 6109, 3743, 3916, 3853, 3706 - just XUL is static 5364, 3817, 3668, 3646, 3589 - dynamic 6069, 4206, 4355, 4407, 4034 - static 5102, 3294, 3325, 3338, 3288 - static, no PIC Yesterday's 50% improvement is nowhere to be found but PIC still shows a great improvement.
Comment 14•14 years ago
|
||
I wrote a blog post on optimizing binary layout on Linux. http://blog.mozilla.com/tglek/2010/02/19/teaching-ld-to-optimize-binaries-for-startup/
Comment 15•14 years ago
|
||
Note to self: I think once .text is reordered, then I need to madvice the r-x mmaped areas of libxul.so up to the last symbol I ordered, and madvice the whole ro mmaped area of libxul.so.
Updated•2 years ago
|
Severity: normal → S3
Comment 16•1 year ago
|
||
The severity field for this bug is relatively low, S3. However, the bug has 11 votes.
:mossop, could you consider increasing the bug severity?
For more information, please visit auto_nag documentation.
Flags: needinfo?(dtownsend)
Comment 17•1 year ago
|
||
The last needinfo from me was triggered in error by recent activity on the bug. I'm clearing the needinfo since this is a very old bug and I don't know if it's still relevant.
Flags: needinfo?(dtownsend)
You need to log in
before you can comment on or make changes to this bug.
Description
•