Closed Bug 651889 Opened 9 years ago Closed 9 years ago
Add support for dynamic symbol lookup in the dynamic linker
The Android dynamic linker doesn't support dynamic symbol lookup, which means it does all the symbol lookup during startup. Avoiding that should improve our time before main by 5 to 10%, though part of that would be postponed to some time later during the startup process.
(Oops, I forgot the -e option to bzexport) This is work in progress. It has been tested to work with THUMB, though it theoretically should work with ARM. It currently has bad performance due to the way reloc_library works currently. Once reloc_library is improved, we'll be able to see how it really performs.
Target Milestone: --- → mozilla6
This is basically the same as the previous one, except it doesn't need to fix the runtime_reloc function address manually in the code. It also avoids reloc_library mmapping a page when it doesn't need to. On my Nexus S, here are the timings I get from GeckoLibLoad: normal: ~ 180ms on the main process, ~ 95ms on the content process patched: ~ 130ms on the main process, ~ 69ms on the content process (note I see more variance on the main process in the patched case,but not on the content process) These are, obviously, numbers with the cached libraries (i.e. no decompression during startup) While there is an nice improvement at startup time, there is a little price to pay, as I suggested in comment 0, at runtime. I timed the overall time taken by PLT relocations before (thus all of them) vs the overall time taken by all individual dynamic PLT relocations with the patch (thus only a part of them, and only those necessary during startup and a few seconds after). This is what they look like: normal: ~ 47ms patched: ~ 17ms as a reference, the previous patch, that didn't modify reloc_library, spent ~ 28ms, so it was already an actual win. The average time spent on individual PLT relocations is 26 ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂµs, so I don't expect much overhead at runtime. They only ever happen once (obviously). So, all in all we're looking at a possible (actual) 30ms startup improvement on a Nexus S (compared to the 80ms suggested by the GeckoLibLoad timings). As a reference, GeckoLibLoad on the main process without cached libraries (i.e. with decompression) takes between 750 and 800ms on the same device. I'll do some more timings to see if we can improve the runtime relocation process further, but I think we can do that in a followup.
Attachment #537115 - Attachment is obsolete: true
(In reply to comment #3) > The average time spent on individual PLT relocations is 26 µs, so I don't > expect much overhead at runtime. They only ever happen once (obviously). Part of the overhead was actually the logging, so individual PLT relocations are actually faster than that. This also means my timings above are greatly affected by logging...
Comment on attachment 537739 [details] [diff] [review] Add support for dynamic symbol lookup in the Android dynamic linker I'd like a review of this. I'll get better performance numbers once I figure out what's wrong with logging.
So, I just completely got rid of my logging and only took the data we already write to logcat, with bug 664934 on top. Here are means over 20 attempts on my Nexus S, with 95% confidence interval, off cc18551d5cc3: with library cache, warm startup: GeckoLibLoad, main process: 147.90ms ± 5.73% total, 56.65ms ± 8.45% user, 27.00ms ± 12.32% system GeckoLibLoad, content process: 84.05ms ± 1.33% total, 26.80ms ± 8.91% user, 13.40ms ± 16.59% system same, with dynamic symbol resolution: GeckoLibLoad, main process: 111.35ms ± 5.86% total, 34.70ms ± 10.57% user, 25.60ms ± 14.48% system GeckoLibLoad, content process: 56.60ms ± 2.52% total, 12.05ms ± 19.58% user, 15.20ms ± 17.17% system This is about 36ms saved off main process, and 27ms off content process, which are cumulative, since both processes can't initialize in parallel. A pessimistic evaluation of the time spent on dynamic relocations during startup is 10ms. We're looking at more than 50ms improvement, which is a lot compared to these idealistic times (library cache + warm startup is best case). I however wonder, as said in bug 664934, what it's doing to explain the difference between total and user + system, considering this is warm startup and thus there is no i/o. with library cache, cold startup: GeckoLibLoad, main process: 899.40ms ± 3.51% total, 130.45ms ± 8.80% user, 65.00ms ± 7.41% system GeckoLibLoad, content process: 84.60ms ± 1.63% total, 29.20ms ± 10.70% user, 12.20ms ± 21.76% system same, with dynamic symbol resolution: GeckoLibLoad, main process: 915.10ms ± 3.16% total, 116.10ms ± 10.59% user, 66.95ms ± 7.62% system GeckoLibLoad, content process: 57.45ms ± 2.23% total, 14.00ms ± 17.28% user, 13.20ms ± 17.51% system Here, we see a slight regression on the total time in the main process, though we have an improvement on the user + system time. Here, we are I/O bound, and I think we hit the case where the new access pattern with dynamic symbol resolution does a lot more random access than before. I suspect we could improve the situation with preloading. Overall, it's however still a tiny win: 16ms lost on the main process, 27ms on the content process, making it somewhere between 1 and 5ms win (considering the 10ms estimate for the dynamic symbol resolution is really pessimistic). without library cache, warm startup: GeckoLibLoad, main process: 782.10ms ± 0.78% total, 516.90ms ± 1.63% user, 116.40ms ± 7.57% system GeckoLibLoad, content process: 83.95ms ± 1.46% total, 26.60ms ± 9.39% user, 14.20ms ± 18.93% system same, with dynamic symbol resolution: GeckoLibLoad, main process: 766.25ms ± 0.67% total, 513.50ms ± 1.39% user, 107.2ms ± 6.06% system GeckoLibLoad, content process: 58.85ms ± 2.85% total, 14.40ms ± 9.69% user, 14.80ms ± 11.94% system Here the user and system numbers on main process are somehow suspicious, but overall, it's still a win: around 16ms on main process, and 25ms on content process, overall around 30ms improvement, which, compared to 782.10 + 83.95 is around 3.5%. without library cache, cold startup: GeckoLibLoad, main process: 844.40ms ± 0.85% total, 536.05ms ± 1.56% user, 125.20ms ± 6.66% system GeckoLibLoad, content process: 85.20ms ± 1.85% total, 25.20ms ± 9.16% user, 16.00ms ± 15.59% system same, with dynamic symbol resolution: GeckoLibLoad, main process: 829.90ms ± 0.60% total, 525.80ms ± 1.51% user, 121.50ms ± 6.35% system GeckoLibLoad, content process: 60.40ms ± 1.64% total, 13.40ms ± 15.30% user, 14.60ms ± 12.41% system Here we are looking at around 14ms improvement on the main process and 25ms improvement on the content process, making it an overall 30ms improvement, which, compared to 844.40 + 85.20 is around 3.2%. There is something else interesting in these number, though, in that in cold startup cases, library cache is actually slower than no library cache, meaning on a Nexus S, decompressing is faster than reading the decompressed data from the internal flash. Maybe preloading the library cache would help improve this. Note that the numbers for the content process, since it's starting after the main process, are all with warm cache, and are more or less stable overall, as well as the improvement dynamic symbol resolution buys us on it.
Filed bug 664965 for library cache preloading.
Comment on attachment 544477 [details] [diff] [review] Add support for dynamic symbol lookup in the Android dynamic linker r=me with some reluctance, as I don't believe this is the right place to find startup wins. It would be nice if this could be pushed upstream however. It might be the only modification to the android dynamic linker that is widely applicable and easy to take advantage of.
Attachment #544477 - Flags: review?(mwu) → review+
(In reply to Michael Wu [:mwu] from comment #9) > Comment on attachment 544477 [details] [diff] [review] [diff] [details] [review] > Add support for dynamic symbol lookup in the Android dynamic linker > > r=me with some reluctance, as I don't believe this is the right place to > find startup wins. IMHO everything that lowers the time to XRE_main() is a good win to have, especially when considering the time spent before XRE_main() on Android. > It would be nice if this could be pushed upstream however. It might be the > only modification to the android dynamic linker that is widely applicable > and easy to take advantage of. The patch here, though, is partly specific to our copy, and ARM-only. I'll submit a stripped down version upstream.
Component: Widget: Android → Widget: BeOS
Whiteboard: [ts] → [ts] [inbound]
Assignee: mh+mozilla → nobody
Component: Widget: BeOS → Widget: Android
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [ts] [inbound] → [ts]
Target Milestone: --- → mozilla8
You need to log in before you can comment on or make changes to this bug.