Last Comment Bug 658074 - Android NDK5 perf regressions
: Android NDK5 perf regressions
Status: RESOLVED FIXED
[inbound]
: mobile, perf
Product: Core
Classification: Components
Component: IPC (show other bugs)
: Trunk
: ARM Android
: -- normal with 1 vote (vote)
: mozilla8
Assigned To: Jim Chen [:jchen] [:darchons]
:
Mentors:
Depends on: 617115
Blocks:
  Show dependency treegraph
 
Reported: 2011-05-18 14:43 PDT by Alon Zakai (:azakai)
Modified: 2011-07-13 02:40 PDT (History)
19 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Fennec with NDK5 showing unresponsive script warning. (1) (61.02 KB, image/png)
2011-06-13 06:50 PDT, Jacob Bramley [:jbramley]
no flags Details
Fennec with NDK5 showing unresponsive script warning. (2) (68.00 KB, image/png)
2011-06-13 06:51 PDT, Jacob Bramley [:jbramley]
no flags Details
Fennec with NDK4 showing much shorter fennec_unofficial bursts, for comparison. (66.31 KB, image/png)
2011-06-13 06:55 PDT, Jacob Bramley [:jbramley]
no flags Details
Annotated Profile (NDK5) (47.30 KB, image/png)
2011-06-22 07:22 PDT, Jacob Bramley [:jbramley]
no flags Details
Patch to optimize string usage in IPC (2.65 KB, patch)
2011-06-29 15:31 PDT, Jim Chen [:jchen] [:darchons]
no flags Details | Diff | Review
Patch to optimize string usage in IPC (v1.1) (2.40 KB, patch)
2011-07-05 11:22 PDT, Jim Chen [:jchen] [:darchons]
cjones.bugs: review+
Details | Diff | Review

Description Alon Zakai (:azakai) 2011-05-18 14:43:04 PDT
In bug 617115 we are preparing to move mobile from the Android NDK4 to NDK5. The newer NDK is supposed to generate faster code, supposedly 20% or so. We are seeing the following:

1. Canvas benchmarks are 25% faster (c5bench.com) (which sort of confirms the NDK5 claims.)
2. JS benchmarks with JIT *disabled*, like SunSpider and V8, are 12-15% faster.
3. JS benchmarks with JIT *enabled*: SunSpider and V8 are about the same speed, Kraken is slightly slower. Specifically the 'imaging' tests in Kraken are 10-30% slower, other parts in Kraken seem about the same as before.

So it looks like JIT stuff is slower while normal compiled code is faster. We are seeing this on a variety of Android devices, from an old 2.1 Galaxy S to a 2.3.3 HTC Desire.

The JIT slowdown is most noticeable in Kraken's imaging tests, but I suspect we are also slower in V8, since I think we spend some time in GC there - which is the kind of code I'd expect to be speeded up. So I suspect a GC speedup is being masked by a JIT slowdown there.

STR:

Build Fennec as described in https://wiki.mozilla.org/Mobile/Fennec/Android#Developing_Fennec_for_Android

Get both the NDK4 and 5 from http://developer.android.com/sdk/ndk/index.html

For NDK4, use this mozconfig (changing paths accordingly)

mk_add_options MOZ_OBJDIR=../android/fennec
ac_add_options --with-android-sdk="/home/alon/Dev/android/android-sdk-linux_x86/platforms/android-8"
ac_add_options --with-android-tools="/home/alon/Dev/android/android-sdk-linux_x86/tools"
 
# Android NDK4
ac_add_options --with-android-ndk="/home/alon/Dev/android/android-ndk-r4c"
ac_add_options --target=arm-android-eabi
 
ac_add_options --enable-application=mobile
ac_add_options --with-endian=little
ac_add_options --disable-tests
export MOZ_DEBUG_SYMBOLS=1


For NDK5 use this mozconfig

mk_add_options MOZ_OBJDIR=../android/fennec
ac_add_options --with-android-sdk="/home/alon/Dev/android/android-sdk-linux_x86/platforms/android-8"
ac_add_options --with-android-tools="/home/alon/Dev/android/android-sdk-linux_x86/tools"
 
# Android NDK5
ac_add_options --with-android-ndk="/home/alon/Dev/android/android-ndk-r5b"
ac_add_options --with-android-version=5
ac_add_options --with-android-toolchain="/home/alon/Dev/android/android-ndk-r5b/toolchains/arm-linux-androideabi-4.4.3/prebuilt/linux-x86"
ac_add_options --with-android-platform="/home/alon/Dev/android/android-ndk-r5b/platforms/android-5/arch-arm"
ac_add_options --target=arm-linux-androideabi
ac_add_options --disable-elf-hack
 
ac_add_options --enable-application=mobile
ac_add_options --with-endian=little
ac_add_options --disable-tests
export MOZ_DEBUG_SYMBOLS=1

(until bug 617115 lands, also need the patches in that bug).
Comment 1 Matt Brubeck (:mbrubeck) 2011-05-18 15:09:55 PDT
(In reply to comment #0)
> Get both the NDK4 and 5 from http://developer.android.com/sdk/ndk/index.html

Actually, you will need to get NDK r4 from here:
http://ftp.mozilla.org/pub/mozilla.org/mobile/source/android-ndk-r4c-0moz3.tar.bz2

This is based on the official NDK r4 sources, but modified to include C++ standard library files and queue.h (bug 569760).  The tarball linked above is prebuilt for Linux x86, and is the version used on Mozilla's build infrastructure.
Comment 2 Alon Zakai (:azakai) 2011-05-18 15:28:15 PDT
Btw, I'd be happy to create builds for anyone that wants to investigate this issue. Feel free to ping me on irc.
Comment 3 Brad Lassey [:blassey] (use needinfo?) 2011-05-26 13:32:05 PDT
the patches needed to build with ndk5 have landed on trunk except for breakpad changes, (which aren't needed for js shell).
Comment 4 Alon Zakai (:azakai) 2011-05-27 10:31:34 PDT
The breakpad changes have landed as well, so everything needed is in m-c.
Comment 5 Chris Leary [:cdleary] (not checking bugmail) 2011-05-27 15:14:01 PDT
So, the generated JIT code is pretty much guaranteed not to change from NDK to NDK. The assembly that's generated will be the same before and after, so for each of these regressions it's either 1) related to the compiled code performance, which is more likely, or 2) related to the way that the new NDK handles OS-interfacing stuff, like marking allocated memory as executable and jumping to it.

To be honest I have no idea how to profile the JS shell on android. For the prior JIT work I did I could inspect the generated ARM assembly, which is easier than identifying slow bits of unknown parts of JS' C++ runtime. Maybe we could delegate this to somebody who has more C++-on-android profiling experience? Taking the delta of the prior NDK's profile and the current one should point out all the C++ code that has actually gotten slower.
Comment 6 Alon Zakai (:azakai) 2011-05-28 11:54:49 PDT
(In reply to comment #5)
> So, the generated JIT code is pretty much guaranteed not to change from NDK
> to NDK.

Hmm, does the JIT optionally use things like NEON or Thumb? (Maybe detection of those features changed in the new NDK?)

> To be honest I have no idea how to profile the JS shell on android.

Not clear to me either. Last I heard, we didn't have valgrind or oprofile running on Android which is very sad. I heard stechz was using pcwalton's piranha profiler.
Comment 7 Jacob Bramley [:jbramley] 2011-05-31 01:55:00 PDT
(In reply to comment #6)
> (In reply to comment #5)
> > So, the generated JIT code is pretty much guaranteed not to change from NDK
> > to NDK.
> 
> Hmm, does the JIT optionally use things like NEON or Thumb? (Maybe detection
> of those features changed in the new NDK?)

The JIT doesn't use Thumb or NEON. It does make use of VFP, and it can use some ARMv7 instructions. However, I would expect the impact of a failed detection to be rather different than what we see. For example, some of the maths routines would surely be affected by VFP as much as the imaging tests, and the ARMv7 instructions should benefit everything approximately equally. This doesn't rule it out as a possibility, of course.

> > To be honest I have no idea how to profile the JS shell on android.
> 
> Not clear to me either. Last I heard, we didn't have valgrind or oprofile
> running on Android which is very sad. I heard stechz was using pcwalton's
> piranha profiler.

I've also been struggling to get a profiler to work on Android. ARM's Streamline profiler can do it (very effectively) but I had some unrelated trouble with the board I was using so got set back a bit. Streamline needs a kernel module, just like OProfile, so profiling on a real phone is not straightforward.
Comment 8 Chris Leary [:cdleary] (not checking bugmail) 2011-05-31 11:46:56 PDT
(In reply to comment #7)

Jacob, do you feel comfortable taking assignment, since I don't have the means necessary to see this bug closed?
Comment 9 Jacob Bramley [:jbramley] 2011-06-01 01:08:39 PDT
(In reply to comment #8)
> (In reply to comment #7)
> 
> Jacob, do you feel comfortable taking assignment, since I don't have the
> means necessary to see this bug closed?

Yep, I can take it, though if anyone knows anything potentially useful about the Android toolchains and libraries then I'd be quite happy to take advice!

I notice that NDK5 includes a C++ library, so we no longer need the NDK modification that we had for NDK4. I wonder if the C++ stuff for NDK4 is faster, somehow. A profile run should show that up, anyway, but it's at least a potential culprit here.
Comment 10 Jacob Bramley [:jbramley] 2011-06-09 07:14:37 PDT
Ok, after battling with my board to try to make it useful, I've got a rough profile run of Fennec built with NDK4 and NDK5, running the V8 and Kraken suites. The results aren't terribly enlightening because the interesting bits seem to be hidden behind areas without symbols. This is normally evidence of JavaScript code, but it's possible that I've just mis-loaded the symbols somehow.

The only meaningful observation I've been able to make so far is that in NDK5 (running Kraken's imaging tests), we're spending about 25% of our time in org.mozilla.fennec_unofficial and about 65% of the time in plugin-container, but in NDK4 about 90% of the time is spent in plugin-container. Basically, something in the fennec_unofficial process is getting much slower with NDK5, but I'm not sure exactly what it is (or what each process is responsible for, even).

Actually I was quite surprised to see plugin-container so high up the profile as there are no plugins on the page. Does everything but the browser chrome get run in the plugin-container?

Something interesting was that in NDK4, most of the time is spent in plugin-container, but with occasional bursts of action in fennec_unofficial. These bursts occur perhaps every 20-30 seconds and take about 2 seconds. The spike occurs in libc.so and fennec_unofficial itself (though for some reason there's no other symbol information). It doesn't seem to have much overall impact.

NDK5 is rather different: I see the same bursts, but they last much longer. The activity during these longer bursts is concentrated in the JavaScript engine. I see lots of js:: functions as well as a significant anonymous region, which I assume is JIT-compiled code.

Another observation is that the test uses _approximately_ 288MB under NDK4, but just 235MB under NDK5.

I thought it would be more useful to profile the JIT as a standalone shell application, but I get a weird "text busy" error when I try to run it. Investigation pointed to some unlikely-looking file system error, so I think something might be getting mixed up here.
Comment 11 Brian Crowder 2011-06-09 08:10:46 PDT
plugin_container is misnamed for the electrolysis subprocess in Fennec (commonly called the "content" process), which is precisely where these benchmarks would run.  Basically, your analysis makes perfect sense in that regard; it's precisely what has been reported here.
Comment 12 Mark Finkle (:mfinkle) (use needinfo?) 2011-06-09 09:18:24 PDT
(In reply to comment #10)

> Actually I was quite surprised to see plugin-container so high up the
> profile as there are no plugins on the page. Does everything but the browser
> chrome get run in the plugin-container?

plugin_container _is_ the web content process. As Brian notes, the process is misnamed.

> Another observation is that the test uses _approximately_ 288MB under NDK4,
> but just 235MB under NDK5.

This is great news. Less memory usage is great.
Comment 13 Boris Zbarsky [:bz] 2011-06-09 10:10:29 PDT
So it sounds like with NDK5 the chrome process is taking a lot more CPU...  Is it possible to breakpoint during one of those CPU usage points and examine the JS stack to see exactly what JS is running in the chrome process there?
Comment 14 Jacob Bramley [:jbramley] 2011-06-13 06:50:18 PDT
Created attachment 538877 [details]
Fennec with NDK5 showing unresponsive script warning. (1)

Here's another observation: the unresponsive script warning seems to pop up at the start of an activity burst in fennec_unofficial. The burst then gets interrupted, of course, but resumes again once the the warning is dismissed. Is there some kind of internal tick or event queue that is being serviced here?

I'll try to get a debugger to stop in a useful spot, though it might be tricky.
Comment 15 Jacob Bramley [:jbramley] 2011-06-13 06:51:32 PDT
Created attachment 538878 [details]
Fennec with NDK5 showing unresponsive script warning. (2)

Another example of the unresponsive script warning interrupting Kraken's imaging tests.
Comment 16 Jacob Bramley [:jbramley] 2011-06-13 06:52:56 PDT
(In reply to comment #12)
> > Another observation is that the test uses _approximately_ 288MB under NDK4,
> > but just 235MB under NDK5.
> 
> This is great news. Less memory usage is great.

Take that with a massive handful of salt. The memory usage statistic is system-wide, and whilst I didn't change anything other than Fennec itself, there are too many variables for it to be a reliable measurement.
Comment 17 Jacob Bramley [:jbramley] 2011-06-13 06:55:59 PDT
Created attachment 538880 [details]
Fennec with NDK4 showing much shorter fennec_unofficial bursts, for comparison.
Comment 18 Alon Zakai (:azakai) 2011-06-13 10:10:29 PDT
I think we expect to get unresponsive script warnings during benchmarks like kraken. Usually we need to set the script timeout (dom.max_script_run_time IIRC) in order to avoid that warning messing up the benchmark results.

In general we would expect to see little or no activity in the parent process (fennec_unofficial) during a benchmark (that doesn't render anything). Whereas the child process (the misleadingly named plugin-container) should max out the CPU. So any activity in the parent is suspicious.

If the script timeout warning is disabled (by setting dom.max_script_run_time to something very big), is there still any activity in the parent process? If so, that sounds very wrong.

It seems hard to explain why changing NDK would affect this though. Random guess: Perhaps we have an event loop that sleeps or waits on a condition var, and somehow in the new NDK that turns into a busy wait?
Comment 19 Jacob Bramley [:jbramley] 2011-06-14 03:24:41 PDT
(In reply to comment #18)
> If the script timeout warning is disabled (by setting
> dom.max_script_run_time to something very big), is there still any activity
> in the parent process? If so, that sounds very wrong.

Ah, I'd set it to 0 to disable it, but that had no effect. I don't know why I didn't think to set it really high. In any case, disabling the timeout warning just stops the interruptions. I still see the activity bursts in fennec_unofficial, but they're just not interrupted with periods of inactivity (whilst the warning was showing).

> It seems hard to explain why changing NDK would affect this though. Random
> guess: Perhaps we have an event loop that sleeps or waits on a condition
> var, and somehow in the new NDK that turns into a busy wait?

That sounds likely. A compiler change (with new optimizations) could introduce such behaviour.
Comment 20 Alon Zakai (:azakai) 2011-06-16 09:36:18 PDT
Did some profiling investigation with dougt. Turns out there is some expected activity in the parent process:

1. Kraken loads each individual benchmark as an html document. So we get IPC messages to change the title from child to parent.
2. The benchmarks do document.write or innerhtml = to write out data. This uses a wyciwyg channel - which is remoted from the child to the parent. There is also caching on this channel, which also triggers IPC messages.

However, none of these should take a substantial amount of processing, and they occur before or after the actual benchmarks run, so they should not impact the scores. So these cannot explain the speed regression we are seeing - it must be something else.

Aside from this stuff, nothing else appears to be running in the parent.
Comment 21 Jacob Bramley [:jbramley] 2011-06-22 07:22:32 PDT
Created attachment 541039 [details]
Annotated Profile (NDK5)

This annotated profile shows that the extra bursts don't coincide with the JavaScript tests (perhaps indicating that there are two separate issues here). However, the pattern is quite different to what I saw before so something else must have changed along the way.

I modified the benchmark to send and cancel the annotations (outside the timing loop), but it is otherwise unmodified.
Comment 22 Alon Zakai (:azakai) 2011-06-22 09:52:57 PDT
Very interesting. If I read that correctly, then activity in the parent happens right as the benchmarks begin. Which makes sense, since we load an HTML document at that point which involves the parent (so we should see both child and parent activity at that point). In theory that shouldn't slow down benchmarks however, unless the benchmark timing includes the HTML parse time etc., in which case a slowdown in our processing in the parent could regress performance.

jchen, once you get piranha or oprofile set up, can you please compare NDK4 and NDK5 builds here? That might help us understand which specific functions are slower. (Note that building NDK5 is tricky right now, see bug 663214. There is a workaroung there though.)
Comment 23 Jacob Bramley [:jbramley] 2011-06-28 08:26:22 PDT
Hmm. I just double-checked, and I don't actually see much difference between NDK4 and NDK5 on the imaging test. Yes, I get the weird profile view, but the actual results vary very little. NDK4 got me about 40990ms for the imaging tests, and NDK5 got me about 41684ms.

The strange regions in fennec_unofficial occur in libc.so. I don't have symbols for my system libraries so I don't know what it's doing, but I'll try to work out how it's getting there.
Comment 24 Jim Chen [:jchen] [:darchons] 2011-06-28 11:36:03 PDT
(In reply to comment #23)
> The strange regions in fennec_unofficial occur in libc.so. I don't have
> symbols for my system libraries so I don't know what it's doing, but I'll
> try to work out how it's getting there.

I've been profiling NDK4 and NDK5 builds running Kraken too.

Here's a typical NDK4 profile:
samples  %        image name               app name                 symbol name
21591    30.5380  anon (tgid:2961 range:0x41a53000-0x41adb000) plugin-container         anon (tgid:2961 range:0x41a53000-0x41adb000)
9808     13.8723  anon (tgid:2961 range:0x41adb000-0x41b11000) plugin-container         anon (tgid:2961 range:0x41adb000-0x41b11000)
4411      6.2389  anon (tgid:2961 range:0x41a53000-0x41a91000) plugin-container         anon (tgid:2961 range:0x41a53000-0x41a91000)
3868      5.4708  anon (tgid:2961 range:0x41adb000-0x41af0000) plugin-container         anon (tgid:2961 range:0x41adb000-0x41af0000)
3731      5.2771  libm.so                  libm.so                  /system/lib/libm.so
2465      3.4865  anon (tgid:2961 range:0x42900000-0x42b00000) plugin-container         anon (tgid:2961 range:0x42900000-0x42b00000)
1880      2.6590  libc.so                  libc.so                  /system/lib/libc.so
1564      2.2121  anon (tgid:2961 range:0x41a53000-0x41a90000) plugin-container         anon (tgid:2961 range:0x41a53000-0x41a90000)
1324      1.8726  anon (tgid:2961 range:0x41a95000-0x41ad5000) plugin-container         anon (tgid:2961 range:0x41a95000-0x41ad5000)
876       1.2390  anon (tgid:2961 range:0x41a53000-0x41ad6000) plugin-container         anon (tgid:2961 range:0x41a53000-0x41ad6000)
835       1.1810  anon (tgid:2961 range:0x41b11000-0x41d00000) plugin-container         anon (tgid:2961 range:0x41b11000-0x41d00000)

Here's a typical NDK5 profile:
samples  %        image name               app name                 symbol name
14573    18.7673  anon (tgid:3184 range:0x41977000-0x419e4000) plugin-container         anon (tgid:3184 range:0x41977000-0x419e4000)
12488    16.0822  libc.so                  libc.so                  /system/lib/libc.so
6505      8.3772  anon (tgid:3184 range:0x41977000-0x419e0000) plugin-container         anon (tgid:3184 range:0x41977000-0x419e0000)
5476      7.0521  anon (tgid:3184 range:0x41977000-0x419e3000) plugin-container         anon (tgid:3184 range:0x41977000-0x419e3000)
4405      5.6728  anon (tgid:3184 range:0x41977000-0x419c3000) plugin-container         anon (tgid:3184 range:0x41977000-0x419c3000)
3703      4.7688  libm.so                  libm.so                  /system/lib/libm.so
3353      4.3180  anon (tgid:3184 range:0x42000000-0x42600000) plugin-container         anon (tgid:3184 range:0x42000000-0x42600000)
2070      2.6658  anon (tgid:3184 range:0x419b7000-0x419e4000) plugin-container         anon (tgid:3184 range:0x419b7000-0x419e4000)
1761      2.2678  anon (tgid:3184 range:0x42a00000-0x42c00000) plugin-container         anon (tgid:3184 range:0x42a00000-0x42c00000)
1177      1.5158  anon (tgid:3184 range:0x41977000-0x419d8000) plugin-container         anon (tgid:3184 range:0x41977000-0x419d8000)
1080      1.3908  anon (tgid:3184 range:0x41977000-0x419c5000) plugin-container         anon (tgid:3184 range:0x41977000-0x419c5000)
890       1.1462  anon (tgid:3184 range:0x419e4000-0x41af4000) plugin-container         anon (tgid:3184 range:0x419e4000-0x41af4000)
883       1.1371  libxul.so (deleted)      libxul.so (deleted)      js_DoubleToInt32(double)

I think this agrees with what you've found. The 31% JIT task in NDk4 is split into 19% JIT and 16% libc in NDK5. I looked at the addresses and most of the time in libc is spent in memcpy. However, I haven't been able to get a backtrace of who is calling memcpy.
Comment 25 Alon Zakai (:azakai) 2011-06-28 11:42:47 PDT
To summarize some more stuff we've discussed offline,

1. memcpy is being called to copy megabytes of ipc data, through a wciwyg channel. This happens in the image tests and not elsewhere. Which makes sense so far. This should take time in the parent, possibly also the child (not sure how we copy ipc data there).

2. It might be slower in NDK5 if it doesn't optimize native functions like memcpy etc., if NDK5 has -fno-builtin by default. Probably worth building without that and seeing the difference. Not having -fno-builting in NDK4 would explain why we don't see it in the traces.
Comment 26 Jim Chen [:jchen] [:darchons] 2011-06-29 15:31:23 PDT
Created attachment 542978 [details] [diff] [review]
Patch to optimize string usage in IPC

Basically the regression is from std:string usage in the Chromium IPC code. IPC keeps a buffer of incoming messages, and IPC assigns any unprocessed messages to this buffer. However, if we didn't do any processing in the first place, the code will still try to assign the buffer to the buffer itself. My guess is our old STL guarded against assigning string to itself. However, the new STL happily allocated and copied another buffer as part of the "assignment", which led to excessive memcpying for no reason.

The comprehensive Kraken results are below
NDK4 (unpatched): 88845ms / 0.4%
NDK4 (patched):   88808ms / 0.4%
NDK5 (unpatched): 95180ms / 0.3%
NDK5 (patched):   88326ms / 0.3%

In addition to not assigning the buffer to itself, the patch also tries to optimize memory allocation of the buffer, but the effect is negligible.
Comment 27 Alon Zakai (:azakai) 2011-06-29 15:53:24 PDT
Comment on attachment 542978 [details] [diff] [review]
Patch to optimize string usage in IPC

Nice find, jchen!

cjones, do you know if there is an upstream that can receive bugreports?
Comment 28 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-07-05 07:02:40 PDT
Comment on attachment 542978 [details] [diff] [review]
Patch to optimize string usage in IPC

>     // Process messages from input buffer.
>-    const char *p;
>+    const char *p, *origp;

Nit: to match surrounding style, please put this decl on a
separate line.  Please rename it |overflowp|.

>     const char *end;
>     if (input_overflow_buf_.empty()) {
>+      origp = NULL;
>       p = input_buf_;
>       end = p + bytes_read;
>     } else {
>       if (input_overflow_buf_.size() >
>          static_cast<size_t>(kMaximumMessageSize - bytes_read)) {
>         input_overflow_buf_.clear();
>         LOG(ERROR) << "IPC message is too big";
>         return false;
>       }
>+      size_t buf_size = input_overflow_buf_.size();
>+      if (buf_size + bytes_read > input_overflow_buf_.capacity()) {
>+        buf_size += bytes_read > buf_size / 4 ? bytes_read : buf_size / 4;
>+        input_overflow_buf_.reserve(buf_size);
>+      }

Why are you implementing a custom buffer-growth policy here?  I
think this may be what you described having a neglible effect.
If so, please remove it.

>@@ -566,17 +572,23 @@ bool Channel::ChannelImpl::ProcessIncomi
>           listener_->OnMessageReceived(m);
>         }
>         p = message_tail;
>       } else {
>         // Last message is partial.
>         break;
>       }
>     }
>-    input_overflow_buf_.assign(p, end - p);
>+    if (end == p) {
>+      input_overflow_buf_.clear();
>+    } else if (!origp) { // p is from input_buf_
>+      input_overflow_buf_.assign(p, end - p);
>+    } else if (p > origp) { // p is from input_overflow_buf_
>+      input_overflow_buf_.erase(0, p - origp);

Let's put these comments on their own lines too to match
surrounding style.

Nice fix!  Let's figure out the custom buffer growth then I'm happy to take this.
Comment 29 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-07-05 07:05:22 PDT
(In reply to comment #27)
> cjones, do you know if there is an upstream that can receive bugreports?

Our code is a de facto fork, but I took a look at chromium trunk and this fix would apply.  We could/should be good citizens and file an issue at http://code.google.com/p/chromium/issues.  I can take care of this after this patch lands, if jchen is busy with other stuff.
Comment 30 Jim Chen [:jchen] [:darchons] 2011-07-05 11:22:37 PDT
Created attachment 544001 [details] [diff] [review]
Patch to optimize string usage in IPC (v1.1)

Thanks for the comments!

(In reply to comment #28)
> Comment on attachment 542978 [details] [diff] [review] [review]
> Patch to optimize string usage in IPC
> 
> Nit: to match surrounding style, please put this decl on a
> separate line.  Please rename it |overflowp|.

Done

> Why are you implementing a custom buffer-growth policy here?  I
> think this may be what you described having a neglible effect.
> If so, please remove it.

Right this had no noticeable effect. Removed.

> Let's put these comments on their own lines too to match
> surrounding style.

Done.

It'd be great if you can follow this up with chromium. Thanks!
Comment 31 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-07-05 11:39:49 PDT
Comment on attachment 544001 [details] [diff] [review]
Patch to optimize string usage in IPC (v1.1)

Thanks.
Comment 32 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-07-05 12:08:45 PDT
Were these builds using STLPort built into NDK5 or gcc's STL impl?
Comment 33 Jim Chen [:jchen] [:darchons] 2011-07-05 13:40:31 PDT
(In reply to comment #32)
> Were these builds using STLPort built into NDK5 or gcc's STL impl?

NDK5 STLPort. That's one thing that's changed from NDK4 to NDK5.
Comment 34 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-07-05 14:05:43 PDT
That sounds a little scary.  Is there no way for us to keep using our own NDK with gcc's STL impl?  Otherwise I'm afraid we might have to keep shaking these kinds of regressions out, and we have fewer perf tests for android than any of our other Tier I-ish platforms.

I'm not against taking this patch even if we go back to gcc's STL, though.
Comment 35 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-07-05 16:17:30 PDT
Filed http://code.google.com/p/chromium/issues/detail?id=88484 .
Comment 36 Jim Chen [:jchen] [:darchons] 2011-07-06 10:40:38 PDT
(In reply to comment #34)
> That sounds a little scary.  Is there no way for us to keep using our own
> NDK with gcc's STL impl?  Otherwise I'm afraid we might have to keep shaking
> these kinds of regressions out, and we have fewer perf tests for android
> than any of our other Tier I-ish platforms.
> 
> I'm not against taking this patch even if we go back to gcc's STL, though.

I wonder in how many places are we using STL. Using STLPort and not requiring a custom NDK build means we can better keep up with new versions of NDK.
Comment 37 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-07-06 10:49:23 PDT
I understand the motivation for using the stock NDK5.

All the performance-critical uses of STL are in IPC code, AFAIK.  One way to get a rough estimate of before/after perf is to built the IPDL tests (--enable-ipdl-tests), and run the latency check ($dist/bin/ipdlunittest TestLatency) with NDK5/STLPort vs. NDK5/gcc (or, less scientifically, NDK4 vs NDK5/STLPort).  That test ought to be bound by OS and gecko event-loop stuff, but STL badness might show up.

If we can't measure a difference (not capable of measuring or no measured difference), then we don't have a reason to use gcc over STLPort.
Comment 38 Daniel Holbert [:dholbert] 2011-07-07 12:16:52 PDT
Looks like bz landed this on m-i:
  http://hg.mozilla.org/integration/mozilla-inbound/rev/4836dc515ac2

[updating from "other branch"-->"Trunk" after verifying that's correct w/ jchen]
Comment 39 Marco Bonardo [::mak] 2011-07-08 05:53:28 PDT
http://hg.mozilla.org/mozilla-central/rev/4836dc515ac2
Comment 40 Luke Wagner [:luke] 2011-07-08 23:17:14 PDT
For those of us following along at home: could you explain how an inefficiency in the ipc code was influencing our kraken score?  Looking at the kraken harness, it seems like only pure JS is run in the timed interval.  Is something happening in parallel that is stealing resources from the content process?  Thanks, and good job solving the problem!
Comment 41 Jim Chen [:jchen] [:darchons] 2011-07-09 11:32:25 PDT
Thanks! My understanding so far is cache data is being sent across IPC in parallel to the benchmark running.

Specifically, when Kraken uses document.write in a frame to start a test, document.write caches that content using WyciwygChannel. In Fennec, this means we send the cache data from child to parent asynchronously. This bug shows up when the parent is not able to receive all the cache data at once. Every time the parent reads an additional chunk of incoming data, it assigns the incoming buffer to itself resulting in unnecessary alloc and memcpy. This extra memcpy in the parent takes away CPU time from the child running the benchmark, making the benchmark slower.

The effect becomes noticeable when a lot of data is being sent, which is the case when running the Kraken image tests (megabytes of image data). The parent receives a chunk of cache data at a time, and everytime, it allocates a new buffer and copies over everything it has received so far. This results in a cumulative memcpy of hundreds of megabytes, which is too much even for Android's super optimized memcpy :)
Comment 42 Luke Wagner [:luke] 2011-07-09 11:41:49 PDT
Thanks for the detailed explanation!

Btw, was this measured on a single or dual core ARM platform?  I'm curious because I thought we are using Tegras and I thought Tegras were dual-core and, if so, this is a very interesting data point concerning (I assume) memory bandwidth.
Comment 43 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-07-09 16:50:19 PDT
(In reply to comment #41)
> Specifically, when Kraken uses document.write in a frame to start a test,
> document.write caches that content using WyciwygChannel. In Fennec, this
> means we send the cache data from child to parent asynchronously.

That sounds, ah, highly undesirable :/.  I suppose we cache this for view-source on desktop?  Does fennec support view-source?  I couldn't find the option.  If not, boy would it be nice to not uselessly cache hundreds of megabytes of document.write().
Comment 44 Brad Lassey [:blassey] (use needinfo?) 2011-07-09 17:23:18 PDT
Fennec has view source but no ui for it unless you install mfinkle's mobile tools add on
Comment 45 Boris Zbarsky [:bz] 2011-07-09 19:34:11 PDT
> I suppose we cache this for view-source on desktop?

No, we cache it so that reload will do the right thing.  It needs to reload the data that was written with document.write, which means we need that data available....
Comment 46 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-07-09 20:37:25 PDT
Huh, ok.  I guess that being the "right thing" is in the eye of the beholder, but that's a topic for elsewhere ;).
Comment 47 Boris Zbarsky [:bz] 2011-07-09 20:45:47 PDT
"right thing" in this case is determined by web compat... There are sites that call reload() and expect that behavior.
Comment 48 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-07-09 20:59:03 PDT
Yeah, figured.
Comment 49 Jim Chen [:jchen] [:darchons] 2011-07-09 21:27:18 PDT
(In reply to comment #42)
> Thanks for the detailed explanation!
> 
> Btw, was this measured on a single or dual core ARM platform?  I'm curious
> because I thought we are using Tegras and I thought Tegras were dual-core
> and, if so, this is a very interesting data point concerning (I assume)
> memory bandwidth.

I did my measurements on a single-core phone, although I should do it on a dual-core too. Might uncover some other interesting stuff :)
Comment 50 Brad Lassey [:blassey] (use needinfo?) 2011-07-10 11:24:09 PDT
I measured on a dual core atrix and saw similar results

Note You need to log in before you can comment on or make changes to this bug.