Investigate where time is spent in BananaBread

RESOLVED WORKSFORME

Status

()

Core
JavaScript Engine
RESOLVED WORKSFORME
5 years ago
4 years ago

People

(Reporter: azakai, Assigned: azakai)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [js:t])

Attachments

(1 attachment)

(Assignee)

Description

5 years ago
Profiling bananabread,

http://dl.dropbox.com/u/80664946/main.html?low,low

using perf on Linux, I see the following at the top of the profile:

16.24% perf-21525.map      [.] 0x93bd72a5
12.18% fglrx_dri.so        [.] 0x11d78c9
 1.98% firefox             [.] 0x9483
 1.71% [kernel.kallsyms]   [k] __copy_from_user_ll.part.1
 1.56% [kernel.kallsyms]   [k] system_call
 1.41% libpthread-2.15.so  [.] __pthread_mutex_unlock_usercnt
 1.41% libpthread-2.15.so  [.] pthread_mutex_lock
 1.41% librt-2.15.so       [.] clock_gettime
 1.39% libxul.so           [.] nsBufferedAudioStream::DataCallback(void*, long)
 1.38% libxul.so           [.] js::Interpret(JSContext*, js::StackFrame*, js::InterpMode)
 1.27% [fglrx]             [k] firegl_calc_user_handle
 0.90% [kernel.kallsyms]   [k] native_flush_tlb_single
 0.87% libxul.so           [.] js::mjit::stubs::GetElem(js::VMFrame&)
 0.82% libxul.so           [.] NS_QuickSort
 0.80% [kernel.kallsyms]   [k] page_fault
 0.74% libc-2.15.so        [.] 0x936ca
 0.68% libxul.so           [.] js::types::TypeMonitorResult(JSContext*, JSScript*, unsigned char*, JS::Value const&)

The highest looks like a perf artifact, then GL, which makes sense. Next is perhaps jicode, then some kernel and pthread calls. Then audio, which also makes sense. Then the JS interpreter, still fairly high up - almost the same time as the suspected jitcode line. And a little lower but still significant are js::mjit::stubs::GetElem and js::types::TypeMonitorResult.

The profile was collected after playing for a while, so I would hope most of the code would be jitted by then, so I asked luke about this. Luke suspects this might be overreported by perf, said he would check on the OS X profiler.
(Assignee)

Comment 1

5 years ago
I can't find good docs to substantiate this, but based on further profiling I believe the top line, |16.24% perf-21525.map|, might be jitcode that confuses perf. So that would support our spending most time in jit or in GL, which sounds ok. Still curious why a significant amount remains interpreted though after running for a long time.
(Assignee)

Comment 2

5 years ago
Created attachment 648527 [details]
More detailed, less noisy perf report

Attached is a longer run, less noisy, and with callgraph info too.

Fairly high up are

2.46% js::Interpret
1.33% js::types::TypeMonitorResult
0.61% js::mjit::stubs::Ursh
0.47% js::types::TypeSet::hasType
(Assignee)

Comment 3

5 years ago
Running on the highest-detail map,

http://dl.dropbox.com/u/80664946/main.html?high,high

I see at the top of the profile

27.28% perf-20124.map 0x964d5f6c
15.44% fglrx_dri.so   0xaed023
 7.89% libxul.so      js::mjit::stubs::GetElem(js::VMFrame&)
 2.75% libxul.so      js::baseops::GetProperty(JSContext*, ..)
 2.01% libxul.so      js::ObjectImpl::nativeLookup(JSContext*, int)

So well over 10% of time is spent in non-JIT code it appears.

Comment 4

5 years ago
Looking with Shark, <.5% of the time is in Interpret (although it is a good question why we are in Interpret at all).

I dug into two Shark profiles (after all jit compilation jank was done):

For 1 bot (which seems to leave my cpu at ~70%, solid 30fps)
 - 55% outside JS (1/2 of that in webgl calls from JS, though)
 - 27% in jit code
 - 14% under various C++ JS VM stuff called from jit code
 - 4.5% under just stubs::GetElem

For 6 bots (which just gets the cpu to 100%, still mostly 30fps)
 - 46% outside JS (1/2 again in webgl calls from JS)
 - 38% in jit code
 - 11% under various C++ JS VM stuff called from jit code
 - 4% just under stubs::GetElem

Thoughts:
 1. Graphics is a pretty big chunk of time, how much of that do you think we can improve?  Are any gfx people looking into BananaBread?
 2. We should see if there is an easy way to avoid that big GetElem (I'll give you source line info if you give me an un-minified build :)
 3. The "various C++ JS VM" category doesn't seem to throw up any red flags for JM (it's spread across 20 or so stubs).  For IM it seems like it would be worth our time to dig in and fix the top 5-10 stub calls.
Summary: Lots of interpreter time in BananaBread → Investigate where time is spent in BananaBread
(Assignee)

Comment 5

5 years ago
(In reply to Luke Wagner [:luke] from comment #4)
> Looking with Shark, <.5% of the time is in Interpret (although it is a good
> question why we are in Interpret at all).
> 
> I dug into two Shark profiles (after all jit compilation jank was done):
> 
> For 1 bot (which seems to leave my cpu at ~70%, solid 30fps)
>  - 55% outside JS (1/2 of that in webgl calls from JS, though)
>  - 27% in jit code
>  - 14% under various C++ JS VM stuff called from jit code
>  - 4.5% under just stubs::GetElem

Maybe that 20% of VM stuff somehow gets partially clumped into the interpreter in my profiles (or vice versa)? In any case, seems like a lot of room for speedup, almost half (or a quarter from your other profile) of JS time is not in JITcode.

> 
> For 6 bots (which just gets the cpu to 100%, still mostly 30fps)
>  - 46% outside JS (1/2 again in webgl calls from JS)
>  - 38% in jit code
>  - 11% under various C++ JS VM stuff called from jit code
>  - 4% just under stubs::GetElem
> 
> Thoughts:
>  1. Graphics is a pretty big chunk of time, how much of that do you think we
> can improve?  Are any gfx people looking into BananaBread?

Yes, gfx people have looked at it. There was one promising optimization, but it doesn't look like it will be very big. Those shaders just take time to run on the GPU...

>  2. We should see if there is an easy way to avoid that big GetElem (I'll
> give you source line info if you give me an un-minified build :)

Cool, try this

http://dl.dropbox.com/u/80664946/main.html?low,low,debug

Note that the debug build there does not have closure optimizations on it, so performance might be a little different than the optimized build. But hopefully not significantly.

>  3. The "various C++ JS VM" category doesn't seem to throw up any red flags
> for JM (it's spread across 20 or so stubs).  For IM it seems like it would
> be worth our time to dig in and fix the top 5-10 stub calls.

But shouldn't JS VM time end up close to 0, and all execution be in JITcode? Or do I not understand what JS VM stuff is?

Comment 6

5 years ago
(In reply to Alon Zakai (:azakai) from comment #5)
> But shouldn't JS VM time end up close to 0, and all execution be in JITcode?
> Or do I not understand what JS VM stuff is?

It should.  With the exception of the big stubs::GetElem call, no single stub call sticks out so it'll take some time to track them all down (time we should spend, I said, but probably on IM, although maybe investigating it in JM would yield insights/fixes that transfer to IM).
(Assignee)

Comment 7

5 years ago
I investigated the GetElem stuff with guidance from luke.

One issue was the FS.streams object (maps file descriptors to file objects) was used as a map. I rewrote the library code to make it dense, and it stopped showing up in GetElem calls.

Next is something puzzling: In _pread (read data from files), we call GetElem 7,341,921 times (once for each byte of file data we "read"). The code is

  for (var i = 0; i < size; i++) {
    HEAPU8[buf + i] = contents[offset + i];
  }

HEAPU8 and contents are both Uint8Arrays, though. Why is GetElem being called?
(Assignee)

Comment 8

5 years ago
|contents| arrives from

  var contents = stream.object.contents;

It always exists and is always a Uint8Array here in this function. However, stream.object in other functions might not have a .contents (filesystem nodes without contents do exist). I am not sure how TI analyses such things, if it does it globally perhaps it might be led to suspect that .contents could be undefined here. But if it checks in this function, it will always see a Uint8Array.
(Assignee)

Comment 9

5 years ago
I moved on to GetElem during play as opposed to load. The worst offender is this

  var attributes = GL.immediate.liveClientAttributes;
  var cacheItem = GL.immediate.rendererCache;
  for (var i = 0; i < attributes.length; i++) {
    var attribute = attributes[i];
    if (!cacheItem[attribute.name]) cacheItem[attribute.name] = {};
    cacheItem = cacheItem[attribute.name];
    if (!cacheItem[attribute.size]) cacheItem[attribute.size] = {};
    cacheItem = cacheItem[attribute.size];
    if (!cacheItem[attribute.type]) cacheItem[attribute.type] = {};
    cacheItem = cacheItem[attribute.type];
  }

This is a pattern luke suggested to me. We need a cache for renderers, where the key is a combination of several parameters. So we make a cache through a hierarchy of objects. Accessing a cached item becomes an access through properties, something like cache.x.y.z.a.b.c.

I'm not sure how this could be optimized. In theory Map objects might make more sense here, but MDN says not to use them yet.
Blocks: 579390
Whiteboard: [js:t]

Comment 10

5 years ago
IIRC, the alternative was concatenating a string, which creates garbage.  With igc (and later ggc), perhaps this would be better than a sequence of worst-case property lookups.  It would also be good to compare MapObject vs. GetElem.
(Assignee)

Comment 11

5 years ago
I optimized everything that came up here, by using only dense []s instead of sparse {}s performance is noticeably better.

The one thing I can't find a workaround for is the pread issue mentioned in comment #7 . It would be great if the JS people can check why that ends up running GetElem even though there is no need.

The one clue I have is that in pread we copy from one typed array to another, as opposed to copying within a single typed array - that's the only difference from all the rest of the code that does end up running fast.
Summary: Investigate where time is spent in BananaBread → GetElem called in BananaBread in pread on two typed arrays
(Assignee)

Comment 12

5 years ago
An updated debug build is at

http://dl.dropbox.com/u/80664946/main.html?low,low,debug
(Assignee)

Comment 13

5 years ago
jandem and bhackett investigated this. Looks like the issue is that while contents[] is always a typed array, those typed arrays are created in a big variety of locations in the code (each preloaded file does it separately), and this makes TI treat them all separately, and once it sees enough that lead to contents[], it figures it can't assume anything about the type there (I hope I understood that).
(In reply to Alon Zakai (:azakai) from comment #13)
> jandem and bhackett investigated this. Looks like the issue is that while
> contents[] is always a typed array, those typed arrays are created in a big
> variety of locations in the code (each preloaded file does it separately),
> and this makes TI treat them all separately, and once it sees enough that
> lead to contents[], it figures it can't assume anything about the type there
> (I hope I understood that).

Yeah, to add a bit more detail, the problem is that TI doesn't like creating lots of type sets, all with the same large set of objects in them.  Eventually we hit a cutoff and start marking the sets as potentially containing any object, which inhibits fast paths for typed array accesses.  This can be avoided by making sure the arrays all have the same type, i.e. they are created at the same scripted location.
(Assignee)

Comment 15

5 years ago
I worked around this as suggested. Looks like the change saves 50ms during startup.

Returning this bug to the previous name. All the obvious GetElem stuff has been done away with, next would be nice to know if there are JIT code things that could be improved.
Summary: GetElem called in BananaBread in pread on two typed arrays → Investigate where time is spent in BananaBread
Alon, the demo is no longer available. Do you think there's anything left to do here with all the Emscripten / asm.js / Ion work that happened this year?
Flags: needinfo?(azakai)
(Assignee)

Comment 17

4 years ago
Probably nothing left to do here, yeah.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Flags: needinfo?(azakai)
Resolution: --- → FIXED

Updated

4 years ago
Assignee: general → azakai
Resolution: FIXED → WORKSFORME
You need to log in before you can comment on or make changes to this bug.