Gameboy emulator uses at least 5% more CPU with tracejit+methodjit than with methodjit alone

RESOLVED WORKSFORME

Status

()

defect
RESOLVED WORKSFORME
9 years ago
9 years ago

People

(Reporter: bzbarsky, Unassigned)

Tracking

Trunk
x86
macOS
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

()

This is probably not worth looking into in detail until bug 580468 is fixed.

Steps to reproduce:
1)  Load http://grantgalitz.org/gameboy/
2)  In the in-page gameboy thing, select File > Open As > URL Address
3)  In the prompt, put in http://grantgalitz.org/gameboy/romStorage/proxima.gb
    for the ROM.

Grant says that his CPU usage for just methodjit is about 93% while method+tracing is over 98%.  jsc/v8 end up at about 75%, so even methodjit has some headroom here (or maybe graphics does).

Comment 1

9 years ago
Yeah, overall firefox isn't as fast as chrome yet on my site, as of 4.0b7pre.
Many gameboy and gameboy color roms are averaging the same result, though the gameboy color games with "double clock speed mode" in the terminal will be taking up more processing power than the usual gameboy or gameboy color game.

Comment 2

9 years ago
Disable auto frameskip in the settings to get a better visual, and to max out the CPU as such.

Comment 3

9 years ago
People gotsa play Pokemon Yellow without stuttering. ;)

Comment 4

9 years ago
On the webkit nightlies the gameboy color CPU usage can sometimes drop to as low as 40%, while on FF4 it will get ~90%. Maybe webkit is optimizing how it handles switch statement execution?

Comment 5

9 years ago
(In reply to comment #4)
> On the webkit nightlies the gameboy color CPU usage can sometimes drop to as
> low as 40%, while on FF4 it will get ~90%. Maybe webkit is optimizing how it
> handles switch statement execution?

Nevermind, seems likely that the webkit nightlies use hardware acceleration, so to judge that against FF4 with no HW acceleration would be cruel.

Comment 6

9 years ago
You can load your own roms in locally now thanks to the File API, so it should be easier to test different cases with this emulator.
Aside from switch statements, another major issue for emulators is array accesses. Emulators can have a lot of code that looks like this;

  var MEMORY = []; // Defined once; this is the emulated RAM
  ...
  var b = MEMORY[a+10];
  MEMORY[b+98] = 17;
  ...
  for (var i = 0; i < 100; i++)
    MEMORY[c+i] = MEMORY[d+i];

and so forth. Typically the exact same array is used for all of these (each time the same statement is executed, but also the same across all of these different statements), and always with the same data type being stored, numbers (maybe mixing integers and floating points though).

Perhaps this type of code can be optimized, if it isn't already, using a PIC-like approach, or something else? Any speedup here would significantly affect emulator performance (and also code compiled into JS from C, LLVM, etc.).
Gal/Nick, was there some plan to have a COPYELEM opcode?

Comment 9

9 years ago
Alon: Yes, the emulator builds upon this. Also it uses multiple separate arrays for RAM banking. There are also ranges of each array that are left out (never initialized) because these separate arrays do these ranges' jobs (As required for banking). And yes, the opcodes are executed by switch, as well as by specifying the opcode as a property of the this object (and accessing it later in the array-like format). I'm wondering for curiosity what would be the fastest way for opcode execution (in an object-oriented kind of way (As in things use the this variable)).
1. You can in theory use separate ranges in a single big array, instead of separate arrays. I have no idea which would be faster though, or which would be easier to optimize the array implementation for.

2. Regarding executing via a switch, you can avoid that entirely by compiling the ROM as opposed to interpreting it. See for example this link:

http://code.google.com/p/emscripten/wiki/GeneratedCodeComparison

It shows C++, then how that looks in LLVM, then how that looks after being compiled into JavaScript (the JavaScript there, btw, is the sort of code that I meant earlier, that would be significantly sped up by optimizing array accesses). Something similar could be done with compiling a Gameboy ROM to JavaScript. (But I suspect this may be getting to be offtopic for this bug; if you're interested to continue talking about this, feel free to contact me via email or irc, azakai on irc.mozilla.org. I'd be happy to help out if I can).

Comment 11

9 years ago
I think there could be multiple problems trying to statically or dynamically JIT the code. The issues are that the code can be self-modifying, and that all the data, not just the OPs are in the ROM. The image data can be in the ROM data interlaced possibly with instruction code. Also, trying to dynarec JIT can also be a problem, since the interrupts are very sensitive and possibly skipping over them to a "final state" for dynarec'd instructions can possibly cause problems along the way.
Interesting. Ok, then I guess all that can be done is what was mentioned before (to optimize the JS engine implementation of array accesses and/or switches).

Comment 13

9 years ago
This problem is going to get worse though when the N64 emulator goes live. Though I'm trying to port the dynarec techniques from mupen64plus. Hopefully that will work.

Comment 14

9 years ago
The only game that currently works for the JS N64 emulator surprisingly is Diddy Kong Racing, though there are numerous dimension miscalculations from the ricevideolinux plugin port causing trouble with WebGL

Comment 15

9 years ago
Also, there's the matter of emulating 64 bit integers in 32 bit numerical typing.

Comment 16

9 years ago
Add Mario Kart to Comment 14 as well. :D
(In reply to comment #8)
> Gal/Nick, was there some plan to have a COPYELEM opcode?

I've heard the idea mentioned, but never more than that.  Let's think about what it could gain us.  A GETELEM/SETELEM pair on an initialized dense array, eg. a[j] = a[i]; involves these steps in trace-generated code (I imagine JM code must do similar things):

GETELEM
1. load pointer to 'a' from the stack
2. if 'a' is not a dense array, exit
3. if 'i' exceeds the capacity of 'a', exit
4. load the dslots pointer of 'a'
5. compute the address of the dslot for a[i]
6. load a[i]'s tag from the dslot
7. if the tag doesn't match the expected type, exit
8. load a[i]'s payload 
9. unbox the value
10. store the unboxed value on the stack

SETELEM
11. load pointer to 'a' from the stack
12. if 'a' is not a dense array, exit
13. if 'j' exceeds the capacity of 'a', call js_EnsureDenseArrayCapacity()
14. load the dslots pointer of 'a'
15. compute the address of the dslot for a[j]
16. load the a[j]'s existing tag from the dslot
17. if the existing tag is for a hole, call js_Array_dense_setelem_hole
18. box the new value
19. store the boxed value into a[j]'s dslot

Horrid, isn't it?  

Here's what COPYELEM would do:

COPYELEM
1. load pointer to 'a' from the stack
2. if 'a' is not a dense array, exit
3. if 'i' exceeds the capacity of 'a', exit
4. if 'j' exceeds the capacity of 'a', call js_EnsureDenseArrayCapacity()
5. load the dslots pointer of 'a'
6. compute the address of the dslot for a[i]
7. load a[i]'s tag and payload from the dslot
8. compute the address of the dslot for a[j]
9. load a[j]'s existing tag
10. if the existing tag is for a hole, call js_Array_dense_setelem_hole
11. store a[i]'s tag and payload into a[j]'s slot.

So this has the following benefits (for trace-generated code):
- No need to reload 'a' from the stack and test if it's a dense array.
  (The tracer can omit these steps thanks to CSE, but the methodjit
  presumably can't.)

- No need to reload the dslots pointer and recompute the dslots address.
  (The tracer can't CSE those in the GETELEM/SETELEM version because of the
  possibly intervening call to js_EnsureDenseArrayCapacity(), which can
  reallocate dslots.)

- No need to do anything with the gotten value -- type-test it, unbox it, or
  store it to the stack.

So that's quite a bit better, I would guess close to 2x faster.  This
assumes that COPYELEM does not put a copy of the element on the stack;
doing so would negate several of the potential benefits, and it doesn't make
much sense.

One issue would be whether the front-end fuses separate GETELEM/SETELEM
pairs, eg:

  var tmp = a[i];
  a[j] = tmp;

You'd hope so, but I'm not certain, I don't know much about the front-end.
Things can get more complicated if there is code between the GETELEM and the
SETELEM.

Comment 18

9 years ago
Don't test against Internet Explorer 9, since IE9 has broken CanvasPixelArray.data setting support (Error seems to indicate that this is read-only in IE9, while every other browser that supports this object allows for this to be set.).
(In reply to comment #17)
> 
> I've heard the idea mentioned, but never more than that.  Let's think about
> what it could gain us.  A GETELEM/SETELEM pair on an initialized dense array,
> eg. a[j] = a[i]; involves these steps in trace-generated code (I imagine JM
> code must do similar things):
> 
> GETELEM
> 1. load pointer to 'a' from the stack
> 2. if 'a' is not a dense array, exit
> 3. if 'i' exceeds the capacity of 'a', exit
> 4. load the dslots pointer of 'a'
> 5. compute the address of the dslot for a[i]
> 6. load a[i]'s tag from the dslot
> 7. if the tag doesn't match the expected type, exit
> 8. load a[i]'s payload 
> 9. unbox the value
> 10. store the unboxed value on the stack
> 

Hmm, I am reminded of an idea of pcwalton's - to use WebGL typed arrays for this kind of code. In theory perhaps this could work like so (for GETs):

 1. load pointer to 'a' from the stack
 2. if 'a' is not a WebGL typed array, exit
 3. if 'i' exceeds the capacity of 'a', exit
 4. load 'a's pointer to the actual memory buffer
 5. load a[i]'s value
 6. store that value on the stack

?

On the emulator side, the JS code would need to read&write only the proper type to the array, but that is feasible. (It might require separate arrays for integer and float values.)
(In reply to comment #19)
> 
> Hmm, I am reminded of an idea of pcwalton's - to use WebGL typed arrays for
> this kind of code.

Absolutely, typed arrays are less flexible and so will be faster.  If the array we are getting/setting is a typed array, the generated code will a bit different (and simpler) than that above (for the trace JIT at least;  it will see that the array is typed at record-time and so can specialize the code for that case).

(I suspect a lot of JS programmers think that arrays are fast, because they are in other languages.  But they're not in JS, even though a lot of effort has gone into making them fast, and more will be done in the future, esp. because Kraken uses them a lot.)

Comment 21

9 years ago
Nick: The emulator is now using Uint8Array for the main memory. Seems to be working in Chrome 7 and Firefox 4. I haven't measured the performance difference yet. Thanks though for bringing this stuff up, since I explicitly needed giant 8-bit memory arrays.
(FYI, typed arrays are not IC'd in JM yet. See bug 594247.)

Comment 23

9 years ago
Well, I guess I added typed array support in advance inside the GBC emu for when special code paths are made later on, so that it will make an improvement later on.

Comment 24

9 years ago
The GBC emu initializes most of its RAM now through the format: new Uint8Array(memory bank length in bytes);
Blocks: 601988
I opened bug 602366 on adding a COPYELEM bytecode.
No longer blocks: 601988

Comment 26

9 years ago
Memory is much more branchy than the only other gameboy emulator out there, since I support a lot ram banking types, as well as gameboy color specific ram banking. So trace jit especially fails on my emu.

Comment 27

9 years ago
Wow, I'm a bad typer, here's what I meant to say if you weren't able to follow the last comment:
GameBoy Online's memory is much more branchy than JSGB's memory (the other gameboy emu).

Comment 28

9 years ago
Also, sound support seems to be adding a bit of load, but https://bugzilla.mozilla.org/show_bug.cgi?id=578916 seems to have helped possibly.

Comment 29

9 years ago
Also, the original ROM posted in a much earlier comment won't work anymore unless you disable the BIOS ROM from booting on game startup in the settings, sine that ROM has a bad ROM header that the GameBoy Color BIOS ROM checks in order to execute Nintendo's lockout of unapproved ROMs.

Comment 30

9 years ago
*since

Comment 31

9 years ago
Also the measurements earlier for CPU usage are now invalid, since the way opcodes are executed and memory is read and written to has changed.

Comment 32

9 years ago
It would probably be helpful if you kept older versions archived somewhere so that Moz developers can test specific optimization problems -- that's difficult when there's a moving target.

Comment 33

9 years ago
Yeah, I need to archive my revisions. It's sad that I don't even have an SVN, a GIT, or a CVS repository (Que in the laziness explanation). But overall performance should be getting better for JM, though TM seems to suck a lot when enabled. If JM can easily access any of  over a thousand functions as properties of the this variable, then JM is set for speed, since the opcodes and memory functions were chewed down into functions that are now numeric properties of "this" that can be accessed like this[address + 0x10000] or this[op + 0x200].

Comment 34

9 years ago
Correction: Numerically addressed properties
Grant, can you _please_ stop the noise?  I don't need all the email that's not really relevant to getting this bug resolved....

Comment 36

9 years ago
Bug seems to have been fixed since hueristics have landed.
Firefox might not run it as fast as V8, but the specific topic of this bug has been resolved (trace jit slowing down method jit)
OK, excellent.
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.