Mops performance improvement on x86

UNCONFIRMED
Assigned to

Status

Tamarin
Virtual Machine
P3
enhancement
UNCONFIRMED
7 years ago
7 years ago

People

(Reporter: sold2, Assigned: sold2)

Tracking

(Depends on: 1 bug)

unspecified
x86
All
Bug Flags:
flashplayer-qrb +

Details

Attachments

(1 attachment, 7 obsolete attachments)

(Assignee)

Description

7 years ago
User-Agent:       Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6.6; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; InfoPath.2; .NET CLR 3.5.21022; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E)
Build Identifier: tamarin-redux-c72c9b1c20a9

As of the current implementation, mops (domainMemory load/store instructions, sXX/lXX) carry a significant overhead: the array's base address  and size are fetched from memory, then the offset is bounds checked. The JIT only makes a shallow attempt to eliminate redundancies, so these costs are more-or-less per-mop. On x86 it's possible to eliminate these steps entirely, reducing mops to a single MOV instruction, while still maintaining based addressing and bounds checking, using some segmentation tricks.

Reproducible: Always
(Assignee)

Comment 1

7 years ago
There needs to be a way to distinguish between a mop load/store, and any other form of load/strore. Would adding a new access set do? (e.g ACCSET_DOMAIN. mops currently use ACCSET_OTHER) As far as I can tell there shouldn't be any aliasing issues here (it can only improve aliasing analysis, if anything)

Comment 2

7 years ago
(In reply to comment #1)
> As far as I can tell there shouldn't be any
> aliasing issues here (it can only improve aliasing analysis, if anything)

True, this is a good first step and should provide the isolation you need.  You may want to open a new bug for each independent piece of work (e.g. access set, segmentation)  and then use this bug as a 'tracker' (i.e the 'depends on:' field of this bug is filled with the other bug numbers).

re: the segmentation tricks, not sure what you have in mind, but just a gentle reminder that one use of tamarin is that its embedded in a plug-in that relies on the hosting application for resources.

Comment 3

7 years ago
IMHO we should consider changing the behavior for future SWF versions: instead of range-checking, constrain the memory to a power-of-two and trim the offset using a bitwise and.

Comment 4

7 years ago
(In reply to comment #0)

> On x86 it's possible to eliminate these steps entirely, reducing mops
> to a single MOV instruction, while still maintaining based addressing and
> bounds checking, using some segmentation tricks.

Could you elaborate on what the segmentation tricks are?
(Assignee)

Comment 5

7 years ago
@Rick:
> True, this is a good first step and should provide the isolation you need.

Any reason not use this for non x86 builds too, then?

@Rick & Edwin:
"segmentation tricks" was a bad choice of words, obviously :-P
It's actually pretty basic. Each segment (descriptor) has a base address and a limit. Memory accesses to the segment are zero based, and automtically adjusted to the correct address, and any out of bounds access raises a segmentation fault exception. That's exactly the ingreediants of the current implementation.
Basically, a segment is set up that spans the domainMemory array, and any access violations to this segment are caught and redirected to the RangeError thunk.
(Assignee)

Comment 6

7 years ago
(In reply to comment #3)
> IMHO we should consider changing the behavior for future SWF versions: instead
> of range-checking, constrain the memory to a power-of-two and trim the offset
> using a bitwise and.

This would only spare you the possible branching, that's all. And since you get nasty "edge cases" too (reading 2 bytes, one byte from the end), it would probablly be even more expensive.
(Assignee)

Comment 7

7 years ago
(In reply to comment #6)
> And since you get nasty "edge cases" too
> (reading 2 bytes, one byte from the end), it would
> probablly be even more expensive.

Rereading my comment I figure that this is probably irrelevant, since the result of such a thing doesn't need to be well-defined anyway. You could just augment the array with a safety-buffer. Still, this only gets rid of the branching, and maybe some of the arithmetic, and we still need to get the array's base & size from memory. For how long can they be reused? Basic blocks, assuming no register pressure? I mean, right now, even this trivial code:

loop:
	pushbyte 0;
	dup;
	si32;
jump loop;

translates into  read base & size + bounds check, on every iteration.

Not to mention the fact that simply trimming the offset would turn what's now a visible bug in user code, into a quite bug.

Btw, I'm going to take Rick's advice and create new bugs for sub-issues. Hope this doesn't cause too much noise.

Comment 8

7 years ago
(In reply to comment #6)
> This would only spare you the possible branching, that's all. 

But that would still be a win on architectures without segmentation tricks (i.e., pretty much everything except x86).
(Assignee)

Comment 9

7 years ago
(In reply to comment #8)
> (In reply to comment #6)
> > This would only spare you the possible branching, that's all. 
> But that would still be a win on architectures without segmentation tricks
> (i.e., pretty much everything except x86).

Oh yeah, forgot about those :)
Anyway, just keep in mind that this still comes at a price: you open a door for hard-to-find user-code bugs. (an oob pointer quitely wraps around and corrupts some data structure, etc...)
(Assignee)

Updated

7 years ago
Depends on: 628776
(Assignee)

Comment 10

7 years ago
Where does VMThread belong? Is is it an integral part of AVM?
It's in the 'vmbase' directory right next to things like 'atom', but in the VS project tree it's under 'shell'.
I need a mutex. Can I use it in 'avmplus'?

Comment 11

7 years ago
things in vmbase are meant as infrastructure usable by MMgc and avmplus, so yeah, fair game.  they aren't shell-only things.
(Assignee)

Comment 12

7 years ago
Is there a way to retreive the current DomainEnv from within CodegenLIR at jit-time? In central Domain was accessible through pool->domain, but in redux most of what used to be in Domain has been moved to DomainEnv, which seems to be retreived at run-time.

Comment 13

7 years ago
The same piece of JIT'd code will be used for varying DomainEnv's, hence the compiled code should load the address to use at runtime.  This was a bug in central and its been a while since that branch was updated.
(Assignee)

Comment 14

7 years ago
Ok, I got some numbers, and they're not very exciting. I am able to cut down up to about 15% of the execution-time, on-off. (The initial tests on central gave way more dramatic results)
Anyway, the tests I'm running are very simplistic, and might not be such a good indication. What are you using to test mops performance? There doesn't seem to be anything relevant in test/performance.
Also, it might be worth testing this on other machines too. The produced code for non-segmented mops isn't that different between central and redux (but performance is, at least in my case), which suggests that numbers might be processor-sensitive (e.g. branch-prediction depth, etc') If anyone's willing to give this a try on their machine, let me know.
(Assignee)

Comment 15

7 years ago
I ran a few more tests, only slightly more challenging, and I'm getting over 40% cut-down, so the earlier ones were definitely not very indicative.
Anyway, where do I take it from here? I'm waiting on some sort of feedback.

Comment 16

7 years ago
Step one would probably be to attach a patch to this bug, and request feedback from someone (eg Edwin, me, anyone else commenting so far)

Comment 17

7 years ago
If you haven't already come across these pages, more details can be found here https://developer.mozilla.org/en/Tamarin and here https://developer.mozilla.org/en/Hacking_Mozilla#Life_Cycle_of_a_Patch
(Assignee)

Comment 18

7 years ago
I will upload what I got so far, but it's by no means complete. There are still some loose ends, and I'm going to need some design clarifications to make progress (the sub-issue i posted a while ago, some of it irrelevant by now, got no response. Don't want to be a nagger, but I only got limited time to invest in this).
If anyone else wants to take over the code, that's fine too.

Comment 19

7 years ago
(In reply to comment #18)
> I will upload what I got so far, but it's by no means complete. There are still
> some loose ends, and I'm going to need some design clarifications to make
> progress

That's ok, sketches are fine.

> (the sub-issue i posted a while ago, some of it irrelevant by now, got
> no response. Don't want to be a nagger, but I only got limited time to invest
> in this).

What was the sub-issue? I don't seem to see it.
(Assignee)

Comment 20

7 years ago
> What was the sub-issue? I don't seem to see it.

It's the 'depends on' bug (628776). Sorry if I was too quite about it, I thought those changes are being notified.
(Assignee)

Comment 21

7 years ago
Created attachment 509563 [details] [diff] [review]
Initial implementation of segmented mops
(Assignee)

Comment 22

7 years ago
If you wonder why the patch is so big, it's because it includes an assembler used to build the performance test. It's not like there's tons of code :)
(Assignee)

Updated

7 years ago
Attachment #509563 - Attachment is obsolete: true
(Assignee)

Comment 23

7 years ago
Created attachment 510624 [details] [diff] [review]
Initial implmentation of segmented mops

Synced code with revision bfbc6d260f40, and added an altenative backend for segmentation_interface.
(Assignee)

Comment 24

7 years ago
Created attachment 510625 [details] [diff] [review]
Initial implementation of segmented mops - diff
(Assignee)

Comment 25

7 years ago
I tried to run the mops acceptance tests on linux with the original code, straight from the repository, and it segfaults. Is this a known issue?

Comment 26

7 years ago
(In reply to comment #25)
> linux segfaults. Is this a known issue?

News to me; can you send me the details on platform and testcase.
(Assignee)

Comment 27

7 years ago
(In reply to comment #26)
> News to me; can you send me the details on platform and testcase.

I'm getting this on ubuntu 10.04, inside a VM. Simple release build of bfbc6d260f40. Running test/acceptance/mops/mops.abc_ without any special flags.

Stack trace:

#0  avmplus::ExceptionFrame::beginCatch (this=0xb7e4e020) at core/Exception.cpp:188
#1  0x080fd77e in avmplus::interpBoxed (env=0xb7ec0140, _argc=0, _atomv=0xbfffecb8) at core/Interpreter.cpp:3233
#2  0x080f6d9c in avmplus::BaseExecMgr::invokeInterpNoCoerce (env=0xb7ec0140, argc=0, args=0xbfffecb8) at core/exec.cpp:812
#3  avmplus::BaseExecMgr::initInvokeInterpNoCoerce (env=0xb7ec0140, argc=0, args=0xbfffecb8) at core/exec.cpp:191
#4  0x080b0e14 in avmplus::MethodEnv::coerceEnter (this=0xb7e03020, code=..., start=0, toplevel=0xb7e6c040, ninit=0x0, codeContext=0xb7e060b8, apiVersion=avmplus::kApiVersion_SWF_13) at core/MethodEnv-inlines.h:137
#5  avmplus::AvmCore::callScriptEnvEntryPoint (this=0xb7e03020, code=..., start=0, toplevel=0xb7e6c040, ninit=0x0, codeContext=0xb7e060b8, apiVersion=avmplus::kApiVersion_SWF_13) at core/AvmCore.cpp:720
#6  avmplus::AvmCore::handleActionPool (this=0xb7e03020, code=..., start=0, toplevel=0xb7e6c040, ninit=0x0, codeContext=0xb7e060b8, apiVersion=avmplus::kApiVersion_SWF_13) at core/AvmCore.cpp:800
#7  avmplus::AvmCore::handleActionBlock (this=0xb7e03020, code=..., start=0, toplevel=0xb7e6c040, ninit=0x0, codeContext=0xb7e060b8, apiVersion=avmplus::kApiVersion_SWF_13) at core/AvmCore.cpp:847
#8  0x080702af in avmshell::ShellCore::handleArbitraryExecutableContent (this=0xb7e03020, settings=..., code=..., filename=0xbffff52a "test/acceptance/mops/mops.abc_") at shell/ShellCore.cpp:542
#9  0x080705f5 in avmshell::ShellCore::evaluateFile (this=0xb7e03020, settings=..., filename=0xbffff52a "test/acceptance/mops/mops.abc_") at shell/ShellCore.cpp:519
#10 0x0806cf69 in avmshell::Shell::singleWorkerHelper (shell=0xb7e03020, settings=...) at shell/avmshell.cpp:219
#11 0x0806d498 in avmshell::Shell::singleWorker (settings=...) at shell/avmshell.cpp:178
#12 0x0806dc9a in avmshell::Shell::run (argc=2, argv=0xbffff354) at shell/avmshell.cpp:146
#13 0x0807dc55 in main (argc=2, argv=0xbffff354) at shell/avmshellUnix.cpp:112

Windows / Mac are doing fine.
(Assignee)

Comment 28

7 years ago
Created attachment 510733 [details] [diff] [review]
thread_local.h small fix

There was a small mistake in therad_local.h, this fixes it. Should be applayed on top of the main patch (or just editted manually)
(Assignee)

Updated

7 years ago
Attachment #510733 - Attachment is patch: true
(Assignee)

Comment 29

7 years ago
Created attachment 511088 [details] [diff] [review]
Initial implementation

Fixed formatting (tabs to spaces, crlfs), new segfault_handler, fast-mops features turned on in avmshell-features.h
Attachment #510624 - Attachment is obsolete: true
Attachment #510625 - Attachment is obsolete: true
Attachment #510733 - Attachment is obsolete: true
(Assignee)

Comment 30

7 years ago
Created attachment 511212 [details] [diff] [review]
Initial implementation

Small fix to segfault_handler to make Mac happy.
Attachment #511088 - Attachment is obsolete: true
(Assignee)

Comment 31

7 years ago
Created attachment 511744 [details] [diff] [review]
Initial implementation

A few fixes to thread_local
Attachment #511212 - Attachment is obsolete: true
(Assignee)

Comment 32

7 years ago
If anyone got to look into this yet, or is willing to, let me know so I can request feedback.

Comment 33

7 years ago
I don't think anyone has looked at it yet; I've been pretty tied up recently, but hopefully will be able to take a look in the next week or so.
(Assignee)

Updated

7 years ago
Attachment #511744 - Attachment is obsolete: true
(Assignee)

Comment 34

7 years ago
Obsolted the last patch. Anyway, it's useless to keep uploading updates if no one looks at it. When anybody gets time to give feedback, let me know and I'll upload the most recent patch, so we'll be in sync.

Comment 35

7 years ago
Yeah, sorry for the delayed response. Please keep pinging us if you don't get feedback.
(In reply to comment #34)
> Obsolted the last patch. Anyway, it's useless to keep uploading updates if no
> one looks at it. When anybody gets time to give feedback, let me know and I'll
> upload the most recent patch, so we'll be in sync.

Presumably at some point you'll stop updating the code or move on to a different task; I'd recommend you upload your patch at that point.

Updated

7 years ago
Assignee: nobody → sold2
Flags: flashplayer-qrb+
Priority: -- → P3
(Assignee)

Comment 37

7 years ago
(In reply to comment #36)
> Presumably at some point you'll stop updating the code or move on to a
> different task; I'd recommend you upload your patch at that point.

Sorry for the delay, haven't forgot about this. I'll try to upload it somewhen next week.
(Assignee)

Comment 38

7 years ago
Created attachment 523438 [details] [diff] [review]
Segmented mops sketch

This patch is pretty big. It should have really been divided into smaller patches, but just to get something going...
Overall, this is pretty stable, but probably needs some refinements.
Attachment #523438 - Flags: feedback?
You need to log in before you can comment on or make changes to this bug.