Closed Bug 628350 Opened 13 years ago Closed 6 years ago

Mops performance improvement on x86

Categories

(Tamarin Graveyard :: Virtual Machine, enhancement, P3)

x86
All
enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: sold2, Assigned: sold2)

References

Details

Attachments

(1 file, 7 obsolete files)

User-Agent:       Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6.6; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; InfoPath.2; .NET CLR 3.5.21022; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E)
Build Identifier: tamarin-redux-c72c9b1c20a9

As of the current implementation, mops (domainMemory load/store instructions, sXX/lXX) carry a significant overhead: the array's base address  and size are fetched from memory, then the offset is bounds checked. The JIT only makes a shallow attempt to eliminate redundancies, so these costs are more-or-less per-mop. On x86 it's possible to eliminate these steps entirely, reducing mops to a single MOV instruction, while still maintaining based addressing and bounds checking, using some segmentation tricks.

Reproducible: Always
There needs to be a way to distinguish between a mop load/store, and any other form of load/strore. Would adding a new access set do? (e.g ACCSET_DOMAIN. mops currently use ACCSET_OTHER) As far as I can tell there shouldn't be any aliasing issues here (it can only improve aliasing analysis, if anything)
(In reply to comment #1)
> As far as I can tell there shouldn't be any
> aliasing issues here (it can only improve aliasing analysis, if anything)

True, this is a good first step and should provide the isolation you need.  You may want to open a new bug for each independent piece of work (e.g. access set, segmentation)  and then use this bug as a 'tracker' (i.e the 'depends on:' field of this bug is filled with the other bug numbers).

re: the segmentation tricks, not sure what you have in mind, but just a gentle reminder that one use of tamarin is that its embedded in a plug-in that relies on the hosting application for resources.
IMHO we should consider changing the behavior for future SWF versions: instead of range-checking, constrain the memory to a power-of-two and trim the offset using a bitwise and.
(In reply to comment #0)

> On x86 it's possible to eliminate these steps entirely, reducing mops
> to a single MOV instruction, while still maintaining based addressing and
> bounds checking, using some segmentation tricks.

Could you elaborate on what the segmentation tricks are?
@Rick:
> True, this is a good first step and should provide the isolation you need.

Any reason not use this for non x86 builds too, then?

@Rick & Edwin:
"segmentation tricks" was a bad choice of words, obviously :-P
It's actually pretty basic. Each segment (descriptor) has a base address and a limit. Memory accesses to the segment are zero based, and automtically adjusted to the correct address, and any out of bounds access raises a segmentation fault exception. That's exactly the ingreediants of the current implementation.
Basically, a segment is set up that spans the domainMemory array, and any access violations to this segment are caught and redirected to the RangeError thunk.
(In reply to comment #3)
> IMHO we should consider changing the behavior for future SWF versions: instead
> of range-checking, constrain the memory to a power-of-two and trim the offset
> using a bitwise and.

This would only spare you the possible branching, that's all. And since you get nasty "edge cases" too (reading 2 bytes, one byte from the end), it would probablly be even more expensive.
(In reply to comment #6)
> And since you get nasty "edge cases" too
> (reading 2 bytes, one byte from the end), it would
> probablly be even more expensive.

Rereading my comment I figure that this is probably irrelevant, since the result of such a thing doesn't need to be well-defined anyway. You could just augment the array with a safety-buffer. Still, this only gets rid of the branching, and maybe some of the arithmetic, and we still need to get the array's base & size from memory. For how long can they be reused? Basic blocks, assuming no register pressure? I mean, right now, even this trivial code:

loop:
	pushbyte 0;
	dup;
	si32;
jump loop;

translates into  read base & size + bounds check, on every iteration.

Not to mention the fact that simply trimming the offset would turn what's now a visible bug in user code, into a quite bug.

Btw, I'm going to take Rick's advice and create new bugs for sub-issues. Hope this doesn't cause too much noise.
(In reply to comment #6)
> This would only spare you the possible branching, that's all. 

But that would still be a win on architectures without segmentation tricks (i.e., pretty much everything except x86).
(In reply to comment #8)
> (In reply to comment #6)
> > This would only spare you the possible branching, that's all. 
> But that would still be a win on architectures without segmentation tricks
> (i.e., pretty much everything except x86).

Oh yeah, forgot about those :)
Anyway, just keep in mind that this still comes at a price: you open a door for hard-to-find user-code bugs. (an oob pointer quitely wraps around and corrupts some data structure, etc...)
Depends on: 628776
Where does VMThread belong? Is is it an integral part of AVM?
It's in the 'vmbase' directory right next to things like 'atom', but in the VS project tree it's under 'shell'.
I need a mutex. Can I use it in 'avmplus'?
things in vmbase are meant as infrastructure usable by MMgc and avmplus, so yeah, fair game.  they aren't shell-only things.
Is there a way to retreive the current DomainEnv from within CodegenLIR at jit-time? In central Domain was accessible through pool->domain, but in redux most of what used to be in Domain has been moved to DomainEnv, which seems to be retreived at run-time.
The same piece of JIT'd code will be used for varying DomainEnv's, hence the compiled code should load the address to use at runtime.  This was a bug in central and its been a while since that branch was updated.
Ok, I got some numbers, and they're not very exciting. I am able to cut down up to about 15% of the execution-time, on-off. (The initial tests on central gave way more dramatic results)
Anyway, the tests I'm running are very simplistic, and might not be such a good indication. What are you using to test mops performance? There doesn't seem to be anything relevant in test/performance.
Also, it might be worth testing this on other machines too. The produced code for non-segmented mops isn't that different between central and redux (but performance is, at least in my case), which suggests that numbers might be processor-sensitive (e.g. branch-prediction depth, etc') If anyone's willing to give this a try on their machine, let me know.
I ran a few more tests, only slightly more challenging, and I'm getting over 40% cut-down, so the earlier ones were definitely not very indicative.
Anyway, where do I take it from here? I'm waiting on some sort of feedback.
Step one would probably be to attach a patch to this bug, and request feedback from someone (eg Edwin, me, anyone else commenting so far)
If you haven't already come across these pages, more details can be found here https://developer.mozilla.org/en/Tamarin and here https://developer.mozilla.org/en/Hacking_Mozilla#Life_Cycle_of_a_Patch
I will upload what I got so far, but it's by no means complete. There are still some loose ends, and I'm going to need some design clarifications to make progress (the sub-issue i posted a while ago, some of it irrelevant by now, got no response. Don't want to be a nagger, but I only got limited time to invest in this).
If anyone else wants to take over the code, that's fine too.
(In reply to comment #18)
> I will upload what I got so far, but it's by no means complete. There are still
> some loose ends, and I'm going to need some design clarifications to make
> progress

That's ok, sketches are fine.

> (the sub-issue i posted a while ago, some of it irrelevant by now, got
> no response. Don't want to be a nagger, but I only got limited time to invest
> in this).

What was the sub-issue? I don't seem to see it.
> What was the sub-issue? I don't seem to see it.

It's the 'depends on' bug (628776). Sorry if I was too quite about it, I thought those changes are being notified.
If you wonder why the patch is so big, it's because it includes an assembler used to build the performance test. It's not like there's tons of code :)
Attachment #509563 - Attachment is obsolete: true
Synced code with revision bfbc6d260f40, and added an altenative backend for segmentation_interface.
I tried to run the mops acceptance tests on linux with the original code, straight from the repository, and it segfaults. Is this a known issue?
(In reply to comment #25)
> linux segfaults. Is this a known issue?

News to me; can you send me the details on platform and testcase.
(In reply to comment #26)
> News to me; can you send me the details on platform and testcase.

I'm getting this on ubuntu 10.04, inside a VM. Simple release build of bfbc6d260f40. Running test/acceptance/mops/mops.abc_ without any special flags.

Stack trace:

#0  avmplus::ExceptionFrame::beginCatch (this=0xb7e4e020) at core/Exception.cpp:188
#1  0x080fd77e in avmplus::interpBoxed (env=0xb7ec0140, _argc=0, _atomv=0xbfffecb8) at core/Interpreter.cpp:3233
#2  0x080f6d9c in avmplus::BaseExecMgr::invokeInterpNoCoerce (env=0xb7ec0140, argc=0, args=0xbfffecb8) at core/exec.cpp:812
#3  avmplus::BaseExecMgr::initInvokeInterpNoCoerce (env=0xb7ec0140, argc=0, args=0xbfffecb8) at core/exec.cpp:191
#4  0x080b0e14 in avmplus::MethodEnv::coerceEnter (this=0xb7e03020, code=..., start=0, toplevel=0xb7e6c040, ninit=0x0, codeContext=0xb7e060b8, apiVersion=avmplus::kApiVersion_SWF_13) at core/MethodEnv-inlines.h:137
#5  avmplus::AvmCore::callScriptEnvEntryPoint (this=0xb7e03020, code=..., start=0, toplevel=0xb7e6c040, ninit=0x0, codeContext=0xb7e060b8, apiVersion=avmplus::kApiVersion_SWF_13) at core/AvmCore.cpp:720
#6  avmplus::AvmCore::handleActionPool (this=0xb7e03020, code=..., start=0, toplevel=0xb7e6c040, ninit=0x0, codeContext=0xb7e060b8, apiVersion=avmplus::kApiVersion_SWF_13) at core/AvmCore.cpp:800
#7  avmplus::AvmCore::handleActionBlock (this=0xb7e03020, code=..., start=0, toplevel=0xb7e6c040, ninit=0x0, codeContext=0xb7e060b8, apiVersion=avmplus::kApiVersion_SWF_13) at core/AvmCore.cpp:847
#8  0x080702af in avmshell::ShellCore::handleArbitraryExecutableContent (this=0xb7e03020, settings=..., code=..., filename=0xbffff52a "test/acceptance/mops/mops.abc_") at shell/ShellCore.cpp:542
#9  0x080705f5 in avmshell::ShellCore::evaluateFile (this=0xb7e03020, settings=..., filename=0xbffff52a "test/acceptance/mops/mops.abc_") at shell/ShellCore.cpp:519
#10 0x0806cf69 in avmshell::Shell::singleWorkerHelper (shell=0xb7e03020, settings=...) at shell/avmshell.cpp:219
#11 0x0806d498 in avmshell::Shell::singleWorker (settings=...) at shell/avmshell.cpp:178
#12 0x0806dc9a in avmshell::Shell::run (argc=2, argv=0xbffff354) at shell/avmshell.cpp:146
#13 0x0807dc55 in main (argc=2, argv=0xbffff354) at shell/avmshellUnix.cpp:112

Windows / Mac are doing fine.
Attached patch thread_local.h small fix (obsolete) — Splinter Review
There was a small mistake in therad_local.h, this fixes it. Should be applayed on top of the main patch (or just editted manually)
Attachment #510733 - Attachment is patch: true
Attached patch Initial implementation (obsolete) — Splinter Review
Fixed formatting (tabs to spaces, crlfs), new segfault_handler, fast-mops features turned on in avmshell-features.h
Attachment #510624 - Attachment is obsolete: true
Attachment #510625 - Attachment is obsolete: true
Attachment #510733 - Attachment is obsolete: true
Attached patch Initial implementation (obsolete) — Splinter Review
Small fix to segfault_handler to make Mac happy.
Attachment #511088 - Attachment is obsolete: true
Attached patch Initial implementation (obsolete) — Splinter Review
A few fixes to thread_local
Attachment #511212 - Attachment is obsolete: true
If anyone got to look into this yet, or is willing to, let me know so I can request feedback.
I don't think anyone has looked at it yet; I've been pretty tied up recently, but hopefully will be able to take a look in the next week or so.
Attachment #511744 - Attachment is obsolete: true
Obsolted the last patch. Anyway, it's useless to keep uploading updates if no one looks at it. When anybody gets time to give feedback, let me know and I'll upload the most recent patch, so we'll be in sync.
Yeah, sorry for the delayed response. Please keep pinging us if you don't get feedback.
(In reply to comment #34)
> Obsolted the last patch. Anyway, it's useless to keep uploading updates if no
> one looks at it. When anybody gets time to give feedback, let me know and I'll
> upload the most recent patch, so we'll be in sync.

Presumably at some point you'll stop updating the code or move on to a different task; I'd recommend you upload your patch at that point.
Assignee: nobody → sold2
Flags: flashplayer-qrb+
Priority: -- → P3
(In reply to comment #36)
> Presumably at some point you'll stop updating the code or move on to a
> different task; I'd recommend you upload your patch at that point.

Sorry for the delay, haven't forgot about this. I'll try to upload it somewhen next week.
This patch is pretty big. It should have really been divided into smaller patches, but just to get something going...
Overall, this is pretty stable, but probably needs some refinements.
Attachment #523438 - Flags: feedback?
Tamarin isn't maintained anymore. WONTFIX remaining bugs.
Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: