TM: tracing thresholds are incorrect

RESOLVED INCOMPLETE

Status

()

--
major
RESOLVED INCOMPLETE
10 years ago
7 years ago

People

(Reporter: mark, Assigned: gal)

Tracking

({perf})

Trunk
x86
Linux
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(URL)

(Reporter)

Description

10 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.4) Gecko/2008111318 Ubuntu/8.04 (hardy) Firefox/3.0.4
Build Identifier: 20081201 Firefox/3.1b2

Benchmark: ScheduleWorld sw2 web application (free) Tools -> Browser Test -> Speed. 15,000 iterations are done on a sampling of code that is often used when the app is running.

On a Windows XP machine (a virtual machine: host=linux, guest=XP) runs 13x faster with tracemonkey enabled!
On a real XP machine (not virtual) tracemonkey is the same speed or slightly slower.

Summary: exact same benchmark code, exact same OS, but tracemonkey only has an effect while running in the virtual machine.

It seems the thresholds for when to trace are wrong. I tried on a slower machine but perhaps it wasn't slow enough. I also tried on a slow 700MHz laptop running Linux but tracemonkey didn't kick in there either. 

I'm guessing there are timing thresholds in tracemonkey that are preventing it from tracing code that really should be traced. I know it should be traced because it makes a huge difference in the VM case. An amazing difference...

I noted the same behaviour with 3.1b1 and minefield 3.1b2pre.



Reproducible: Always

Steps to Reproduce:
1.log in to http://www.scheduleworld.com/sw2/
2.run Tools -> Browser Test -> Speed
3.JIT is enabled.
Actual Results:  
JIT does not increase speed on a real machine.
JIT dramatically increases speed on a virtual machine.

Expected Results:  
JIT increases speed on a real machine.

You can always be more memory efficient in a later release by compiling less. If it makes this problem easier to solve, please consider adjusting TM to compile more often / with decreased thresholds. People need to see how amazing this is.

Comment 1

10 years ago
I duplicated this test on a native Linux machine (Mandriva 2009), using last night's "nightly", and got the following:

javascript.options.jit.content=True , 1139-1156 milliseconds
javascript.options.jit.content=False, 1024-1036 milliseconds

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1b3pre) Gecko/20090105 Shiretoko/3.1b3pre - Build ID: 20090105020425

Mandriva 2.6.27.5-desktop-2mnb #1 SMP Thu Nov 20 15:20:32 EST 2008 i686 AMD Athlon(tm) 64 X2 Dual Core Processor 5400+ GNU/Linux


I will proceed to download the same Nightly on native Windows XP (same machine, it's dual boot, and I'll try to use the same profile-- my Linux profile is readable from Windows). results next post, after I log in from Windows...

Comment 2

10 years ago
OK, here I am on Windows XP-SP3 with all updates applied. Windows says: "AMD Athlon 64 X2 Dual core 5400+, 2.00 GB of RAM" (same as Linux, they both recognize and use the dual core.) Same Firefox profile, same extensions in use. (BTW, jit.chrome was left "false" for all tests.)

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008052912 Firefox/3.0 - Build ID: 2008052912 - Build ID: 20090105045505

javascript.options.jit.content=True , 889-956 milliseconds
javascript.options.jit.content=False, 813-866 milliseconds
- - - - -

Windows Native is faster than Linux Native, but the ratio of degradation with jit.content=True is nearly identical. Thus, I have (hopefully) thus isolated the "bug" to Tracemonkey within a purely VM versus native environment, NOT Tracemonkey in Linux versus Windows.

I'm not an expert, but I'll SWAG that the difference isn't caused by "something about TraceMonkey thresholds". More likely, I'll guess, the VM code does a really fantastic job of scheduling and swapping between "virtual" threads which are contending against each other under Windows-- but can't provide the same performance enhancement for a virtualized Linux OS.

At least, not with our application.

Oh, and BTW, I run Linux with EXTREME Compiz-Fusion "eye-candy". Although I didn't do any "wobbly windows" movements or "spinning cubes" or "ring switcher" operations during the tests, I'm configured with really fast fast mouse polling-- and that probably creates a lot of overhead versus my dumb, ugly Windows implementation. (I theoretically could test for this, but we've all got more important things to do....)
(Reporter)

Comment 3

10 years ago
No I/O is done in the test, and there should be no thread/task switching during the entire test except for what the OS timer tick asks for (negligible).

I think what you proved is that JavaScript runs faster in Windows than in Linux on the same hardware. I think this is still useful to know btw, and I'm thankful you took the time to investigate.

I will acknowledge that Windows XP running in a client QEMU/KVM Linux VM may simply be triggering a special case inside that won't happen in real machines.

The crux of this bug report was based on the fact that the new TraceMonkey isn't speeding up important parts of a large JavaScript application. From reading the blogs and white papers on tracing it seemed like TM has a great deal of potential. When I saw the large speedup while running XP in a VM I simply thought I could provide some evidence that TraceMonkey might just need some tweaking - it is beta software after all.

I still hold out hope that some TM setting that affects the JIT compilation is set wrong or the timer resolution that helps determine if code should be traced or not isn't handled correctly, or ...

Comment 4

10 years ago
Mark, I think that we're in violent agreement ;) on nearly everything you just posted. Here's some comments back, kinda long:

First, and I feel this is most important-- although I'm NOT a competent coding person, and cannot "take charge" of your bug: I agree that TM's failure to improve this particular loop test is interesting, VERY interesting, and worthy of a bug all by itself. But this title and bug description, "piling on" all kinds of baggage about Linux versus Windows, and about vastly better performance while running Windows within an unidentifed hosting VM manager... I think the bug should be tightened down, focus only on the "degrades when JIT is off" issue. (Again, just my OPINION.)

Second: I know, absolutely and totally for sure, that you're under-estimating the number of context switches involved here while running native Threads. In my "too much eye-candy" desktop, with no disk I/O at all, and when I let it "quiet down" by stopping my typing of this post for several seconds, I'm STILL seeing no less than 900 context switches per second. 

This leads to two sub-points. First subpoint, I'm willing to bet many virtual beers that I can make Linux "win" against Windows by simply by stopping Compiz-Fusion and switching from KDE to a much lighter Window Manager (e.g., e17 with GTK+ support, or ICEWM with Gnome support, etc.). And doing absolutely nothing else. Even with all my desktop overhead, it's barely 10% disadvantage-- and heck, maybe dumping Compiz alone would be enough, still keeping one of the "fat" WMs in charge (KDE or GNOME) 

Now that I've obtained hard numbers for Linux versus windows on identical hardware, I'd like to recommend that we also toss out Linux versus Windows as an "issue" of this bug. (again, it's YOUR bug, this is only my feeling.)

Second subpoint: now that I've come back with actual counts of context switches on Linux for my too-many-layers software stack, it might be appropriate to also toss out the Windows-within-in-VM-Manager-versus-Native part of the bug. With this new bit of data, I feel even more confident in guessing that the VM implemention does dramatically better by implementing a many-to-one mapping of Windows Threads on to VM Manager native Threads. But without access to performance analysis tools for that particular VM manager, Firefox developers might not be able to get a good handle on why it's so good.

(Still on second sub-point) However, I can easily imagine Firefox use of thread-like concurrency structures, both "native" and "internal/lightweight", to be implemented in a less-than optimal matter. (Gecko is still at "Version One.) Tracemonkey, within Firefox, could be suffering from excessive overhead. But I'll SWAG that to be a kinda big reseach project-- and definitely a different bug ID, even if some TM or Firefox code expert DOES raise a hand to say "I know of some easy, low risk changes which could improve this a lot".

So what's left after we toss out "Windows versus Linux" completely, and move "Firefox Windows running jit.content=true inside VM manager xxx is amazingly faster than same Firefox running within Windows native" to another bug? Exactly what you just said is left-- running with jit.content=true degrades Firefox performance on this particular script, and we both wonder why this 15000x repitition loop didn't get *better*.
(Reporter)

Comment 5

10 years ago
<quote> focus only on the "degrades when JIT is off" issue</quote>
I think this is an oversimplification.

<quote>"piling on" all kinds of baggage about Linux versus Windows</quote>
I've never argued it was a Windows vs Linux issue.

<quote> you're under-estimating the number of context switches </quote>
My experience analyzing the CPU cycles lost to context switching tells me this is irrelevant for this particular test. If you have evidence that shows otherwise please post.

I don't believe this is a threading issue either (syscalls vs user mode futex to guard resources etc.) because I have no evidence that the single threaded test is facing massive contention for resources. I simply don't see how it could. I usually do this analysis with tools under Linux and don't know how to do this analysis under Windows so I can't provide data.

It would be fine to agree to disagree and leave this up to the TM folks.

<quote>and move "Firefox Windows running jit.content=true inside VM manager xxx is amazingly faster than same Firefox running within Windows native" to another bug</quote>
I've really tried to build a case for _this_ bug around this evidence:
1. in VM without JIT: speed = X
2. in VM with JIT: speed = 13X
3. no VM without JIT: speed = Y
4. no VM with JIT: speed = ~Y (why not ~13Y?)

I'm simply hoping the TM folks can use this data to make TM better. I am reluctant to speculate further. I trust the TM devs to take the data for what it's worth and do the right thing. I'm willing to leave it at that.

Cheers.

Comment 6

10 years ago
OK.
(Assignee)

Updated

10 years ago
Assignee: general → gal
From what I understand here, there should be a significant tm perf gain here, requesting wanted1.9.1?
Flags: wanted1.9.1?
Keywords: perf
Is this still valid?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: wanted1.9.1? → wanted1.9.2?
(Reporter)

Comment 9

9 years ago
The 13x speedup difference is no longer reproducible.
I wonder if whatever corner case was causing this has been 'fixed'?
I wonder if there are too many steps in the function() that is being tested and TM is no longer tracing it. It would be really handy to me to be able to tell when this occurs for given functions. One too many steps and TM is disabled; if there was a way to test for that it wouldn't be too hard to make minor code changes to get huge performance benefits. I digress...
Flags: wanted1.9.2?
Looks like scheduleworld.com no longer exists. Mark, is there an alternative site to test? Otherwise, this should be closed as INCOMPLETE.
Just realized that Mark has an email at scheduleworld.com as well. Seems unlikely we're going to hear back from him.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.