Closed Bug 509097 Opened 16 years ago Closed 14 years ago

TM: background pool allocator for a few frequent malloc size classes

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: gal, Assigned: gal)

Details

Attachments

(2 files)

No description provided.
Attached patch patchSplinter Review
Assignee: general → gal
10-15ms speedup on SS in a MT shell.
I wonder if this is a speedup on linux.
#2 is for mac. I am interested in the question in #3 as well. The speedup was less than what I was hoping for.
I think I understand why the speedup is not great, compared to the nice results of the magazine allocator in Bonwick01 and as implemented in GNOME (very nice comments in http://git.gnome.org/cgit/glib/tree/glib/gslice.c). And all of this should be prefixed with a big "if I read the patch right"... The idea of the magazine allocator is that mallocs and frees from a single thread come out of and go into one of a couple 'magazines' held by the thread. That way, if I do: for (int i = 0; i < 1000000; ++i) { cx->free(cx->malloc(4)); } I will only need to use the global [locked] malloc exactly once. In this patch, frees either go back to global free or (during GC) back to the background-freeing thread. Hence, all allocs must ultimately come from global malloc, and the real effect of this patch is to avoid making this malloc a blocking call only some of the time. So, right after a GC, global free will be swamped, so cx->frees (all) and cx->mallocs (after JSMallocPool::freelist runs out) will be sure to contend. Now, about the magazine allocator: a magazine is just a fixed size array of pointers to free allocations. The thread gives them back to a global "depot" when has too many full magazines and exchanges empty for full magazines with the depot when it runs out. For our purposes, especially if we go forward with the externally single-threaded model, this "giving back to the depot" could be replaced by "fire an asynchronous free task", and "pulling magazines from the depot" could be "fire an asynchronous malloc task" (perhaps even preemptively, when we reach a low-water mark). This would seriously reduce global malloc/free contention. There is a whole separate aspect of the magazine allocator that could also yield speedup: these small objects are allocated in big slabs, so there is no per-alloc bookkeeping. In this patch, each slab holds a single 4/8/16-byte allocation. The challenge is how to determine the size class when given a pointer in 'free'. GNOME's g_slice_free requires size as a parameter. Another option is to use some page-level reserve/commit trickery so that we can classify allocations by address range, but that gets away from global malloc altogether. If nothing else, returning to the allocation-per-slab strategy, we could perhaps pull this information from the underlying malloc/free bookkeeping.
It occurred to me that a good malloc would be doing something thread-local like the magazine allocator already. I wrote the attached 5-part test, which does a bunch of small-size malloc/free in 0 (main()), 1, 2, 3 threads. Commenting in each individual part, with g++ (4.3.3, -O3 -pthread), on Linux: 1. in main(): 2.7s 2. in main(), after create noop thread: 4.8s 3. in separate thread: 4.8s 4. in two threads: 9.7s 5. in three threads: 15.6s So, gcc malloc would definitely benefit from a magazine allocator. I'd be interested to hear how jemalloc, os x malloc, and msvc malloc do.
Attachment #393999 - Attachment mime type: text/x-c++src → text/plain
So... my laptop is a single core :-P (It's late.) I'll re-run tomorrow.
Ok, this time on a Core 2 Duo: 1. in main(): 1.2s 2. in main(), after create noop thread: 2.5s 3. in separate thread: 2.5s 4. in two threads: 2.6s 5. in three threads: 4.0s So, it looks like GCC switches do a different/slower scheme when you have any threads created, but once it does, it uses a low-contention scheme.
#8: I noticed (2) too. This is one of the major reasons we are slower in the browser than in the ST shell. Also, osx has a bulk malloc that gives you N blocks of size S. I was about to add that but maybe you want to play with it. Also, I don't think lock contention explains the meager SS speedup since we never GC during SS and hence never free.
What is the "use jemalloc on Mac OS X" situation? /be
So, interesting results, and many thanks to dvander for getting them to compile and run them on Mac and Windows. All of this is w.r.t tests 1-5 in comment 8: - OS X: In agreement with gal's cries, 2 is twice as slow as 1 and the same speed as 3. 4 is about 2x slower than 3, thus, they clearly do no effective thread-local anything. - Windows (MS CRT): same situation as OS X. - Windows (jemalloc): same situation, except that 1 was as slow as 2. So, basically, I think the mental model we should assume for anything but glibc is "malloc and free start and end with a big fat lock", since, if you look at the test case, this is pretty much the simplest possible scenario to make thread-local. Hence, I think a thread-local strategy is the way to go to reduce contention. If we can go ahead with the "JSRuntimes are not thread-safe, the caller must make multiple JSRuntimes or synchronize with a lock" plan, I can think we can be even faster/simpler than the general magazine allocator.
JSRuntime not thread-safe is probably a year out. I would prefer a more short-term win. Some 90%+ of our allocations go through the pool allocator. It should offer near-ideal performance as long we buffer enough to overcome contention periods. Why should we not go this route?
I talked to Luke in person, but there's no (as in zero) goal or likelihood of JSRuntime being thread-unsafe. We will always have MT embeddings. They will have a shared process analogue (magazine allocators have the "depot", Luke tells me). This is shared among threads. It needs synchronized access, whether via locks or lock-free techniques. In an MT embedding, if you want lock-free storage, use JSThreadData within JSThread, reached from your cx. In an ST embedding, there's no JSThread, just a JSThreadData singleton member of JSRuntime. /be
(In reply to comment #13) > I talked to Luke in person, but there's no (as in zero) goal or likelihood of > JSRuntime being thread-unsafe. We will always have MT embeddings. They will > have a shared process analogue (magazine allocators have the "depot", Luke > tells me). This is shared among threads. It needs synchronized access, whether > via locks or lock-free techniques. Forgot the punchline. This shared thingie is JSRuntime. There's no point in renaming it or re-inventing it. Hence my confusion about comment 11, which didn't get cleared up by comment 12 and its rogue agenda :-P. /be
(In reply to comment #12) > JSRuntime not thread-safe is probably a year out. I would prefer a more > short-term win. Some 90%+ of our allocations go through the pool allocator. It > should offer near-ideal performance as long we buffer enough to overcome > contention periods. Why should we not go this route? Which pool allocator are you talking about again? Currently, or in a patch in a bug somewhere?
Again, MT embedding means -DJS_THREADSAFE. ST embedding means -UJS_THREADSAFE. If we want to make Gecko be an ST embedding, great -- no argument. The isolation for worker threads will be enforced by the JS VM instead of the worker code and any DOM code it uses. But making Gecko an ST embedding of SpiderMonkey seems not what seems to be intended in comment 12, since there is no need to make "JSRuntime not thread-safe [in] a year [...]". We can compile SpiderMonkey -UJS_THREADSAFE today, no work on JSRuntime needed. The work would be in the workers code, in proxy autoconfig (IIRC), in anything else I'm forgetting that uses a shared runtime among threads today in Gecko, and in other XUL apps and plugins. /be
#16: My comment meant "If we want to make JSRuntime never shared, that would be a long hard road there." It didn't imply we want to do that. Either way, I want shorter term wins, and I think we can easily get those.
(Continuing comment 15) I talked to Andreas in person, and he pointed out that frees happen almost never when we are not GC'ing, so that might be a reason why a magazine allocator isn't really needed. OTOH, he pointed me to bug 505849 which, same as the above micro-benchmarks above, shows that if we have >1 thread, malloc/free get unconditionally slower. Thread-local magazine malloc/free may be worth it for no reason other than this.
Converting Gecko to be an ST embedding is a bug that could be filed. We would rely even more (we already rely *a lot*, given how asymmetric of an MT embedding Gecko is) on the ServerJS guys and others to keep our MT pretensions semi-real. /be
I have to apologize for confusing things by talking about changing MT embedding rules (where -DJS_THREADSAFE) to force single-threaded object creation by default. This does not mean JSRuntime is an ST data structure in such embeddings however, ever -- as Luke and I discussed earlier today. /be
I suspect this bug isn't going anywhere, so I'll WONTFIX it. Feel free to resurrect it if you disagree. FWIW, jemalloc works on Mac now.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: