Share atoms compartment/zone between multiple runtimes

RESOLVED FIXED in mozilla30

Status

()

defect
RESOLVED FIXED
6 years ago
5 months ago

People

(Reporter: bhackett, Assigned: bhackett)

Tracking

(Depends on 1 bug, Blocks 1 bug)

unspecified
mozilla30
x86_64
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:1.3T+, b2g-v1.3T fixed)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(6 attachments)

Per measurements in bug 941818, if we could eliminate the atoms zone in DOM worker runtimes it would save a fair amount of space, particularly in empty or near-empty workers but also in larger ones --- atomizing the same string in multiple workers, or workers plus the main runtime would not consume additional memory.  In combination with bug 964057, this would accomplish much the same memory reductions as bug 941818 (a much larger task).

Since atoms are marked and swept like other GC things, this is a good deal more difficult than bug 964057 (self hosted stuff can be lazily instantiated but afterwards is not destroyed).  Without considering incremental GCs, this can be done by associating a counter with each atom indicating the last GC in which it was marked, and having the GCs for each runtime atomically update the counter whenever marking the atom.  When doing full GCs the main runtime would sweep entries from the atoms table when their GC counter indicated that the atom was not marked in any of the last GCs made by each worker or the present full GC.

Considering incremental GCs makes this trickier.  Pre barriers in our VM are triggered based on whether the zone of the thing being overwritten needs a barrier.  In this case both the main runtime and worker runtimes will be testing whether the common atoms zone needs a barrier, which will be true whenever the main thread is doing an incremental full GC.  So workers could trigger the barrier tracer in the atoms zone, and this tracer needs to be threadsafe (it could just catch and ignore triggers from worker runtimes, since they aren't doing an incremental GC).  Workers can't do incremental GCs, but if they could (bug 941794) then this would get trickier and slower, as the atoms zone would need to have a barrier tracer whenever *any* runtime was doing an incremental GC.
Blocks: 921176
(In reply to Brian Hackett (:bhackett) (Vacation until early February) from comment #0)
> So workers could
> trigger the barrier tracer in the atoms zone, and this tracer needs to be
> threadsafe (it could just catch and ignore triggers from worker runtimes,
> since they aren't doing an incremental GC).  Workers can't do incremental
> GCs, but if they could (bug 941794) then this would get trickier and slower,
> as the atoms zone would need to have a barrier tracer whenever *any* runtime
> was doing an incremental GC.

Enabling incremental GCs in workers is very desirable, though. Shumway is moving to a setup where all (action)script execution happens in a worker, with rendering happening on the main thread. So to prevent long pauses, iGC would be required in workers. With WebGL coming to workers (bug 709490), I bet this requirement will become more and more common for other uses, too.
Well, there's (a) eat the performance/complexity cost of making atomsZone->needsBarrier() true if any runtime is doing an incremental GC of the atoms zone, (b) rework the incremental barriers to look at the zone associated with the source of the pointer (which is more intuitive to me, as the current barriers rely for correctness on the fact we only collect atoms during full GCs), or (c) make shared atoms zones a configuration or runtime option.

I like (b) the most, as that would also fix the above problem of triggering barriers from workers when the main thread is doing an incremental GC of the atoms zone, which also manifests as the existing problem of not being able to do off thread parses in this case.

Bill, thoughts?
Flags: needinfo?(wmccloskey)
Depends on: 964079
Whiteboard: [MemShrink]
If we do this, will that mean that jsid values will match across mainthread and workers, such that we can drop the atom cache complexity we added in bug 903772 and just go back to static ids (with reasonable locked or otherwis synchronized lazy init, of course)?
(In reply to Boris Zbarsky [:bz] from comment #3)
> If we do this, will that mean that jsid values will match across mainthread
> and workers, such that we can drop the atom cache complexity we added in bug
> 903772 and just go back to static ids (with reasonable locked or otherwis
> synchronized lazy init, of course)?

Yes.
I was interested in why there are so many atoms in an empty worker, so I dumped them out. I noticed that almost everything seems to be from self-hosted code. Most of it is from Intl.js and Array.js (which I think is for PJS?).

I'd like to think about this some more, since we've had a lot of issues with the atoms zone and threads. But one possible strategy would be to share only atoms that live "forever": static atoms, common atoms, and this self-hosted stuff. It seems like that would simplify a lot of the GC issues.
Flags: needinfo?(wmccloskey)
Thinking about this some more, I still feel it would be better for different runtimes to have different atoms zones. It doesn't seem too likely to me that workers will share a significant number of atoms in common with other workers or with the main runtime, since I kind of suspect b2g workers won't have many atoms at all (ignoring fixed atoms = static atoms, common atoms, and self-hosted atoms).

It seems like a strategy of sharing only fixed atoms between runtimes combined with doing bug 964079 would solve a lot of our problems. One problem I can foresee is that we might need one table of shared atoms and another table of runtime-local atoms. Having to check both tables during atomization would be a little slower. However, if we avoid computing the hash of the key twice, it might not be so bad.

We'll also have to figure out how we'll store these shared atoms. I guess we'd have a notion of a "main" runtime, which would own all the fixed atoms, and then the workers would just point to them? One issue is that it's possible to do atom->runtimeFromWhateverThread(). That wouldn't really work. We might just have to find all the callers that might operate on atoms and fix them somehow.

What do you think, Brian? I'm really leery of doing what you propose, but I guess I could be convinced. It just doesn't seem like the benefits justify the costs to me, based on what I know now.
Bill and I talked about this on IRC: it seems like we should first see what lazy parsing combined with not superfluously storing the source does for self-hosting memory usage. We both didn't know if syntax parsing atomizes. If it does, then this might not help enough.

In that case, it might actually be easier to not share atoms at all, but share the self-hosting compartment. If we can either make that fully read-only (hard) or make script execution in it be extremely rare and mainthread-only (easier, but not completely trivial), then we'd not need to take a lock when copying things over, so we could hopefully just access it simultaneously from multiple threads.
Just one clarification: even if we lazily parse the self-hosted stuff, I still think it would be nice to share static atoms. There's a fair number of them.
Atomization happens during tokenization so will happen during lazy parsing.  Bug 964079 could alleviate this but I've been working on that bug and don't think the approach is viable --- the parser is just too deeply entwined with the rest of the VM and I don't want to start stuffing parser-local atom pointers in traced heap structures.  There's a simpler way of reducing the cost of taking the lock (amortize it) and I'd like to pursue that but we'd still be atomizing off thread and run into all the issues we currently have.

It seems to me like sharing the fixed atoms between runtimes has many of the same complexities as sharing all atoms, we're just in a weirder state (two atom tables, and what is an atom anyways?  what is an atom which is pinned by self hosting state?) and potentially using more memory.  I don't know of any good way to test this but it seems to me like workers can at least potentially share atoms in their code (and elsewhere) with the main runtime or other workers, especially if they run overlapping code.  Anyways, I'd only like to do this bug if it keeps things relatively clean and fast.  Many of the parts already exist, like atomizing new strings off thread, the main new complexity is tracing the atoms on the main thread.
(In reply to Brian Hackett (:bhackett) (Vacation until early February) from comment #9)
> It seems to me like sharing the fixed atoms between runtimes has many of the
> same complexities as sharing all atoms

Well, not needing to worry about how to garbage collect them is a big complexity savings.

> we're just in a weirder state (two
> atom tables, and what is an atom anyways?  what is an atom which is pinned
> by self hosting state?) and potentially using more memory.

There wouldn't need to be a direct connection to self-hosting. We would just have a set of atoms that are shared between runtimes and always alive. If that sets happens to include most of the atoms needed by self-hosted code, then we would save memory. Anything missing would be duplicated across runtimes.

I agree that having two tables kinda sucks.

> I don't know of
> any good way to test this but it seems to me like workers can at least
> potentially share atoms in their code (and elsewhere) with the main runtime
> or other workers, especially if they run overlapping code.

I'm pretty sure it doesn't happen now; we assert against that in the GC during marking. I guess you're saying it would be possible once we share self-hosted stuff between runtimes? If so, it would be good to have a pinch points to control when and how the cloning happens so we can easily contain that kind of leakage.

> Anyways, I'd
> only like to do this bug if it keeps things relatively clean and fast.  Many
> of the parts already exist, like atomizing new strings off thread, the main
> new complexity is tracing the atoms on the main thread.

Do you mean the write barriers? That still seems tricky to me, but I don't fully understand what you're proposing. Keep in mind that besides pushing stuff onto a mark stack, the barriers also set the mark bit, which is a pain to do atomically.

Anyway, I don't mean to push too hard for just sharing the fixed stuff; it's not that great either. It sounds like you're still experimenting, so maybe it's best to wait and see what happens.
We talked about this a little at the DOM work week and had a few ideas to reduce the number of atoms in workers from self-hosted code.

First, it seems like maybe we could split self-hosted code into pieces. The most-used stuff would be in once piece and all the rest in another. Then only the most-used stuff would need to be loaded into workers, and we could avoid Intl.js atoms entirely.

Another idea was that perhaps we could change how we do tokenization during syntax-only parses to avoid atomizing much. We don't need to construct bindings or anything, so it sounds like we could skip most atomization. I don't completely understand the details.
(In reply to Bill McCloskey (:billm) from comment #11)
> First, it seems like maybe we could split self-hosted code into pieces. The
> most-used stuff would be in once piece and all the rest in another. Then
> only the most-used stuff would need to be loaded into workers, and we could
> avoid Intl.js atoms entirely.

We'd still want/need to have Intl.js available in workers, right? It'd just be lazily loaded?

This isn't trivial to implement, but I guess it could work. Theoretically, we could go even further by skipping the concatenation we currently do during build. That way, all the builtin/*.js files would be parsed upon first usage of one of the functions defined in them, respectively.

This might not be too hard to do, if we make sure we split functions by which builtin object they belong to. So if a self-hosted Array method is called, builtin/Array.js is evaluated. This split exists today, but there might be a few exceptions.

Additionally, we'd have to somehow encertain that there aren't any interdependencies, and all the js files are either self-contained, or only rely on Utilities.js, which would always be evaluated. One way to do this might be to introduce a way to check if any gnames are left unresolved after evaluation. Combined with a `make check` test iterating over the builtins and evaluating them individually, this could be quite robust.

Note that this is a non-trivial amount of work, and I don't know when I'd be able to work on it.
Blocks: 916021
No longer blocks: 941783
Whiteboard: [MemShrink] → [MemShrink:P2]
This patch follows the strategy that Bill outlined in comments 5 and 6.  Runtimes have a single atoms zone, but within some main runtime (pointed to by other runtimes via a new parentRuntime field) there is a new notion of permanent atoms, which includes some set of atoms fixed during the runtime's initialization and not modified later on.  Permanent atoms are not collected until the runtime is destroyed, so are similar to interned atoms but different in their operation --- a runtime doesn't get new permanent atoms after creation, and permanent atoms are not explicitly interned.  I think interning could be changed to use the same representation as permanent atoms (whether an atom is interned is represented using its entry in the atoms table, which isn't really suitable for tests during marking that need to be fast) though maybe that would be better done as followup.  In worker runtimes, the static strings, builtin names and table of permanent atoms all point to atoms stored in the main runtime, all of which are immutable.  When marking these from either the main or worker runtimes, permanent atoms are ignored.

I was working on a patch that used a single atoms zone for all runtimes, an approach which isn't that much more complicated than this patch --- instead of ignoring permanent atoms during marking, the worker uses another threadsafe mechanism for keeping track of the atoms it finds when marking.  The main problem with this is that keeping track of worker references on atoms consumes memory, in that patch one word per atom table entry, and this extra memory is used whether there are workers using the atoms or not.  In a browser session with a large atom table but few workers the extra memory consumed could be more than that saved by consolidating the atoms table with the workers.  The approach in this patch is purely beneficial for memory usage.

Below is about:memory info for an empty worker using this patch, for comparison see bug 941818 comment 1.

0.64 MB (00.64%) -- worker(script.js, 0x7f5721352000)
├──0.30 MB (00.30%) -- zone(0x7f5721356000)
│  ├──0.25 MB (00.25%) ++ compartment(web-worker)
│  ├──0.05 MB (00.05%) ── unused-gc-things
│  └──0.01 MB (00.01%) ── sundries/gc-heap
├──0.17 MB (00.17%) -- runtime
│  ├──0.10 MB (00.10%) ── script-data
│  ├──0.03 MB (00.03%) -- gc
│  │  ├──0.03 MB (00.03%) ── marker
│  │  ├──0.00 MB (00.00%) ── nursery
│  │  └──0.00 MB (00.00%) ++ store-buffer
│  ├──0.03 MB (00.03%) ── runtime-object
│  ├──0.00 MB (00.00%) ── atoms-table
│  ├──0.00 MB (00.00%) ── dtoa
│  ├──0.00 MB (00.00%) ── interpreter-stack
│  ├──0.00 MB (00.00%) ── temporary
│  ├──0.00 MB (00.00%) ── contexts
│  ├──0.00 MB (00.00%) ── script-sources
│  ├──0.00 MB (00.00%) ++ code
│  ├──0.00 MB (00.00%) ── math-cache
│  ├──0.00 MB (00.00%) ── regexp-data
│  └──0.00 MB (00.00%) ── source-data-cache
├──0.12 MB (00.12%) -- zone(0x7f572acce800)
│  ├──0.07 MB (00.07%) ++ compartment(web-worker)
│  ├──0.03 MB (00.03%) ── unused-gc-things
│  ├──0.01 MB (00.01%) ── type-pool
│  └──0.00 MB (00.00%) ── sundries/gc-heap
├──0.03 MB (00.03%) ++ gc-heap
└──0.01 MB (00.01%) -- zone(0x7f5721354800)
   ├──0.01 MB (00.01%) ++ sundries
   └──0.00 MB (00.00%) ++ compartment(web-worker-atoms)

The total memory usage for the worker has gone from 1.10 MB to .64 MB, a decrease of .46 MB.  Compared to the earlier numbers there is a .22 MB decrease in the GC marker size (???) and a .06 MB increase in the size of the self hosted compartment + script data.  The remaining .30 MB of decrease is due to this patch, which lines up pretty well with estimates in bug 941818 comment 1.
Attachment #8373074 - Flags: review?(wmccloskey)
Attachment #8373074 - Flags: review?(luke)
> Compared to the earlier numbers there is a .22 MB
> decrease in the GC marker size (???)

Bug 921224 reduced the GC marker size (on 64-bit builds) from 256 KiB to 32 KiB for runtimes that don't do incremental GC. It landed in November last year. So the measurement in bug 941818 comment 1 must have been done on a build that lacked that bug's patch, or in which the worker somehow had incremental GC enabled.
Patch which shares all atoms between runtimes, attaching for posterity.
Comment on attachment 8373074 [details] [diff] [review]
patch (af482ab5738c)

Review of attachment 8373074 [details] [diff] [review]:
-----------------------------------------------------------------

I think Bill's review is enough for this patch.  From a quick glance, though, it looks great.  Two small requests:

::: js/src/shell/js.cpp
@@ +2607,5 @@
>      return ok;
>  }
>  
> +static void
> +WorkerMain(void *arg)

Great that you added these functions for testing, but I don't see any unit tests.  Maybe you forgot to hg add them?  It'd be great to have some in jit-tests.

::: js/src/vm/String.h
@@ +215,3 @@
>      static const size_t ATOM_BIT              = JS_BIT(3);
>  
> +    static const size_t PERMANENT_ATOM_FLAGS  = JS_BIT(2) | JS_BIT(3);

Can you add this to the above string comment?
Attachment #8373074 - Flags: review?(luke) → review+
Comment on attachment 8373074 [details] [diff] [review]
patch (af482ab5738c)

Review of attachment 8373074 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great! Can you ask bent to look over the worker part of this?

::: js/src/gc/Barrier.h
@@ +607,5 @@
>  
>  template <class T>
>  struct DefaultHasher< EncapsulatedPtr<T> > : EncapsulatedPtrHasher<T> { };
>  
> +bool StringIsPermanentAtom(JSString *str);

I think bool is supposed to go on a separate line.

::: js/src/gc/Marking.cpp
@@ +264,5 @@
> +        // Atoms do not refer to other GC things so don't need to go on the mark stack.
> +        // Additionally, PushMarkStack will ignore permanent atoms.
> +        atom->markIfUnmarked();
> +    } else {
> +        trc->callback(trc, (void **)&atom, JSTRACE_STRING);

Eventually tracers may attempt to change this value and we'll want to catch that. Please change to this:
void *addr = (void *)&atom;
trc->callback(trc, &addr, JSTRACE_STRING);
JS_ASSERT(addr == (void *)&atom);

::: js/src/gc/RootMarking.cpp
@@ +733,5 @@
>      if (!rt->isBeingDestroyed() && !trc->runtime->isHeapMinorCollecting()) {
>          if (!IS_GC_MARKING_TRACER(trc) || rt->atomsCompartment()->zone()->isCollecting()) {
> +            // Permanent atoms only need to be marked in the runtime
> +            // which owns them.
> +            if (!rt->parentRuntime)

I think it would be a little nicer if we always call MarkPermanentAtoms and if it returns early when rt->parentRuntime.

::: js/src/jsapi.cpp
@@ +638,5 @@
>  JS_FRIEND_API(bool) JS::isGCEnabled() { return true; }
>  #endif
>  
>  JS_PUBLIC_API(JSRuntime *)
> +JS_NewRuntime(JSRuntime *parentRuntime, uint32_t maxbytes, JSUseHelperThreads useHelperThreads)

Can you please move the parentRuntime param to the end and have it default to null?

::: js/src/shell/js.cpp
@@ +2665,5 @@
> +    JSString **pstring = cx->new_<JSString *>();
> +    if (!pstring)
> +        return false;
> +    *pstring = args[0].toString();
> +    if (!JS_AddStringRoot(cx, pstring))

I'm guessing we're going to have problems here because the root is never removed. Maybe we can remove the root when we PR_JoinThread or something?

::: js/src/vm/Runtime.h
@@ +679,5 @@
>       */
>      js::PerThreadData mainThread;
>  
>      /*
> +     * uIf non-null, another runtime guaranteed to outlive this one and whose

Extra 'u' at the beginning.

@@ +1620,3 @@
>  
> +    // Cached pointers to various permanent property names.
> +    JSAtomState *atomState;

I would love it if we could give this a more descriptive name. Can you please change this to something like "commonAtoms"?
Attachment #8373074 - Flags: review?(wmccloskey)
Attachment #8373074 - Flags: review?(luke)
Attachment #8373074 - Flags: review+
Comment on attachment 8373074 [details] [diff] [review]
patch (af482ab5738c)

Review of attachment 8373074 [details] [diff] [review]:
-----------------------------------------------------------------

Ben, can you look at the RuntimeService.cpp changes? We just want to make sure that every worker runtime gets the main runtime as an argument.
Attachment #8373074 - Flags: review?(luke) → review?(bent.mozilla)
Comment on attachment 8373074 [details] [diff] [review]
patch (af482ab5738c)

Review of attachment 8373074 [details] [diff] [review]:
-----------------------------------------------------------------

r=me on RuntimeService.cpp changes, provided the following comments are addressed:

::: dom/workers/RuntimeService.cpp
@@ +842,5 @@
>  {
>  public:
>    // The heap size passed here doesn't matter, we will change it later in the
>    // call to JS_SetGCParameter inside CreateJSContextForWorker.
> +  WorkerJSRuntime(JSRuntime *parentRuntime, WorkerPrivate* aWorkerPrivate)

Nit: Here and several other places just make sure * goes on the left everywhere, and we prefix args with 'a'.

@@ +1532,5 @@
>      NS_WARNING("Could not set the thread's priority!");
>    }
>  
>    nsCOMPtr<nsIRunnable> runnable =
> +    new WorkerThreadPrimaryRunnable(aWorkerPrivate, thread, JS_GetRuntime(aCx));

I haven't looked at the rest of the patch here but just to be clear, this will be called on many different threads, and the parent runtime won't always be the main thread's runtime. Is it ok to pass another worker runtime here? Do the runtimes chain somehow?
Attachment #8373074 - Flags: review?(bent.mozilla) → review+
The runtimes do chain, though the parent runtime has to outlive the child runtime.  During testing this didn't seem to be the case when workers were spawning children, so I fixed this to use the worker's own parent runtime (the main runtime) when it spawns new workers.

https://hg.mozilla.org/integration/mozilla-inbound/rev/d2c4ae312b66
https://hg.mozilla.org/mozilla-central/rev/d2c4ae312b66
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
We'll definitely want this for 1.3T (Tarako), and quite possibly for 1.3 as well.
blocking-b2g: --- → 1.3T?
Depends on: 974739
:njn, it looks like this is quite a big change. do you feel the risk is acceptable for tarako? given we have until mid march to take major patches and another month or so to stablize? Thanks
blocking-b2g: 1.3T? → 1.3T+
Flags: needinfo?(n.nethercote)
> :njn, it looks like this is quite a big change. do you feel the risk is
> acceptable for tarako?

I think it's worth taking.

And if you're taking this one, you should take bug 964057 as well. But before doing so, it would be a good idea to take measurements on Tarako to make sure the memory savings for workers are as big as expected. (I expect they will be a bit smaller because bhackett's measurements have been on 64-bit platforms, but they should still be substantial. It won't hurt to check.)
Flags: needinfo?(n.nethercote)
Yes, we should definitely take measurements before we land this.
That needs to be rebased on 1.3t - Brian, can you do that?
Which branch is that?
(In reply to Andrew McCreight [:mccr8] from comment #29)
> probably this one: http://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/

Yep!
While rebasing on 1.3T I noticed that in the runup to landing this patch I accidentally disabled it and bug 964057 entirely (workers are never given a non-null parent runtime).  So we don't know at this point about the stability of the patch since it has never actually been active on trunk, though once it lands for real I don't expect many stability problems and it is at least easy to disable (just reverse this patch).
Attachment #8387666 - Flags: review?(wmccloskey)
Rebased patch on b2g28_v1_3t, including the above fix.
Attachment #8387666 - Flags: review?(wmccloskey) → review+
Brian, let me know if these memory reports match what you expect. Overall that looks like a win, so if risk is low I'm inclined to take that on 1.3t
(In reply to Fabrice Desré [:fabrice] from comment #34)
> Brian, let me know if these memory reports match what you expect. Overall
> that looks like a win, so if risk is low I'm inclined to take that on 1.3t

Yes, they do.  There is only a single worker but it shows memory improvements of about .15 MB between the size of the runtime object and the size of the atoms zone.  The size of the atoms table did not go down, which is strange; maybe the table has lots of unused entries.
Duplicate of this bug: 921176
You need to log in before you can comment on or make changes to this bug.