Don't use nsStringBuffer for dynamic atoms

RESOLVED FIXED in Firefox 62

Status

()

enhancement
RESOLVED FIXED
Last year
Last year

People

(Reporter: njn, Assigned: njn)

Tracking

(Blocks 1 bug)

Trunk
mozilla62
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox61 wontfix, firefox62 fixed)

Details

(Whiteboard: [overhead:500k])

Attachments

(2 attachments, 1 obsolete attachment)

Dynamic atoms point to the chars in an nsStringBuffer. This means that, in addition to the chars themselves, we use memory for:

- the pointer within the dynamic atom;
- mRefCount and mStorageSize within the nsStringBuffer.

On 64-bit that is 16 bytes of overhead. On 32-bit it is 12 bytes.

This overhead might be worth it if the nsStringBuffers were often shared. But they're not. I instrumented the atom table memory reporter and got these numbers with a bunch of tabs open:

- 1007 dynamic atoms with a shared buffer, taking up 70656 bytes
- 35040 dynamic atoms with an unshared buffer, taking up 2299392 bytes

(The "taking up" number here is the combination of the dynamic atoms and the nsStringBuffers.)

In other words, about 3% of the dynamic atoms have a shared nsStringBuffer. (This is at least partly because lots of atoms are created from UTF8 strings, in which case the nsStringBuffer is freshly allocated and certain to be unshared.)

That's a small enough number that we should consider not using nsStringBuffer. An alternative would be to make nsDynamicAtom a variable-sized struct with the chars tacked onto the end. That would halve the number of allocations, avoid the 16 (or 12) bytes of overhead, and also minimize slop.
I believe the only place where we would share the stringbuffer right now is when the dynamic atom is being returned to JS via DOMString::SetKnownLiveAtom.

In practice, the place where this will matter most is .id gets on elements.  And there it's a _performance_ optimization (not having to copy the string), not just a memory one.  At least in theory.  It might be worth changing DOMString::SetKnownLiveAtom to not share the stringbuffer for the dynamic atom case (which is what we would have to do anyway if we made this change) and do some performance measurement to see whether there might be possible impact there.
Whiteboard: [overhead:500k]
First I tried just making nsDynamicAtom own its chars outright, instead of using nsStringBuffer. It seemed like it should help a bit, because we're not storing the mRefCount and mStorageSize, but in practice it was no better and probably a little bit worse. Never fear, a better approach will be forthcoming.
This reduces memory usage because we only need one allocation instead of two
for the dynamic atom and its chars, and because we don't need to store a
refcount and a size. It precludes sharing of chars between dynamic atoms, but
we weren't benefiting much from that anyway.

This reduces per-process memory usage by up to several hundred KiB on my
Linux64 box.

One consequence of this change is that we need to allocate + copy in
DOMString::SetKnownLiveAtom(), which could make some things slower.
Comment on attachment 8986421 [details] [diff] [review]
Store nsDynamicAtom's chars after the end of the object

I did some measurements with cnn.com and Gmail:

BEFORE

> parent process
> 2,275,104 B (00.94%) -- atoms
> ├──2,002,720 B (00.83%) -- dynamic
e │  ├──1,808,672 B (00.75%) ── unshared-buffers
> │  └────194,048 B (00.08%) ── atom-objects
> └────272,384 B (00.11%) ── table

> cnn.com process
> 671,968 B (00.26%) -- atoms
> ├──431,328 B (00.17%) -- dynamic
> │  ├──291,104 B (00.11%) ── unshared-buffers
> │  └──140,224 B (00.05%) ── atom-objects
> └──240,640 B (00.09%) ── table

> gmail.com process
> 1,329,856 B (00.59%) -- atoms
> ├────813,760 B (00.36%) -- dynamic
> │    ├──444,736 B (00.20%) ── unshared-buffers
> │    └──369,024 B (00.16%) ── atom-objects
> └────516,096 B (00.23%) ── table

AFTER

> parent process
> 2,284,736 B (00.93%) -- atoms
> ├──2,010,304 B (00.82%) ── dynamic-atoms
> └────274,432 B (00.11%) ── table

> cnn.com process
> 511,888 B (00.19%) -- atoms
> ├──275,344 B (00.10%) ── dynamic-atoms
> └──236,544 B (00.09%) ── table

> gmail.com process
> 958,624 B (00.41%) -- atoms
> ├──514,048 B (00.22%) ── table
> └──444,576 B (00.19%) ── dynamic-atoms

Not much difference on the parent process, strangely, but a clear win on cnn.com (~180 KB) and Gmail (~370KB). The majority of this would be the slop avoidance by having a single allocation per dynamic atom instead of two allocations.
Attachment #8986421 - Flags: review?(nfroyd)
This reduces memory usage because we only need one allocation instead of two
for the dynamic atom and its chars, and because we don't need to store a
refcount and a size. It precludes sharing of chars between dynamic atoms, but
we weren't benefiting much from that anyway.

This reduces per-process memory usage by up to several hundred KiB on my
Linux64 box.

One consequence of this change is that we need to allocate + copy in
DOMString::SetKnownLiveAtom(), which could make some things slower.
Attachment #8986421 - Attachment is obsolete: true
Attachment #8986421 - Flags: review?(nfroyd)
Attachment #8986653 - Flags: review?(nfroyd)
Comment on attachment 8986653 [details] [diff] [review]
Store nsDynamicAtom's chars after the end of the object

Review of attachment 8986653 [details] [diff] [review]:
-----------------------------------------------------------------

This is a nice change; r=me.  I thought that you asked bz about a potential benchmark to see if this change had any effect; am I imagining things?  If there was a benchmark, did this change produce any interesting benchmark numbers?

::: dom/bindings/DOMString.h
@@ +224,5 @@
> +        // Dynamic atoms own their own chars, and never have 0 length because
> +        // nsGkAtoms::_empty is a static atom.
> +        MOZ_ASSERT(aAtom->GetLength() > 0);
> +        AsAString() = nsString(aAtom->AsDynamic()->String(),
> +                               aAtom->GetLength());

I think it'd be slightly better to:

AsAString().Assign(aAtom->AsDynamic()->String(), ...);

here.  In a perfect world, I think your version and the Assign version compile to the same code, but I don't think the compiler is quite that intelligent...
Attachment #8986653 - Flags: review?(nfroyd) → review+
> This is a nice change; r=me.  I thought that you asked bz about a potential
> benchmark to see if this change had any effect; am I imagining things?  If
> there was a benchmark, did this change produce any interesting benchmark
> numbers?

I wrote something like that in my original bzexport text, but then it got rejected because bz is on vacation and not accepting feedback requests. Perhaps you saw that somehow? Though I can't see any sign of it in the bug...

Anyway, I measured on Speedometer and there was no difference. I think it'll be ok.
Pushed by nnethercote@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b67973aeb2af
Store nsDynamicAtom's chars after the end of the object. r=froydnj
https://hg.mozilla.org/mozilla-central/rev/b67973aeb2af
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
Awesome work Nick!
Nick, did you run the performance test I suggested in comment 1?  Specifically, a "read the id" microbenchmark, with the [Pure] annotation removed from the "id" attribute in Element.webidl.

Apart from that, the DOMString changes look ok to me.  I don't have an obvious better idea to prevent the copy; it would need to happen one way or another once our DOMString got converted to a JSString.

I so wish we had a unified string representation.  :(
Flags: needinfo?(n.nethercote)
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #11)
> Nick, did you run the performance test I suggested in comment 1? 
> Specifically, a "read the id" microbenchmark, with the [Pure] annotation
> removed from the "id" attribute in Element.webidl.

No. The only perf test I ran was Speedometer, where there was no difference.
Flags: needinfo?(n.nethercote)
Right, I saw the bit about Speedometer.  Speedometer is not the only performance thing we care about, though.

Do you plan to run such a test?  If not, please let me know and I will try to do it.
Flags: needinfo?(n.nethercote)
> If not, please let me know and I will try to do it.

That would be very helpful. You have a much clearer idea of what to measure than I do.
Flags: needinfo?(n.nethercote)
So I have done some measurements using this testcase:

  <script>
  var count = 20000000;
  var start = new Date();
  var el = document.createElement("span");
  el.id = "This is a test";
  for (var i = 0; i < count; ++i) {
    el.id;
  }
  var stop = new Date();
  document.write((stop - start)/count * 1e6);
  </script>

which reports ns/call for the id get.  For all tests I used m-c revision 6c29e4fc89c7 (right before this bug landed) with the [Pure] annotation on Element.id removed.  I used a "control" build which had no other changes and a "test" build also had this change in DOMString.h:

-        SetKnownLiveStringBuffer(
-          aAtom->AsDynamic()->GetStringBuffer(), aAtom->GetLength());
+        AsAString().Assign(aAtom->AsDynamic()->String(), aAtom->GetLength());

I tested several different values for the id: "foo", "This is a test", "aaa...aaa" repeated 15 times, "bbb...bbb" repeated 16 times, "ccc...ccc" repeated 63 times, "ddd...ddd" repeated 64 times, "eee...eee" repeated 100 times, "fff...fff" repeated 101 times.  On my (Mac) laptop, I get values like so:

  control, "foo": ~42
  control, "This is a test": ~50
  control, "aaa...aaa" (15x): ~52
  control, "bbb...bbb" (16x): ~23
  control, "ccc...ccc" (63x): ~23
  control, "ddd...ddd" (64x): ~23
  control, "eee...eee" (100x): ~23
  control, "fff...fff" (101x): ~23

  test, "foo": ~50
  test, "This is a test": ~67
  test, "aaa...aaa" (15x): ~68
  test, "bbb...bbb" (16x): ~107
  test, "ccc...ccc" (63x): ~250
  test, "ddd...ddd" (64x): ~136
  test, "eee...eee" (100x): ~160
  test, "fff...fff" (101x): ~380

The "control" results compared to each other surprised  me a bit.  That's happening because js::NewMaybeExternalString does the inline-latin1-string optimization before it does the external string thing (see bug 1038099).  We allow up to 15 chars in inline latin1 strings, so all three of "foo", "This is a test", and "aaaaaaaaaaaaaaa" end up doing an inline string plus copy here.  The difference in times between these cases is the extra work the copy and CanStoreCharsAsLatin1 have to do for the longer string, I expect.

For the strings longer than 15 chars, we do end up creating an external string in the "control" build and hence avoid the extra copying and scan bits.  Since we also hit the external string cache, and without even having to do the char scan in ExternalStringCache::lookup we avoid the GC pressure and allocation work too.  Hence the faster time for those case.

In the "test" build, the inline-string cases are a little slower than "control" because there's more work done in copying into the DOMString's internal buffer.  That's an autostring, so we don't even try the external string thing and just end up creating allocated inline strings.  So it's basically the same work as the "control" build except for the extra copy into the on-stack buffer, which is not too bad.

Once we're over 15 chars, the behaior is quite different, though.  In the "test" build we keep doing the "copy into stack buffer" thing up to 63 chars.  Those cases don't do an inline string, and don't do an external string (because we have no stringbuffer), so end up doing one copy to the stack in the bindings and then another copy to the heap in the JS engine when allocating the string.  And there's the string allocation, GCs, etc.

Once we hit 64 chars, we stop using the on-stack buffer in DOMString and just copy directly to a heap nsStringBuffer.  For the 64- and 100-char cases we end up hitting the external string cache (after doing the scan of the string, because we're not pointer-identical to what's in the cache) and hence times drop somewhat compared to the 63-char case: we're back to not doing a second copy, not doing a GCthing allocation, and not doing more GC.

At 101 chars we stop doing the linear scan in ExternalStringCache::lookup and hence never hit the external string cache.  Now we're always doing a copy to the heap in the DOM code and then always allocating a new external string.  There's still only one copy, but more GC pressure than there was in the cache-hit cases.  And the copy itself is more expensive, of course.

So the conclusion is that for short values this is a somewhat minor regression, but for values in the range [16,63] we're adding two copies and a bunch of GC pressure, for a significant slowdown.

It feels to me like the right solution here might be to have an external string type that hold on to an nsDynamicAtom so we can avoid all that and basically recover the performance we used to have...  Thoughts?
Flags: needinfo?(n.nethercote)
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #15)
> It feels to me like the right solution here might be to have an external
> string type that hold on to an nsDynamicAtom so we can avoid all that and
> basically recover the performance we used to have...  Thoughts?

My initial, mostly unconsidered thought is that I am unexcited about adding more cases to the string code.
I sympathize, but I'm not sure I have any better ideas offhand, short of knowing what the length distribution of things coming through DOMString::SetKnownLiveAtom looks like....

I really wish we had a more-shared JS/C++ string representation.  But even that would not help here, because the whole point of this bug was to stop using the shared representation we _did_ have, for dynamic atoms.
So to be clear, I am not proposing we add more cases to nsAString, just to DOMString.
I filed bug 1473149 on the dynamic atom external string idea.
Flags: needinfo?(n.nethercote)
You need to log in before you can comment on or make changes to this bug.