Closed Bug 1353292 Opened 7 years ago Closed 7 years ago

Transferring ArrayBuffers to/from a WebWorker has very poor performance for large ArrayBuffer sizes.

Categories

(Core :: JavaScript: GC, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Performance Impact medium

People

(Reporter: QuestsOfHeart, Unassigned)

Details

Attachments

(2 files)

Attached file Test 1.zip
User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:39.0) Gecko/20100101 Firefox/39.0
Build ID: 20150618135210

Steps to reproduce:

Transferred ArrayBuffers of various sizes back and forth between the main thread and Web Workers. I attached one of the test scripts/webpages I used to investigate the issue.


Actual results:

Performance for large ArrayBuffer sizes is very poor compared to smaller ArrayBuffers and Google Chrome's performance. For example here are some results for transferring 4 ArrayBuffers of the given size back and forth between the main thread and the Web Workers 1000 times:
16 byte ArrayBuffer:		115 ms
64 byte ArrayBuffer:		106 ms
256 byte ArrayBuffer:		100 ms
1024 byte ArrayBuffer:		106 ms
4096 byte ArrayBuffer:		106 ms
16384 byte ArrayBuffer:	140 ms
65536 byte ArrayBuffer:	196 ms
262144 byte ArrayBuffer:	855 ms
1048576 byte ArrayBuffer:	353 ms
4194304 byte ArrayBuffer:	516 ms
16777216 byte ArrayBuffer:	19143 ms
67108864 byte ArrayBuffer:	94886 ms
268435456 byte ArrayBuffer:	286547 ms
536870912 byte ArrayBuffer:	293053 ms


Expected results:

Here are results for the same test in Google Chrome, showing the kind of performance one would expect from transferring ownership of an array of data (constant performance regardless of size). Same test as was used for Firefox:
16 byte ArrayBuffer:		105 ms
64 byte ArrayBuffer:		116 ms
256 byte ArrayBuffer:		120 ms
1024 byte ArrayBuffer:		106 ms
4096 byte ArrayBuffer:		109 ms
16384 byte ArrayBuffer:	116 ms
65536 byte ArrayBuffer:	105 ms
262144 byte ArrayBuffer:	102 ms
1048576 byte ArrayBuffer:	113 ms
4194304 byte ArrayBuffer:	109 ms
16777216 byte ArrayBuffer:	112 ms
67108864 byte ArrayBuffer:	115 ms
268435456 byte ArrayBuffer:	116 ms
536870912 byte ArrayBuffer:	120 ms
Component: Untriaged → JavaScript Engine
Product: Firefox → Core
This seems bad indeed. Steve, needinfo'ing you because you worked on structured cloning, which this is probably related to.
Flags: needinfo?(sphink)
Firefox 39?
Reproduced in Nightly:

Nightly 55 2017-04-03:
Test #1 running...
16 byte ArrayBuffer:		373 ms
64 byte ArrayBuffer:		236 ms
256 byte ArrayBuffer:		250 ms
1024 byte ArrayBuffer:		249 ms
4096 byte ArrayBuffer:		252 ms
16384 byte ArrayBuffer:		245 ms
65536 byte ArrayBuffer:		258 ms
262144 byte ArrayBuffer:	276 ms
1048576 byte ArrayBuffer:	335 ms
4194304 byte ArrayBuffer:	485 ms
16777216 byte ArrayBuffer:	995 ms
67108864 byte ArrayBuffer:	3422 ms
268435456 byte ArrayBuffer:	6898 ms
536870912 byte ArrayBuffer:	7141 ms
DONE!

Chrome Canary (~current):
Test #1 running...
16 byte ArrayBuffer:		366 ms
64 byte ArrayBuffer:		229 ms
256 byte ArrayBuffer:		221 ms
1024 byte ArrayBuffer:		228 ms
4096 byte ArrayBuffer:		305 ms
16384 byte ArrayBuffer:		218 ms
65536 byte ArrayBuffer:		223 ms
262144 byte ArrayBuffer:	220 ms
1048576 byte ArrayBuffer:	226 ms
4194304 byte ArrayBuffer:	216 ms
16777216 byte ArrayBuffer:	219 ms
67108864 byte ArrayBuffer:	218 ms
268435456 byte ArrayBuffer:	668 ms
536870912 byte ArrayBuffer:	1809 ms
DONE!


(Chrome also gets worse, but later and less so.)
Version: 39 Branch → Trunk
Status: UNCONFIRMED → NEW
Ever confirmed: true
I realized I could have given some more information. I used the 64 bit versions of both Firefox and Chrome. I was also using an older version of Chrome, version 46.
Whiteboard: [qf]
Is this used enough to be a p1 at these sizes. If not i will make it a p3.
I'd be surprised if this is not mostly a GC issue (cost of mmap/munmap/searching for available memory/forcing GC under memory pressure) but I get no love from the Gecko Profiler when trying to verify that (bug 1364347).
Eventually the GC kicks in and runs for two 10-second intervals (if I understand the profile correctly).  This time dominates the latter part of the benchmark run and probably accounts for the slowdown.
Component: JavaScript Engine → JavaScript: GC
Discussion in qf triage today centred around the fact that we diverge from Chrome around 4 or 16 MB and that's not terribly huge for, say, media files (which one could imagine are being processed in a worker). Please remove p1 if you'd like to re-evaluate the decision.
Whiteboard: [qf] → [qf:p1]
If I'm looking at this right, the GC occurs because whenever creating an ArrayBuffer, even when transferring it, we increment the malloc counter:
http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/js/src/vm/ArrayBufferObject.cpp#1071

Bug 1037358 introduced this as a fix for a virtual memory leak. Removing it gives flat timing all the way up to 256MB (512MB runs out of memory on my 32-bit build).

This doesn't seem like something we can safely remove.
(In reply to Adam Gashlin [:agashlin] from comment #9)
> If I'm looking at this right, the GC occurs because whenever creating an
> ArrayBuffer, even when transferring it, we increment the malloc counter:
> http://searchfox.org/mozilla-central/rev/
> 4b79f3b23aebb4080ea85e94351fd4046116a957/js/src/vm/ArrayBufferObject.cpp#1071
> 
> Bug 1037358 introduced this as a fix for a virtual memory leak. Removing it
> gives flat timing all the way up to 256MB (512MB runs out of memory on my
> 32-bit build).
> 
> This doesn't seem like something we can safely remove.

Let me know if I'm understanding this correctly. When we transfer a buffer to a Worker, we need to update the Worker's JS::Zone's accounting to inform it now owns N more bytes, correct? Is there any reason we shouldn't update the zone that we're transferring from to inform it that it no longer owns those N bytes?
I considered crediting the stolen bytes back to the zone that it is stolen from, and implemented it, which works for this bug (see attached). There is an issue when the counter has been reset since it was decremented, we will credit it back anyway leading to the counter going too high.

But I was worried that I didn't fully understand the implications for bug 1037358, I don't know if we need to keep counting to avoid VM fragmentation somehow. It's possible that I've misunderstood the issue there, though.
>+    void credit(T* owner, size_t bytes) {
>+        bytes_ += bytes;
>+        if (bytes > maxBytes_) {
>+            bytes_ = maxBytes_;
>+        }
>+    }

Since bytes_ is an atomic to make it thread safe, you'll probably want to do an atomic compare exchange. I believe something like this?

void credit(T* owner, size_t bytes) {
    ptrdiff_t localBytes;
    ptrdiff_t newBytes;
    do {
        localBytes = bytes_;
        newBytes = localBytes + bytes;
        if (newBytes > maxBytes_) {
            newBytes = maxBytes_;
        }
    } while (!bytes_.compareExchange(localBytes, newBytes));
}
So far, we've been avoiding ever decrementing the counter. Probably partly because bytes_ is really an allocation rate trigger, not a heap usage amount trigger. As in, we don't track frees, only allocations. Once we start decrementing it here and there, it skews the scheduling. Which is not necessarily a bad thing, but at some point you'll need to start lowering trigger thresholds because the ratio of tracked allocations vs actual retained memory will shift.

But more to the point, I'm not sure how well we deal with allocation counters being nonmonotonic between collections. Decrementing them introduces the possibility of hitting the higher (nonincremental) threshold, then dropping back below and perhaps hitting the lower (incremental) threshold whereas right now can assume that you haven't hit the higher threshold yet. Or related, we have separate runtime-wide and per-zone triggers, and currently we do things like checking for a runtime-wide trigger first, and only if that doesn't hit, bumping the per-zone counter. (After looking at the code, it appears that the low/high thresholds only apply to GC bytes, not the malloc bytes we're talking about here. So if I am reading correctly, only the runtime vs zone thing matters here, which seems safer to mess with.)

I don't know that it would cause a problem, but I'd want to look at it a bit more closely. Jon, you had concerns with me doing something similar for accounting for nursery strings in bug 903519; do you want to weigh in here? It may be that it's the right thing to use for a spot fix here, but not in general. (I'm fine with reviewing this fix for real, but I'd like to know your opinion first.)

And is there any reason to not rename bytes_ to bytesRemaining_? The whole += vs -= thing never ceases to confuse me.

As for the patch, you only need to pass in T* owner if you have a chance of triggering onTooMuchMalloc(), which is not the case here.
Flags: needinfo?(sphink) → needinfo?(jcoppeard)
(In reply to Steve Fink [:sfink] [:s:] (PTO Aug 16-23) from comment #13)
You are correct in that there is currently only one threshold for malloc bytes, but there is a bug open to add a second.

I'm happy to take a spot fix for this if it makes this problem go away, but we need to make fixing our malloc accounting a priority.
Flags: needinfo?(jcoppeard)
(In reply to Doug Thayer [:dthayer] from comment #12)
> >+    void credit(T* owner, size_t bytes) {
> >+        bytes_ += bytes;
> >+        if (bytes > maxBytes_) {
> >+            bytes_ = maxBytes_;
> >+        }
> >+    }

If |newBytes > maxBytes_| then the counter must have been reset so we should probably ignore the credit in this case.
I have a workspace hanging around where I've added a second per-zone counter (I think for gcBytes, not malloc bytes) that is strictly monotonic (until the GC cycle is definitely ended). Maybe we should do this kind of thing more widely. Also I need to fix up and have the code in my workspace committed.
Perhaps Bug 1384049 is related?
We won't be able to address this bug in time for 57. I am changing it from qf:p1 to qf:p2 for post 57 work.
Whiteboard: [qf:p1] → [qf:p2]
Thanks for an easy to run test.

Current results in latest nightly:

Test #1 running...
16 byte ArrayBuffer:		205 ms
64 byte ArrayBuffer:		116 ms
256 byte ArrayBuffer:		117 ms
1024 byte ArrayBuffer:		170 ms
4096 byte ArrayBuffer:		122 ms
16384 byte ArrayBuffer:		126 ms
65536 byte ArrayBuffer:		136 ms
262144 byte ArrayBuffer: 	153 ms
1048576 byte ArrayBuffer:	198 ms
4194304 byte ArrayBuffer:	189 ms
16777216 byte ArrayBuffer:	179 ms
67108864 byte ArrayBuffer:	213 ms
268435456 byte ArrayBuffer:	221 ms
536870912 byte ArrayBuffer:	172 ms
DONE!

So I think this is fixed now.  Making the malloc memory limit dynamic in bug 1405274 is probably responsible.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Performance Impact: --- → P2
Whiteboard: [qf:p2]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: