Closed
Bug 1353292
Opened 7 years ago
Closed 7 years ago
Transferring ArrayBuffers to/from a WebWorker has very poor performance for large ArrayBuffer sizes.
Categories
(Core :: JavaScript: GC, defect)
Core
JavaScript: GC
Tracking
()
RESOLVED
FIXED
Performance Impact | medium |
People
(Reporter: QuestsOfHeart, Unassigned)
Details
Attachments
(2 files)
3.62 KB,
application/zip
|
Details | |
3.16 KB,
patch
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:39.0) Gecko/20100101 Firefox/39.0 Build ID: 20150618135210 Steps to reproduce: Transferred ArrayBuffers of various sizes back and forth between the main thread and Web Workers. I attached one of the test scripts/webpages I used to investigate the issue. Actual results: Performance for large ArrayBuffer sizes is very poor compared to smaller ArrayBuffers and Google Chrome's performance. For example here are some results for transferring 4 ArrayBuffers of the given size back and forth between the main thread and the Web Workers 1000 times: 16 byte ArrayBuffer: 115 ms 64 byte ArrayBuffer: 106 ms 256 byte ArrayBuffer: 100 ms 1024 byte ArrayBuffer: 106 ms 4096 byte ArrayBuffer: 106 ms 16384 byte ArrayBuffer: 140 ms 65536 byte ArrayBuffer: 196 ms 262144 byte ArrayBuffer: 855 ms 1048576 byte ArrayBuffer: 353 ms 4194304 byte ArrayBuffer: 516 ms 16777216 byte ArrayBuffer: 19143 ms 67108864 byte ArrayBuffer: 94886 ms 268435456 byte ArrayBuffer: 286547 ms 536870912 byte ArrayBuffer: 293053 ms Expected results: Here are results for the same test in Google Chrome, showing the kind of performance one would expect from transferring ownership of an array of data (constant performance regardless of size). Same test as was used for Firefox: 16 byte ArrayBuffer: 105 ms 64 byte ArrayBuffer: 116 ms 256 byte ArrayBuffer: 120 ms 1024 byte ArrayBuffer: 106 ms 4096 byte ArrayBuffer: 109 ms 16384 byte ArrayBuffer: 116 ms 65536 byte ArrayBuffer: 105 ms 262144 byte ArrayBuffer: 102 ms 1048576 byte ArrayBuffer: 113 ms 4194304 byte ArrayBuffer: 109 ms 16777216 byte ArrayBuffer: 112 ms 67108864 byte ArrayBuffer: 115 ms 268435456 byte ArrayBuffer: 116 ms 536870912 byte ArrayBuffer: 120 ms
Updated•7 years ago
|
Component: Untriaged → JavaScript Engine
Product: Firefox → Core
Comment 1•7 years ago
|
||
This seems bad indeed. Steve, needinfo'ing you because you worked on structured cloning, which this is probably related to.
Flags: needinfo?(sphink)
Comment 2•7 years ago
|
||
Firefox 39?
Comment 3•7 years ago
|
||
Reproduced in Nightly: Nightly 55 2017-04-03: Test #1 running... 16 byte ArrayBuffer: 373 ms 64 byte ArrayBuffer: 236 ms 256 byte ArrayBuffer: 250 ms 1024 byte ArrayBuffer: 249 ms 4096 byte ArrayBuffer: 252 ms 16384 byte ArrayBuffer: 245 ms 65536 byte ArrayBuffer: 258 ms 262144 byte ArrayBuffer: 276 ms 1048576 byte ArrayBuffer: 335 ms 4194304 byte ArrayBuffer: 485 ms 16777216 byte ArrayBuffer: 995 ms 67108864 byte ArrayBuffer: 3422 ms 268435456 byte ArrayBuffer: 6898 ms 536870912 byte ArrayBuffer: 7141 ms DONE! Chrome Canary (~current): Test #1 running... 16 byte ArrayBuffer: 366 ms 64 byte ArrayBuffer: 229 ms 256 byte ArrayBuffer: 221 ms 1024 byte ArrayBuffer: 228 ms 4096 byte ArrayBuffer: 305 ms 16384 byte ArrayBuffer: 218 ms 65536 byte ArrayBuffer: 223 ms 262144 byte ArrayBuffer: 220 ms 1048576 byte ArrayBuffer: 226 ms 4194304 byte ArrayBuffer: 216 ms 16777216 byte ArrayBuffer: 219 ms 67108864 byte ArrayBuffer: 218 ms 268435456 byte ArrayBuffer: 668 ms 536870912 byte ArrayBuffer: 1809 ms DONE! (Chrome also gets worse, but later and less so.)
Version: 39 Branch → Trunk
Updated•7 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Reporter | ||
Comment 4•7 years ago
|
||
I realized I could have given some more information. I used the 64 bit versions of both Firefox and Chrome. I was also using an older version of Chrome, version 46.
Updated•7 years ago
|
Whiteboard: [qf]
Comment 5•7 years ago
|
||
Is this used enough to be a p1 at these sizes. If not i will make it a p3.
Comment 6•7 years ago
|
||
I'd be surprised if this is not mostly a GC issue (cost of mmap/munmap/searching for available memory/forcing GC under memory pressure) but I get no love from the Gecko Profiler when trying to verify that (bug 1364347).
Comment 7•7 years ago
|
||
Eventually the GC kicks in and runs for two 10-second intervals (if I understand the profile correctly). This time dominates the latter part of the benchmark run and probably accounts for the slowdown.
Component: JavaScript Engine → JavaScript: GC
Comment 8•7 years ago
|
||
Discussion in qf triage today centred around the fact that we diverge from Chrome around 4 or 16 MB and that's not terribly huge for, say, media files (which one could imagine are being processed in a worker). Please remove p1 if you'd like to re-evaluate the decision.
Whiteboard: [qf] → [qf:p1]
Comment 9•7 years ago
|
||
If I'm looking at this right, the GC occurs because whenever creating an ArrayBuffer, even when transferring it, we increment the malloc counter: http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/js/src/vm/ArrayBufferObject.cpp#1071 Bug 1037358 introduced this as a fix for a virtual memory leak. Removing it gives flat timing all the way up to 256MB (512MB runs out of memory on my 32-bit build). This doesn't seem like something we can safely remove.
Comment 10•7 years ago
|
||
(In reply to Adam Gashlin [:agashlin] from comment #9) > If I'm looking at this right, the GC occurs because whenever creating an > ArrayBuffer, even when transferring it, we increment the malloc counter: > http://searchfox.org/mozilla-central/rev/ > 4b79f3b23aebb4080ea85e94351fd4046116a957/js/src/vm/ArrayBufferObject.cpp#1071 > > Bug 1037358 introduced this as a fix for a virtual memory leak. Removing it > gives flat timing all the way up to 256MB (512MB runs out of memory on my > 32-bit build). > > This doesn't seem like something we can safely remove. Let me know if I'm understanding this correctly. When we transfer a buffer to a Worker, we need to update the Worker's JS::Zone's accounting to inform it now owns N more bytes, correct? Is there any reason we shouldn't update the zone that we're transferring from to inform it that it no longer owns those N bytes?
Comment 11•7 years ago
|
||
I considered crediting the stolen bytes back to the zone that it is stolen from, and implemented it, which works for this bug (see attached). There is an issue when the counter has been reset since it was decremented, we will credit it back anyway leading to the counter going too high. But I was worried that I didn't fully understand the implications for bug 1037358, I don't know if we need to keep counting to avoid VM fragmentation somehow. It's possible that I've misunderstood the issue there, though.
Comment 12•7 years ago
|
||
>+ void credit(T* owner, size_t bytes) {
>+ bytes_ += bytes;
>+ if (bytes > maxBytes_) {
>+ bytes_ = maxBytes_;
>+ }
>+ }
Since bytes_ is an atomic to make it thread safe, you'll probably want to do an atomic compare exchange. I believe something like this?
void credit(T* owner, size_t bytes) {
ptrdiff_t localBytes;
ptrdiff_t newBytes;
do {
localBytes = bytes_;
newBytes = localBytes + bytes;
if (newBytes > maxBytes_) {
newBytes = maxBytes_;
}
} while (!bytes_.compareExchange(localBytes, newBytes));
}
Comment 13•7 years ago
|
||
So far, we've been avoiding ever decrementing the counter. Probably partly because bytes_ is really an allocation rate trigger, not a heap usage amount trigger. As in, we don't track frees, only allocations. Once we start decrementing it here and there, it skews the scheduling. Which is not necessarily a bad thing, but at some point you'll need to start lowering trigger thresholds because the ratio of tracked allocations vs actual retained memory will shift. But more to the point, I'm not sure how well we deal with allocation counters being nonmonotonic between collections. Decrementing them introduces the possibility of hitting the higher (nonincremental) threshold, then dropping back below and perhaps hitting the lower (incremental) threshold whereas right now can assume that you haven't hit the higher threshold yet. Or related, we have separate runtime-wide and per-zone triggers, and currently we do things like checking for a runtime-wide trigger first, and only if that doesn't hit, bumping the per-zone counter. (After looking at the code, it appears that the low/high thresholds only apply to GC bytes, not the malloc bytes we're talking about here. So if I am reading correctly, only the runtime vs zone thing matters here, which seems safer to mess with.) I don't know that it would cause a problem, but I'd want to look at it a bit more closely. Jon, you had concerns with me doing something similar for accounting for nursery strings in bug 903519; do you want to weigh in here? It may be that it's the right thing to use for a spot fix here, but not in general. (I'm fine with reviewing this fix for real, but I'd like to know your opinion first.) And is there any reason to not rename bytes_ to bytesRemaining_? The whole += vs -= thing never ceases to confuse me. As for the patch, you only need to pass in T* owner if you have a chance of triggering onTooMuchMalloc(), which is not the case here.
Flags: needinfo?(sphink) → needinfo?(jcoppeard)
Comment 14•7 years ago
|
||
(In reply to Steve Fink [:sfink] [:s:] (PTO Aug 16-23) from comment #13) You are correct in that there is currently only one threshold for malloc bytes, but there is a bug open to add a second. I'm happy to take a spot fix for this if it makes this problem go away, but we need to make fixing our malloc accounting a priority.
Flags: needinfo?(jcoppeard)
Comment 15•7 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #12) > >+ void credit(T* owner, size_t bytes) { > >+ bytes_ += bytes; > >+ if (bytes > maxBytes_) { > >+ bytes_ = maxBytes_; > >+ } > >+ } If |newBytes > maxBytes_| then the counter must have been reset so we should probably ignore the credit in this case.
Comment 16•7 years ago
|
||
I have a workspace hanging around where I've added a second per-zone counter (I think for gcBytes, not malloc bytes) that is strictly monotonic (until the GC cycle is definitely ended). Maybe we should do this kind of thing more widely. Also I need to fix up and have the code in my workspace committed.
Comment 17•7 years ago
|
||
Perhaps Bug 1384049 is related?
Comment 18•7 years ago
|
||
We won't be able to address this bug in time for 57. I am changing it from qf:p1 to qf:p2 for post 57 work.
Whiteboard: [qf:p1] → [qf:p2]
Comment 19•7 years ago
|
||
Thanks for an easy to run test. Current results in latest nightly: Test #1 running... 16 byte ArrayBuffer: 205 ms 64 byte ArrayBuffer: 116 ms 256 byte ArrayBuffer: 117 ms 1024 byte ArrayBuffer: 170 ms 4096 byte ArrayBuffer: 122 ms 16384 byte ArrayBuffer: 126 ms 65536 byte ArrayBuffer: 136 ms 262144 byte ArrayBuffer: 153 ms 1048576 byte ArrayBuffer: 198 ms 4194304 byte ArrayBuffer: 189 ms 16777216 byte ArrayBuffer: 179 ms 67108864 byte ArrayBuffer: 213 ms 268435456 byte ArrayBuffer: 221 ms 536870912 byte ArrayBuffer: 172 ms DONE! So I think this is fixed now. Making the malloc memory limit dynamic in bug 1405274 is probably responsible.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•2 years ago
|
Performance Impact: --- → P2
Whiteboard: [qf:p2]
You need to log in
before you can comment on or make changes to this bug.
Description
•