Closed Bug 1065894 Opened 10 years ago Closed 4 years ago

Optimize native DataView or self host it.

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla78
Tracking Status
firefox78 --- fixed

People

(Reporter: mbx, Assigned: anba)

References

Details

(Keywords: perf)

Attachments

(9 files)

Attached file dataViewBenchmark.js
A simple JS implementation (see attached JS file) of DataViews is nearly 10x faster than the native implementation. Although I didn't take care of all the corner cases in the polyfill, it still feels that we ought to do better than 10x slower. DataView All, Iterations: 64, Elapsed: 9239.0000ms (144.3594ms / Iteration), Result: 0 DataViewPolyfill All, Iterations: 64, Elapsed: 1183.0000ms (18.4844ms / Iteration), Result: 0 DataView.getInt8, Iterations: 64, Elapsed: 1011.0000ms (15.7969ms / Iteration), Result: 0 DataViewPolyfill.getInt8, Iterations: 64, Elapsed: 40.0000ms (0.6250ms / Iteration), Result: 0 DataView.getUint8, Iterations: 64, Elapsed: 993.0000ms (15.5156ms / Iteration), Result: 0 DataViewPolyfill.getUint8, Iterations: 64, Elapsed: 42.0000ms (0.6563ms / Iteration), Result: 0 DataView.getInt16, Iterations: 64, Elapsed: 1076.0000ms (16.8125ms / Iteration), Result: 0 DataViewPolyfill.getInt16, Iterations: 64, Elapsed: 60.0000ms (0.9375ms / Iteration), Result: 0 DataView.getUint16, Iterations: 64, Elapsed: 1035.0000ms (16.1719ms / Iteration), Result: 0 DataViewPolyfill.getUint16, Iterations: 64, Elapsed: 40.0000ms (0.6250ms / Iteration), Result: 0 DataView.getInt32, Iterations: 64, Elapsed: 1042.0000ms (16.2813ms / Iteration), Result: 0 DataViewPolyfill.getInt32, Iterations: 64, Elapsed: 107.0000ms (1.6719ms / Iteration), Result: 0 DataView.getUint32, Iterations: 64, Elapsed: 1130.0000ms (17.6563ms / Iteration), Result: 0 DataViewPolyfill.getUint32, Iterations: 64, Elapsed: 119.0000ms (1.8594ms / Iteration), Result: 0 DataView.getFloat32, Iterations: 64, Elapsed: 1495.0000ms (23.3594ms / Iteration), Result: 0 DataViewPolyfill.getFloat32, Iterations: 64, Elapsed: 496.0000ms (7.7500ms / Iteration), Result: 0 DataView.getFloat64, Iterations: 64, Elapsed: 1211.0000ms (18.9219ms / Iteration), Result: 0 DataViewPolyfill.getFloat64, Iterations: 64, Elapsed: 599.0000ms (9.3594ms / Iteration), Result: 0
Blocks: shumway-m3
Brian, will DataViews be affected by your TypedArray work at all? If not, I can take a stab at self-hosting parts of them to speed things up. FWIW, it seems like no engine[1] is really optimizing DataView performance all that much. V8 is about half as fast on all the tests, and JSC slightly slower. [1]: Well, I don't know about Chakra
Flags: needinfo?(bhackett1024)
Note that DataView as currently implemented is almost designed to be slow. You get stubcalls from the JIT, you get the overhead of CallNonGenericMethod...
(In reply to Till Schneidereit [:till] from comment #1) > Brian, will DataViews be affected by your TypedArray work at all? If not, I > can take a stab at self-hosting parts of them to speed things up. > > FWIW, it seems like no engine[1] is really optimizing DataView performance > all that much. V8 is about half as fast on all the tests, and JSC slightly > slower. > > > [1]: Well, I don't know about Chakra No, the stuff I'm working on will only remove the next view slot in data views (bug 1061404), which will shrink the size of them but not change their behavior otherwise. For at least some of these methods though, self hosting seems like a bad idea. E.g. the above file's implementation for getInt32: DataViewPolyfill.prototype.getInt32 = function(offset, littleEndian) { var u8 = this._u8; var a = u8[offset + 0]; var b = u8[offset + 1]; var c = u8[offset + 2]; var d = u8[offset + 3]; return littleEndian ? (d << 24) | (c << 16) | (b << 8) | a : (a << 24) | (b << 16) | (c << 8) | d; } This is really inefficient compared to what we could do by properly handling getInt32 directly during Ion compilation. Most platforms support unaligned memory accesses, and a single 32 bit read will be much faster than what this stuff Ion compiles down to. If we self host this method we'll just have to rip out that self hosted version later when we want to make this truly fast. If there's a need to make DataViews fast I don't think it would be very hard to do in Ion, it's just that so far as I know no one has asked for it before.
Flags: needinfo?(bhackett1024)
(In reply to Brian Hackett (:bhackett) from comment #3) > No, the stuff I'm working on will only remove the next view slot in data > views (bug 1061404), which will shrink the size of them but not change their > behavior otherwise. Ok. > > For at least some of these methods though, self hosting seems like a bad > idea. E.g. the above file's implementation for getInt32: > > DataViewPolyfill.prototype.getInt32 = function(offset, littleEndian) { > var u8 = this._u8; > var a = u8[offset + 0]; > var b = u8[offset + 1]; > var c = u8[offset + 2]; > var d = u8[offset + 3]; > return littleEndian ? > (d << 24) | (c << 16) | (b << 8) | a : > (a << 24) | (b << 16) | (c << 8) | d; > } > > This is really inefficient compared to what we could do by properly handling > getInt32 directly during Ion compilation. Most platforms support unaligned > memory accesses, and a single 32 bit read will be much faster than what this > stuff Ion compiles down to. If we self host this method we'll just have to > rip out that self hosted version later when we want to make this truly fast. That makes sense, yes. > > If there's a need to make DataViews fast I don't think it would be very hard > to do in Ion, it's just that so far as I know no one has asked for it before. I think DataViews aren't really used all that much. We need to use it, or a custom equivalent, in Shumway to implement certain functionality, which is why we're looking into performance issues. I'll look into inlining these, as that doesn't seem all that hard.
(In reply to Till Schneidereit [:till] from comment #4) > I think DataViews aren't really used all that much. They get used in other addons too. Optimizing them a bit would be welcome.
This optimization does not need to block Shumway's M3 milestone.
Blocks: shumway-1.0
No longer blocks: shumway-m3
Keywords: perf
Huh. I find myself searching Bugzilla to see if anyone has reported the fact that DataViews seem to be really slow in comparison to native-endian typed-arrays, even if manually swapping bytes to deal with endianness. Apparently I already read this... On the plus side, I should be able to optimise the thing I was trying to do that was too slow by implementing a faster version cribbed from the attachment. If properly inlining is not currently the plan, self-hosting in the meantime would be a very welcome improvement, as the performance is kinda awful.
OS: Mac OS X → All
Hardware: x86 → All
Also see bug 1246597. For reasons that are sometimes beyond me, people want to use DataView. Sometimes this is for reasons of confusion (they think a TypedArray copies the ArrayBuffer, and does not view it, maybe because the word "Int8Array" does not contain the word "View"), sometimes it just seems they feel it is significantly more convenient (parsing stream data).
I thought they were used primarily for unaligned reads/writes.
"Alignment" really only makes sense as an idea if you've done low-levelish programming on non-x86 systems, though. I get the sense that DataView is often seen as the default interface to ArrayBuffer, with TypedArrays being reserved for special cases. I can't really back that up with data, but given how flexible DataView is -- and how well it fits in with a dynamic-typing world view -- I can see the point.
DataView can read little or big endian; the type-specific arrays are native-endian only, and thus not usable for portable data that could be read on either. DataView is sadly the only way to not have JavaScript be ~hardware~ dependent here, short of writing your own replacement or detecting endianess and reversing when needed. DataView cleanly deals with this botched part of JS typed array design; it's just really too slow to use. Also, the type-specific arrays choke if the data isn't aligned, without any real indication as to the problem. That was a confusing one to encounter, as that's not actually properly specified or documented. DataView is really the only fully fleshed-out typed array; the others break in cases that nobody should have to worry about in a high-level scripting language.
(In reply to Lars T Hansen [:lth] from comment #8) > Also see bug 1246597. > > For reasons that are sometimes beyond me, people want to use DataView. > Sometimes this is for reasons of confusion (they think a TypedArray copies > the ArrayBuffer, and does not view it, maybe because the word "Int8Array" > does not contain the word "View"), sometimes it just seems they feel it is > significantly more convenient (parsing stream data). let buffer = new ArrayBuffer(5); let view = new Float32Array(buffer, 1, 1); // Whoops! And while this will generally never happen in one's own code, it certainly does happen with some binary formats out there that weren't designed with byte alignment in mind. The only solution is DataView, whether native or not.
(In reply to spectralcanine from comment #12) > (In reply to Lars T Hansen [:lth] from comment #8) > > Also see bug 1246597. > > > > For reasons that are sometimes beyond me, people want to use DataView. > > Sometimes this is for reasons of confusion (they think a TypedArray copies > > the ArrayBuffer, and does not view it, maybe because the word "Int8Array" > > does not contain the word "View"), sometimes it just seems they feel it is > > significantly more convenient (parsing stream data). > > let buffer = new ArrayBuffer(5); > let view = new Float32Array(buffer, 1, 1); // Whoops! > > And while this will generally never happen in one's own code, it certainly > does happen with some binary formats out there that weren't designed with > byte alignment in mind. > The only solution is DataView, whether native or not. Yes, I know and understand this use case. DataView is not the only solution but of course it is by far the most convenient one, and very clearly legitimate and appropriate. (Another solution is to copy the bytes individually into a separate ArrayBuffer using Uint8Array views and then reading a value out of the second buffer using a Float32Array view. I'm not advocating that, though at present I would not be surprised if it were faster than DataView, especially if the auxiliary buffer and views on it could be reused.) (Didn't mean to upset any users of DataView with my earlier remark.)
I did in fact do just what you described at some point, with copying chunks into new arrays to get proper byte alignment. In the end, it was far slower than just using DataView (but to be upfront, I don't quite remember if I tested in Firefox too). But disregarding performance for a second, it's just far more logical and intuitive to use DataView. It's a native object who's purpose is to quickly handle arbitrary binary data......except for in reality :)
v8 has optimized their DataView performance and we should start to see it become more performance-sensitive on the web: https://v8project.blogspot.com/2018/09/dataview.html
Flags: needinfo?(jdemooij)
Two years later, and I still use weird in-house type casting functions to read/write [unaligned] binary data, as since this issue was reported (and in fact I started a similar one on Chromium's issue tracker), DataView is a magnitude slower than my far more complex code. Thanks for looking at this again!
According to the comment log above, I've commented on this once every year since it was filed, 4 years ago, but I neglected to do so last year. Shucks. Seriously, though: just throw together a self-hosting patch based on the already provided code and dump it in. This should be standard-practice for absolutely everything that could have its performance improved with it, and not doing it for things that get an order of magnitude (at time of filing) speed up is just stupid. Worry about fancy better optimized versions later. If Google doing it is what it takes for Mozilla to finally fix this, so be it. :sigh: /end +1
> and not doing it for things that get an order of magnitude (at time of filing) speed up Do they? The proof of concept is not spec-compliant, as comment 0 notes. In the past, fixing that sort of problem has significantly reduced, and in some cases eliminated, the speedup. That said, I agree that it's worth at least trying and seeing what the performance ends up looking like.
I looked into this a bit. We could probably inline DataView.prototype.get* with a single MIR instruction, and same for set*. The annoying part is the optional littleEndian argument. Who decided it's a good idea to have that default to big endian when almost all CPUs are little endian and typed arrays store their data as little endian? Sigh.
Surely that argument will be a constant in most cases and will just boil away in MCallOptimize?
(In reply to Lars T Hansen [:lth] from comment #20) > Surely that argument will be a constant in most cases and will just boil > away in MCallOptimize? That's true, but if we want to inline both little endian and big endian we need the code to get/set both cases anyway, and ideally we'd also optimize the unknown-bool case... Thinking a bit more about this, Ion could also optimize the known-little-endian case just like typed array get/set. When we have a detached buffer, the bounds check would always fail so we would handle that case in Baseline.
Depends on: 1496378
Bug 1496378 landed, now in the JIT we should be able to rename s/TypedArray/ArrayBufferView/ in a number of places and reuse the code also for DataView objects. After discussion with Ted we decided to try to implement this in the CacheIR CallIC first (and maybe see if we can use that IC in Ion).
Flags: needinfo?(jdemooij)

A user reported[1] Firefox being 40x slower than Chrome on a DataView benchmark[2]. I collected a profile[3] while running DataView part of the test. Is this the right bug to use for addressing the performance of DataView?

  1. https://news.ycombinator.com/item?id=22943131
  2. https://jsperf.com/dataview-float-int
  3. https://perfht.ml/3cFtVY3

Is this the right bug to use for addressing the performance of DataView?

Yes, it is.

Assignee: nobody → andrebargull
Status: NEW → ASSIGNED
Type: defect → enhancement

Following bug 1496378, update MIR nodes to mention ArrayBufferView instead of
TypedArray. Later patches in this stack will make use of them for DataView
objects.

x86/x64:

  • Implement the 2-byte swap instructions using rol, matching the output of
    Clang for the __builtin_bswap16 built-in.
  • 4-byte and 8-byte swaps are implemented through bswap.
  • The actual rol and bswap functions are derived from upstream.

arm32:

  • Implement 2-byte swap with sign extend using revsh.
  • Implement 2-byte swap with zero extend using rev16 followed by uxth.
  • 4-byte and 8-byte swaps are implemented through rev.

arm64:

  • Implement 2-byte swap using rev16 followed by either sxth or uxth.
  • 4-byte and 8-byte swaps are implemented through rev.

Depends on D74619

Change LInt64Definition to a proper class and add a BogusTemp method to it.

This will be used in the next patches.

Depends on D74621

The implementation of LoadDataViewElement follows LoadUnboxedScalar,
therefore most LoadDataViewElement functions were placed next to their
counterparts from LoadUnboxedScalar.

MCallOptimize:

  • The two single-byte getters getInt8 and getUint8 are directly inlined
    through LoadUnboxedScalar.
  • This let's us avoid duplicating the single byte extra cases in Lowering.
  • And MLoadDataViewElement doesn't need to save the otherwise unused
    littleEndian parameter.

Lowering:

  • Optimises the case when littleEndian is a constant value, so we can avoid
    branching around the byte swap instructions in most cases.
  • As usual, x86 needs special-casing because there are too few registers
    available.

I didn't bother to implement the following two optimisations for
LoadUnboxedScalar to LoadDataViewElement:

  • AnalyzeLoadUnboxedScalar from EffectiveAddressAnalysis
  • Constant index optimisation in lowering/codegen

Both optimisations are currently unavailable due to Spectre index masking,
so it didn't seem worthwhile to spend time to support them for
LoadDataViewElement.

Depends on D74622

Similar to the previous part, inline the setter functions based on
the current StoreUnboxedScalar support.

Depends on D74623

Depends on D74625

Pushed by cbrindusan@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/420e72fb4f83 Part 1: Rename TypedArray mir nodes to mention ArrayBufferView. r=jandem https://hg.mozilla.org/integration/autoland/rev/2b5bf053877a Part 2: Add byte swap instructions to the assemblers. r=jandem https://hg.mozilla.org/integration/autoland/rev/abb6d449fd6a Part 3: Add unaligned load and store functions to the assemblers. r=jandem https://hg.mozilla.org/integration/autoland/rev/0822e88eb009 Part 4: Add LInt64Definition::BogusTemp. r=jandem https://hg.mozilla.org/integration/autoland/rev/d36328fd387d Part 5: Inline DataView getter functions. r=jandem https://hg.mozilla.org/integration/autoland/rev/5bd1d18fe1c2 Part 6: Inline DataView setter functions. r=jandem https://hg.mozilla.org/integration/autoland/rev/e0d13e5b07a6 Part 7: Inline 'byteLength' and 'byteOffset' getters for DataViews. r=jandem https://hg.mozilla.org/integration/autoland/rev/9ba7a0d69b96 Part 8: Add test cases. r=jandem

Thank you! I can definitely see the speed difference.

Unfortunately my typecasting code is still somehow faster (on Chrome as well), but if this landed a couple of years ago I'd probably be using it :P

...and then I realized that I am forgetting this will only be in 78 and I wasn't even testing it.
Currently on 76, reading with DataView seems to be about 4 times slower than Chrome, while writing is faster, not sure why.
Would be interesting to see how this optimization works.
Right now it is still faster on both browsers to explicitly call for example a function that takes 4 numbers, sets 4 indices of a Uint8Array, and returns the resulting number from a Float32Array that looks at the same buffer, all preallocated at script load time.

Any plans in backporting to the latest ESR?

(In reply to Marco Trevisan (Treviño) from comment #38)

Any plans in backporting to the latest ESR?

That's unlikely to happen, since 78 ESR is close -- and also because this bug has been languishing for 6 years, not exactly making it a poster child for something that is backport-worthy.

Just updated this example [1] to DataView and tried it in Firefox 76, Firefox Nightly and Chrome 81. That worker now runs double as fast as previously and performance is on par with that of Chrome.

Great update, thanks!

[1] https://github.com/potree/potree/blob/develop/src/workers/BinaryDecoderWorker.js

Regressions: 1642883
Blocks: 1642915
See Also: → 1653246
No longer blocks: 1642915
Depends on: 1642915
Blocks: 1642915
No longer depends on: 1642915
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: