Open Bug 1027624 Opened 9 years ago Updated 4 months ago
Float denormal issue in Java
Script processor node in Web Audio API
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/6.1.4 Safari/537.76.4 Steps to reproduce: We successfully compile our C++ audio processing code with emcripten in asm.js to deploy on the web using the WebAudio API , so running the resulting asm.js code in a ScriptProcessorNode in the Web Audio API. Our C++ code uses the following denormalized float number protection code ("protection" is needed since denormalized float number computation is awfully slow and has to be avoided): #ifdef __SSE__ #include <xmmintrin.h> #ifdef __SSE2__ #define AVOIDDENORMALS _mm_setcsr(_mm_getcsr() | 0x8040) #else #define AVOIDDENORMALS _mm_setcsr(_mm_getcsr() | 0x8000) #endif #else #define AVOIDDENORMALS #endif Basically we add a call at AVOIDDENORMALS before each audio block processing. It seems this AVOIDDENORMALS is just removed by the emcripten compiler and so we get asm.js code that seems to produce denormalized floats and the speed issue occurs. The attached "piano.html" page contains a piano physical model that is compiled in asm.js and run as a Web Audio API ScriptProcessorNode. It you hit the "gate" button a sound is played. After some seconds the CPU use (seen in activity monitor on OSX raises to 100%) Actual results: ScriptProcessorNode node takes a lot of time to execute since float denormal issue happens Expected results: ScriptProcessorNode should probably be processes in an context where flush denormals to zero is done automatically.
Component: Untriaged → Web Audio
Product: Firefox → Core
Component: Web Audio → Projects
Product: Core → Audio/Visual Infrastructure
Version: 33 Branch → unspecified
Component: Projects → Web Audio
Product: Audio/Visual Infrastructure → Core
Version: unspecified → Trunk
Severity: major → normal
OS: Mac OS X → All
Hardware: x86_64 → x86
Is there a Java Script API for that? That could be called once before each audio block computation to put the processor in the "denormals to zero" mode ? (could not find it…)
No, just iterate on the buffer, detect denormals and replace them by zeros. This is just normal IEEE754 stuff, it does not depend on the language.
To elaborate on that: this is quite a common practice in audio code to flush denormals to zero, because in practice it is what we want. So we need a cheap way to do that, especially if processors allow that. See: http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-flush-denormals-confidence/ "For instance, in audio processing applications, denormal values usually represent a signal so quiet that it is out of the human hearing range. Because of this, a common measure to avoid denormals on processors where there would be a performance penalty is to cut the signal to zero once it reaches denormal levels or mix in an extremely quiet noise signal." http://www.juce.com/forum/topic/resolving-denormal-floats-once-and-all https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz/ http://randomascii.wordpress.com/2012/05/20/thats-not-normalthe-performance-of-odd-floats/ "Signal processing, especially audio processing, is another area where recursive iterations usually lead to dernormals. Disabling them is generally the best practical solution too, because it would be extremely expensive to check the signal and disable the feedback loops at every computation stage. Moreover, it’s usually a null signal that triggers the denormal slowness (think of this simple iteration: y[n] = x[n-1] * 0.1 + y[n-1] * 0.9), so curing the processing at one stage could worsen it at a subsequent stage! But if you can’t disable denormals, you can still inject noise at some points to ensure that the signal stays away from the denormal range…"
The "add noise" method at appropriate places is the alternative method, yes, but not very practical. The point is that in C/C++, and when the processor support it of course, this AVOIDDENORMALS macro is very easy to use. It would be a pity not to be able to use the same kind of technique in JS, when it is supported by the underlying processor. Could a FlushDenormalToZero() method be added in the Math package for instance ? that would set the processor in the appropriate mode if possible and return true. If not the programmer would have to use the "add noise" method, basically the way he/she would have to do in C/C++.
It's a JS spec issue, not a Firefox-specific issue or a WebAudio spec issue. The JS spec says JS arithmetic has to follow IEEE754. You want something different.
There are several ways this could be addressed in JS, e.g. -- "use flush_denormals_to_zero" -- Math.flushDenormalToZero(v) (returns 0 if v is denormal, otherwise v) -- flush-to-zero versions of JS SIMD intrinsics (or just make the intrinsics flush-to-zero by default) A global variable that changes the semantics of all JS arithmetic is probably not a good idea, even though it matches what the hardware provides.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #13) > A global variable that changes the semantics of all JS arithmetic is > probably not a good idea, even though it matches what the hardware provides. It's also entirely unrealistic that it would ever be accepted into the language. TC39 has a very strict "no more modes" policy, which certainly won't be broken for a case like this. The SIMD intrinsics seem most promising to me. CC'ing folks working on that, and Luke because this is probably most relevant to asm.js code.
The association of Web Audio API and asm.js is a terrific platform that can have a deep impact on the way we write, distributed and use audio software in the future. Unfortunately having slow denormals without a real solution could potentially defeat the whole process. My proposition was to have an exception, limited to custom dsp effects written in JS in the framework of the web audio API. This would not affect general JS code outside this very specific context. JS arithmetic remains IEEE754 compatible.
To see the problem, here is a piano physical model in C++ compiled in asm.js using emcsripten. A single string is then duplicated (for "bug" demonstration purpose) 16 times, so 16 strings are computed all the time. http://faust.grame.fr/www/piano.html Chrome CPU here ==> 36 % Hit the "gate" button to play a same note on all 16 string, wait some seconds, CPU raise to 100% Same issue with Firefox and Safari WebKit. Stéphane Letz
With SIMD we do technically have the opportunity to introduce new semantics for arith operations, so we could mandate FTZ. In fact, ARM Neon forces FTZ. The problem is that the FTZ/DAZ mxcsr flags affect both SIMD and scalar double arithmetic so we'd potentially have to flip the flag on and off repeatedly. For this reason, I think everyone wanted to leave the denormal behavior undefined. However, looking at agner.org, stmxcsr doesn't seem terribly expensive, so perhaps we could do this. As long as noone is interleaving scalar arith in SIMD loops, in theory we could hoist the stmxcsr's to before/after the loop. Any thoughts on this Dan/Benjamin?
(In reply to YANN ORLAREY from comment #15) > My proposition was to have an exception, limited to custom dsp effects > written in JS in the framework of the web audio API. This would not affect > general JS code outside this very specific context. JS arithmetic remains > IEEE754 compatible. Except for the JS arithmetic in your DSP effects written in JS. Which means JS arithmetic would not actually remain IEEE-754 compatible. C/C++ can get away with compiler-specific AVOIDDENORMALS sorts of things to modify floating-point behavior, because floating point computations don't have specified arithmetic semantics. JS in contrast precisely defines floating point arithmetic. There's no leeway to select different ones. Different semantics require new operations defined to have those semantics. (In reply to letz from comment #16) > Chrome CPU here ==> 36 % Are you asserting, based on reading of Blink/v8 code, profiling, and perhaps Blink/v8 patching, that Chrome is flushing denormals here, and that's the specific reason it's faster? Or is it at all possible that some other unrelated factor(s) is/are in play and might be the cause of the performance difference?
"Chrome CPU here ==> 36 % " The point was to show that 16 piano simulated strings compiled in asm.js when running "normally" (no denormal issue when the note is played) consume like 36 % on this machine, and the CPU raises to 100% when the notes becomes silent and denormal problem starts to happen.
(In reply to letz from comment #19) > The point was to show that 16 piano simulated strings compiled in asm.js > when running "normally" (no denormal issue when the note is played) consume > like 36 % on this machine, and the CPU raises to 100% when the notes becomes > silent and denormal problem starts to happen. Oh! I misread your comment. I see what you meant now. Sorry I got confused here.
(In reply to Luke Wagner [:luke] from comment #17) > With SIMD we do technically have the opportunity to introduce new semantics > for arith operations, so we could mandate FTZ. In fact, ARM Neon forces > FTZ. The problem is that the FTZ/DAZ mxcsr flags affect both SIMD and > scalar double arithmetic so we'd potentially have to flip the flag on and > off repeatedly. For this reason, I think everyone wanted to leave the > denormal behavior undefined. However, looking at agner.org, stmxcsr doesn't > seem terribly expensive, so perhaps we could do this. As long as noone is > interleaving scalar arith in SIMD loops, in theory we could hoist the > stmxcsr's to before/after the loop. Any thoughts on this Dan/Benjamin? ARM NEON doesn't have double precision operations, so to emulate them on ARM you would have to change the flush-to-zero flag in the FPCR, issue the VFP instructions, and then change the flush-to-zero flag back. If you have interleaved SIMD and scalar floating-point operations, this could be quite a mess. ARM64 has denormals in both scalar and SIMD arithmetic (and the SIMD has double precision operations), and the flush-to-zero flag affects both.
The initial JS SIMD API only has Float32x4 (and (U)Int32x4). Does audio processing need double precision floats?
It may in some cases.
Oops, looks like there is a Float64x2 in the works in https://github.com/johnmccutchan/ecmascript_simd.
(In reply to Norbert.Schnell from comment #25) > > Except for the JS arithmetic in your DSP effects written in JS. Which means > > JS arithmetic would not actually remain IEEE-754 compatible. > > Yes, that's right. > Insisting on IEEE-754-compatibility basically means insisting on > audio-processing-incompatibility. I can't agree more. Realtime audio applications and more generally signal processing applications heavily rely on recursive IIR filters. These recursive filters (for example Y[n] = X[n] + 0.5*Y[n-1]) will inevitably produce a huge number of denormals every second in particular when the input signal becomes 0. The problem is that denormals are generally an order of magnitude slower to process, with nearly no benefit for audio applications. This is why we usually prefer to relax IEEE compatibility and run with FTZ and DAZ flags set. In other words strict IEEE compatibility is currently incompatible with realtime audio application as long as denormals handling is so slow on our processors. This is why we request a pragmatic solution with a clearly delimited exception allowing to relax IEEE compatibility in some well defined cases. Otherwise JS, despite all the interesting developments around the Web Audio API and asm.js, will remain audio-processing-incompatible.
This problem might also be addressed by the value types work that is in progress. "Denormalized floats" could just be a distinct type of number that follows different rules.
A debug version of the "piano" example compiled with emcc -profiling and -s LINKABLE=1 so that lines can be looked at: http://faust.grame.fr/www/piano-debug.html The real audio computation is in "__ZN5piano7computeEiPPfS1_"
We got the confirmation that in practice, (almost..) all audio applications developed with CoreAudio on OSX, will be "automagically" protected against denormals. Read the thread here http://lists.apple.com/archives/coreaudio-api/2014/Jun/index.html
Two kind of crazy ideas: 1. Does the denormal slowdown happen in specific parts of the code? I wonder if we could add a PGO-like option in emscripten where the code checks for denormals in practice, then we rebuild with the profile and emscripten would emit round-to-zero in the relevant places where it was actually seen. This would require no manual user action (except for running a build to profile). 2. Much crazier ;) - could audio processing be done by a shader? That is, upload the data to a WebGL buffer, use a compute shader, and retrieve the data? GPUs always round to zero, so the problem would go away. If this makes sense in theory, we could write a JS library that makes it convenient. Or is typical audio processing non-GPU-able?
(In reply to Alon Zakai (:azakai) from comment #34) > 2. Much crazier ;) - could audio processing be done by a shader? That is, > upload the data to a WebGL buffer, use a compute shader, and retrieve the > data? GPUs always round to zero, so the problem would go away. If this makes > sense in theory, we could write a JS library that makes it convenient. Or is > typical audio processing non-GPU-able? In practice, this induces a lot of latency, because you would need to work on big buffers. I imagine the goal is to have real time interaction with the software, here, so latency should be minimal (especially considering this is going to be used by musician that are used to physical instruments, and tend to be even pickier about latency). Also, even with the latency issue solved, audio code tends to be quite hard to convert to shaders (with some exceptions, I've had some good results doing massive convolutions on a GPU for examples, for reverbs).
Thanks Alon for you proposal with emscripten, but don't forget that asm.js code can be directly generated. We can now do that with a new asm.js generation back-end we have recently added in the Faust compiler. The withFTZ(code…) seems like the better proposal up to now.
You need to log in before you can comment on or make changes to this bug.