Closed Bug 977538 Opened 10 years ago Closed 10 years ago

MSVC with PGO still miscompiles/nops CanonicalizeNaN

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla30
Tracking Status
firefox28 + fixed
firefox29 + fixed
firefox30 + verified
firefox-esr24 28+ fixed
b2g18 --- unaffected
b2g-v1.1hd --- unaffected
b2g-v1.2 --- unaffected
b2g-v1.3 --- unaffected
b2g-v1.4 --- unaffected

People

(Reporter: jandem, Assigned: jandem)

References

Details

(Keywords: sec-critical, Whiteboard: [adv-main28+][adv-esr24.4+])

Attachments

(2 files)

Attached file Testcase
Remember bug 859892, MSVC miscompiling the CanonicalizeNaN call in DataView.getFloat32? I fixed that bug but MSVC is still miscompiling CanonicalizeNaN.

Bug 939562 enables the JITs for more chrome code and this Win32 PGO bug is causing Jetpack crashes, but the reduced testcase also crashes a normal Nightly.

MSVC turns the CanonicalizeNaN call for DataView.getFloat64 into a no-op, so JS code can create arbitrary Values and this is sec-critical.

For CanonicalizeNaN, MSVC with PGO generates the following code, annotated:

  // Prologue.
  mozjs!JS::CanonicalizeNaN:
  680ef5f0 55              push    ebp
  680ef5f1 8bec            mov     ebp,esp
  680ef5f3 83ec0c          sub     esp,0Ch

  // Move the double argument to ebp-8. Also save esi and set it to 0.
  680ef5f6 dd4508          fld     qword ptr [ebp+8]
  680ef5f9 56              push    esi
  680ef5fa 33f6            xor     esi,esi
  680ef5fc dd5df8          fstp    qword ptr [ebp-8]

  // Compare esi and if it's non-zero, we're done.
  // We just zero'ed esi so this branch is never taken.
  680ef5ff 3bf6            cmp     esi,esi
  680ef601 7513            jne     mozjs!JS::CanonicalizeNaN+0x26 (680ef616)

  // mozilla::IsNaN does (bits & DoubleExponentBits) == DoubleExponentBits,
  // so this looks reasonable. If this test fails, we're done.
  680ef603 8b45fc          mov     eax,dword ptr [ebp-4]
  680ef606 250000f07f      and     eax,7FF00000h
  680ef60b 3d0000f07f      cmp     eax,7FF00000h
  680ef610 0f849b911900    je      mozjs!JS::CanonicalizeNaN+0x1991c1 (682887b1)

  // Done, return the double and restore esi.
  680ef616 dd4508          fld     qword ptr [ebp+8]
  680ef619 5e              pop     esi
  680ef61a c9              leave
  680ef61b c3              ret

The branch that's always taken is a bit weird for an opt build, but so far so good. Here's what happens when we have a NaN value and jump to 682887b1:

  // Load the high word in edx, low word in eax.
  682887b1 8b55fc          mov     edx,dword ptr [ebp-4]
  682887b4 8b45f8          mov     eax,dword ptr [ebp-8]

  // mozilla::IsNaN does: (bits & DoubleSignificandBits) != 0
  // DoubleSignificandBits == 0x000fffff ffffffff, so the and instruction below
  // makes some sense.
  682887b7 81e2ffff0f00    and     edx,0FFFFFh

  // The code below is totally bogus, we "or" both words, but whatever
  // happens we jump to "Done." and return the original input.
  682887bd 0bc2            or      eax,edx
  682887bf 0f84516ee6ff    je      mozjs!JS::CanonicalizeNaN+0x26 (680ef616)
  682887c5 e94c6ee6ff      jmp     mozjs!JS::CanonicalizeNaN+0x26 (680ef616)
Summary: MSVC PGO builds still miscompiles/nops CanonicalizeNaN → MSVC with PGO still miscompiles/nops CanonicalizeNaN
Keywords: sec-critical
My current plan of attack is to disable PGO for JS::CanonicalizeNaN and see if that helps.

But we should also consider disabling PGO completely for (big parts of) JS. The perf win from PGO should be a lot less than in the interpreter days, and even if we lose a few % on the benchmarks we can make up for that elsewhere. Somebody should measure.
As of two years ago PGO on Windows was still good for a 10% improvement on Sunspider:
https://groups.google.com/forum/#!topic/mozilla.dev.tree-management/HzAIVijRXUE That was from bug 641325.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2)
> As of two years ago PGO on Windows was still good for a 10% improvement on
> Sunspider:

With our new JITs we should spend less time in the interpreter/VM though, so I expect this to be less nowadays. I'll get some numbers.

How can I get a Try build with --disable-profiling --disable-js-diagnostics? I assume --enable-profiling affects both PGO and non-PGO builds, but ideally we'd compare without it.
I downloaded PGO and non-PGO inbound builds, created a new profile and ran some benchmarks.

On Sunspider, PGO still helps about 10%. Sunspider is kind of a best case for PGO though because it's short running so we spend more time in the interpreter/VM than other benchmarks. Most of this is on a few different tests; it would be interesting to see where PGO is helping us and if we can add JIT/C++ optimizations to get there without PGO.

On Kraken, PGO helps about 3-4%. Kraken spends more time in JIT code. Octane is a bit more noisy, but it looks like PGO is a ~5% win.

So PGO is still a measurable perf win. Question is if we really need PGO for all of JS or just a small number of files (Interpreter.cpp, jsobj.cpp, etc).

It's really unfortunate that our shell fuzzers are not testing the code we run in the browser. Is this something we can easily fix?
While I was stepping through the code, I noticed that MSVC with PGO was not inlining many trival functions like Value::toObject(), CallArgs::rval() etc. Also note that CanonicalizeNaN in comment 0 is not inlined.

With a non-PGO build, all these methods *are* inlined. So I created a silly micro-benchmark to see which one is faster:

function f() {
    var buffer = new Uint8Array(8);
    var view = new DataView(buffer.buffer);
    var t = new Date;
    for (var i=0; i<10000000; i++)
        view.getFloat64(0);
    alert(new Date - t);
}

And indeed, PGO builds are much slower (663 ms with PGO, 369 ms without PGO).

This suggests that PGO builds don't just optimize hot code, they also deoptimize cold code. If this is true, disabling PGO for code not exercised in our profile run could actually be a win...
In general, PGO not inlining cold code is one of its most important features: hot code is optimized for speed and cold code is optimized for size because overall that produces the fastest result (because of cache miss rates etc).

And I don't think this s-s bug is the right place to discuss our overall PGO strategy. Let's fix the bug at hand by removing this particular function from PGO in the simplest way possible (or figuring out why it's miscompiling and working around it, though that seems harder).
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> In general, PGO not inlining cold code is one of its most important
> features: hot code is optimized for speed and cold code is optimized for
> size because overall that produces the fastest result (because of cache miss
> rates etc).

I get that, but if you decide what's hot based on an outdated benchmark like Sunspider and the compiler deoptimizes the other 80% of the code you'll lose on (real-world) workloads.

> And I don't think this s-s bug is the right place to discuss our overall PGO
> strategy.

Agreed. Sorry, I'll take this elsewhere.
Attached patch PatchSplinter Review
Disable PGO for CanonicalizeNaN. Jetpack tests are green now (on top of bug 939562):

https://tbpl.mozilla.org/?tree=Try&rev=42153df0d9d1
Attachment #8383165 - Flags: review?(luke)
I tested Firefox 27 and 29 and they don't crash, so this seems to only affect Nightly. That makes it a lot less scary. I'll backport the patch because it's trivial and in case other callers have the same problem.
Comment on attachment 8383165 [details] [diff] [review]
Patch

Nice job tracking this down Jan!
Attachment #8383165 - Flags: review?(luke) → review+
Comment on attachment 8383165 [details] [diff] [review]
Patch

AFAIK this only affects m-c. Asking for sec-approval though because I don't know when this was introduced and it *may* affect older branches somehow, so I'd like to backport the patch.

[Security approval request comment]
> How easily could an exploit be constructed based on the patch?
Not very easy. There are multiple callers of this function and not all of them are affected.

> Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?
No.

> Which older supported branches are affected by this flaw?
It only affects Nightly. However, this may cause similar problems in older versions so I'd like to backport it to be safe.

> If not all supported branches, which bug introduced the flaw?
Unknown.

> Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?
Should apply.

> How likely is this patch to cause regressions; how much testing does it need?
Unlikely.
Attachment #8383165 - Flags: sec-approval?
FWIW, 27.0.1 has two copies of this function, one of which (called from js::ctypes::ConvertToJS, and possibly elsewhere) has exactly the same disassembly as comment 0. The other copy (called from js::DataViewObject::getFloat64Impl, and possibly elsewhere) looks OK at first glance.
Comment on attachment 8383165 [details] [diff] [review]
Patch

sec-approval+ for trunk.

We'll need discussion with Release Management about taking it on Beta but if you make an Aurora patch, I can approve that as well.
Attachment #8383165 - Flags: sec-approval? → sec-approval+
Comment on attachment 8383165 [details] [diff] [review]
Patch

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unknown.
User impact if declined: Possible crashes or security issues.
Testing completed (on m-c, etc.): On m-i.
Risk to taking this patch (and alternatives if risky): Low.
String or IDL/UUID changes made by this patch: None.
Attachment #8383165 - Flags: approval-mozilla-aurora?
Comment on attachment 8383165 [details] [diff] [review]
Patch

Patch also applies to beta.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unknown.
User impact if declined: Possible crashes or security issues.
Testing completed (on m-c, etc.): On m-i.
Risk to taking this patch (and alternatives if risky): Low.
String or IDL/UUID changes made by this patch: None.

[Approval Request Comment]
User impact if declined: Possible crashes and/or security issues.
Fix Landed on Version: m-c, but will be backported.
Risk to taking this patch (and alternatives if risky): Low.
String or UUID changes made by this patch: None.
Attachment #8383165 - Flags: approval-mozilla-esr24?
Attachment #8383165 - Flags: approval-mozilla-beta?
Attachment #8383165 - Flags: approval-mozilla-esr24?
Attachment #8383165 - Flags: approval-mozilla-esr24+
Attachment #8383165 - Flags: approval-mozilla-beta?
Attachment #8383165 - Flags: approval-mozilla-beta+
Attachment #8383165 - Flags: approval-mozilla-aurora?
Attachment #8383165 - Flags: approval-mozilla-aurora+
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> Let's fix the bug at hand by removing this particular function
> from PGO in the simplest way possible (or figuring out why it's miscompiling
> and working around it, though that seems harder).

Isn't figuring out why it's being miscompiled a requirement to prevent this from happening again with other functions? If MSVC's PGO can cause such critical problems, maybe this is not the only case.
https://hg.mozilla.org/mozilla-central/rev/00f1d0e19c9b
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Flags: in-testsuite?
Whiteboard: [adv-main28+][adv-esr24.4+]
Confirmed crash in Fx30, 2014-02-14.
Verified fix in Fx30, 2014-03-12.

I never saw a crash in other branches, and based on comment 9, it appears to only to have been backported for good measure. So, no QA verification on 24esr/28/29 will be done.
Status: RESOLVED → VERIFIED
Group: core-security
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/2edc56eddf55
Land the attached testcase as a crashtest. r=me
Flags: in-testsuite? → in-testsuite+
You need to log in before you can comment on or make changes to this bug.