Closed Bug 1155836 Opened 5 years ago Closed 5 years ago

Try to avoid the AMD bug in the DoGetStyle* functions

Categories

(Core :: CSS Parsing and Computation, defect)

x86_64
Windows 8.1
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox37 --- wontfix
firefox38 --- fixed
firefox38.0.5 --- fixed
firefox39 --- fixed
firefox40 --- fixed

People

(Reporter: dmajor, Assigned: dmajor)

References

Details

Attachments

(1 file, 1 obsolete file)

Lately, whenever bug 772330 hits, it's always in one of the nsStyleContext::DoGetStyle helpers.

We still don't know what actually triggers the bug. It's probably edge cases in the chip internals that aren't architecturally exposed. I know that it can't be *entirely* caused by the DoGetStyle code, because I've found the same disassembly in both crashy and non-crashy builds.

But maybe the code is *partly* responsible; maybe changing DoGetStyle might stop the crashes. These are performance-sensitive functions so we'll need to be very careful. I think the only way I can justify touching this code is if I can squeeze a tiny bit more perf out of it, and maybe as a side effect the AMD bug might go away.

As a starting point: I noticed that (at least on Windows) aComputeData gets pushed on the stack. Maybe we could try templating to avoid that.
Attached patch Template on aComputeData (obsolete) — Splinter Review
One possible approach
Comment on attachment 8594136 [details] [diff] [review]
Template on aComputeData

(Sorry for the flag, bz.)

What do you think of this patch? I hear you've put some effort into optimizing these functions so I want to be careful not to regress that. What measurements were you most looking at when you were optimizing? What testing do I need to do (and on which platforms) to see whether this is an acceptable change?

And if you have suggestions for different approaches, I'm all ears.
Attachment #8594136 - Flags: feedback?(bzbarsky)
Comment on attachment 8594136 [details] [diff] [review]
Template on aComputeData

This seems reasonable.  The one worry is that this is inlining a bit more than it used to, so might cause a bit more code bloat; we may be able to just put the template implementation out of line to avoid that, if we have to.  But I wouldn't worry about it as a first cut.

In terms of what I measured when optimizing this, I think I did things like take the WHATWG HTML single-page spec, set the root to display:none, flush layout, set the root to display:block, and measure how long a style flush (as measured by "getComputedStyle(document.documentElement).color" takes, then profile that operation and look at what's going on in the style system bits.

In terms of testing, I expect just measuring the performance of said style flush and making sure it doesn't noticeably regress is good enough.  I don't know about platforms... I tend to assume/hope that this stuff is reasonably cross-platform, which I realize is silly given our different compilers.  ;)

One thing we can definitely do, with this patch, is nix the MOZ_UNLIKELY() around "!aComputeData": Now that it's a template aComputeData is a compile-time constant, so the compiler will either drop the code after that if block or drop the "return nullptr", and either way drop the if test, right?
Attachment #8594136 - Flags: feedback?(bzbarsky) → feedback+
(In reply to Not doing reviews right now from comment #3)
> Comment on attachment 8594136 [details] [diff] [review]
> Template on aComputeData
> 
> This seems reasonable.  The one worry is that this is inlining a bit more
> than it used to, so might cause a bit more code bloat; we may be able to
> just put the template implementation out of line to avoid that, if we have
> to.  But I wouldn't worry about it as a first cut.

FWIW the RuleNode methods had already been getting inlined on MSVC. There might be a little bit of extra code from splitting the <true> and <false> versions, but there are very few PeekFoo callers so most of the <false> ones get dropped by the linker anyway.

> One thing we can definitely do, with this patch, is nix the MOZ_UNLIKELY()
> around "!aComputeData": Now that it's a template aComputeData is a
> compile-time constant, so the compiler will either drop the code after that
> if block or drop the "return nullptr", and either way drop the if test,
> right?

True, good catch.
MSVC seems to take this pattern way too literally (with or without the template change):

      const nsStyle##name_ * cachedData = mCachedResetData              \
        ? static_cast<nsStyle##name_*>(                                 \
            mCachedResetData->mStyleStructs[eStyleStruct_##name_])      \
        : nullptr;                                                      \
      if (cachedData) /* Have it cached already, yay */                 \
        return cachedData;                                              \

Note that if you fail the mCachedResetData test, you'll never pass the cachedData test. But the compiler wants to be extra sure:

xul!nsStyleContext::DoGetStylePosition:
1035d0f4 55              push    ebp
1035d0f5 8bec            mov     ebp,esp
1035d0f7 8b4120          mov     eax,dword ptr [ecx+20h]
1035d0fa 85c0            test    eax,eax   ; OK, standard null check
1035d0fc 741c            je      xul!nsStyleContext::DoGetStylePosition+0x26 (1035d11a)

; let's say eax is 0 and follow the 'je':
1035d11a 33c0            xor     eax,eax   ; seriously?
1035d11c ebe3            jmp     xul!nsStyleContext::DoGetStylePosition+0xd (1035d101)

; let's follow that jmp:
1035d101 85c0            test    eax,eax   ; again?

So I changed it to:

+      if (mCachedResetData) {                                           \
+        const nsStyle##name_ * cachedData =                             \
+          static_cast<nsStyle##name_*>(                                 \
+            mCachedResetData->mStyleStructs[eStyleStruct_##name_]);     \
+        if (cachedData) /* Have it cached already, yay */               \
+          return cachedData;                                            \
+      }                                                                 \

And the cruft fell out. PGO started inlining a lot deeper, too.
As far as I can tell this is a wash, perf-wise.

Win32 talos is all gray (as in neither red nor green) with an equal number of non-significant pluses and minuses. I tried bz's WHATWG test locally under xperf and didn't see a difference in the style system.
Attachment #8596229 - Flags: review?(dbaron)
Comment on attachment 8596229 [details] [diff] [review]
Template + rewrote null-check

r=dbaron, and sorry for the delay getting to this.

(Once I read it closely it was a lot simpler than I thought it was going to be from skimming it.)
Attachment #8596229 - Flags: review?(dbaron) → review+
Attachment #8594136 - Attachment is obsolete: true
I think 39a2 is affected by the "AMD bug", so it might be worth uplifting this
to see if it helps there. (bug 1160317)
Mats, all versions are randomly affected, including betas and sometimes release RCs (but we see to rebuild and finally release a version that is not affected). Be sure that we'll want to uplift to any feasible trains in any case.
https://hg.mozilla.org/mozilla-central/rev/98d3519ab098
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
jmaher helped get me some extra Talos runs on the inbound landing: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=4d85a02e9c28&newProject=mozilla-inbound&newRevision=98d3519ab098

Avi, we may consider uplifting this to beta or even release. Could you let me know if there's anything in that delta that looks 'real' and what are the next steps for investigating? Thanks!
Flags: needinfo?(avihpit)
The red and green highlights look reasonably real to me. They're not huge, but they're not negligible either. OTOH, besides the regressions, there are also improvements.

I'd say that the bug this patch tries to fix is important enough to be worth landing despite the apparent regressions, especially since we want to uplift it ASAP.

If it lands and uplifts though, I'd say the regressions are worth filing followup bugs to examine them more closely and hopefully fix them where possible.
Flags: needinfo?(avihpit)
pgo runs are done and retriggered.  The above link doesn't have the changes rolled out to support PGO :(  

The original compare talos should help us out here:
https://compare-talos.paas.mozilla.org/?oldBranch=mozilla-inbound&oldRevs=4d85a02e9c28&newBranch=mozilla-inbound&newRev=98d3519ab098&submit=true
(In reply to Joel Maher (:jmaher) from comment #15)
> https://compare-talos.paas.mozilla.org/?oldBranch=mozilla-
> inbound&oldRevs=4d85a02e9c28&newBranch=mozilla-
> inbound&newRev=98d3519ab098&submit=true

I don't get any results on that page. Am I doing something wrong?
Flags: needinfo?(jmaher)
Assignee: nobody → dmajor
Component: General → CSS Parsing and Computation
The PGO numbers look a lot better (and are the ones that really matter here). The only result that mildly stands out to me is dromaeo_css on linux32. From looking at the graph of various build results before and after the change, I don't think it's an issue though.

At this point I would feel comfortable requesting an uplift.
(In reply to David Major [:dmajor] from comment #17)
> Apparently you have to capitalize Mozilla-Inbound.
> 
> https://compare-talos.paas.mozilla.org/?oldBranch=Mozilla-Inbound&oldRevs=4d85a02e9c28&newBranch=Mozilla-Inbound&newRev=98d3519ab098&submit=true

(In reply to David Major [:dmajor] from comment #18)
> The PGO numbers look a lot better (and are the ones that really matter
> here). The only result that mildly stands out to me is dromaeo_css on
> linux32. From looking at the graph of various build results before and after
> the change, I don't think it's an issue though.
> 
> At this point I would feel comfortable requesting an uplift.

Agreed.
Comment on attachment 8596229 [details] [diff] [review]
Template + rewrote null-check

Approval Request Comment
[Feature/regressing bug #]: hardware!
[User impact if declined]: high volume crashes
[Describe test coverage new/current, TreeHerder]: CI is green (but that doesn't test the fix itself)
[Risks and why]: The main risk is that we don't know whether this will actually stop the crashes.
[String/UUID change made/needed]: none
Attachment #8596229 - Flags: approval-mozilla-beta?
Attachment #8596229 - Flags: approval-mozilla-aurora?
Comment on attachment 8596229 [details] [diff] [review]
Template + rewrote null-check

Approved for uplift to aurora in hopes this will fix a hardware crash.
Attachment #8596229 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Guys, we are considering to do 38.0 rc build2. We are afraid to do that because of the amd crash.
If we take this patch, do you think it will mitigate the risk to have the amd bug in a build2 ?
Thanks
Flags: needinfo?(dmajor)
Flags: needinfo?(dbaron)
Flags: needinfo?(bzbarsky)
Given how badly the 38 betas have been hit, there's a high likelihood that we'll see the bug again if we do nothing. If we take the patch, the likelihood is unknown. It may work or it may not. To me that sounds like a better chance than doing nothing. I would say let's take the patch, but monitor the build just as aggressively as if we hadn't.
Flags: needinfo?(dmajor)
Flags: needinfo?(dbaron)
Flags: needinfo?(bzbarsky)
Comment on attachment 8596229 [details] [diff] [review]
Template + rewrote null-check

[Triage Comment]
OK, let's take it in release & the 38.0 relbranch release ( GECKO380_2015050320_RELBRANCH )
Attachment #8596229 - Flags: approval-mozilla-beta? → approval-mozilla-release+
(thanks btw)
I presume we aren't fixing this for 37....
For what it's worth, so far we haven't seen these crashes during the 38.0.5 and 39.0 betas. (Although we did see one each in JS and the CSS parser)
You need to log in before you can comment on or make changes to this bug.