NS_SUCCEEDED/NS_FAILED could get better codegen with a sign check

NEW
Unassigned

Status

()

Core
XPCOM
5 years ago
5 years ago

People

(Reporter: dmajor (away and/or busy), Unassigned)

Tracking

Trunk
x86_64
Windows 7
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

5 years ago
Created attachment 819216 [details] [diff] [review]
Check sign instead of bit mask

NS_SUCCEEDED/NS_FAILED are pretty common, and (at least on Windows) I've been seeing a ton of code that goes like this:

e8a5feffff      call    FunctionThatReturnsNSResult
8bc8            mov     ecx,eax
c1e91f          shr     ecx,1Fh
f6c101          test    cl,1
7515            jne     ErrorHandler

Really that code should just check the sign bit:

e8a5feffff      call    FunctionThatReturnsNSResult
85c0            test    eax,eax
7815            js      ErrorHandler

The reduced instruction count is probably not important on modern processors. But saving six bytes is nice, and freeing up a register is even nicer. 

Changing to a sign-check shrinks opt xul.dll by 147KB. However, that gets completely erased by PGO, and in fact under PGO the binary actually grows by 24KB. Maybe the optimizer is using the extra register freedom to do smart things?

I haven't taken the time to do perf measurements; I'd like to collect feedback first. Is there a good reason for keeping the existing bit mask? It has been unchanged for 15 years, and surely somebody must have proposed this change before.
Isn't the cast to int32_t technically undefined behavior?
(In reply to On vacation Oct 12 - Oct 27 from comment #1)
> Isn't the cast to int32_t technically undefined behavior?

It is technically implementation-defined ([conv.integral]p3); I'm pretty sure we rely on the expected behavior in a lot of places (even if we shouldn't).

FWIW, GCC (and I assume clang) are able to optimize the existing code to the sign bit test.  Maybe the PGO optimizer is just smarter (why?!).  Does anything interesting happen if you change the function to:

 inline bool NS_FAILED_impl(nsresult _nsresult) {
   return (static_cast<uint32_t>(_nsresult) & 0x80000000) != 0;
 }

?  (I assume you also checked the new xul.dll to see that the size reduction was because MSVC was now doing the sign-bit tests?)
(Reporter)

Comment 3

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #2)

> FWIW, GCC (and I assume clang) are able to optimize the existing code to the
> sign bit test.  

No surprise there :-)

> Does anything interesting happen if you change the function to:
> 
>  inline bool NS_FAILED_impl(nsresult _nsresult) {
>    return (static_cast<uint32_t>(_nsresult) & 0x80000000) != 0;
>  }

I just tried and that doesn't seem to help.

> ?  (I assume you also checked the new xul.dll to see that the size reduction
> was because MSVC was now doing the sign-bit tests?)

I spot-checked a few places where I had noticed right-shifts before, and those shifts were gone.

Comment 4

5 years ago
It would actually be nice if someone takes a look at what clang does to this code, as the clang optimizer is vastly different than the gcc optimizer.

Comment 5

5 years ago
My opinions are:

* we should probably not rely on undefined behavior unless there are big wins to be had
* Otherwise we should do the thing that gets us the best-optimized code in the release build for each platform

Given that, the only really interesting datapoint here is which is better in PGO builds in terms of speed and codesize (and I suspect codesize is the better metric here).

Comment 6

5 years ago
(In reply to comment #5)
> My opinions are:
> 
> * we should probably not rely on undefined behavior unless there are big wins
> to be had
> * Otherwise we should do the thing that gets us the best-optimized code in the
> release build for each platform
> 
> Given that, the only really interesting datapoint here is which is better in
> PGO builds in terms of speed and codesize (and I suspect codesize is the better
> metric here).

Note that we don't PGO on Mac, which is why I suggested investigation with clang.
(In reply to :Ehsan Akhgari (needinfo? me!) [Away 10/29-11/6] from comment #4)
> It would actually be nice if someone takes a look at what clang does to this
> code, as the clang optimizer is vastly different than the gcc optimizer.

clang tests the sign bit as expected.
(Reporter)

Comment 8

5 years ago
For my own notes: I did verify that current shipped PGO builds do use right-shifts. (In case there are concerns about red herrings in local developer builds)
(Reporter)

Comment 9

5 years ago
I can't measure any performance improvement from this. It's well within noise on both PGO and non-PGO. And given that PGO doesn't even get a size improvement, I don't have much reason to spend more time on this.

Oh well, at least I got an answer to my question of "why hasn't anyone done this before"
You need to log in before you can comment on or make changes to this bug.