Closed Bug 513422 Opened 15 years ago Closed 14 years ago

add common code for decisions about compiling and using SSE* code

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dbaron, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 3 obsolete files)

I started to look into making the optimization on bug 506430 work on Linux, and quickly came to the conclusion that we ought to both consolidate and improve our code for:
 * whether to compile things like SSE2 code
 * whether to run such code (based on CPUID checks, etc.)

This will allow us to:
 * add such optimizations more easily in the future
 * get them running across platforms more quickly
 * respect the -march gcc option that says to assume that the processor has certain capabilities
 * be able to use emmintrin.h, etc., on Linux, even when compiling without an -march option that says those instructions can be assumed to be available

I have some work in progress on this in http://hg.mozilla.org/users/dbaron_mozilla.com/patches/file/tip/sse-tests but it's still a rough sketch, and I haven't started testing it yet, and have a bunch more changes I intend to make (such as copying the cpuid code from gfx/qcms/transform.c and then using it).
(In reply to comment #0)
>  * be able to use emmintrin.h, etc., on Linux, even when compiling without an
> -march option that says those instructions can be assumed to be available

This seems like it actually won't be possible.  It's possible to work around the fact that emmintrin.h, etc., have a #error if __SSE2__ etc. aren't defined; however, after I do that I end up with errors like:

/tools/gcc-4.1.1/lib/gcc/i686-pc-linux-gnu/4.1.2/include/mmintrin.h: In function ‘void _mm_empty()’:
/tools/gcc-4.1.1/lib/gcc/i686-pc-linux-gnu/4.1.2/include/mmintrin.h:48: error: ‘__builtin_ia32_emms’ was not declared in this scope

and the others in this log:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1251576661.1251577221.11207.gz
http://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html#Function-Specific-Option-Pragmas
(see also http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#index-g_t_0040code_007btarget_007d-function-attribute-2355 for more details) seems like it ought to help with the problem in comment 1 in gcc 4.4 and newer.  However, I ran into two gcc bugs that prevent this from working:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39787
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41201

So I've given up on the idea of CPUID-based stuff on gcc-based platforms for now.

For Mac this isn't a big deal since the Intel Mac's default -march includes support for MMX, SSE, and SSE2 (since I guess that's the minimum Intel CPU that they've shipped Macs with).

Hopefully those gcc bugs will get fixed not too far in the future and we'll be able to give this another shot.  (The other alternative is to hand-write the assembly rather than using the intrinsics, but that seems quite painful, and it looks like things are pretty close to being able to avoid that.)
(In reply to comment #2)
> However, I ran into two gcc bugs that prevent this from
> working:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39787
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41201

The second is relatively easy to work around, though, if the first happens to get fixed first.
(In reply to comment #2)
> For Mac this isn't a big deal since the Intel Mac's default -march includes
> support for MMX, SSE, and SSE2 (since I guess that's the minimum Intel CPU that
> they've shipped Macs with).

Yes, no Mac ever shipped with an Intel processor that didn't support at least SSE2 IIRC.  I think the first Intel Macs shipped with the Core microarchitecture, but its possible that some end model Pentiums slipped in there too.  SSE2 was first introduced in the Williamette P4s in 2001 IIRC, so any Intel Mac should be good.

Obviously we can't use SSE2 on PPC, I don't know anything about the PPC architecture but it may not be worth writing separate optimizations for an architecture which has basically been end-of-lifed by Apple.

With respect to GCC, there are a lot of long-standing problems using sse.  It looks like the first gcc bug you mentioned (39787) isn't getting too much traction so the assembly workaround may be what we need to do.  Writing an ASM function to determine support for MMX, SSE, SSE2 shouldn't be too difficult.
Attached patch patch for review (obsolete) — Splinter Review
Here's a patch for review with the pieces that are working for me now.

Relative to what the patch in bug 506430 does (once that code is changed to use this code, which I have in a separate patch), all this changes is that:

  1. we use a function using the cpuid intrinsic instead of __sse2_available on Windows

  2. we honor the -march or -msse* options on gcc-based platforms.  On Intel Mac, -march defaults to enabling MMX, SSE, SSE2, and SSE3, which means we're not changing anything from that patch on the default configuration on Mac.  It does, however, change what that patch does for x86_64 Linux, since the base x86_64 chips have MMX, SSE, and SSE2 (at least according to gcc's defaults for the architecture).

This also adds mechanisms for detection of a whole bunch of features other than SSE2.


I don't yet have a solution that allows CPUID-based use of *intrin.h on gcc; however, gcc 4.4 is relatively close to such a solution; see comment 2 for the bugs known to block this working (one of the two is easy to work around if it comes to that; see comment 3)

We could add additional macros to this header for whether it makes sense to use inline assembly-based solutions for gcc, but I haven't done that (yet, anyway).  It would be practical to do inline assembly with CPUID-based detection for gcc today, but the only platform it would help (given that what we're using is SSE2) is 32-bit Linux.


I'm requesting review from Benjamin, but I'd welcome comments from the rest of the cc: list as well.
Attachment #397885 - Flags: review?(benjamin)
Attached patch patch for review (obsolete) — Splinter Review
Also, the previous patch, plus the patch to use this code in nsUTF8ToUnicode, passed the try server (modulo usual random orange).

However, it also probably triggers warnings for unknown pragmas on older gccs, which this one should fix; it also improves a comment that I noticed I hadn't updated.
Attachment #397885 - Attachment is obsolete: true
Attachment #397887 - Flags: review?(benjamin)
This looks wrong -- it calls has_cpuid_bit for every has_* function.  It should really be initialized once in global xpcom init, and these should just be straight static variable returns.
(In reply to comment #7)
> This looks wrong -- it calls has_cpuid_bit for every has_* function.  It should
> really be initialized once in global xpcom init, and these should just be
> straight static variable returns.

I did consider that, but the work needed to load a global variable (dealing with PIC, etc.) didn't seem obviously less than the work needed to execute the CPUID instructions twice and check its results.  And doing the global variable thing is a good bit more work dealing with XPCOM glue, etc., so I did what was easier.
(In reply to comment #4)
> With respect to GCC, there are a lot of long-standing problems using sse.  It
> looks like the first gcc bug you mentioned (39787) isn't getting too much
> traction so the assembly workaround may be what we need to do.  Writing an ASM
> function to determine support for MMX, SSE, SSE2 shouldn't be too difficult.

So it's pretty obvious that I didn't read the upstream bug carefully enough before writing this.  Using inline assembly instead of the intrinsics for the actual algorithms probably isn't a good idea.

In your testcase in the GCC bugzilla what seems to blow it up is the push/pop_options pragmas.  Specifically, if you move pop_options to after the intrinsic call it compiles fine.  

Obviously GCC shouldn't ICE, but on the other hand it's not entirely clear to me why your testcase *should* work.  If you've moved back out of targeting sse2 architectures by pop_optioning why shouldn't SSE2 intrinsics fail to compile?

We would have to do something like 
#include "sse.h"

code code code
#ifdef MOZ_COMPILE_WITH_ABC
MOZ_ENTER_ABC_MODE();
if (mozilla::use_abc())
{
do_abc();
} else {
MOZ_EXIT_ABC_MODE();
#endif
do_fallback();
#ifdef MOZ_COMPILE_WITH_ABC
}
#endif
code code code


Where 
#ifdef GCC_ON_LINUX
#define MOZ_ENTER_ABC_MODE _Pragma ("GCC push_options") \
                           _Pragma ("GCC target(\"ABC\") \
#define MOZ_EXIT_ABC_MODE  _Pramga ("GCC pop_options)
#else // GCC_ON_MAC_THROUGH_AT_LEAST_SSE2, MSVC.
#define MOZ_ENTER_ABC_MODE      ; // Making it clear this is a NOOP
#define MOZ_EXIT_ABC_MODE       ; // Making it clear this is a NOOP
#endif

(you have to use _Pragma because macros can't expand to preprocessor directives)

I'm sure I'm missing something, but it appears to me that something along these lines would work.  This has some other less than desirable side effects, for instance GCC target pragmas work at a function level so the do_abc / do_fallback stuff needs to be wrapped in an inline function, but it will compile and generate code at least ;-)
Actually you only wrap do_abc in an inline function.  Doing something like this to your testcase makes it compile fine with the correctly emitted instructions.
What sort of log?  I just got an internal compiler error again.
Hmm it compiles correctly for me.  Which function does it ICE in?  Convert_ascii_run_sse2_part?
(In reply to comment #8)
> I did consider that, but the work needed to load a global variable (dealing
> with PIC, etc.) didn't seem obviously less than the work needed to execute the
> CPUID instructions twice and check its results.  And doing the global variable
> thing is a good bit more work dealing with XPCOM glue, etc., so I did what was
> easier.

Sure, but this means executing a cpuid at the start of pretty much every hot loop where we want maximum performance... it seems worth it to avoid that, especially since I very much doubt the cpuid instruction is optimized (calling it involves spilling some registers at the very least, might also include stalls/flushes/whatever).
(In reply to comment #0)
>  * be able to use emmintrin.h, etc., on Linux, even when compiling without an
> -march option that says those instructions can be assumed to be available

Is this really a requirement?  I thought that you had to use -march etc. to have the compiler generate those instructions, but that you could always include emmintrin.h and use the intrinsics.
(In reply to comment #18)
> (In reply to comment #0)
> >  * be able to use emmintrin.h, etc., on Linux, even when compiling without an
> > -march option that says those instructions can be assumed to be available
> 
> Is this really a requirement?

The alternative would be inline assembly, which gets painful when done for 32-bit and 64-bit in addition to the intrinsics (or another two pairs) for Windows.

> I thought that you had to use -march etc. to
> have the compiler generate those instructions, but that you could always
> include emmintrin.h and use the intrinsics.

You can't use emmintrin.h and the intrinsics without -march or the equivalent in pragmas.  I tried.
Comment on attachment 397887 [details] [diff] [review]
patch for review

I'm going to redirect this review because I've only barely understood what it's used for.

Does this typically mean that we'll be compiling two or more versions of various functions using makefiles, e.g.

nsStringOptimizationsSSE2.cpp

or something similar, and this support code lets us choose those versions dynamically?
Attachment #397887 - Flags: review?(benjamin) → review?(vladimir)
Comment on attachment 397887 [details] [diff] [review]
patch for review

We really, really don't want to call cpuid here.  This is already in xpcom glue -- why can't we just place the static symbols inside xpcom?  The getters would just return the static values, and we can call some init function either via a static constructor or as part of xpcom init.

JS may want to share this code, in which case it can pull in its own copy; other than that, everything else should already be touching xpcom at some level (or be entirely independent, like cairo).
Attachment #397887 - Flags: review?(vladimir) → review-
(In reply to comment #20)
> Does this typically mean that we'll be compiling two or more versions of
> various functions using makefiles, e.g.
> 
> nsStringOptimizationsSSE2.cpp
> 
> or something similar, and this support code lets us choose those versions
> dynamically?

No, it's just about using ifdefs within a single file for our own hand-written SSE2 code.  Example use is demonstrated in http://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/49b0628bde4f/utf8-use-sse-h

That said, we probably also could get benefits from cpuid-detected compiler-optimized code, but that's not what I was working on.

(In reply to comment #21)
> JS may want to share this code, in which case it can pull in its own copy;
> other than that, everything else should already be touching xpcom at some level
> (or be entirely independent, like cairo).

I think if we end up making it require symbols inside XPCOM, other callers like JS are far less likely to share the code.

I'm also not convinced it would be faster, but I'll hopefully have time to experiment with it in a few weeks.
I'm not too worried about JS sharing the code tbh; if it wants the same functionality, they can pull in a local copy.  There are very few places that don't touch xpcom, and they're generally "special" in one way or another.  We could put this in the new-nspr-like-thing, too.
Given that these decisions would be made within nanojit, which I believe is *extremely* unlikely to ever depend on anything in Mozilla's codebase (CC addends please chime in if you think otherwise), I wouldn't give much, if any, thought to what JS might want.
Indeed, don't worry about JS here.  We already have a solution to this problem, and our use of it is different anyway (check once and then generate appropriate code that's run many times, versus checking per hot-path run).  pixman likely also has its own mechanism, as will our image decoders, etc.  Make it easy for us to add new ones, but don't get hung up on JS sharing this stuff at the expense of making it cheap.

If/when we get some NSPR evolution, we could put the code there, since JS already depends on NSPR in some configuration.  For now, I recommend getting numbers on how much it costs to CPUID vs loading a symbol from the static xpcom glue -- in libxul builds I would hope that we could avoid inter-DSO operations here, really.
Attached patch patch (obsolete) — Splinter Review
Revised to initialize *_available booleans inside the xpcom glue library as |extern| (following the example of gTLSIsMainThread).
Attachment #397887 - Attachment is obsolete: true
Attachment #412370 - Flags: review?(vladimir)
This looks quite good, but one question... the suggested use:

+ *   #ifdef MOZILLA_COMPILE_WITH_ABC
+ *     if (mozilla::use_abc()) {
+ *       // abc-specific code here
+ *     } else {
+ *   #endif
+ *      // generic code here
+ *   #ifdef MOZILLA_COMPILE_WITH_ABC
+ *     }
+ *   #endif

feels pretty clunky.  What do you think about having use_abc() be guaranteed to exist, and just statically evaluate to false if not available (like your code does), and then have consumers do:

if (mozilla::use_abc()) {
#ifdef MOZILLA_COMPILE_WITH_ABC
  // abc
#endif
} else {
  // generic
}

That makes it much easier to do

if (mozilla::use_abc()) {
#ifdef ABC
  // abc
#endif
} else if (mozilla::use_def()) {
#ifdef DEF
  // def
#endif
} else {
  // generic
}

for when we have SSE2/MMX/generic code paths.  (It's really annoying that we need the ifdefs in there at all, but I don't know how to avoid them...)  If that sounds good to you, I think the only change needed is in the comment, as all the functions are already do the right thing.
It actually requires more than fixing the comment; it requires making this header define functions for the non-x86 non-x86_64 case.  That said, that's not hard to do, and probably worth doing.
And I'd need to make this:
+  // Note that this means there are cases where mozilla::use_abc() will
+  // return true even though we can't define MOZILLA_COMPILE_WITH_ABC.
not true anymore.
I think a good way of doing both would be to actually define two sets of functions:

  mozilla::supports_abc would be what is currently mozilla::use_abc

and then I'd define mozilla::use_abc as:

#ifdef MOZILLA_COMPILE_WITH_ABC
  bool use_abc { return supports_abc(); }
#else
  bool use_abc { return false; }
#endif

This would allow the supports_abc() functions to be used by people who are hand-coding assembly, and leave the use_abc() functions for people who are using the builtins and MOZILLA_COMPILE_WITH_ABC.

That said, I'm not sure I like the names...
I have a modified version of the patch, but I've modified enough that it needs a bit more testing again...
Attached patch patchSplinter Review
Attachment #412370 - Attachment is obsolete: true
Attachment #414859 - Flags: review?(vladimir)
Attachment #412370 - Flags: review?(vladimir)
Comment on attachment 414859 [details] [diff] [review]
patch

Looks good; I like this approach.  Though, why do you need the static false/true variants in the .cpp file if the .h file is going to hardcode the values? (Which I agree that it should.)


>+#if defined(MOZILLA_PRESUME_MMX)
>+  bool mmx_enabled = true;
>+#elif defined(MOZILLA_SSE_HAVE_CPUID_DETECTION)
>+  bool mmx_enabled
>+    = sse_private::has_cpuid_bit(1u, sse_private::edx, (1u<<23));
>+#else
>+  bool mmx_enabled = false;
>+#endif


That is, I think the only #if block you need here is the HAVE_CPUID_DETECTION one (and you could probably combine them all into one chunk for x86 -- if we can't build with, say, SSE2 even if the CPU supports it, the inline functions will return false from the .h file).  For other cases having the extern declarations without the actual variable is fine I think, because nothing should be referencing them.  But the way you have it is also fine, if you'd rather keep it like that.
Attachment #414859 - Flags: review?(vladimir) → review+
After the checkin, I'm not able to build again SeaMonkey (x64), not even after clobbering, on OpenSuSE 11.1 with "gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291]".
c++ -o SSE.o -c -I../../dist/system_wrappers -include /home/prometeo/src/mozilla/config/gcc_hidden.h -DMOZ_SUITE=1 -DOSTYPE=\"Linux2.6.27.39-0\" -DOSARCH=Linux -DTARGET_XPCOM_ABI=\"x86_64-gcc3\" -I/home/prometeo/src/mozilla/xpcom/glue/../build  -I/home/prometeo/src/mozilla/xpcom/glue -I. -I../../dist/include -I../../dist/include/nsprpub  -I/home/prometeo/suite-obj/mozilla/dist/include/nspr -I/home/prometeo/suite-obj/mozilla/dist/include/nss       -fPIC  -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Wno-long-long -pedantic -fno-strict-aliasing -fshort-wchar -pthread -pipe  -DNDEBUG -DTRIMMED -Os -freorder-blocks -fno-reorder-functions    -DMOZILLA_CLIENT -include ../../mozilla-config.h -Wp,-MD,.deps/SSE.pp /home/prometeo/src/mozilla/xpcom/glue/SSE.cpp
In file included from /home/prometeo/src/mozilla/xpcom/glue/SSE.cpp:40:
/home/prometeo/src/mozilla/xpcom/glue/SSE.h:243:19: error: cpuid.h: No such file or directory
In file included from /home/prometeo/src/mozilla/xpcom/glue/SSE.cpp:40:
/home/prometeo/src/mozilla/xpcom/glue/SSE.h: In function ‘bool mozilla::sse_private::has_cpuid_bit(unsigned int, mozilla::sse_private::CPUIDRegister, unsigned int)’:
/home/prometeo/src/mozilla/xpcom/glue/SSE.h:256: error: ‘__get_cpuid’ was not declared in this scope
gmake[6]: *** [SSE.o] Error 1
gmake[6]: Leaving directory `/home/prometeo/suite-obj/mozilla/xpcom/glue'
gmake[5]: *** [libs] Error 2
gmake[5]: Leaving directory `/home/prometeo/suite-obj/mozilla/xpcom'
gmake[4]: *** [libs_tier_xpcom] Error 2
gmake[4]: Leaving directory `/home/prometeo/suite-obj/mozilla'
gmake[3]: *** [tier_xpcom] Error 2
gmake[3]: Leaving directory `/home/prometeo/suite-obj/mozilla'
gmake[2]: *** [default] Error 2
gmake[2]: Leaving directory `/home/prometeo/suite-obj/mozilla'
make[1]: *** [default] Error 2
make[1]: Leaving directory `/home/prometeo/suite-obj'
make: *** [build] Error 2
The problem Giacomo is having is because openSUSE guys forgot to package cpuid.h in their ggc 4.3 packages they shipped on openSUSE 11.1, and they only fixed it for the next version of their distro (11.2, which shipped only 3-4 weeks ago), see https://bugzilla.novell.com/show_bug.cgi?id=527433

That means everyone on openSUSE 11.1 breaks at this point :(
I've been meaning to make that code use inline assembly instead; I'm currently testing a patch, written with the help of http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html .  I tested that gcc does use the input/output register constraints to know that ebx (PIC register) is modified by the assembly, and therefore gcc does generate code to save/restore ebx around the assembly block, as it should.  So my little chunk of inline assembly should work for both i386 and x86_64 without any ifdefs between them.
(In reply to comment #38)
> that gcc does use the input/output register constraints to know that ebx (PIC
> register) is modified by the assembly, and therefore gcc does generate code to
> save/restore ebx around the assembly block, as it should.  So my little chunk
> of inline assembly should work for both i386 and x86_64 without any ifdefs
> between them.

Hmm.  Never mind that; it's fine on x86_64, but on i386 with -fPIC, gcc won't allow register constraints for ebx.
The jit was some code to work around the ebx issue.
This does cpuid "by hand".  WARNING: untested.  It un-breaks
the build on openSUSE 11.0 and the resulting Fx will at least
start.  It is slightly conservative in that it passes the %eax
argument to cpuid through memory, which is an unnecessary load/store,
but it means we don't have to get tied up writing and trying to
assess the correctness of complicated gcc register constraints to
do with earlyclobber and wotnot.  It also preserves ebx so as to
keep macos happy (again, untested).
Comment on attachment 417143 [details] [diff] [review]
use SSE.h in the code that led me to start this

(sorry for long review time!)
Attachment #417143 - Flags: review?(vladimir) → review+
Is there a reason that this isn't marked RESOLVED FIXED?  Are we holding out for a solution to these issues with GCC and the intrinsics?
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
I filed bug 585738 as a followup to fix the gcc issues.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: