Closed Bug 856850 Opened 9 years ago Closed 9 years ago

Add a PodOperations.h including all the Pod* methods from jsutil.h

Categories

(Core :: MFBT, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

Attachments

(2 files)

I'm seeing way too much memset and memcmp in Gecko patches I'm looking at, and it makes me sad.  A quick overview of the operations, for people who haven't seen them before:

template <class T> void PodZero(T *t):
memset(t, 0, sizeof(T))

template <class T> void PodZero(T *t, size_t nelem):
memset(t, 0, sizeof(T) * nelem)

// Should PodZero(array) zero the first element, or the entire thing?
// Given the ambiguity, force users to use PodArrayZero to clear the
// entire thing, and &array[0] to clear the first element
template <class T, size_t N> static void PodZero(T (&)[N]);          /* undefined */
template <class T, size_t N> static void PodZero(T (&)[N], size_t);  /* undefined */

template <class T, size_t N> void PodArrayZero(T (&t)[N]):
memset(t, 0, sizeof(T) * N)

template <class T> void PodAssign(T *dst, const T *src):
memcpy(dst, src, sizeof(T))

template <class T> void PodCopy(T *dst, const T *src, size_t nelem):
memcpy(dst, src, nelem * sizeof(T))

template <class T> bool PodEqual(T *one, T *two, size_t len):
len * sizeof(T) bytes at |one| are the same as the bytes at |two|
Mostly mechanical changes, nothing difficult modulo typos, and forgetting to hg add the new header in multiple try-pushes.  :-\

https://tbpl.mozilla.org/?tree=Try&rev=e96c1e5741aa
https://tbpl.mozilla.org/?tree=Try&rev=8060ab8797b4
Attachment #732435 - Flags: review?(Ms2ger)
Comment on attachment 732435 [details] [diff] [review]
Move the methods, convert the users

Review of attachment 732435 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm

::: js/src/ion/shared/Assembler-shared.h
@@ +148,4 @@
>      Address(Register base, int32_t offset) : base(base), offset(offset)
>      { }
>  
> +    Address() { mozilla::PodZero(this); }

I see there's no consistency at all in newlines for this pattern

::: js/src/jsiter.cpp
@@ +47,5 @@
>  using namespace js::gc;
>  
>  using mozilla::ArrayLength;
> +#ifdef JS_MORE_DETERMINISTIC
> +using mozilla::PodCopy;

Dunno if I'd ifdef this.

::: mfbt/PodOperations.h
@@ +72,5 @@
> +PodAssign(T* dst, const T* src)
> +{
> +  MOZ_ASSERT(dst != src);
> +  MOZ_ASSERT_IF(src < dst, PointerRangeSize(src, static_cast<const T*>(dst)) >= 1);
> +  MOZ_ASSERT_IF(dst < src, PointerRangeSize(static_cast<const T*>(dst), src) >= 1);

>0?

@@ +94,5 @@
> +     * Avoid using operator= in this loop, as it may have been
> +     * intentionally deleted by the POD type.
> +     */
> +    for (const T* srcend = src + nelem; src < srcend; src++, dst++)
> +      PodAssign(dst, src);

We don't do {}s? Boo.
Attachment #732435 - Flags: review?(Ms2ger) → review+
Comment on attachment 732435 [details] [diff] [review]
Move the methods, convert the users

Review of attachment 732435 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/PodOperations.h
@@ +24,5 @@
> +/** Set the contents of |t| to 0. */
> +template<typename T>
> +static void
> +PodZero(T* t)
> +{

If PodZero() for one element took a T& reference instead of a pointer, it might help identify mistakes where callers inadvertently call PodZero(buffer) instead of PodZero(buffer, bufferLength). Same suggestion for PodAssign().

@@ +32,5 @@
> +/** Set the contents of |nelem| elements starting at |t| to 0. */
> +template<typename T>
> +static void
> +PodZero(T* t, size_t nelem)
> +{

Do you want to handle the nelem == 0 case? Or can you assert nelem > 0? Same question for PodCopy() and PodEqual().
Attachment #732435 - Flags: review+ → review?(Ms2ger)
Ms2ger: sorry I accidentally erased your r+!
PodZero(T& t) would just work with T = int[5] and such, so I don't think it'd be much different from the array/length confusion already worked around here.

I'm pretty sure PodZero gets used for array contents of various sizes including 0, so it probably wants to stay as it is in that regard.  PodEqual gets used in similar places, I think for checking JS string characters' equality, so 0 is useful there too.  PodCopy also gets used in 0-taking places.  I'm not necessarily *opposed* to a method that would assert non-zero, but definitely the precedent here is for zero being a reasonable count to handle.
(In reply to :Ms2ger from comment #2)
> >  using mozilla::ArrayLength;
> > +#ifdef JS_MORE_DETERMINISTIC
> > +using mozilla::PodCopy;
> 
> Dunno if I'd ifdef this.

Vague paranoia on my part about compilers warning for unused |using| declarations.  Figured it couldn't hurt.

> >0?

No particular preference here, left as-is out of laziness.
So, PodOperations.h seems to mimic memcpy/memset but not close enough.  PodZero seems to never call memset for an array, and PodCopy doesn't use memcpy if n<128.  Are these limitations expected?  I find it quite odd for this kind of abstraction to break memcpy/memset promises.
The observable effect is the same as those: there's no breaking of memset/memcpy promises.  But the actual way that effect is implemented is based on observation of how those actual calls *do* perform when compiled, if we used them.  These aren't "limitations", they're performance optimizations based on how compilers behave when you do the naive memsets.

Calling directly into memset involves undesirable cross-library call overhead, which makes it not ideal for sufficiently small amounts of memory being set.

The memset function in libc also has to deal with any particular amount written -- 2, 4, 8, 16, but also 7, 33, 17, 42.  Giving the compiler enough information to know that the amount written will be a multiple of some size -- say, the size of a pointer or whatever -- allows it to emit architecture-specific instructions to write zeroes a short/int/long/long long at a time, rather than a byte at a time.

It's true memset can detect these cases, but it's extra overhead to do so -- to set leading unaligned bytes, then check for the largest size to write out each iteration, then handle trailing unaligned bytes.  (Compilers could recognize these cases where T* is being specified, and multiples of sizeof(T) bytes are being set, but it seems they don't do it now.)  Make the non-genericity evident in the code, through T* and memsetting by sizeof(T) each time through the loop, as in these Pod* operations, and the compiler can generate the optimal code without need for special/edge-case handling to satisfy spec requirements.

memcmp has similar issues regarding its being required to accept pointers of any alignment, and compare any number of bytes at each pointer.

To be fair, at least some of these optimization observations were made with older compilers -- I think some, at least, with OS X gcc.  It's possible some could be revisited wrt newer compilers.
https://hg.mozilla.org/mozilla-central/rev/de6afab8b383
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
(In reply to comment #9)
> To be fair, at least some of these optimization observations were made with
> older compilers -- I think some, at least, with OS X gcc.  It's possible some
> could be revisited wrt newer compilers.

That's sort of my point.  Until we have evidence that these functions will not be slower than their libc versions, I don't think that exposing them in MFBT is a good idea, since I don't want people to have to make a choice between fast code or more beautiful code.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #9)
> It's true memset can detect these cases, but it's extra overhead to do so --
> to set leading unaligned bytes, then check for the largest size to write out
> each iteration, then handle trailing unaligned bytes.  (Compilers could
> recognize these cases where T* is being specified, and multiples of
> sizeof(T) bytes are being set, but it seems they don't do it now.)
>
> To be fair, at least some of these optimization observations were made with
> older compilers -- I think some, at least, with OS X gcc.  It's possible
> some could be revisited wrt newer compilers.

I was curious about this, so I wrote some tests.  I checked for optimizing away functions for up to 128 elements under different optimization levels.  Individual table cells are the number of times the function was optimized away for a particular datatype; a "perfect" compiler would have |128| in every table cell.

Results for clang 3.2 trunk, x86-64:

Os      char  short  int  long
memset  59    59     128  128
memcpy  27    27     128  128
memcmp  4     4      4    4
O2      char  short  int  long
memset  123   123    128  128
memcpy  59    59     128  128
memcmp  4     4      4    4
O3      char  short  int  long
memset  123   123    128  128
memcpy  59    59     128  128
memcmp  4     4      4    4

Results for ubuntu gcc 4.6.3, x86-64:

Os      char  short  int  long
memset  128   128    128  128
memcpy  128   128    128  128
memcmp  1     1      1    1
O2      char  short  int  long
memset  128   128    128  128
memcpy  128   128    128  128
memcmp  1     1      1    1
O3      char  short  int  long
memset  128   128    128  128
memcpy  128   128    128  128
memcmp  1     1      1    1

Results for debian gcc 4.4.5:

Os      char  short  int  long
memset  128   128    128  128
memcpy  128   128    128  128
memcmp  1     1      1    1
O2      char  short  int  long
memset  128   128    128  128
memcpy  128   128    128  128
memcmp  128   128    128  128
O3      char  short  int  long
memset  128   128    128  128
memcpy  128   128    128  128
memcmp  128   128    128  128

Results for android ndk6b gcc 4.4.3, arm:

Os      char  short  int  long
memset  2     4      7    7
memcpy  1     2      64   64
memcmp  1     1      1    1
O2      char  short  int  long
memset  14    28     55   55
memcpy  1     2      64   64
memcmp  1     1      1    1
O3      char  short  int  long
memset  14    28     55   55
memcpy  1     2      64   64
memcmp  1     1      1    1

Results for android ndk6b gcc 4.4.3, x86:

Os      char  short  int  long
memset  128   128    128  128
memcpy  128   128    128  128
memcmp  1     1      1    1
O2      char  short  int  long
memset  128   128    128  128
memcpy  128   128    128  128
memcmp  1     1      1    1
O3      char  short  int  long
memset  128   128    128  128
memcpy  128   128    128  128
memcmp  1     1      1    1

clang's numbers are a bit curious, especially because it "misses" some inlining; it will inline memsets of, say, 22 and 24 elements, but not 23 elements.  gcc 4.6 is very unwilling to optimize memcmp; x86-64 gcc 4.4 appears much more aggressive on that front.  Take the ARM numbers with a grain of salt; I didn't try tweaking the particular CPU and ARM CPUs can vary widely in their tuning in GCC.  So the ARM numbers might not be representative of what, say, our production compilers would actually see.

The benchmark is possibly too simple, what with only dealing with function parameters; things might change with local variables, etc.

Anyway, run the script with |bash check-memfoo.sh <compiler>|; I'd be interested to see what the old OS X gcc does.
The OSX gcc4.2 is not interesting as it's explicitly not supported and can no longer build Firefox.  What's more interesting is MSVC.  :-)
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #11)

Methinks you doth protest too much.  The overhead being optimized away here is small, and it may not matter in many to most cases.  It *did* matter to the JS engine, and it motivated wrappers for memset/memcmp/memcpy (cleaner ones, too).  Worries about this being slower are not well-founded, I think, unless you can demonstrate specific cases where it matters.  But really, most bad-perf cases aren't going to be gated on mem* overheads, so even if these weren't fully optimal, it shouldn't be a game-changing difference the vast majority of the time.  And when it is, people should be benchmarking the particular algorithm in question to determine that.
OK, fine, I guess.
Attachment #732435 - Flags: review?(Ms2ger) → review+
Depends on: 935370
You need to log in before you can comment on or make changes to this bug.