Closed Bug 871596 Opened 6 years ago Closed 6 years ago

investigate serialization/deserialization IPC overhead

Categories

(Core :: IPC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla24

People

(Reporter: vlad, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 4 obsolete files)

Our IPC code is directly in the critical path, especially on B2G (e.g. time from finger down to the action being reflected on the screen has IPC right in the middle of it).  For some unrelated things, I was looking at the serialization/deserialization code.

The higher-level code (e.g. structs etc.) calls down to lower level primitives, which eventually get to code in Pickle (e.g. ReadInt32) and friends.  This code isn't in a header file, so likely doesn't get inlined, and each one looks something like:

bool Pickle::ReadInt32(void** iter, int32_t* result) const {
  DCHECK(iter);
  if (!*iter)
    *iter = const_cast<char*>(payload());

  if (!IteratorHasRoomFor(*iter, sizeof(*result)))
    return false;

  memcpy(result, *iter, sizeof(*result));

  UpdateIter(iter, sizeof(*result));
  return true;
}



IteratorHasRoomFor() does checks on the current iter pointer, the message header, and the remaining size to make sure that we can read sizeof(*result) from it.

This all seems like a large amount of overhead to me.  There's no reason why we shouldn't be able to generate code to directly [de]serialize all the primitive data types straight in the higher-level generated message/struct code for this, including doing one size check early on, instead of calling multiple layers to eventually get to a ReadInt32 above.
Word of warning, this code hasn't ever showed up on profiles, and my gut intuition is it's not hot enough to bother with.  There are other parts of IPC that are more expensive than this, like the work covered by bug 787363.  Would definitely want to start here with a testcase.
Yeah, I haven't seen it in profiles either, which is a good argument for not poking into it.. but at the same time, by inspection, it's not how you'd want to do it for performance.  Testcases would definitely be good (I may poke, and get a standalone ipc setup going), as would just force-inlining all the Pickle functions and looking at the generated code that the compiler spits out for the higher level message sending functions.
It might also be useful to make sure that we have some ipc stuff in our pgo automation script. Maybe then we could just let the compiler take care of this.
PGO won't help on b2g, but it's true that our current PGO profile gathering script doesn't do a whole lot of useful things (including examining the IPC code.)
The Pickle::Read* functions have a couple of issues.  Take a look at Pickle::ReadInt64:

bool Pickle::ReadInt64(void** iter, int64_t* result) const {
  DCHECK(iter);
  if (!*iter)
    *iter = const_cast<char*>(payload());

  if (!IteratorHasRoomFor(*iter, sizeof(*result)))
    return false;

  memcpy(result, *iter, sizeof(*result));

  UpdateIter(iter, sizeof(*result));
  return true;
}

which compiles to the following assembly on m-c:

  ca7744:       b570            push    {r4, r5, r6, lr}
  ca7746:       680b            ldr     r3, [r1, #0]
  ca7748:       460c            mov     r4, r1
  ca774a:       4616            mov     r6, r2
  ca774c:       b91b            cbnz    r3, ca7756 <NS_InvokeByIndex+0x23c5a>   (0)
  ca774e:       6802            ldr     r2, [r0, #0]
  ca7750:       6843            ldr     r3, [r0, #4]
  ca7752:       18d3            adds    r3, r2, r3
  ca7754:       600b            str     r3, [r1, #0]
  ca7756:       6825            ldr     r5, [r4, #0]
  ca7758:       2208            movs    r2, #8
  ca775a:       4629            mov     r1, r5
  ca775c:       f7ff fe84       bl      ca7468 <NS_InvokeByIndex+0x2396c>       (1)
  ca7760:       b160            cbz     r0, ca777c <NS_InvokeByIndex+0x23c80>
  ca7762:       4629            mov     r1, r5
  ca7764:       2208            movs    r2, #8
  ca7766:       4630            mov     r0, r6
  ca7768:       f632 cf76       blx     2da658 <JSD_JSContextInUse-0x51c0>      (2)
  ca776c:       2008            movs    r0, #8
  ca776e:       2104            movs    r1, #4
  ca7770:       f7ff fe90       bl      ca7494 <NS_InvokeByIndex+0x23998>       (3)
  ca7774:       6823            ldr     r3, [r4, #0]
  ca7776:       1818            adds    r0, r3, r0
  ca7778:       6020            str     r0, [r4, #0]
  ca777a:       2001            movs    r0, #1
  ca777c:       bd70            pop     {r4, r5, r6, pc}

(0) is a check for NULL *iter.  See below.
(1) is a call to IteratorHasRoomFor.  This is a complicated function, so can be
    out-of-line.  I don't see a good way to improve on this or (0) short of
    rejiggering all of the IPC code so that we're only checking bounds once, rather
    than on every read/write.
(2) is a call to memcpy that goes through the PLT.  This could be better optimized
    with newer versions of GCC, I think (since they are more aware of ARM's
    unaligned memory access instructions), but given that we're aligning all of
    Pickle's writes anyway, we ought to be able to do a better job here.  Still
    working out the patch for this.
(3) is a call to UpdateIter.  We shouldn't be calling this out-of-line, except that
    the sole function called in UpdateIter, AlignInt, is written in a too-general
    fashion:

  // Aligns 'i' by rounding it up to the next multiple of 'alignment'
  static uint32_t AlignInt(uint32_t i, int alignment) {
    return i + (alignment - (i % alignment)) % alignment;
  }

    which results in the following assembly:

  ca7494:       b570            push    {r4, r5, r6, lr}
  ca7496:       460c            mov     r4, r1
  ca7498:       4605            mov     r5, r0
  ca749a:       f633 ca9a       blx     2da9d0 <JSD_JSContextInUse-0x4e48>
  ca749e:       1a60            subs    r0, r4, r1
  ca74a0:       4621            mov     r1, r4
  ca74a2:       f633 ca96       blx     2da9d0 <JSD_JSContextInUse-0x4e48>
  ca74a6:       1948            adds    r0, r1, r5
  ca74a8:       bd70            pop     {r4, r5, r6, pc}

     Those blx instructions are calls--through the PLT (!)--to __aeabi_idivmod, I
     believe.  Which is stupid, because |alignment| here is only ever a power of 2,
     and we ought to have inlined the whole ball of wax.  (AlignInt as written works
     for any number, but I believe the inliner is reluctant to inline because the
     divisions are out-of-line function calls and/or because of compiling with -Os.) 

The posted patch reworks the alignment bits so that we can verify at compile-time
that we're aligning to a power of 2.  And then we can do an efficient alignment
which the compiler appears to be more amenable to inlining.

I'll double-check with a try push, but this seems to DTRT on opt+debug builds on
x86-64 Linux with GCC 4.4.  Like Chris said, doesn't show up on profiles, but a
couple fewer instructions plus less jumping out of libxul can't be a bad thing,
right?

Writes are another story.  Writes all go through a general mechanism
based on memcpy, so optimizing things down to single-instruction writes
or thereabouts will take a little work.
Attachment #750576 - Flags: review?(bent.mozilla)
(In reply to Nathan Froyd (:froydnj) from comment #5)
> The posted patch reworks the alignment bits so that we can verify at
> compile-time
> that we're aligning to a power of 2.  And then we can do an efficient
> alignment
> which the compiler appears to be more amenable to inlining.
> 
> I'll double-check with a try push, but this seems to DTRT on opt+debug
> builds on
> x86-64 Linux with GCC 4.4.  Like Chris said, doesn't show up on profiles,
> but a
> couple fewer instructions plus less jumping out of libxul can't be a bad
> thing,
> right?

Assembly from ParseInt64 on a try push:

  caa628:	b570      	push	{r4, r5, r6, lr}
  caa62a:	680b      	ldr	r3, [r1, #0]
  caa62c:	460c      	mov	r4, r1
  caa62e:	4616      	mov	r6, r2
  caa630:	b91b      	cbnz	r3, caa63a <NS_InvokeByIndex+0x23c96>
  caa632:	6802      	ldr	r2, [r0, #0]
  caa634:	6843      	ldr	r3, [r0, #4]
  caa636:	18d3      	adds	r3, r2, r3
  caa638:	600b      	str	r3, [r1, #0]
  caa63a:	6825      	ldr	r5, [r4, #0]
  caa63c:	2208      	movs	r2, #8
  caa63e:	4629      	mov	r1, r5
  caa640:	f7ff fe9e 	bl	caa380 <NS_InvokeByIndex+0x239dc>
  caa644:	b140      	cbz	r0, caa658 <NS_InvokeByIndex+0x23cb4>
  caa646:	4630      	mov	r0, r6
  caa648:	4629      	mov	r1, r5
  caa64a:	2208      	movs	r2, #8
  caa64c:	f630 cef8 	blx	2db440 <JSD_JSContextInUse-0x5218>
  caa650:	6823      	ldr	r3, [r4, #0]                                    (*)
  caa652:	2001      	movs	r0, #1
  caa654:	3308      	adds	r3, #8
  caa656:	6023      	str	r3, [r4, #0]
  caa658:	bd70      	pop	{r4, r5, r6, pc}

The code at (*) and following is the new hotness of UpdateIter with better alignment code.  So the optimization does work.
Comment on attachment 750576 [details] [diff] [review]
rewrite Pickle's alignment mechanism to be more obviously optimizable

Review of attachment 750576 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/chromium/src/base/pickle.h
@@ +247,5 @@
>    bool Resize(uint32_t new_capacity);
>  
> +  // Round 'bytes' up to the next multiple of 'alignment'.  'alignment' must be
> +  // a power of 2.
> +  template<uint32_t alignment> struct ConstantAligner {

So does the optimizer just do a better job with the addition of this struct, the static assert, or the reworked rounding operation? Are all three changes needed?

@@ +257,5 @@
> +  };
> +  // Round 'bytes' up to the next multiple of 'sizeof(T)'.  'sizeof(T)' must be
> +  // a power of 2.
> +  template<typename T> struct Aligner {
> +    static uint32_t align(int bytes) { return ConstantAligner<sizeof(T)>::align(bytes); }

Instead of this (which forces you to write |Aligner<uint32>::align(foo)| all over) how about:

  static uint32 AlignInt(int bytes) { return ConstantAligner<sizeof(uint32_t)>::align(bytes); }

There's no real reason for the extra struct I don't think.
(In reply to ben turner [:bent] from comment #7)
> > +  // Round 'bytes' up to the next multiple of 'alignment'.  'alignment' must be
> > +  // a power of 2.
> > +  template<uint32_t alignment> struct ConstantAligner {
> 
> So does the optimizer just do a better job with the addition of this struct,
> the static assert, or the reworked rounding operation? Are all three changes
> needed?

I believe the optimizer would do a better job with just the reworked rounding optimization.  The struct and static assert are so we don't accidentally try to align to non-powers-of-two.  Maybe that's overkill for this code, but it still seemed like a useful thing to do.

> @@ +257,5 @@
> > +  };
> > +  // Round 'bytes' up to the next multiple of 'sizeof(T)'.  'sizeof(T)' must be
> > +  // a power of 2.
> > +  template<typename T> struct Aligner {
> > +    static uint32_t align(int bytes) { return ConstantAligner<sizeof(T)>::align(bytes); }
> 
> Instead of this (which forces you to write |Aligner<uint32>::align(foo)| all
> over) how about:
> 
>   static uint32 AlignInt(int bytes) { return
> ConstantAligner<sizeof(uint32_t)>::align(bytes); }
> 
> There's no real reason for the extra struct I don't think.

Good point.  Will fix.
Comment on attachment 750576 [details] [diff] [review]
rewrite Pickle's alignment mechanism to be more obviously optimizable

Ok, I'll stamp that patch
Attachment #750576 - Flags: review?(bent.mozilla)
Awesome, thanks for looking into this!  Is there value in moving the Read* functions into the header?  Give the compiler more opportunity to inline/optimize across functions, especially without PGO.
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #10)
> Awesome, thanks for looking into this!  Is there value in moving the Read*
> functions into the header?  Give the compiler more opportunity to
> inline/optimize across functions, especially without PGO.

I don't think it's worth trying to think about moving these functions into the header until the call to memcpy() goes away (not too hard, need to post patches when I'm on a different machine).  There also needs to be a better way of reading datatypes from the packet without checking for NULL *iter and IteratorHasRoomFor on every read.  Some sort of RawRead/RawWrite interface that doesn't require the NULL/IteratorHasRoomFor checks, for instance.

I think doing clever things with Tuple is the way to safely call RawRead/RawWrite under the hood (Read/Write of Tuple can just figure out how long the entire Tuple is going to be once, and then dispatch to their Raw* counterparts), but I haven't figured out how to do that in a transparent way (or if it can be done, or if it's worth trying to protect clients of Read/Write from themselves by not exposing RawRead/RawWrite).
Now with hiding everything behind AlignInt.
Attachment #756020 - Flags: review?(bent.mozilla)
Comment on attachment 756020 [details] [diff] [review]
part 1 - rewrite Pickle's alignment mechanism to be more obviously optimizable

Review of attachment 756020 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks!
Attachment #756020 - Flags: review?(bent.mozilla) → review+
This is just a cleanup, but since we're adding a few things that are going to require
specific alignments, it seemed good to give the type we're aligning against a name.
Attachment #756880 - Flags: review?(bent.mozilla)
This patch is mostly for ARM (any machine without unaligned load support, really);
GCC won't optimize:

  memcpy(aligned_ptr, void_star_ptr, sizeof(*aligned_ptr));

to inline loads and stores--at least not with the GCC version B2G (and maybe even
Fennec) is currently compiling with, I think.  But we know something about the
alignment of void_star_ptr, so we can help the compiler realize what's going on.

Ideally the comment in front of Copier covers all this.  Please let me know if you
don't think it's clear enough.
Attachment #756883 - Flags: review?(bent.mozilla)
This patch is a minor optimization: ReadBytes, BeginWrite, and WriteBytes all take
an alignment parameter that defaults to sizeof(memberAlignmentType).  It's mostly
used in modulus operations and since the functions are too big to inline, we wind
up doing divisions because the alignment is technically unknown at runtime.  Since
we never use the alignment parameter, just delete it, say that it's always going
to be equal to sizeof(memberAlignmentType), and let the compiler turn those modulus
operations into something smaller and faster (particularly on ARM).
Attachment #756884 - Flags: review?(bent.mozilla)
Attachment #750576 - Attachment is obsolete: true
Comment on attachment 756884 [details] [diff] [review]
part 4 - un-generalize alignment required for Read*/Write* in Pickle

Bleh, I hadn't looked far enough; we rely on 64-bit alignment for SerializedStructuredCloneBuffer.
Attachment #756884 - Attachment is obsolete: true
Attachment #756884 - Flags: review?(bent.mozilla)
Comment on attachment 756880 [details] [diff] [review]
part 2 - add a new memberAlignmentType to replace the scattered uint32_t alignments

Looks good.
Attachment #756880 - Flags: review?(bent.mozilla) → review+
Comment on attachment 756883 [details] [diff] [review]
part 3 - replace memcpys of POD datatypes in Pickle with something smarter

Review of attachment 756883 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/chromium/src/base/pickle.cc
@@ +34,5 @@
> +
> +template<typename T, bool hasSufficientAlignment>
> +struct Copier
> +{
> +  static void copy(T* dest, void** iter) {

Nit: Functions names should begin with capital letters. There are two 'copy' functions here, and then 'copyFromIter' too.

@@ +39,5 @@
> +    memcpy(dest, *iter, sizeof(T));
> +  }
> +};
> +template<typename T>
> +struct Copier<T, true>

I wonder if we shouldn't #ifdef this to the platforms where we know we need it. Opting in to an optimization like this seems safer, doesn't it?

@@ +45,5 @@
> +  static void copy(T* dest, void** iter) {
> +    // The reinterpret_cast is only safe if two conditions hold:
> +    // (1) If the alignment of T* is the same as void*;
> +    // (2) The alignment of the data in *iter is at least as
> +    //     bit as MOZ_ALIGNOF(T).

s/bit/big/

@@ +55,5 @@
> +  }
> +};
> +
> +template<typename T>
> +void copyFromIter(T* dest, void** iter) {

Is it worth specializing this for 64-bit ints?

@@ +76,5 @@
>      : header_(NULL),
>        header_size_(AlignInt(header_size)),
>        capacity_(0),
>        variable_buffer_offset_(0) {
> +  DCHECK(static_cast<memberAlignmentType>(header_size) >= sizeof(Header));

This belongs in the part 2 patch fwiw.
Will fix the capitalization.  I'm not sure why I thought local style in this file was lowercaseCamelCase, but it's clearly not.

(In reply to ben turner [:bent] from comment #19)
> @@ +39,5 @@
> > +    memcpy(dest, *iter, sizeof(T));
> > +  }
> > +};
> > +template<typename T>
> > +struct Copier<T, true>
> 
> I wonder if we shouldn't #ifdef this to the platforms where we know we need
> it. Opting in to an optimization like this seems safer, doesn't it?

If this optimization is valid on $PLATFORM, then it's valid everywhere else, too.  Seems weird to just enable it for a particular platform, especially since none of the other code depends on details of the platform.

> @@ +55,5 @@
> > +  }
> > +};
> > +
> > +template<typename T>
> > +void copyFromIter(T* dest, void** iter) {
> 
> Is it worth specializing this for 64-bit ints?

What benefit would you get out of that?

> @@ +76,5 @@
> >      : header_(NULL),
> >        header_size_(AlignInt(header_size)),
> >        capacity_(0),
> >        variable_buffer_offset_(0) {
> > +  DCHECK(static_cast<memberAlignmentType>(header_size) >= sizeof(Header));
> 
> This belongs in the part 2 patch fwiw.

Whoops, sorry about that.  Will move.
(In reply to Nathan Froyd (:froydnj) from comment #20)
> If this optimization is valid on $PLATFORM, then it's valid everywhere else,
> too.  Seems weird to just enable it for a particular platform, especially
> since none of the other code depends on details of the platform.

We're basically telling all compilers to not use memcpy but instead use normal load/store for types that fit. Is it not possible that some compilers have their own specialized memcpy that could do something different here? I can't really come up with any idea about what they would do, but it just seems wrong to say we always know best. That's why I said opting in seems safer. What do you think?

> > Is it worth specializing this for 64-bit ints?
> 
> What benefit would you get out of that?

We have several calls below that copy 64-bit integers, and that will always use memcpy with your current patch. Is that faster or slower than doing two 32-bit stores?
(In reply to ben turner [:bent] from comment #21)
> (In reply to Nathan Froyd (:froydnj) from comment #20)
> > If this optimization is valid on $PLATFORM, then it's valid everywhere else,
> > too.  Seems weird to just enable it for a particular platform, especially
> > since none of the other code depends on details of the platform.
> 
> We're basically telling all compilers to not use memcpy but instead use
> normal load/store for types that fit. Is it not possible that some compilers
> have their own specialized memcpy that could do something different here? I
> can't really come up with any idea about what they would do, but it just
> seems wrong to say we always know best. That's why I said opting in seems
> safer. What do you think?

I know the ARM AEABI has specialized memcpy functions (with different names, obviously), but those are for copying when the pointer alignments are already known, so they don't apply here.

I have a hard time imagining how any out-of-line memcpy (especially one that doesn't live in libxul) is going to win over inline loads and stores (two, maybe four, instructions, which is about as many instructions as it's going to take just to get to the beginning of memcpy--and some of those instructions are going to be jumps).  I think selectively enabling this is being overly cautious.

> > > Is it worth specializing this for 64-bit ints?
> > 
> > What benefit would you get out of that?
> 
> We have several calls below that copy 64-bit integers, and that will always
> use memcpy with your current patch. Is that faster or slower than doing two
> 32-bit stores?

OK, bizarre: x86 only requires 4-byte alignment for 64-bit ints, but ARM and other 32-bit platforms appear to require 8-byte alignment (why?!).  I assumed 64-bit ints would only require 4-byte alignment on all 32-bit platforms.  My mistake.

I imagine two inline stores would be faster than calling out to memcpy--but the single inline store generated on a 64-bit platform would be faster than two 32-bit stores.  Can probably take care of that with some template magic.
(In reply to Nathan Froyd (:froydnj) from comment #22)
> I have a hard time imagining how any out-of-line memcpy

I didn't mean out-of-line, I would expect inlines and maybe some compiler intrinsics for some of this stuff. Can we at least look at the disassembly on Windows and Mac with and without this to make sure we're not shooting ourselves in the foot here? As long as the output isn't much different I am fine here.

> I assumed 64-bit ints would only require 4-byte alignment on all 32-bit
> platforms.  My mistake.

Ha! I assumed the opposite! :)

Looks like we're both sorta wrong though:

http://en.wikipedia.org/wiki/Data_structure_alignment#Typical_alignment_of_C_structs_on_x86
(In reply to ben turner [:bent] from comment #23)
> http://en.wikipedia.org/wiki/
> Data_structure_alignment#Typical_alignment_of_C_structs_on_x86

Assuming that is accurate...
(In reply to ben turner [:bent] from comment #23)
> (In reply to Nathan Froyd (:froydnj) from comment #22)
> > I have a hard time imagining how any out-of-line memcpy
> 
> I didn't mean out-of-line, I would expect inlines and maybe some compiler
> intrinsics for some of this stuff. Can we at least look at the disassembly
> on Windows and Mac with and without this to make sure we're not shooting
> ourselves in the foot here? As long as the output isn't much different I am
> fine here.

For a 32-bit int, the best that a compiler can do with the original memcpy is to turn it into a load and a store.  The new optimization should do just that, too.  Similarly for other sizes (assuming they don't have to be aligned any more than a uint32_t).  I will confirm Mac and Windows disassembly, though.

> > I assumed 64-bit ints would only require 4-byte alignment on all 32-bit
> > platforms.  My mistake.
> 
> Ha! I assumed the opposite! :)
> 
> Looks like we're both sorta wrong though:
> 
> http://en.wikipedia.org/wiki/
> Data_structure_alignment#Typical_alignment_of_C_structs_on_x86

Well, I know long longs are 4-byte aligned on Linux and Mac:

struct s {
  char c;
  long long ll;
};                                                                                         

int foo()
{
  return __builtin_offsetof(struct s, ll);
}

-O2 -m32 -S -o - output:

	pushl	%ebp
	movl	%esp, %ebp
	movl	$4, %eax
	popl	%ebp
	ret

but I could believe Windows is different (again, why?!).
(In reply to ben turner [:bent] from comment #23)
> (In reply to Nathan Froyd (:froydnj) from comment #22)
> > I have a hard time imagining how any out-of-line memcpy
> 
> I didn't mean out-of-line, I would expect inlines and maybe some compiler
> intrinsics for some of this stuff. Can we at least look at the disassembly
> on Windows and Mac with and without this to make sure we're not shooting
> ourselves in the foot here? As long as the output isn't much different I am
> fine here.

On Mac (x86/x86-64), clang already turns the memcpys into inline assembly.  So these patches don't make things any worse or any better.

On Windows, MSVC does not inline the memcpys here.  So these patches improve codesize and performance.
OK, here's something that seems to generate sane assembly on clang and
gcc for x86{,-64} and ARM.
Attachment #756883 - Attachment is obsolete: true
Attachment #756883 - Flags: review?(bent.mozilla)
Attachment #762652 - Flags: review?(bent.mozilla)
(In reply to Nathan Froyd (:froydnj) from comment #27)
> Created attachment 762652 [details] [diff] [review]
> part 3 - replace memcpys of POD datatypes in Pickle with something smarter
> 
> OK, here's something that seems to generate sane assembly on clang and
> gcc for x86{,-64} and ARM.

The HAVE_64BIT_OS check is not perfect (I think), but it's good enough for the platforms we care about.
Here's something that might actually compile on non-x86oid and avoids any language
lawyering around how unions work.
Attachment #762652 - Attachment is obsolete: true
Attachment #762652 - Flags: review?(bent.mozilla)
Attachment #763908 - Flags: review?(bent.mozilla)
Comment on attachment 763908 [details] [diff] [review]
part 3 - replace memcpys of POD datatypes in Pickle with something smarter

Review of attachment 763908 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with these addressed:

::: ipc/chromium/src/base/pickle.cc
@@ +48,5 @@
> +// if 64-bit types aren't sufficiently aligned; the alignment
> +// requirements for them vary between 32-bit platforms.
> +#ifndef HAVE_64BIT_OS
> +template<typename T>
> +struct Copier<T, 8, false>

Nit: s/8/sizeof(int64_t)/ is more readable

@@ +52,5 @@
> +struct Copier<T, 8, false>
> +{
> +  static void Copy(T* dest, void** iter) {
> +    MOZ_STATIC_ASSERT(MOZ_ALIGNOF(uint32_t) <= MOZ_ALIGNOF(Pickle::memberAlignmentType),
> +		      "Insufficient alignment");

Can you move this somewhere else (like, top of the file or something)? It should always be true given our current memberAlignmentType definition, not just if we're copying a 64 bit integer.

@@ +54,5 @@
> +  static void Copy(T* dest, void** iter) {
> +    MOZ_STATIC_ASSERT(MOZ_ALIGNOF(uint32_t) <= MOZ_ALIGNOF(Pickle::memberAlignmentType),
> +		      "Insufficient alignment");
> +    MOZ_STATIC_ASSERT(MOZ_ALIGNOF(T) >= MOZ_ALIGNOF(uint32_t),
> +		      "Insufficient alignment");

This looks insufficient to me... If you match this template then |MOZ_ALIGNOF(T) <= sizeof(Pickle::memberAlignmentType)| must be false, and since memberAlignmentType is just a uint32_t the only useful thing you could enforce here is |MOZ_ALIGNOF(T) > sizeof(Pickle::memberAlignmentType)|, right?

@@ +56,5 @@
> +		      "Insufficient alignment");
> +    MOZ_STATIC_ASSERT(MOZ_ALIGNOF(T) >= MOZ_ALIGNOF(uint32_t),
> +		      "Insufficient alignment");
> +#if MOZ_LITTLE_ENDIAN
> +    const int loIndex = 0, hiIndex = 1;

Nit: static here and below

@@ +61,5 @@
> +#else
> +    const int loIndex = 1, hiIndex = 0;
> +#endif
> +    // We're not using memcpy here because some compilers don't inline
> +    // properly.

Nit: This comment is redundant.

@@ +62,5 @@
> +    const int loIndex = 1, hiIndex = 0;
> +#endif
> +    // We're not using memcpy here because some compilers don't inline
> +    // properly.
> +    uint32_t* src = *reinterpret_cast<uint32_t**>(iter);

Do you think it's worth adding:

  MOZ_STATIC_ASSERT(MOZ_ALIGNOF(uint32_t*) == MOZ_ALIGNOF(void*))

here?

@@ +88,5 @@
> +};
> +
> +template<typename T>
> +void CopyFromIter(T* dest, void** iter) {
> +  MOZ_STATIC_ASSERT(mozilla::IsPod<T>::value, "copied type must be a pod type");

Nit: s/copied/Copied/ and s/pod/POD/
Attachment #763908 - Flags: review?(bent.mozilla) → review+
(In reply to Nathan Froyd (:froydnj) from comment #31)
> Marking as [leave open] for future enhancements.

Eh, let's please just file a followup if you have additional things to add? Once things land it gets complicated to remember which patches are in, etc.
Blocks: 884978
(In reply to ben turner [:bent] from comment #32)
> (In reply to Nathan Froyd (:froydnj) from comment #31)
> > Marking as [leave open] for future enhancements.
> 
> Eh, let's please just file a followup if you have additional things to add?
> Once things land it gets complicated to remember which patches are in, etc.

OK, new bug created then for trying to remove redundant checking.
Whiteboard: [leave open]
You need to log in before you can comment on or make changes to this bug.