Closed Bug 559508 Opened 14 years ago Closed 14 years ago

Bad scalability on 4-socket Nehalem caused by unused counters

Categories

(NSS :: Libraries, enhancement, P2)

3.12.5
x86_64
All
enhancement

Tracking

(Not tracked)

RESOLVED FIXED
3.12.10

People

(Reporter: sergey.l.ivashin, Assigned: sergey.l.ivashin)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: FIPS)

Attachments

(3 files, 1 obsolete file)

User-Agent:       Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MS-RTC LM 8; InfoPath.2)
Build Identifier: 3.12.5

Very bad scalability is observed on 4-socket Nehalem-EX running SPECjvm2008.crypto.rsa with NSS (3.12.5) library.
Run using all 4 sockets shows score lower than run using only 3 sockets; 3-socket run shows score lower than 2-socket run. 

This performance degradation is partly caused by ++ operation on global counters (mp_copies, mp_allocs, mp_frees) in mozilla/security/nss/lib/freebl/mpi/mpi.c. These counters are absolutely unused in NSS and used only in some tests. Moreover, special wrappers around standard routines calloc/free are created which do nothing except reference counting.

According to VTune, wrappers s_mp_alloc and s_mp_free have CPI (clocks per instruction) value 40-45 and contribute >7% of freebl module time. Similar behavior shows mp_copy routine (CPI ~25 and ~5% time) and it caused by ++mp_copies operation.

Attached patch removes unnecessary wrappers and unused counters from standard NSS build:
(1) MP_MACRO=1 definition eliminates wrappers and causes direct call to standard routines, so counters mp_allocs and mp_frees become unused;
(2) Increment operation on counter mp_copies moved under #ifdef MP_MACRO=0, so the counter become unused in standard build.

This simple patch delivers almost 60% performance boost for SPECjvm2008.crypto.rsa.

After the patch it would be sufficient to build NSS with MP_MACRO=0 to get the counters working (and get performance penalty on multi-sockets machines).


Reproducible: Always
Keywords: perf
Version: unspecified → 3.12.5
Companion bug#559510 also addresses scalability problem, but it is independent.
Comment on attachment 439185 [details] [diff] [review]
Simple patch to improve scalability of NSS

I will review this sometime.  I could review it sooner if it was a real cvs diff -pu
Attachment #439185 - Flags: review?(nelson)
Attachment #439185 - Attachment is obsolete: true
Attachment #439461 - Flags: review?(nelson)
Attachment #439185 - Flags: review?(nelson)
Comment on attachment 439461 [details] [diff] [review]
Sergey's patch as a real CVS diff 

Fully reviewing this patch will necessitate reviewing all the code that is affected by the MP_MACRO feature test macro.  
I'll try to get to that this weekend.  Even if the code gets r+, the change must wait for the 3.13 release, because the affected code is in the FIPS module.
> According to VTune, wrappers s_mp_alloc and s_mp_free have CPI (clocks per
> instruction) value 40-45 and contribute >7% of freebl module time.

40 to 45 clocks PER INSTRUCTION?  Per single CPU instruction?   really?

While these particular statistics may well be expendable, NSS has many others
that aren't.  Does Nehalem offer any solution to reduce the CPU wait time for incrementing these counters?
Target Milestone: --- → 3.13
Sergey,
First, I'm delighted that you're interested in improving the performance 
of NSS on Nehalem CPUs.  I appreciate that you reported this problem.
I believe your patch does resolve the issue, but it entangles the global
counter issue with another issue, which is the calls-to-functions (such 
as memset) vs inline code (such as for loops) to try to accomplish the 
same job.  I think these issues should not be directly coupled.  

So, I propose to solve the counter problem is a way that does not force
a choice of the MP_MACRO issue.  Rather than tying the "fix" to one or
the other setting of MP_MACRO, I propose to wrap ALL the lines that 
increment those counters (wether in a macro or in a subroutine) in a 
macro, e.g. 
- ++mp_copies;
+ MP_COUNT(mp_copies);

and the conditionally define that macro as either
#define MP_COUNT(x) (x)++ 
or
#define MP_COUNT(x)

and have the default be the latter of those two definitions. 

Now, I could write that patch, but then I'd have to find a reviewer for it,
and reviewers are VERY difficult to find these days (I cannot review my own
patch).  But if YOU write that patch, then I can review it and it's a slam
dunk, so to speak.

While we're here.  Let's look at the rest of the issues related to performance
of MP_MACRO.  The original concern (back in year 2000) was that on some platforms, with some compilers, calls to memset would be less efficient than
direct loops to do zeroing of arrays inline.  That's still a concern.  But
I think the code fails to address it.  If should offer a choice between inline
for-loops and inline calls to memset, or memcpy (etc), but instead it offers a
choice between inline for-loops and inline calls to s_mp_* calls. :(.  

Maybe we need 3 choices: among
- inline for-loops
- calls to our own functions that do for-loops
- inline calls to memset, memcpy, etc.

The choice among those should be made via ifdefs and chosen in Makefiles.

In any case, the counters (or other statistical gathering stuff) should ONLY be build in builds specifically intented to instrument such stuff and not in general builds.  That's an error pointed out by this bug, to be corrected.

So, let me ask you to submit another patch that introduces the MP_COUNT
macro and ifdef's its definition according to something.  Merci'.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Assignee: nobody → sergey.l.ivashin
Priority: -- → P2
Comment on attachment 439461 [details] [diff] [review]
Sergey's patch as a real CVS diff 

See previous review comments.  Also
:%s/Merci'/Спасибо/g
Attachment #439461 - Flags: review?(nelson) → review-
Nelson,
I am working on the patch you are waiting. Meanwhile, let us clear some questions I don't quite understand.
It seems that there are features in the sources to make the choices you have mentioned:

- inline for-loops
	Set MP_MACR0=1 and MP_MEMSET=0 to inline for-loop for zeroing (s_mp_setz macro).
	Set MP_MACR0=1 and MP_MEMCPY=0 to inline for-loop for copying (s_mp_copy macro).

- calls to our own functions that do for-loops
	Set MP_MACR0=0 to call own s_mp_setz and s_mp_copy routines
	(also s_mp_alloc and s_mp_free routines will used).

- inline calls to memset, memcpy, etc.
	Set MP_MACR0=1 and MP_MEMSET=1 to call standard memset routine.
	Set MP_MACR0=1 and MP_MEMCPY=1 to call standard memcpy routine.
	Set MP_MACR0=1 to call standard calloc and free routines.

All these flags can be set/cleared in makefiles. 
So do we really need additional features in the sources?

Thank very much you for help and forgive my English.
(In reply to comment #6)
> > According to VTune, wrappers s_mp_alloc and s_mp_free have CPI (clocks per
> > instruction) value 40-45 and contribute >7% of freebl module time.
> 40 to 45 clocks PER INSTRUCTION?  Per single CPU instruction?   really?

Yes, so it is; It is quite normal for NUMA architecture. 
Just imagine how all 64 working threads (32 cores with hyperthreading enabled) are incrementing single memory cell simultaneously, invalidating all existing caches.

> While these particular statistics may well be expendable, NSS has many others
> that aren't.  

From SPECjvm2008 point of view on NSS the problems reported in this bug and in #559510 are the most serious. I don't see others (serious) problems so far.

> Does Nehalem offer any solution to reduce the CPU wait time for incrementing these counters?

I am afraid these cases of "true sharing" must be eliminated manually.
Sergey, You're right.  All we need is a way to conditionally compile the 
counting that's independent of those other variables.  

Your English is undoubtedly better than my По-русски язык. :)
Comment on attachment 439461 [details] [diff] [review]
Sergey's patch as a real CVS diff 

It appears to me that the ONLY reason for the s_mp_copy and s_mp_setz functions to exist is to contain those count instructions.  Of course, they don't need to be functions for that purpose.  They could be 
do { .. } while (0)
constructs for that, and IMO that would be an improvement.  
In any event, I agree that Sergey's patch is the quickest path to the goal, and will commit it shortly.
Attachment #439461 - Flags: review- → review+
Bug 559508: Bad scalability on 4-socket Nehalem caused by unused counters
Patch contributed by Sergey L Ivashin <sergey.l.ivashin@intel.com>, r=nelson

Checking in mpi/mpi.c; new revision: 1.45.56.3; previous revision: 1.45.56.2
Checking in mpi/mpi-config.h; new revision: 1.5.184.1; previous revision: 1.5

These commits are on the 3_13 branch and will be in the NSS 3.13 release
(if not sooner)
Blocks: FIPS2010
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Whiteboard: FIPS
Target Milestone: 3.13 → 3.12.10
Move the ++mp_copies statement after variable declarations.
If this code is compiled, it'll break under compilers that
don't support C99.

Patch checked in on the NSS trunk (NSS 3.13).

Checking in mpi.c;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi.c,v  <--  mpi.c
new revision: 1.50; previous revision: 1.49
done
Attachment #524511 - Flags: review?(rrelyea)
Combines the previous two patches.

Patch checked in on the NSS_3_12_BRANCH (NSS 3.12.10).

Checking in mpi-config.h;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi-config.h,v  <--  mpi-config.h
new revision: 1.5.198.1; previous revision: 1.5
done
Checking in mpi.c;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi.c,v  <--  mpi.c
new revision: 1.47.2.1; previous revision: 1.47
done
Comment on attachment 524511 [details] [diff] [review]
Supplemental patch

r+ rrelyea

Good point.
Attachment #524511 - Flags: review?(rrelyea) → review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: