Closed Bug 559508 Opened 14 years ago Closed 14 years ago

Bad scalability on 4-socket Nehalem caused by unused counters

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

3.12.10

People

(Reporter: sergey.l.ivashin, Assigned: sergey.l.ivashin)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: FIPS)

Attachments

(3 files, 1 obsolete file)

Simple patch to improve scalability of NSS 14 years ago Sergey 1.05 KB, patch		Details \| Diff \| Splinter Review
Sergey's patch as a real CVS diff 14 years ago Nelson Bolyard (seldom reads bugmail) 1.44 KB, patch	nelson : review+	Details \| Diff \| Splinter Review
Supplemental patch 13 years ago Wan-Teh Chang 928 bytes, patch	rrelyea : review+	Details \| Diff \| Splinter Review
Patch for NSS_3_12_BRANCH 13 years ago Wan-Teh Chang 2.11 KB, patch		Details \| Diff \| Splinter Review

Sergey

Assignee

Description

•

14 years ago

User-Agent:       Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MS-RTC LM 8; InfoPath.2)
Build Identifier: 3.12.5

Very bad scalability is observed on 4-socket Nehalem-EX running SPECjvm2008.crypto.rsa with NSS (3.12.5) library.
Run using all 4 sockets shows score lower than run using only 3 sockets; 3-socket run shows score lower than 2-socket run. 

This performance degradation is partly caused by ++ operation on global counters (mp_copies, mp_allocs, mp_frees) in mozilla/security/nss/lib/freebl/mpi/mpi.c. These counters are absolutely unused in NSS and used only in some tests. Moreover, special wrappers around standard routines calloc/free are created which do nothing except reference counting.

According to VTune, wrappers s_mp_alloc and s_mp_free have CPI (clocks per instruction) value 40-45 and contribute >7% of freebl module time. Similar behavior shows mp_copy routine (CPI ~25 and ~5% time) and it caused by ++mp_copies operation.

Attached patch removes unnecessary wrappers and unused counters from standard NSS build:
(1) MP_MACRO=1 definition eliminates wrappers and causes direct call to standard routines, so counters mp_allocs and mp_frees become unused;
(2) Increment operation on counter mp_copies moved under #ifdef MP_MACRO=0, so the counter become unused in standard build.

This simple patch delivers almost 60% performance boost for SPECjvm2008.crypto.rsa.

After the patch it would be sufficient to build NSS with MP_MACRO=0 to get the counters working (and get performance penalty on multi-sockets machines).


Reproducible: Always

Sergey

Assignee

Comment 1

•

14 years ago

Attached patch Simple patch to improve scalability of NSS (obsolete) — Details — Splinter Review

Sergey

Assignee

Updated

•

14 years ago

Keywords: perf

Version: unspecified → 3.12.5

Sergey

Assignee

Comment 2

•

14 years ago

Companion bug#559510 also addresses scalability problem, but it is independent.

Nelson Bolyard (seldom reads bugmail)

Comment 3

•

14 years ago

Comment on attachment 439185 [details] [diff] [review]
Simple patch to improve scalability of NSS

I will review this sometime.  I could review it sooner if it was a real cvs diff -pu

Attachment #439185 - Flags: review?(nelson)

Nelson Bolyard (seldom reads bugmail)

Comment 4

•

14 years ago

Attached patch Sergey's patch as a real CVS diff — Details — Splinter Review

Attachment #439185 - Attachment is obsolete: true

Attachment #439461 - Flags: review?(nelson)

Attachment #439185 - Flags: review?(nelson)

Nelson Bolyard (seldom reads bugmail)

Comment 5

•

14 years ago

Comment on attachment 439461 [details] [diff] [review]
Sergey's patch as a real CVS diff 

Fully reviewing this patch will necessitate reviewing all the code that is affected by the MP_MACRO feature test macro.  
I'll try to get to that this weekend.  Even if the code gets r+, the change must wait for the 3.13 release, because the affected code is in the FIPS module.

Nelson Bolyard (seldom reads bugmail)

Comment 6

•

14 years ago

> According to VTune, wrappers s_mp_alloc and s_mp_free have CPI (clocks per
> instruction) value 40-45 and contribute >7% of freebl module time.

40 to 45 clocks PER INSTRUCTION?  Per single CPU instruction?   really?

While these particular statistics may well be expendable, NSS has many others
that aren't.  Does Nehalem offer any solution to reduce the CPU wait time for incrementing these counters?

Target Milestone: --- → 3.13

Nelson Bolyard (seldom reads bugmail)

Comment 7

•

14 years ago

Sergey,
First, I'm delighted that you're interested in improving the performance
of NSS on Nehalem CPUs. I appreciate that you reported this problem.
I believe your patch does resolve the issue, but it entangles the global
counter issue with another issue, which is the calls-to-functions (such
as memset) vs inline code (such as for loops) to try to accomplish the
same job. I think these issues should not be directly coupled.

So, I propose to solve the counter problem is a way that does not force
a choice of the MP_MACRO issue. Rather than tying the "fix" to one or
the other setting of MP_MACRO, I propose to wrap ALL the lines that
increment those counters (wether in a macro or in a subroutine) in a
macro, e.g.
- ++mp_copies;
+ MP_COUNT(mp_copies);

and the conditionally define that macro as either
#define MP_COUNT(x) (x)++
or
#define MP_COUNT(x)

and have the default be the latter of those two definitions.

Now, I could write that patch, but then I'd have to find a reviewer for it,
and reviewers are VERY difficult to find these days (I cannot review my own
patch). But if YOU write that patch, then I can review it and it's a slam
dunk, so to speak.

While we're here. Let's look at the rest of the issues related to performance
of MP_MACRO. The original concern (back in year 2000) was that on some platforms, with some compilers, calls to memset would be less efficient than
direct loops to do zeroing of arrays inline. That's still a concern. But
I think the code fails to address it. If should offer a choice between inline
for-loops and inline calls to memset, or memcpy (etc), but instead it offers a
choice between inline for-loops and inline calls to s_mp_* calls. :(.

Maybe we need 3 choices: among
- inline for-loops
- calls to our own functions that do for-loops
- inline calls to memset, memcpy, etc.

The choice among those should be made via ifdefs and chosen in Makefiles.

In any case, the counters (or other statistical gathering stuff) should ONLY be build in builds specifically intented to instrument such stuff and not in general builds. That's an error pointed out by this bug, to be corrected.

So, let me ask you to submit another patch that introduces the MP_COUNT
macro and ifdef's its definition according to something. Merci'.

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Nelson Bolyard (seldom reads bugmail)

Updated

•

14 years ago

Assignee: nobody → sergey.l.ivashin

Priority: -- → P2

Nelson Bolyard (seldom reads bugmail)

Comment 8

•

14 years ago

Comment on attachment 439461 [details] [diff] [review]
Sergey's patch as a real CVS diff 

See previous review comments.  Also
:%s/Merci'/Спасибо/g

Attachment #439461 - Flags: review?(nelson) → review-

Sergey

Assignee

Comment 9

•

14 years ago

Nelson,
I am working on the patch you are waiting. Meanwhile, let us clear some questions I don't quite understand.
It seems that there are features in the sources to make the choices you have mentioned:

- inline for-loops
	Set MP_MACR0=1 and MP_MEMSET=0 to inline for-loop for zeroing (s_mp_setz macro).
	Set MP_MACR0=1 and MP_MEMCPY=0 to inline for-loop for copying (s_mp_copy macro).

- calls to our own functions that do for-loops
	Set MP_MACR0=0 to call own s_mp_setz and s_mp_copy routines
	(also s_mp_alloc and s_mp_free routines will used).

- inline calls to memset, memcpy, etc.
	Set MP_MACR0=1 and MP_MEMSET=1 to call standard memset routine.
	Set MP_MACR0=1 and MP_MEMCPY=1 to call standard memcpy routine.
	Set MP_MACR0=1 to call standard calloc and free routines.

All these flags can be set/cleared in makefiles. 
So do we really need additional features in the sources?

Thank very much you for help and forgive my English.

Sergey

Assignee

Comment 10

•

14 years ago

(In reply to comment #6)
> > According to VTune, wrappers s_mp_alloc and s_mp_free have CPI (clocks per
> > instruction) value 40-45 and contribute >7% of freebl module time.
> 40 to 45 clocks PER INSTRUCTION?  Per single CPU instruction?   really?

Yes, so it is; It is quite normal for NUMA architecture. 
Just imagine how all 64 working threads (32 cores with hyperthreading enabled) are incrementing single memory cell simultaneously, invalidating all existing caches.

> While these particular statistics may well be expendable, NSS has many others
> that aren't.  

From SPECjvm2008 point of view on NSS the problems reported in this bug and in #559510 are the most serious. I don't see others (serious) problems so far.

> Does Nehalem offer any solution to reduce the CPU wait time for incrementing these counters?

I am afraid these cases of "true sharing" must be eliminated manually.

Nelson Bolyard (seldom reads bugmail)

Comment 11

•

14 years ago

Sergey, You're right.  All we need is a way to conditionally compile the 
counting that's independent of those other variables.  

Your English is undoubtedly better than my По-русски язык. :)

Nelson Bolyard (seldom reads bugmail)

Comment 12

•

14 years ago

Comment on attachment 439461 [details] [diff] [review]
Sergey's patch as a real CVS diff 

It appears to me that the ONLY reason for the s_mp_copy and s_mp_setz functions to exist is to contain those count instructions.  Of course, they don't need to be functions for that purpose.  They could be 
do { .. } while (0)
constructs for that, and IMO that would be an improvement.  
In any event, I agree that Sergey's patch is the quickest path to the goal, and will commit it shortly.

Attachment #439461 - Flags: review- → review+

Nelson Bolyard (seldom reads bugmail)

Comment 13

•

14 years ago

Bug 559508: Bad scalability on 4-socket Nehalem caused by unused counters
Patch contributed by Sergey L Ivashin <sergey.l.ivashin@intel.com>, r=nelson

Checking in mpi/mpi.c; new revision: 1.45.56.3; previous revision: 1.45.56.2
Checking in mpi/mpi-config.h; new revision: 1.5.184.1; previous revision: 1.5

These commits are on the 3_13 branch and will be in the NSS 3.13 release
(if not sooner)

Blocks: FIPS2010

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Whiteboard: FIPS

Robert Relyea

Updated

•

13 years ago

Target Milestone: 3.13 → 3.12.10

Wan-Teh Chang

Comment 14

•

13 years ago

Attached patch Supplemental patch — Details — Splinter Review

Move the ++mp_copies statement after variable declarations.
If this code is compiled, it'll break under compilers that
don't support C99.

Patch checked in on the NSS trunk (NSS 3.13).

Checking in mpi.c;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi.c,v  <--  mpi.c
new revision: 1.50; previous revision: 1.49
done

Attachment #524511 - Flags: review?(rrelyea)

Wan-Teh Chang

Comment 15

•

13 years ago

Attached patch Patch for NSS_3_12_BRANCH — Details — Splinter Review

Combines the previous two patches.

Patch checked in on the NSS_3_12_BRANCH (NSS 3.12.10).

Checking in mpi-config.h;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi-config.h,v  <--  mpi-config.h
new revision: 1.5.198.1; previous revision: 1.5
done
Checking in mpi.c;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi.c,v  <--  mpi.c
new revision: 1.47.2.1; previous revision: 1.47
done

Robert Relyea

Comment 16

•

13 years ago

Comment on attachment 524511 [details] [diff] [review]
Supplemental patch

r+ rrelyea

Good point.

Attachment #524511 - Flags: review?(rrelyea) → review+

You need to log in before you can comment on or make changes to this bug.