Closed Bug 1095272 Opened 10 years ago Closed 10 years ago

Reduce NSS's heap allocation rate

Categories

(NSS :: Libraries, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(3 files)

I did some cumulative heap profiling with a modified version of DMD (see bug
1094552) on an e10s build of Firefox. I ran http://gregor-wagner.com/tmp/mem50,
which opens 50 popular websites, one per tab. The parent process cumulatively
allocated 12.9 GiB of heap blocks, and the child process cumulatively allocated
14.2 GiB.

I was surprised to see that more than 2/3 of the allocations done in the parent
process were from NSS. Typically NSS doesn't show up as significant in memory
snapshots (which measure *live* allocations rather than *cumulative*
allocations), so this indicates that NSS is doing lots of short-lived
allocations.

Here are the top seven records when we use a stack trace depth of 1, which
gives us the most coarse-grained view. I've added some comments to some of the
records. (Note that the word "cumulative" in these records means "this record
and all the ones before it", and is unrelated to the notion of "cumulative
heap profiling". Apologies for any confusion.)

> # These are almost all from PORT_ArenaAlloc_Util() and
> # nss_zalloc_arena_locked().
> Live {
>   ~630,282 blocks in heap block record 1 of 1,832
>   ~2,580,519,432 bytes (~2,580,420,334 requested / ~99,098 slop)
>   Individual block sizes: 8,192 x 189; 4,096 x 165; ~4,093 x 629,928
>   19.99% of the heap (19.99% cumulative)
>   Allocated at {
>     #01: PL_ArenaAllocate (/home/njn/moz/mi5/co64dmd/nsprpub/lib/ds/../../../../nsprpub/lib/ds/plarena.c:203)
>   }
> }
> 
> Live {
>   ~344,405 blocks in heap block record 2 of 1,832
>   ~1,410,476,390 bytes (~1,163,558,918 requested / ~246,917,472 slop)
>   Individual block sizes: 4,096 x 275,575; ~4,093 x 68,830
>   10.93% of the heap (30.92% cumulative)
>   Allocated at {
>     #01: nss_ZAlloc (/home/njn/moz/mi5/security/nss/lib/base/arena.c:892)
>   }
> }
> 
> Live {
>   ~212,422 blocks in heap block record 3 of 1,832
>   ~1,385,587,010 bytes (~1,039,694,442 requested / ~345,892,568 slop)
>   Individual block sizes: 16,384 x 387; 8,192 x 124,759; 4,096 x 2; ~4,093 x 87,274
>   10.73% of the heap (41.65% cumulative)
>   Allocated at {
>     #01: PORT_ZAlloc_Util (/home/njn/moz/mi5/security/nss/lib/util/secport.c:117)
>   }
> }
> 
> # These are almost all called from PORT_NewArena_Util(), nssArena_Create(),
> # sftk_NewSession(), pk11_CreateNewContextInSlot(), nssCKFWMutex_Create(),
> # sftk_NewObject(), nssPKIObject_NewLock(), PK11_NewSlotList().
> Live {
>   ~296,281 blocks in heap block record 4 of 1,832
>   ~1,212,678,133 bytes (~1,212,678,133 requested / ~0 slop)
>   Individual block sizes: ~4,093 x 296,281
>   9.39% of the heap (51.04% cumulative)
>   Allocated at {
>     #01: PR_NewLock (/home/njn/moz/mi5/co64dmd/nsprpub/pr/src/pthreads/../../../../../nsprpub/pr/src/pthreads/ptsynch.c:142)
>   }
> }
> 
> Live {
>   ~213,135 blocks in heap block record 5 of 1,832
>   ~910,281,079 bytes (~905,540,469 requested / ~4,740,610 slop)
>   Individual block sizes: 20,480 x 2,314; 4,096 x 2; ~4,093 x 210,819
>   7.05% of the heap (58.09% cumulative)
>   Allocated at {
>     #01: PORT_Alloc_Util (/home/njn/moz/mi5/security/nss/lib/util/secport.c:86)
>   }
> }
> 
> Live {
>   ~220,580 blocks in heap block record 6 of 1,832
>   ~902,833,940 bytes (~902,833,940 requested / ~0 slop)
>   Individual block sizes: ~4,093 x 220,580
>   6.99% of the heap (65.09% cumulative)
>   Allocated at {
>     #01: mp_init_size (/home/njn/moz/mi1/security/nss/lib/freebl/mpi/mpi.c:120)
>   }
> }

I've done some digging and found various causes.

- Bug 1094650 fixes some over-allocation in nss_builtins_FindObjectsInit().

- Arenas are used *extremely* heavily. I understand that they can simplify
  freeing, but the frequency of arena creation/destruction is incredibly high.
  For each arena you have to allocate the arena, the first chunk, and also
  a lock (which explains most of the high allocation rate from PR_NewLock()).

- There appears to be a lot of copying of structs. I suspect this is related to
  the heavy use of arenas, e.g. often a struct needs to be copied from one
  arena into another so it can be destroyed at the right time.

- The number of arbitrary-precision integers created is enormous, hence the
  appearance of mp_init_size() in record 7 above. These appear to commonly be
  512 or 1024 bytes each, which is rather large.

I certainly don't want to make major changes to NSS. I'm hoping that a small
number of minor changes will suffice to greatly reduce the allocation rates.
This should make things faster (fewer calls to malloc() and free(), and less
memory traffic) and may help with heap fragmentation.
> The number of arbitrary-precision integers created is enormous

More specifically, during another run of http://gregor-wagner.com/tmp/mem50,
934,262 of them were created. 617,087 of them were 512 bytes (64 digits),
and 317175 of them were 1024 bytes (128 digits), for a total of 640 MB of
allocations.
Depends on: 1095307
This file gives a more detailed breakdown of where the allocations are happening.
Most of the arbitrary-precision integers come about as a result of this stack trace:

> RSA_PublicKeyOp (/home/njn/moz/mi5/security/nss/lib/freebl/rsa.c:836)
> RSA_CheckSignRecover (/home/njn/moz/mi5/security/nss/lib/freebl/rsapkcs.c:1346)
> NSC_VerifyRecover (/home/njn/moz/mi5/security/nss/lib/softoken/pkcs11c.c:3349)
> PK11_VerifyRecover (/home/njn/moz/mi5/security/nss/lib/pk11wrap/pk11obj.c:648)
> recoverPKCS1DigestInfo (/home/njn/moz/mi5/security/nss/lib/cryptohi/secvfy.c:66)
> vfy_CreateContext (/home/njn/moz/mi5/security/nss/lib/cryptohi/secvfy.c:451)
> vfy_VerifyData (/home/njn/moz/mi5/security/nss/lib/cryptohi/secvfy.c:727)
> mozilla::pkix::VerifySignedData(mozilla::pkix::SignedDataWithSignature const&, mozilla::pkix::Input, unsigned int, void*) (/home/njn/moz/mi5/co64dmd/security/pkix/../../../security/pkix/lib/pkixnss.cpp:157)

In my mem50 workload, there were 39,108 calls to RSA_PublicKeyOp(). But there
were only 546 unique results, and the top 12 most frequent results covered
98.1% of the calls. I wonder if it's worth caching prior results.
Depends on: 1096741
Before you spend too much time on this, would you mind sending something to mozilla.dev.tech.crypto discussing your plans, short and long-term?

I'm very nervous that you're touching very security sensitive code which is designed NOT to leak timing information, and that may require consistently allocating certain buffers or performing some operations. Because your hot spots are always going to be at the lowest layers, it would be good to discuss this with the broader NSS community, to understand your objectives / performance criteria, etc.

In particular, I'd like to understand what the objectives goals are and how success is or isn't measured. Simply reducing allocations to reduce allocations is not necessarily a reasonable goal (depending on the allocator used)
I've emailed the list. I also expect that my patches will get the appropriate level of reviewing required.
Whiteboard: [MemShrink] → [MemShrink:P2]
With all the dependent bugs closed, this bug can be closed.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee: nobody → n.nethercote
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: