Last Comment Bug 676189 - [meta] sqlite3.c botches rounding up allocation sizes to a power-of-two, wasting memory, and under-reporting its memory usage
: [meta] sqlite3.c botches rounding up allocation sizes to a power-of-two, wast...
Status: RESOLVED FIXED
[MemShrink:P1][clownshoes]
: meta
Product: Toolkit
Classification: Components
Component: Storage (show other bugs)
: Trunk
: All All
: -- normal with 3 votes (vote)
: ---
Assigned To: Nicholas Nethercote [:njn]
:
Mentors:
Depends on: 678977 SQLite3.7.8
Blocks:
  Show dependency treegraph
 
Reported: 2011-08-02 23:15 PDT by Nicholas Nethercote [:njn]
Modified: 2012-01-13 12:53 PST (History)
22 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
sqlite instrumentation patch (6.34 KB, patch)
2011-08-03 20:05 PDT, Nicholas Nethercote [:njn]
no flags Details | Diff | Splinter Review

Description Nicholas Nethercote [:njn] 2011-08-02 23:15:00 PDT
In various places sqlite3.c carefully allocates heap blocks that are power-of-two sized, to avoid the heap allocator rounding up the requested size.

One example (note the 1024):

  static int growOpArray(Vdbe *p){
    VdbeOp *pNew;
    int nNew = (p->nOpAlloc ? p->nOpAlloc*2 : (int)(1024/sizeof(Op)));
    pNew = sqlite3DbRealloc(p->db, p->aOp, nNew*sizeof(Op));

Another example:

  ** The size chosen is a little less than a power of two.  That way,
  ** the FileChunk object will have a size that almost exactly fills
  ** a power-of-two allocation.  This mimimizes wasted space in power-of-two
  ** memory allocators.
  */
  #define JOURNAL_CHUNKSIZE ((int)(1024-sizeof(FileChunk*)))
  
  struct FileChunk {
    FileChunk *pNext;               /* Next chunk in the journal */
    u8 zChunk[JOURNAL_CHUNKSIZE];   /* Content of this chunk */
  };

ROWSET_ENTRY_PER_CHUNK is another case like this.


Unfortunately, sqlite3.c adds 8 bytes to every allocation request before it gets sent to the heap allocator, in order to store the request size:

  /*
  ** Like malloc(), but remember the size of the allocation
  ** so that we can find it later using sqlite3MemSize().
  */  
  static void *sqlite3MemMalloc(int nByte){
    sqlite3_int64 *p;
    assert( nByte>0 );
    nByte = ROUND8(nByte);
    p = malloc( nByte+8 );
    if( p ){
      p[0] = nByte;
      p++;
    }else{
      testcase( sqlite3GlobalConfig.xLog!=0 );
      sqlite3_log(SQLITE_NOMEM, "failed to allocate %u bytes of memory", nByte);
    }
    return (void *)p;
  }

So those carefully computed sizes of 1024 become 1032 and subsequently get rounded up by jemalloc to 2048.  And ironically enough, this renders sqlite3MemSize (which uses the stored size) completely inaccurate -- it underestimates these botched allocations by almost 2x.  Argh!

This accounts for at least 1.3MB of waste in the cumulative allocations that occur when I start up Firefox and load Gmail.  I don't have numbers on how many of those allocations are live at once.

What would be ideal would be to change sqlite3MemSize to just use malloc_usable_size.  That would avoid this rounding botch, and also save 8 bytes per allocation.  Can we change sqlite3.c?
Comment 1 Marco Bonardo [::mak] (Away 6-20 Aug) 2011-08-03 01:25:17 PDT
no, we can't touch sqlite3.c, we can contact sqlite team and suggest upstream fixes though.
Comment 2 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-08-03 05:45:19 PDT
I don't think we should be afraid to patch sqlite in our tree, but we should definitely make the changes upstream too and pick them up when they trickle back.
Comment 3 Marco Bonardo [::mak] (Away 6-20 Aug) 2011-08-03 05:55:01 PDT
we are not afraid, we cannot afaict.
Comment 4 Justin Lebar (not reading bugmail) 2011-08-03 07:03:48 PDT
(In reply to comment #3)
> we are not afraid, we cannot afaict.

Why is this?  We patch almost every other library we import...
Comment 5 D. Richard Hipp 2011-08-03 07:25:02 PDT
Background information: SQLite uses a start-time pluggable memory allocator. (Details at http://www.sqlite.org/malloc.html)  The default memory allocator is standard library malloc() though SQLite also comes with a memory allocator that allocates from a large static buffer using a power-of-two first-fit algorithm.  Or, applications can plug in a different memory allocator of their own choosing.  (Note that the test harnesses for SQLite plug in their own custom memory allocators to facilitate testing of out-of-memory (OOM) errors.)

The memory allocator used by SQLite must have four routines.  There is the traditional malloc(), realloc(), and free().  And there is a fourth routine (call it size()) which returns the size of an outstanding allocation.

The standard library memory allocators only supports the first three routines.  So, when using the standard library memory allocator, SQLite has to wrap it inside a set of routines that store the size of each allocation in the first 8 bytes.  This causes the allocations to be larger than requested by 8 bytes, as you have observed.

One possible solution here (assuming jemalloc has a size() interface) is to register jemalloc as the memory allocator directly, without going through the wrapper that adds the 8-byte overhead at the beginning of each allocation.  This has the advantage of using less memory and being faster.  But it depends on jemalloc supporting a size() interface and it requires some extra code at start-time to register the new malloc implementation with SQLite.

If that approach is not going to work for you, we can add a cpp macro SQLITE_MALLOC_OVERHEAD that defaults to 0, but which you can set to 8 at compile-time, and arrange for those power-of-two memory allocations you identify to request SQLITE_MALLOC_OVERHEAD bytes fewer than a power of two.

The optimal solution in terms of performance and efficient memory utilization will be the first one.  But that solution is also more work for FF devs.  The SQLITE_MALLOC_OVERHEAD approach is simpler for FF devs, but it not quite as efficient. 

We (the SQLite devs) will go ahead and start working on the SQLITE_MALLOC_OVERHEAD approach, which you (the FF devs) can use or not use as you see fit.  Let us know if you want further advice on how to plug jemalloc directly into SQLite, bypassing the standard library malloc wrapper.
Comment 6 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-08-03 07:27:29 PDT
(In reply to comment #5)
> The standard library memory allocators only supports the first three
> routines.  So, when using the standard library memory allocator, SQLite has
> to wrap it inside a set of routines that store the size of each allocation
> in the first 8 bytes.  This causes the allocations to be larger than
> requested by 8 bytes, as you have observed.
> 
> One possible solution here (assuming jemalloc has a size() interface) is to
> register jemalloc as the memory allocator directly, without going through
> the wrapper that adds the 8-byte overhead at the beginning of each
> allocation.  This has the advantage of using less memory and being faster. 
> But it depends on jemalloc supporting a size() interface and it requires
> some extra code at start-time to register the new malloc implementation with
> SQLite.

jemalloc supports size() (as malloc_usable_size()), so we should do this.

Thanks for the explanation.
Comment 7 Justin Lebar (not reading bugmail) 2011-08-03 07:29:49 PDT
I'll see if I can get this to work.
Comment 8 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2011-08-03 07:33:44 PDT
But be aware that on Linux at least we can't assume that jemalloc is active even when it is built. This is because libxul is sometimes delay loaded, so it really depends on whether the initial binary is linked against jemalloc.
Comment 9 Nicholas Nethercote [:njn] 2011-08-03 16:25:31 PDT
I plan to add some instrumentation code to break down the memory allocated by SQLite in Firefox into three parts:

(1) space available for useful data
(2) space taken up by the 8-byte counters for every allocation
(3) space taken up by the waste caused by the rounding.

The SQLITE_MALLOC_OVERHEAD approach will avoid (3).  Registering jemalloc directly as an allocator will avoid (2).


Richard:  maybe you should default SQLITE_MALLOC_OVERHEAD to 8, since that seems to be the common case.  I wonder how many other SQLite users are wasting memory due to this same issue.
Comment 10 Nicholas Nethercote [:njn] 2011-08-03 20:05:30 PDT
Created attachment 550588 [details] [diff] [review]
sqlite instrumentation patch

> (1) space available for useful data
> (2) space taken up by the 8-byte counters for every allocation
> (3) space taken up by the waste caused by the rounding.
> 
> The SQLITE_MALLOC_OVERHEAD approach will avoid (3).  Registering jemalloc
> directly as an allocator will avoid (2).

The attached patch add instrumentation for this.  Note that the new memory reporters overlap with the old explicit/storage/sqlite one, but that doesn't matter for throw-away instrumentation code like this, and it allowed the new ones to be compared easily to the old ones.

Here's the relevant about:memory output when I start the browser with a moderately-used profile:

├───3,528,848 B (08.60%) -- my-sqlite
│   ├──3,014,152 B (07.34%) -- useful
│   ├────492,488 B (01.20%) -- excess
│   └─────22,208 B (00.05%) -- accounting
├───3,014,152 B (07.34%) -- storage
│   └──3,014,152 B (07.34%) -- sqlite

Here's after I open a bunch of tabs:

├────9,761,664 B (01.90%) -- my-sqlite
│    ├──8,494,136 B (01.65%) -- useful
│    ├──1,235,200 B (00.24%) -- excess
│    └─────32,328 B (00.01%) -- accounting
├────8,494,136 B (01.65%) -- storage
│    └──8,494,136 B (01.65%) -- sqlite

Things to note:

- The "my-sqlite/useful" numbers match the "storage/sqlite" numbers exactly, which is good evidence that my patch is correct.

- SQLite under-reports its memory usage by ~1.15--1.20x.  Not terrible, but definitely worth fixing.

- The "my-sqlite/accounting" number ((2) above) is tiny, which is good, as it means that the simpler SQLITE_MALLOC_OVERHEAD approach will get us almost as much benefit as registering jemalloc would.

- This will put a small dent (0.5--1.0%, maybe?) in heap-unclassified.
Comment 11 Nicholas Nethercote [:njn] 2011-08-03 20:10:15 PDT
Richard:  BTW, SQLite currently doesn't include the 8-bytes-per-allocation in its memory accounting.  IMO it should.
Comment 12 Nicholas Nethercote [:njn] 2011-08-05 00:48:49 PDT
Richard: do you have a link to the SQLite bug tracker?
Comment 13 Shawn Wilsher :sdwilsh 2011-08-07 15:53:24 PDT
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #2)
> I don't think we should be afraid to patch sqlite in our tree, but we should
> definitely make the changes upstream too and pick them up when they trickle
> back.
Yes, we should.  It makes upgrading hard.  We used to have changes, and it made updating incredibly difficult.  We will always be better off getting the changes upstreamed into SQLite proper and then taking an update.

We have a support contract with SQLite for a reason.  We should use it instead of going through the pain ourselves.
Comment 14 Nicholas Nethercote [:njn] 2011-08-09 17:43:11 PDT
Can anyone estimate the timelines involved here?  By that I mean:

- How long the SQLITE_MALLOC_OVERHEAD change will take to make it into an SQLite release.

- How long it will take to update the SQLite version in Firefox.
Comment 15 Nicholas Nethercote [:njn] 2011-08-10 23:08:57 PDT
I did some more investigation.  The 1024-sized power-of-two botches mentioned in comment 0 only account for a small fraction of the wasted memory in this case.

I found four cases that accounted for most of the wasted memory:

Request   Request+8   actual    slop    percentage
-------   ---------   ------    ----    ----------
 1024      1032        2048     1016     4.5%
 2048      2056        4096     2040    12.1%
32768     32776       36864     4088     9.7%
33016     33024       36864     3840    55.3%

The "percentage" columns says how much of the live slop each size accounted for when I had a few tabs open.  Clearly the 1024-sized ones aren't that important, it's the 33016-size ones that matter most.

Still, let's go through them all:

- 1024:  This is a power-of-two botch due to the extra 8-byte size field.  growOpArray and JOURNAL_CHUNKSIZE (both mentioned in comment 0) account for most (all?) of the cases.

- 2048:  Another power-of-two botch due to the 8-byte size field.  pcache1ResizeHash calls |sqlite3_malloc(sizeof(PgHdr1 *)*nNew);| and nNew is usually a power-of-two.

- 32768: Ditto.  These ones all go through sqlite3PageMalloc.  I think this is because our Makefile passes in -DSQLITE_DEFAULT_PAGE_SIZE=32768.

- 33016:  This is the most important one, and is a bit different.  It always goes through pcache1AllocPage which allocates |sizeof(PgHdr1) + pCache->szPage|.
|sizeof(PgHdr1)| is 5 words, so 40 bytes on 64-bit platforms.  pCache->szPage is 32976.  I found the following comment that seems to explain why it's this not-very-round number:

** SQLite will typically create one cache instance for each open database file,
** though this is not guaranteed. ^The
** first parameter, szPage, is the size in bytes of the pages that must
** be allocated by the cache.  ^szPage will not be a power of two.  ^szPage
** will the page size of the database file that is to be cached plus an
** increment (here called "R") of less than 250.  SQLite will use the
** extra R bytes on each page to store metadata about the underlying
** database page on disk.  The value of R depends
** on the SQLite version, the target platform, and how SQLite was compiled.

  Looks like the R value on my machine is 32976 - 32768 = 208.


So, conclusions:
- The SQLITE_MALLOC_OVERHEAD fix will have to include the 2048 and 32768 cases.  Once they're done, this will avoid ~25% of the slop.

- Can the 33016 case be changed so it's 32768 bytes instead?  That one looks harder to fix, but would avoid more than 50% of the slop.
Comment 16 Nicholas Nethercote [:njn] 2011-08-10 23:14:30 PDT
FWIW, I just tried a different, older profile, and saw this:

├───32,620,912 B (05.20%) -- my-sqlite
│   ├──28,938,232 B (04.61%) -- useful
│   ├───3,640,552 B (00.58%) -- excess
│   └──────42,128 B (00.01%) -- accounting

In my earlier measurements it was more like 7MB.  In this 32MB measurement the 33016 case is now accounting for 84% of the slop.
Comment 17 Nicholas Nethercote [:njn] 2011-08-10 23:24:25 PDT
I guess in the limit scenario the 33016 case will dominate.  With 33016 of useful data, 8 bytes of size and 3840 bytes of slop, SQLite's total slop percentage will be 10.5%.
Comment 18 Shawn Wilsher :sdwilsh 2011-08-13 16:32:55 PDT
(In reply to Nicholas Nethercote [:njn] from comment #14)
> - How long the SQLITE_MALLOC_OVERHEAD change will take to make it into an
> SQLite release.
We can request a special release of SQLite if needed.  I tend to avoid that because nothing we need fixed is ever super critical.

> - How long it will take to update the SQLite version in Firefox.
It's very easy to upgrade SQLite (mostly because we don't take changes to the library): http://mxr.mozilla.org/mozilla-central/source/db/sqlite3/README.MOZILLA
Comment 19 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-08-15 06:59:55 PDT
I filed Bug 678977 for comment 6.

The 33016-byte allocations will need to be fixed upstream, I think.
Comment 20 Nicholas Nethercote [:njn] 2011-08-18 21:21:40 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #8)
> But be aware that on Linux at least we can't assume that jemalloc is active
> even when it is built. This is because libxul is sometimes delay loaded, so
> it really depends on whether the initial binary is linked against jemalloc.

Can you explain more about this?  I thought that if MOZ_MEMORY was defined, we can assume jemalloc exists and is used.  I'm a little terrified by the thought that this might not be the case...
Comment 21 Nicholas Nethercote [:njn] 2011-08-18 21:25:07 PDT
Richard:  With bug 678977 progressing well, your proposed SQLITE_MALLOC_OVERHEAD workaround looks less pressing now.

The matter of the 200-odd bytes in the "R" increment is still the biggest problem.  Is it possible that the "R" increment could be stored separately from its page when in memory?
Comment 22 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-08-19 04:01:03 PDT
If sqlite will make use of the roundup space caused by rounding the 32K + R allocations up to 36 K I think we can resolve completely on our side.
Comment 23 Justin Lebar (not reading bugmail) 2011-08-22 10:14:33 PDT
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #22)
> If sqlite will make use of the roundup space caused by rounding the 32K + R
> allocations up to 36 K I think we can resolve completely on our side.

(In case it's not clear) ... because jemalloc rounds 32K + epsilon up to 36K.
Comment 24 Nicholas Nethercote [:njn] 2011-09-01 19:33:44 PDT
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #22)
> If sqlite will make use of the roundup space caused by rounding the 32K + R
> allocations up to 36 K I think we can resolve completely on our side.

I did some investigation to answer this.  With the patch from bug 678977 applied, for all 32K+R allocations I wrote some canary values in the final page when the memory was allocated, and then checked if they were modified when the allocation was freed.  They were *not* modified, so I conclude that the slop space is *not* used.

However, from looking at the memory accounting code, AFAICT the storage/sqlite reporter *is* counting the slop bytes for the 32K+R allocations (with the patch from bug 678977 applied).  So we're wasting memory but at least we're being honest in about:memory about it.
Comment 25 Justin Lebar (not reading bugmail) 2011-09-29 13:06:12 PDT
> So we're wasting memory but at least we're being honest in about:memory about it.

So this is no longer a darkmatter blocker, right?
Comment 26 Nicholas Nethercote [:njn] 2011-09-29 19:33:21 PDT
(In reply to Justin Lebar [:jlebar] from comment #25)
> > So we're wasting memory but at least we're being honest in about:memory about it.
> 
> So this is no longer a darkmatter blocker, right?

Once the patch from bug 678977 lands.
Comment 27 Nicholas Nethercote [:njn] 2011-11-03 21:21:44 PDT
The patch from bug 678977 just landed.  I've spun off bug 699708 for the remaining issue, and I'll close this bug.

Note You need to log in before you can comment on or make changes to this bug.