Closed Bug 699708 Opened 13 years ago Closed 12 years ago

10% or 50% slop for SQLite caches due to PgHdr1 and the "R" value being tacked onto 2^N-sized page allocations

Categories

(Toolkit :: Storage, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: n.nethercote, Unassigned)

References

Details

(Keywords: memory-footprint, Whiteboard: [MemShrink:P2])

+++ This bug was initially created as a clone of Bug #676189 +++

The patch from bug 678977 just landed, fixing the dark matter problem and and the slop bytes in the 1024, 2048 and 32768 byte cases from bug 676189 comment 15.  Let's recap what's left to do...

That still leaves the dominant 33016 byte case -- which jemalloc rounds it up to 36864, resulting in 10.5% slop for the page cache, which is the dominant source of memory consumption in SQLite.

It arises through pcache1AllocPage which allocates |sizeof(PgHdr1) + pCache->szPage| bytes.  |sizeof(PgHdr1)| is 5 words, so 40 bytes on 64-bit platforms.  pCache->szPage is 32976.  I found the following comment that seems to explain why it's this not-very-round number:

** SQLite will typically create one cache instance for each open database file,
** though this is not guaranteed. ^The
** first parameter, szPage, is the size in bytes of the pages that must
** be allocated by the cache.  ^szPage will not be a power of two.  ^szPage
** will the page size of the database file that is to be cached plus an
** increment (here called "R") of less than 250.  SQLite will use the
** extra R bytes on each page to store metadata about the underlying
** database page on disk.  The value of R depends
** on the SQLite version, the target platform, and how SQLite was compiled.

So the R value on my machine is 32976 - 32768 = 208.

In bug 681184 we discussed the provisional SQLITE_PAGECACHE_BLOCKALLOC option, which caused 15 32KB pages to be allocated at once.  This reduced the slop fraction to almost zero, but at the cost of increasing up-front memory usage to an unacceptable degree (see 678977 comment 22).

Richard, do you have any other ideas for changing the 33016 byte allocations to be more allocator-friendly, e.g. storing the header and "R" part separately so we have a 32768 byte part and a ~250 byte part?
> Richard, do you have any other ideas for changing the 33016 byte allocations
> to be more allocator-friendly, e.g. storing the header and "R" part
> separately so we have a 32768 byte part and a ~250 byte part?

I read the code enough to understand how the header works.  Here is the declaration:

 /*
 ** Each cache entry is represented by an instance of the following 
 ** structure. A buffer of PgHdr1.pCache->szPage bytes is allocated 
 ** directly before this structure in memory (see the PGHDR1_TO_PAGE() 
 ** macro below).
 */
 struct PgHdr1 {
   unsigned int iKey;             /* Key value (page number) */
   PgHdr1 *pNext;                 /* Next in hash table chain */
   PCache1 *pCache;               /* Cache that currently owns this page */
   PgHdr1 *pLruNext;              /* Next in LRU list of unpinned pages */
   PgHdr1 *pLruPrev;              /* Previous in LRU list of unpinned pages */
 };

AFAICT, all movements between a header and its page are mediated by the following macros:

 /*
 ** When a PgHdr1 structure is allocated, the associated PCache1.szPage
 ** bytes of data are located directly before it in memory (i.e. the total
 ** size of the allocation is sizeof(PgHdr1)+PCache1.szPage byte). The
 ** PGHDR1_TO_PAGE() macro takes a pointer to a PgHdr1 structure as
 ** an argument and returns a pointer to the associated block of szPage
 ** bytes. The PAGE_TO_PGHDR1() macro does the opposite: its argument is
 ** a pointer to a block of szPage bytes of data and the return value is
 ** a pointer to the associated PgHdr1 structure.
 **
 **   assert( PGHDR1_TO_PAGE(PAGE_TO_PGHDR1(pCache, X))==X );
 */
 #define PGHDR1_TO_PAGE(p)    (void*)(((char*)p) - p->pCache->szPage)
 #define PAGE_TO_PGHDR1(c, p) (PgHdr1*)(((char*)p) + c->szPage)

The macros nicely encapsulate things.  If we stored the header separately, we could add a page pointer to the header and PGHDR1_TO_PAGE would be easily modified.  But implementing PAGE_TO_PGHDR1 would be difficult -- you naturally want a pointer back but you can't store that in the page itself without exceeding 32768 bytes.  You could have a separate table lookup table but that would be annoying.  Hmm.

I haven't worked out the R part yet.  It looks to be another name for the |MemPage| struct, which has size |EXTRA_SIZE|.  If I had to guess I'd reckon that it gets tacked onto the end of the page for the same reason the header is tacked onto the front -- you can move between it and the page with constant offsets -- but I haven't seen any code that actually does that.

Anyway, splitting the header and R from the page doesn't look easy, unfortunately.
The "header" is placed immediately after the page cache buffer on purpose.  The reason is that when SQLite is reading a corrupt database file, for certain obscure kinds of corruption, it is possible that SQLite might try to read as many as 8 bytes past the end of the page cache buffer.  (NB: read, not write.)  By placing the header after the buffer (instead of before it, or in a separate allocation) we ensure that those reads do not cause a segfault.

The reads in question only occur when a very carefully corrupted database file is presented to SQLite.  It's very hard to get them to occur, and they never occur when accessing a well-formed database.  We could add a single comparison operator to eliminate those reads in all cases, but the reads are occurring in a particularly performance sensitive part of the code (the variable-length integer decoder) and adding even that one comparison will result in an easily measurable performance reduction.

The other problem is that we have published APIs that assume the header and buffer for a page cache entry are taken from the same allocation.  If we try to split them up, that API will break.  Not many applications use that API, but the ones that do use it are very high-profile so we need to be careful there.

So there are some dicey technical issues with splitting the header from the buffer in the page cache.  But we (the SQLite team) will be putting our heads together to try to come up with a solution for you.

Meanwhile, here's another idea:  If your goal is simply to reduce the memory usage in FF, have you considered simply using less page cache, rather than trying to make the page cache use memory more efficiently?  I'm entering this comment using a hacked version of Nightly in which I have hard-coded an upper bound on the size of SQLite's page cache to 100 pages (down from the default of 2000 pages).  It seems to be working just fine for me - database memory usage has dropped by half according to about:memory, and there is no noticeable performance degradation.  (That's a subjective evaluation, though - it would be interesting to see what your performance tests show.)
(In reply to D. Richard Hipp from comment #2)
> Meanwhile, here's another idea:  If your goal is simply to reduce the memory
> usage in FF, have you considered simply using less page cache, rather than
> trying to make the page cache use memory more efficiently?

I'm doing that in bug 692487, and bug 700417 along with bug 658303 went into that direction.

The 2 roads are complementary, one side we try to reduce memory usage, the other side we try to use memory more efficiently.
Im not sure the former is enough though, since for example on my profile I see slop bytes being as high as used bytes, an efficiency of 50% that is pretty bad in this case. I'm still investigating this issue though.
(In reply to Marco Bonardo [:mak] from comment #3)
> Im not sure the former is enough though, since for example on my profile I
> see slop bytes being as high as used bytes, an efficiency of 50% that is
> pretty bad in this case. I'm still investigating this issue though.

I did some investigation, the 4096 page_size hurts by rounding up 4232 to 8192, 
see bug 700294.
(In reply to D. Richard Hipp from comment #2)
> 
> Meanwhile, here's another idea:  If your goal is simply to reduce the memory
> usage in FF, have you considered simply using less page cache, rather than
> trying to make the page cache use memory more efficiently?

That's definitely a good idea, and I'm glad Marco is working on it.

This issue came up because I was looking at slop in general, in particular the "clownshoes" cases where we requested 2^N+epsilon bytes, which in many case involve 50% slop and are relatively easy to fix, as in bug 678977.  (Plus half a dozen non-SQLite cases.)  

But this 32KB+epsilon -> 36KB slop issue is atypical.  It's much harder to fix -- you've explained there are good technical reasons for it.  And the waste is much smaller, only about 10%.  (And I've downgraded it from a MemShrink:P1 to a MemShrink:P2.)

So I'd say:  spend some time thinking about if it can be avoided in a simpler way.  (E.g. if Mozilla doesn't use those APIs, can we add an option to split the allocation in two, and if you specify that option you can't use those APIs?)  And we can discuss it on IRC, but I'm happy to WONTFIX this if the cost:benefit ratio is too high.
Whiteboard: [MemShrink:P1] → [MemShrink:P2]
Hmm, Marco just told me that telemetry says that half the places databases out in the wild have 1KB or 4KB page sizes (because we used to use 1KB, then increased to 4KB, then increased to 32KB).  So they'll be getting 50% slop.  Which makes this bug more important again.
Summary: Avoid 33016 byte allocations in sqlite because they are rounded up to 36864 bytes → 10% or 50% slop for SQLite caches due to PgHdr1 and the "R" value being tacked onto 2^N-sized page allocations
(In reply to Nicholas Nethercote [:njn] from comment #6)
> Hmm, Marco just told me that telemetry says that half the places databases
> out in the wild have 1KB or 4KB page sizes

Yes, the reporting ones, that so far is mostly Nightly users, since Places telemetry is broken on anything < FF10. We may get better numbers once new Aurora 10 is released.
Whiteboard: [MemShrink:P2] → [MemShrink:P2][clownshoes]
FWIW, I don't consider this a clownshoes bug, because the 2^N+epsilon calculation had significant technical constraints and reasons underlying it.  A true clownshoes bug is really easy to fix, and this one isn't.  (Though it is possible to fix and it is worth fixing.)
Whiteboard: [MemShrink:P2][clownshoes] → [MemShrink:P2]
drh sent a few of us a patch privately, which I evaluated.  

I did some measurements with a moderately old profile,
getting determistic results isn't easy but I saw enough to be
convinced that it is working as expected.

Opening a couple of pages, my total SQLite usage goes to around 5.4MB,
and the "other" part drops by ~400KB (from 1.4MB to 1.0MB) with the patch.

Opening 150 pages, without the patch, total goes to 7.2MB and "other"
goes to 1.4MB;  with the patch, total goes to 6.0MB and "other" goes
to 1.0MB.  (There are some other pragma/default cache size changes might
be affecting the total size.)

Those are measurements for live pages.  I also logged all the
individual allocations done by SQLite, I've confirmed there are no
32KB+250 byte allocations.  Furthermore, the total amount of
*cumulative* slop has reduced by a factor of 2--5x for the workloads I
tried.

And my profile has 32KB pages, on profiles with 4KB or 1KB pages the
improvements should be a lot bigger.
Keywords: footprint
Blocks: 746486
This seems hard to fix, because this memory layout is deeply embedded in SQLite.  If we could convert all the profiles that use 1KB/4KB to 32KB we could reduce the waste from 50% to 10%.  I opened bug 746486 for this.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
No longer blocks: 746486
You need to log in before you can comment on or make changes to this bug.