Created attachment 575317 [details] [diff] [review]
patch for evaluation
We have a trivial patch from Richard Hipp that drastically reduces SQLite's memory consumption in Firefox. Richard said via private email:
> I modified the SQLite page cache so that it immediately discards (frees)
> pages that have not been changed and are not in active use. The default
> behavior is to keep those pages in memory with the hope that they might be
> reused later, thus saving a memcpy() from the kernel's filesystem cache.
> After this change, however, SQLite just throws those unused pages away and
> reclaims the memory.
> With this change, SQLite might still allocate large amounts of memory
> while running a large query or doing a large transaction. But as soon as
> that operation completes, the memory is freed. Clownfoot allocations are
> still occurring, but those allocations are now transient - lasting only
> for the duration of a single SQL statement or transaction - so any slop
> quickly gets returned to the free memory pool.
(The clownfoot allocations are covered by bug 699708, and Richard is doing some independent work to reduce them, but this patch would obviously make that less important.)
I've been running with this patch for a few days, I haven't seen SQLite total memory consumption in about:memory go above 3.2MB, when previously it would often be 2, 3, 4x that. Furthermore, the profile I'm testing isn't that big, and it uses 32KB pages where slop is a much smaller problem than it is on profiles with 1KB or 4KB pages. So the potential gains for some users might be significantly higher.
The crucial question is, will this cause slow-downs, and how can we tell? (I haven't noticed any slowdowns but I have a very fast desktop machine.) Marco said this via email:
> So, it's matter of measuring the increase in the number of reads during a
> navigation session (eventually during the Tp5 test) and see if it's
> substantial or minimal. May be problematic to accept the former case, but
> that clearly depends on our usage, it's possible we rarely hit the same
> pages thus it won't hurt. This is measurable, so it's an easy problem to
> solve. Likely we may reuse the same pages in short amount of time (changes
> and reads of the same data usually happen near), dropping the pages after
> a certain delay may both reduce memory usage and not increase reads, but
> it's more code complication.
Marco, are you able to do those measurements? They're not time-based measurements, unfortunately, but should give us a good general idea of the impact.
Another idea for evaluating is to use telemetry. Telemetry already records total SQLite memory usage and the time taken for lots of SQLite operations. So we could land this patch temporarily, wait however long it takes to get reasonable telemetry results (a week?) and then back it out and evaluate. Does that seem reasonable? We still have over a month left in the Fx11 development period.
(BTW, the attached patch obviously couldn't be landed permanently as-is, this behaviour needs to be given a proper option in SQLite. But Richard needs to know if we really want this feature before doing that.)
(In reply to Nicholas Nethercote [:njn] from comment #0)
> Marco, are you able to do those measurements? They're not time-based
> measurements, unfortunately, but should give us a good general idea of the
I'm able (or at least I suppose so, I don't use xperf from some time, so I need a refresh), the question may be time.
> Another idea for evaluating is to use telemetry. Telemetry already records
> total SQLite memory usage and the time taken for lots of SQLite operations.
> So we could land this patch temporarily, wait however long it takes to get
> reasonable telemetry results (a week?) and then back it out and evaluate.
> Does that seem reasonable?
This wouldn't give us good data, since we need a well-known and fixed testcase to measure reads variation, telemetry is a moving target not really good for this kind of micro-benchmarking. Tp5 is a possibly good target.
> (BTW, the attached patch obviously couldn't be landed permanently as-is,
> this behaviour needs to be given a proper option in SQLite. But Richard
> needs to know if we really want this feature before doing that.)
Right, my first thought is that, if we (and someone here around may have ideas!) could find a solution to the over-reads issue, we should try it. That may be solved either by:
- micro-optimizing the boundaries check code in sqlite.c
- modifying jemalloc to be able to alloc 2 adjacent areas
- modified jemalloc to be able to alloc an area with some padding at the end
- other crazy ideas?
Doing one of these would have a 0-3% loss and a 10-50% gain.
The approach in this bug has a larger gain, but also a possible loss of 5-10% and it will increase IO for sure (the point is how much, thus the need to measure with care) that is something we are fighting on the perf side.
And with that, I'm not saying to drop this idea, we should really do this if data is performance data is acceptable since it's a large gain, just that this needs more careful testing than a solution to the over-reads issue would require.
Do we have the capability to do a telemetry A/B test? I saw some code floating around for this a while ago.
It seems to me that this would be a valid experiment. By testing both with and without this change at the same time, we avoid the problem that nightly is a moving target.
Maybe the data would be too noisy to derive useful data from, but that shouldn't keep us from trying...
unless we don't land anything that affects a database or a component using a database, for like a week, I don't see how we would avoid noise. Also considering that different pages or different searches in the locationbar, for example, will have completely different number of reads.
Anything we land will affect both the A and B set in the same way, right? So I don't understand how this is a problem.
There will obviously be random variation between users, but the idea is that this should even out via the law of large numbers.
Actually, there may be a simpler solution that anyone can do, we have a vfs shim (that is the one telemetry uses) that accumulates reads. would just be matter of dumping the value on shutdown. Tp5 can be run from standalone talos.
Even without A/B testing I'm hoping that the memory reduction will be large enough to be conclusive. Evaluating the speed impact would be harder, yes.
> Right, my first thought is that, if we (and someone here around may have
> ideas!) could find a solution to the over-reads issue, we should try it.
> That may be solved either by:
> - micro-optimizing the boundaries check code in sqlite.c
> - modifying jemalloc to be able to alloc 2 adjacent areas
> - modified jemalloc to be able to alloc an area with some padding at the end
Modifying jemalloc doesn't sound doable to me. The whole reason that allocators round up to numbers like powers-of-two because it makes them much faster and simpler. If jemalloc was able to allocate adjacent areas or padding we wouldn't have this clownshoes issue in the first place -- we wouldn't need to split the page from the header, we'd just allocate 32KB+epsilon bytes.
(In reply to Nicholas Nethercote [:njn] from comment #7)
> Modifying jemalloc doesn't sound doable to me.
Alternative crazy idea, rather than allocating pages one by one, SQLite may allocate space for 2, and check boundaries only when using the second one. The unused space would then be just 1 page every odd allocation, and boundaries check cost would be the half.
Btw, if you want to try the telemetry thing, I won't oppose. I just think it won't be conclusive and we'll still need to spend time on a test done on a fixed testcase.
We have xperf setup on talos, we should do a couple of runs with it on and then compare output.
I think A/B testing with telemetry could give us useful data. However, unless the win is very significant, we might not see it given our current telemetry rates(300-400 Nightly pings a day; bug 683340 should increase that 5-10x).
We shouldn't be trying to throw this in the tree until we understand what this costs us.
The suggestion was that one way we could understand how much this costs us is by throwing it in the tree.
Are you explicitly opposed to running a controlled experiment using telemetry on nightly or aurora?
(In reply to Justin Lebar [:jlebar] from comment #11)
> The suggestion was that one way we could understand how much this costs us
> is by throwing it in the tree.
...without doing any basic investigation first. Using xperf isn't hard to get these counts (we can even see how much more main thread I/O happens). That's the low bar I'm asking for.
> Are you explicitly opposed to running a controlled experiment using
> telemetry on nightly or aurora?
I am if we aren't going to do some basic work first to verify that we aren't going to make the nightly experience crap for some people. Increasing the amount of disk I/O (especially on the main thread) is just going to add jankiness to the build. I want to understand what the change is going to be first.
> I've been running with this patch for a few days, I haven't seen SQLite
> total memory consumption in about:memory go above 3.2MB, when previously it
> would often be 2, 3, 4x that.
FWIW, I ran today without the patch and SQLite total usage was usually around 14 or 15MB. So that's a ~4.5x improvement on my unexceptional profile.
(In reply to Nicholas Nethercote [:njn] from comment #13)
> FWIW, I ran today without the patch and SQLite total usage was usually
> around 14 or 15MB. So that's a ~4.5x improvement on my unexceptional
I don't think there is the need to sell the gain, we are very aware of the possible gain knowing how SQlite handles the cache, but we are not well aware of the possible loss. Even saying this will improve by 1 million percent won't make the bug go faster.
That said, I think we may decide to go this way on Mobile even with a perf regression, since there the databases are more limited and memory is a big concern (actually though mobile is moving to their own history/bookmarks implementation using System SQLite, that for sure won't have these improvements, but likely its memory usage will be "hidden" similarly to how IE does with explorer.exe on Windows).
Marco, I'm assigning this to you because you're the one who knows how to measure the number of reads, which is a pre-req for further experimentation (such as temporarily landing the patch in Nightly builds). Or if the measurements are easy and you want to tell me how to do them, I could give it a shot.
(In reply to Shawn Wilsher :sdwilsh (Vacation Dec 17 - Jan 5) from comment #12)
> (In reply to Justin Lebar [:jlebar] from comment #11)
> > The suggestion was that one way we could understand how much this costs us
> > is by throwing it in the tree.
> ...without doing any basic investigation first. Using xperf isn't hard to
> get these counts (we can even see how much more main thread I/O happens).
> That's the low bar I'm asking for.
Shawn, can you describe exactly what you're recommending be measured with xperf here? Eg: What are the steps that should be taken with the browser in order to properly exercise this change, and what xperf measurement should be taken during that time?
fwiw, I'm looking at this today, though I'm not sure how far I'll go since I have a flu. But I've setup xperf and talos, so I can probably start the measures. Talos tp_dist may be fine to test changes in common browsing, some additional test on locationbar performances may be welcome. Not sure what else we may test then.
Regarding the documentation sources:
I'm not sure how to distinguish io on threads, don't remember if xperf does that automatically.
I think the easiest way to do this is via telemetry. Since this bug was opened we've landed slow sql telemetry.
Saptashi is doing a similar analysis right now on impact of running sqlite in async mode on mobile using sql data.
We were able to see clear differences in disk perf from enabling compressed cache on mobile. I expect similar success with sql.
I disagree with using telemetry, for the simple fact these kind of measures require to be done on fixed data to be able to have good comparison.
Now, I have finished the first round of testing, and unfortunately I'm negatively surprised. I tested using Talos tp_dist, should be a decent evaluation of the impact on common browsing. After the first measures, I was puzzled, so I repeated the tests another couple times.
Without the patch xperf accounts for about 32 thousands read operations generated from firefox.exe. I've seen many of those go to sqlite files.
With the patch the count grows up to 290 thousands operations.
I planned other tests like importing/exporting a large bookmarks json, or executing a fixed list of queries in the awesomebar, but at this point I'm not sure if it's worth spending more time on it, considered the so bad result from simple browsing testing.
As for the procedure procedure, I used xperf with
I used standalone talos 2.1, I disabled all tests but tp_dist
From the xperf viewer I used the File I/O types histogram, opened the relative Summary Table and checked for the larger firefox.exe (Tp starts the process twice, the first one is just to gather screen size)
on the other side, while I had this setup, I measured the effect of having Places fallback to the default cache size, and I didn't notice any interesting variation in numbers (there's some tens of reads more). So I'll file that bug apart is will be another way to gain back some MB on high end system.
I think a better way to move with this memory usage would be too hook the new memory shrink pragma so that we run it on idle and other hooks (once there was the trimOnMinimize pref to reduce memory when the browser window was minimized, may make sense to apply that concept to this case?)
filed bug 717575 on the Places cache reduction, will do further measures there related to that.
Marco, thanks for running that through xperf.
> I disagree with using telemetry, for the simple fact these kind of measures require to be done on
> fixed data to be able to have good comparison.
I want to make the general point -- if only so I can point others at this comment in the future -- that this kind of measurement is something we absolutely can do with telemetry. I don't mean to say that the xperf testing Marco did was insufficient or suboptimal, but I want to refute the notion that this kind of question is unsuited for telemetry.
The methodology of A/B testing, where we
* Randomly assign each machine to group A or B;
* Let group A have the change, while group B doesn't;
* Observe what happens to groups A and B;
is scientifically valid *even if each person experiences something different*. Let me say that again: We do *not* need to ensure that groups A and B browse to the same sites, or have similar hardware, or anything like that.
We can get away with this because whether a person is in group A or B is independent of her hardware or browsing habits; we assigned the groups at random. We use the law of large numbers  to establish that the mean of each group's measurements filters out all this noise and is a meaningful representation of the difference between groups A and B.
You may need a large number of people in order to get useful results, but in statistics, 1000 people is a fairly large group, and we can collect that many samples in a few days on the nightlies.
Again, it's not true that you need "fixed data in order to have a good comparison" for this kind of experiment. If we have the right telemetry probes, this experiment can be just as rigorous as any medical or otherwise scientific study.
sure, it depends on the numbers, we should do this in Nightly, that may no have enough users.
Also, as noted here, doing this to a large number of users would have increased their read counts by 9 times, making their browsing pretty awful.
I agree with randomization bit - that there will be no qualitative difference between the two groups. Sample sizes of a few hundred will do too. Median/means will be representative (but for nightly users) But can one do case-control studies in Telemetry? I was aware that p% getting a feature and (1-p)% not getting it is not possible in current versions of telemetry.
I don't plan to investigate this further atm, so unassigning, but if anyone wants to do measures and needs help setting up the env feel free to ask. Shouldn't be hard at all using the links in comment 17 and https://wiki.mozilla.org/StandaloneTalos
> I think a better way to move with this memory usage would be too hook the
> new memory shrink pragma so that we run it on idle and other hooks (once
> there was the trimOnMinimize pref to reduce memory when the browser window
> was minimized, may make sense to apply that concept to this case?)
We have bug 411894 which is about doing this on the memory-pressure event, and bug 681525 which is about doing this when inactive.
Thanks for doing this measurement, Marco!
> Without the patch xperf accounts for about 32 thousands read operations
> generated from firefox.exe. I've seen many of those go to sqlite files.
> With the patch the count grows up to 290 thousands operations.
That does sound bad. Maybe we should WONTFIX this. jlebar, what do you think?
This bug is a MemShrink:P1 because I wanted to know if this big easy win was possible, and if the answer is "no" that's ok.
> jlebar, what do you think?
Is there no way to do this halfway? If not, wontfix sgtm.
Oh well. Thanks for the patch, Richard, and thanks for the testing, Marco.