Open Bug 484646 Opened 15 years ago Updated 2 years ago

some day gloda may want to VACUUM or auto-VACUUM

Categories

(MailNews Core :: Database, defect)

defect

Tracking

(Not tracked)

People

(Reporter: asuth, Unassigned)

References

(Blocks 4 open bugs)

Details

(Whiteboard: [no l10n impact])

per dmose, from the gloda code review:
'''
presumably, the database needs to have some sort of vacuuming strategy, whether that's simply using autovacuuming or periodically issuing vacuum commands.  This seems like it's probably a blocker for Tb3.  In a recent post, sdwilsh said:

"There is almost never a good time to VACUUM, except for maybe when the user is idle.  In 1.9.1, we've added an asynchronous API for statement execution, and if you use that all the time, it shouldn't be noticeable to the user that you are doing a VACUUM."
'''

As I see it, there are two reasons to vacuum:
1) Re-claim wasted disk space.
2) Make the database perform better.

uno)

Ignoring full-text indexing, the first case assumes that the amount of e-mail a user has will be strictly decreasing.  If the amount of mail they have remains constant or increases, SQLite will reuse the newly freed space at some point anyways.  The obvious exception would be the case where a user gets an insane amount of e-mail and then deletes it all (or otherwise causes it to appear deleted to gloda; for example, by tagging it as spam).  One might argue that the latter is an exceptional case and should ideally be mitigated by the spam detector hopefully working.

Unfortunately, the full-text indexing engine FTS3 uses a segmented structure for its indices and deletion is not really deletion.  When a document is deleted (or updated, which we avoid doing), what amounts to 'tombstones' are added to a new segment that will eventually knock out the original data when segments are merged.  The segment merging algorithm is tenuously analogous to generational garbage collection; recently inserted documents belong to segments that get merged more frequently than documents inserted a long time ago.  This means that a usage pattern where the user mainly deletes new e-mails will work out well.  The pathological case would be a user who expires their oldest messages; the 'carrying cost' of deleted messages would continually increase.  (Please read the nice comment at the top of fts3.c for more info.)  So, a vacuum would be a good thing in the face of many deleted messages.


dos)

The second case begs the question of whether vacuuming makes the database perform better.  Per https://bugzilla.mozilla.org/show_bug.cgi?id=385834#c20, auto-vacuuming can result in reduced performance because it will tend to increase disk fragmentation.  Ignoring FTS3, a straight-forward vacuum would likely decrease both disk fragmentation and internal fragmentation and be a net win even if a user never deleted any messages.

From an FTS3 perspective, I believe the segment merging logic is deterministic (read: would be indistinguishable from the original insertion pattern given that the vacuum is basically an INSERT INTO blah FROM SELECT *) and there is no way to force a merge to a single segment.  So in a use case where no messages were ever deleted, the VACUUM would have no benefits above the potential for a reduction in disk fragmentation.  With a meaningful amounts of deletion of messages with terms the user is likely to search on, the vacuuming would avoid processing doclists that will later be knocked out in a result merge for a win.


thoughts)

My vague strategy/plan has been "let's add a button that can kick off a VACUUM".  I feel like this is still not a bad plan.  The nature of the gloda database is such that it is not a small thing.  A VACUUM is going to be a lengthy affair.  While the VACUUM is running, gloda is not going to be able to do anything.  In a world where only gloda search exposes gloda, this isn't horrible, but for a user using exptoolbar or successors, that would suck.

Probably the biggest problem is that a VACUUM is going to run flat-out sucking up CPU and I/O as much as it can.  (As I read the code, the VDBE op is atomic as far as the call from mozStorage goes, so we cannot compel it to run slower other than cranking the mozStorage async thread priority all the way down.)  So if run while the user is around, things are likely to suck on slow machines.  We should be able to abort the VACUUM, so we could attempt to run a VACUUM and give up if the user comes back or tells us they want to use gloda, but this could result in some fairly pathological behaviour.

It's also worth noting that because VACUUM does all its work by creating a new database and copying things over, you need free space to get space back.  We could attempt to mitigate that by enabling autovacuum but not having it automatically run and running an incremental vacuum to first shrink the database and then do a full VACUUM to get the speed benefits.

So, in general, I feel like automatically trying to VACUUM is not going to work out.  It's hard to predict how long a VACUUM will take, and it's even harder to predict how long a user will be idle for.  And that's ignoring the serious problem of knowing when it's a good time to royally thrash the user's machine.

If some users end up wanting to VACUUM periodically, it should be short work for an extension to automatically push the VACUUM button for them on a schedule or what not.
Flags: blocking-thunderbird3?
From a user experience perspective, I think the button would just be confusing.  I'd almost say this is add-on territory for the few cases where it is, in fact, a problem.  We've been contemplating the same thing for places.
What's blocking is figuring out next steps for this bug.  And I suspect asuth can drive that better than anyone.
Assignee: nobody → bugmail
Flags: blocking-thunderbird3? → blocking-thunderbird3+
It's a unclear to me from reading comment 0 whether (and why) it would be a bad-strategy to simply enable autovacuum.  Enlightenment would be super-helpful.
(In reply to comment #3)
> It's a unclear to me from reading comment 0 whether (and why) it would be a
> bad-strategy to simply enable autovacuum.  Enlightenment would be
> super-helpful.

As I understand it, autovacuum enables the file to shrink.  As pages are emptied, pages with data from the end of the file are moved into the empty holes and the file is truncated.  (This is enabled by having a page know what pages point to it; then sqlite can look at the last page in the file, find out who references it, move the page, and re-write the references to the new location.)

An expected side effect of shrinking the file is that the next time the file has to grow, the filesystem may allocate new disk blocks that are not contiguous with the existing file.  While some filesystems may be awesome, the churn of autovacuum is more likely to increase fragmentation that not doing so.  (And even if normal growth patterns are fragmentary, a non-autovacuuming file will benefit from a disk defragmentation program whereas the autvacuumer would fight it.)

My assertion from comment 0 is that people are always getting new e-mail, so the file will tend to grow even if people periodically delete a lot of data.  Since it will continue to grow, it will eventually use the free-list that autovacuuming would turn into disk space.

Also, autovacuum does not help us with FTS3 cruft issues.
Thanks, that's very helpful.  I suspect this factors to some degree on the type of user in question (eg pack-rat who likes to keep stuff vs. person who doesn't).  

Presumably, large deletes are most likely to happen when a user bumps up against a disk-full condition, which case they might _really_ need every bit of space they can get.  That might suggest the button you mention, though I share Shawn's concerns about it.  This would seem to tie in closely with the disk-space-limitation UI bug.

All that said, the archiving model certainly seems likely to push many users in the direction of growth being the norm.
Whiteboard: [no l10n impact]
Target Milestone: --- → Thunderbird 3.0rc1
Removing blocking of Thunderbird 3.  I'm with sdwilsh that this is extension fodder for the time being.  Especially since Thunderbird 3 uses Mozilla 1.9.1 whose asynchronous storage is not quite as asynchronous as one would like, and a VACUUM on a gloda database of a size where it would have any use is pretty much guaranteed to lock up Thunderbird for a good deal of time.  We could engineer around that, but it would be a bad idea given that it would be very ugly and when we switch to 1.9.2 it will become moot.
Assignee: bugmail → nobody
Flags: blocking-thunderbird3+
Summary: gloda needs SQLite VACUUM strategy → some day gloda may want to VACUUM or auto-VACUUM
Target Milestone: Thunderbird 3.0rc1 → Thunderbird 3.1a1
I don't know how to enable same tokenizer used by Tb in Firefox+SQLite Manager or SQLite batch, although I could use standard tokenizer of FTS3 extension as dummy/faked tokenizer of global-messages-db.sqlite in order to stop "module not found error" in Query only.
If someone can add following function to GlodaQuilla, we can use it for Gloda related performance problem analysis which is reported by general users.
(1) Disable all activities in Tb which requires Gloda's *.sqlite or other *.sqlite files.
(2) Do VACUUM on global-messages-db.sqlite and other *.sqlite files.
(3) Report statistics of SQLite DBs, for "Before VACUUM" and "After VACUUM".
This can be thunderbird.exe -Do_VACUUM_Operation_Only "x.sqlite, y.sqlite, ..." like one.
Is someone tell me "how to use Gloda's tokenizer in Firefox+SQLite Manager or SQLite batch", I can do test of it, because I'm silighly much famliar with SQL and Relational Database than other people.
(In reply to WADA from comment #8)
> Is someone tell me "how to use Gloda's tokenizer in Firefox+SQLite Manager
> or SQLite batch", I can do test of it, because I'm silighly much famliar
> with SQL and Relational Database than other people.

No easy way to get the tokenizer in Firefox.  You should be able to run SQLite manager in Thunderbird though, probably?
(In reply to Andrew Sutherland (:asuth) from comment #9)
> No easy way to get the tokenizer in Firefox.
> You should be able to run SQLite manager in Thunderbird though, probably?
If so, I prefer "run Tb under Firefox, as done on MailNews & Browser by SeaMonkey" == new SeaMnokey == new "MonkeyBird" :-)
I'll try to issue SQL statement to Gloda DB in add-on for Tb. If SQL statement can be issued, putting Query result to Error Console is pretty easy.
FYI.
How to access following two fts3 VIRTUAL TABLEs with SQLite Manager+Firefox or SQLite3.exe.
  CREATE VIRTUAL TABLE conversationsText   USING fts3(tokenize mozporter, subject TEXT)
  CREATE VIRTUAL TABLE imConversationsText USING fts3(tokenize mozporter, content STRING)
(1) Connect to backup of global-messages-db.sqlite which is kept while Tb is not running.
(2) select hex(fts3_tokenizer('porter'));
    => 1C94C000 is returned (hexadecimal address depends on database)
(3) select fts3_tokenizer('mozporter',X'1C94C000');
By this, "mozporter not found" error is bypassed, so TABLEs are accessed without SQL error.
See http://code.google.com/p/sqlite-manager/issues/detail?id=444 (reporter is me)
Keep in mind that the stock porter tokenizer works differently than the modified porter tokenizer Thunderbird uses, so any numbers gathered that way are not going to be representative.

The way VACUUM works is fairly naive.  While it should result in less fragmentation, simply deleting a global-messsages-db.sqlite that has been in use for a while and letting gloda regenerate it should produce comparable output.  The exception is if gloda somehow was broken and failed to delete messages or failed to index messages, the effective contents of the databases may differ after the re-indexing.
See Also: → 632791
Blocks: 1023000
Target Milestone: Thunderbird 3.1a1 → ---
Blocks: tbbigfolder
Blocks: 1567764
Depends on: 1574093
Blocks: 1574093
No longer depends on: 1574093
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.