Closed Bug 870171 Opened 11 years ago Closed 11 years ago

Expire old data

Categories

(Firefox Health Report Graveyard :: Client: Android, defect, P3)

ARM
Android
defect

Tracking

(Not tracked)

RESOLVED FIXED
Firefox 27

People

(Reporter: rnewman, Assigned: mcomella)

References

Details

(Whiteboard: [qa+])

Attachments

(5 files, 5 obsolete files)

      No description provided.
It is my understanding that we generate documents from the past six months [1]. Would old data be considered anything older than six months?

[1]: https://mxr.mozilla.org/mozilla-central/source/mobile/android/base/background/healthreport/upload/AndroidSubmissionClient.java#106 (Also, I think this should probably be a constant, rather than MILLISECONDS_PER_SIX_MONTHS).
Assignee: nobody → michael.l.comella
Status: NEW → ASSIGNED
Yup!
https://github.com/mozilla-services/android-sync/pull/335

f? - Specifically, I'd like to know if my deleteEvents call (part 3) is in an appropriate location and called with an appropriate frequency (i.e. should a query be made on every upload, as I have done?) before I start writing a new file to test AndroidSubmissionClient. The remaining parts can be looked over in the actual review if you'd like.
Flags: needinfo?(rnewman)
Don't forget to prune things other than events.

On every upload is probably too often. Consider that we might upload several times a day. I suggest doing it *after* upload, and not every time -- perhaps once a week? I leave it to you to figure out the best way to do periodic cleanup.

Consider also that we have a few other cleanup things we might want to do:

* Rolling up or pruning session data. Bug 885650.
* Pruning and vacuuming the database when it gets too big. (If you're hitting thousands of events, we probably want to prune sooner than 6 months.)
* Periodic vacuuming -- expensive, but possibly necessary to defeat fragmentation.
Flags: needinfo?(rnewman)
Note to self: prune orphaned (and NULL?) addons too.
Closing old PR due to a large rewrite. Here is the new link:

https://github.com/mozilla-services/android-sync/pull/344
f? - There is more coming (see my GH comment for details) but I figured this would be easier to review in chunks. Note the PR link has changed.
Flags: needinfo?(rnewman)
f+, keep at it.
Flags: needinfo?(rnewman)
Here are is my periodic cleanup and pruning plan. Richard, can you verify this is reasonable?

## Summary
For a estimate of 2.25 MB of FHR data, we should prune when data is older than
six months, or exceeds a count of 31,000 events.

Pruning should occur weekly and vacuuming should occur either monthly, or when
over 4200 events have been deleted since the previous vacuum.

## What to prune
My profile (which is probably just-under average usage) is 44 days old, has 976
events, which is an average of 22 events/day. These events are session records
and search counts. For the other tables, I have 2 measurements, 5 fields, 5
addons, and 10 environments.

Since fields and measurements can only be deleted when they are not in use,
they are not ideal to analyze for pruning. On mobile, addons and environments
will also be fairly static and are comparatively negligible to events entries
so we will only analyze events for our pruning.

Overall. the size of the data now and, naively assuming constant growth, after
a 1/2 year, and 1 year (for a total of 8030 events), is...

Data type            | 44 days  | 1/2 year | 1 year

Events               | 48 kB    | 199 kB   | 398 kB
Events & its indices | 72 kB    | 298 kB   | 597 kB
All                  | 120 kB   | 498 kB   | 995 kB

However, this is for 2 measurements (as opposed to the 12 current available,
and others that will presumably be added). We will assume the usage per
measurement is average (though, granted, these measurements may not contain as
much text or may be taken less frequently and vice versa) and multiply the
event storage sizes by 6 for these additional measurements (and keep the size
of the remaining content constant), which is...

Data type            | 44 days  | 1/2 year | 1 year

Events               | 288 kB   | 1194 kB  | 2388 kB
Events & its indices | 432 kB   | 1788 kB  | 3582 kB
All                  | 480 kB   | 1988 kB  | 3971 kB

That's 132 events/day, 5856 events/44 days, 24090 events/(1/2) year, and 48180
events/year.

We really only upload data at most 6 months old, so we can start pruning data
after ~24090 events are captured, which is a DB size of 1988 kB (~1.94 MB),
which seems reasonable compared to the 51.6 MB of data my Fennec currently
takes up (not including the apk size).

However, I stated my usage is probably just under average so we'll likely want
to raise this limit for power users:

n = (c - 200) / k
where n = number of events
      c = cap in kB
      k ~= ~.074 kB/event (solved for 1.94 MB)

Samples:
  2.25 MB ~= 31067 events
  2.50 MB ~= 34519 events
  2.75 MB ~= 37971 events
  3.00 MB ~= 41423 events

My choice would be a 2.25 MB cap which means we prune the oldest events
numbered greater than ~31000 events (or, of course, events of older than six
months).

## Frequency
If we pruned weekly, we would recover, on average (assuming the 6x multiplier
above), ~68.7 kB (432 kB / 44days * 7 days), which is ~924 events (132 events *
7).

If we vacuum monthly, that's ~294.5 kB (432 kB / 44 * 30 days), which is
~3960 events (132 events * 30).

Additionally, we should also have a condition to vacuum if a certain number of
events are deleted through weekly pruning, to prevent issues with excessive
usage. Since the expected average usage per month is ~3960 events, we can make
this value 4200 events.

## Unknowns
* Performance: I'm unsure of how to take in the degradation to insert/query
performance due to the increasing size of the database and whether pruning for
less events is worthwhile to boost performance.
* Indices: Do these actually bloat in constant space? Probably worse because
they are layered "date, env, field".
* Document upload size: 2.25 MB of event data seems like a lot to upload daily,
especially if the generated document is not aggregated (see bug 885650).
Flags: needinfo?(rnewman)
I'm generally in favor of very aggressive pruning.

Consider:

* At-least-linear space and cost of document generation due to number of rows. We walk a cursor to spit out document fragments; the fewer rows, the less work we do, and the fewer rows, the fewer days and environments we have to process. Killing an event row will kill more than one event's worth of space: perhaps it'll kill a whole day and a whole environment, which is a big win with little cost.

* At-least-linear cost of including old environments: they have their own section in the doc as well as contributing to each day's data.

* Database fragmentation, inserts, indices. You mention these, and yes, they're a concern, particularly on smaller phones.

* Compound wire size. That is, if you hang on to just a 1KB day's data for six months rather than three, you're spending 90KB on extra wire data (plus the extra environments you can't discard). This is compressed, but still, it's spending someone else's money.


So my feeling is that:

* Six months is too long on Android, at least for session data, just due to sheer volume. Bug 885650 again.

* And we get compounded gains the more aggressively we collect, so we should *really* aggressively consider cutting down on data -- perhaps by maintaining alternative aggregates locally (e.g., do we really need six months of daily version history, when we could have a simple JSON timeline encoding do the job in 1/50th the space?).

* Retaining more data uses more than linear space, but does not (IMO; jjensen might disagree!) linearly improve our ability to do analysis. This is particularly true when you consider FHR as a user-facing value prop: would we really spend 4MB of a user's disk space on a feature with this much benefit, when 2MB offers almost as much?


In other words, we should really look at these numbers and say "seriously, we're suggesting storing multiple MB and 25,000 database rows in order to show accurate search counts from March when you open about:healthreport?".

The pruning point should be set where this impact trades off against the maximum user value.


My broad-stroke feeling is to set hard limits like:

20000 events?
 ... a number that we should tweak by timing synthetic document generation on lowish-end phones for varying DB sizes. If we're hammering your shitty phone flash for 10 seconds, it's too much.

50 environments?
 ... just because of sheer payload overhead. If we can figure out a scheme to discard these intelligently...

3 months (or 2 months?)
 ... because data older than that is of less value, right?

When Fig merges it should dramatically reduce the number of sessions stored -- right now each Awesomebar access is a session split. But even so...


Regarding frequency: perhaps do the routine date-wise prune every week, and check for event/env count every day? Include a hysteresis curve (prune down to 18,000 when we hit 20,000) to avoid bouncing every day. 

We shouldn't limit pruning because we're afraid of an expensive delete: that row needs to be deleted at some point. It might as well be now, so we save N days' document generations!

Regarding vacuuming: vacuum is expensive. (It's a full DB copy, basically.)

As such, we should really measure to decide when to do it: see option 3.

http://lzone.de/How%20to%20Vacuum%20SQLite

jjensen: what are your feelings about aggressive pruning on Android? Would you consider 1 month, or 3 months, of retained data enough to deliver the user benefit of FHR? Does your answer change if we add some mechanism for long-term aggregation?

You still have an open needinfo on Bug 885650, so feel free to answer this one instead.
Flags: needinfo?(rnewman) → needinfo?(jjensen)
f? - Parts 6-11, which remove the extraneous Receivers and test as discussed on IRC.

https://github.com/mozilla-services/android-sync/pull/344

Concerns:
    1) I was unable to test AnnouncementsService. It shouldn't be making network requests, but the application does not change state without doing so. Thus there was nothing to test for if we want to see if the request would have been made or not (but bailing out before a request is made). There doesn't seem to be any way to insert a mock into the code and so it seems that without changing the main code, there was no way to test AnnouncementsService. I'm not sure if such changes would be worthwhile because they'll probably require additional testing.
    2) We store some intents in the Alarm Manager and while I checked that they exist, I was unable to check that these same intents will actually start the associated service - startService() only starts the Service of whose test case this is and PendingIntent.send() does not seem to fire in a reasonable amount of time (I slept for 2 seconds, but that has its own problems).
    3) As far as I could find, there is no proper way to check AlarmManager has a PendingIntent stored other than checking that the PendingIntent exists (ref: http://stackoverflow.com/questions/4713592/how-to-check-if-alarm-is-set), but I've done this and it seems like there should be a better way. Additionally, I had to add some code which is unnecessary in the main codebase, namely PendingIntent.cancel() calls, to do this (though this added code does not change the functionality), which also seems incorrect.
Flags: needinfo?(nalexander)
Hi all,

Sorry for missing the needinfo on this -- was on PTO for a while and am still going through bugmail.

Philosophically, I am not adamantly opposed to pruning before six months. However, I'd like to onlyact after a) a broader discussion with more stakeholders, ie the product owner and b) a review of some of the existing FHR data we have to better see what the real-world situation is and what the effects could be. To that end, I've added Brendan Colloran to this bug.

What I suggest is that we a) come up with a better summary of what the likely growth rate of data is, similar to what Michael did in comment #9, b) come to a recommendation, and get buy-in from the product owner Karen.

I'll send a note to the interested parties about this separately and update this ASAP.
Flags: needinfo?(jjensen)
Feedback provided at https://github.com/mozilla-services/android-sync/pull/335.
Flags: needinfo?(nalexander)
(In reply to Richard Newman [:rnewman] from comment #10)
> Regarding vacuuming: vacuum is expensive. (It's a full DB copy, basically.)
> 
> As such, we should really measure to decide when to do it: see option 3.

It seems the "auto_vacuum" pragma is enabled to full by default (on my Galaxy Nexus, at least in our tests). I would argue we don't want this behavior if we're going to be vacuuming regularly by hand anyway (over time auto_vacuum may increase fragmentation, reduce insert speed, etc. at the benefit of a smaller overall DB size).

Additionally, with the auto_vacuum pragma, "PRAGMA freelist_count" always returns 0 and thus we can't actually measure the free page ratio as per option 3 and we'd have to vacuum on time alone.

In order to disable it, we have to first vacuum the database and then set the pragma to 0 [1]. My solution: we can blindly do this after each vacuum so that existing DBs will be moved to auto_vacuum = 0 (off) after their first vacuum. With a DB version bump, onCreate can also run "vacuum; pragma auto_vacuum = 0;" before creating the tables, to disable this functionality for new databases.

Does this sound reasonable?

[1]: http://www.sqlite.org/pragma.html#pragma_auto_vacuum
Flags: needinfo?(rnewman)
Sounds reasonable to me!
Flags: needinfo?(rnewman)
It seems onCreate is called from within a transaction and so vacuum cannot be run on new databases without compensating. Options:
    1) Have a first run preference that is ticked true in onCreate(). onOpen will check for this preference and if true, set it to false, and change the pragma. This is gross because we need to give HealthReportDatabaseStorage a SharedPreferences instance and since we're overriding a method called from Android code, we can't just pass it in as a parameter.
    2) Treat newly created databases as existing databases and turn auto_vacuum off on the first vacuum since we're eventually going to vacuum anyway. The one downside is that the first vacuum has to be entirely time dependent (since we can't read the free page count) but we'd have to deal with that in (1) anyway for existing databases.

I prefer (2). (1) seems a bit hackish and more complex for what seem to be very minor gains for some portion of the users (although I haven't benchmarked it so I can't be sure).

What do you think?
Flags: needinfo?(rnewman)
2 for sure.
Flags: needinfo?(rnewman)
Attached patch 01: m-c Patch (obsolete) — Splinter Review
This patch aligns with Part 20 of the github PR: https://github.com/mozilla-services/android-sync/pull/344

Please also f? on Parts 13 - 20 and the 3 followups of the PR.

I have not yet completed the testing, as nalexander previously mentioned I was being a bit overzealous. I'm hoping to test:
    * HealthReportPruneService.isIntentValid
    * attemptPruneBy* and attemptVacuum work as intended (they prune when time has expired, do not when not expired)
    * BroadcastService correctly sets and starts the PruneService alarm.

Much of this I have already done for Upload and it shouldn't be too difficult to adapt here - I will do this after the feedback.

In the meantime, I have tried to test that the PruneService runs by hand but have been unable to fire the event reliably. I tried making the prune check interval very short and using PendingIntent.send() but neither seem to work properly. I once got an error (PruneService was not found) and another time it worked properly so I'm not sure what to think. Do you have any ideas?

Note also that the constants I added are not yet finalized - I still need to do performance testing for those values. The ones I included are just placeholders.
Attachment #795639 - Flags: feedback?(rnewman)
Note to self: test vacuuming on low-end devices to ensure we don't OOM every time it's called.
Comment on attachment 795639 [details] [diff] [review]
01: m-c Patch

Review of attachment 795639 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/base/GeckoApp.java
@@ +1565,3 @@
>                  GeckoPreferences.broadcastAnnouncementsPref(context);
>                  GeckoPreferences.broadcastHealthReportUploadPref(context);
> +                GeckoPreferences.broadcastHealthReportPrune(context);

We really have to unify these. At this point we're instantiating six service instances.

Also, let's not do this here. What do you think about doing this prune broadcast after a session ends in BrowserHealthRecorder?

::: mobile/android/base/GeckoPreferences.java
@@ +395,5 @@
>          // broadcast settings for all profiles).  See Bug 882182.
>          GeckoProfile profile = GeckoProfile.get(context);
>          if (profile != null) {
> +            intent.putExtra("profileName", profile.getName())
> +                    .putExtra("profilePath", profile.getDir().getAbsolutePath());

Nit: align the .putExtras.

@@ +442,5 @@
>  
> +    public static void broadcastHealthReportPrune(final Context context) {
> +        final String action = HealthReportConstants.ACTION_HEALTHREPORT_PRUNE;
> +        final Intent intent = new Intent(action)
> +                .putExtra("branch", GeckoApp.PREFS_NAME);

One line here is fine.

@@ +445,5 @@
> +        final Intent intent = new Intent(action)
> +                .putExtra("branch", GeckoApp.PREFS_NAME);
> +        fillIntentWithProfileInfo(context, intent);
> +        Log.d(LOGTAG, "Broadcast: " + action + ", " + GeckoApp.PREFS_NAME);
> +        context.sendBroadcast(intent, GlobalConstants.PER_ANDROID_PACKAGE_PERMISSION);

Code reuse between this fragment and your reworked broadcastPrefAction?
Attachment #795639 - Flags: feedback?(rnewman)
> (In reply to Richard Newman [:rnewman] from comment #20)
>
> We really have to unify these. At this point we're instantiating six service
> instances.

Unify the broadcasts or the background services or both?

> Also, let's not do this here. What do you think about doing this prune
> broadcast after a session ends in BrowserHealthRecorder?

Sounds reasonable to me - in theory, we shouldn't have such an epic browser session that we need to prune before the browser is backgrounded for the first time.
Attached patch 01a: m-c Patch (obsolete) — Splinter Review
Is this what you had in mind for the code reuse?
Attachment #795639 - Attachment is obsolete: true
Attachment #799876 - Flags: feedback?(rnewman)
Comment on attachment 799876 [details] [diff] [review]
01a: m-c Patch

Review of attachment 799876 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, pretty much, though I gently question the amount of logging work!
Attachment #799876 - Flags: feedback?(rnewman) → feedback+
(In reply to Michael Comella (:mcomella) from comment #21)

> > We really have to unify these. At this point we're instantiating six service
> > instances.
> 
> Unify the broadcasts or the background services or both?

Eventually, probably both. This is pretty much a channel for kicking off relevant background work, so we might start that by combining the broadcast receivers, then the broadcasts themselves, and then perhaps even the background services.

(Right now we issue three broadcasts, which starts three broadcast receiver services, then three background services.)


> Sounds reasonable to me - in theory, we shouldn't have such an epic browser
> session that we need to prune before the browser is backgrounded for the
> first time.

Exactly.
(In reply to Richard Newman [:rnewman] from comment #23)

> Yeah, pretty much, though I gently question the amount of logging work!

I was keeping what was there (without asking for info as arguments) - should I have reduced it at all?

(In reply to Richard Newman [:rnewman] from comment #24)

> Eventually, probably both. This is pretty much a channel for kicking off
> relevant background work, so we might start that by combining the broadcast
> receivers, then the broadcasts themselves, and then perhaps even the
> background services.

I feel it would be best to do this in a followup. Do you concur?
Attached patch 01b: m-c Patch (obsolete) — Splinter Review
Start prune service in onPause.
Attachment #799876 - Attachment is obsolete: true
(In reply to Michael Comella (:mcomella) from comment #25)

> I was keeping what was there (without asking for info as arguments) - should
> I have reduced it at all?

The mfinkle in me says "delete it".

> I feel it would be best to do this in a followup. Do you concur?

Oh hell yes.
Attached patch 01c: Patch (obsolete) — Splinter Review
Remove Logging statement.

Also, bug 913107 to unify BroadcastReceivers and Broadcasts.
Attachment #799891 - Attachment is obsolete: true
Attachment #800259 - Flags: review?(rnewman)
Comment on attachment 800259 [details] [diff] [review]
01c: Patch

Review of attachment 800259 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/base/GeckoApp.java
@@ +1980,5 @@
>  
> +        // In theory, the first browser session will not run long enough that we need to prune
> +        // during it and we'd rather run it when the browser is inactive so we wait until here to
> +        // register the prune service.
> +        GeckoPreferences.broadcastHealthReportPrune(this);

Let's move this right to the end of the method.
Attachment #800259 - Flags: review?(rnewman) → review+
Attached patch 01d: Patch (obsolete) — Splinter Review
Move r+; move PruneService start broadcast to end of the method.
Attachment #800259 - Attachment is obsolete: true
Attachment #800867 - Flags: review+
Comment on attachment 800867 [details] [diff] [review]
01d: Patch

And by "the end" I kinda meant "in the background runnable". Sorry for being unclear.

We should do as little work as possible in onPause; it blocks the next activity.
Attachment #800867 - Flags: review+
Attached patch 01e: PatchSplinter Review
Move Prune Service start broadcast to background runnable.
Attachment #800867 - Attachment is obsolete: true
Attachment #800949 - Flags: review?(rnewman)
Attachment #800949 - Flags: review?(rnewman) → review+
r? - github [1] parts 23-26. I abstracted the storage code away to a PrunePolicyStorage class as (generally) discussed on IRC. I am currently writing integration tests for the PrunePolicyStorage class and will follow those up with unit tests for attemptPruneBySize, attempt... etc.

[1]: https://github.com/mozilla-services/android-sync/pull/344
Add 27-29 to the review (they are the unit tests).
Flags: needinfo?(rnewman)
Proposed constants:

Time between...
    Prune by size - daily: We're not sure how much data we might collect in a day so it's good to ensure yesterday's collection was reasonable
    Expiration - weekly: This isn't urgent persay - this is not attempting to clean up excessive amounts of data, just unused data - so a week seems reasonable
    Vacuum - monthly: Honestly, no good reason, but I believe that's what desktop does [1]. Without writing an involved benchmarking suite, it may be hard to come up with good estimates. For a rough benchmarking test, I tried a series of inserts (10000) into a table with one integer column followed by a series of deletes (3000 random rows). I noticed (1) vacuuming took longer on a larger file size, despite being unfragmented (as a DB copy, this is to be expected) and (2) vacuuming 7000 events inserted directly was more or less the same as vacuuming 7000 events after 3000 were deleted (from 10000 events). Thus, with this naive usage pattern, it seems the vacuuming frequency doesn't matter too much. However, I imagine this would change with extensive fragmentation in a realistic usage pattern or with the real, more complex database.

Event expiration duration: Up to jjensen and how useful the information is that is older than this value. My database on Nightly is roughly 144K and three months old - I would cap this value at 3 months for now and talk in bug 885650 as a followup for what this value should be (or if older data should be aggregated instead).

freelist_count/page_count ratio limit - 0.1: Without spending considerable time writing a benchmarking suite, I'm not sure how to get this value, so I took the value from when vacuuming was first implemented on desktop [2]. However, after searching mxr, it seems like they don't use this methodology anymore [3] - would you happen to know why? Are we repeating a past mistake?

Max __ count / __ count after pruning...
    Events/Environments: I am going benchmark how long document generation takes on both modern and crappy hardware with various numbers of these to determine the amount and report back.

Again, to save time, I neglected to benchmark the amount of time between vacuums and (with considerably more challenge) the freelist_count/page_count ratio limit. If this is worthwhile to be done before landing or as followup, let me know.

What do you think on the values I gave and their motivations?

[1]: https://mxr.mozilla.org/mozilla-central/source/storage/src/VacuumManager.cpp?rev=8c5869c1f670#39
[2]: http://hg.mozilla.org/mozilla-central/rev/18a6c37bb588#l1.34
[3]: https://mxr.mozilla.org/mozilla-central/search?string=freelist_count&find=&findi=&filter=^[^\0]*%24&hitlimit=&tree=mozilla-central
Benchmarks for max event count on my debug build running on my Galaxy Nexus, 3 runs per event count, where event count is additional events over pre-populated storage (i.e. storage.insertTextualEvents(count)):

Event count | Avg. time (millis)
    0            48
 5000          5345
10000         10182
15000         13877

Ouch. I'm going to rebench on a non-debug build once the clobber runs.

Keep in mind this bench is without fragmentation, only with notable entries in the event_textual table, and with unrealistic usage on the other tables.

Testing source: http://pastebin.mozilla.org/3039477
(In reply to Michael Comella (:mcomella) from comment #36)

> Ouch. I'm going to rebench on a non-debug build once the clobber runs.

Assuming I'm reading your test code right, your findings are "document generation is linear in number of events, approx. 1msec per event".

That would be good… if we didn't have so many events!

Try it just grabbing the event cursor and walking it (no processing), or profiling these runs. I'm curious how much of this is JSON document construction and serialization versus DB walking.

But this does imply that right now we should keep no more than ~5000 events, which is ~100 days of session data. On a less capable (read: typical) device, with searches and other recorded data, that probably drops to one month. (Notably, that's less than our release cycle, which implies that we should treat high-volume data differently so as to avoid losing low-volume high-utility events: Bug 885650.)
Flags: needinfo?(rnewman)
(In reply to Michael Comella (:mcomella) from comment #35)
> Proposed constants:

These all seem sane, but see comments inline.


> Time between...
>     Prune by size - daily: We're not sure how much data we might collect in
> a day so it's good to ensure yesterday's collection was reasonable

We need to benchmark expiration itself. It *should* be on the order of number of items deleted, in which case we should prune as frequently as possible — we don't want to trash a user's experience by doing a big delete on their single-core phone.

>     Expiration - weekly: This isn't urgent persay - this is not attempting
> to clean up excessive amounts of data, just unused data - so a week seems
> reasonable

There's also the point that expiration might render size-based pruning unnecessary (and it definitely does for steady-state accrual). The frequency of doing this depends on how cheap it is to figure out if we should do any work (by computing counts).

>     Vacuum - monthly

Note that we shouldn't vacuum without having done a thorough prune first: copying is expensive in all kinds of ways, so we should copy as little as possible. As such, these have a sequence dependency, despite us not wanting to run them back-to-back for performance reasons.

My hope is that the free page ratio will pretty much indicate that you've just done a bunch of deletes, and thus we get this for free, but time will tell.

Bug 870170 should be addressed very soon after this, so that we can tune these numbers.


> freelist_count/page_count ratio limit - 0.1: Without spending considerable
> time writing a benchmarking suite, I'm not sure how to get this value, so I
> took the value from when vacuuming was first implemented on desktop [2].
> However, after searching mxr, it seems like they don't use this methodology
> anymore [3] - would you happen to know why? Are we repeating a past mistake?

mak, can you speak to this?


> Max __ count / __ count after pruning...
>     Events/Environments: I am going benchmark how long document generation
> takes on both modern and crappy hardware with various numbers of these to
> determine the amount and report back.

Get a reasonable hysteresis…
Flags: needinfo?(mak77)
Followup to comment 36, Galaxy Nexus, non-debug build [1]:

Event count | Avg. time (millis)
    0            36
 5000          4923
10000          9093
15000         13765

And it's more or less the same. Huh, I wonder if I set up my build environment correctly. In any case, that seems pretty bad - I'll try to get some benchmarks on an older device.

I guess it's worth noting that I'm running this in the context of a JUnit test (as it was easier to set up), which could potentially influence the device via extra memory use, garbage collection, and other background processes I'm unaware of.

[1]: http://pastebin.mozilla.org/3039853
> Try it just grabbing the event cursor and walking it (no processing), or profiling these runs.
> I'm curious how much of this is JSON document construction and serialization versus DB walking.

Galaxy Nexus, non-debug build:

Event count | Time - 1 run (millis)
    0           2
 5000          55
10000         133
15000         258

D:

Test source: http://pastebin.mozilla.org/3039976
(In reply to Michael Comella (:mcomella) from comment #39)
> Followup to comment 36, Galaxy Nexus, non-debug build [1]:
> And it's more or less the same. Huh, I wonder if I set up my build
> environment correctly. In any case, that seems pretty bad - I'll try to get
> some benchmarks on an older device.

I/O doesn't care about debug/opt :P


> I guess it's worth noting that I'm running this in the context of a JUnit
> test (as it was easier to set up), which could potentially influence the
> device via extra memory use, garbage collection, and other background
> processes I'm unaware of.

I doubt it.

(In reply to Michael Comella (:mcomella) from comment #40)

> D:

So raw table walks are cheap...

> Test source: http://pastebin.mozilla.org/3039976

... but make sure you're doing all the same work that the generator does (ordering, grouping, ranges, limits…).

Either some of that is turning this into a bad query, or serialization is really expensive.
(In reply to Richard Newman [:rnewman] from comment #38)
> > freelist_count/page_count ratio limit - 0.1: Without spending considerable
> > time writing a benchmarking suite, I'm not sure how to get this value, so I
> > took the value from when vacuuming was first implemented on desktop [2].
> > However, after searching mxr, it seems like they don't use this methodology
> > anymore [3] - would you happen to know why? Are we repeating a past mistake?
> 
> mak, can you speak to this?

with the very large page size we use on desktop (32k pages) the freelist count was basically never a good ratio to figure out fragmentation, cause pages were hardly being completely free.  It *may* work if you use the default 1024 bytes page size. Moreover, even if you have a large freelist, it doesn't mean the db is fragmented (they may all be in the same area), or viceversa, having a few doesn't mean it's not fragmented. For example auto_vacuum moves all of the free pages to the end of the db and removes them automatically, nonetheless auto vacuumed databases are heavily fragmented (that's why we stopped using auto vacuum much time ago).
Honestly there is no good ratio to figure out fragmentation, that's why we moved to a guessed time based ratio (1 month), it's a guess in both cases, but at least with time based vacuum we know it will happen.
Flags: needinfo?(mak77)
> with the very large page size we use on desktop (32k pages) the freelist
> count was basically never a good ratio to figure out fragmentation, cause
> pages were hardly being completely free.  It *may* work if you use the
> default 1024 bytes page size.

We can figure this out, mcomella:

http://developer.android.com/reference/android/database/sqlite/SQLiteDatabase.html#getPageSize%28%29

> Moreover, even if you have a large freelist,
> it doesn't mean the db is fragmented (they may all be in the same area), or
> viceversa, having a few doesn't mean it's not fragmented.

We're only partly interested in fragmentation; we also want compaction, because steady-state disk usage is a concern. From that perspective auto_vacuum is good, so this is a tradeoff between auto_vacuum's tendency to fragment and the unbounded growth when you turn it off.

> moved to a guessed time based ratio (1 month), it's a guess in both cases,
> but at least with time based vacuum we know it will happen.

Gotcha.
I should expand: for some users, events will arrive at approximately the same rate, so the free list will do its job and we'll have steady-state disk usage but increasing fragmentation. For those users, the vacuum aims to reduce fragmentation. For irregular users (or unpredictable usage) there will be spiky patterns, and we want to ensure that those peaks don't keep an inflated database size.
> ... but make sure you're doing all the same work that the generator does
> (ordering, grouping, ranges, limits…).

Event count | Time - 1 run (millis)
    0           3
 5000         271
10000         394
15000         626

Conclusion: serialization is expensive.

Source: http://pastebin.mozilla.org/3046893
Query taken from: https://mxr.mozilla.org/mozilla-central/source/mobile/android/base/background/healthreport/HealthReportDatabaseStorage.java?rev=e881258c3ac5#1224
> It *may* work if you use the default 1024 bytes page size.

SQLiteDatabase.getPageSize() returns 4096 bytes on my Galaxy Nexus.

Even if this were the desired 1024 byte page size, I'm not sure we can guarantee this across devices. There is SQLiteDatabase.setPageSize(), but that must be called on database creation (before any data is inserted).
(In reply to Michael Comella (:mcomella) from comment #46)

> Even if this were the desired 1024 byte page size, I'm not sure we can
> guarantee this across devices. There is SQLiteDatabase.setPageSize(), but
> that must be called on database creation (before any data is inserted).

Vacuum also applies page size changes.

(In reply to Michael Comella (:mcomella) from comment #45)

> Conclusion: serialization is expensive.

Sounds like we need some profiling and a follow-up…
Document generation profiling followup: bug 916297.
r? - gh parts 30-31 - Dealing with page sizes for vacuums.

Still TODO in vacuuming: only vacuum if we've pruned.
Flags: needinfo?(rnewman)
> Note that we shouldn't vacuum without having done a thorough prune first: 
> copying is expensive in all kinds of ways, so we should copy as little as
> possible. As such, these have a sequence dependency, despite us not wanting
> to run them back-to-back for performance reasons.

r? - gh part 32 - Don't vacuum unless we've pruned.
I realized that part 32 could be better tested with a TestPrunePolicyDatabaseStorage class - we could test to ensure we set the "needsCleanup" preference as expected (which is a good idea since we'll never vacuum if this code does not work correctly). I'll add the test.
Commented on PR.
Flags: needinfo?(rnewman)
(In reply to Richard Newman [:rnewman] from comment #47)
> Vacuum also applies page size changes.

it still may not be a good idea to change it, whatever page size is in use it has very likely been tested and found as the fastest on that storage (iirc it changed recently due to that)
r? - gh parts 30-31 and followup - Only vacuum if pruning has occurred.
Flags: needinfo?(rnewman)
r? - gh parts 30, followup, 31. Note that part 31 is not tested.

https://github.com/mozilla-services/android-sync/pull/344
(In reply to Richard Newman [:rnewman] from comment #38)
> We need to benchmark expiration itself. It *should* be on the order of
> number of items deleted, in which case we should prune as frequently as
> possible — we don't want to trash a user's experience by doing a big delete
> on their single-core phone.

Note that this benchmarking leaves out pruneEnvironments (just because it is harder to insert the appropriate dummy data), but I make the naive assumption that duration of expiration will be at most double the event prune, but probably much less.

pruneEvents benchmarking, 1 run, Galaxy Nexus:

10,000 total events
Count deleted | Time (millis)
0               105
2500            459
5000            524
7500            601

20,000 total events
Count deleted | Time (millis)
0               182
2500            623
5000            804
10000           995

Test source code: https://pastebin.mozilla.org/3090121

Will follow up with an analysis and perhaps more benchmarks.
Some more realistic pruneEvents benchmarks, Galaxy Nexus, 1 run:

Total/deleted - time (millis)
20000/2000 - 638
25000/2000 - 783
30000/2000 - 838

And deleteDataBefore (expiration) benchmarks, Galaxy Nexus, 1 run:

Total/deleted - time (millis)
20000/2000 - 281
25000/2000 - 422
30000/2000 - 507

20000/5000 - 339
25000/5000 - 421
30000/5000 - 504

Huh. I wonder if we shouldn't just expire data daily (since it seems to run so quickly) and axe pruning based on size.

I'll try benching on a slower device.
(In reply to Michael Comella (:mcomella) from comment #57)

> Huh. I wonder if we shouldn't just expire data daily (since it seems to run
> so quickly) and axe pruning based on size.

If the numbers are *tiny* for, e.g., 6000 events, then yeah. Otherwise, weekly?
Flags: needinfo?(rnewman)
(And it seems a Nexus S out-performs my Galaxy Nexus - I think I'll need an older device)

> Huh. I wonder if we shouldn't just expire data daily (since it seems to run
> so quickly) and axe pruning based on size.

Oops, sorry, this didn't make much sense.
r? - gh Part 32: Remove fragmentation-based cleanup code (hope you enjoyed reviewing it the first time too! ;)

https://github.com/mozilla-services/android-sync/pull/344
Flags: needinfo?(rnewman)
I tried benchmarking on a Galaxy Chat, a Droid Bionic, and a Sony Ericsson Xperia, the Xperia being the only one that was slower than the Galaxy Nexus on 20000/2000 (330). However, it was faster on 30000/5000 (433 at its worst).

Thus, it seems we can use the Galaxy Nexus as a benchmark, at least with these numbers and usage patterns. Notably, as the disk fills up (as it likely will on older devices with smaller disks), access speeds may decrease with increased fragmentation so we should keep this in mind when creating these constants.
Comment 61 refers to the deleteDataBefore (expiration) benchmarks.
Re-evaluation of the constants:

Max events: 20,000 - ~1.64 MB (using my analysis from comment 9), and the computation speed seems negligible. The smaller the DB, the faster the vacuums are too.
Max environments: 50 - pretty arbitrary, but (via comment 9) I currently have 10, and they seem fairly cheap.

Prune-by-size frequency: Daily - probably about 1-2 seconds of IO time a day, doesn't seem too bad to me
Expiration frequency: Weekly - doesn't need to be urgent and pruning-by-size makes this less necessary
Vacuum: Monthly - it's what desktop does and it's pretty expensive

Event expiration duration: 6 months - really up to jjensen, depending on how useful the older information is (as per comment 35). Also bug 885650 to aggregate data. We can leave this at 6 months for now since we have prune-by-size in place.

Event/environment count after pruning: TBD - I need to figure out how to hysteresis. :)

rnewman, what say you?
(In reply to Michael Comella (:mcomella) from comment #63)
> Re-evaluation of the constants:
> 
> Max events: 20,000 - ~1.64 MB (using my analysis from comment 9), and the
> computation speed seems negligible. The smaller the DB, the faster the
> vacuums are too.

Let's halve this. 10k events is a lot.

> Max environments: 50 - pretty arbitrary, but (via comment 9) I currently
> have 10, and they seem fairly cheap.

They're the biggest thing we have, but yeah.

> Vacuum: Monthly - it's what desktop does and it's pretty expensive

Sure.

> Event/environment count after pruning: TBD - I need to figure out how to
> hysteresis. :)

Heh, just prune down to X once you hit Y, where Y >> X!

 
> rnewman, what say you?

WFM.
Flags: needinfo?(rnewman)
Follow-up concerns:
    * Provide testing instructions for QA (and ensure it is easily testable)
    * The added tests can take a while to run since they do a lot of DB operations; this may be a concern when the tests get onto tbpl (bug 709353) - we can look into this if it becomes an issue
    * Reassess constants (bug 918527)
https://hg.mozilla.org/mozilla-central/rev/66541e3d93b7
https://hg.mozilla.org/mozilla-central/rev/debb2fa19aec
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 27
# QA Test Plan

## Background info
The FHR prune service runs in the background to delete unneeded info and cleanup the FHR database in order to reduce size on disk, upload size, and improve performance.

The prune service currently has three specific functions:
    1) Delete data from the database when the number of events/environments recorded grows too large ("prune by size")
    2) Delete data from the database when that data exceeds a certain age ("expiration")
    3) Vacuum the database ("vacuum" or, more generally, "cleanup")

The service runs after a certain fixed interval and each function will only trigger after a certain duration has passed (note that this duration is independent for each function). This means the service can potentially run without performing any action. Note also that (3) should only run after (1) or (2) occur (this state is reset after (3) runs) since we only need to vacuum the database if something has been deleted from it.

## Testing plan
The service is unit tested well enough that if the service runs, it should (hopefully) run properly. However, it is difficult to unit test the interaction between Android services and thus whether or not it will correctly run.

By default, the prune service should run once a day. Using the prune-test-addon [1], you can set this value to much lower (by default, it is 60 seconds). Additionally, menu items are added that enable (1), (2), and (3) to run. When the service runs successfully, there should be some relevant log output from each function with log tag PrunePolicy. If there is no output from PrunePolicy, there may be error output from log tag HealthReportPruneService. If there is no output from either, the service has probably not run and is likely broken.

[1]: https://github.com/mcomella/healthreport-prune-test-addon
Whiteboard: [qa+]
The add-on does not create the buttons because a ")" is missing from the end of https://github.com/mcomella/healthreport-prune-test-addon/blob/master/bootstrap.js#L186. Make sure to add it when testing.
Attached file logs Nexus 4
Tested on Nightly 2013-09-24 using LG Nexus 4 (Android 4.2.2) with a clean profile and sync setup. We don't have any device with an older profile to have a lot of health report data to test this with.

We are seeing the broadcast messages being sent, "preferences committed" and the alarm set to 60000, but we don't see any messages with the tags "PrunePolicy" or "HealthReportPruneServices". Please see the attached logs.

Is there a way to create healthreportdata on a clean profile? 

Please let us know if we can help in any other way.
Flags: needinfo?(michael.l.comella)
> The add-on does not create the buttons...

My apologies - I pushed an update to github.

> Is there a way to create healthreportdata on a clean profile? 

Quickly, no. By hand, each search in the awesomebar will generate a "search" event. If the device or browser configuration changes, a new "environment" will be created. Notably, adding and removing addons will add entries to the "addons" table and create a new "environment". For specifics on what constitutes an environment, see [1].

If it would help, I can give you my FHR database from my personal Nightly, which you should be able to insert into the Fennec profile by hand on a rooted device.

> but we don't see any messages with the tags "PrunePolicy" or "HealthReportPruneServices".

I'll look into this.

[1]: https://mxr.mozilla.org/mozilla-central/source/mobile/android/base/background/healthreport/Environment.java?rev=79088e422daf#42
Flags: needinfo?(michael.l.comella)
Can you please provide us your FHR database?
Flags: needinfo?(michael.l.comella)
This should go right into the base Mozilla profile directory as "health.db".
Flags: needinfo?(michael.l.comella)
> > but we don't see any messages with the tags "PrunePolicy" or "HealthReportPruneServices".

It turns out this was an issue with GeckoLogger's verbosity. By running,

`adb shell setprop log.tag.GeckoHealth VERBOSE` and restarting the Fennec process, the PrunePolicy and HealthReportPruneService output should be visible.

Note that the test addon has been updated again and you should `git pull` before testing.
Attached file logs samsung note
We tested on the Samsung Galaxy Note (Android 4.0.4) and copied the database and made sure that the data was copied using the dump command. We tested all three options of the add-on and you can find the output in the log file. 

We didn't see any error messages.

On HTC Desire (Android 2.2 ) the add-on does not add the buttons into the menu list.
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.