Increase frequency for StorageMaintenanceWorker
Categories
(Fenix :: Accounts and Sync, enhancement)
Tracking
(firefox120 wontfix, firefox121 fixed)
People
(Reporter: bdk, Assigned: bdk)
Details
Attachments
(1 file)
We introduce this last year as an attempt to keep places DB sizes bounded at 75mb. I took a look at the metrics yesterday and noticed some things:
- DB size after maintenance seems to show:
- 99% of users are under 75mb, some number between 0.1% and 1% are over.
- Based on the 99.9th percentile, there doesn't seem to be much effect from the maintenance for the users that are over. If it was working, I would expect that line to be trending slowly downwards, with occasional spikes in either direction.
- Maintenance time seems to show that the maintenance is not having an adverse effect on users. The 99.9th percentile is 3.9s for the entire operation. We believe that the only way this would affect users is if it caused database lock errors and that would require one of the maintenance queries to take over 5 seconds and a sync or other operation that made write queries would need to be running at the same time.
Based on that I think we should experiment with upping the frequency of the maintenance runs. Right now it runs every 24 hours, maybe we could move that down to 6 and monitor for changes?
This could be coupled with some new metrics:
- For DB sizes over the target, what was the delta between the last maintenance run and this one? This could tell us if the maintenance runs are removing entries faster than the user is creating them.
- Metrics like
read_query_count
/write_query_count
/read_error_query_count
/write_error_query_count
/, but only when we are running maintenance. This would tell us if there are increased error rates during maintenance and also if changing the rate is affecting the error rates. I'm not sure if these need to be new metrics, or we could add a tag to the existing ones.
Comment 1•1 year ago
|
||
This makes sense to me.
The 99.9th percentile is 3.9s for the entire operation.
The 99.9 part of that is great, the 3.9s part of that less so :) I note that maintenance is actually split into 4 stages, so if we cared about that, a possible optimization would be to try and set an target of (say) 3s for the operation, and as soon as we hit that we could stop calling later stages. This might mean really low powered devices never complete, but OTOH it might mean devices temporarily overwhelmed by other things stay responsive. I guess we can leave that call to the Android team.
For DB sizes over the target, what was the delta between the last maintenance run and this one? This could tell us if the maintenance runs are removing entries faster than the user is creating them.
This metric makes sense to me (although I guess it needs a better specification - recording this for every user might mean we miss the outliers, and here it seems like the outliers we care most about?
NI Christian who helped us form the initial intentionally conservative policy and CC Jon for any thoughts.
Comment 2•1 year ago
|
||
It makes sense to me to start cleaning up more data, especially as we now know that the performance impact is minimal. However, I am not sure increasing the cadence fourfold is necessarily the best option.
Can we also increase the amount of data we prune? We currently only delete 6 records at a time.
Perhaps a compromise is an ideal next "experiment" e.g., delete 12 record at a time, verify, and increase cadence to every 12 hrs? I think we want to avoid running the worker too many times during the day.
I will add Kaya who implemented the worker.
Comment 3•1 year ago
|
||
Perhaps a compromise is an ideal next "experiment" e.g., delete 12 record at a time, verify, and increase cadence to every 12 hrs?
And maybe a followup bug or other way of ensuring we aggressively track these metrics - while the above is effectively quadrupling of the visits we prune, it still leaves it at 24 visits per day, which I'd expect fails to keep up with many of our users, and certainly not heavy users.
Looking again at desktop, it seems to have quite a different strategy
- When the browser becomes active, a timer is started which does the expiry every 3 minutes.
- When the browser becomes idle, it stops the expiration with the aim of reducing battery usage.
So in practice it seems like desktop will actually prune far more than 24 visits per day.
IOW, I agree with this conservative increase, but doubt it will get us close to actually keeping up with many of our users.
Comment 4•1 year ago
|
||
Assignee | ||
Comment 5•1 year ago
|
||
For DB sizes over the target, what was the delta between the last maintenance run and this one? This could tell us if the maintenance runs are removing entries faster than the user is creating them.
This metric makes sense to me (although I guess it needs a better specification - recording this for every user might mean we miss the outliers, and here it seems like the outliers we care most about?
That's a really good point. I think it would be quite complicated to get this metric right.
Maybe we should just start with the reducing the waiting period, increasing the number of visits pruned, and watching the existing metrics. I made 2 PRs for that:
Comment 6•1 year ago
|
||
Authored by https://github.com/bendk
https://github.com/mozilla-mobile/firefox-android/commit/348d4b2f46c3e183eef033247341a41553d6fa92
[main] Bug 1854383 - Increase frequency for StorageMaintenanceWorker
Comment 7•1 year ago
|
||
Hello,
Is there any QA manual testing necessary here?
Thank you!
Assignee | ||
Comment 8•1 year ago
|
||
None needed, the changes should not be user visible.
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Description
•