Closed Bug 1854383 Opened 1 year ago Closed 1 year ago

Increase frequency for StorageMaintenanceWorker

Tracking

(firefox120 wontfix, firefox121 fixed)

Status:

RESOLVED FIXED

Milestone:

121 Branch

Tracking Flags:

Tracking

Status

firefox120

---

wontfix

firefox121

---

fixed

People

(Reporter: bdk, Assigned: bdk)

Details

Attachments

(1 file)

[mozilla-mobile/firefox-android] Bug 1854383 - Increase frequency for StorageMaintenanceWorker (#4432) 1 year ago BMO Github Automation 59 bytes, text/x-github-pull-request		Details \| Review

Ben Dean-Kawamura [:bdk]

Assignee

Description

•

1 year ago

We introduce this last year as an attempt to keep places DB sizes bounded at 75mb. I took a look at the metrics yesterday and noticed some things:

DB size after maintenance seems to show:
- 99% of users are under 75mb, some number between 0.1% and 1% are over.
- Based on the 99.9th percentile, there doesn't seem to be much effect from the maintenance for the users that are over. If it was working, I would expect that line to be trending slowly downwards, with occasional spikes in either direction.
Maintenance time seems to show that the maintenance is not having an adverse effect on users. The 99.9th percentile is 3.9s for the entire operation. We believe that the only way this would affect users is if it caused database lock errors and that would require one of the maintenance queries to take over 5 seconds and a sync or other operation that made write queries would need to be running at the same time.

Based on that I think we should experiment with upping the frequency of the maintenance runs. Right now it runs every 24 hours, maybe we could move that down to 6 and monitor for changes?

This could be coupled with some new metrics:

For DB sizes over the target, what was the delta between the last maintenance run and this one? This could tell us if the maintenance runs are removing entries faster than the user is creating them.
Metrics like read_query_count/write_query_count/read_error_query_count/write_error_query_count/, but only when we are running maintenance. This would tell us if there are increased error rates during maintenance and also if changing the rate is affecting the error rates. I'm not sure if these need to be new metrics, or we could add a tag to the existing ones.

Mark Hammond [:markh] [:mhammond]

Comment 1

•

1 year ago

This makes sense to me.

The 99.9th percentile is 3.9s for the entire operation.

The 99.9 part of that is great, the 3.9s part of that less so :) I note that maintenance is actually split into 4 stages, so if we cared about that, a possible optimization would be to try and set an target of (say) 3s for the operation, and as soon as we hit that we could stop calling later stages. This might mean really low powered devices never complete, but OTOH it might mean devices temporarily overwhelmed by other things stay responsive. I guess we can leave that call to the Android team.

For DB sizes over the target, what was the delta between the last maintenance run and this one? This could tell us if the maintenance runs are removing entries faster than the user is creating them.

This metric makes sense to me (although I guess it needs a better specification - recording this for every user might mean we miss the outliers, and here it seems like the outliers we care most about?

NI Christian who helped us form the initial intentionally conservative policy and CC Jon for any thoughts.

Flags: needinfo?(csadilek)

Christian Sadilek [:csadilek]

Comment 2

•

1 year ago

It makes sense to me to start cleaning up more data, especially as we now know that the performance impact is minimal. However, I am not sure increasing the cadence fourfold is necessarily the best option.

Can we also increase the amount of data we prune? We currently only delete 6 records at a time.

Perhaps a compromise is an ideal next "experiment" e.g., delete 12 record at a time, verify, and increase cadence to every 12 hrs? I think we want to avoid running the worker too many times during the day.

I will add Kaya who implemented the worker.

Flags: needinfo?(csadilek)

Mark Hammond [:markh] [:mhammond]

Comment 3

•

1 year ago

Perhaps a compromise is an ideal next "experiment" e.g., delete 12 record at a time, verify, and increase cadence to every 12 hrs?

And maybe a followup bug or other way of ensuring we aggressively track these metrics - while the above is effectively quadrupling of the visits we prune, it still leaves it at 24 visits per day, which I'd expect fails to keep up with many of our users, and certainly not heavy users.

Looking again at desktop, it seems to have quite a different strategy

When the browser becomes active, a timer is started which does the expiry every 3 minutes.
When the browser becomes idle, it stops the expiration with the aim of reducing battery usage.

So in practice it seems like desktop will actually prune far more than 24 visits per day.

IOW, I agree with this conservative increase, but doubt it will get us close to actually keeping up with many of our users.

BMO Github Automation

Comment 4

•

1 year ago

Attached file [mozilla-mobile/firefox-android] Bug 1854383 - Increase frequency for StorageMaintenanceWorker (#4432) — Details

Ben Dean-Kawamura [:bdk]

Assignee

Comment 5

•

1 year ago

For DB sizes over the target, what was the delta between the last maintenance run and this one? This could tell us if the maintenance runs are removing entries faster than the user is creating them.

This metric makes sense to me (although I guess it needs a better specification - recording this for every user might mean we miss the outliers, and here it seems like the outliers we care most about?

That's a really good point. I think it would be quite complicated to get this metric right.

Maybe we should just start with the reducing the waiting period, increasing the number of visits pruned, and watching the existing metrics. I made 2 PRs for that:

BMO Github Automation

Comment 6

•

1 year ago

Authored by https://github.com/bendk
https://github.com/mozilla-mobile/firefox-android/commit/348d4b2f46c3e183eef033247341a41553d6fa92
[main] Bug 1854383 - Increase frequency for StorageMaintenanceWorker

Status: NEW → RESOLVED

Closed: 1 year ago

status-firefox121: --- → fixed

Flags: qe-verify+

Resolution: --- → FIXED

Target Milestone: --- → 121 Branch

Adina Petridean QA [:apetridean]

Comment 7

•

1 year ago

Hello,
Is there any QA manual testing necessary here?
Thank you!

Flags: needinfo?(bdeankawamura)

Ben Dean-Kawamura [:bdk]

Assignee

Comment 8

•

1 year ago

None needed, the changes should not be user visible.

Flags: needinfo?(bdeankawamura)

Adina Petridean QA [:apetridean]

Updated

•

1 year ago

Flags: qe-verify+

Chris Peterson [:cpeterson]

Updated

•

1 year ago

status-firefox120: --- → wontfix

Joe M [:jmahon]

Updated

•

1 year ago

Assignee: nobody → bdeankawamura

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Increase frequency for StorageMaintenanceWorker

Categories

(Fenix :: Accounts and Sync, enhancement)

Tracking

(firefox120 wontfix, firefox121 fixed)

People

(Reporter: bdk, Assigned: bdk)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type