Closed Bug 1397789 Opened 7 years ago Closed 7 years ago

Tune data expiration policy for Places as Firefox starts collecting page metadata

Categories

(Toolkit :: Places, enhancement)

57 Branch
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: nanj, Assigned: nanj)

References

Details

Attachments

(1 file)

This is a followup bug for

https://bugzilla.mozilla.org/show_bug.cgi?id=1352502

and

https://bugzilla.mozilla.org/show_bug.cgi?id=1393924

Once we start collecting preview image url and description, we'll need to tweak the current data expiration policy in Places so that the storage impact won't affect the Places performance and capacity.
Depends on: 1352502, 1393924
Summary: Tune the data expiration policy for Places as Firefox starts collecting page metadata → Tune data expiration policy for Places as Firefox starts collecting page metadata
As mak suggested earlier, before adding any fancy stuff to the current Places expiration policy, simply increasing the DATEBASE_MAX_SIZE should be enough to begin with.

We do have a good estimate on the size of description and preview image url (200 characters in total), but it's hard to tell what percentage of sites that provide those two fields in their pages, so we have to take a guess. Ideally, we should have some telemetry on these to better understand the insights.

Here is my proposal,

# Increase the db size by 10MB, which is an approx. 20% growth based on the current DATEBASE_MAX_SIZE(60MB). As mentioned earlier, the size estimate for description and preview image url is 200 characters (could be more than 200 bytes), the current URIENTRY_AVG_SIZE is 600 bytes, which leads to a 33% disk space increase in the extreme case that it successfully collects the page metadata for every entry in Places. In practice, however, this metadata percentage may vary from the individual's browsing pattern, geo, and other factors. Let's take a guess that 50% of sites will have metadata attached in their page, which suggests a 20% increase for each URI entry in Places. Of course, we can revisit those estimates once we collect some metrics in the wild.  

# Telemetry existing and new if necessary. We can observe some existing metrics like "PLACES_DATABASE_FILESIZE", "PLACES_DATABASE_PAGESIZE", and "PLACES_DATABASE_SIZE_PER_PAGE" for their changes. On the other hand, we can add new telemetry to ContentMetaHandler, such as "PAGE_METADATA_PARSING_ATTEMPT", "PAGE_METADATA_PARSING_SUCCESS", "PAGE_METADATA_SIZE" etc. Over time, we would be able to know more on all those factors.

Thoughts?
mak, thoughts on comment 1?

From my own profile, since we started saving metadata, I have 576 moz_places rows with some metadata and 2594 rows without, so that's about 22% with metadata.

Can split that finer:

SELECT COUNT(1) FROM moz_places WHERE last_visit_date >= 1504859278547029 AND description NOTNULL AND preview_image_url NOTNULL;

15%  395 rows that have both description and image
 2%   41 rows with description and no image
 5%  140 rows with image and no description
78% 2018 rows with no description and no image

Looking at sizes:

SELECT AVG(LENGTH(description)), AVG(LENGTH(preview_image_url)), AVG(LENGTH(url)), AVG(LENGTH(title)) FROM moz_places WHERE last_visit_date >= 1504859278547029
(the average skips nulls)
description: 119
preview_image_url: 72
url: 113
title: 66
Flags: needinfo?(mak77)
TL;DR
I'd suggest to set URIENTRY_AVG_SIZE to 700 and DATABASE_MAX_SIZE to 70MiB

(In reply to Nan Jiang [:nanj] from comment #1)
> bytes), the current URIENTRY_AVG_SIZE is 600 bytes

This is a fallback value, extracted mostly from telemetry. We try to calculare a more meaningful value at runtime, if that value ends up being "nonsense" (due to the chunked growth), then we fallback.
The telemetry probe is PLACES_DATABASE_SIZE_PER_PAGE_B for which the current median value is 775.46. Since we don't update this from a while, it may be worth to increase URIENTRY_AVG_SIZE to 700. 800 could also work, but I prefer moving in small steps and observing how things evolve.
We can further tune this in the future when we'll have collected new telemetry.

> Thoughts?

Off-hand it may be useful to have a probe for the avg metadata size and the percentage of pages with metadata. At least at the beginning to tweak things. Though we can indeed already use the existing probes to evaluate the effect.
Another interesting probe to keep an eye on is the evolution of PLACES_MOST_RECENT_EXPIRED_VISIT_DAYS, that is how old was the most recent expired visit. We try to keep that around 1 year atm, if that increases, it means the db has more free space and could probably be compacted, if it starts to decrease, it means we are expiring too much history, and we should enlarge the db.
It's clearly visible that recently it increased, likely thanks to the removal of favicons. That made some space for this new metadata.

(In reply to Ed Lee :Mardak from comment #2)
> mak, thoughts on comment 1?

the 95 percentile of PLACES_PAGES_COUNT is 100k, guessing about half may have metadata, and considering 200b per page, 10MB sounds like a good approximation. We may have to tweak it in the future.
Flags: needinfo?(mak77)
(In reply to Marco Bonardo [::mak] from comment #3)
> TL;DR
> I'd suggest to set URIENTRY_AVG_SIZE to 700 and DATABASE_MAX_SIZE to 70MiB
> 

Sounds good! Patch submitted.

I'd like to do the new telemetry probes for metadata collection in a separate bug, as I suppose that needs to go through the process of data collection review.
Assignee: nobody → najiang
Comment on attachment 8908116 [details]
Bug 1397789 - Tune Places data expiration for page metadata collection.

https://reviewboard.mozilla.org/r/179792/#review185094

Thank you!
Attachment #8908116 - Flags: review?(mak77) → review+
Pushed by edilee@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/a59dee0eecb9
Tune Places data expiration for page metadata collection. r=mak
https://hg.mozilla.org/mozilla-central/rev/a59dee0eecb9
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
I was poking around the metadata from my places after I noticed a "null" description being saved: https://github.com/mozilla/activity-stream/issues/3510

But more to this bug:

sqlite> SELECT LENGTH(description), description, url FROM moz_places WHERE LENGTH(description) > 500;

785|It is extremely important for our customers to be able to easily charge their cars. The most convenient way to charge is to plug in overnight at home, and for most people, this is all that is needed. However, for customers who use their car for long distance travel, there is a growing network of Superchargers located along highways on popular driving routes. We have also installed thousands of Destination Charging connectors at hotels, resorts and restaurants that replicate the home charging experience when you’re away from home. Now, as part of our commitment to make Tesla ownership easy for everyone, including those without immediate access to home or workplace charging, we are expanding our Supercharger network into city centers, starting with downtown Chicago and Boston.|https://www.tesla.com/blog/supercharging-cities

530|The preload value of the link element's rel attribute allows you to write declarative fetch requests in your HTML head, specifying resources that your pages will need very soon after loading, which you therefore want to start preloading early in the lifecycle of a page load, before the browser's main rendering machinery kicks in. This ensures that they are made available earlier and are less likely to block the page's first render, leading to performance improvements. This article provides a basic guide to how preload works.|https://developer.mozilla.org/en-US/docs/Web/HTML/Preloading_content

Even if we assume there's no title and 4 lines of description is shown, only the first 120 characters of 530 are shown for the "Preloading content" MDN page.

I wonder if we should be capping at some point…
Note that the PlacesUtils.History.update has already capped the description by 1024 bytes.

Alternatively, we can also cap it in ContentMetaHandler with a more aggressive limit. I'd be curious to see how often this happens though.
Just a followup about this tuning, looks like the PLACES_MOST_RECENT_EXPIRED_VISIT_DAYS has been decreasing significantly since Firefox began to collect the metadata in Nightly. Given that the sweet spot is to keep it around a year, we might want to increase the db size accordingly.

https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2017-10-01&keys=!__none__!__none__&max_channel_version=nightly%252F58&measure=PLACES_MOST_RECENT_EXPIRED_VISIT_DAYS&min_channel_version=nightly%252F55&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-09-28&trim=1&use_submission_date=0

In addition, PLACES_DATABASE_SIZE_PER_PAGE_B has shown a substantial increase recently, the median value bumped from 754 to 899 bytes, 

https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2017-10-01&keys=!__none__!__none__&max_channel_version=nightly%252F58&measure=PLACES_DATABASE_SIZE_PER_PAGE_B&min_channel_version=nightly%252F55&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-09-28&trim=1&use_submission_date=0

Lastly, here is the evolution of PAGE_METADATA_SIZE. Pretty much explains the increase of PLACES_DATABASE_SIZE_PER_PAGE_B, 

https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=2017-10-01&keys=!__none__!__none__&max_channel_version=nightly%252F58&measure=PAGE_METADATA_SIZE&min_channel_version=nightly%252F55&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-09-28&trim=1&use_submission_date=0

I'd recommend that let's keep a closer look at those metrics for a few more days, take actions once those changes become more stable and obvious.
You need to log in before you can comment on or make changes to this bug.