Closed Bug 1656589 Opened 2 years ago Closed 2 years ago

Add a memory distribution metric to record the Glean database size

Categories

(Data Platform and Tools :: Glean: SDK, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdroettboom, Assigned: janerik)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

It's possible a bug could be introduced that would explode the Glean database size in an unbounded way. Perhaps we should collect metadata on that (or only when size grows above some service-level guarantee?)

Priority: P3 → P1
Whiteboard: [telemetry:glean-rs:m?]
Assignee: nobody → jrediger
Type: defect → enhancement

It's easy to get to the file size, additionally Rkv can tell us the number of data entries if we want to.

However, I'm struggling to find the right place to record this data.
Should this be a memory distribution?

If so:

  • What exactly do we want to answer? The average size of the database we see across all users? Looking at outliers with large databases?
  • What lifetime for the metric?
  • When do we collect the data? Only on init, on each backgrounding? Before sending any ping? After sending a ping?

:Dexter, :mdroettboom, any input here?

Flags: needinfo?(mdroettboom)
Flags: needinfo?(alessio.placitelli)

(In reply to Jan-Erik Rediger [:janerik] from comment #1)

Should this be a memory distribution?

Probably, if we can get a memory value out of the data entries you linked.

If so:

  • What exactly do we want to answer? The average size of the database we see across all users? Looking at outliers with large databases?

"Does a significant part of our users have a bloated database?" / "Do we need to find a solution to empty the database SOON?"

  • What lifetime for the metric?

I believe "ping" lifetime on the "metrics" ping is fine.

  • When do we collect the data? Only on init, on each backgrounding? Before sending any ping? After sending a ping?

When is the data available? Is there any cost involved in querying the data entries?

Flags: needinfo?(alessio.placitelli)

(In reply to Alessio Placitelli [:Dexter] from comment #2)

Probably, if we can get a memory value out of the data entries you linked.

We can also get the file size of the database on disk.
The pure number of data entries won't help us.

When is the data available? Is there any cost involved in querying the data entries?

When we ask for it. :)
File size is easy, we just need to look at the file system (that's I/O though).
I think number of data entries (if we even want that) is tracked by the DB and thus cheap to get, but I'll verify that. Consider it to be cheap for now (though again I don't think just number of entries is a helpful measure for us)

(In reply to Jan-Erik Rediger [:janerik] from comment #3)

When we ask for it. :)
File size is easy, we just need to look at the file system (that's I/O though).

Then I'd lean more towards only at init

Yeah -- I think reading the file size on init (in a background thread) feels like the right cadence.

Flags: needinfo?(mdroettboom)

(In reply to Michael Droettboom [:mdroettboom] from comment #5)

Yeah -- I think reading the file size on init (in a background thread) feels like the right cadence.

init runs off main thread plus we want to do that before rkv touches the database again (because then we potentially already read/write from/to it)

I believe "ping" lifetime on the "metrics" ping is fine.

Actually, that means it will only be sent on the next ping, then cleared. Is this a case for an "application" lifetime metric instead?

Comment on attachment 9168710 [details]
data-review-request.txt

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Jan-Erik Rediger is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.


Result: datareview+

Attachment #9168710 - Flags: data-review?(chutten) → data-review+
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.