Closed Bug 1162538 Opened 5 years ago Closed 4 years ago

Telemetry redesign: Implement archived ping storage quota

Categories

(Toolkit :: Telemetry, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla41
Tracking Status
firefox40 --- fixed
firefox41 --- fixed

People

(Reporter: gfritzsche, Assigned: Dexter)

References

(Blocks 1 open bug)

Details

(Whiteboard: [uplift])

Attachments

(5 files, 16 obsolete files)

13.59 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
11.47 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
8.90 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
7.13 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
14.49 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
This comes from the new ping expiry rules here:
https://docs.google.com/document/d/1K_N3Vkxdm-Ham_GhuXhMsC7-qz2ksDKzAtz5J0gBnag/edit

This is the implementation discussion:
https://docs.google.com/document/d/19tBB2SYfO3Q4wglq32CLyT4BVJZbA7aHjEshXCLV9_4/edit

Note that bug 1120380 handles removing old pings already, this is about extending this cleanup task to implement a storage quota for archived & pending pings.
Assignee: nobody → alessio.placitelli
This patch computes how much space the archive is eating on the disk and, if that's more than the quota we set, removes older pings.

Pending pings quota and archived ping count/evicted count will be handled in separate patches.
Attachment #8609273 - Flags: review?(gfritzsche)
This adds test coverage for the archive quota.
Attachment #8609274 - Flags: review?(gfritzsche)
Summary: Telemetry redesign: Implement ping storage quota → Telemetry redesign: Implement archived ping storage quota
Comment on attachment 8609273 [details] [diff] [review]
part 1 - Extend the cleanup task to enforce the archive quota

Review of attachment 8609273 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +559,5 @@
>  
>      // Make sure to clear |_archiveCleanTask| once done.
>      let clear = () => this._archiveCleanTask = null;
>      // Since there's no archive cleaning task running, start it.
> +    this._archiveCleanTask = this._cleanArchiveTask().then(clear, clear);

Let's please stay with a consistent naming order - either *archiveClean* or *cleanArchive*.
This should have been addressed in bug 1120380.

@@ +616,5 @@
> +    yield this.loadArchivedPingList();
> +
> +    // Refresh the cache, if needed. That's needed even after |loadArchivedPingList| as
> +    // this function could also be called long after startup.
> +    if (newestRemovedMonth) {

We could still skip this if we didn't scan the archive yet and no scan is in progress, right?

In this case it might be cheap enough to just do this instead of introducing task dependencies, but lets at least fix the comment.

@@ +653,5 @@
> +    let lastPingIndexToKeep = null;
> +    let archiveSizeInBytes = 0;
> +
> +    // Find the disk size of the archive.
> +    for (let i = 0; i < pingList.length; i++) {

This should bail out early on this._shutdown too.

@@ +678,5 @@
> +    }
> +
> +    // Check if we're eating too much space. If so, remove the oldest pings until we reach
> +    // a comfort disk quota.
> +    if (archiveSizeInBytes >= Policy.getArchiveQuota()) {

An early return here if (archiveSize < quota) would spare us the indentation below.

@@ +684,5 @@
> +                     + ", safety quota: " + SAFE_QUOTA + "bytes");
> +
> +      // Remove all the pings older than the last one which we are safe to keep
> +      // (|lastPingIndexToKeep|).
> +      for (let i = lastPingIndexToKeep + 1; i < pingList.length; i++) {

Instead of working with the index we could just e.g. do:
pingsToPurge = pingList.slice(lastIndexToKeep + 1);
... i think thats much better to read.

Also, _shutdown bailout.

@@ +762,5 @@
>      for (let dir of subdirs) {
> +      if (this._shutdown) {
> +        this._log.trace("_scanArchive - Terminating archive scanning task due to shutdown");
> +        return new Map();
> +      }

That might be an ok change, but is rather unrelated to the changes in this patch.
Let's at least put it in a separate patch.

@@ +1195,5 @@
> + * @return {Integer} The file size, in bytes, of the ping file or 0 on errors.
> + */
> +let getArchivedPingSize = Task.async(function*(aPingId, aDate, aType) {
> +  const filePath = getArchivedPingPath(aPingId, aDate, aType);
> +  const filePathCompressed = filePath + "lz4";

let filePaths = [ getArchivedPingPath(aPingId, aDate, aType) ];
filePath.push(... + "lz4");

for (let path of filePaths) {
 try {
  return (... stat(path)...
Attachment #8609273 - Flags: review?(gfritzsche) → feedback+
Attachment #8609273 - Attachment is obsolete: true
Attachment #8609406 - Flags: review?(gfritzsche)
Attachment #8609274 - Attachment is obsolete: true
Attachment #8609274 - Flags: review?(gfritzsche)
Attachment #8609407 - Flags: review?(gfritzsche)
Attachment #8609409 - Flags: review?(gfritzsche)
Comment on attachment 8609406 [details] [diff] [review]
part 1 - Extend the cleanup task to enforce the archive quota - v2

Review of attachment 8609406 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +559,5 @@
>  
>      // Make sure to clear |_archiveCleanTask| once done.
>      let clear = () => this._archiveCleanTask = null;
>      // Since there's no archive cleaning task running, start it.
> +    this._archiveCleanTask = this._archiveClean().then(clear, clear);

Nit: _archiveClean() is odd - _cleanArchive() and _cleanArchiveTask.

@@ +676,5 @@
> +        // We save the index of the last ping which is ok to keep in order to speed up ping
> +        // pruning.
> +        lastPingIndexToKeep = i;
> +      } else if (archiveSizeInBytes > Policy.getArchiveQuota()) {
> +        // Ouch, looks like our ping archive is too big. Bail out and start pruning!

Nit: it doesn't look like it's too big, it is too big.

@@ +681,5 @@
> +        break;
> +      }
> +    }
> +
> +    // Check if we're eating too much space. If not, bail out.

s/eating/using/

@@ +692,5 @@
> +
> +    let pingsToPurge = pingList.slice(lastPingIndexToKeep + 1);
> +
> +    // Remove all the pings older than the last one which we are safe to keep
> +    // (|lastPingIndexToKeep|).

Nit: We don't need to reference lastPingIndexToKeep anymore, we have a sufficiently descriptively named local variable now.

@@ +701,5 @@
> +      }
> +
> +      // This list is guaranteed to be in order, so remove the pings at its
> +      // beginning (oldest).
> +      yield this._removeArchivedPing(ping.id, ping.timestampCreated, ping.type);

Is it much work to remove full monthly directories here where possible (i.e. if all the pings in a monthly dir will be purged?
Could make it a follow-up.

@@ +716,5 @@
> +      return;
> +    }
> +
> +    // Remove pings older than 180 days.
> +    yield this._purgeOldPings();

If anything in _purgeOldPings throws, we get a rejection here (i.e. this throws here too).
We'll still want to run the quota enforcement below, so we should try/catch and log an error.

@@ +1201,5 @@
> + * @return {Integer} The file size, in bytes, of the ping file or 0 on errors.
> + */
> +let getArchivedPingSize = Task.async(function*(aPingId, aDate, aType) {
> +  let filePaths = [ getArchivedPingPath(aPingId, aDate, aType) ];
> +  filePaths.unshift(filePaths[0] + "lz4");

Lets make this more readable:
 let path = ...
 let filePaths = [ path + "lz4", path ];
Attachment #8609406 - Flags: review?(gfritzsche) → review+
Status: NEW → ASSIGNED
Comment on attachment 8609407 [details] [diff] [review]
part 2 - Make sure _scanArchive bails out on shutdown

Review of attachment 8609407 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +773,5 @@
>      for (let dir of subdirs) {
> +      if (this._shutdown) {
> +        this._log.trace("_scanArchive - Terminating archive scanning task due to shutdown");
> +        return new Map();
> +      }

I think this has a big risk with throwing off other tasks that use the scanned archive list.
E.g. if we don't watch out we could report wildly wrong archived ping counts.

I think we should start out with watching the histogram that measures how long this takes and think about this if it's actually expensive.
Attachment #8609407 - Flags: review?(gfritzsche)
Comment on attachment 8609409 [details] [diff] [review]
part 3 - Add test coverage for the archive quota

Review of attachment 8609409 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/tests/unit/test_PingAPI.js
@@ +251,5 @@
> +  fakeStorageQuota(TEST_ARCHIVE_QUOTA_BYTES);
> +
> +  // The storage prunes archived pings until we reach 90% of the requested storage quota.
> +  // Based on that, find how many pings should be kept and which one are pruned.
> +  const NUM_KEPT_PINGS = Math.floor((TEST_ARCHIVE_QUOTA_BYTES * 0.9) / PING_SIZE);

This is assuming that they are all the same size on disk and that there are no other files in the directory.
This sounds like oranges waiting to happen.

@@ +253,5 @@
> +  // The storage prunes archived pings until we reach 90% of the requested storage quota.
> +  // Based on that, find how many pings should be kept and which one are pruned.
> +  const NUM_KEPT_PINGS = Math.floor((TEST_ARCHIVE_QUOTA_BYTES * 0.9) / PING_SIZE);
> +  // Compile an array of ping ids that should have been removed. The following line also
> +  // modifies |archiveIds| so that it only contains the ids of the retained pings.

Instead of commenting this, lets use more descriptive variable names?
const expectedPrunedIds = ...
const expectedNotPrunedIds = ...

@@ +271,5 @@
> +  // .. and that the correct pings were kept.
> +  for (let pingId of archivedIds) {
> +    Assert.ok((yield TelemetryArchive.promiseArchivedPingById(pingId)),
> +              "Ping " + pingId + " should be kept in the archive");
> +  }

Now trigger it again and check that we are not pruning anything.
Attachment #8609409 - Flags: review?(gfritzsche)
Attachment #8609407 - Attachment is obsolete: true
Thanks Georg. This patch also takes care of the Find&Replace error (|_archiveCleanTaskArchiveLoadingTask|) from bug 1120380.
Attachment #8609406 - Attachment is obsolete: true
Attachment #8610086 - Flags: review+
Thanks Georg. As discussed over IRC, I've joined the cleanup tests.

This patch also changes the quota part so that the pings are generated in different folders. The sizes of the ping files are read from the disk and we also check that pings removed from the cache are no longer on the disk.

Moreover, a new archive check is triggered before leaving the test to make sure we're not touching anything if that's not needed.
Attachment #8609409 - Attachment is obsolete: true
Attachment #8610088 - Flags: review?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #7)
> Comment on attachment 8609406 [details] [diff] [review]
> part 1 - Extend the cleanup task to enforce the archive quota - v2
> 
> Review of attachment 8609406 [details] [diff] [review]:
> -----------------------------------------------------------------
> @@ +701,5 @@
> > +      }
> > +
> > +      // This list is guaranteed to be in order, so remove the pings at its
> > +      // beginning (oldest).
> > +      yield this._removeArchivedPing(ping.id, ping.timestampCreated, ping.type);
> 
> Is it much work to remove full monthly directories here where possible (i.e.
> if all the pings in a monthly dir will be purged?
> Could make it a follow-up.

I'm filing a followup for that.
Attached patch part 3 - Add archive probes (obsolete) — Splinter Review
Georg, do the types of those probes look ok to you?
Attachment #8610111 - Flags: feedback?(gfritzsche)
Comment on attachment 8610088 [details] [diff] [review]
part 2 - Add test coverage for the archive quota - v2

Review of attachment 8610088 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/tests/unit/test_PingAPI.js
@@ +58,5 @@
> +
> +    // Then get a list of the files for the current subdir and sum the file sizes
> +    // to get the archive size.
> +    for (let f of files) {
> +      if (!f.name.endsWith('jsonlz4') && !f.name.endsWith('json')) {

This will break with e.g. 'jsongz' suffixes.
Check against e.g. /^[a-z0-9]+\.json[a-z0-9]+/i ?

@@ +74,5 @@
> + * Count the number of pings it takes to reach a specific disk quota.
> + * @return {Integer} The number of pings, in the archive directory, it takes to reach the
> + *         quota.
> + */
> +let getPingCountToQuota = Task.async(function*(aQuota) {

Most lines are just duplicating the above function, can we make that shared?

Both getArchiveDirectorySize & getPingCountToQuota depend on the order you encounter the files when iterating over the directory contents.
You really want to operate over them ordered by time/date descending, a common helper function could return you that.

@@ +264,4 @@
>  
> +  // Create 20 other pings which are within the retention period, but would be affected
> +  // by the disk quota.
> +  for (let i = 0; i < 10; i++) {

for (let month of [3, 4]) {
  for (let minute = 0; minute < 10; minute++) {
    ...

@@ +290,2 @@
>  
>    // Move the current date 180 days ahead of the PING_ID2 ping.

PING_ID2?

@@ +291,5 @@
>    // Move the current date 180 days ahead of the PING_ID2 ping.
>    fakeNow(2010, 8, 1, 1, 0, 0);
>    // Reset TelemetryController but not the TelemetryArchive: the cleanup task will update
>    // the existing archive cache.
>    yield TelemetryController.reset();

Lets reset the archive too to "properly" fake a browser restart.

@@ +303,5 @@
> +  // Check that the archive is in the correct state.
> +  yield checkArchive();
> +
> +  // Find how much space the archive and a single archived ping take. All pings have the
> +  // same size.

Remove "and a single archived ping take", "All pings have the same size."

@@ +312,5 @@
> +
> +  // The storage prunes archived pings until we reach 90% of the requested storage quota.
> +  // Based on that, find how many pings should be kept.
> +  const keptPingsNum = yield getPingCountToQuota(testQuotaInBytes * 0.9);
> +  const pingsRemoveByQuota =

pingsRemovedByQuota
Attachment #8610088 - Flags: review?(gfritzsche)
Comment on attachment 8610111 [details] [diff] [review]
part 3 - Add archive probes

Review of attachment 8610111 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Histograms.json
@@ +4396,5 @@
>    },
> +  "TELEMETRY_ARCHIVE_DIRECTORIES_COUNT": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "count",

A count is used for values that are incremented by 1 (e.g. use counts).
You can't start them out with a fixed value & currently they only allow incrementing by one per call.

We can use a linear histograms here with appropriate bucketing & "high" set sufficiently high.
We shouldn't need to many buckets and too high values for this.

@@ +4402,5 @@
> +  },
> +  "TELEMETRY_ARCHIVE_PING_COUNT": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "count",

Linear histogram, submitted after the scan.
Bucket count and "high" will have to be higher here.

@@ +4408,5 @@
> +  },
> +  "TELEMETRY_ARCHIVE_EVICTED_OVER_QUOTA": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "count",

Linear, submitted once after the quota enforcement.

@@ +4414,5 @@
> +  },
> +  "TELEMETRY_ARCHIVE_EVICTED_OLD_DIRS": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "count",

Linear, submitted once after "old dir eviction"

@@ +4421,5 @@
> +  "TELEMETRY_ARCHIVE_EVICTING_DIRS_MS": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "exponential",
> +    "high": "5000",

Lets maybe try 10000 to see more detailed values if it's higher then expected.
Dito the other durations below.
Attachment #8610111 - Flags: feedback?(gfritzsche)
Thanks Georg, I've addressed your concerns for the tests in this patch.
Attachment #8610088 - Attachment is obsolete: true
Attachment #8610443 - Flags: review?(gfritzsche)
Comment on attachment 8610443 [details] [diff] [review]
part 2 - Add test coverage for the archive quota - v3

Review of attachment 8610443 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the below addressed.

::: toolkit/components/telemetry/tests/unit/test_PingAPI.js
@@ +38,5 @@
> + * @return {String} The full path to the archived ping.
> + */
> +function getArchivedPingPath(aPingId, aDate, aType) {
> +  let storage = Cu.import("resource://gre/modules/TelemetryStorage.jsm");
> +  return storage.getArchivedPingPath(aPingId, aDate, aType);

Let's just export this and the one below properly as say TelemetryStorage._testGetArchivedPingPath() / _testGetArchivedPingDataFromFileName(), with comments that they are for testing only.

@@ +288,2 @@
>    fakeNow(2010, 7, 1, 1, 0, 0);
>    // Reset TelemetryArchive and TelemetryController to start the startup cleanup.

I'm missing the actual reset of the TelemetryArchive here.

@@ +319,5 @@
> +  fakeStorageQuota(testQuotaInBytes);
> +
> +  // The storage prunes archived pings until we reach 90% of the requested storage quota.
> +  // Based on that, find how many pings should be kept.
> +  const keptPingsNum = yield getPingCountToQuota(testQuotaInBytes * 0.9);

getPingCountToQuota() is only used in one place.
Instead of juggling around with count/indices we could just iterate through a list from getArchivedPingsInfo() and push to two sequences directly (say pingsWithinQuota & pingsOutsideQuota).
Attachment #8610443 - Flags: review?(gfritzsche) → review+
Thanks Georg.
Attachment #8610443 - Attachment is obsolete: true
Attachment #8610510 - Flags: review+
Attached patch part 3 - Add archive probes (obsolete) — Splinter Review
This patch adds the archive probes. As discussed over IRC, I've changed  the ping counts to "exponential", rather than "linear".
Attachment #8610111 - Attachment is obsolete: true
Attachment #8610598 - Flags: review?(gfritzsche)
Since I'm using |truncateToDays|, and the name is misleading, I've renamed and moved it to TelemetryUtils.
Attachment #8610599 - Flags: review?(gfritzsche)
Comment on attachment 8610598 [details] [diff] [review]
part 3 - Add archive probes

Review of attachment 8610598 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Histograms.json
@@ +4397,5 @@
> +  "TELEMETRY_ARCHIVE_DIRECTORIES_COUNT": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "linear",
> +    "high": "100",

We don't expect more than 6, so 100 seems pretty high. 50?

@@ +4405,5 @@
> +  "TELEMETRY_ARCHIVE_OLDEST_DIRECTORY_AGE": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "linear",
> +    "high": "360",

300? Then we have each bucket cover roughly a month.

@@ +4428,5 @@
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "linear",
> +    "high": "300",
> +    "n_buckets": 20,

A bit more detail would be good here. 60 would get us 5MB resolution.

@@ +4444,5 @@
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "linear",
> +    "high": "100",
> +    "n_buckets": 3,

Can we easily change the bucket count later without breaking the histogram history?
3 buckets sounds like pretty low resolution.

@@ +4451,5 @@
> +  "TELEMETRY_ARCHIVE_EVICTING_DIRS_MS": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "exponential",
> +    "high": "10000",

This could stall a lot worse, 10sec is not that long.

@@ +4460,5 @@
> +  "TELEMETRY_ARCHIVE_CHECKING_OVER_QUOTA_MS": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "exponential",
> +    "high": "10000",

Dito.

@@ +4469,5 @@
> +  "TELEMETRY_ARCHIVE_EVICTING_OVER_QUOTA_MS": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "exponential",
> +    "high": "10000",

Dito.
Attachment #8610598 - Flags: review?(gfritzsche)
Comment on attachment 8610599 [details] [diff] [review]
part 4 - Add/move millisecondsToDays to TelemetryUtils

Review of attachment 8610599 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryEnvironment.jsm
@@ +490,5 @@
>          scope: addon.scope,
>          type: addon.type,
>          foreignInstall: addon.foreignInstall,
>          hasBinaryComponents: addon.hasBinaryComponents,
> +        installDay: TelemetryUtils.millisecondsToDays(installDate.getTime()),

I think you'll want to globally alias `TelemetryUtils` here too, e.g. as `Utils`.

::: toolkit/components/telemetry/TelemetryUtils.jsm
@@ +15,5 @@
>  this.TelemetryUtils = {
>    /**
> +   * Turn a millisecond timestamp into a day timestamp.
> +   *
> +   * @param aMsec a number of milliseconds since epoch.

"Unix epoch", upper-case letter at the start of the sentence.

@@ +16,5 @@
>    /**
> +   * Turn a millisecond timestamp into a day timestamp.
> +   *
> +   * @param aMsec a number of milliseconds since epoch.
> +   * @return the number of whole days denoted by the input.

"The number of whole days since Unix epoch"
Attachment #8610599 - Flags: review?(gfritzsche) → review+
Attachment #8610599 - Attachment is obsolete: true
Attachment #8610629 - Flags: review+
Attached patch part 3 - Add archive probes - v2 (obsolete) — Splinter Review
Thanks Georg, I've updated the probes.
Attachment #8610598 - Attachment is obsolete: true
Attachment #8610638 - Flags: review?(gfritzsche)
This patch adds test coverage for the probes.
Attachment #8610639 - Flags: review?(gfritzsche)
Blocks: 1168728
Comment on attachment 8610638 [details] [diff] [review]
part 3 - Add archive probes - v2

Review of attachment 8610638 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Histograms.json
@@ +4397,5 @@
> +  "TELEMETRY_ARCHIVE_DIRECTORIES_COUNT": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "linear",
> +    "high": "50",

Thinking again here, 12 would already be unusual.
high:13, buckets:12?

@@ +4407,5 @@
> +    "expires_in_version": "never",
> +    "kind": "linear",
> +    "high": "13",
> +    "n_buckets": 12,
> +    "description": "The age of the oldest Telemetry archive directory (months)"

"... in months"

@@ +4409,5 @@
> +    "high": "13",
> +    "n_buckets": 12,
> +    "description": "The age of the oldest Telemetry archive directory (months)"
> +  },
> +  "TELEMETRY_ARCHIVE_PING_COUNT": {

TELEMETRY_ARCHIVE_SCAN_PING_COUNT

@@ +4437,5 @@
> +    "expires_in_version": "never",
> +    "kind": "exponential",
> +    "high": "100000",
> +    "n_buckets": 100,
> +    "description": "Number of Telemetry pings evicted from the Archive, because over quota, during clean-up"

"... from the archive during cleanup, because they were over the quota"

@@ +4443,5 @@
> +  "TELEMETRY_ARCHIVE_EVICTED_OLD_DIRS": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "linear",
> +    "high": "100",

If we remove 12 there is already a lot wrong (or someone didn't use their profile in a long time).
high:13, n_buckets:12?

@@ +4445,5 @@
> +    "expires_in_version": "never",
> +    "kind": "linear",
> +    "high": "100",
> +    "n_buckets": 10,
> +    "description": "Number of Telemetry directories evicted from the Archive, because too old, during clean-up"

"... from the archive during cleanup, because they were too old"

@@ +4451,5 @@
> +  "TELEMETRY_ARCHIVE_EVICTING_DIRS_MS": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "exponential",
> +    "high": "120000",

This is exponential, so we shouldn't have a problem going a bit higher for this still.
5min to cover more extreme cases? Dito the two below.

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +629,3 @@
>  
>            // Update the newest removed month.
>            if (archiveDate > newestRemovedMonth) {

newestRemovedMonth = Math.max(archiveDate, newestRemovedMonth)?

@@ +636,5 @@
>          }
> +      } else {
> +        // We're not removing this directory, so record the age for the oldest directory.
> +        const dirAgeInMonths = (nowDate.getMonth() - archiveDate.getMonth())
> +                               + 12 * (nowDate.getFullYear() - archiveDate.getFullYear());

This looks like a helper function waiting to happen.

@@ +637,5 @@
> +      } else {
> +        // We're not removing this directory, so record the age for the oldest directory.
> +        const dirAgeInMonths = (nowDate.getMonth() - archiveDate.getMonth())
> +                               + 12 * (nowDate.getFullYear() - archiveDate.getFullYear());
> +        if (dirAgeInMonths > oldestDirectoryAgeMonths) {

maxDirAgeInMonths = Math.max(... ?

@@ +664,5 @@
> +    // Save the time it takes to evict old directories and the eviction count.
> +    Telemetry.getHistogramById("TELEMETRY_ARCHIVE_EVICTED_OLD_DIRS")
> +             .add(evictedDirsCount);
> +    Telemetry.getHistogramById("TELEMETRY_ARCHIVE_EVICTING_DIRS_MS")
> +             .add(Math.round(Policy.now().getTime() - now));

Math.ceil?
Not that it matters much, but wouldn't we want to take the end-time first, then start submitting measurements?

@@ +666,5 @@
> +             .add(evictedDirsCount);
> +    Telemetry.getHistogramById("TELEMETRY_ARCHIVE_EVICTING_DIRS_MS")
> +             .add(Math.round(Policy.now().getTime() - now));
> +    if (oldestDirectoryAgeMonths >= 0) {
> +      // Only record the age if we have at least a directory.

Why? Wouldn't we want to have a value then too so this shows up in the dashboard?
Also, if we only have a dir for the current month, we get 0 here.

@@ +734,3 @@
>      if (archiveSizeInBytes < Policy.getArchiveQuota()) {
> +      Telemetry.getHistogramById("TELEMETRY_ARCHIVE_SIZE_MB")
> +               .add(Math.round(archiveSizeInBytes / 1024 / 1024));

Also need to submit values for the other two probes here.
For avoiding repetition, maybe put the recording in a local helper function.

@@ +764,5 @@
> +             .add(pingsToPurge.length);
> +    // Save the time it takes to evict over-quota pings. We just do this if we're
> +    // over quota.
> +    Telemetry.getHistogramById("TELEMETRY_ARCHIVE_EVICTING_OVER_QUOTA_MS")
> +             .add(Math.round(Policy.now().getTime() - now));

Not that it matters much, but wouldn't we want to take the end-time first, then start submitting measurements?

@@ +877,5 @@
> +    // Update the ping and directories count histograms.
> +    Telemetry.getHistogramById("TELEMETRY_ARCHIVE_PING_COUNT")
> +             .add(this._archivedPings.size);
> +    Telemetry.getHistogramById("TELEMETRY_ARCHIVE_DIRECTORIES_COUNT")
> +             .add(subdirs.length);

You also should submit this on the early bail-out for the dir not existing.
Attachment #8610638 - Flags: review?(gfritzsche)
Comment on attachment 8610639 [details] [diff] [review]
part 5 - Add test coverage for the probes

Review of attachment 8610639 [details] [diff] [review]:
-----------------------------------------------------------------

Looks ok, but you'll want to add coverage for the open points in the probes patch.
Attachment #8610639 - Flags: review?(gfritzsche) → review+
Attached patch part 3 - Add archive probes - v3 (obsolete) — Splinter Review
Thanks Georg. This addresses your suggestions and adds a new helper function to TelemetryUtils.
Attachment #8610638 - Attachment is obsolete: true
Attachment #8611089 - Flags: review?(gfritzsche)
Attachment #8610639 - Attachment is obsolete: true
Attachment #8611098 - Flags: review+
Comment on attachment 8611089 [details] [diff] [review]
part 3 - Add archive probes - v3

Review of attachment 8611089 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +621,5 @@
>          continue;
>        }
>  
>        // If this archive directory is older than 180 days, remove it.
> +      if (!Utils.areTimesClose(archiveDate.getTime(), now,

Ok, looks like i missed that in a previous review, but the behavior of areTimesClose() is not really what we want here.
We can just do the simpler thing which does exactly what we want: ((now - archiveDate) > MAX_RETENTION)

@@ +727,5 @@
> +
> +    let submitProbes = (sizeInMB, evictedPings, elapsedMs) => {
> +      Telemetry.getHistogramById("TELEMETRY_ARCHIVE_SIZE_MB").add(sizeInMB);
> +      Telemetry.getHistogramById("TELEMETRY_ARCHIVE_EVICTED_OVER_QUOTA")
> +               .add(evictedPings);

Nit: Why is this on a new line and the next one not? Lets stay consistent.

@@ +877,5 @@
>  
>      // Mark the archive as scanned, so we no longer hit the disk.
>      this._scannedArchiveDirectory = true;
> +    // Update the ping and directories count histograms.
> +    submitProbes(this._archivedPings.size, subdirs.length);

subdirs includes the invalid dir names that we skip too.
You could just add a filter() for subdirs and move the validity check up there:
let subdirs = ...filter(... isDir).filter(... isValidArchiveDir)
Attachment #8611089 - Flags: review?(gfritzsche) → review+
Comment on attachment 8611089 [details] [diff] [review]
part 3 - Add archive probes - v3

Review of attachment 8611089 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +662,5 @@
> +    // Save the time it takes to evict old directories and the eviction count.
> +    Telemetry.getHistogramById("TELEMETRY_ARCHIVE_EVICTED_OLD_DIRS")
> +             .add(evictedDirsCount);
> +    Telemetry.getHistogramById("TELEMETRY_ARCHIVE_EVICTING_DIRS_MS")
> +             .add(Math.ceil(endTimestamp - now));

|endTime - now| is confusing.
s/now/startTimeStamp/?
Same for the other ones below.
Attached patch part 3 - Add archive probes - v4 (obsolete) — Splinter Review
Thanks Georg.

Here's the try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4069d9dc88b7
Attachment #8611089 - Attachment is obsolete: true
Attachment #8611122 - Flags: review+
Blocks: 1168835
Keywords: checkin-needed
Keywords: checkin-needed
This patch adds a missing utility function which was not in the previous patches and renames |now| as suggested.
Attachment #8612833 - Flags: review?(gfritzsche)
Comment on attachment 8612833 [details] [diff] [review]
part 6 - Fixes test failures and renames |now|

Review of attachment 8612833 [details] [diff] [review]:
-----------------------------------------------------------------

That looks fine, but it seems like we should fix the patches that were missing this instead of adding another patch.
Attachment #8612833 - Flags: review?(gfritzsche) → review+
Adds the missing function from part 6.
Attachment #8612833 - Attachment is obsolete: true
Attachment #8612883 - Flags: review+
Renames |now| (was in part 6).
Attachment #8610629 - Attachment is obsolete: true
Attachment #8611122 - Attachment is obsolete: true
Attachment #8612885 - Flags: review+
Depends on: 1170520
Whiteboard: [uplift]
Comment on attachment 8610086 [details] [diff] [review]
part 1 - Extend the cleanup task to enforce the archive quota - v3

Approval Request Comment
[Feature/regressing bug #]:
Unified Telemetry, https://wiki.mozilla.org/Unified_Telemetry
This is part of the first (main) batch of uplifts to 40 to enable shipping on that train, see bug 1120356, comment 2.
[User impact if declined]:
Data & measurement insight projects delayed or blocked with direct impact on projects depending on this.
[Describe test coverage new/current, TreeHerder]:
We have good automation coverage of the feature.
We also had manual tests of the main tasks as well as confirmation of correct behavior on Nightly for the patches here.
[Risks and why]:
Low-risk - these patches are rather isolated to Telemetry and have been on Nightly for a while with no bad reports.
We intend to track on-going data quality and other issues during the 40 aurora & beta and flip the new behavior off when it doesn't meet the requirements.
[String/UUID change made/needed]:
The only string changes were to the about:telemetry page.
We decided that we can live with missing translations on that page for a cycle as that page is not exactly user-facing.
Attachment #8610086 - Flags: approval-mozilla-aurora?
Comment on attachment 8610510 [details] [diff] [review]
part 2 - Add test coverage for the archive quota - v4

Approval Request Comment
[Feature/regressing bug #]:
Unified Telemetry, https://wiki.mozilla.org/Unified_Telemetry
This is part of the first (main) batch of uplifts to 40 to enable shipping on that train, see bug 1120356, comment 2.
[User impact if declined]:
Data & measurement insight projects delayed or blocked with direct impact on projects depending on this.
[Describe test coverage new/current, TreeHerder]:
We have good automation coverage of the feature.
We also had manual tests of the main tasks as well as confirmation of correct behavior on Nightly for the patches here.
[Risks and why]:
Low-risk - these patches are rather isolated to Telemetry and have been on Nightly for a while with no bad reports.
We intend to track on-going data quality and other issues during the 40 aurora & beta and flip the new behavior off when it doesn't meet the requirements.
[String/UUID change made/needed]:
The only string changes were to the about:telemetry page.
We decided that we can live with missing translations on that page for a cycle as that page is not exactly user-facing.
Attachment #8610510 - Flags: approval-mozilla-aurora?
Comment on attachment 8611098 [details] [diff] [review]
part 5 - Add test coverage for the probes - v2

Approval Request Comment
[Feature/regressing bug #]:
Unified Telemetry, https://wiki.mozilla.org/Unified_Telemetry
This is part of the first (main) batch of uplifts to 40 to enable shipping on that train, see bug 1120356, comment 2.
[User impact if declined]:
Data & measurement insight projects delayed or blocked with direct impact on projects depending on this.
[Describe test coverage new/current, TreeHerder]:
We have good automation coverage of the feature.
We also had manual tests of the main tasks as well as confirmation of correct behavior on Nightly for the patches here.
[Risks and why]:
Low-risk - these patches are rather isolated to Telemetry and have been on Nightly for a while with no bad reports.
We intend to track on-going data quality and other issues during the 40 aurora & beta and flip the new behavior off when it doesn't meet the requirements.
[String/UUID change made/needed]:
The only string changes were to the about:telemetry page.
We decided that we can live with missing translations on that page for a cycle as that page is not exactly user-facing.
Attachment #8611098 - Flags: approval-mozilla-aurora?
Comment on attachment 8612883 [details] [diff] [review]
part 4 - Add/move millisecondsToDays to TelemetryUtils - v3

Approval Request Comment
[Feature/regressing bug #]:
Unified Telemetry, https://wiki.mozilla.org/Unified_Telemetry
This is part of the first (main) batch of uplifts to 40 to enable shipping on that train, see bug 1120356, comment 2.
[User impact if declined]:
Data & measurement insight projects delayed or blocked with direct impact on projects depending on this.
[Describe test coverage new/current, TreeHerder]:
We have good automation coverage of the feature.
We also had manual tests of the main tasks as well as confirmation of correct behavior on Nightly for the patches here.
[Risks and why]:
Low-risk - these patches are rather isolated to Telemetry and have been on Nightly for a while with no bad reports.
We intend to track on-going data quality and other issues during the 40 aurora & beta and flip the new behavior off when it doesn't meet the requirements.
[String/UUID change made/needed]:
The only string changes were to the about:telemetry page.
We decided that we can live with missing translations on that page for a cycle as that page is not exactly user-facing.
Attachment #8612883 - Flags: approval-mozilla-aurora?
Comment on attachment 8612885 [details] [diff] [review]
part 3 - Add archive probes - v5

Approval Request Comment
[Feature/regressing bug #]:
Unified Telemetry, https://wiki.mozilla.org/Unified_Telemetry
This is part of the first (main) batch of uplifts to 40 to enable shipping on that train, see bug 1120356, comment 2.
[User impact if declined]:
Data & measurement insight projects delayed or blocked with direct impact on projects depending on this.
[Describe test coverage new/current, TreeHerder]:
We have good automation coverage of the feature.
We also had manual tests of the main tasks as well as confirmation of correct behavior on Nightly for the patches here.
[Risks and why]:
Low-risk - these patches are rather isolated to Telemetry and have been on Nightly for a while with no bad reports.
We intend to track on-going data quality and other issues during the 40 aurora & beta and flip the new behavior off when it doesn't meet the requirements.
[String/UUID change made/needed]:
The only string changes were to the about:telemetry page.
We decided that we can live with missing translations on that page for a cycle as that page is not exactly user-facing.
Attachment #8612885 - Flags: approval-mozilla-aurora?
Comment on attachment 8610086 [details] [diff] [review]
part 1 - Extend the cleanup task to enforce the archive quota - v3

Unified telemetry is an important new feature. It is blocking some other projects. Taking it.
Attachment #8610086 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8610510 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8611098 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8612883 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8612885 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.