Closed Bug 1033860 Opened 5 years ago Closed 3 years ago

Add probes for time needed to load pending & archived pings from disk

Categories

(Toolkit :: Telemetry, defect, P4)

defect
Points:
1

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: vladan, Assigned: fionn_mac, Mentored)

Details

(Whiteboard: [measurement:client] [lang=js])

Attachments

(1 file, 8 obsolete files)

The telemetry upload mechanism is being used to upload non-Telemetry data such as ANR logs and Loop logs. This is achieved by dropping (uncompressed) payloads into the saved-telemetry-pings folder, which are then read in on Firefox startup (up to a max of 17 pings are read) and then Telemetry uploads the pings during idle-daily or more rarely, on startup.

Some of these non-Telemetry payloads may be quite large, so we should keep an eye out on the amount of time spent reading these payloads on startup.

We should add a Telemetry probe for this, and also uplift it so that we can do before & after comparisons with bug 998989.
Summary: Report amount of time spent reading Telemetry logs → Report amount of time spent reading saved Telemetry pings on startup
Summary: Report amount of time spent reading saved Telemetry pings on startup → Report time spent reading saved Telemetry pings on startup
Assignee: vladan.bugzilla → nobody
We refactored this quite a bit and don't load all the pings on startup anymore. We now keep track of on-disk pending pings and only load them on-demand when sending.

We do have probes in place for the pending ping directory scan times etc., but not specifically for load times, which would be good to keep track of.
We should track this separately for pending and archived pings:
https://dxr.mozilla.org/mozilla-central/rev/e5a10bc7dac4ee2453d8319165c1f6578203eac7/toolkit/components/telemetry/TelemetryStorage.jsm#1304
https://dxr.mozilla.org/mozilla-central/rev/e5a10bc7dac4ee2453d8319165c1f6578203eac7/toolkit/components/telemetry/TelemetryStorage.jsm#677

I'm suggesting to add "TELEMETRY_LOAD_PENDING_MS" & "TELEMETRY_LOAD_ARCHIVED_MS" as exponential histograms around here:
https://dxr.mozilla.org/mozilla-central/rev/e5a10bc7dac4ee2453d8319165c1f6578203eac7/toolkit/components/telemetry/Histograms.json#4900
Mentor: gfritzsche, alessio.placitelli
Points: --- → 1
Priority: -- → P3
Summary: Report time spent reading saved Telemetry pings on startup → Add probes for time needed to load pending & archived pings from disk
Whiteboard: [measurement:client] [lang=js]
(In reply to Georg Fritzsche [:gfritzsche] from comment #1)
> I'm suggesting to add "TELEMETRY_LOAD_PENDING_MS" &
> "TELEMETRY_LOAD_ARCHIVED_MS"

TELEMETRY_PENDING_LOAD_MS & TELEMETRY_ARCHIVE_LOAD_MS would be more consistent with existing histogram names though.
malayaleecoder, would you be interested in taking this?
Flags: needinfo?(malayaleecoder)
Sure Georg, I did not quite understand what the issue is! Will take this up anyways :)
Flags: needinfo?(malayaleecoder) → needinfo?(gfritzsche)
Sorry for the late follow-up :)

We already have other Telemetry probes in place that measure how long certain operations take.
E.g. with "TELEMETRY_PENDING_CHECKING_OVER_QUOTA_MS" we measure how long it takes to check whether we exceeded our storage quota for outgoing ping.
We register the probe here:
https://dxr.mozilla.org/mozilla-central/rev/16663eb3dcfa759f25b5e27b101bc79270c156f2/toolkit/components/telemetry/Histograms.json#5066
... and record the time needed for this operation here:
https://dxr.mozilla.org/mozilla-central/rev/16663eb3dcfa759f25b5e27b101bc79270c156f2/toolkit/components/telemetry/TelemetryStorage.jsm#1091
This e.g. shows up in our dashboard here:
https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2016-05-02&keys=__none__!__none__!__none__&max_channel_version=beta%252F45&measure=TELEMETRY_PENDING_CHECKING_OVER_QUOTA_MS&min_channel_version=null&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2015-12-18&table=0&trim=1&use_submission_date=0
... and we can monitor it for regressions and distributions.

Now here we are planning to also measure how long it takes to load outgoing pings from disk:
https://dxr.mozilla.org/mozilla-central/rev/e5a10bc7dac4ee2453d8319165c1f6578203eac7/toolkit/components/telemetry/TelemetryStorage.jsm#1304
... and also archived ones:
https://dxr.mozilla.org/mozilla-central/rev/e5a10bc7dac4ee2453d8319165c1f6578203eac7/toolkit/components/telemetry/TelemetryStorage.jsm#677
Flags: needinfo?(gfritzsche)
Does that clear it up?
Flags: needinfo?(malayaleecoder)
Yup, I am on it! Will ping you if I get stuck somewhere :)
Flags: needinfo?(malayaleecoder)
Great, thanks.
Assignee: nobody → malayaleecoder
Hi, did you get stuck on anything?
Me or Alessio/Dexter are happy to help if anything is blocking you.
Flags: needinfo?(malayaleecoder)
Very sorry Georg for not progressing much into this bug, as I did not get time.
Will take it up when I find some time in between.
Assignee: malayaleecoder → nobody
Flags: needinfo?(malayaleecoder)
Priority: P3 → P4
Attached patch Initial patch for Bug 1033860 (obsolete) — Splinter Review
Hello!

I found this bug quite interesting, and since no one was working on it, I gave it a go!

This is an initial patch. The newly created histograms have the same values for "alert_emails", "high" and "n_buckets" as "TELEMETRY_PENDING_CHECKING_OVER_QUOTA_MS", as it was mentioned as an example.
Attachment #8846344 - Flags: review?(gfritzsche)
I mistakenly added a review flag instead of a needinfo flag. Sorry!

Currently, the build is failing with the error 'New histogram "TELEMETRY_PENDING_LOAD_MS" must have a bug_numbers field.'. What can be the possible reason behind this?
Attached patch Initial patch for Bug 1033860 (obsolete) — Splinter Review
Fixed the issue that was occurring earlier.

I was able to build successfully.
I also ran the commands |./mach eslint toolkit/components/telemetry/| and |./mach xpcshell-test toolkit/components/telemetry/|. The patch passed all tests.
Attachment #8846344 - Attachment is obsolete: true
Attachment #8846344 - Flags: review?(gfritzsche)
Attachment #8846350 - Flags: review?(gfritzsche)
Assignee: nobody → vedant.sareen
Comment on attachment 8846350 [details] [diff] [review]
Initial patch for Bug 1033860

I'm short on time, redirecting request.
Attachment #8846350 - Flags: review?(gfritzsche) → review?(alessio.placitelli)
Comment on attachment 8846350 [details] [diff] [review]
Initial patch for Bug 1033860

Review of attachment 8846350 [details] [diff] [review]:
-----------------------------------------------------------------

This looks like a good start, thanks for the patch!

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +690,4 @@
>     * @return {promise<object>} Promise that is resolved with the ping data.
>     */
>    loadArchivedPing: Task.async(function*(id) {
> +    let startTimeStamp = Policy.now().getTime();

Let's make this a const.

@@ +729,5 @@
>      }
> +
> +    // Save the time it takes to load archived ping into histogram.
> +    Telemetry.getHistogramById("TELEMETRY_ARCHIVE_LOAD_MS")
> +         .add(Math.round(Policy.now().getTime() - startTimeStamp));

Please align |.add| to the |.getHistogramById| on the previous line.

@@ +1327,4 @@
>  
>    loadPendingPing: Task.async(function*(id) {
>      this._log.trace("loadPendingPing - id: " + id);
> +    let startTimeStamp = Policy.now().getTime();

Let's make this a const.

@@ +1354,5 @@
>        throw new Error("loadPendingPing - exceeded the maximum ping size: " + fileSize);
>      }
>  
> +    // Try to load the ping file.
> +    // Update the related histograms on the basis of success/failure.

Let's drop the change to this comment.

@@ +1373,5 @@
>      }
>  
> +    // Save the time it takes to load the pending pings into histogram.
> +    Telemetry.getHistogramById("TELEMETRY_PENDING_LOAD_MS")
> +         .add(Math.round(Policy.now().getTime() - startTimeStamp));

Please fix the alignment as for the previous case.

::: toolkit/components/telemetry/histogram-whitelists.json
@@ +1541,4 @@
>      "SYSTEM_FONT_FALLBACK",
>      "SYSTEM_FONT_FALLBACK_FIRST",
>      "SYSTEM_FONT_FALLBACK_SCRIPT",
> +    "TELEMETRY_ARCHIVE_LOAD_MS",

Let's not add the probe here. We're using this file for legacy compatibility with old probes.

@@ +1553,4 @@
>      "TELEMETRY_PENDING_EVICTING_OVER_QUOTA_MS",
>      "TELEMETRY_PENDING_LOAD_FAILURE_PARSE",
>      "TELEMETRY_PENDING_LOAD_FAILURE_READ",
> +    "TELEMETRY_PENDING_LOAD_MS",

Let's not add the probe here. We're using this file for legacy compatibility with old probes.
Attachment #8846350 - Flags: review?(alessio.placitelli)
Thank you for the feedback, Sir!

We have exams in our college from tomorrow onwards, so there might be a slight delay in uploading the rectified patch.

I'll try and upload one by Saturday though!
Attached patch Rectified Patch for Bug 1033860 (obsolete) — Splinter Review
Rectified the errors in previous patch and added a bug_number field to the new histograms.

I was able to build successfully. The patch passed xpcshell-test and eslint.
Attachment #8846350 - Attachment is obsolete: true
Attachment #8848858 - Flags: review?(alessio.placitelli)
Chris, would you mind picking up this one for review?
Flags: needinfo?(chutten)
Flags: needinfo?(chutten)
Attachment #8848858 - Flags: review?(alessio.placitelli) → review?(chutten)
Comment on attachment 8848858 [details] [diff] [review]
Rectified Patch for Bug 1033860

Review of attachment 8848858 [details] [diff] [review]:
-----------------------------------------------------------------

Excellent work on the patch, I just made mention of a pattern that might simplify things and a change to the histogram parameters.

Once we get the parameters for these probes nailed down, we'll ask for Data Review. never-expiring probes likely will require tests, so consider looking into toolkit/components/telemetry/tests/unit/test_TelemetryStorage.jsm and see if you can add some tests there that ensure we're recording good values into these histograms.

::: toolkit/components/telemetry/Histograms.json
@@ +5935,5 @@
> +  "TELEMETRY_ARCHIVE_LOAD_MS": {
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "exponential",
> +    "high": 300000,

In your manual testing, did these probes have values in them in your about:telemetry page? If so, what were typical values?

I don't expect that these operations will take too many seconds, so a lower "high" value might be appropriate. Maybe something around 10000.

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +690,4 @@
>     * @return {promise<object>} Promise that is resolved with the ping data.
>     */
>    loadArchivedPing: Task.async(function*(id) {
> +    const startTimeStamp = Policy.now().getTime();

Rather than keeping track of timestamps yourself, Telemetry provides the TelemetryStopwatch object for dealing with exactly these sorts of cases. 

You can see examples here: https://dxr.mozilla.org/mozilla-central/search?q=TelemetryStopwatch&redirect=true

But the tl;dr is - TelemetryStopwatch.start("TELEMETRY_ARCHIVE_LOAD_MS"); up here, and TelemetryStopwatch.stop("TELEMETRY_ARCHIVE_LOAD_MS"); down there. (and .cancel the stopwatch in any error cases.)
Attachment #8848858 - Flags: review?(chutten)
Attached patch Work in Progress (obsolete) — Splinter Review
This is the work that I've been able to do so far.
Currently, the new test that I've written fail as soon as it reaches the line

> TelemetryStorage.addPendingPing(ping).then(() => {

or the line

> TelemetryStorage.saveArchivedPing(ping).then(() => {

with the error message

> test_pendingLoadTime FAIL [test_pendingLoadTime : 224] A promise chain failed to handle a rejection: invalid path component
> stack: _do_main@/home/vedant/Mozilla/mozilla-central/testing/xpcshell/head.js:211:5
> _execute_test@/home/vedant/Mozilla/mozilla-central/testing/xpcshell/head.js:546:5
> @-e:1:1
> - false == true
> resource://testing-common/PromiseTestUtils.jsm:assertNoUncaughtRejections:224
> /home/vedant/Mozilla/mozilla-central/testing/xpcshell/head.js:_run_next_test:1533
> /home/vedant/Mozilla/mozilla-central/testing/xpcshell/head.js:run:703
> /home/vedant/Mozilla/mozilla-central/testing/xpcshell/head.js:_do_main:211
> /home/vedant/Mozilla/mozilla-central/testing/xpcshell/head.js:_execute_test:546
> -e:null:1
(Similarly for the archived ping case)

I've been trying to debug this but I haven't been able to do so.

If I keep a throw state here -
https://dxr.mozilla.org/mozilla-central/rev/16663eb3dcfa759f25b5e27b101bc79270c156f2/toolkit/components/telemetry/TelemetryStorage.jsm#1299

> savePendingPing: function(ping) {
>   throw("testing");
>   let p = this.savePing(ping, true).then((path) => {

then the error is thrown successfully. However, if I keep it here -

> savePendingPing: function(ping) {
>   let p = this.savePing(ping, true).then((path) => {
>   throw("testing");
>     this._pendingPings.set(ping.id, {

Or here -
https://dxr.mozilla.org/mozilla-central/rev/16663eb3dcfa759f25b5e27b101bc79270c156f2/toolkit/components/telemetry/TelemetryStorage.jsm#1270

> savePing: Task.async(function*(ping, overwrite) {
>   throw("testing");
>   yield getPingDirectory();

Then this custom error isn't thrown and the test fails with the same error as mentioned above.
I haven't been able to understand why this may be happening, and any help would be really appreciated :)
Flags: needinfo?(chutten)
Promises can resolve or reject. If they resolve, then the first function in ".then(resolveFn, rejectFn)" is run. If they reject (which usually happens when something goes wrong) then it'll run the second function.

Since the ".then"s do not have rejection handlers, it errors out generically.

(more information on MDN: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise)
Flags: needinfo?(chutten)
I apologize for the delay in following up with the patch.
My pace of work has been really slow since I was earlier caught up with my final project submissions and now with preparing for my upcoming end-semester examinations.

I'll continue work on the patch from 26th onwards :)
Attached patch Patch with working tests. (obsolete) — Splinter Review
The tests which were written for |TelemetryStorage.jsm| have now been fixed.
I was able to build successfully.
The patch passed eslint and xpcshell tests.
Attachment #8848858 - Attachment is obsolete: true
Attachment #8855724 - Attachment is obsolete: true
Attachment #8863749 - Flags: review?(chutten)
Comment on attachment 8863749 [details] [diff] [review]
Patch with working tests.

Review of attachment 8863749 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with comments addressed.

I have raised f?bsmedberg so he can take a look at the new data collection we have proposed for Data Collection Review (see https://wiki.mozilla.org/Firefox/Data_Collection ). When that is successfully complete, add r=chutten and data-r=bsmedberg to the first line of your patch's commit message.

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +745,3 @@
>      }
> +
> +    // Save the time it takes to load archived ping into histogram.

You can omit this comment.

::: toolkit/components/telemetry/tests/unit/test_TelemetryStorage.js
@@ +12,5 @@
> +
> +function getPing() {
> +  TelemetrySession.earlyInit(true);
> +
> +  const type = PING_TYPE_MAIN;

You can use PING_TYPE_MAIN directly in the testAssemblePing call.
Attachment #8863749 - Flags: review?(chutten)
Attachment #8863749 - Flags: review+
Attachment #8863749 - Flags: feedback?(benjamin)
Comment on attachment 8863749 [details] [diff] [review]
Patch with working tests.

Talking to chutten on IRC, the existing alerting mechanism should be sufficient for these probes; these are a good candidate for future monitoring via mission control.

In addition to the list, please also add a real person's name to the alert_emails: not sure whether that would be chutten or gfritzsche.

data-r=me with that change
Attachment #8863749 - Flags: feedback?(benjamin) → feedback+
Go ahead and use my email address to alert_emails, fionn_mac
Attached patch Patch with comments addressed. (obsolete) — Splinter Review
Attachment #8863749 - Attachment is obsolete: true
Attachment #8867654 - Flags: review?(chutten)
Attachment #8867654 - Flags: review?(chutten) → review+
Keywords: checkin-needed
Needs rebasing.
Flags: needinfo?(vedant.sareen)
Keywords: checkin-needed
Attached patch Rebased patch. (obsolete) — Splinter Review
Also added the now required "record_in_processes" field to the newly added histograms.
Attachment #8867654 - Attachment is obsolete: true
Flags: needinfo?(vedant.sareen)
Attachment #8867908 - Flags: review?(chutten)
Comment on attachment 8867908 [details] [diff] [review]
Rebased patch.

Review of attachment 8867908 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Histograms.json
@@ +7111,5 @@
> +    "bug_numbers": [1033860],
> +    "description": "Time (ms) it takes for loading pending pings from disk"
> +  },
> +  "TELEMETRY_ARCHIVE_LOAD_MS": {
> +    "record_in_processes": ["main", "content"],

Actually, should just be ["main"]. TelemetryStorage should only load pings in the parent process.
Attachment #8867908 - Flags: review?(chutten) → review-
Attached patch Patch with requested changes. (obsolete) — Splinter Review
Attachment #8867908 - Attachment is obsolete: true
Attachment #8872319 - Flags: review?(chutten)
Attachment #8872319 - Attachment is obsolete: true
Attachment #8872319 - Flags: review?(chutten)
Attachment #8872320 - Flags: review?(chutten)
Attachment #8872320 - Flags: review?(chutten) → review+
Keywords: checkin-needed
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c45dd304a939
Add probes for time needed to load pending & archived pings from disk. r=chutten data-r=bsmedberg
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/c45dd304a939
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.