Create method to disable a Telemetry probe

RESOLVED FIXED in mozilla29

Status

()

Toolkit
Telemetry
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: lmandel, Assigned: rvitillo)

Tracking

unspecified
mozilla29
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [Telemetry])

Attachments

(1 attachment, 9 obsolete attachments)

As a related discussion to that in bug 742496 about Telemetry data retention I think it's worthwhile to discuss the need for a probe retirement policy. Retiring probes may be useful:
1. to reduce required disk space
2. to reduce processing time for moving data into elastic search
3. to protect user data by only collecting what the dev team is actually using

(1 should be answered in bug 742496.)

Some questions:
- How can we monitor probe usage? (Instrument the dashboard? Track JSON data requests?) 
- Is there a regular period in which we should review the active probes and nominate probes for retirement?
- Related: If we are reviewing probes for retirement, should we also review probes before they are added to the product?
(In reply to Lawrence Mandel [:lmandel] from comment #0)
> - How can we monitor probe usage? (Instrument the dashboard? Track JSON data
> requests?) 

This seems reasonable, though I'd imagine many probes don't get looked at on a regular basis.

> - Is there a regular period in which we should review the active probes and
> nominate probes for retirement?

I propose that we review probes ~3 months after they go out in a release build (~6 months from the time they're added to the tree).  For a lot of the probes I've added, that's enough to determine whether they are providing value, and to address any problems the probes have highlighted.

> - Related: If we are reviewing probes for retirement, should we also review
> probes before they are added to the product?

I think this is already addressed by any module owner review before the probes are committed.

Comment 2

5 years ago
Daniel proposed tallying active/inactive probes on the serverside. It would be handy to have that to narrow down telemetry candidates for eviction

Comment 3

5 years ago
Marking: in group of > 33 asks for Telemetry that need PM priority before triage/scheduling.
Status: NEW → ASSIGNED
Whiteboard: [Telemetry] → Telemetry -- needs PM project priority

Comment 4

5 years ago
Triaged.
Target Milestone: Unreviewed → Backlogged - BZ
Whiteboard: Telemetry -- needs PM project priority → [Telemetry] -- needs PM project priority
Component: Data/Backend Reports → Telemetry
Product: Mozilla Metrics → Toolkit
Whiteboard: [Telemetry] -- needs PM project priority → [Telemetry]
Target Milestone: Backlogged - BZ → ---
Repurposing this bug as a request to add a required parameter to the Telemetry probe API that takes a release version after which the probe should be disabled. This parameter will require that every probe has an expiration version so that data is only collected for the period of time that it is required.
Summary: Telemetry probe retirement policy → Create method to disable a Telemetry probe
Some probes will always be useful and should never expire.  The MEMORY_ ones are examples.  Please allow that as an option!
Assignee: nobody → rvitillo
(Assignee)

Comment 7

4 years ago
Created attachment 832751 [details] [diff] [review]
Disable expired telemetry probes.
Attachment #832751 - Flags: review?(vdjeric)
Comment on attachment 832751 [details] [diff] [review]
Disable expired telemetry probes.

Review of attachment 832751 [details] [diff] [review]:
-----------------------------------------------------------------

Should the API require an explicit value to indicate that a probe should not expire vs the implicit no value assigned as currently shown in the patch?
(In reply to Nicholas Nethercote [:njn] from comment #6)
> Some probes will always be useful and should never expire.  The MEMORY_ ones
> are examples.  Please allow that as an option!

The GC_* ones, anything having to do with the cycle collector, and things having to do with I/O are others.
(In reply to Nathan Froyd (:froydnj) from comment #9)
> The GC_* ones, anything having to do with the cycle collector, and things
> having to do with I/O are others.

Could you define "things having to do with I/O"?
Created attachment 8333904 [details] [diff] [review]
Disable expired telemetry probes.
Attachment #832751 - Attachment is obsolete: true
Attachment #832751 - Flags: review?(vdjeric)
Attachment #8333904 - Flags: review?(vdjeric)
Created attachment 8333910 [details] [diff] [review]
742500.patch
Attachment #8333904 - Attachment is obsolete: true
Attachment #8333904 - Flags: review?(vdjeric)
Attachment #8333910 - Flags: review?(vdjeric)
(In reply to Lawrence Mandel [:lmandel] from comment #8)
> Should the API require an explicit value to indicate that a probe should not
> expire vs the implicit no value assigned as currently shown in the patch?

It makes more sense indeed to use an explicit value; I have updated the patch accordingly. It seems to me though that most of the API is used only for testing purposes. Do we have any external users for registerAddonHistogram et al.?
Comment on attachment 8333910 [details] [diff] [review]
742500.patch

Review of attachment 8333910 [details] [diff] [review]:
-----------------------------------------------------------------

- Setting the expiration date to Nightly 28 for current histograms will disable all histograms in current Nightlies. Let's make the existing histograms either expire in Nightly 30 (Feb 3rd) or make them all last forever in the initial patch. Either way, devs will need time to update the expiration dates.
- The string for "never expires" should be some human-readable string, e.g. "never". "0", "-1", and "" could be easily misinterpreted or confused

::: toolkit/components/telemetry/Histograms.json
@@ +1,4 @@
>  {
>    "A11Y_INSTANTIATED_FLAG": {
>      "kind": "flag",
> +    "description": "has accessibility support been instantiated",

Leave the "description" field as the last field in the histogram definition, it makes it a little bit easier to read

::: toolkit/components/telemetry/Telemetry.cpp
@@ +422,5 @@
> +  return &gHistogramStringTable[this->expiration_offset];
> +}
> +
> +bool
> +TelemetryHistogramIsExpired(Histogram *h)

Nitpick: rename to IsHistogramExpired

@@ +479,5 @@
> +    min = 1;
> +    max = 2;
> +    bucketCount = 3;
> +    histogramType = nsITelemetry::HISTOGRAM_LINEAR;
> +  }

Wouldn't we want to return the actual histogram or NS_ERROR_INVALID_ARG here instead of a different dummy histogram?

::: toolkit/components/telemetry/TelemetryPing.js
@@ +525,5 @@
>      let snapshots = Telemetry.histogramSnapshots;
>      for (let name in info) {
>        // Only duplicate histograms with actual data.
>        if (this.isInterestingStartupHistogram(name) && name in snapshots) {
> +        Telemetry.histogramFrom("STARTUP_" + name, "", name);

did something change on this line?

::: toolkit/components/telemetry/gen-histogram-data.py
@@ +51,5 @@
> +            e = explodeToCharArray(string)
> +            if e:
> +                f.write("  /* %5d */ %s, '\\0',\n"
> +                        % (offset, explodeToCharArray(string)))
> +            else:

When will this else-branch get hit?

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +145,5 @@
>     * @param existing_name Existing histogram name
>     * The returned object has the same functions as a histogram returned from newHistogram.
>     */
>    [implicit_jscontext]
> +  jsval histogramFrom(in ACString name, in ACString expiration, in ACString existing_name);

Do we want really to change the expiration date when we clone a histogram?

@@ +187,1 @@
>                                in unsigned long histogram_type);

I don't think we need an expiration mechanism for addon histograms, since 1) they're always dynamically defined, and 2) I don't think any add-ons even use Telemetry today
We can discuss this outside bugzilla
Attachment #8333910 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) from comment #14)

> Review of attachment 8333910 [details] [diff] [review]:
> @@ +479,5 @@
> > +    min = 1;
> > +    max = 2;
> > +    bucketCount = 3;
> > +    histogramType = nsITelemetry::HISTOGRAM_LINEAR;
> > +  }
> 
> Wouldn't we want to return the actual histogram or NS_ERROR_INVALID_ARG here
> instead of a different dummy histogram?

It would be preferable to not waste memory for the expired histograms. Returning NS_ERROR_INVALID_ARG would be the cleanest option but then we might break some client code and we wanted to avoid that. On second thought though it looks like the only interface being used to access telemetry is the "Accumulate" method. Most calls to it, if not all of them, seem to not check at all the return code so we could avoid the dummy histogram completely.  

> ::: toolkit/components/telemetry/gen-histogram-data.py
> @@ +51,5 @@
> > +            e = explodeToCharArray(string)
> > +            if e:
> > +                f.write("  /* %5d */ %s, '\\0',\n"
> > +                        % (offset, explodeToCharArray(string)))
> > +            else:
> 
> When will this else-branch get hit?

When we have an empty string composed of only one character.
Created attachment 8337728 [details] [diff] [review]
742500.patch
Attachment #8333910 - Attachment is obsolete: true
Attachment #8337728 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) from comment #14)
> Comment on attachment 8333910 [details] [diff] [review]
> 742500.patch
> 
> Review of attachment 8333910 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> - Setting the expiration date to Nightly 28 for current histograms will
> disable all histograms in current Nightlies. Let's make the existing
> histograms either expire in Nightly 30 (Feb 3rd) or make them all last
> forever in the initial patch. Either way, devs will need time to update the
> expiration dates.

I think it's preferable to set an expiration date like Nightly 30 by default. Expiring probes by default means that people will need to take action to keep their probes alive and unused probes will expire by default. Any probes that are accidentally disabled should be quickly identified during the 30 cycle.
> Any probes that are accidentally disabled should be quickly identified
> during the 30 cycle.

I don't think that's true.  For example, I look occasionally at the memory stats in telemetry, but a month could easily pass between looks.  I'd hate to look again for the first time in a while and find that there's a big gap in the data.
(In reply to Nicholas Nethercote [:njn] from comment #18)
> > Any probes that are accidentally disabled should be quickly identified
> > during the 30 cycle.
> 
> I don't think that's true.  For example, I look occasionally at the memory
> stats in telemetry, but a month could easily pass between looks.  I'd hate
> to look again for the first time in a while and find that there's a big gap
> in the data.

That's a good point. I still think it's worth expiring by default as we're looking to remove unused probes. We can announce this change again during the engineering meeting, on dev-platform, on planet, and via e-mail to eng-team-all. Perhaps we can pull history and e-mail each person who has added a probe.
Comment on attachment 8337728 [details] [diff] [review]
742500.patch

Review of attachment 8337728 [details] [diff] [review]:
-----------------------------------------------------------------

- Lawrence: I think we should make existing histograms never expire, and just email the histogram authors (from the Histograms.json changelog) asking them to remove any unused histograms and set expiry dates on their histograms as appropriate. We can try this less accident-prone approach first
- Let's call the "expiration" field "expiresInVersion" instead? It might make it clearer what it means, e.g. "expires": "30" is a bit unclear
- Ditto for using "never" instead of "-1" to define histograms that never expire

(In reply to Roberto A. Vitillo (:rvitillo) from comment #15)
> Returning NS_ERROR_INVALID_ARG would be the cleanest option but then we
> might break some client code and we wanted to avoid that. On second thought
> though it looks like the only interface being used to access telemetry is
> the "Accumulate" method. Most calls to it, if not all of them, seem to not
> check at all the return code so we could avoid the dummy histogram
> completely.  

Returning error codes from IDL calls (e.g. getHistogramById) results in an exceptions being thrown in JS. The dummy histogram was the best choice. Sorry, I hadn't thought that through.

(In reply to Roberto A. Vitillo (:rvitillo) from comment #15)
> (In reply to Vladan Djeric (:vladan) from comment #14)
> > 
> > When will this else-branch get hit?
> 
> When we have an empty string composed of only one character.

Sorry, I'm still unclear, you mean when there's a histogram with a zero-length description or expiration field? Should we even allow that?

::: toolkit/components/telemetry/Histograms.json
@@ +2097,5 @@
> +    "high": "3000",
> +    "n_buckets": 10,
> +    "extended_statistics_ok": true,
> +    "description": "Amount of time a learn event waits in the queue (ms)"
> +  },

How did all these histogram definition lines change? Did you use tabs instead of spaces or did Bugzilla's diff tool get confused?

::: toolkit/components/telemetry/Telemetry.cpp
@@ +1125,5 @@
>    if (!success)
>      return NS_ERROR_INVALID_ARG;
>  
>    Histogram *clone;
> +  rv = HistogramGet(PromiseFlatCString(name).get(), "-1",

Wouldn't we want to carry over the existing histogram's expiry date?

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +181,5 @@
>     */
>    void registerAddonHistogram(in ACString addon_id, in ACString name,
> +			      in uint32_t min, in uint32_t max,
> +			      in uint32_t bucket_count,
> +			      in unsigned long histogram_type);

did anything actually change on these 3 lines?

::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
@@ +43,5 @@
>  var httpserver = new HttpServer();
>  var serverStarted = false;
>  var gFinished = false;
>  
> +function test_expired_histogram() {

There's no value in repeating the same tests in both test_nsITelemetry.js and test_TelemetryPing.js.
In test_TestTelemetryping.js file, make sure that Telemetry.histogramSnapshots and Telemetry.registeredHistograms don't return expired histograms

@@ +523,4 @@
>  
>    addWrappedObserver(nonexistentServerObserver, "telemetry-test-xhr-complete");
>    telemetry_ping();
> +  //spin the event loop

typo?

::: toolkit/components/telemetry/tests/unit/test_nsITelemetry.js
@@ +10,5 @@
>  Cu.import("resource://gre/modules/Services.jsm");
>  
> +function test_expired_histogram() {
> +  var histogram_id = "FOOBAR";
> +  var addon_id = "test_addon";

addon_id is unused

@@ +12,5 @@
> +function test_expired_histogram() {
> +  var histogram_id = "FOOBAR";
> +  var addon_id = "test_addon";
> +
> +  expect_fail(function () Telemetry.newHistogram(histogram_id, "0", 1, 2, 3, Telemetry.HISTOGRAM_EXPONENTIAL));

- there's a do_check_throws() function
- Use curlies around the anonymous function's body
Attachment #8337728 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) from comment #20)
> - Lawrence: I think we should make existing histograms never expire, and
> just email the histogram authors (from the Histograms.json changelog) asking
> them to remove any unused histograms and set expiry dates on their
> histograms as appropriate. We can try this less accident-prone approach first

I still think that requiring action of probe authors who no longer use their probes will result in probes being left enabled that are not in use. However, I don't think this is yet a big enough issue to cause disruption. So, I'm OK with starting with your approach. Perhaps we can track the probes that are not viewed on telemetry.m.o for a period of 3+ months (or some other interval) and propose them for removal at a later date.
(In reply to Vladan Djeric (:vladan) from comment #20) 

> Sorry, I'm still unclear, you mean when there's a histogram with a
> zero-length description or expiration field? Should we even allow that?

We might want to allow it for future fields; the generator should emit valid C code for any string.

> ::: toolkit/components/telemetry/Histograms.json
> @@ +2097,5 @@
> > +    "high": "3000",
> > +    "n_buckets": 10,
> > +    "extended_statistics_ok": true,
> > +    "description": "Amount of time a learn event waits in the queue (ms)"
> > +  },
> 
> How did all these histogram definition lines change? Did you use tabs
> instead of spaces or did Bugzilla's diff tool get confused?

Some histogram definitions were hacked in by hand with an inconsistent spacing.
Created attachment 8343765 [details] [diff] [review]
742500.patch
Attachment #8337728 - Attachment is obsolete: true
Created attachment 8343766 [details] [diff] [review]
742500.patch
Attachment #8343765 - Attachment is obsolete: true
Attachment #8343766 - Flags: review?(vdjeric)
Comment on attachment 8343766 [details] [diff] [review]
742500.patch

Review of attachment 8343766 [details] [diff] [review]:
-----------------------------------------------------------------

Btw, you'll have to merge the changes from bug 932285 after it lands

::: toolkit/components/telemetry/Telemetry.cpp
@@ -412,3 @@
>  bool
> -TelemetryHistogramType(Histogram *h, uint32_t *result)
> -{

Did you accidentally delete this function?

@@ +424,5 @@
>  bool
> +IsExpired(const char *expiration){
> +  static Version current_version = Version(MOZ_APP_VERSION);
> +  MOZ_ASSERT(expiration);
> +  return strcmp(expiration, "never") && (mozilla::Version(expiration) <= current_version);

So I think there's a subtlety here that we're missing.

Services.vc.compare("29", "29.0a1")
1
Services.vc.compare("29", "29.0")
-1
Services.vc.compare("29", "29.0.1")
-1

This implies that if a histogram had an expiration version of "29", we would collect Telemetry for it in Nightly 29 (i.e. 29.0a1), and on the release channel (29.0), but if the release channel were to receive a minor update (29.0.1), we would no longer be reporting the histogram to Telemetry.

We could add a method to the version comparator that returns the parsed values, so we can do our own comparisons.

@@ +1302,2 @@
>      }
>    };

So will GetHistogramSnapshots return expired histograms?

::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
@@ -502,4 @@
>  
>    addWrappedObserver(nonexistentServerObserver, "telemetry-test-xhr-complete");
>    telemetry_ping();
> -  // spin the event loop

Deleted accidentally?
Attachment #8343766 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) from comment #25)
> Comment on attachment 8343766 [details] [diff] [review]
> 742500.patch
> 
> Review of attachment 8343766 [details] [diff] [review]:
> -----------------------------------------------------------------
> ::: toolkit/components/telemetry/Telemetry.cpp
> @@ -412,3 @@
> >  bool
> > -TelemetryHistogramType(Histogram *h, uint32_t *result)
> > -{
> 
> Did you accidentally delete this function?

No, the only caller of that function was TelemetryImpl::HistogramFrom. I changed that function slightly to carry over the expiration version from the cloned histogram and that particular call wasn't needed anymore.

> @@ +424,5 @@
> >  bool
> > +IsExpired(const char *expiration){
> > +  static Version current_version = Version(MOZ_APP_VERSION);
> > +  MOZ_ASSERT(expiration);
> > +  return strcmp(expiration, "never") && (mozilla::Version(expiration) <= current_version);
> 
> So I think there's a subtlety here that we're missing.
> 
> Services.vc.compare("29", "29.0a1")
> 1
> Services.vc.compare("29", "29.0")
> -1

Services.vc.compare("29", "29.0") should return 0

> Services.vc.compare("29", "29.0.1")
> -1
> 
> This implies that if a histogram had an expiration version of "29", we would
> collect Telemetry for it in Nightly 29 (i.e. 29.0a1), and on the release
> channel (29.0), 

Since the comparison of “29” with “29.0” is < 1, we wouldn’t collect Telemetry for the release channel.

> but if the release channel were to receive a minor update
> (29.0.1), we would no longer be reporting the histogram to Telemetry.
> 
> We could add a method to the version comparator that returns the parsed
> values, so we can do our own comparisons.

If we want to expire histograms from the nightly onwards, couldn’t we just explicitly specify that in Histograms.json, i.e. “expires_in_version”: “29.0a1”? That would allow us to keep a consistent versioning scheme with the rest of the system.

> 
> @@ +1302,2 @@
> >      }
> >    };
> 
> So will GetHistogramSnapshots return expired histograms?

No, it will not as you can see in the “test_expired_histogram” unit test. I should remove the dummy histogram though.
> Services.vc.compare("29", "29.0") should return 0

You're right, I typo'd

> If we want to expire histograms from the nightly onwards, couldn’t we just
> explicitly specify that in Histograms.json, i.e. “expires_in_version”:
> “29.0a1”? That would allow us to keep a consistent versioning scheme with
> the rest of the system.

That would work
Btw, modify the generator script to disallow expiry strings in the form of "XY" (e.g. "29", "30") since they would cause unexpected behavior.

"29.0" and "30.0" might be ok since we might want probes to exist only on development channels
Created attachment 8348819 [details] [diff] [review]
742500.patch
Attachment #8343766 - Attachment is obsolete: true
Attachment #8348819 - Flags: review?(vdjeric)
Created attachment 8348822 [details] [diff] [review]
742500.patch
Attachment #8348819 - Attachment is obsolete: true
Attachment #8348819 - Flags: review?(vdjeric)
Attachment #8348822 - Flags: review?(vdjeric)
(Assignee)

Updated

4 years ago
Blocks: 913070
Comment on attachment 8348822 [details] [diff] [review]
742500.patch

Review of attachment 8348822 [details] [diff] [review]:
-----------------------------------------------------------------

Document this new functionality in https://developer.mozilla.org/en-US/docs/Performance/Adding_a_new_Telemetry_probe

::: toolkit/components/telemetry/Histograms.json
@@ +5484,5 @@
>      "n_values": 8,
>      "description": "Status of OCSP stapling on this handshake (1=present, good; 2=none; 3=present, expired; 4=present, other error)"
> +  },
> +  "TELEMETRY_TEST_EXPIRED": {
> +    "expires_in_version": "1.0",

let's use 4.0a1 here as good practice.. it should be pretty rare that someone would want to actually use an "N.0" expiration date

::: toolkit/components/telemetry/histogram_tools.py
@@ +177,5 @@
> +        if not expiration:
> +            return
> +
> +        if not re.match(r'[1-9][0-9]*\..*|never', expiration):
> +            raise BaseException, '%s not permitted as an expiration version for %s; the complete version name is required' % (expiration, name)

Add a link to the "Adding a new Telemetry Probe" wiki page
Attachment #8348822 - Flags: review?(vdjeric) → review+
Created attachment 8355509 [details] [diff] [review]
742500.patch

try: https://tbpl.mozilla.org/?tree=Try&rev=97d294225376
Attachment #8348822 - Attachment is obsolete: true
Attachment #8355509 - Flags: review+
(Assignee)

Updated

4 years ago
Keywords: checkin-needed
https://hg.mozilla.org/integration/fx-team/rev/70fd824abff0
Flags: in-testsuite+
Keywords: checkin-needed
Whiteboard: [Telemetry] → [Telemetry][fixed-in-fx-team]
Backed out for Linux64 ASAN and debug xpcshell orange.
https://hg.mozilla.org/integration/fx-team/rev/78dd63564719

https://tbpl.mozilla.org/php/getParsedLog.php?id=32517001&tree=Fx-Team
https://tbpl.mozilla.org/php/getParsedLog.php?id=32518386&tree=Fx-Team
And OSX 10.8 opt
Created attachment 8355573 [details] [diff] [review]
742500.patch

I generated a new UUID for nsITelemetry which might fix the issue.
Attachment #8355509 - Attachment is obsolete: true
Attachment #8355573 - Flags: review+
https://hg.mozilla.org/integration/fx-team/rev/a9dbfb86d99f
https://hg.mozilla.org/mozilla-central/rev/a9dbfb86d99f
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Whiteboard: [Telemetry][fixed-in-fx-team] → [Telemetry]
Target Milestone: --- → mozilla29
You need to log in before you can comment on or make changes to this bug.