Closed Bug 1438896 Opened 7 years ago Closed 6 years ago

Consider adding a probe to count Telemetry failures by ping type

Categories

(Toolkit :: Telemetry, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox60 --- wontfix
firefox68 --- fixed

People

(Reporter: mreid, Assigned: clement.allain, Mentored)

Details

(Whiteboard: [good second bug][lang=js])

Attachments

(4 files, 7 obsolete files)

We have the TELEMETRY_SEND_FAILURE_TYPE probe to count failures by failure type, it would potentially be useful for monitoring to have a similar probe to count failures by ping type. A keyed scalar seems like a good option here.
Would we want to key TELEMETRY_SEND_FAILURE_TYPE by ping type? That way we could see things like "sync" being mostly eTooLate and "main" being even between eChannelOpen and eUnreachable... A keyed count wouldn't give us that detail.
To help Mozilla out with this bug, here's the steps: 0) Comment here on the bug that you want to volunteer to help. I (or someone else) will assign it to you. 1) Download and build the Firefox source code: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Build_Instructions/Simple_Firefox_build - If you have any problems, please ask on IRC (https://wiki.mozilla.org/Irc) in the #introduction channel. They're there to help you get started. - You can also read the Developer Guide, which has answers to most development questions: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Introduction 2) Start working on this bug. You'll be working in the TelemetrySend.jsm and Histograms.json files, replacing TELEMETRY_SEND_FAILURE_TYPE with a new per-probe "keyed" histogram. This documentation will be very helpful in understanding what is involved here: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/start/adding-a-new-probe.html - If you have any problems with this bug, please comment on this bug and set the needinfo flag for me. Also, you can find me and my teammates on the #telemetry channel on IRC (https://wiki.mozilla.org/Irc) most hours of most days. 3) Build your change with `mach build` and test your change with `mach test toolkit/components/telemetry/tests/`. Also check your changes for adherence to our style guidelines by using `mach lint` 4) Submit the patch (including the change to the test_TelemetrySend.js automated tests) for review. Mark me as a reviewer so I'll get an email to come look at your code. - Here's the guide: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/How_to_Submit_a_Patch 5) After a series of reviews and changes to your patch, I'll mark it for checkin or push it to autoland. Your code will soon be shipping to Firefox users worldwide! 6) ...now you get to think about what kind of bug you'd like to work on next. Let me know what you're interested in and I can help you find your next contribution.
Mentor: chutten
Priority: -- → P3
Whiteboard: [good second bug][lang=js]
(( Oh, and a note for :mreid, making this a keyed histogram means that cerberus won't tell us if there's a failure, since it doesn't support keyed histograms ))
Can i work on this bug?
Hi Chris, I lately came across this bug. From the above discussion i understand that the 'TELEMETRY_SEND_FAILURE_TYPE' probe must be replaced by a probe that counts failure by ping type. Accordingly i have submitted a patch.This is not the final patch . Please assign this bug to me and help me to come to the correct workaround for this. thanks:)
Comment on attachment 8964280 [details] bug 1438896 : Test patch https://reviewboard.mozilla.org/r/233004/#review239374 This is not quite what I was expecting. We were thinking of having one TELEMETRY_SEND_FAILURE_TYPE histogram for each ping type so that if, for instance, "sync" pings had eTimeout and "main" pings had a lot of eTerminated, we could look at them separately. The current histogram lumps them all together. The way to do this would be by introducing a new version of TELEMETRY_SEND_FAILURE_TYPE which is keyed on the ping type. Keyed histograms are documented over here: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/collection/histograms.html#keyed-histograms Does this make sense?
Attachment #8964280 - Flags: review-
Assignee: nobody → akriti.v10
Status: NEW → ASSIGNED
Attachment #8964280 - Attachment is obsolete: true
Comment on attachment 8966579 [details] Bug 1438896 : Probe added to count Telemetry failures by ping type , https://reviewboard.mozilla.org/r/235310/#review241080 We're making good progress here, thank you! Let's leave the old probe in place so we can maintain history and add the second, new, keyed one after it. ::: toolkit/components/telemetry/Histograms.json:7097 (Diff revision 1) > "kind": "exponential", > "high": 120000, > "n_buckets": 20, > "description": "Time needed (in ms) for a failed send of a Telemetry ping to the servers and getting a reply back." > }, > "TELEMETRY_SEND_FAILURE_TYPE" : { We can't change histograms in-place. Server-side processes may depend on them being immutable. We'll need a new histogram with a new name. ::: toolkit/components/telemetry/Histograms.json:7115 (Diff revision 1) > + "health", > + "heartbeat", > + "modules", > + "sync", > + "update" > + ], Parts of Telemetry may send pings unbeknownst to use using the submitExternalPing method. Since we don't know the full list of acceptable ping types, it would probably be best to skip the "keys" portion and allow this keyed histogram to collect based on any key.
Attachment #8966579 - Flags: review?(chutten)
Attachment #8966579 - Attachment is obsolete: true
Comment on attachment 8967035 [details] Bug 1438896 : test patch v2 , https://reviewboard.mozilla.org/r/235710/#review241578 Looking good so far! Now I'm looking forward to seeing you accumulate to this new probe in TelemetrySend.jsm. ::: toolkit/components/telemetry/Histograms.json:7116 (Diff revision 1) > "eTooLate", > "eTerminated" > ], > "description": "Counts of the different ways sending a Telemetry ping can fail." > }, > + "TELEMETRY_COUNT_FAILURE_TYPE" : { The failures we're counting are send failures, so I'd like to keep "SEND_FAILURE" in the name. Perhaps TELEMETRY_SEND_FAILURE_TYPE_PER_PING ? ::: toolkit/components/telemetry/Histograms.json:7134 (Diff revision 1) > + "abort", > + "timeout", > + "eTooLate", > + "eTerminated" > + ], > + "description": "Counts the number of occurrences of a particular failure type." The description should be extended to mention what the keys are.
Attachment #8967035 - Flags: review?(chutten)
Attachment #8967035 - Attachment is obsolete: true
Hi Chris, Please review this new patch. I have made the changes we discussed. I tried the following commands:- ./mach build ./mach lint They gave no errors , but the 'mach test toolkit/components/telemetry/tests/unit/test_TelemetrySend.js' command does not work properly.It doesn't give errors, but hangs in between and i have to kill the process. I think its mainly because of this statement 'Telemetry.getHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING").add(failure, ping.type);' in TelemetrySend.jsm. What according to you can be a problem here?
Comment on attachment 8967767 [details] Bug 1438896 : New test Patch , https://reviewboard.mozilla.org/r/236470/#review242248 Good work so far! Once you get a feel for how TELEMETRY_SEND_FAILURE_TYPE_PER_PING works, we can add a test like we have for TELEMETRY_SEND_FAILURE_TYPE over here: https://searchfox.org/mozilla-central/rev/9f3da81290054c5b8955bb67ff98cae66676f745/toolkit/components/telemetry/tests/unit/test_TelemetrySend.js#457 ::: toolkit/components/telemetry/TelemetrySend.jsm:1111 (Diff revision 1) > > if (this._tooLateToSend) { > // Too late to send now. Reject so we pend the ping to send it next time. > this._log.trace("_doPing - Too late to send ping " + ping.id); > Telemetry.getHistogramById("TELEMETRY_SEND_FAILURE_TYPE").add("eTooLate"); > + Telemetry.getHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING").add("eTooLate", ping.type); This will need to be getKeyedHistogramById since it is a keyed histogram, and the .add takes the key (ping.type) first, label second. You can test how this works by opening a tab to about:telemetry#keyed-histograms-tab and trying it out for yourself in the devtools console.
Attachment #8967767 - Flags: review?(chutten)
To directly address your test question, likely something was waiting for the TelemetrySend call to complete successfully and didn't handle the case when things don't complete successfully (because it's a test). When you fix the call, it should run to completion.
Hey Chris, i am stuck at this error for a long time now ...can you please help me out . I have made the above changes but the 'mach test toolkit/components/telemetry/tests/unit/test_TelemetrySend.js' command still hangs. I feel its a small issue which i can't figure out. Thanks.
Attached image histogram.png
This image shows the result of running the './mach test test toolkit/components/telemetry/tests/unit/test_TelemetrySend.js' command. This is the point where the test gets stuck and results in a 'TIMEOUT'.
Flags: needinfo?(chutten)
The problem might be the line at the very top of the screenshot you attached. It says that getKeyedHistogramById returned a failure. This failure may not have been handled properly by the test framework and so it hung, waiting. Did you change code in or around test_TelemetrySend.js line 1167?
Flags: needinfo?(chutten)
Attachment #8967767 - Attachment is obsolete: true
This patch contains the "test" for "TELEMETRY_SEND_FAILURE_TYPE_PER_PING" added to 'test_TelemetrySend.js'. The following commands executed successfully with no errors. ./mach lint ./mach test test toolkit/components/telemetry/tests/unit/test_TelemetrySend.js ./mach build I have added two try-catch blocks in TelemetrySend.jsm and test_TelemetrySend.js to check for exceptions other than 'NS_ERROR_FAILURE'. I did this because i was facing issues with the execution of './mach test test toolkit/components/telemetry/tests/unit/test_TelemetrySend.js' command where the process used to hang (As we had discussed earlier). Please let me know if this needs more changes. Thanks.
Flags: needinfo?(chutten)
Comment on attachment 8968905 [details] Bug 1438896 : Add a probe to count Telemetry failures by ping type. , https://reviewboard.mozilla.org/r/237632/#review243402 I'm not sure I understand why these new try/catch blocks are necessary. Telemetry.get{Keyed}HistogramById("PROBE_NAME") should only ever fail if PROBE_NAME doesn't exist (or if you're using the Keyed function on a non-keyed histogram or vice versa). And in this case TELEMETRY_SEND_FAILURE_TYPE_PER_PING is definitely registered, and definitely has keyed set to true. ...ohh, you know what. Are you using an artefact build to build these changes? Changes to Histograms.json require a full build (at least until bug 1206117 is tackled). If you haven't performed a full build since adding the definition to Histograms.json, the C++ code that defines TELEMETRY_SEND_FAILURE_TYPE_PER_PING hasn't been generated or built or run. I'm sorry, I should have caught this sooner. Is my guess consistent with your setup? ::: toolkit/components/telemetry/TelemetrySend.jsm:1169 (Diff revision 1) > } > > TelemetryHealthPing.recordSendFailure(failure); > Telemetry.getHistogramById("TELEMETRY_SEND_FAILURE_TYPE").add(failure); > + try { > + Telemetry.getKeyedHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING").add(ping.type, failure); This should never fail, so the try-catch block should never be needed. In what cases were you seeing this fail?
Attachment #8968905 - Flags: review?(chutten) → review-
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #21) > Comment on attachment 8968905 [details] > Bug 1438896 : Add a probe to count Telemetry failures by ping type. , > > https://reviewboard.mozilla.org/r/237632/#review243402 > > I'm not sure I understand why these new try/catch blocks are necessary. > Telemetry.get{Keyed}HistogramById("PROBE_NAME") should only ever fail if > PROBE_NAME doesn't exist (or if you're using the Keyed function on a > non-keyed histogram or vice versa). > > And in this case TELEMETRY_SEND_FAILURE_TYPE_PER_PING is definitely > registered, and definitely has keyed set to true. > > ...ohh, you know what. Are you using an artefact build to build these > changes? Changes to Histograms.json require a full build (at least until bug > 1206117 is tackled). If you haven't performed a full build since adding the > definition to Histograms.json, the C++ code that defines > TELEMETRY_SEND_FAILURE_TYPE_PER_PING hasn't been generated or built or run. > > I'm sorry, I should have caught this sooner. Is my guess consistent with > your setup? > > ::: toolkit/components/telemetry/TelemetrySend.jsm:1169 > (Diff revision 1) > > } > > > > TelemetryHealthPing.recordSendFailure(failure); > > Telemetry.getHistogramById("TELEMETRY_SEND_FAILURE_TYPE").add(failure); > > + try { > > + Telemetry.getKeyedHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING").add(ping.type, failure); > > This should never fail, so the try-catch block should never be needed. > > In what cases were you seeing this fail? Yes Chris, i always do a full build, but what about that artefact build because during installation i did a 'Desktop Build for firefox'. Should i have chosen the artefact mode?? This 'Telemetry.getKeyedHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING").add(ping.type, failure);' doesn't fail but it seems, due to this statement the ./mach test... command will 'TIMEOUT' and other changes don't cause this. So that is why i came to the conclusion that the problem lies in this statement only.
No, you should not have chosen artefact mode. Full builds are necessary for these changes. You did the right thing. Telemetry.getKeyedHistogramById -is- failing, though. The log line is in your screenshot. The reason the test stalls instead of failing immediately is due to how the test is set up. Let's try this: run your build (./mach run) and go to about:telemetry. Open the developer tools and get to a JS console. Run Telemetry.getKeyedHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING") and see if it works. If you get the NS_ERROR_FAILURE there, then your currently-build Firefox doesn't know about your new probe. You can then try Telemetry.getHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING") to see if it, for some reason, thinks it isn't keyed. If it succeeds, then that means the "keyed": true in Histograms.json hasn't been picked up. If that errors out too, it has no idea about your probe. This means the build doesn't have your Histograms.json changes in it, and something has gone wrong. Check your mozconfig to se if --enable-artefact-builds snuck in there somehow. Run `./mach build` even if you before ran `./mach build toolkit/components/telemetry`. If your probe was built properly, you should be able to find it in the build-generated file <your obj dir>/toolkit/components/telemetry/TelemetryHistogramEnums.h (<your obj dir> will be something like /path/to/mozilla-central/obj-x86_64-pc-mingw32 )
Attachment #8968905 - Attachment is obsolete: true
I have uploaded a new patch. All the tests run successfully. I tried executing the 'Telemetry.getKeyedHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING")' command in the JS console and got no error. It seems the currently-build firefox recognizes the new keyed probe. I am not facing the earlier issues any more. thanks.
Comment on attachment 8969365 [details] Bug 1438896 : Added a probe to count Telemetry failures by ping type. , =chutten https://reviewboard.mozilla.org/r/238110/#review243846 Where did the automated test go? We really should have at least one. Also, this patch will need to undergo Data Collection review in order to be merged (all new data collections need to be reviewed by a Data Steward). If it's okay with you, I'll arrange for that to happen while we get the testing back into place.
Attachment #8969365 - Attachment is obsolete: true
Hey Chris, what do you have to say about this test for "TELEMETRY_SEND_FAILURE_TYPE_PER_PING". Is it logical??. The good thing about this test is that it gives no errors!
Comment on attachment 8969609 [details] Bug 1438896 : test for the new probe. , https://reviewboard.mozilla.org/r/238386/#review244256 ::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js:461 (Diff revision 1) > Assert.equal(pendingPings.length, 1, "Should have a pending ping in storage"); > Assert.equal(pendingPings[0].id, id, "Should have pended our test's ping"); > > Assert.equal(Telemetry.getHistogramById("TELEMETRY_SEND_FAILURE_TYPE").snapshot().counts[7], 1, > "Should have registered the failed attempt to send"); > + Assert.equal(histogramValueCount(Telemetry.getKeyedHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING").snapshot(pendingPings[0].id)), 0); It shouldn't be 0, though, as there was a failed attempt to send (specifically an eTooLate failure). That being said, it isn't the ping's id that we're keying off of, it's the ping's type. So try TEST_TYPE as the key and ensure that the count in counts[7] is 1
Attachment #8969609 - Flags: review?(chutten) → review-
(In reply to Chris H-C :chutten from comment #29) > Comment on attachment 8969609 [details] > Bug 1438896 : test for the new probe. , > > https://reviewboard.mozilla.org/r/238386/#review244256 > > ::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js:461 > (Diff revision 1) > > Assert.equal(pendingPings.length, 1, "Should have a pending ping in storage"); > > Assert.equal(pendingPings[0].id, id, "Should have pended our test's ping"); > > > > Assert.equal(Telemetry.getHistogramById("TELEMETRY_SEND_FAILURE_TYPE").snapshot().counts[7], 1, > > "Should have registered the failed attempt to send"); > > + Assert.equal(histogramValueCount(Telemetry.getKeyedHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING").snapshot(pendingPings[0].id)), 0); > > It shouldn't be 0, though, as there was a failed attempt to send > (specifically an eTooLate failure). > > That being said, it isn't the ping's id that we're keying off of, it's the > ping's type. So try TEST_TYPE as the key and ensure that the count in > counts[7] is I have done this earlier- 'Assert.equal(Telemetry.getKeyedHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING").snapshot(pendingPings[0].type).counts[7], 1,should have registered the failed attempt to send");' but it gives a 'type error' and says its undefined. The moment i change to- 'Assert.equal(histogramValueCount(Telemetry.getKeyedHistogramById("TELEMETRY_SEND_FAILURE_TYPE_PER_PING").snapshot(pendingPings[0].id)), 0' The test runs smoothly without errors. I know its not what we need, but there are problems while executing the test.
Which part of it is undefined? getKeyedHistogramById should return a histogram. snapshot with a key that is present in the keyedhistogram should return a snapshot... Maybe the key isn't present? Maybe the test is identifying an actual problem with the code?
Did you manage to figure out what the problem was here?
Flags: needinfo?(akriti.v10)
Unassigning due to inactivity. I can reassign this later if you'd like to come back to it. Until then it is available.
Assignee: akriti.v10 → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(akriti.v10)

Hello, would you mind assigning it to me ?

Few questions :

  • Is there a file describing all ping types or should I use the documentation here ? If I got it right, its needed for categorical histograms in the labels field
  • Should I write a new test case or edit one ?

Thanks

Flags: needinfo?(chutten)

Sure, it's yours!

There is no specific registry describing all pings types, and so they might change in name and number at any time. However, the ping type is provided to TelemetrySend as a parameter, so we can use that as a "key" for a keyed histogram. For an example, the patch from Comment #26 was almost complete (though the automated test was missing).

Speaking of testing, whichever organization of testing makes most sense to you. If an existing test case catches almost everything and it seems reasonable to add your code there, then do so and we can chat about it in the code review if I have any concerns. If you're not sure it can cover everything, build out its own test case.

Does that help?

Assignee: nobody → clement.allain
Status: NEW → ASSIGNED
Flags: needinfo?(chutten)

Add a new categorical keyed histogram to count failures type by ping type

So I have edited the test_tooLateToSend, but to me it seems like a test that goes into the errorHandler function is missing (line 1146 of TelemetrySend.jsm), what do you think ?

This is correct. We haven't (yet) structured the code to make that part testable (we'd need to mock out ServiceRequest so we could call errorHandler directly with the error codes we want), so we won't be able to cover that part of your patch with tests.

Attachment #9054010 - Flags: data-review?(teon)
Comment on attachment 9054010 [details] data collection review request DATA COLLECTION REVIEW RESPONSE: Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate? Yes. This probe consists of histograms per ping type and its documentation will be listed in in the definitions file [Histograms.json](https://hg.mozilla.org/mozilla-central/file/tip/toolkit/components/telemetry/Histograms.json) and in the [Probe Dictionary](https://telemetry.mozilla.org/probe-dictionary/). Is there a control mechanism that allows the user to turn the data collection on and off? Yes. This collection is through Telemetry so it can be controlled through Firefox's Preferences. If the request is for permanent data collection, is there someone who will monitor the data over time? Yes, this collection will be permanent, monitored by :chutten Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under? Category 1, Technical. Is the data collection request for default-on or default-off? Default on for all prelease channels. Does the instrumentation include the addition of any new identifiers? No. Is the data collection covered by the existing Firefox privacy notice? Yes. Does there need to be a check-in in the future to determine whether to renew the data? No, collection will be permanent for pre-release channels. --- Result: datareview+
Attachment #9054010 - Flags: data-review?(teon) → data-review+
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0f5833c5f5f7 Add a probe to count Telemetry failures by ping type. r=chutten
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

Add a new categorical keyed histogram to count failures type by ping type

Bug 1437446 : Make probe process choice more visible in about:telemetry

Comment on attachment 9061731 [details]
Bug 1438896 - Add a probe to count Telemetry failures by ping type.

Revision D29401 was moved to bug 1437446. Setting attachment 9061731 [details] to obsolete.

Attachment #9061731 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: