Closed Bug 1323628 Opened 5 years ago Closed 5 years ago

test_TelemetrySend.js began failing around 5pm Pacific on 2016-12-14 on all trunk trees

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
2

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox51 --- unaffected
firefox52 --- fixed
firefox53 --- fixed

People

(Reporter: philor, Assigned: Dexter)

References

Details

(Whiteboard: [measurement:client])

Attachments

(2 files)

Attached file log fragment
I don't see a timebomb set for that time anywhere in https://hg.mozilla.org/mozilla-central/rev/edcd2fb87124 but there must be one, since it began failing on mozilla-inbound, mozilla-central, and autoland, but not on mozilla-aurora, and retriggering on a week-old mozilla-central push it now fails 100%.

https://treeherder.mozilla.org/logviewer.html#?job_id=7917409&repo=autoland
Pushed by philringnalda@gmail.com:
https://hg.mozilla.org/mozilla-central/rev/7652a58efa46
disable test_TelemetrySend.js for sudden apparently time-based failure across all trunk trees, a=surprise
Assignee: nobody → alessio.placitelli
Status: NEW → ASSIGNED
Points: --- → 2
Priority: -- → P1
Attached patch bug1323628.patchSplinter Review
Comment on attachment 8818918 [details] [diff] [review]
bug1323628.patch

Review of attachment 8818918 [details] [diff] [review]:
-----------------------------------------------------------------

Needs a commit message explaining what was happening, and if this invalidates TELEMETRY_STRINGIFY data before/after this change.
(In reply to Chris H-C :chutten from comment #3)
> Comment on attachment 8818918 [details] [diff] [review]
> bug1323628.patch
> 
> Review of attachment 8818918 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Needs a commit message explaining what was happening, and if this
> invalidates TELEMETRY_STRINGIFY data before/after this change.

Well, in fact I have yet to mark people for review and was in the process of writing a lengthy comment about that.
:S Sorry about that. The bug was too interesting to not investigate
This was real fun to debug :-)

All the errors seemed to point at the TELEMETRY_SEND_FAILURE histogram which, for some reason, was not storing the correct send failure times and was failing at [1]. However, dumping the histogram before the recording and soon after near [2], revealed that something was being added at the 0 bucket, which was unexpected. We should not see stuff in that bucket unless we're seeing values outside of our low range.

> {"min":1,"max":120000,"histogram_type":0,"sum":0,"ranges":[0,1,2,4,8,15,29,55,104,197,374,710,1348,2560,4861,9230,17527,33281,63196,120000],"counts":[8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}

Following up the investigation, I dumped |startTime| and |monotonicNow()| at [2]:

>  0:10.34 PROCESS_OUTPUT: Thread-3 (pid:12473) "**** DEBUG Times - now: 5047.530732 start Thu Dec 15 2016 15:59:52 GMT+0100 (CET) elapsed -1481813987762.4692"

Bingo. While |monotonicNow()| reports the correct value, |startTime| seems to be a Date object. After checking that |monotonicNow()| wasn't failing, I noticed that we're using again |startTime| a few lines below, at [3]. Apparently, |_onPingRequestFinished| was using the new Date values instead of the one from monotonicNow(), producing a negative elapsed time that ended up in the 0 bucket.

This seems to be fallout from bug 1318284, but I honestly have no idea about why it's showing up only now.

I think both TELEMETRY_SEND_FAILURE and TELEMETRY_SEND_SUCCESS might be affected and their histograms [4] seem to only have values outside of the desired range. The telemetry histograms which are changed further down and rely on startDate seem to behave correctly [5]. This patch won't affect them.

[1] - https://dxr.mozilla.org/mozilla-central/rev/8103c612b79c2587ea4ca1b0a9f9f82db4b185b8/toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
[2] - https://dxr.mozilla.org/mozilla-central/rev/8103c612b79c2587ea4ca1b0a9f9f82db4b185b8/toolkit/components/telemetry/TelemetrySend.jsm#843
[3] - https://dxr.mozilla.org/mozilla-central/rev/8103c612b79c2587ea4ca1b0a9f9f82db4b185b8/toolkit/components/telemetry/TelemetrySend.jsm#922,987,1006
[4] - https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2016-12-13&keys=__none__!__none__!__none__&max_channel_version=nightly%252F53&measure=TELEMETRY_SEND_SUCCESS&min_channel_version=nightly%252F50&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2016-11-21&table=1&trim=1&use_submission_date=0
[5] - https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2016-12-13&keys=__none__!__none__!__none__&max_channel_version=nightly%252F53&measure=TELEMETRY_STRINGIFY&min_channel_version=nightly%252F50&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2016-11-21&table=1&trim=1&use_submission_date=0
Blocks: 1318284
Attachment #8818918 - Flags: review?(chutten)
Comment on attachment 8818918 [details] [diff] [review]
bug1323628.patch

Review of attachment 8818918 [details] [diff] [review]:
-----------------------------------------------------------------

I'd have preferred some comment in the commit msg, but a bugzilla comment probably does the trick.

Maybe an underflow that only happens when now() - monotonic now() is greater than a certain value? Anyhoo, r+
Attachment #8818918 - Flags: review?(chutten) → review+
(In reply to Chris H-C :chutten from comment #7)
> Maybe an underflow that only happens when now() - monotonic now() is greater
> than a certain value? Anyhoo, r+

Thanks for the r+.

I think that's even simpler than the underflow: we were computing the elapsed time between the number returned from |monotonicNow()| (which is the milliseconds since the process started) and |new Date()| (which is milliseconds since Unix epoch). This will always be negative.

Unfortunately I have no clue about this didn't blow up before... and this is driving me crazy :(
Keywords: leave-open
Whiteboard: [test disabled]
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2e35ba42b42c
Fix test_TelemetrySend.js failing and enable it again. r=chutten
Whiteboard: [measurement:client]
https://hg.mozilla.org/mozilla-central/rev/2e35ba42b42c
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Whiteboard: [measurement:client] → [measurement:client] [measurement:client:uplift]
Comment on attachment 8818918 [details] [diff] [review]
bug1323628.patch

Approval Request Comment
[Feature/Bug causing the regression]: Event Telemetry.
[User impact if declined]: Data of Event Telemetry behavior reaches us later, delaying our decision making for the project.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: Yes, bug 1302671.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]:
The individual parts are:
* bug 1302663 - basic Telemetry implementation
* bug 1316810 - adjust event limits
* bug 1318284 - improve metrics that track effects on ping sending
* bug 1323628 - fix
* bug 1316281 - record search event in Telemetry
[Is the change risky?]: No.
[Why is the change risky/not risky?]: It's validated on Nightly and the changes are selective and relatively well understood.
[String changes made/needed]: No.

This is part of an uplift need for the initial Event Telemetry [1][2] implementation to Firefox 52.
We want to move this to Beta a little faster to see somewhat realistic data of an example event (search) coming in and verify that it matches our expectations.

I validated that this behaves as expected on Nightly over the last two weeks in bug 1302671.

1: https://docs.google.com/document/d/1hNuS9lUJMvMqgntZXbFA6xZBU9zBpQgo7x73-sXKRpI/
2: https://wiki.mozilla.org/Event_Telemetry
Attachment #8818918 - Flags: approval-mozilla-aurora?
Comment on attachment 8818918 [details] [diff] [review]
bug1323628.patch

add event telemetry for aurora52
Attachment #8818918 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Whiteboard: [measurement:client] [measurement:client:uplift] → [measurement:client]
setting flags for the uplift in comment #14
You need to log in before you can comment on or make changes to this bug.