Closed
Bug 1323628
Opened 8 years ago
Closed 8 years ago
test_TelemetrySend.js began failing around 5pm Pacific on 2016-12-14 on all trunk trees
Categories
(Toolkit :: Telemetry, defect, P1)
Toolkit
Telemetry
Tracking
()
RESOLVED
FIXED
mozilla53
Tracking | Status | |
---|---|---|
firefox51 | --- | unaffected |
firefox52 | --- | fixed |
firefox53 | --- | fixed |
People
(Reporter: philor, Assigned: Dexter)
References
Details
(Whiteboard: [measurement:client])
Attachments
(2 files)
151.79 KB,
text/plain
|
Details | |
3.36 KB,
patch
|
chutten
:
review+
jcristau
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
I don't see a timebomb set for that time anywhere in https://hg.mozilla.org/mozilla-central/rev/edcd2fb87124 but there must be one, since it began failing on mozilla-inbound, mozilla-central, and autoland, but not on mozilla-aurora, and retriggering on a week-old mozilla-central push it now fails 100%.
https://treeherder.mozilla.org/logviewer.html#?job_id=7917409&repo=autoland
Pushed by philringnalda@gmail.com:
https://hg.mozilla.org/mozilla-central/rev/7652a58efa46
disable test_TelemetrySend.js for sudden apparently time-based failure across all trunk trees, a=surprise
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → alessio.placitelli
Status: NEW → ASSIGNED
Points: --- → 2
Priority: -- → P1
Assignee | ||
Comment 2•8 years ago
|
||
Comment 3•8 years ago
|
||
Comment on attachment 8818918 [details] [diff] [review]
bug1323628.patch
Review of attachment 8818918 [details] [diff] [review]:
-----------------------------------------------------------------
Needs a commit message explaining what was happening, and if this invalidates TELEMETRY_STRINGIFY data before/after this change.
Assignee | ||
Comment 4•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #3)
> Comment on attachment 8818918 [details] [diff] [review]
> bug1323628.patch
>
> Review of attachment 8818918 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> Needs a commit message explaining what was happening, and if this
> invalidates TELEMETRY_STRINGIFY data before/after this change.
Well, in fact I have yet to mark people for review and was in the process of writing a lengthy comment about that.
Comment 5•8 years ago
|
||
:S Sorry about that. The bug was too interesting to not investigate
Assignee | ||
Comment 6•8 years ago
|
||
This was real fun to debug :-)
All the errors seemed to point at the TELEMETRY_SEND_FAILURE histogram which, for some reason, was not storing the correct send failure times and was failing at [1]. However, dumping the histogram before the recording and soon after near [2], revealed that something was being added at the 0 bucket, which was unexpected. We should not see stuff in that bucket unless we're seeing values outside of our low range.
> {"min":1,"max":120000,"histogram_type":0,"sum":0,"ranges":[0,1,2,4,8,15,29,55,104,197,374,710,1348,2560,4861,9230,17527,33281,63196,120000],"counts":[8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}
Following up the investigation, I dumped |startTime| and |monotonicNow()| at [2]:
> 0:10.34 PROCESS_OUTPUT: Thread-3 (pid:12473) "**** DEBUG Times - now: 5047.530732 start Thu Dec 15 2016 15:59:52 GMT+0100 (CET) elapsed -1481813987762.4692"
Bingo. While |monotonicNow()| reports the correct value, |startTime| seems to be a Date object. After checking that |monotonicNow()| wasn't failing, I noticed that we're using again |startTime| a few lines below, at [3]. Apparently, |_onPingRequestFinished| was using the new Date values instead of the one from monotonicNow(), producing a negative elapsed time that ended up in the 0 bucket.
This seems to be fallout from bug 1318284, but I honestly have no idea about why it's showing up only now.
I think both TELEMETRY_SEND_FAILURE and TELEMETRY_SEND_SUCCESS might be affected and their histograms [4] seem to only have values outside of the desired range. The telemetry histograms which are changed further down and rely on startDate seem to behave correctly [5]. This patch won't affect them.
[1] - https://dxr.mozilla.org/mozilla-central/rev/8103c612b79c2587ea4ca1b0a9f9f82db4b185b8/toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
[2] - https://dxr.mozilla.org/mozilla-central/rev/8103c612b79c2587ea4ca1b0a9f9f82db4b185b8/toolkit/components/telemetry/TelemetrySend.jsm#843
[3] - https://dxr.mozilla.org/mozilla-central/rev/8103c612b79c2587ea4ca1b0a9f9f82db4b185b8/toolkit/components/telemetry/TelemetrySend.jsm#922,987,1006
[4] - https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2016-12-13&keys=__none__!__none__!__none__&max_channel_version=nightly%252F53&measure=TELEMETRY_SEND_SUCCESS&min_channel_version=nightly%252F50&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2016-11-21&table=1&trim=1&use_submission_date=0
[5] - https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2016-12-13&keys=__none__!__none__!__none__&max_channel_version=nightly%252F53&measure=TELEMETRY_STRINGIFY&min_channel_version=nightly%252F50&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2016-11-21&table=1&trim=1&use_submission_date=0
Assignee | ||
Updated•8 years ago
|
Attachment #8818918 -
Flags: review?(chutten)
Comment 7•8 years ago
|
||
Comment on attachment 8818918 [details] [diff] [review]
bug1323628.patch
Review of attachment 8818918 [details] [diff] [review]:
-----------------------------------------------------------------
I'd have preferred some comment in the commit msg, but a bugzilla comment probably does the trick.
Maybe an underflow that only happens when now() - monotonic now() is greater than a certain value? Anyhoo, r+
Attachment #8818918 -
Flags: review?(chutten) → review+
Assignee | ||
Comment 8•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #7)
> Maybe an underflow that only happens when now() - monotonic now() is greater
> than a certain value? Anyhoo, r+
Thanks for the r+.
I think that's even simpler than the underflow: we were computing the elapsed time between the number returned from |monotonicNow()| (which is the milliseconds since the process started) and |new Date()| (which is milliseconds since Unix epoch). This will always be negative.
Unfortunately I have no clue about this didn't blow up before... and this is driving me crazy :(
Assignee | ||
Comment 9•8 years ago
|
||
Updated•8 years ago
|
status-firefox51:
--- → unaffected
status-firefox52:
--- → unaffected
Keywords: leave-open
Whiteboard: [test disabled]
Comment 10•8 years ago
|
||
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2e35ba42b42c
Fix test_TelemetrySend.js failing and enable it again. r=chutten
Assignee | ||
Updated•8 years ago
|
Whiteboard: [measurement:client]
Comment 11•8 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Updated•8 years ago
|
Whiteboard: [measurement:client] → [measurement:client] [measurement:client:uplift]
Comment 12•8 years ago
|
||
Comment on attachment 8818918 [details] [diff] [review]
bug1323628.patch
Approval Request Comment
[Feature/Bug causing the regression]: Event Telemetry.
[User impact if declined]: Data of Event Telemetry behavior reaches us later, delaying our decision making for the project.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: Yes, bug 1302671.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]:
The individual parts are:
* bug 1302663 - basic Telemetry implementation
* bug 1316810 - adjust event limits
* bug 1318284 - improve metrics that track effects on ping sending
* bug 1323628 - fix
* bug 1316281 - record search event in Telemetry
[Is the change risky?]: No.
[Why is the change risky/not risky?]: It's validated on Nightly and the changes are selective and relatively well understood.
[String changes made/needed]: No.
This is part of an uplift need for the initial Event Telemetry [1][2] implementation to Firefox 52.
We want to move this to Beta a little faster to see somewhat realistic data of an example event (search) coming in and verify that it matches our expectations.
I validated that this behaves as expected on Nightly over the last two weeks in bug 1302671.
1: https://docs.google.com/document/d/1hNuS9lUJMvMqgntZXbFA6xZBU9zBpQgo7x73-sXKRpI/
2: https://wiki.mozilla.org/Event_Telemetry
Attachment #8818918 -
Flags: approval-mozilla-aurora?
Comment 13•8 years ago
|
||
Comment on attachment 8818918 [details] [diff] [review]
bug1323628.patch
add event telemetry for aurora52
Attachment #8818918 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 14•8 years ago
|
||
bugherder uplift |
Updated•8 years ago
|
Whiteboard: [measurement:client] [measurement:client:uplift] → [measurement:client]
Comment 15•8 years ago
|
||
setting flags for the uplift in comment #14
You need to log in
before you can comment on or make changes to this bug.
Description
•