Closed Bug 1134669 Opened 9 years ago Closed 9 years ago

unified-FHR quality report: activity latency

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: rvitillo, Mentored)

Details

We want to know the latency between when client-side activity happens and when we receive the data in the pipeline. Right now our standards are pretty low: our useful latency from FHR v2 is close to ten days. But going into the future we're going to want to monitor and improve this number significantly.
Priority: -- → P2
Mentor: mtrinkala
Assignee: nobody → rvitillo
Hey Roberto, if you're going to be working on this question, I'd recommend that you revue the thread from fhr-dev with subject "how long does it take to get all the FHR submissions we're gonna get?" (active May and June of 2014), and the documents attached to that thread, wherein I tried to address latency for FHR v2.

Not sure if that methodology will be applicable to v4, but it might be; even if it's not, the distinction between--
(a) the time it takes to capture X% of *activity*, and
(b) the time it takes to get a ping from X% of *users*
--will remain relevant.

Happy to discuss this if you'd like.
I have a PR pending for the metrics below [1]. 

Brendan's comment about those:
"I think these metrics are interesting and totally reasonable to
collect and monitor as quick and straightforward metrics that can be
found using Heka filters, but I'd be concerned that they aren't
exactly what we'd might want because e.g. looking back to the most
recent ping for *every* profile doesn't filter out profiles from users
that have started a new profile or stopped using FF entirely. This
would be similar to estimating reporting latency based on
"thisPingDate" from FHR v2, which I found to be quite sensitive to the
set of assumptions made about the background attrition rate.

I also think that once we have longitudinal data (when the session
stitching API is complete), it will make sense to look at more than
just the most recent submission -- things like inter-arrival times
will be available, which will give us more data to work with when
formulating estimates."

[1]
1) How many days do we need to look back for k% of active profiles to be up-to-date?
   Each active profile is associated to the date in which we last received a submission
   for it on our servers. Periodically, we compute for all profiles the difference
   between the current date and the date of reception and finally plot a histogram
   of the differences expressed in number of days.

2) How may days do we need to look back to observe k% of active profile activity?
   Each active profile is associated with its last activity date. Periodically, we compute
   for all active profiles the difference from the last activity date to the current date
   and finally plot a histogram of the differences expressed in number of days.

3) What’s the delay in hours between the start of a session activity for an active profile 
   and the time we receive the submission? When we receive a new submission for an active
   profile we compute the delay from the start of the submission activity to the time we
   received the submission on our servers. Periodically, we plot a histogram of the latencies
   expressed in hours.

Note that:
   - As timeseries of histograms or heatmaps are not supported by the Heka plotting
     facilities, only the median and some other percentiles are being output.

   - An active profile is one who has used the browser in the last six weeks (42 days)
0) I was actually hoping to start out with something much simpler: for data received recently (past 24 hours, past 7 days), what is the distribution of latency between the client collecting the data and the server receiving it? I'm hoping that this number or distribution is something that we can constantly monitor broken down by release channel, to detect regressions in the client. Although perhaps this is what you meant by #3? The big difference appears to be that I care about the "end" of the session (when we save the ping) and not as much the beginning of the session.

Also granularity smaller than a day probably doesn't matter. So basically group the results into "the same day" and "N days late".

#1/#2 can be periodically computed (once a month), we don't need to constantly monitor them.
This filter has been deployed to production, and after a message matcher update is receiving data: https://pipeline-prototype-cep.prod.mozaws.net/#plugins/filters/TelemetryLatency (though a 1% sample of current data is not very much data).
The filter is currently down but it should be redeployed shortly. Benjamin mentioned that he would have liked some documentation about writing Heka filters. A good tutorial can be found at https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Exploring+with+the+Mozilla+Data+Pipeline+Demo.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.