We want to know the latency between when client-side activity happens and when we receive the data in the pipeline. Right now our standards are pretty low: our useful latency from FHR v2 is close to ten days. But going into the future we're going to want to monitor and improve this number significantly.
Hey Roberto, if you're going to be working on this question, I'd recommend that you revue the thread from fhr-dev with subject "how long does it take to get all the FHR submissions we're gonna get?" (active May and June of 2014), and the documents attached to that thread, wherein I tried to address latency for FHR v2. Not sure if that methodology will be applicable to v4, but it might be; even if it's not, the distinction between-- (a) the time it takes to capture X% of *activity*, and (b) the time it takes to get a ping from X% of *users* --will remain relevant. Happy to discuss this if you'd like.
I have a PR pending for the metrics below . Brendan's comment about those: "I think these metrics are interesting and totally reasonable to collect and monitor as quick and straightforward metrics that can be found using Heka filters, but I'd be concerned that they aren't exactly what we'd might want because e.g. looking back to the most recent ping for *every* profile doesn't filter out profiles from users that have started a new profile or stopped using FF entirely. This would be similar to estimating reporting latency based on "thisPingDate" from FHR v2, which I found to be quite sensitive to the set of assumptions made about the background attrition rate. I also think that once we have longitudinal data (when the session stitching API is complete), it will make sense to look at more than just the most recent submission -- things like inter-arrival times will be available, which will give us more data to work with when formulating estimates."  1) How many days do we need to look back for k% of active profiles to be up-to-date? Each active profile is associated to the date in which we last received a submission for it on our servers. Periodically, we compute for all profiles the difference between the current date and the date of reception and finally plot a histogram of the differences expressed in number of days. 2) How may days do we need to look back to observe k% of active profile activity? Each active profile is associated with its last activity date. Periodically, we compute for all active profiles the difference from the last activity date to the current date and finally plot a histogram of the differences expressed in number of days. 3) What’s the delay in hours between the start of a session activity for an active profile and the time we receive the submission? When we receive a new submission for an active profile we compute the delay from the start of the submission activity to the time we received the submission on our servers. Periodically, we plot a histogram of the latencies expressed in hours. Note that: - As timeseries of histograms or heatmaps are not supported by the Heka plotting facilities, only the median and some other percentiles are being output. - An active profile is one who has used the browser in the last six weeks (42 days)
0) I was actually hoping to start out with something much simpler: for data received recently (past 24 hours, past 7 days), what is the distribution of latency between the client collecting the data and the server receiving it? I'm hoping that this number or distribution is something that we can constantly monitor broken down by release channel, to detect regressions in the client. Although perhaps this is what you meant by #3? The big difference appears to be that I care about the "end" of the session (when we save the ping) and not as much the beginning of the session. Also granularity smaller than a day probably doesn't matter. So basically group the results into "the same day" and "N days late". #1/#2 can be periodically computed (once a month), we don't need to constantly monitor them.
This filter has been deployed to production, and after a message matcher update is receiving data: https://pipeline-prototype-cep.prod.mozaws.net/#plugins/filters/TelemetryLatency (though a 1% sample of current data is not very much data).
The filter is currently down but it should be redeployed shortly. Benjamin mentioned that he would have liked some documentation about writing Heka filters. A good tutorial can be found at https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Exploring+with+the+Mozilla+Data+Pipeline+Demo.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.