Closed Bug 1462433 Opened 7 years ago Closed 4 years ago

Streaming document statistics (count, clock-skew) by doctype

Categories

(Data Platform and Tools :: Monitoring & Alerting, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: amiyaguchi, Unassigned)

References

Details

We should have document statistics about the data coming through the pipeline in short aggregates (1 hour). The primary measurement is document frequency over time. This can help monitor new and existing document types or tagged pings. I want to sample a fixed set of document by type through `Spark.RDD.sampleByKey`. The document statistics are important for visibility and have applicability for monitoring. We should only look at meta information tagged on the documents. This will include the http header timestamp to calculate clock-skew and the submission request uri. ========== The questions I want to answer are: * What are the frequency of each document type per day * What fraction of the documents do I need to sample for exactly X documents where X is much smaller than the total document size. * What were the set of documents seen today, this week, etc. * When was the first date that this document type was seen in the dataset * What is the duplicate rate for the document type * What is the clock-skew of the documents in this period ``` -- only the minimal set of features are uncommented SELECT timestamp, -- at a 5-30 minute resolution namespace doc_type, -- doc_version, -- count(distinct submit_uri) as n_submit_uri, count(distinct document_id) as n_distinct_documents, count(document_id) as n_documents, approx_percentile(server_timestamp - client_timestamp, [0.25, 0.5, 0.75]) as qtl_clock_skew, approx_percentile(server_timestamp - client_timestamp, [0.9, 0.95, 0.99]) as tail_qtl_clock_skew, -- t_digest(server_timestamp - client_timestamp) as clock_skew_digest FROM kafka-topic GROUP BY 1, 2, 3 ``` * `namespace` and `doc_type` are both defined in the generic ingestion specification and also shared by telemetry. * `n_documents` is the number of documents seen during a period * `n_distinct_documents` are used to find the number of duplicated documents (residual = n_distinct_documents - n_documents) I've included both quantiles and biased quantiles of the clock-skew as a secondary measurement. It is possible to use the t-digest sketch to aggregate percentiles, quantiles, and trimmed means of the clock-skew. The `clock_skew_digest` is the potential implementation of the t-digest on the clock-skew. It may be useful to run the t-digest over the log clock-skew because the values are long tailed. I would also like to include a data-structure for performing efficient time-series similarity joins. I'm interested in the matrix profile for measuring transient and steady state of documents. ========== I'm going to prototype the dataset using a Spark Streaming job in Databricks, since the immediate information about the document frequency is needed for sampling in landfill. This need may be better served by a system such as hindsight or statsd (or any other time-series database/processing system). See bug 1458736 and bug 1458734 for more details on sampling. See the t-digest for approximate rank statistics: https://github.com/tdunning/t-digest See the matrix profile for similarity joins: http://www.cs.ucr.edu/%7Eeamonn/MatrixProfile.html
For clarifying information, the document statistics are meant to supplement batched sampling from landfill. If implemented, reservoir sampling would be a more efficient way to gather uniformly sampled data by document type because it requires only a single pass over the data. However, there are limited options for obtaining a small sample of pre-validated for use in CI in the `mozilla-pipeline-schemas` repository. See bug 1447851 for the current available telemetry API's for accessing raw ingestion data. Having information about doctype frequency can improve the performance of sampling data from landfill (bug 1458736) through sampleByKey. This requires two passes over the data, but the first pass for generating document statistics can be useful elsewhere. A better first prototype is done through the moztelemetry API as a batch job. I am still strongly interested in our Spark Streaming capabilities though.
See Also: → 1463249
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.