Closed
Bug 1462433
Opened 7 years ago
Closed 4 years ago
Streaming document statistics (count, clock-skew) by doctype
Categories
(Data Platform and Tools :: Monitoring & Alerting, enhancement)
Data Platform and Tools
Monitoring & Alerting
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: amiyaguchi, Unassigned)
References
Details
We should have document statistics about the data coming through the pipeline in short aggregates (1 hour). The primary measurement is document frequency over time. This can help monitor new and existing document types or tagged pings.
I want to sample a fixed set of document by type through `Spark.RDD.sampleByKey`.
The document statistics are important for visibility and have applicability for monitoring. We should only look at meta information tagged on the documents. This will include the http header timestamp to calculate clock-skew and the submission request uri.
==========
The questions I want to answer are:
* What are the frequency of each document type per day
* What fraction of the documents do I need to sample for exactly X documents where X is much smaller than the total document size.
* What were the set of documents seen today, this week, etc.
* When was the first date that this document type was seen in the dataset
* What is the duplicate rate for the document type
* What is the clock-skew of the documents in this period
```
-- only the minimal set of features are uncommented
SELECT timestamp, -- at a 5-30 minute resolution
namespace
doc_type,
-- doc_version,
-- count(distinct submit_uri) as n_submit_uri,
count(distinct document_id) as n_distinct_documents,
count(document_id) as n_documents,
approx_percentile(server_timestamp - client_timestamp, [0.25, 0.5, 0.75]) as qtl_clock_skew,
approx_percentile(server_timestamp - client_timestamp, [0.9, 0.95, 0.99]) as tail_qtl_clock_skew,
-- t_digest(server_timestamp - client_timestamp) as clock_skew_digest
FROM kafka-topic
GROUP BY 1, 2, 3
```
* `namespace` and `doc_type` are both defined in the generic ingestion specification and also shared by telemetry.
* `n_documents` is the number of documents seen during a period
* `n_distinct_documents` are used to find the number of duplicated documents (residual = n_distinct_documents - n_documents)
I've included both quantiles and biased quantiles of the clock-skew as a secondary measurement. It is possible to use the t-digest sketch to aggregate percentiles, quantiles, and trimmed means of the clock-skew. The `clock_skew_digest` is the potential implementation of the t-digest on the clock-skew.
It may be useful to run the t-digest over the log clock-skew because the values are long tailed.
I would also like to include a data-structure for performing efficient time-series similarity joins. I'm interested in the matrix profile for measuring transient and steady state of documents.
==========
I'm going to prototype the dataset using a Spark Streaming job in Databricks, since the immediate information about the document frequency is needed for sampling in landfill.
This need may be better served by a system such as hindsight or statsd (or any other time-series database/processing system).
See bug 1458736 and bug 1458734 for more details on sampling.
See the t-digest for approximate rank statistics: https://github.com/tdunning/t-digest
See the matrix profile for similarity joins: http://www.cs.ucr.edu/%7Eeamonn/MatrixProfile.html
Reporter | ||
Comment 1•7 years ago
|
||
For clarifying information, the document statistics are meant to supplement batched sampling from landfill. If implemented, reservoir sampling would be a more efficient way to gather uniformly sampled data by document type because it requires only a single pass over the data.
However, there are limited options for obtaining a small sample of pre-validated for use in CI in the `mozilla-pipeline-schemas` repository. See bug 1447851 for the current available telemetry API's for accessing raw ingestion data. Having information about doctype frequency can improve the performance of sampling data from landfill (bug 1458736) through sampleByKey. This requires two passes over the data, but the first pass for generating document statistics can be useful elsewhere.
A better first prototype is done through the moztelemetry API as a batch job. I am still strongly interested in our Spark Streaming capabilities though.
Reporter | ||
Updated•4 years ago
|
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•