The main summary is fairly large, so its useful to use a smaller sample of the data when working on a project that relies heavily on the dataset. The resources for using this attribute are sparse. The sample_id field is documented in telemetry-batch-view repository.  The Main Summary Tutorial  should be updated to explain/show the nuances of obtaining a small representative sample of the data covering parquet partitioning and small code snippets taking advantage of the sample_id field. The distribution of sample_ids within a month/day can help explain how it works. The report should be added on RTMO.  https://github.com/mozilla/telemetry-batch-view/blob/master/docs/MainSummary.md  https://gist.github.com/mreid-moz/518f7515aac54cd246635c333683ecce
Assignee: nobody → amiyaguchi
Points: --- → 1
Created attachment 8824270 [details] How to use main_summary.sample_id I've written a gist that shows the properties of the sample_id and how to use it in pyspark to select a subset of the main summary.
Comment on attachment 8824270 [details] How to use main_summary.sample_id Link to gist on how to use main_summary.sample_id https://gist.github.com/acmiyaguchi/0b3772807f146575420a9e157b10fbb9
Attachment #8824270 - Attachment is obsolete: true
Component: Metrics: Pipeline → Documentation and Knowledge Repo (RTMO)
Product: Cloud Services → Data Platform and Tools
It would be great if we could move this to a cookbook in the gitbook!
You need to log in before you can comment on or make changes to this bug.