Closed Bug 1458736 Opened 6 years ago Closed 6 years ago

Create script for generating sampled datasets by docType from landfill

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

These datasets are meant for schema validation, therefore the source of the data should come from landfill. * A document limit per sample * folder structure for mapping docTypes to dataset * should be deterministic (based on a seed value)
Assignee: nobody → amiyaguchi
Status: NEW → ASSIGNED
Priority: P2 → P1
Blocks: 1452166
The notebook for generating the cleaned dataset can be found at [1]. A script that converts the path to reflect the submission uri and mps folder structure can be found in [2]. [1] https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/12423/command/12450 [2] https://github.com/acmiyaguchi/edge-validator/blob/e533f9fd3d155b35aa3a57288eb34003118140d4/sync.sh
See Also: → 1462433
I'm going to run some quick experiments on the document frequencies to see what the best configuration is for 1000 documents. I'm planing on turning the notebook into a python_mozetl job as-is and to iterate on the best solution going forward.
See Also: → 1463249
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Blocks: 1464484
Example sizes for each of the pings. ``` $ ls -lh resources/data/telemetry/ | awk '{print $5 "\t" $9}' 72K anonymous.batch.json 502K core.batch.json 19M crash.batch.json 891K focus-event.batch.json 564K health.batch.json 9.3M heartbeat.batch.json 33M main.batch.json 828K mobile-event.batch.json 31M modules.batch.json 8.1M new-profile.batch.json 133M saved-session.batch.json 7.6M shield-study-addon.batch.json 9.2M shield-study.batch.json 151K testpilot.batch.json 6.9M update.batch.json ```
Blocks: 1465242
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.