Closed
Bug 1458736
Opened 6 years ago
Closed 6 years ago
Create script for generating sampled datasets by docType from landfill
Categories
(Data Platform and Tools :: General, enhancement, P1)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: amiyaguchi, Assigned: amiyaguchi)
References
Details
These datasets are meant for schema validation, therefore the source of the data should come from landfill.
* A document limit per sample
* folder structure for mapping docTypes to dataset
* should be deterministic (based on a seed value)
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → amiyaguchi
Status: NEW → ASSIGNED
Priority: P2 → P1
Assignee | ||
Comment 1•6 years ago
|
||
The notebook for generating the cleaned dataset can be found at [1]. A script that converts the path to reflect the submission uri and mps folder structure can be found in [2].
[1] https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/12423/command/12450
[2] https://github.com/acmiyaguchi/edge-validator/blob/e533f9fd3d155b35aa3a57288eb34003118140d4/sync.sh
Assignee | ||
Comment 2•6 years ago
|
||
I'm going to run some quick experiments on the document frequencies to see what the best configuration is for 1000 documents. I'm planing on turning the notebook into a python_mozetl job as-is and to iterate on the best solution going forward.
See Also: → 1463249
Assignee | ||
Comment 3•6 years ago
|
||
Here's the gist: https://gist.github.com/acmiyaguchi/ece7d850adcd7d641c8678f29d27894b
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 4•6 years ago
|
||
Example sizes for each of the pings.
```
$ ls -lh resources/data/telemetry/ | awk '{print $5 "\t" $9}'
72K anonymous.batch.json
502K core.batch.json
19M crash.batch.json
891K focus-event.batch.json
564K health.batch.json
9.3M heartbeat.batch.json
33M main.batch.json
828K mobile-event.batch.json
31M modules.batch.json
8.1M new-profile.batch.json
133M saved-session.batch.json
7.6M shield-study-addon.batch.json
9.2M shield-study.batch.json
151K testpilot.batch.json
6.9M update.batch.json
```
Updated•2 years ago
|
Component: Datasets: General → General
You need to log in
before you can comment on or make changes to this bug.
Description
•