We have a lot of shared boilerplate in our dataset jobs that could use a good refactoring, both for maintainability and so we can create new datasets more quickly. In particular the datasets that are generated from telemetry pings share a lot of underlying structure. Some examples of shared code: common CLI options, filtering pings, going from RDD -> Spark DataFrame, writing the dataset back out, *maybe* defining the schema and field generation in the same place in a higher level DSL?
Closing abandoned bugs in this product per https://bugzilla.mozilla.org/show_bug.cgi?id=1337972
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.