Myself and dzeber have been using the longitudinal dataset for a modeling project. I'd like to work with someone in getting a longitudinal dataset across all profiles with the following features: * release channel * US geo_county and en-US locale * profile created between 2016-01-01 and 2016-03-31 Profiles that meet the above consist of .8% of profiles in the current longitudinal dataset, implying the resulting data from this job likely wont be bigger than any traditional longitudinal dataset. We've seen positive results in our current model, however we haven't been able to segment fully in an effort to keep our data reasonably sized.
is this still valid Ben?
Yes this is still valid, however I'd like to make some adjustments. With our v1 results in and recent requests, it'd be nice for us to have, say, a script that allows these constraints to be passed by the user. For example, we could run something like spark-submit -- [...] \ --channel release\ --from 20160101\ --to 20160630\ --geo US\ --locale en-US \ --profile_creation_min 20160101 --profile_creation_max 20160331 outputting a longitudinal dataset that meets these constraints. This functionality (in addition to larger n) would allow us to more quickly iterate as new requests come in. I'm unsure of how feasible this type of job is since it would have to read 100% of the data--this is, however, the best case scenario.
Thanks Ben, is this something you are up for doing? I see the mentor field is filled out, but Mark Reid or Harter could help as well.
In a meeting last week with mreid and rvitillo, it was decided that something like this (in the current state of things), isn't feasible since such a job would be very expensive.