Produce a dataset (i.e., a file) which is a dump of all the client for each day of the year
Categories
(Data Platform and Tools :: General, enhancement, P2)
Tracking
(Not tracked)
People
(Reporter: ekr, Unassigned)
Details
Doing any work with MAU and DAU is quite difficult because you have to do complicated and slow queries. It would be much easier if we just had a data set that one could download and then work with with commodity tools. Here's what I suggest:
One file per day, named by the day
Each line of the file is a truncated (128 bits is plenty) SHA-1 of the client ID in hex form, terminated in a line feed
If we're concerned about mapping back to the client IDs, whoever does this could instead generate a random secret value R and then have the lines be SHA-1(R || Client Id). I don't need R, I just need uniqueness.
This should be about 3 TB once we're done.
Comment 1•6 years ago
|
||
My initial concern here is understanding which source dataset would feed in to this. Would we need this for desktop, mobile, etc. each separately? Would there be a need to filter on other dimensions (such as geo) before dumping the data?
Would better guidance or documentation on how to produce such a dataset be sufficient for the need? Or is it currently onerous for a data user to produce a temporary working dataset tailored for their needs?
Reporter | ||
Comment 3•6 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #1)
My initial concern here is understanding which source dataset would feed in to this.
I want to process it locally with Python scripts and the like
Would we need this for desktop, mobile, etc. each separately?
Yes.
Would there be a need to filter on other dimensions (such as geo) before dumping the data?
No.
Would better guidance or documentation on how to produce such a dataset be sufficient for the need? Or is it currently onerous for a data user to produce a temporary working dataset tailored for their needs?
Well, when I just wanted to get MAU for the past two years, it took Mark something like a day to generate the data set. That's not scalable if I want to do any more sophisticated analysis.
Updated•6 years ago
|
Assignee | ||
Updated•3 years ago
|
Description
•