Open Bug 1520561 Opened 6 years ago Updated 4 months ago

Produce a dataset (i.e., a file) which is a dump of all the client for each day of the year

Categories

(Data Platform and Tools :: General, enhancement, P2)

enhancement
Points:
2

Tracking

(Not tracked)

People

(Reporter: ekr, Unassigned)

Details

Doing any work with MAU and DAU is quite difficult because you have to do complicated and slow queries. It would be much easier if we just had a data set that one could download and then work with with commodity tools. Here's what I suggest:

One file per day, named by the day
Each line of the file is a truncated (128 bits is plenty) SHA-1 of the client ID in hex form, terminated in a line feed

If we're concerned about mapping back to the client IDs, whoever does this could instead generate a random secret value R and then have the lines be SHA-1(R || Client Id). I don't need R, I just need uniqueness.

This should be about 3 TB once we're done.

My initial concern here is understanding which source dataset would feed in to this. Would we need this for desktop, mobile, etc. each separately? Would there be a need to filter on other dimensions (such as geo) before dumping the data?

Would better guidance or documentation on how to produce such a dataset be sufficient for the need? Or is it currently onerous for a data user to produce a temporary working dataset tailored for their needs?

Which commodity tools did you have in mind?

Flags: needinfo?(ekr)

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #1)

My initial concern here is understanding which source dataset would feed in to this.

I want to process it locally with Python scripts and the like

Would we need this for desktop, mobile, etc. each separately?

Yes.

Would there be a need to filter on other dimensions (such as geo) before dumping the data?

No.

Would better guidance or documentation on how to produce such a dataset be sufficient for the need? Or is it currently onerous for a data user to produce a temporary working dataset tailored for their needs?

Well, when I just wanted to get MAU for the past two years, it took Mark something like a day to generate the data set. That's not scalable if I want to do any more sophisticated analysis.

Flags: needinfo?(ekr)
Points: --- → 2
Priority: -- → P2
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.