Export a sample data set of Telemetry V4 data organized by client id.

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
P1
normal
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: mreid, Assigned: mreid)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

3 years ago
Included data:
- Document Type: "main"
- Application Name: "Firefox"
- Channel: "nightly"
- App Version: 39.0a1
- Sample: 100% (disk-space permitting)
- Date range: forever (effectively since Feb. 26th)

Format:
compressed plain text (gzipped unless otherwise requested) of the form
client_id<\t>json_metadata<\t>json_payload<\n>

The json_metadata field will contain the other bits and pieces of the Heka message that are not stored in the payload (geoCountry, server timestamp, etc)

Brendan, will that suit your needs?
(Assignee)

Updated

3 years ago
Flags: needinfo?(bcolloran)

Comment 1

3 years ago
i think i need it as just
client_id<\t>json_payload<\n>
for it play nice with mrjob, which will be expecting rows of (text,text) key/val pairs. Is it possible to stuff the metadata into extra fields in the JSON?

I'm actually not sure how the loading will work...
tmary, mreid will be sending me a big text dump of data for the new FHR v4 data that I'll nead to load into peach HDFS to compare to existing FHR v2 data. Can you help me with this? I assume it's just and 'hdfs dfs' command? Can you provide an example? And are there any requirements on the text file that will make the loading easier, or things that we'll need to avoid to keep it from failing?

Thanks!
Flags: needinfo?(bcolloran) → needinfo?(tmeyarivan)

Comment 2

3 years ago
I really don't think we need to do this on HDFS, and it will be unnecessarily complex to do so. Since this is nightly-only, the total data size will be small enough to do on a single local machine (or on peach-gw, but not in the cluster). We can pull nightly data from FHRv2 to compare against pretty easily.

Comment 3

3 years ago
Sure, that'll work for samples up to several thousand. Sounds fine to me.

Benjamin, are you on top of pulling the v2 records that match the clientIds in the sample? In that case, for both v4 and v2 we could just send zipped text files back and forth.

Comment 4

3 years ago
It seems like the size of data is small enough to use scp/e-mail/.. - please let me know if there is a need to copy large datasets from S3 -> HDFS

--
Flags: needinfo?(tmeyarivan)

Updated

3 years ago
Assignee: nobody → mreid
Priority: -- → P1
(Assignee)

Comment 5

3 years ago
(In reply to brendan c from comment #1)
> i think i need it as just
> client_id<\t>json_payload<\n>
> for it play nice with mrjob, which will be expecting rows of (text,text)
> key/val pairs. Is it possible to stuff the metadata into extra fields in the
> JSON?

How about
client_id<\t>[meta,payload]<\n>

So each "value" is an array of length 2?

Comment 6

3 years ago
If we're going to be processing small samples locally, then the exact format is less critical b/c it doesn't have to play nice with map reduce. So
client_id<\t>[meta,payload]<\n>
is fine, but if it's easier to produce
client_id<\t>json_metadata<\t>json_payload<\n>
then that should be fine too.

thanks Mark!
(Assignee)

Comment 7

3 years ago
It's relatively easy to produce the output in whatever format, so might as well go with something MR-friendly.

A snapshot from yesterday is uploading to peach-gw now. What is the easiest way to make that available to you (and others that might be interested)?
(Assignee)

Comment 8

3 years ago
The data file is available on peach-gw at /home/mreid/telemetry_main_snapshot20150309.txt.gz

I believe it should be readable by other peach-gw users, please let me know if you can't access it.

Feel free to re-open this bug if the data does not suit your needs.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED

Comment 9

3 years ago
Wow, am I reading this right? 13G compressed? How many clientIds are here? Is it all of nightly?

Comment 10

3 years ago
Hey Mark, can you supply a few more details about the extract? Is this just all the raw incoming payloads? It appears not to be grouped by clientId or anything else ("session stitching" or even some cruder form of linking by clientId are still some distance off I take it?).

Comment 11

3 years ago
Another question: are the rows sorted by clientId? If, for example, I "head -n 100" the file, are any contiguous sets of row belonging to a clientId guaranteed to be *all* of the rows with that clientId from the whole data extract?
(Assignee)

Comment 12

3 years ago
Yes, this is all of nightly (comment 0 lists all the specific criteria - the only additional one is "contains a clientId"). There are 93528 unique clientIds, comprising 674634 submissions, in the data set.

It contains all the raw submissions, sorted by clientId. Yep, all of the submissions for a given client will be contiguous in the file.
(Assignee)

Updated

3 years ago
See Also: → bug 1142165
Hello,
Could i have a complete enumeration of Nightly v4 data, ideally organized by client id, that is one JSON per client id, each JSON containing a list of session submissions.

All i care for is the first 100 submissions per day per client (to counter excessive submissions)

Can this be stored on peach cluster without being gzipped?
(Assignee)

Comment 14

3 years ago
Please file a new bug for new data requests.
You need to log in before you can comment on or make changes to this bug.