1140094 - Export a sample data set of Telemetry V4 data organized by client id.

Assignee

Description

•

10 years ago

Included data: - Document Type: "main" - Application Name: "Firefox" - Channel: "nightly" - App Version: 39.0a1 - Sample: 100% (disk-space permitting) - Date range: forever (effectively since Feb. 26th) Format: compressed plain text (gzipped unless otherwise requested) of the form client_id<\t>json_metadata<\t>json_payload<\n> The json_metadata field will contain the other bits and pieces of the Heka message that are not stored in the payload (geoCountry, server timestamp, etc) Brendan, will that suit your needs?

Mark Reid [:mreid]

Assignee

Updated

•

10 years ago

Flags: needinfo?(bcolloran)

brendan c

Comment 1

•

10 years ago

i think i need it as just client_id<\t>json_payload<\n> for it play nice with mrjob, which will be expecting rows of (text,text) key/val pairs. Is it possible to stuff the metadata into extra fields in the JSON? I'm actually not sure how the loading will work... tmary, mreid will be sending me a big text dump of data for the new FHR v4 data that I'll nead to load into peach HDFS to compare to existing FHR v2 data. Can you help me with this? I assume it's just and 'hdfs dfs' command? Can you provide an example? And are there any requirements on the text file that will make the loading easier, or things that we'll need to avoid to keep it from failing? Thanks!

Flags: needinfo?(bcolloran) → needinfo?(tmeyarivan)

Benjamin Smedberg

Comment 2

•

10 years ago

I really don't think we need to do this on HDFS, and it will be unnecessarily complex to do so. Since this is nightly-only, the total data size will be small enough to do on a single local machine (or on peach-gw, but not in the cluster). We can pull nightly data from FHRv2 to compare against pretty easily.

brendan c

Comment 3

•

10 years ago

Sure, that'll work for samples up to several thousand. Sounds fine to me. Benjamin, are you on top of pulling the v2 records that match the clientIds in the sample? In that case, for both v4 and v2 we could just send zipped text files back and forth.

T [:tmary] Meyarivan

Comment 4

•

10 years ago

It seems like the size of data is small enough to use scp/e-mail/.. - please let me know if there is a need to copy large datasets from S3 -> HDFS --

Flags: needinfo?(tmeyarivan)

Katie Parlante

Updated

•

10 years ago

Assignee: nobody → mreid

Priority: -- → P1

Mark Reid [:mreid]

Assignee

Comment 5

•

10 years ago

(In reply to brendan c from comment #1) > i think i need it as just > client_id<\t>json_payload<\n> > for it play nice with mrjob, which will be expecting rows of (text,text) > key/val pairs. Is it possible to stuff the metadata into extra fields in the > JSON? How about client_id<\t>[meta,payload]<\n> So each "value" is an array of length 2?

brendan c

Comment 6

•

10 years ago

If we're going to be processing small samples locally, then the exact format is less critical b/c it doesn't have to play nice with map reduce. So client_id<\t>[meta,payload]<\n> is fine, but if it's easier to produce client_id<\t>json_metadata<\t>json_payload<\n> then that should be fine too. thanks Mark!

Mark Reid [:mreid]

Assignee

Comment 7

•

10 years ago

It's relatively easy to produce the output in whatever format, so might as well go with something MR-friendly. A snapshot from yesterday is uploading to peach-gw now. What is the easiest way to make that available to you (and others that might be interested)?

Mark Reid [:mreid]

Assignee

Comment 8

•

10 years ago

The data file is available on peach-gw at /home/mreid/telemetry_main_snapshot20150309.txt.gz I believe it should be readable by other peach-gw users, please let me know if you can't access it. Feel free to re-open this bug if the data does not suit your needs.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

brendan c

Comment 9

•

10 years ago

Wow, am I reading this right? 13G compressed? How many clientIds are here? Is it all of nightly?

brendan c

Comment 10

•

10 years ago

Hey Mark, can you supply a few more details about the extract? Is this just all the raw incoming payloads? It appears not to be grouped by clientId or anything else ("session stitching" or even some cruder form of linking by clientId are still some distance off I take it?).

brendan c

Comment 11

•

10 years ago

Another question: are the rows sorted by clientId? If, for example, I "head -n 100" the file, are any contiguous sets of row belonging to a clientId guaranteed to be *all* of the rows with that clientId from the whole data extract?

Mark Reid [:mreid]

Assignee

Comment 12

•

10 years ago

Yes, this is all of nightly (comment 0 lists all the specific criteria - the only additional one is "contains a clientId"). There are 93528 unique clientIds, comprising 674634 submissions, in the data set. It contains all the raw submissions, sorted by clientId. Yep, all of the submissions for a given client will be contiguous in the file.

Mark Reid [:mreid]

Assignee

Updated

•

10 years ago

Comment 13

•

10 years ago

Hello, Could i have a complete enumeration of Nightly v4 data, ideally organized by client id, that is one JSON per client id, each JSON containing a list of session submissions. All i care for is the first 100 submissions per day per client (to counter excessive submissions) Can this be stored on peach cluster without being gzipped?

Mark Reid [:mreid]

Assignee

Comment 14

•

10 years ago

Please file a new bug for new data requests.

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

Export a sample data set of Telemetry V4 data organized by client id.

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: mreid, Assigned: mreid)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Comment 13

Comment 14

Updated