Included data: - Document Type: "main" - Application Name: "Firefox" - Channel: "nightly" - App Version: 39.0a1 - Sample: 100% (disk-space permitting) - Date range: forever (effectively since Feb. 26th) Format: compressed plain text (gzipped unless otherwise requested) of the form client_id<\t>json_metadata<\t>json_payload<\n> The json_metadata field will contain the other bits and pieces of the Heka message that are not stored in the payload (geoCountry, server timestamp, etc) Brendan, will that suit your needs?
i think i need it as just client_id<\t>json_payload<\n> for it play nice with mrjob, which will be expecting rows of (text,text) key/val pairs. Is it possible to stuff the metadata into extra fields in the JSON? I'm actually not sure how the loading will work... tmary, mreid will be sending me a big text dump of data for the new FHR v4 data that I'll nead to load into peach HDFS to compare to existing FHR v2 data. Can you help me with this? I assume it's just and 'hdfs dfs' command? Can you provide an example? And are there any requirements on the text file that will make the loading easier, or things that we'll need to avoid to keep it from failing? Thanks!
I really don't think we need to do this on HDFS, and it will be unnecessarily complex to do so. Since this is nightly-only, the total data size will be small enough to do on a single local machine (or on peach-gw, but not in the cluster). We can pull nightly data from FHRv2 to compare against pretty easily.
Sure, that'll work for samples up to several thousand. Sounds fine to me. Benjamin, are you on top of pulling the v2 records that match the clientIds in the sample? In that case, for both v4 and v2 we could just send zipped text files back and forth.
It seems like the size of data is small enough to use scp/e-mail/.. - please let me know if there is a need to copy large datasets from S3 -> HDFS --
(In reply to brendan c from comment #1) > i think i need it as just > client_id<\t>json_payload<\n> > for it play nice with mrjob, which will be expecting rows of (text,text) > key/val pairs. Is it possible to stuff the metadata into extra fields in the > JSON? How about client_id<\t>[meta,payload]<\n> So each "value" is an array of length 2?
If we're going to be processing small samples locally, then the exact format is less critical b/c it doesn't have to play nice with map reduce. So client_id<\t>[meta,payload]<\n> is fine, but if it's easier to produce client_id<\t>json_metadata<\t>json_payload<\n> then that should be fine too. thanks Mark!
It's relatively easy to produce the output in whatever format, so might as well go with something MR-friendly. A snapshot from yesterday is uploading to peach-gw now. What is the easiest way to make that available to you (and others that might be interested)?
The data file is available on peach-gw at /home/mreid/telemetry_main_snapshot20150309.txt.gz I believe it should be readable by other peach-gw users, please let me know if you can't access it. Feel free to re-open this bug if the data does not suit your needs.
Wow, am I reading this right? 13G compressed? How many clientIds are here? Is it all of nightly?
Hey Mark, can you supply a few more details about the extract? Is this just all the raw incoming payloads? It appears not to be grouped by clientId or anything else ("session stitching" or even some cruder form of linking by clientId are still some distance off I take it?).
Another question: are the rows sorted by clientId? If, for example, I "head -n 100" the file, are any contiguous sets of row belonging to a clientId guaranteed to be *all* of the rows with that clientId from the whole data extract?
Yes, this is all of nightly (comment 0 lists all the specific criteria - the only additional one is "contains a clientId"). There are 93528 unique clientIds, comprising 674634 submissions, in the data set. It contains all the raw submissions, sorted by clientId. Yep, all of the submissions for a given client will be contiguous in the file.
Hello, Could i have a complete enumeration of Nightly v4 data, ideally organized by client id, that is one JSON per client id, each JSON containing a list of session submissions. All i care for is the first 100 submissions per day per client (to counter excessive submissions) Can this be stored on peach cluster without being gzipped?
Please file a new bug for new data requests.