Closed Bug 1152107 Opened 9 years ago Closed 8 years ago

need merged FHR v2 and v4 data for continuity of longitudinal analysis across v2/v4 transition

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: bcolloran, Assigned: spenrose)

References

Details

(Whiteboard: [unifiedTelemetry][40b9][loasis])

Attachments

(1 file)

As discussed with mreid over email, and in bug 1126958:

Metrics team believes that in order to keep important numbers (like MAU, longitudinal measures related to search, etc) stable across the v2/v4 transition, we'll need a way of looking at the v2 and v4 packets from a clientId simultaneously -- i.e., we'll need to be able to get data like, for example:

{
  clientId: foo,
  v2data: { .... }, // v2 data json
  v4data: [ .... ]  // array of v4 subsessions
}

(the exact details about the most sensible way to structure this data are up for discussion)

In the nearer term, having data structured this way would also *greatly* facilitate the work in https://bugzilla.mozilla.org/show_bug.cgi?id=1134661
Depends on: 1126958
Priority: -- → P2
This seems to be on track, with Brendan and Roberto working to make both data sets available via Spark.  Brendan, is that right? Are you waiting on anything from our team?
Flags: needinfo?(bcolloran)
Hey guys, I now have something that give me the kind of data merged data I want and that has the level of quality that I need... at least I hope so-- I haven't dug into it yet because it's taken me forever to get the script working, but I think it's the right thing.

Right now I have about 8k v2+v4 records I can look at (merging with 10% of nightly v2 data), which should be fine for the time being. But to hit the data quality targets we've talked about, I'll eventually to look at at least 100k nightly records (which is basically all of night). At that point I'm going to need help (from Mark and Roberto I'm guessing) scaling it up.

Here's the script: http://nbviewer.ipython.org/gist/bcolloran/ac508f1d141eacdf7098 (also attached to this bug).

Mark+Georg: please take a look to double check that this is sensible and that we're doing all of the obvious filtering steps needed to have a valid set of records to compare.

Mark+Roberto: please take a look and start thinking about whether there are spark optimizations that will be needed to make this scale up to the 100k range. I ran this in a reasonable amount of time (an hour or so) on a 5 node cluster, but part of the reason it took so long to get this working was that I would hit memory errors and stuff like that, so I don't know if we can just run a 5 node cluster for 10x as long to match up with the full v2 data set or if more nodes should be used or if there are other optimizations.

Thanks all.
Flags: needinfo?(rvitillo)
Flags: needinfo?(mreid)
Flags: needinfo?(gfritzsche)
Flags: needinfo?(bcolloran)
Priority: P2 → P1
Oops, note that the cells below the heading "Did it work? Try to load a single file" are not correct. That should have been:

"Did it work? Try to load a few files"

In [37]:

pathToMergeTest = "s3n://"+outBucketName+pathToOutput+"part-0000*"

print pathToMergeTest

mergeTest = sc.sequenceFile(pathToMergeTest)

s3n://net-mozaws-prod-us-west-2-pipeline-analysis/bcolloran/mergedDataPerClient/nightly/2015-06-10/8428clients/part-0000*

In [38]:

mergeTest.count()

Out[38]:

431

In [39]:

mt = mergeTest.first()

In [40]:

len(mt)

Out[40]:

2

In [43]:

print len(mt),mt[0]

# print mt[1].keys()

print

print json.loads(mt[1]).keys()

print len(json.loads(mt[1])['v4'])

print json.loads(mt[1])['v4'][0].keys()

​

print

print json.loads(mt[1])['v2'].keys()

2 7a2de1cd-9bc0-40c0-9c94-3c5f2a11cad3

[u'v2', u'v4', u'clientId']
29
[u'payload/info', u'payload/simpleMeasurements/main', u'payload/simpleMeasurements/totalTime', u'payload/simpleMeasurements/sessionRestored', u'type', u'payload/keyedHistograms/SEARCH_DEFAULT_ENGINE', u'payload/simpleMeasurements/firstPaint', u'clientId', u'environment', u'application', u'payload/histograms/PLACES_PAGES_COUNT', u'version', u'payload/simpleMeasurements/activeTicks', u'meta/appUpdateChannel', u'creationDate', u'payload/histograms/PLACES_BOOKMARKS_COUNT', u'id', u'payload/keyedHistograms/SEARCH_COUNTS']

[u'thisPingDate', u'geckoAppInfo', u'lastPingDate', u'geoCountry', u'clientID', u'version', u'clientIDVersion', u'data', u'BAGHEERA_TS']
Also, apologies for the big json blobs. Those were minimized in my local output so I forgot to remove them.
(In reply to brendan c from comment #2)
> Mark+Roberto: please take a look and start thinking about whether there are
> spark optimizations that will be needed to make this scale up to the 100k
> range. I ran this in a reasonable amount of time (an hour or so) on a 5 node
> cluster, but part of the reason it took so long to get this working was that
> I would hit memory errors and stuff like that, so I don't know if we can
> just run a 5 node cluster for 10x as long to match up with the full v2 data
> set or if more nodes should be used or if there are other optimizations.

I gave it a quick glance and it seems you are caching a big dataset in memory. I would like to have an idea of how much memory it's consuming. If the dataset doen't fit in memory it gets spilled to disk making the caching layer rather useless and slowing down the job. Could you give me some statistics from the Spark web dashboard?

I have some ideas on how to speed-up larger jobs that use only scalar values using dataframes but this will require some investigative work that I will unlikely be able to do in this quarter.

My question is, do you really need all the data? Can't we sample on the client-id for now, work with more managable datasets and once we iron out the analysis, run it on a larger cluster only once?

We could use a larger cluster just to do the merging of a big chunk of v2 and v4 data and save that dataset somewhere. Then iteratively update it with a batch job on a daily basis. Finally use that intermediate dataset for your analysis.
Flags: needinfo?(rvitillo)
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #6)

> My question is, do you really need all the data?

I don't need it all *yet*-- for now the 8k sample should be sufficient to test jobs, but we will need it soon, which is why I'm looking to you and Mark for the best ways to make that happen.

> Can't we sample on the
> client-id for now, work with more managable datasets and once we iron out
> the analysis, run it on a larger cluster only once?

Absolutely, if that is a better approach we can do it that way. Just to make sure I understand: you are suggesting that instead of saving the data to s3 as an intermediate step, we should just do the data join as part of the data quality job and run the whole thing at once on one big cluster? That works for me if you think it'll be more efficient overall.

Thanks as always for the input!
(In reply to brendan c from comment #7)
> > Can't we sample on the
> > client-id for now, work with more managable datasets and once we iron out
> > the analysis, run it on a larger cluster only once?
> 
> Absolutely, if that is a better approach we can do it that way. Just to make
> sure I understand: you are suggesting that instead of saving the data to s3
> as an intermediate step, we should just do the data join as part of the data
> quality job and run the whole thing at once on one big cluster? That works
> for me if you think it'll be more efficient overall.

I was actually suggesting to use an even smaller sample and do everything at once and only when you are happy with it, move to a larger sample. In other words what I meant is that development and debugging should be done using a single-instance cluster whenever possible.

It looks like you have alrady passed that stage though and are saving your intermediate dataset to S3, which is what I was suggesting in my last paragraph in comment 6. How often is that intermediate dataset generated?
> It looks like you have alrady passed that stage though and are saving your
> intermediate dataset to S3, which is what I was suggesting in my last
> paragraph in comment 6. How often is that intermediate dataset generated?

It's not scripted, so I'll only run it again if I need new data (like after Georg informs me of a bug fix or something). The data that is saved so far should be enough to prototype.

Agreed that it makes sense to work on smaller samples for prototyping and then scale up to the full data set only when needed-- that is what I was intending to suggest above, sorry if that was unclear.
Whiteboard: [unifiedTelemetry][b5]
(In reply to brendan c from comment #2)
> Mark+Georg: please take a look to double check that this is sensible and
> that we're doing all of the obvious filtering steps needed to have a valid
> set of records to compare.
This looks sensible to me.

> Mark+Roberto: please take a look and start thinking about whether there are
> spark optimizations that will be needed to make this scale up to the 100k
> range.
I don't have enough experience with spark to offer any useful suggestions here, but I can potentially help write custom code for this if need be.
Flags: needinfo?(mreid)
Whiteboard: [unifiedTelemetry][b5] → [unifiedTelemetry][b5][dataValidation]
Note, I once again ran into a problem where having pre-existing per-clientId data would be a major boon  (see https://bugzilla.mozilla.org/show_bug.cgi?id=1171265#c20 )

In this case, it's not exactly the v2+v4 data I need, but rather all of the pings that have ever been submitted by *clients* that have submitted pings on nightly (as opposed to all of the *pings* submitted under nightly). There are corner cases having to do with channel switching that are implicated in a data anomaly, but the switching cannot be detected with just the pings submitted under nightly, and the existing per client data has too small of a sample of nightly records.

There is 100% of FHR v2 saved as sequence files on s3 at s3n://mozillametricsfhrsamples/nightly/ . Since nightly data remains the most relevant for the v2/v4 data quality work, running all of the clientIds in that data through the per-clientId API would be waaaaay more valuable than the 1% sample of all records on all channels that is available through the get_clients_history() function.


So I guess: consider this one more plea for a consolidated data set described as above in comment 0, but that contains 100% of the data for the clientIds contained in the FHR v2 Nightly extract at s3n://mozillametricsfhrsamples/nightly/ . And I will add to that plea a plea to have the pings for each clientId de-duped (as I do in cell 11 of http://nbviewer.ipython.org/gist/bcolloran/ac508f1d141eacdf7098 ), so that I don't have to do that cleaning step over and over. This data would ideal be updated weekly after Saptarshi pushes the new Nightly data to s3.

I have spent sooo much time on the data wrangling part of the project rather than the actual analysis and QA part, and a quite substantial part of the cluster time I use is spent waiting for these consolidation and cleaning steps to run (which also means that I have to wait for those cleaning/merging steps instead of firing up a cluster and just loading the data I need)


And fwiw, since I am focused on just the parts of the new data that can be compared to FHR v2, I don't need all of the data in each ping, just the following paths:

['clientId',
    'meta',
    'id',
    'environment',
    'application',
    'version',
    'creationDate',
    'type',
    'payload/info',
    'payload/simpleMeasurements/activeTicks',
    'payload/simpleMeasurements/totalTime',
    'payload/simpleMeasurements/main',
    'payload/simpleMeasurements/firstPaint',
    'payload/simpleMeasurements/sessionRestored',
    'payload/histograms/PLACES_PAGES_COUNT',
    'payload/histograms/PLACES_BOOKMARKS_COUNT',
    'payload/keyedHistograms/SEARCH_COUNTS',
    'payload/keyedHistograms/SEARCH_DEFAULT_ENGINE']
Georg and Alessio: please verify that the following paths from v4 pings contain all of the data needed to compare FHR v2 and v4. This list is derived from the helpful Google doc you put together, but I want to confirm with you here.

['clientId',
    'meta',
    'id',
    'environment',
    'application',
    'version',
    'creationDate',
    'type',
    'payload/info',
    'payload/simpleMeasurements/activeTicks',
    'payload/simpleMeasurements/totalTime',
    'payload/simpleMeasurements/main',
    'payload/simpleMeasurements/firstPaint',
    'payload/simpleMeasurements/sessionRestored',
    'payload/histograms/PLACES_PAGES_COUNT',
    'payload/histograms/PLACES_BOOKMARKS_COUNT',
    'payload/keyedHistograms/SEARCH_COUNTS',
    'payload/keyedHistograms/SEARCH_DEFAULT_ENGINE']
Flags: needinfo?(alessio.placitelli)
Re 'payload/keyedHistograms/SEARCH_DEFAULT_ENGINE', after bug 1138503 this is gone. This is now environment.settings.defaultSearchEngine.
Otherwise this looks fine to me.

I quickly ran through the google doc and did some updates to the current state.
Flags: needinfo?(gfritzsche)
Looks good!
Flags: needinfo?(alessio.placitelli)
Great, thanks to you both.

So since the default search engine is now included in the environment, the correct whitelist for v2/v4 comparison should be:

['clientId',
    'meta',
    'id',
    'environment',
    'application',
    'version',
    'creationDate',
    'type',
    'payload/info',
    'payload/simpleMeasurements/activeTicks',
    'payload/simpleMeasurements/totalTime',
    'payload/simpleMeasurements/main',
    'payload/simpleMeasurements/firstPaint',
    'payload/simpleMeasurements/sessionRestored',
    'payload/histograms/PLACES_PAGES_COUNT',
    'payload/histograms/PLACES_BOOKMARKS_COUNT',
    'payload/keyedHistograms/SEARCH_COUNTS']
katie and sam: while we're waiting on automated merged samples of more of nightly, I have updated the small merged samples that I created a while back. Available on s3 at:
s3n://net-mozaws-prod-us-west-2-pipeline-analysis/bcolloran/mergedDataPerClient/nightly/2015-07-09/8937clients/part-*
 
see also:
http://nbviewer.ipython.org/gist/bcolloran/757e35f7990d62a49f83
Blocks: 1182684
Whiteboard: [unifiedTelemetry][b5][dataValidation] → [unifiedTelemetry][40b9][data-validation]
Sam, can you please take the data the Mark created that is described here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1171265#c24
and create a merged longitudinal data set of the format described above?

I think you should be able to adapt this notebook--
http://nbviewer.ipython.org/gist/bcolloran/757e35f7990d62a49f83
--to pull the v2 data and do the consolidation.

Please let me know if you need any details. Thanks!
Assignee: nobody → spenrose
Flags: needinfo?(spenrose)
Iteration: --- → 42.3 - Aug 10
My work on this can be seen in the "Notes" section of our "FHR v4 Acceptance Criteria" doc:

  https://docs.google.com/document/d/1KpcQy_QEfizd6Q4MFvOt5rMCL32Ef49_2T5yXNqWIWw/edit#

I was able to create groups of 35K clients / 1 week of pings before choking spark. Unfortunately they are now out of date. Brendan's latest dataset is the best we have right now.
Flags: needinfo?(spenrose)
(moving this request for data here b/c it's orthogonal to the main point of bug 1171265; but see also https://bugzilla.mozilla.org/show_bug.cgi?id=1171265#c24 https://bugzilla.mozilla.org/show_bug.cgi?id=1171265#c31 )

Hey Mark, could you update the per client data extracts from bug 1171265#c24 again? Thanks!
Flags: needinfo?(mreid)
Iteration: 42.3 - Aug 10 → 43.1 - Aug 24
Data has been updated to 20150812, available here:
s3://net-mozaws-prod-us-west-2-pipeline-analysis/mreid/bug1171265/merged_by_day20150814/

Please let me know if you have any permission trouble.
Flags: needinfo?(mreid)
no troubles, thanks mark
Whiteboard: [unifiedTelemetry][40b9][data-validation] → [unifiedTelemetry][40b9]
Moving to P2 to reflect ETA.
Priority: P1 → P2
Sam and I discussed, more info to come
Whiteboard: [unifiedTelemetry][40b9] → [unifiedTelemetry][40b9][loasis]
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: