Closed Bug 1262609 Opened 8 years ago Closed 8 years ago

Validate Games Hardware Survey Spark Notebook

Categories

(Cloud Services :: Metrics: Product Metrics, defect, P2)

defect
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: avaughn, Assigned: Dexter)

References

Details

Attachments

(3 files, 2 obsolete files)

Hello!

This is a request for a code review of Alessio's Spark notebook for the Games Hardware Survey. It can be found at:

https://github.com/almossawi/firefox-hardware-survey/pull/1


The original Games Hardware Survey bug is here for reference:

https://bugzilla.mozilla.org/show_bug.cgi?id=1241696

Thank You,

Anthony
Assignee: nobody → rweiss
Summary: Validate Games Hardware Survey Notebook → Validate Games Hardware Survey Spark Notebook
Thank you for filing this bug, Anthony! Some notes about the Spark notebook:

- It doesn't produce CSVs, it rather produces weekly aggregated JSON blobs which are appended to the previous state file fetched from S3 (this allows for iterative updates and eases backfilling).
- The format of the the produced JSON file is like [1].
- We only aggregate data coming from the current and the previous major release version of FF.
- We are generating the aggregates by considering only the newest ping for each client.

[1] - https://metrics.mozilla.com/protected/firefox-hardware-survey/data/hwsurvey-FF43.json
Hey guys, goal is to get this data eventually published (and automatically updating on some TBD frequency) on the mozilla.games.org website. I think I understand the goal of this bug is for us to be able to validate that the data looks right before taking the next step, which is engaging with our web agency to start building the actual presentation. What is the easiest/right way to do that, if we can't via CSV?
Thanks for the added info, Alessio!

Andre, we have answers for your questions which will come after this bug is complete. This bug is specifically for the Telemetry folks to make sure this is all good to go before we work with Games to do final tweaks (like frequency of updates, format etc) and involve the web agency.

FYI JSON should be just as easy for the web agency to use as CSV. I believe Ali already made a harness to read that which we can provide to the agency.
All sounds awesome... really looking forward to this, exciting stuff! Thanks for all the great work!
Points: --- → 2
Priority: -- → P1
I don't appear to have read access to that github pull request.  Was this contributed under the Mozilla project or as a personal repo?
Flags: needinfo?(aalmossawi)
rweiss, done. Happy to move the repo to the mozilla org post-review.
Flags: needinfo?(aalmossawi)
Now I need access to :Dexter, since it seems like the PR originates from him:

I am trying to read this notebook:
https://github.com/Dexterp37/firefox-hardware-survey/blob/spark_job/report/summarize_json.ipynb
Flags: needinfo?(alessio.placitelli)
(In reply to Rebecca Weiss from comment #7)
> Now I need access to :Dexter, since it seems like the PR originates from him:
> 
> I am trying to read this notebook:
> https://github.com/Dexterp37/firefox-hardware-survey/blob/spark_job/report/
> summarize_json.ipynb

Done! It should be working now.
Flags: needinfo?(alessio.placitelli)
Hi, everyone! Thanks for coming together on this one. Rebecca, are you good to go with no more blockers preventing you access to the repo?

The web agency is ready to go once the review is done and we've vetted final needs with the games group internally.

Thanks again.
Flags: needinfo?(rweiss)
Yup, I've reviewed the notebook; it seems right to me.  I would like to see the resulting json file, if that's possible.

:Dexter, could you attach a sample json for a given week (can be any week, doesn't matter) to this bug thread?

:avaughn, please be sure to include me with the web agency discussion for ensuring that we accurately describe how the data was collected and computed on the mozgames site.
Flags: needinfo?(rweiss) → needinfo?(alessio.placitelli)
FYI, we have changed scope with the web agency, and are now looking to get them to prioritize finishing the Open Web Games site, and not the games.mozilla.org site. So, at this point, we imagine that we will have someone internal (likely Harald) add a tile to the games.mozilla.org site that will link *somewhere* to show the live data. I assume it would make sense for us to connect to what Ali has mocked up? Short form, the "web agency discussion" is really an internal one, where we figure out how to publish the data regularly somewhere and display it using Ali's work. Make sense?
(In reply to Rebecca Weiss from comment #10)
> Yup, I've reviewed the notebook; it seems right to me.  I would like to see
> the resulting json file, if that's possible.
> 
> :Dexter, could you attach a sample json for a given week (can be any week,
> doesn't matter) to this bug thread?

Thanks for taking time to review this. Here's a link to the file (7z compressed json, only accessible to Mozilla): https://drive.google.com/a/mozilla.com/file/d/0B1RACVTBwZ94YTZhdW9fUVlPd1E/view?usp=sharing 

I uploaded it to gdrive, as I was unsure about the policy for this kind of aggregated data: can they be publicly shared/attached to the bug?
Flags: needinfo?(alessio.placitelli)
:Dexter, I actually think this should be added to the bug given that it will be powering a page that will also be open.  almossawi, will we have a link that allows for people to download the data from the page?

:bsmedberg, do you agree or disagree that we should include a copy of an example raw json that will power the public-facing Games Hardware Survey to this bug?
Flags: needinfo?(benjamin)
Flags: needinfo?(aalmossawi)
Note: we made a point of allowing anyone who runs tests on the Open Web Games site to download both a representative JSON blob sample, as well as the actual data we collect. If possible, would like to do the same for the hardware telemetry data. Defer to Benjamin on this.
I am trying to understand what this data represents. The file from comment 12 doesn't parse as JSON. It appears to be multiple lines of JSON separated by commas?

 And I don't have access to the scripts, perhaps because of some security setting in github. If we're at the point of making the aggregates public, can we make the github repo public as well?

Do things like "display_1920x956":0.0000005969 mean that 0.00006% of Firefox users have a 1920x956 window size? Or is that combined with other data in the same line somehow? It seems likely that this is a single user?

In general, we're ok to publish any aggregate data that doesn't identify particular users. But there's a lot of data here I'm not certain about: is it possible for gpu_* or os_* lines to contain identifying strings? One way to avoid that is to make sure that any bucket you're publishing has at least N users (I suggest N=100) and clump everyone else into "other". 

Should we be bucketing these a little bit? So for example if I look at just the first line of this data, we have:

display_1366x768         0.291692284
display_1920x1080        0.2040645903
display_1600x900         0.0924423595
display_1280x1024        0.0866290654
display_1024x768         0.0685334788
display_1440x900         0.0567424422
display_1680x1050        0.0487181982
display_1280x800         0.0484170577
display_1360x768         0.0222858859
display_1920x1200        0.0161792111
.... thousands more
For users with less-common configurations, can we bucket these to the nearest-hundred for export? So in this case the "display_1920x956" would end up grouped in a larger "display_~1900x1000" bucket. And if we still don't have 100 users, the group the rest into "other".

This would also reduce concerns about junk data showing up in the long tail of OS names. For example:

>>> oses.sort(key=lambda (k, v): v, reverse=True)
>>> oses[:10]
[(u'os_Windows_NT-6.1', 0.5061198067), (u'os_Windows_NT-10.0', 0.1934596114), (u'os_Windows_NT-6.3',
 0.1121137851), (u'os_Windows_NT-5.1', 0.1070555806), (u'os_Windows_NT-6.0', 0.0296806896), (u'os_Wi
ndows_NT-6.2', 0.0167110568), (u'os_Darwin-14.5.0', 0.0064561648), (u'os_Darwin-15.3.0', 0.006130252
6), (u'os_Darwin-10.8.0', 0.0051531131), (u'os_Darwin-13.4.0', 0.003410438)]

This is fine.

>>> oses[-10:]
[(u'os_Linux-4.2.0-1-586', 2.985e-07), (u'os_Linux-3.0.0-24-generic', 2.985e-07), (u'os_Linux-3.5.0-
31-generic', 2.985e-07), (u'os_Linux-3.4.110-2vl6pae', 2.985e-07), (u'os_Linux-3.11.10-170.g1e76e80-
desktop', 2.985e-07), (u'os_Linux-3.11.0-19-generic', 2.985e-07), (u'os_Linux-3.0.66', 2.985e-07), (
u'os_Linux-2.6.43.8-1.fc15.x86_64', 2.985e-07), (u'os_Linux-4.3.3preempt-r2', 2.985e-07), (u'os_Linu
x-2.6.32-431.29.2.el6.i686', 2.985e-07)]

I believe this means is only one active user in the world on each of these configs. With this particular data I don't think that's a privacy risk, but since we don't know what data might show up I'd prefer for us to be careful.
Flags: needinfo?(benjamin)
bsmedberg, to your last point: yes, pruning or consolidating many of the fields in the json file is essential, seeing as they contribute a lot to the json file's size, e.g. display resolutions and OSs.

For clarity, the demo that uses Alessio's json file is here (behind ldap): 
https://metrics.mozilla.com/protected/firefox-hardware-survey

The choice of fields for each chart is based solely on frequency, which may or may not make product sense.

Also for clarity, and to alleviate any potential issues with access, the mockup is now in this repo: https://github.com/mozilla/firefox-hardware-survey
Flags: needinfo?(aalmossawi)
Agree with :bsmedberg, and I will go further and suggest that it is better to simply roll up every tiny group to the nearest 1%.  Meaning any group that has < 1% should just be categorized "other".  

Given that this hardware survey is primarily designed to give insight into the state of hardware among Firefox users for the purpose of helping game developers identify market segments, I posit that any group with <1% configuration is unlikely to be of much interest anyway.
Put differently, Alessio has all the fields there so we can quickly swap fields in the mockup in response to requests from stakeholders. The final json would presumably only have the fields that we’re displaying in the charts.
Well, we should not make the unfiltered data public without reviewing it "by hand". But I expect that shoving the <1% buckets into "other" is a trivial change to the notebook. I'd still like the notebook to be public.
Thanks for the feedback so for, I'll work on fixing the mentioned issues (JSON validity, per resolution grouping, <1% group collapsing).
assigned to Dexter and prioritized to allow for next weeks scrub.
Assignee: rweiss → alessio.placitelli
Priority: P1 → --
The PR at [1] now contains the updated notebook with the following changes:

- We're now using the longitudinal dataset, which makes everything MUCH faster (from 10+hrs down to 15mins on a 10 node cluster).
- We're collapsing <1% groups to the "other" bucket (OS and Resolutions are collapsed in a slightly different way, as per comment 16).
- We're producing a valid JSON output.

The output can be found at [2].

@Benjamin, do you think it should be ok now to share the JSON file on the bug? Is there anything else privacy-sensitive that stands out to you?

@Rebecca, any suggestion on how to deal with partial data from clients? For example, we have clients with broken screen resolutions but correct CPU and GPU information. Should we just filter out the broken records, using the partial data we have, or leave out the whole client?

[1] - https://github.com/mozilla/firefox-hardware-survey/pull/1
[2] - Data for April 2016: https://drive.google.com/a/mozilla.com/file/d/0B1RACVTBwZ94MFdoeEJlWm9qZTQ/view?usp=sharing
Flags: needinfo?(rweiss)
Flags: needinfo?(benjamin)
Yes, this is fine to share.
Flags: needinfo?(benjamin)
Attached file Sample HW survey JSON (obsolete) —
Status: NEW → ASSIGNED
Priority: -- → P2
This is mostly for public facing dashboard purposes, so I would try to keep as many client records in as possible.  If we have a broken record, use the partial data.
Flags: needinfo?(rweiss)
(In reply to Rebecca Weiss from comment #26)
> This is mostly for public facing dashboard purposes, so I would try to keep
> as many client records in as possible.  If we have a broken record, use the
> partial data.

I had a nice conversation with Rebecca over IRC, and we agreed to drop broken records and discard the partial data. Otherwise, we would end up with the issue of having different subgroups with distinct totals (e.g. the total count of cpu records being different than the total count of gfx records).

We'd rather throw away the entire record and mention on the public dashboard that we throw away records with "corrupted fields" (to make sure people know).
Attached file Sample HW survey JSON (obsolete) —
Updated the PR as per comment 27: this is the generated output.

@Rebecca: I flagged you for reviewing the PR (as discussed over IRC).
Attachment #8753239 - Attachment is obsolete: true
Attachment #8756431 - Flags: review?(rweiss)
I have also asked :spenrose to review this with me as the second reviewer.
Flags: needinfo?(spenrose)
1) The notebook discards an unknown amount of records without, as far as I can tell, counting them or distinguishing among several different reasons for discarding them. I think we should add counts for each reason and display them at the end, to see how they compare with the totals.

2) The notebook does not display intermediate results and it does not come with unit tests the way code would, which means that reviewing it requires holding a lot of state in one's head. I think that "reviewing" this work should consist in part of showing input and intermediate results so we can have a sense of the state machine. I am a candidate for doing that work (though not currently scheduled to).
Flags: needinfo?(spenrose)
(In reply to Sam Penrose from comment #30)
> 1) The notebook discards an unknown amount of records without, as far as I
> can tell, counting them or distinguishing among several different reasons
> for discarding them. I think we should add counts for each reason and
> display them at the end, to see how they compare with the totals.

That's a good point, I think we could report the reason why things were discarded in the logs or produce a separate, private report with this kind of information.

For what is worth, I had a look at the reasons why data was being discarded. Here's a breakdown in percentages of the discarded pings for the clients that were active in the past weeks:

- 1.8% due to pings not containing a monitor section
- 2.9% due to pings not containing a valid number for the number of CPU cores
- 0.04% due to pings not containing a valid vendor id for the video card
- 1.8% due to pings not containing a valid device id for the video card

Other reasons pings were being discarded (below .001%): no video adapters section in the ping, no CPU vendor in the ping.

Most of the clients are discarded because they are inactive (no submission in that date) in the week we are extracting the data for. This is not something that has to do with data validation though. 

> 2) The notebook does not display intermediate results and it does not come
> with unit tests the way code would, which means that reviewing it requires
> holding a lot of state in one's head. I think that "reviewing" this work
> should consist in part of showing input and intermediate results so we can
> have a sense of the state machine. I am a candidate for doing that work
> (though not currently scheduled to).

The notebook is not intended to be run one-off, but it should rather run as a scheduled weekly job and easily allow backfilling if we find some issues in the data. That's the reason why not much care was put in displaying intermediate results (even though this collides with the nature of Jupyter notebooks, I know).

That said, I'm open to suggestions on how to make things clearer/easier/faster to review. Please let me know if you were thinking of any specific action or you have a particular example/style I can adhere to!
(In reply to Alessio Placitelli [:Dexter] from comment #31)
> (In reply to Sam Penrose from comment #30)
> > 1) The notebook discards an unknown amount of records without, as far as I
> > can tell, counting them or distinguishing among several different reasons
> > for discarding them. I think we should add counts for each reason and
> > display them at the end, to see how they compare with the totals.
> 
> That's a good point, I think we could report the reason why things were
> discarded in the logs or produce a separate, private report with this kind
> of information.
> 
> For what is worth, I had a look at the reasons why data was being discarded.
> Here's a breakdown in percentages of the discarded pings for the clients
> that were active in the past weeks:
> 
> - 1.8% due to pings not containing a monitor section
> - 2.9% due to pings not containing a valid number for the number of CPU cores
> - 0.04% due to pings not containing a valid vendor id for the video card
> - 1.8% due to pings not containing a valid device id for the video card
> 
> Other reasons pings were being discarded (below .001%): no video adapters
> section in the ping, no CPU vendor in the ping.

How about just including them in the report in a separate bucket?
Either putting them into "other" or - more specific - buckets like "no cpu info", "no gfx info", etc.?
Attached file Sample HW survey JSON
A slightly bigger sample with backfilled data, from January 2016 to the last week.
Attachment #8756431 - Attachment is obsolete: true
Attachment #8756431 - Flags: review?(rweiss)
Attachment #8765874 - Flags: review?(rweiss)
Attached image OperatingSystemHW.png
As Ali pointed out, the data doesn't contain os_Darwin-15* entries on the four dates that we have troughs in the "Operating System" graph at [1] (screenshot in comment 35).

What's going on? Let's look at the data.

Data from 2016-02-21 (OSX 15 is there):

...
    "os_Darwin-14.5.0": 0.0135462513,
    "os_Darwin-15.3.0": 0.0129951447,
    "os_Darwin-Other": 0.0356674444,
...

Data from 2016-05-22 (OSX 15 isn't there):

...
    "os_Darwin-14.5.0": 0.0125374454,
    "os_Darwin-Other": 0.0509810781,
...

In the first case, Darwin-Other accounts for the 3% of the users while, in the second case, this generic bucket accounts for the 5%. Since we're bucketing any configuration that accounts for less than 1% of the sample in a generic bucket, I think that OSX 15.* for that particular weak is under that threshold and gets into Darwin-Other bucket.

Question time!
Can we live with this kind of artifact?
Flags: needinfo?(rweiss)
Flags: needinfo?(waxmigs2902)
To close out the early comment: Until we hear otherwise from stakeholders, we can live with this artifact.
Flags: needinfo?(rweiss)
Attached file HW Survey Notebook
Ryan, as mentioned over IRC, I'm r? you on this notebook for sanity checking.
Attachment #8781881 - Flags: review?(rharter)
Flags: needinfo?(waxmigs2902)
Hey Alessio, 

Can you open a pull request so I can use reviewable for inline comments?

A few high level comments:
* Do you have an example of the generated report? Are these saved for future review by the automated job or are they discarded?
* The % of discarded observations does not appear to be reported/saved. I expect this measure will be very useful if anything should look strange in the data. Can we save this in the above mentioned report or, better yet, include it in the payload as suggested by Georg

[0] https://github.com/mozilla/firefox-hardware-survey/pull/1
Flags: needinfo?(alessio.placitelli)
(In reply to Ryan Harter [:harter] from comment #39)
> Hey Alessio, 
> 
> Can you open a pull request so I can use reviewable for inline comments?

Hey, first of all, thanks for taking a look at this! As we cleared up over IRC, you can drop your comments on the PR for bug 1293954, as the PR for the notebook was already merged.

> A few high level comments:
> * Do you have an example of the generated report? Are these saved for future
> review by the automated job or are they discarded?

An example of the generated data is attached to this bug. This job runs weekly
and the most recent generated report will be saved in the public analysis bucket
by bug 1293954.

> * The % of discarded observations does not appear to be reported/saved. I
> expect this measure will be very useful if anything should look strange in
> the data. Can we save this in the above mentioned report or, better yet,
> include it in the payload as suggested by Georg

This should be saved, unless there's a bug :-(
Look for |discarded_count| in the notebook.

> [0] https://github.com/mozilla/firefox-hardware-survey/pull/1
Flags: needinfo?(alessio.placitelli)
Attachment #8781881 - Flags: review?(rharter) → review+
As noted offline, the example ping is out of date but the current ping [0] does include the discarded ping rate. This LGTM. Thanks!

[0] https://analysis-output.telemetry.mozilla.org/game-hardware-survey/data/hwsurvey-weekly.json
Comment on attachment 8765874 [details]
Sample HW survey JSON

This looks good to me.  You should r? :ali on this to make sure this doesn't break the site.
Attachment #8765874 - Flags: review?(rweiss) → review+
(In reply to Rebecca Weiss from comment #42)
> Comment on attachment 8765874 [details]
> Sample HW survey JSON
> 
> This looks good to me.  You should r? :ali on this to make sure this doesn't
> break the site.

Cool. Ali is aware of the format. Some new fields are being added by bug 1293955, and he already updated the website to that. 

Closing this bug, feel free to reopen if needed.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: