Public dataset release; JESTr topN TLD+1 navigation
Categories
(Data Platform and Tools :: General, task)
Tracking
(Not tracked)
People
(Reporter: mlopatka, Unassigned)
Details
Attachments
(1 file)
1018 bytes,
text/plain
|
nshadowen
:
data-review+
|
Details |
Description
This is a public dataset release request. Hereby submitted for data steward review as part of our internal pre-release audit process.
Motivation for this dateset release is to emphasize transparency, reproducibility, and scientific rigour in our public facing research. This dataset may be of use to other researchers studying a variety of fundamental topics pertaining to the Web.
The dataset is intended to be hosted here as part of supplementary materials to compliment this publication.
This dataset has been aggregated over the entirety of the Pioneer cohort
Fields to be included in the dataset
- TLD+1: the domain (pay level domain) name
- Rank: the relative rank at which pages at this domain were navigated to by all participants in the opt-in Jestr study throughout the study duration.
Privacy measures employed
To ensure the privacy of study participants the ranked list is released subject to a deferentially private (DP) release mechanism satisfying an epsilon value of 0.7. Our DP release approach for this list involves computing noisy versions of the JESTr observed frequency counts for every domain on a given whitelist. For this purpose, we use the top K entries of the Trexa list as the whitelist.
Domain level navigation frequencies were obfuscated using the Laplace mechanism yielding DP-protected item frequency counts on which the rank is based. Contributions from any single client were capped to avoid disproportionate information leakage as a consequence of high browsing volume.
List generation notebook here
Data Characteristics
Is the level of aggregation lower than 3 (ie. does it include individual-level data)?
No.
Are there any Data Collection Category 3 or 4 dimensions?
Yes. The TLD+1 level domains pre-aggregation are Category 3 data. Raw data is NOT intended for public release and only aggregated and noise-obfuscated data is planned for release.
Do any of the dimensions or metrics include sensitive data?
No. the post-aggregation data does not pose a privacy risk to any individual client, can not be used to identify an individual, nor does it leak sensitive information about the Pioneer population.
@nshadowen would you care to own the data steward review process for this one?
@dzeber can you please verify that all the specifics pertaining to the DP-release parameters have been correctly articulated in the bug description?
Comment 2•5 years ago
|
||
Thanks for including me @mlopatka, happy to. I will await @dzeber's verification on the specifics of DP-release parameters.
In the meantime, would you mind answering the following - Are there any data included that do not have a corresponding data review for collection? Is the only relevant data review from the opt-in Jestr study? Also, I was not able to access the link to List Generation notebook.
Thanks.
All data contained in this release were aggregated from the collection referenced JESTr study.
Following up via slack to establish access to the Pioneer analysis private repo.
Comment 4•5 years ago
|
||
The description is accurate, and I will add that the whitelist consists of the top 10K Trexa entries. Ie, the domains in the list we plan to release are a subset of the Trexa top 10K.
Comment 5•5 years ago
|
||
Just to confirm with :agray that we can now upload the list (as described here) to a publicly accessible location?
Comment 7•5 years ago
|
||
(In reply to mlopatka from comment #6)
Just to confirm with :agray that we can now upload the list (as described here) to a publicly accessible location?
Approved.
Assignee | ||
Updated•3 years ago
|
Description
•