Closed Bug 1447851 Opened 8 years ago Closed 8 years ago

Landfill data should be accessible by moztelemetry

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

Attachments

(1 file)

Link to GitHub pull-request: https://github.com/mozilla/python_moztelemetry/pull/209 8 years ago GitHub Bugzilla PR Linker 55 bytes, text/x-github-pull-request		Details \| Review

Anthony Miyaguchi [:amiyaguchi]

Assignee

Description

•

8 years ago

You should be able to use the moztelemetry api to access the landfill data. This data is the same as telemetry data, but has not had schema validation applied to it. This data would be useful for generating diagnostic information from schema validation. Running the following snippet: ``` Dataset.from_source("landfill") .where(docType='main') .where(appUpdateChannel="nightly") .where(submissionDate=lambda x: start_date <= x <= end_date) .records(sc) ``` will return with the following error: ``` KeyError: 'metadata_prefix' --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <command-8667> in <module>() ----> 1 pings = get_pings("20180301", "20180301", sample=0.1) <command-8666> in get_pings(start_date, end_date, sample) 16 17 return ( ---> 18 Dataset.from_source("landfill") 19 .where(docType='main') 20 .where(appUpdateChannel="nightly") /databricks/python/local/lib/python2.7/site-packages/moztelemetry/dataset.pyc in from_source(source_name) 304 raise Exception('Unknown source {}'.format(source_name)) 305 --> 306 schema = store.get_key('{}/schema.json'.format(source['metadata_prefix'])).read() 307 dimensions = [f['field_name'] for f in json.loads(schema)['dimensions']] 308 return Dataset(source['bucket'], dimensions, prefix=source['prefix']) KeyError: 'metadata_prefix' ``` This is because the `sources.json` section for landfill is missing a metadata_prefix (because there isn't associated metadata). > aws s3 cp s3://net-mozaws-prod-us-west-2-pipeline-metadata/sources.json - | grep -A2 "landfill" I think it's reasonable to add extra logic for processing landfill data (since it structured on disk very simply).

Wesley Dawson [:whd]

Comment 1

•

8 years ago

GetObject access to landfill is specifically denied by the current production bucket policy. The data contained in the current landfill (landfill-3) is unsanitized and contains the IP addresses of clients, which should not be made broadly available within Mozilla through ATMO. In order to use landfill data you currently need specific production credentials, which is why it is not listed as a source to moztelemetry. What is listed in sources.json is the deprecated landfill-2 dataset, which applied GeoIP decoding before being written and was therefore accessible from dev. Accessing that dataset broke with [1], which we never noticed because we never used it via moztelemetry. So simply adding the right information to sources.json and updating moztelemetry will be insufficient, but it looks like we could create an IAM role in the databricks IAM that has landfill access and allow only certain users to launch clusters with it [2]. From a security perspective this is something we don't have support for in the dev IAM/ATMO/WTMO, but it looks like it may be sufficient for your purposes to access this data from databricks only. [1] https://github.com/mozilla/python_moztelemetry/pull/86 [2] https://docs.databricks.com/api/latest/instance-profiles.html

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 2

•

8 years ago

Accessing landfill from an isolated cluster in databricks would work well for my intended use-case. I would be sampling from the data, sanitizing it, and generating reports based on validation errors.

Wesley Dawson [:whd]

Comment 3

•

8 years ago

I've added an IAM role ec2-databricks-landfill that has the requisite access, given you access in databricks to launch clusters with this role, and updated sources.json to point at landfill-3. On a cluster with this role, you can run e.g. ``` Dataset.from_source("landfill") .where(Host='incoming.telemetry.mozilla.org') .where(submissionDate="20180323").records(sc) ``` and see output, but it will be missing the actual payloads, which are binary strings. Depending on how we want to expose this data, there are a few options. (a) Pass the content field bytes string as-is This requires the most manual processing but allows you to analyze a class of before-schema validation errors that otherwise require using the telemetry_errors_parquet table. (b) Decode (possibly with gunzip) the content field to a string This is an intermediate representation that provides a single string field. If you plan on mostly doing json schema validation this approach may make the most sense, with the caveat that some strings will not be valid json. (c) Decode the content field as if it were Fields[submission] / Payload This is essentially what the current API gives you. You can derive the representation from (b) by removing the meta field from the resultant object and re-encoding the result as json. Pings that are not valid json will be discarded. I've implemented these for the python bindings at [1], and I can write tests for the variant we decide to use. We can also update the scala bindings if needed. [1] https://github.com/mozilla/python_moztelemetry/compare/master...whd:landfill?expand=1

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

8 years ago

Assignee: nobody → amiyaguchi

Priority: -- → P1

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

8 years ago

Blocks: 1450289

Mark Reid [:mreid]

Updated

•

8 years ago

Priority: P1 → P2

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 4

•

8 years ago

The file-based validator is almost done and will be ready to test against real data. Option (b) sounds the closest to what I have in mind, dumping the data into a folder structure that mirrors mozilla-pipeline-schemas.

Priority: P2 → P1

Wesley Dawson [:whd]

Comment 5

•

8 years ago

I suggest you take over my POC PR comment #3 and add me as a reviewer when the bindings work to your liking.

GitHub Bugzilla PR Linker

Comment 6

•

8 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/python_moztelemetry/pull/209 — Details

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 7

•

8 years ago

With the patch above, I'm able to extract the raw documents from landfill. Stripping most or all of the meta information should be adequate for my purposes.

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Telemetry APIs for Analysis → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Landfill data should be accessible by moztelemetry

Categories

(Data Platform and Tools :: General, defect, P1)

Tracking

(Not tracked)

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated

Attachment

General

Description

File Name

Content Type