Closed Bug 1447851 Opened 8 years ago Closed 8 years ago

Landfill data should be accessible by moztelemetry

Categories

(Data Platform and Tools :: General, defect, P1)

defect
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

Attachments

(1 file)

You should be able to use the moztelemetry api to access the landfill data. This data is the same as telemetry data, but has not had schema validation applied to it. This data would be useful for generating diagnostic information from schema validation. Running the following snippet: ``` Dataset.from_source("landfill") .where(docType='main') .where(appUpdateChannel="nightly") .where(submissionDate=lambda x: start_date <= x <= end_date) .records(sc) ``` will return with the following error: ``` KeyError: 'metadata_prefix' --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <command-8667> in <module>() ----> 1 pings = get_pings("20180301", "20180301", sample=0.1) <command-8666> in get_pings(start_date, end_date, sample) 16 17 return ( ---> 18 Dataset.from_source("landfill") 19 .where(docType='main') 20 .where(appUpdateChannel="nightly") /databricks/python/local/lib/python2.7/site-packages/moztelemetry/dataset.pyc in from_source(source_name) 304 raise Exception('Unknown source {}'.format(source_name)) 305 --> 306 schema = store.get_key('{}/schema.json'.format(source['metadata_prefix'])).read() 307 dimensions = [f['field_name'] for f in json.loads(schema)['dimensions']] 308 return Dataset(source['bucket'], dimensions, prefix=source['prefix']) KeyError: 'metadata_prefix' ``` This is because the `sources.json` section for landfill is missing a metadata_prefix (because there isn't associated metadata). > aws s3 cp s3://net-mozaws-prod-us-west-2-pipeline-metadata/sources.json - | grep -A2 "landfill" I think it's reasonable to add extra logic for processing landfill data (since it structured on disk very simply).
GetObject access to landfill is specifically denied by the current production bucket policy. The data contained in the current landfill (landfill-3) is unsanitized and contains the IP addresses of clients, which should not be made broadly available within Mozilla through ATMO. In order to use landfill data you currently need specific production credentials, which is why it is not listed as a source to moztelemetry. What is listed in sources.json is the deprecated landfill-2 dataset, which applied GeoIP decoding before being written and was therefore accessible from dev. Accessing that dataset broke with [1], which we never noticed because we never used it via moztelemetry. So simply adding the right information to sources.json and updating moztelemetry will be insufficient, but it looks like we could create an IAM role in the databricks IAM that has landfill access and allow only certain users to launch clusters with it [2]. From a security perspective this is something we don't have support for in the dev IAM/ATMO/WTMO, but it looks like it may be sufficient for your purposes to access this data from databricks only. [1] https://github.com/mozilla/python_moztelemetry/pull/86 [2] https://docs.databricks.com/api/latest/instance-profiles.html
Accessing landfill from an isolated cluster in databricks would work well for my intended use-case. I would be sampling from the data, sanitizing it, and generating reports based on validation errors.
I've added an IAM role ec2-databricks-landfill that has the requisite access, given you access in databricks to launch clusters with this role, and updated sources.json to point at landfill-3. On a cluster with this role, you can run e.g. ``` Dataset.from_source("landfill") .where(Host='incoming.telemetry.mozilla.org') .where(submissionDate="20180323").records(sc) ``` and see output, but it will be missing the actual payloads, which are binary strings. Depending on how we want to expose this data, there are a few options. (a) Pass the content field bytes string as-is This requires the most manual processing but allows you to analyze a class of before-schema validation errors that otherwise require using the telemetry_errors_parquet table. (b) Decode (possibly with gunzip) the content field to a string This is an intermediate representation that provides a single string field. If you plan on mostly doing json schema validation this approach may make the most sense, with the caveat that some strings will not be valid json. (c) Decode the content field as if it were Fields[submission] / Payload This is essentially what the current API gives you. You can derive the representation from (b) by removing the meta field from the resultant object and re-encoding the result as json. Pings that are not valid json will be discarded. I've implemented these for the python bindings at [1], and I can write tests for the variant we decide to use. We can also update the scala bindings if needed. [1] https://github.com/mozilla/python_moztelemetry/compare/master...whd:landfill?expand=1
Assignee: nobody → amiyaguchi
Priority: -- → P1
Blocks: 1450289
Priority: P1 → P2
The file-based validator is almost done and will be ready to test against real data. Option (b) sounds the closest to what I have in mind, dumping the data into a folder structure that mirrors mozilla-pipeline-schemas.
Priority: P2 → P1
I suggest you take over my POC PR comment #3 and add me as a reviewer when the bindings work to your liking.
With the patch above, I'm able to extract the raw documents from landfill. Stripping most or all of the meta information should be adequate for my purposes.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Telemetry APIs for Analysis → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: