Closed Bug 1450289 Opened 8 years ago Closed 8 years ago

Create a validation service for schemas

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

Attachments

(1 file, 2 obsolete files)

See this github issue.[0] Currently, the pipeline-schemas repo is missing integration testing against real data. I suggest using real sampled data from landfill to apply schema validation. There are security implications for using live data, so this service should only be available to CI and users with SSO. CI should make testing available to outside contributors. In terms of usability, this service should be similar to the `try` servers for building Firefox. An initial sketch of the service can be found in this gist.[1] [0] https://github.com/mozilla-services/mozilla-pipeline-schemas/issues/5 [1] https://gist.github.com/acmiyaguchi/2280ca1c5dddce404c2ef133fa6a9b6b
Assignee: nobody → amiyaguchi
Points: --- → 5
Depends on: 1447851
Priority: -- → P1
Points: 5 → 3
See Also: → 1450290
Even available to CI, if outside users are able to submit PRs and run CI, they can access the data. I suggest pulling in a member of the security operations team for review.
I've been thinking through this problem with that in mind. I appreciate the feedback, I think getting review from security operations is a good idea. I'm keen on getting a working demo and a better fleshed out design document beforehand though. There are a few things that are missing from comment #1 that I think would be useful. First is the link to the notebook that will be used as the basis of the service.[1] The second is a rendered image of the service design, which I will attach to the bug. I've tried to limit the scope of the initial design as much as I possibly could. The main idea is that the service only has access to sanitized data and the service only outputs summary information about validation. This service is only available through a proxy to separate function from authorization. A client has to communicate through a proxy service that logs all requests and summaries. Anyone should be able to ask for a report because there shouldn't be any sensitive data leaking through (percentage of validation errors, rollup of errors based on a ranking function). [1] https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/8754
Attached image Initial sketch of the service diagram (obsolete) —
Here's an initial design document with some requirements, a suggested data model, and an API. I'll be creating a new repository with tests for the service request and response schemas. Schema Validation Service - Design Document - https://docs.google.com/document/d/1xVgHXwvBtLZAusk-TdgU8WP9vh7ll0P0XpEeL6d0VRU/edit#
This is an intermediate build (v0.1) of the spark validation application. Data can be manually tested by adding data to a specific directory and running an integration test. The application can currently be submitted over the network to a spark-standalone cluster. It should be possible to replace the standalone instance with a `spark-submit` compatible service (EMR, etc). There are currently some clean-up that needs to be done with files. The Flask REST API is close to being done. The integration test (`run-compose-test.sh`) will be the basis of the celery tasks for submitting validation requests.
Attachment #8963994 - Attachment is obsolete: true
Attachment #8963995 - Attachment is obsolete: true
Blocks: 1454062
Summary: Create a validation service for schemas using santized and sampled landfill data → Create a validation service for schemas
The intermediate build here is a useful proof of concept to accompany the design document. A v1 version is planned in a series of bugs in bug 1454062, which should include a staging environment and self-hosted documentation.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Pipeline Ingestion → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: