Closed
Bug 1450289
Opened 8 years ago
Closed 8 years ago
Create a validation service for schemas
Categories
(Data Platform and Tools :: General, enhancement, P1)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: amiyaguchi, Assigned: amiyaguchi)
References
Details
Attachments
(1 file, 2 obsolete files)
See this github issue.[0]
Currently, the pipeline-schemas repo is missing integration testing against real data. I suggest using real sampled data from landfill to apply schema validation. There are security implications for using live data, so this service should only be available to CI and users with SSO. CI should make testing available to outside contributors. In terms of usability, this service should be similar to the `try` servers for building Firefox.
An initial sketch of the service can be found in this gist.[1]
[0] https://github.com/mozilla-services/mozilla-pipeline-schemas/issues/5
[1] https://gist.github.com/acmiyaguchi/2280ca1c5dddce404c2ef133fa6a9b6b
| Assignee | ||
Updated•8 years ago
|
| Assignee | ||
Updated•8 years ago
|
Points: 5 → 3
Comment 1•8 years ago
|
||
Even available to CI, if outside users are able to submit PRs and run CI, they can access the data. I suggest pulling in a member of the security operations team for review.
| Assignee | ||
Comment 2•8 years ago
|
||
I've been thinking through this problem with that in mind. I appreciate the feedback, I think getting review from security operations is a good idea. I'm keen on getting a working demo and a better fleshed out design document beforehand though.
There are a few things that are missing from comment #1 that I think would be useful. First is the link to the notebook that will be used as the basis of the service.[1] The second is a rendered image of the service design, which I will attach to the bug.
I've tried to limit the scope of the initial design as much as I possibly could. The main idea is that the service only has access to sanitized data and the service only outputs summary information about validation. This service is only available through a proxy to separate function from authorization. A client has to communicate through a proxy service that logs all requests and summaries.
Anyone should be able to ask for a report because there shouldn't be any sensitive data leaking through (percentage of validation errors, rollup of errors based on a ranking function).
[1] https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/8754
| Assignee | ||
Comment 3•8 years ago
|
||
| Assignee | ||
Comment 4•8 years ago
|
||
| Assignee | ||
Comment 5•8 years ago
|
||
Here's an initial design document with some requirements, a suggested data model, and an API. I'll be creating a new repository with tests for the service request and response schemas.
Schema Validation Service - Design Document - https://docs.google.com/document/d/1xVgHXwvBtLZAusk-TdgU8WP9vh7ll0P0XpEeL6d0VRU/edit#
| Assignee | ||
Comment 6•8 years ago
|
||
This is an intermediate build (v0.1) of the spark validation application. Data can be manually tested by adding data to a specific directory and running an integration test.
The application can currently be submitted over the network to a spark-standalone cluster. It should be possible to replace the standalone instance with a `spark-submit` compatible service (EMR, etc).
There are currently some clean-up that needs to be done with files. The Flask REST API is close to being done. The integration test (`run-compose-test.sh`) will be the basis of the celery tasks for submitting validation requests.
Attachment #8963994 -
Attachment is obsolete: true
Attachment #8963995 -
Attachment is obsolete: true
| Assignee | ||
Updated•8 years ago
|
Summary: Create a validation service for schemas using santized and sampled landfill data → Create a validation service for schemas
| Assignee | ||
Comment 7•8 years ago
|
||
The intermediate build here is a useful proof of concept to accompany the design document. A v1 version is planned in a series of bugs in bug 1454062, which should include a staging environment and self-hosted documentation.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•3 years ago
|
Component: Pipeline Ingestion → General
You need to log in
before you can comment on or make changes to this bug.
Description
•