URI-based scrubbing in gcp-ingestion
Categories
(Data Platform and Tools :: General, enhancement, P2)
Tracking
(Not tracked)
People
(Reporter: whd, Assigned: ascholtz)
Details
Attachments
(2 files)
This should mostly replace existing using cases where we drop submissions at the edge using nginx/openresty:
telemetry v2 / other old pings: https://github.com/mozilla-services/cloudops-infra/blob/master/projects/data-ingestion/k8s/charts/data-ingestion/templates/filter-configmap.yaml#L14-L41
sslreports: https://github.com/mozilla-services/cloudops-infra/blob/master/projects/data-ingestion/k8s/charts/data-ingestion/templates/nginx-configmap.yaml#L96
This would give us better visibility into the types of unwanted data we're collecting (at a marginal cost, assuming the number of pings we drop at the edge is relatively small). It would also be more maintainable than the existing filter logic, which can be confusing and relatively dangerous to update. The confusion comes from this logic initially being structured to selectively tee traffic to two two stacks in AWS (heka and hindsight), and the danger comes from an inability to reprocess data if we accidentally drop something we didn't mean to. https://github.com/mozilla/data-docs/pull/575 is an example of confusing behavior associated with this logic.
The main reasons I can think of to prefer edge-based URI filtering are:
- cost (unlikely to be a factor for deprecated/unwanted data)
- contractual obligation to never store specific data (currently no such contracts, but they'e been discussed)
In these cases the edge filter may be preferable as a secondary filtering/processing mechanism described here.
Updated•5 years ago
|
Comment 1•5 years ago
|
||
I think it would make sense here to reuse the existing MessageScrubber class, adding a scrubByUri(String uri) method that throws UnwantedDataException. We would add a call to MessageScrubber.scrubByUri in the ParseUri transform.
I think we should associate sslreports URIs with bug 1585144, and the telemetry v2 / other old pings bucket can be associated with this bug 1673433.
| Assignee | ||
Comment 2•5 years ago
|
||
| Assignee | ||
Comment 3•5 years ago
|
||
Implemented in https://github.com/mozilla/gcp-ingestion/pull/1486
| Reporter | ||
Comment 4•5 years ago
|
||
It looks like that PR addresses the /submit/sslreports case but not the telemetry v2 case / other old pings case implemented by
https://github.com/mozilla-services/cloudops-infra/blob/master/projects/data-ingestion/k8s/charts/data-ingestion/templates/filter-configmap.yaml#L14-L41, so I'm going to re-open this.
On that point, it may end up being somewhat more difficult to test the latter logic. We'd probably need to go to the ELB/GCLB logs since these are otherwise dropped at the edge. However, the new filter should be a noop on all URIs that made it to payload_bytes_raw so that'd probably be sufficient to determine no false positives.
Another option would be to disable the openresty filter and empirically identify the offenders in e.g. the standard weekly triage meeting, and build specific exclusions from that. This doesn't seem like an unreasonable thing to do to me, as we'd then be able to avoid propagating the arcane edge filtering logic forevermore. This option assumes that there is no data we're dropping at the edge that would otherwise be considered valid from a schemas perspective, which I guess is true.
Comment 5•5 years ago
|
||
| Assignee | ||
Comment 6•5 years ago
|
||
Option 2 sounds like a good way forward to me. The re-implemented validation logic turned out to be hard to verify which is probably also bad for future maintenance. Only scrubbing the pieces we actually need hopefully results in a much simpler logic.
| Reporter | ||
Comment 7•5 years ago
|
||
Ok, I'm going to re-close this as fixed and open a new bug for removing the current edge filter (probably next year). We can file separate bugs for the specific cases that arise from that change as per standard triage meeting procedure and possibly reuse logic developed in the draft PR for that.
Updated•3 years ago
|
Description
•