Perform extra validation on s3 output dimensions

NEW
Unassigned

Status

P3
normal
2 years ago
5 months ago

People

(Reporter: whd, Unassigned)

Tracking

Details

(Whiteboard: [DataOps])

(Reporter)

Description

2 years ago
We started receiving some "pretty weird" error messages today. Here's one such, in its decoded form:

:Uuid: b7d30d85-b5a4-4eac-a827-50e31bf97832
:Timestamp: 2017-02-13T08:00:08.910368768Z
:Type: telemetry.error
:Logger: telemetry
:Severity: 7
:Payload: <nil>
:EnvVersion: <nil>
:Pid: <nil>
:Hostname: ip-172-31-35-193
:Fields:
    | name: remote_addr type: 0 representation: <nil> value: X
    | name: uri type: 0 representation: <nil> value: /submit/telemetry/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/main/Firefox/${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"timeout","11"})).start()).getInputStream()))).readLine()}${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"sleep","11"})).start()).getInputStream()))).readLine()}/release/20170125094131
    | name: args type: 0 representation: <nil> value: v=4
    | name: protocol type: 0 representation: <nil> value: HTTP/1.1
    | name: Content-Length type: 0 representation: <nil> value: 74
    | name: Date type: 0 representation: <nil> value: Mon, 13 Feb 2017 07:19:24 GMT
    | name: Host type: 0 representation: <nil> value: incoming.telemetry.mozilla.org
    | name: User-Agent type: 0 representation: <nil> value: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0
    | name: content type: 1 representation: <nil> value: <?xml version="1.0" encoding="UTF-8" standalone="no"?>\x0d\x0a<root>"?"</root>\x0d\x0a
    | name: DecodeErrorType type: 0 representation: <nil> value: json
    | name: DecodeError type: 0 representation: <nil> value: invalid submission: failed to parse offset:0 Invalid value.
    | name: submissionDate type: 0 representation: <nil> value: 20170213
    | name: appVersion type: 0 representation: <nil> value: ${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"timeout","11"})).start()).getInputStream()))).readLine()}${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"sleep","11"})).start()).getInputStream()))).readLine()}
    | name: Date type: 0 representation: <nil> value: Mon, 13 Feb 2017 07:19:24 GMT
    | name: appUpdateChannel type: 0 representation: <nil> value: release
    | name: documentId type: 0 representation: <nil> value: X
    | name: docType type: 0 representation: <nil> value: main
    | name: appBuildId type: 0 representation: <nil> value: 20170125094131
    | name: geoCountry type: 0 representation: <nil> value: X
    | name: sourceName type: 0 representation: <nil> value: telemetry
    | name: geoCity type: 0 representation: <nil> value: X
    | name: Host type: 0 representation: <nil> value: incoming.telemetry.mozilla.org
    | name: appName type: 0 representation: <nil> value: Firefox


This was obviously a garbage submission, but the interesting thing is that it caused our s3 errors output to fail, in the following way:

Feb 13 08:00:09 ip-172-31-36-227 hindsight[3583]: 1486972809359602906 [error] output.telemetry_errors_s3 terminated: process_message() run/output/moz_telemetry_s3.lua:167: /media/ephemeral0/split_out/telemetry-errors-3/20170213+telemetry+UNKNOWN+main+Firefox+release+___new_java.io.BufferedReader_new_java.io.InputStreamReader___new_java.lang.ProcessBuilder_new_java.lang.

This error was generated due to an assert when attempting to open a file. I'm not awake enough to debug this in full detail, but we should definitely be performing some sort of validation / replacement to prevent this from occurring. I think if we see a dimension that doesn't match our expectations (similar to the file name sanitation logic in the s3 output) we should set that dimension to some fixed string like "UNKNOWN". As a workaround I've set the errors s3 output to match 'Fields[uri] !~ "java"', so we will simply drop the most egregiously broken submissions from the s3 errors stream for now.

On a related note, we have an implicit limit on total s3 object name size of 256, due to the filesystem limit on filename size. This should again not be an issue when dimensions are well-behaved, but is possibly the root cause of the above issue. In the previous system we used actual directories instead of "replace + with / before uploading" but that required a separate cleanup cron for all the spurious directories created.

Updated

2 years ago
Assignee: nobody → whd
Points: --- → 2
Priority: -- → P1
(Reporter)

Comment 1

2 years ago
The problem is actually more general than just errors decoding, but it is even rarer that a ping will pass normal validation and have total garbage in appBuildId or appVersion. But (of course) this just happened. I have more workarounds in place, but if I don't get to implementing the proper fix this week I'll have to move it to next sprint.

Updated

2 years ago
Whiteboard: [SvcOps]
(Reporter)

Updated

2 years ago
Status: NEW → ASSIGNED

Updated

2 years ago
Component: Metrics: Pipeline → Pipeline Ingestion
Product: Cloud Services → Data Platform and Tools

Comment 2

2 years ago
Since workarounds are in place, this is getting pushed to next quarter.
Priority: P1 → P3
(Reporter)

Comment 3

10 months ago
This bug was originally for the heka protobuf s3 output, but extends to parquet output as well. With the heka protobuf output, we have a mechanism for specifying that unknown dimension values be bucketed into OTHER (which has its own problems) but we don't have this for parquet.

As such, when using non-normalized dimensions such as appName for partitions we can end up causing a couple of issues. A specific bad case is when there are single quotes in an appName, which won't result in a valid s3 object name.

I have workarounds in place (message matchers), but having a general mechanism to enforce a schema on URL fields above the basic structure we require may be a better way to go than fixing / unifying the heka and parquet output file/object name logic.
Priority: P3 → P2
Summary: Perform extra validation on potential s3 dimensions when decoding errors → Perform extra validation on s3 output dimensions

Updated

10 months ago
Whiteboard: [SvcOps] → [DataOps]
(Reporter)

Updated

9 months ago
Priority: P2 → P3
(Reporter)

Updated

5 months ago
Assignee: whd → nobody
Status: ASSIGNED → NEW
You need to log in before you can comment on or make changes to this bug.