Perform extra validation on potential s3 dimensions when decoding errors

ASSIGNED
Assigned to

Status

Data Platform and Tools
Pipeline Ingestion
P3
normal
ASSIGNED
8 months ago
5 months ago

People

(Reporter: whd, Assigned: whd)

Tracking

Details

(Whiteboard: [SvcOps])

(Assignee)

Description

8 months ago
We started receiving some "pretty weird" error messages today. Here's one such, in its decoded form:

:Uuid: b7d30d85-b5a4-4eac-a827-50e31bf97832
:Timestamp: 2017-02-13T08:00:08.910368768Z
:Type: telemetry.error
:Logger: telemetry
:Severity: 7
:Payload: <nil>
:EnvVersion: <nil>
:Pid: <nil>
:Hostname: ip-172-31-35-193
:Fields:
    | name: remote_addr type: 0 representation: <nil> value: X
    | name: uri type: 0 representation: <nil> value: /submit/telemetry/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/main/Firefox/${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"timeout","11"})).start()).getInputStream()))).readLine()}${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"sleep","11"})).start()).getInputStream()))).readLine()}/release/20170125094131
    | name: args type: 0 representation: <nil> value: v=4
    | name: protocol type: 0 representation: <nil> value: HTTP/1.1
    | name: Content-Length type: 0 representation: <nil> value: 74
    | name: Date type: 0 representation: <nil> value: Mon, 13 Feb 2017 07:19:24 GMT
    | name: Host type: 0 representation: <nil> value: incoming.telemetry.mozilla.org
    | name: User-Agent type: 0 representation: <nil> value: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0
    | name: content type: 1 representation: <nil> value: <?xml version="1.0" encoding="UTF-8" standalone="no"?>\x0d\x0a<root>"?"</root>\x0d\x0a
    | name: DecodeErrorType type: 0 representation: <nil> value: json
    | name: DecodeError type: 0 representation: <nil> value: invalid submission: failed to parse offset:0 Invalid value.
    | name: submissionDate type: 0 representation: <nil> value: 20170213
    | name: appVersion type: 0 representation: <nil> value: ${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"timeout","11"})).start()).getInputStream()))).readLine()}${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"sleep","11"})).start()).getInputStream()))).readLine()}
    | name: Date type: 0 representation: <nil> value: Mon, 13 Feb 2017 07:19:24 GMT
    | name: appUpdateChannel type: 0 representation: <nil> value: release
    | name: documentId type: 0 representation: <nil> value: X
    | name: docType type: 0 representation: <nil> value: main
    | name: appBuildId type: 0 representation: <nil> value: 20170125094131
    | name: geoCountry type: 0 representation: <nil> value: X
    | name: sourceName type: 0 representation: <nil> value: telemetry
    | name: geoCity type: 0 representation: <nil> value: X
    | name: Host type: 0 representation: <nil> value: incoming.telemetry.mozilla.org
    | name: appName type: 0 representation: <nil> value: Firefox


This was obviously a garbage submission, but the interesting thing is that it caused our s3 errors output to fail, in the following way:

Feb 13 08:00:09 ip-172-31-36-227 hindsight[3583]: 1486972809359602906 [error] output.telemetry_errors_s3 terminated: process_message() run/output/moz_telemetry_s3.lua:167: /media/ephemeral0/split_out/telemetry-errors-3/20170213+telemetry+UNKNOWN+main+Firefox+release+___new_java.io.BufferedReader_new_java.io.InputStreamReader___new_java.lang.ProcessBuilder_new_java.lang.

This error was generated due to an assert when attempting to open a file. I'm not awake enough to debug this in full detail, but we should definitely be performing some sort of validation / replacement to prevent this from occurring. I think if we see a dimension that doesn't match our expectations (similar to the file name sanitation logic in the s3 output) we should set that dimension to some fixed string like "UNKNOWN". As a workaround I've set the errors s3 output to match 'Fields[uri] !~ "java"', so we will simply drop the most egregiously broken submissions from the s3 errors stream for now.

On a related note, we have an implicit limit on total s3 object name size of 256, due to the filesystem limit on filename size. This should again not be an issue when dimensions are well-behaved, but is possibly the root cause of the above issue. In the previous system we used actual directories instead of "replace + with / before uploading" but that required a separate cleanup cron for all the spurious directories created.

Updated

8 months ago
Assignee: nobody → whd
Points: --- → 2
Priority: -- → P1
(Assignee)

Comment 1

8 months ago
The problem is actually more general than just errors decoding, but it is even rarer that a ping will pass normal validation and have total garbage in appBuildId or appVersion. But (of course) this just happened. I have more workarounds in place, but if I don't get to implementing the proper fix this week I'll have to move it to next sprint.

Updated

8 months ago
Whiteboard: [SvcOps]
(Assignee)

Updated

8 months ago
Status: NEW → ASSIGNED

Updated

6 months ago
Component: Metrics: Pipeline → Pipeline Ingestion
Product: Cloud Services → Data Platform and Tools

Comment 2

5 months ago
Since workarounds are in place, this is getting pushed to next quarter.
Priority: P1 → P3
You need to log in before you can comment on or make changes to this bug.