Closed Bug 1339024 Opened 8 years ago Closed 6 years ago

Perform extra validation on s3 output dimensions

Categories

(Data Platform and Tools :: General, defect, P1)

defect
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whd, Assigned: trink)

Details

(Whiteboard: [DataOps])

We started receiving some "pretty weird" error messages today. Here's one such, in its decoded form: :Uuid: b7d30d85-b5a4-4eac-a827-50e31bf97832 :Timestamp: 2017-02-13T08:00:08.910368768Z :Type: telemetry.error :Logger: telemetry :Severity: 7 :Payload: <nil> :EnvVersion: <nil> :Pid: <nil> :Hostname: ip-172-31-35-193 :Fields: | name: remote_addr type: 0 representation: <nil> value: X | name: uri type: 0 representation: <nil> value: /submit/telemetry/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/main/Firefox/${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"timeout","11"})).start()).getInputStream()))).readLine()}${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"sleep","11"})).start()).getInputStream()))).readLine()}/release/20170125094131 | name: args type: 0 representation: <nil> value: v=4 | name: protocol type: 0 representation: <nil> value: HTTP/1.1 | name: Content-Length type: 0 representation: <nil> value: 74 | name: Date type: 0 representation: <nil> value: Mon, 13 Feb 2017 07:19:24 GMT | name: Host type: 0 representation: <nil> value: incoming.telemetry.mozilla.org | name: User-Agent type: 0 representation: <nil> value: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 | name: content type: 1 representation: <nil> value: <?xml version="1.0" encoding="UTF-8" standalone="no"?>\x0d\x0a<root>"?"</root>\x0d\x0a | name: DecodeErrorType type: 0 representation: <nil> value: json | name: DecodeError type: 0 representation: <nil> value: invalid submission: failed to parse offset:0 Invalid value. | name: submissionDate type: 0 representation: <nil> value: 20170213 | name: appVersion type: 0 representation: <nil> value: ${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"timeout","11"})).start()).getInputStream()))).readLine()}${(new java.io.BufferedReader(new java.io.InputStreamReader(((new java.lang.ProcessBuilder(new java.lang.String[]{"sleep","11"})).start()).getInputStream()))).readLine()} | name: Date type: 0 representation: <nil> value: Mon, 13 Feb 2017 07:19:24 GMT | name: appUpdateChannel type: 0 representation: <nil> value: release | name: documentId type: 0 representation: <nil> value: X | name: docType type: 0 representation: <nil> value: main | name: appBuildId type: 0 representation: <nil> value: 20170125094131 | name: geoCountry type: 0 representation: <nil> value: X | name: sourceName type: 0 representation: <nil> value: telemetry | name: geoCity type: 0 representation: <nil> value: X | name: Host type: 0 representation: <nil> value: incoming.telemetry.mozilla.org | name: appName type: 0 representation: <nil> value: Firefox This was obviously a garbage submission, but the interesting thing is that it caused our s3 errors output to fail, in the following way: Feb 13 08:00:09 ip-172-31-36-227 hindsight[3583]: 1486972809359602906 [error] output.telemetry_errors_s3 terminated: process_message() run/output/moz_telemetry_s3.lua:167: /media/ephemeral0/split_out/telemetry-errors-3/20170213+telemetry+UNKNOWN+main+Firefox+release+___new_java.io.BufferedReader_new_java.io.InputStreamReader___new_java.lang.ProcessBuilder_new_java.lang. This error was generated due to an assert when attempting to open a file. I'm not awake enough to debug this in full detail, but we should definitely be performing some sort of validation / replacement to prevent this from occurring. I think if we see a dimension that doesn't match our expectations (similar to the file name sanitation logic in the s3 output) we should set that dimension to some fixed string like "UNKNOWN". As a workaround I've set the errors s3 output to match 'Fields[uri] !~ "java"', so we will simply drop the most egregiously broken submissions from the s3 errors stream for now. On a related note, we have an implicit limit on total s3 object name size of 256, due to the filesystem limit on filename size. This should again not be an issue when dimensions are well-behaved, but is possibly the root cause of the above issue. In the previous system we used actual directories instead of "replace + with / before uploading" but that required a separate cleanup cron for all the spurious directories created.
Assignee: nobody → whd
Points: --- → 2
Priority: -- → P1
The problem is actually more general than just errors decoding, but it is even rarer that a ping will pass normal validation and have total garbage in appBuildId or appVersion. But (of course) this just happened. I have more workarounds in place, but if I don't get to implementing the proper fix this week I'll have to move it to next sprint.
Whiteboard: [SvcOps]
Status: NEW → ASSIGNED
Component: Metrics: Pipeline → Pipeline Ingestion
Product: Cloud Services → Data Platform and Tools
Since workarounds are in place, this is getting pushed to next quarter.
Priority: P1 → P3
This bug was originally for the heka protobuf s3 output, but extends to parquet output as well. With the heka protobuf output, we have a mechanism for specifying that unknown dimension values be bucketed into OTHER (which has its own problems) but we don't have this for parquet. As such, when using non-normalized dimensions such as appName for partitions we can end up causing a couple of issues. A specific bad case is when there are single quotes in an appName, which won't result in a valid s3 object name. I have workarounds in place (message matchers), but having a general mechanism to enforce a schema on URL fields above the basic structure we require may be a better way to go than fixing / unifying the heka and parquet output file/object name logic.
Priority: P3 → P2
Summary: Perform extra validation on potential s3 dimensions when decoding errors → Perform extra validation on s3 output dimensions
Whiteboard: [SvcOps] → [DataOps]
Priority: P2 → P3
Assignee: whd → nobody
Status: ASSIGNED → NEW

Adding a proper fix as the workaround still has some failure modes that we are hitting.

Assignee: nobody → mtrinkala
Priority: P3 → P1

Re Comment #3

The parquet output allows the full set of safe characters as described here (including the single quote): https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html
whd: can you clarify what the specific issue is with the special characters i.e. do we need to alter that set?

The s3 file output is more restrictive only allowing a-z, A-Z, 0-9, period and underscore.

Both sets are compatible with the Linux file system and S3. However, in the parquet case third party tools/scripts would have to be careful with the special characters. I was considering standardizing the normalization but we would have to relax the s3 file output otherwise we risk changing some of the parquet dimensions. That wouldn't buy us a lot so I think we should leave it alone.

Flags: needinfo?(whd)

(In reply to Mike Trinkala [:trink] from comment #5)

Re Comment #3

The parquet output allows the full set of safe characters as described here (including the single quote): https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html
whd: can you clarify what the specific issue is with the special characters i.e. do we need to alter that set?

Nope, turns out we're fine and that the issue was only due to overly long file names before / after rename. We should be good with leaving the set as-is.

I was considering standardizing the normalization but we would have to relax the s3 file output otherwise we risk changing some of the parquet dimensions. That wouldn't buy us a lot so I think we should leave it alone.

Agreed.

Flags: needinfo?(whd)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Pipeline Ingestion → General
You need to log in before you can comment on or make changes to this bug.