As discussed in bug 1339421, this is caused by a change in behavior of the new pipeline WRT invalid UTF8 strings, and I didn't catch it when testing jobs before the migration because it didn't happen for any of the days that I tested (including 4 months of longitudinal data). I am testing a fix that essentially disables utf8 checking in the scala bindings, which may fix the issue. In the worst case, if we continue to see more jobs failing, I can cut back to the heka stacks until this is fully sorted out.

Mike Trinkala [:trink]

Comment 4

•

8 years ago

The behavior of the current pipeline (no modification of the submission) is desired. Why would we revert back? 1) The proper fix is to prevent the client from sending crap in the first place 2) The infrastructure passes through what it gets with no modifications and it will end up in the good or bad pile as-is. 2a) If we add a new UTF-8 validation requirement that is acceptable and we will put those pings on the 'bad' pile 3 (this bug)) Consumers of the data have to be fault tolerant to anything that is not validated (and a majority of the data has no validation)

Wesley Dawson [:whd]

Assignee

Comment 5

•

8 years ago

(In reply to Mike Trinkala [:trink] from comment #4) > The behavior of the current pipeline (no modification of the submission) is > desired. Why would we revert back? We would revert back until every scheduled job is verified to work on *all* historical data from the new infra before cutting over in order to prevent further data disruption to downstream consumers. > 1) The proper fix is to prevent the client from sending crap in the first > place This is infeasible, unfortunately. > 2a) If we add a new UTF-8 validation requirement that is acceptable and we > will put those pings on the 'bad' pile I am leaning towards this being how we should proceed. > 3 (this bug)) Consumers of the data have to be fault tolerant to anything > that is not validated (and a majority of the data has no validation) While I agree with this, I did my best to ensure that migrating to the new pipeline would not introduce any class of additional fault tolerance requirements that would require code changes beyond my integration PRs. I didn't understand how the new pipeline behaved in this edge case and it never came up in my tests.

Wesley Dawson [:whd]

Assignee

Comment 6

•

8 years ago

I tested removing the UTF8 check and it appears to have done the trick. I filed https://github.com/mozilla/moztelemetry/pull/3/ with the fix. I didn't write tests for this, I only verified empirically that it caused crash aggregates to work for the day that had the failure (20170212). It presumably would also fix bug 1339421 but that already has a manual fix. I don't know what the actual coercion behavior is, only that it appears to work.

Mauro Doglio [:mdoglio]

Reporter

Comment 7

•

8 years ago

:whd I verified you patch fixed this bug. Thanks!

Assignee: nobody → whd

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Crash aggregates ETL is failing

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

Tracking

(Not tracked)

People

(Reporter: mdoglio, Assigned: whd)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Attachment

General

Description

File Name

Content Type