Closed Bug 1339834 Opened 8 years ago Closed 8 years ago

Crash aggregates ETL is failing

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdoglio, Assigned: whd)

Details

Attachments

(1 file)

There seems to be a problem with the deserialization of some json fields.
Attached file Error stacktrace
This is pretty much the same error as the one found in bug 1339421
As discussed in bug 1339421, this is caused by a change in behavior of the new pipeline WRT invalid UTF8 strings, and I didn't catch it when testing jobs before the migration because it didn't happen for any of the days that I tested (including 4 months of longitudinal data). I am testing a fix that essentially disables utf8 checking in the scala bindings, which may fix the issue. In the worst case, if we continue to see more jobs failing, I can cut back to the heka stacks until this is fully sorted out.
The behavior of the current pipeline (no modification of the submission) is desired. Why would we revert back? 1) The proper fix is to prevent the client from sending crap in the first place 2) The infrastructure passes through what it gets with no modifications and it will end up in the good or bad pile as-is. 2a) If we add a new UTF-8 validation requirement that is acceptable and we will put those pings on the 'bad' pile 3 (this bug)) Consumers of the data have to be fault tolerant to anything that is not validated (and a majority of the data has no validation)
(In reply to Mike Trinkala [:trink] from comment #4) > The behavior of the current pipeline (no modification of the submission) is > desired. Why would we revert back? We would revert back until every scheduled job is verified to work on *all* historical data from the new infra before cutting over in order to prevent further data disruption to downstream consumers. > 1) The proper fix is to prevent the client from sending crap in the first > place This is infeasible, unfortunately. > 2a) If we add a new UTF-8 validation requirement that is acceptable and we > will put those pings on the 'bad' pile I am leaning towards this being how we should proceed. > 3 (this bug)) Consumers of the data have to be fault tolerant to anything > that is not validated (and a majority of the data has no validation) While I agree with this, I did my best to ensure that migrating to the new pipeline would not introduce any class of additional fault tolerance requirements that would require code changes beyond my integration PRs. I didn't understand how the new pipeline behaved in this edge case and it never came up in my tests.
I tested removing the UTF8 check and it appears to have done the trick. I filed https://github.com/mozilla/moztelemetry/pull/3/ with the fix. I didn't write tests for this, I only verified empirically that it caused crash aggregates to work for the day that had the failure (20170212). It presumably would also fix bug 1339421 but that already has a manual fix. I don't know what the actual coercion behavior is, only that it appears to work.
:whd I verified you patch fixed this bug. Thanks!
Assignee: nobody → whd
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: