Closed
Bug 1339834
Opened 8 years ago
Closed 8 years ago
Crash aggregates ETL is failing
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mdoglio, Assigned: whd)
Details
Attachments
(1 file)
9.29 KB,
text/plain
|
Details |
There seems to be a problem with the deserialization of some json fields.
Reporter | ||
Comment 1•8 years ago
|
||
Reporter | ||
Comment 2•8 years ago
|
||
This is pretty much the same error as the one found in bug 1339421
Assignee | ||
Comment 3•8 years ago
|
||
As discussed in bug 1339421, this is caused by a change in behavior of the new pipeline WRT invalid UTF8 strings, and I didn't catch it when testing jobs before the migration because it didn't happen for any of the days that I tested (including 4 months of longitudinal data).
I am testing a fix that essentially disables utf8 checking in the scala bindings, which may fix the issue. In the worst case, if we continue to see more jobs failing, I can cut back to the heka stacks until this is fully sorted out.
Comment 4•8 years ago
|
||
The behavior of the current pipeline (no modification of the submission) is desired. Why would we revert back?
1) The proper fix is to prevent the client from sending crap in the first place
2) The infrastructure passes through what it gets with no modifications and it will end up in the good or bad pile as-is.
2a) If we add a new UTF-8 validation requirement that is acceptable and we will put those pings on the 'bad' pile
3 (this bug)) Consumers of the data have to be fault tolerant to anything that is not validated (and a majority of the data has no validation)
Assignee | ||
Comment 5•8 years ago
|
||
(In reply to Mike Trinkala [:trink] from comment #4)
> The behavior of the current pipeline (no modification of the submission) is
> desired. Why would we revert back?
We would revert back until every scheduled job is verified to work on *all* historical data from the new infra before cutting over in order to prevent further data disruption to downstream consumers.
> 1) The proper fix is to prevent the client from sending crap in the first
> place
This is infeasible, unfortunately.
> 2a) If we add a new UTF-8 validation requirement that is acceptable and we
> will put those pings on the 'bad' pile
I am leaning towards this being how we should proceed.
> 3 (this bug)) Consumers of the data have to be fault tolerant to anything
> that is not validated (and a majority of the data has no validation)
While I agree with this, I did my best to ensure that migrating to the new pipeline would not introduce any class of additional fault tolerance requirements that would require code changes beyond my integration PRs. I didn't understand how the new pipeline behaved in this edge case and it never came up in my tests.
Assignee | ||
Comment 6•8 years ago
|
||
I tested removing the UTF8 check and it appears to have done the trick. I filed https://github.com/mozilla/moztelemetry/pull/3/ with the fix.
I didn't write tests for this, I only verified empirically that it caused crash aggregates to work for the day that had the failure (20170212). It presumably would also fix bug 1339421 but that already has a manual fix. I don't know what the actual coercion behavior is, only that it appears to work.
Reporter | ||
Comment 7•8 years ago
|
||
:whd I verified you patch fixed this bug. Thanks!
Assignee: nobody → whd
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•