Main summary ETL job is failing

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
RESOLVED FIXED
8 months ago
8 months ago

People

(Reporter: rvitillo, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

User Story

The ETL job for the main_summary dataset started failing yesterday with the following error:

7/02/14 05:39:20 ERROR TaskSetManager: Task 489 in stage 0.0 failed 4 times; aborting job
17/02/14 05:39:20 INFO YarnScheduler: Cancelling stage 0
17/02/14 05:39:20 INFO YarnScheduler: Stage 0 was cancelled
17/02/14 05:39:20 INFO DAGScheduler: ShuffleMapStage 0 (parquet at MainSummaryView.scala:128) failed in 15840.848 s
17/02/14 05:39:20 INFO DAGScheduler: Job 0 failed: parquet at MainSummaryView.scala:128, took 15841.164087 s
17/02/14 05:39:20 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 489 in stage 0.0 failed 4 times, most recent failure: Lost task 489.3 in stage 0.0 (TID 574, ip-172-31-42-181.us-west-2.compute.internal): com.fasterxml.jacksĀ»
 at [Source: ; line: 1, column: 1]
        at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148)
        at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3781)
        at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3721)
        at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
        at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:20)
        at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:50)
        at com.mozilla.telemetry.views.MainSummaryView$.histograms$lzycompute$1(MainSummaryView.scala:278)
        at com.mozilla.telemetry.views.MainSummaryView$.histograms$1(MainSummaryView.scala:278)
        at com.mozilla.telemetry.views.MainSummaryView$.weaveConfigured$lzycompute$1(MainSummaryView.scala:281)
        at com.mozilla.telemetry.views.MainSummaryView$.weaveConfigured$1(MainSummaryView.scala:281)
        at com.mozilla.telemetry.views.MainSummaryView$.messageToRow(MainSummaryView.scala:382)
        at com.mozilla.telemetry.views.MainSummaryView$$anonfun$main$1$$anonfun$14.apply(MainSummaryView.scala:96)
        at com.mozilla.telemetry.views.MainSummaryView$$anonfun$main$1$$anonfun$14.apply(MainSummaryView.scala:95)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Attachments

(1 attachment)

PR
56 bytes, text/x-github-pull-request
sunahsuh
: review+
Details | Review | Splinter Review
Comment hidden (empty)
(Reporter)

Updated

8 months ago
Assignee: nobody → rvitillo
Priority: -- → P1
(Reporter)

Comment 1

8 months ago
Created attachment 8837130 [details] [review]
PR
Attachment #8837130 - Flags: review?(ssuh)
(Reporter)

Updated

8 months ago
Severity: normal → blocker
Attachment #8837130 - Flags: review?(ssuh) → review+
(Reporter)

Comment 2

8 months ago
Wesley, it appears that the ETL job failed to parse a JSON field that Hindsight emits (payload.histograms) because it contained non valid data. Do you have any ideas about what could have caused this?
Flags: needinfo?(whd)
I suspect an invalid UTF-8 character in the client submission, since we just pass strings through as-is. The old cjson encoder didn't validate UTF-8 either but it did do some escaping.

----
Lua CJSON will escape the following characters within each UTF-8 string:

    Control characters (ASCII 0 - 31)

    Double quote (ASCII 34)

    Forward slash (ASCII 47)

    Blackslash (ASCII 92)

    Delete (ASCII 127)

All other bytes are passed transparently.
----

Comment 4

8 months ago
I found the submission that is causing the failure, in:

s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-3/20170213/telemetry/4/main/Firefox/release/51.0.1/20170125094131/1487009062_0_ip-172-31-4-252.gz

Here's the lsb_heka_cat of payload.histograms for the particular failing submission (doc id "88fd3409-2896-41bf-8359-c0123e73130b"):

      | name: payload.histograms type: 1 representation: <nil> value: {"SSL_OBSERVED_END_ENTITY_CERTIFICATE_LIFETIME":{"range":[1,125],"bucket_count":126,"histogram_type":1,"values":{"10":0,"11":3,"36":2,"50":3,"52":81,"61":1,"65":9,"104":2,"105":24,"106":0},"sum":7841},"UPDATE_CHECK_NO_UPDATE_NOTIFY":{"range":[1,2],"bucket_count":3,"histogram_type":4,"values":{"0":1,"1":0},"sum":1},"UPDATE_LAST_NOTIFY_INTERVAL_DAYS_NOTIFY":{"range":[1,180],"bucket_count":30,"histogram_type":0,"values":{"0":0,"1":1,"2":0},"sum":1},"UPDATE_PING_COUNT_NOTIFY":{"range":[1,2],"bucket_count":3,"histogram_type":4,"values":{"0":1,"1":0},"sum":1},"UPDATE_SERVICE_INSTALLED_NOTIFY":{"range":[1,2],"bucket_count":3,"histogram_type":2,"values":{"0":0,"1":1,"2":0},"sum":1},"UPDATE_NOT_PREF_UPDATE_SERVICE_ENABLED_NOTIFY":{"range":[1,2],"bucket_count":3,"histogram_type":4,"values":{"0":1,"1":0},"sum":1},"FX_TAB_SWITCH_TOTAL_E10S_MS":{"range":[1,1000],"bucket_count":20,"histogram_type":0,"values":{"0":4,"1":1,"19":3,"27":5,"39":9,"56":2,"80":1,"115":0},"sum":894},"TELEMETRY_PING_SUBMISSION_WAITING_CLIENTID":{"range":[1,2],"bucket_count":3,"histogram_type":4,"values":{"0":2,"1":0},"sum":2},"FIPS_ENABLED":{"range":[1,2],"bucket_count":3,"histogram_type":3,"values":{"0":1,"1":0},"sum":0},"SEARCH_SERVICE_ENGINE_COUNT":{"range":[1,200],"bucket_count":50,"histogram_type":1,"values":{"1":0,"5":1,"9":0},"sum":8},"SEARCH_SERVICE_HAS_UPDATES":{"range":[1,2],"bucket_count":3,"histogram_type":2,"values":{"0":1,"1":0},"sum":0},"SEA\xf6CH_SERVICE_HAS_ICON_UPDATES":{"range":[1,2],"bucket_count":3,"histogram_type":2,"values":{"0":1,"1":0},"sum":0},"VIDEO_CAN_CREATE_AAC_DECODER":{"range":[1,2],"bucket_count":3,"histogram_type":2,"values":{"0":0,"1":1,"2":0},"sum":1},"VIDEO_CAN_CREATE_H264_DECODER":{"range":[1,2],"bucket_count":3,"histogram_type":2,"values":{"0":0,"1":1,"2":0},"sum":1},"BROWSER_IS_USER_DEFAULT":{"range":[1,2],"bucket_count":3,"histogram_type":2,"values":{"0":0,"1":1,"2":0},"sum":1},"BROWSER_IS_USER_DEFAULT_ERROR":{"range":[1,2],"bucket_count":3,"histogram_type":2,"values":{"0":1,"1":0},"sum":0},"BROWSER_SET_DEFAULT_DIALOG_PROMPT_RAWCOUNT":{"range":[1,250],"bucket_count":15,"histogram_type":0,"values":{"0":1,"1":0},"sum":0},"BROWSER_SET_DEFAULT_ALWAYS_CHECK":{"range":[1,2],"bucket_count":3,"histogram_type":2,"values":{"0":0,"1":1,"2":0},"sum":1},"COOKIE_SCHEME_SECURITY":{"range":[1,10],"bucket_count":11,"histogram_type":1,"values":{"0":19,"1":607,"3":2,"4":0},"sum":613},"WEAVE_CONFIGURED":{"range":[1,2],"bucket_count":3,"histogram_type":2,"values":{"0":1,"1":0},"sum":0},"E10S_STATUS":{"range":[1,12],"bucket_count":13,"histogram_type":1,"values":{"0":0,"1":1,"2":0},"sum":1},"GRAPHICS_DRIVER_STARTUP_TEST":{"range":[1,20],"bucket_count":21,"histogram_type":1,"values":{"0":1,"1":0},"sum":0},"FXA_CONFIGURED":{"range":[1,2],"bucket_count":3,"histogram_type":3,"values":{"0":1,"1":0},"sum":0},"YOUTUBE_REWRITABLE_EMBED_SEEN":{"range":[1,2],"bucket_count":3,"histogram_type":3,"values":{"0":1,"1":0},"sum":0},"YOUTUBE_NONREWRITABLE_EMBED_SEEN":{"range":[1,2],"bucket_count":3,"histogram_type":3,"values":{"0":1,"1":0},"sum":0},"FX_CONTENT_CRASH_PRESENTED":{"range":[1,2],"bucket_count":3,"histogram_type":4,"values":{"0":1,"1":0},"sum":1}}

If you look closely this indeed appears to be UTF8 corruption for the key "SEA\xf6CH_SERVICE_HAS_ICON_UPDATES" (lsb_heka_cat escapes the output).

I didn't look at the heka-generated version of this document but I assume it has the escaping as per :trink's above comment.

We should decide what we want to do in this case. In my head it makes sense to throw this data away because it doesn't conform to the data format we're expecting (UTF8), but perhaps doing what the old pipeline did and escaping it makes sense too.

:trink, what are the options available to us for handling this with rjson? I believe in IRC you mentioned there might be a mode to perform strict UTF8 checking (that is slower, but may be worth it for us).
Flags: needinfo?(whd)

Comment 5

8 months ago
One additional note: because of the way the I implemented support for bytes fields in the scala bindings [1] the invalid UTF8 is propagated as "", and so the actual error above is scala blowing up trying to parse "". It may well be that the JSON parser can parse the invalid UTF8, but I wasn't aware that this would ever occur.

[1] https://github.com/mozilla/moztelemetry/blob/master/src/main/scala/com/mozilla/telemetry/heka/package.scala#L20-L24
> :trink, what are the options available to us for handling this with rjson? I
> believe in IRC you mentioned there might be a mode to perform strict UTF8
> checking (that is slower, but may be worth it for us).

Strict UTF-8 validation or simple escaping are both possible; However, I would opt for simply passing it through without any alteration and having the down stream deal with it (if we fix the scala I am betting this will parse).  The only other option we should consider is throwing the submission into the error stream  (harsh but correct, cc'ing Sam into this discussion).

Comment 7

8 months ago
(In reply to Mike Trinkala [:trink] from comment #6)
> > :trink, what are the options available to us for handling this with rjson? I
> > believe in IRC you mentioned there might be a mode to perform strict UTF8
> > checking (that is slower, but may be worth it for us).
> 
> Strict UTF-8 validation or simple escaping are both possible; However, I
> would opt for simply passing it through without any alteration and having
> the down stream deal with it (if we fix the scala I am betting this will
> parse).  The only other option we should consider is throwing the submission
> into the error stream  (harsh but correct, cc'ing Sam into this discussion).

Let me see if I can frame this issue slightly differently. We are presented with a UTF8 decoding issue that breaks the pipeline, so we have to fix the pipeline. Fine as far as it goes. But what is the underlying cause? I believe that we have a single clearly corrupted data structure: "SEA\xf6CH_SERVICE_HAS_ICON_UPDATES" 1) is our string, not a user-supplied string 2) obviously should read "SEARCH_SERVICE_HAS_ICON_UPDATES" So, a byte that should have had the decimal value of 82 was garbled somehow to read 246. My inclination would be to say we need two responses:

1) Prevent the pipeline from breaking.

2) Apply the policy for handling pings with single invalid fields. In other words, treat it the same way we treat a negative counter.

Right now 2) is "send to the error stream" IIRC. Soon I hope it will be smarter than that, but we are developing that part.

I'm not sure I have all this right -- please correct / object.

Comment 8

8 months ago
(In reply to Mike Trinkala [:trink] from comment #6)
> Strict UTF-8 validation or simple escaping are both possible; However, I
> would opt for simply passing it through without any alteration and having
> the down stream deal with it (if we fix the scala I am betting this will
> parse).

:rvitillo, how do you feel about this? It's caused another job to fail in bug #1339834. I don't think it's reasonable to have to update all existing jobs to conditionally check that they are receiving JSON, as this is in part why we are doing validation when decoding the data.

I am testing whether simply removing the "valid utf8" check will fix things in the short term. This means we would send non-utf8 bytes fields (e.g. a hll) as stringified (garbage) unless we update the scala api to support an optional "don't coerce" or similar type option to field(). However, this (non-json bytes fields) is a use case we don't presently have or support anyway, and I would prefer having downstream jobs work.
(Reporter)

Comment 9

8 months ago
(In reply to Wesley Dawson [:whd] from comment #8)
> (In reply to Mike Trinkala [:trink] from comment #6)
> > Strict UTF-8 validation or simple escaping are both possible; However, I
> > would opt for simply passing it through without any alteration and having
> > the down stream deal with it (if we fix the scala I am betting this will
> > parse).
> 
> :rvitillo, how do you feel about this? It's caused another job to fail in
> bug #1339834. I don't think it's reasonable to have to update all existing
> jobs to conditionally check that they are receiving JSON, as this is in part
> why we are doing validation when decoding the data.
> 
> I am testing whether simply removing the "valid utf8" check will fix things
> in the short term. This means we would send non-utf8 bytes fields (e.g. a
> hll) as stringified (garbage) unless we update the scala api to support an
> optional "don't coerce" or similar type option to field(). However, this
> (non-json bytes fields) is a use case we don't presently have or support
> anyway, and I would prefer having downstream jobs work.

I agree with the general idea that downstream jobs should be resilient to dirty data. That being said, breaking an implicit contract we had with our downstream users without announcing it first isn't great. We don't own/control all downstream jobs and this change might have repercussions on other projects and teams. Because of that I am leaning towards guaranteeing backward compatibility with what we had before. That doesn't mean we can't break interfaces; it merely means such changes should be announced beforehand.
The old behavior is a bug like the conversion of arrays to objects caused by the same library. I don't believe the bug should be re-introduced as the 'fix'.  If people don't want to add basic error handling in their downstream consumers then the best option is to have the ingestion put all invalid UTF-8 pings in the error stream and monitor the error rates.  If this acceptable I will drop some sprint work to get it in today.
(Reporter)

Comment 11

8 months ago
(In reply to Mike Trinkala [:trink] from comment #10)
> The old behavior is a bug like the conversion of arrays to objects caused by
> the same library. I don't believe the bug should be re-introduced as the
> 'fix'.  
> If people don't want to add basic error handling in their downstream
> consumers 

Our users can add basic error handling if you give them time to do it before their jobs start failing :) 

> then the best option is to have the ingestion put all invalid
> UTF-8 pings in the error stream and monitor the error rates.  If this
> acceptable I will drop some sprint work to get it in today.

That's reasonable, thanks.
Also in this case the invalid character in question is not filtered by the old library so it is not a change in behavior as it would have always failed downstream.

Comment 13

8 months ago
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #11)
> Our users can add basic error handling if you give them time to do it before their jobs start failing :) 

Again, this stems from my assuming after a fair amount of manual testing that no new classes of errors would be introduced by the new pipeline. I spent a long time doing this, but obviously didn't catch everything. In the future with better end-to-end testing this validation will in theory be easier to perform and not necessarily my responsibility.

(In reply to Mike Trinkala [:trink] from comment #12)
> Also in this case the invalid character in question is not filtered by the
> old library so it is not a change in behavior as it would have always failed downstream.

This is because of the way bytes support was implemented, where I assumed that the new pipeline would produce valid UTF8 strings. Essentially the behind-the-scenes integration to make the new data look like the old data was new code that I wrote and it didn't handle some edge cases correctly.

> then the best option is to have the ingestion put all invalid
> UTF-8 pings in the error stream and monitor the error rates.  If this
> acceptable I will drop some sprint work to get it in today.

+1
Update: unfortunately I am unable to add this new behavior today as it doesn't appear to be supported for insitu parsing https://github.com/miloyip/rapidjson/issues/857.  Insitu parsing is used on all the gzipped pings which are the majority of our submissions.
(Reporter)

Updated

8 months ago
Assignee: rvitillo → nobody
Severity: blocker → normal
Priority: P1 → --

Comment 15

8 months ago
:trink was able to add UTF8 validation per https://github.com/mozilla-services/lua_sandbox_extensions/issues/75, and I've deployed it to the DWL nodes. It's a rare event but we should eventually see it show up in our errors monitoring. As a result of the stricter ingestion rules (above link), additional downstream error checking (https://github.com/mozilla/telemetry-batch-view/pull/176), and more relaxed downstream parsing (https://github.com/mozilla/moztelemetry/pull/3), main summary ETL is no longer failing and I'm calling this fixed.
Status: NEW → RESOLVED
Last Resolved: 8 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.