Closed Bug 1465974 Opened 7 years ago Closed 7 years ago

Make HTTP Sink logs generally available and alertable

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Assigned: klukas)

Details

(Whiteboard: [DataPlatform])

Attachments

(2 files)

Currently the HTTP Sink logs failures of any kind. We should really be alerting on these failures to keep an eye on the dependency of the upload to Amplitude. https://github.com/mozilla/telemetry-streaming/blob/master/src/main/scala/com/mozilla/telemetry/sinks/HTTPSink.scala
The easiest thing may be to send these events as statsd metrics and relay them to datadog. There's an example config at https://github.com/akkomar/telemetry-streaming/blob/db_deploy_temp/databricks-deploy/update_error_aggregates_streaming_job.sh#L29-L31 (thanks Arkadiusz!). The datadog agent is available on Databricks clusters, but would need to be added to the EMR bootstrap if we want to run it there as well.
Points: --- → 2
Priority: -- → P2
I'm about to submit a PR today or tomorrow that includes a datadog sink -- I didn't design it for sending custom job metrics but we should be able to reuse/refactor it for this purpose
Assignee: nobody → akomarzewski
Priority: P2 → P1
Status: NEW → ASSIGNED
Blocks: 1484276
Based on the discussion here, it sounds like the way forward is to get the Datadog agent running as part of emr-spark-bootstrap, then use the DogStatsDCounterSink to send metrics payloads to the agent where needed in the code. Tutorial for adding EMR bootstrap steps to install Datadog: https://medium.com/lookout-engineering/publish-spark-streaming-system-and-application-metrics-from-aws-emr-to-datadog-part-2-6e9c60883d3a One detail I need to figure out is how to provide the DataDog API key when bootstrapping clusters.
Assignee: akomarzewski → jklukas
I don't think DogStatsDCounterSink is the way to go here because it will force introduction of another streaming query. Since I already spent some time on this, I'm going to continue with approach based on Spark metrics system in Bug 1485583.
(In reply to akomarzewski from comment #4) > I don't think DogStatsDCounterSink is the way to go here because it will > force introduction of another streaming query. I was imagining we instantiate a DogStatsDCounterSink within the batch job and call its process method directly to send metrics rather than setting up a streaming query. But integrating with Spark's metrics system is also fine and likely a more elegant long-term solution.
No longer blocks: 1484276
I'm picking up work again on https://github.com/mozilla/emr-bootstrap-spark/pull/473 We have a DataDog API key now that we can pull down and decrypt within telemetry.sh, so I'm starting to test sending metrics to DD from EMR clusters.
I'm realizing we have two options for sending data to statsd. In either case, we need to install the datadog-agent and have it running on the cluster. I have WIP code that accomplishes that already. Once the agent is installed, though, we can either configure Spark to periodically emit statsd metrics to the local agent, or we can configure the datadog-agent to collect metrics from spark. The advantage of configuring Spark to emit statsd metrics is that we have a clear migration path if we replace datadog with some other metrics destination; we simply swap in a different statsd-compliant agent. The disadvantage is that Spark's StatsdSink doesn't know anything about Datadog's tagging extension to statsd, so we'll be sending metrics without tags. That means we would need a more complex metric prefix name in order to disambiguate between clusters; it's less user-friendly in the Datadog UI than tags and possibly consumes more of our Datadog metrics quota. To be more specific, spark metrics using Datadog's integration appear in Datadog with names like: spark.jobs.count and there are various tags like cluster_name, etc. that you can use to drill down to a given cluster or job of interest. Using Spark's StatsdSink, we'd likely need to set prefix=spark-emr.$cluster_name, meaning metrics will show up in datadog with names like: spark-emr.mycluster.jobs.count It's unclear at the moment which of these two approaches is a better investment.
You're probably better off using the Datadog integration since, like you said, the tagging scheme it uses will make the data much more navigable throughout Datadog compared to what Spark would emit itself. If you create any custom metrics, don't be afraid to use Datadog's tagging extension to the statsd protocol. Tagging is commonly supported in monitoring tools now, so worse case is that you might have to change the specific formatting of the tags in order to move off Datadog.
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #8) > (...) The > disadvantage is that Spark's StatsdSink doesn't know anything about > Datadog's tagging extension to statsd, so we'll be sending metrics without > tags. I think it's not a problem - although Spark doesn't understand Datadog tags, Dogstatsd does. We can configure agent to add job-specific (since we're running one job per cluster) tags to the metrics it's forwarding. IIRC we're doing something like this on Databricks, :whd will know where's the script responsible for setting up tagging.
(In reply to akomarzewski from comment #10) > I think it's not a problem - although Spark doesn't understand Datadog tags, > Dogstatsd does. > We can configure agent to add job-specific (since we're running one job per > cluster) tags to the metrics it's forwarding. IIRC we're doing something > like this on Databricks, :whd will know where's the script responsible for > setting up tagging. I almost finished implementing the Datadog-specific way and now I realize I should have read this more carefully on Friday. I missed that dogstatsd can be configured to send tags globally. That sounds great and I will work on refactoring to use Spark's statsdsink + tags in statsd.yaml.
Now that I've implemented and tested both metrics styles, I have some commentary. Datadog's Spark integration is a curated list of metrics that is smaller, but significantly easier to interpret and navigate. They have a CSV configuration that gives all the metrics they publish [0]; they have nice names like: spark.job.count spark.stage.shuffle_write_bytes In my first attempt of configuring Spark with StatsdSink, we got names like: spark.application_1537369456804_0002.driver.LiveListenerBus.queue.appStatus.listenerProcessingTime.mean_rate spark.application_1537369456804_0002.1.executor.shuffleTotalBytesRead In particular, the ${spark.app.id} is the second part of the path there, which we definitely don't want. We can avoid that by setting spark.metrics.namespace to a static value, and the above become: spark.driver.LiveListenerBus.queue.appStatus.listenerProcessingTime.mean_rate spark.1.executor.shuffleTotalBytesRead So, the names still aren't as friendly. The second piece of the name now indicates which node the metric is from, which would ideally be a tag instead of part of the path. That said, this now aligns with the statsdsink metrics coming out of databricks, which are named like: telemetry_streaming.1.executor.shuffleTotalBytesRead [0] https://github.com/DataDog/integrations-core/blob/master/spark/metadata.csv
Attachment #9003467 - Flags: review?(whd)
Attachment #9003467 - Flags: review?(hwoo)
Attachment #9003467 - Flags: review?(whd)
Attachment #9003467 - Flags: review?(hwoo)
Attachment #9003467 - Flags: review+
I'm testing some final fixups here in stage and will plan to deploy to prod on Monday morning.
The emr-bootstrap-spark change is deployed, so the `--metrics-provider` option is now available to use. Adding metrics to HttpSink in https://github.com/mozilla/telemetry-streaming/pull/182, which I will merged today. Final step should be a PR to telemetry-airflow adding an option to EMRSparkOperator to turn on metrics, and enabling that for amplitude jobs. We should then have metrics flowing to Datadog for all nightly Amplitude jobs.
Attachment #9011836 - Flags: review?(fbertsch)
https://github.com/mozilla/telemetry-airflow/pull/352 is merged, so Amplitude jobs tonight should include metrics. I'll check tomorrow morning to ensure everything looks reasonable in DataDog and then look at configuring alerts in Datadog.
Attachment #9011836 - Flags: review?(fbertsch) → review+
Metrics from last night's amplitude jobs looks reasonable. See timeboard in Datadog: https://app.datadoghq.com/dash/929495/events-to-amplitude-httpsink-metrics?live=true&page=0&is_auto=false&from_ts=1537966593193&to_ts=1538052993193&tile_size=m A high rate of successes, some retries, and zeros for dropped or errored messages. I also created a Monitor and added myself and Frank as recipients: https://app.datadoghq.com/monitors#6466485/edit That should send us emails if either of the Amplitude jobs shows a non-zero count of unrecovered errors or dropped messages. Calling this done!
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: