Closed Bug 1622997 Opened 5 years ago Closed 4 years ago

Consider routing non-decoder beam errors to pubsub and bigquery

Categories

(Data Platform and Tools Graveyard :: Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whd, Assigned: whd)

Details

Investigation into bug #1622977 reminded me there was a notion to publish beam errors for non-decoder jobs to a single place for monitoring purposes. We may have a bug or issue for this but I couldn't find it in a cursory inspection.

Having this data routed to pubsub/bq would have enabled us to catch the aformentioned bug more quickly since in theory this class of errors shouldn't exist (https://github.com/mozilla/gcp-ingestion/issues/667) and we should always be concerned when error counts here are nonzero.

There are some considerations around errors when writing errors that have been discussed, namely that writing errors to GCS was originally deemed to be more reliable than pubsub/bigquery due to fewer inherent limits on e.g. payload size. However, in the course of our usage of GCP and validation for the prod cutover we actually determined that GCS-based sinks were the least reliable, so that point is probably moot now.

In the upcoming post-beam-sink world, the errors from only e.g. republisher are not very interesting, but perhaps with new jobs like amplitude streaming we should still consider this work. As of now I would consider this work to be low priority.

We provisioned a monitoring dataset for data like this, but we'd need to consider the access controls around where this data is ultimately published.

Assignee: nobody → whd

We ended up publishing errors for contextual services diagnostic output in a manner similar to what's desribed here, and we can continue to use the pattern going forward. I still plan to revisit e.g. existing republisher error output but that will be more directly for deprecating the relevant GCS buckets instead of WRT a general policy on error reporting.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.