Closed Bug 1652789 Opened 4 years ago Closed 3 years ago

Days-old records written into telemetry_live on 2020-07-14

Categories

(Data Platform and Tools Graveyard :: Operations, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Unassigned)

References

Details

:benwu saw that we missed a few doc_ids in each of 8 stable tables last night: https://datastudio.google.com/u/0/reporting/1pZ6tT1LuQTgqtQcyO_HnhuwT5UyNm9Pi/page/54pIB

The following query shows that partitions for the last several days in main_v4 were all modified at 2020-07-14 06:54:47.826 UTC:

#legacySQL
SELECT
  partition_id,
  last_modified_timestamp
FROM
  [moz-fx-data-shared-prod:telemetry_live.main_v4$__PARTITIONS_SUMMARY__]
ORDER BY
  partition_id DESC

We generally expect that partitions in live tables will have their last modification well before 01:00 UTC of the following day, so this is strange behavior and we want to understand what happened.

Let's investigate sync_v4 which is a smaller table than main_v4 but had 2 missing documents. We can pinpoint the late-arriving documents via the following script does an EXCEPT between doc_id lists at two points in time: https://sql.telemetry.mozilla.org/queries/72804/source

The late-arriving documents are:

  • c2d4f03b-dc2f-45b9-9592-5b2db05e8470
  • 5f91e311-1088-4a7c-b3db-a5838eae40c8

Looking further back to the 2020-07-10 partition, main_v4 has 15 documents that arrived late, presumably all this morning.

This query shows they all have submission_timestamp grouped within a second of each other near 20:34:49 UTC: https://sql.telemetry.mozilla.org/queries/72805/source

Their client Date headers generally agree, so I feel confident that these are pings that were legitimately submitted by clients and received by the pipeline edge around 2020-07-10 20:34, but somehow they did not hit the BQ sink until around 2020-07-14 06:54. Or, there was some other operation outside of the sink that added these to live tables, but I'm not aware of any backfilling efforts.

There was a deploy of the telemetry BQ live sink yesterday, but I don't see how that could induce this behavior.

We also deployed an ingestion-sink code update on July 7th: https://github.com/mozilla/gcp-ingestion/releases/tag/20200707. This seems like it would have the behavior we're seeing here, but doesn't match the time frame. This deploy should have caused some latent messages to the live tables in the July 5,6,7 timeframe, but not after, and :relud likely determined exactly which messages those were.

We may want to check whether payload_bytes_{raw,decoded} saw similarly late delivery of the same messages. Since we're not seeing pubsub latency triggers my best guess is that these are messages queued on disk at the edge that are eventually being written to pubsub.

Recent changes to ingestion-edge come to mind, all on 2020/06/29:

We've disabled autoscaling twice before: https://bugzilla.mozilla.org/show_bug.cgi?id=1607349 and more recently due to https://bugzilla.mozilla.org/show_bug.cgi?id=1630096 (which caused the same behavior). The following scenario seems plausible to me:

  1. edge can't write to pubsub, queues some messages
  2. scale down initiated
  3. edge doesn't finish writing pubsub messages from disk to pubsub before it is killed
  4. edge comes back online after a scale up event, and eventually persists messages in its disk queue to pubsub

However, we have been running with autoscaling for the majority of the time GCP telemetry has been in production and we've never noticed this issue before. An easy way to test this is to disable autoscaling again.

  • ROUTE_TABLE update to route messages to pioneer

I doubt this is related to submitting latent data.

We should still have live data for 30 days, so we should be able to tell if the behavior begins around the 29th or not (though I believe we check for this in weekly meetings). If it is in fact only beginning around the 10th or so then the above changes may be unrelated.

Looking at payload_bytes_raw.telemetry$__PARTITIONS_SUMMARY__, I see that last_modified_timestamp is generally around 01:00 of the following day, except for the following:

20200703 2020-07-06 06:33:04.418 UTC
20200706 2020-07-08 04:58:32.722 UTC
20200712 2020-07-14 06:56:26.351 UTC
20200713 2020-07-14 07:40:00.621 UTC

The oldest partition is 20200628, so notable that modification times all look fine immediately after the 2020-06-29 deploy.

When I look at payload_bytes_decoded.telemetry_telemetry__main_v4$__PARTITIONS_SUMMARY__, I get 30 days of history, but see anomalies only on the same four days listed above.

I'm surprised that we don't see anomalies on the 10th and 11th in these tables like we do for the live tables.

Around 1UTC being standard itself needs explanation, as that number should be closer to 0UTC than 1, especially for the pbr sinks which have no latency other than the edge and use streaming inserts.

Around 1UTC being standard itself needs explanation, as that number should be closer to 0UTC than 1, especially for the pbr sinks which have no latency other than the edge and use streaming inserts.

I will try to see if I can dig up docs to explicitly support this, but I'm willing to be that when streaming inserts are involved, last_modified_time is reflecting the last time at which the streaming buffer was flushed to permanent storage, which I believe is documented to be something like a 30 minute delay from the time a record enters the streaming buffer and is available for querying.

https://cloud.google.com/bigquery/streaming-data-into-bigquery

Data can take up to 90 minutes to become available for copy operations

The biggest edge code change I can see during the time frame is https://github.com/mozilla/gcp-ingestion/pull/355/files, which does change batching behavior. EDIT: this is actually from 2019, so it is not within the timeframe we're investigating. As far as I know, no major ingestion-edge changes have happened in the last year or so.

As of now, I think it would be appropriate to try one of the following:

  1. scale up the edge to max capacity and observe whether the issue goes away
  2. roll back the edge to the previous code version and observe whether the issue goes away

This is operating under the theory that the issue is with ingestion edge queueing, and wouldn't account for the discrepancies in comment #4.

I've done (1) above to see if we stop seeing latent messages when there are a static set of pods in the statefulset.

It develops also that we've seen a few (5) pod shutdown errors that could cause the behavior we're seeing, but the occurrences of these are as follows (Pacific):

7/14/20, 1:18 PM Traceback (most recent call last): View logs
7/6/20, 12:35 PM Traceback (most recent call last): View logs
7/6/20, 9:38 AM Traceback (most recent call last): View logs
7/6/20, 9:38 AM Traceback (most recent call last): View logs
7/4/20, 1:49 PM Traceback (most recent call last): View logs

These don't quite align with the pbr latencies however, and particularly don't align with the live latencies on the 10th arriving on the 14th. :relud is investigating.

One could image a scenario where a pod queues data to disk on the 10th, and continues to fail to write the data to pubsub until the pod is terminated by e.g. autoscaling, and then hard fails on that termination on the 14th, followed by the pod restarting and on startup being able to submit the data. This seems unlikely but we currently don't have a good understanding of what's happening here and this is I believe technically possible.

With preliminary data (two days) it appears that scaling up the cluster to handle peak traffic does in fact resolve the issue. This (with a few more days of data) pretty definitively establishes that the issue is with queueing at the edge, most probably with shutdown queue flushing behavior related to autoscaling.

There are a couple of things we could do here. We could try to establish when pods are being SIGKILLed on shutdown and why, and e.g. tweak graceful shutdown parameters to accommodate. :relud is investigating implementing the disk queue cleanup service that we've been planning as part of the mediation for a previous outage in https://bugzilla.mozilla.org/show_bug.cgi?id=1630096.

The main issues with maintaining the status quo are (a) sudden influx of traffic will not initiate an autoscale (b) extra cost associated with being fully scaled all the time.

I'm not too worried about (a), and monitoring will detect if this occurs and page ops to manually scale up the cluster.

As a cost mediation for (b), per :relud suggestion we could re-enable autoscaling but manually scale up the number of pods to the maximum before midnight UTC every day (typically near the trough). I'll work on an estimate of how much this would save us, and get :melissa's opinion on whether it's worth doing.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
See Also: → 1723566
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.