Closed Bug 1767545 Opened 3 years ago Closed 3 years ago

Increase in duplicates in live tables

Categories

(Data Platform and Tools :: General, task)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: klukas)

References

Details

(Whiteboard: [dataplatform])

While doing some checking on copy_deduplicate performance, I noticed that a single partition of the main ping live table (moz-fx-data-shared-prod.telemetry_live.main_v4) was close to 60 TB in size. I was expecting this to be more like 15 TB.

Indeed, the stable table partitions are about 15 TB apiece. But I was assuming most documents would not be duplicated. What I'm observing is that most documents appear five or more times in the live tables.

This could be a big part of why copy_deduplicate_main_ping has been taking longer and seeing more errors recently compared to years past. We should seek to understand more deeply what's going on here.

This appears to be main-specific.

The event_v4 table has only about 0.3% of document_ids appearing more than once: https://sql.telemetry.mozilla.org/queries/85635/source

For main, it appears every document_id appears at least five times: https://sql.telemetry.mozilla.org/queries/85636/source

PBD (which uses streaming inserts) for main_v4 looks pretty reasonable as well https://sql.telemetry.mozilla.org/queries/85638/source. So I think the source of the duplication is somewhere after the decoder, in ingestion-sink, and in batch preparation or loading.

From a very cursory inspection, the number of UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: Deadline expired before operation could complete. errors may be increasing: https://console.cloud.google.com/errors/detail/CPmsjv24l4bYTw?organizationId=442341870013&project=moz-fx-data-ingesti-prod-d59c. Note that this isn't scoped to e.g. telemetry pipeline family loaders.

So my initial guess would be that we're starting to frequently hit some kind of a 60s deadline and retrying multiple times until all messages are published, and it's taking at a minimum 5 tries to upload main pings.

Looking a bit more closely, the Deadline expired before operation could complete errors are probably unrelated and appear to be some custom stackdriver monitoring that isn't working: https://github.com/mozilla/gcp-ingestion/blob/main/ingestion-sink/src/main/java/com/mozilla/telemetry/ingestion/sink/config/SinkConfig.java#L547. These deadline errors are also prevalent in stage: https://console.cloud.google.com/errors/detail/CPmsjv24l4bYTw?project=moz-fx-data-inge-nonprod-9cc2 and ironically those metrics might be useful to help diagnose the main dupes issue. I've not worked much with them before but :relud might be able to chime in.

Speaking of stage, moz-fx-data-shar-nonprod-efed.telemetry_live.main_v4 also has very low dupe rates, so the dupes likely only manifest at high volume.

One thing that we could try is bumping the container version in case some GCP library has been updated that might help resolve this (the last container update was in late January). This is very unlikely to fix the issue.

Whiteboard: [data-platform-infra-wg]

the last container update was in late January

Very interesting. Because evidence points to Jan 19 as the day we saw the step change: https://sql.telemetry.mozilla.org/queries/85657/source#212102

I think we should look closely at what all we changed in the last container publish.

January 19th is in fact the day of the last docker -> ops infra mirror, so this is quite a promising lead.

Jan 19th (current):
mozilla/ingestion-sink:latest Digest: sha256:cb66d9a89dc7c5ab85a1f8dda0de55f381fbf5b246ab37ef63974c8334cea46f Build URL: https://circleci.com/gh/mozilla/gcp-ingestion/52377

Dec 10th (previous):
mozilla/ingestion-sink:latest Digest: sha256:cd9d286064a05c489d66970fdfc9353b7e49c0b7dc5d7fa31c00bbe3c51c140d Build URL: https://circleci.com/gh/mozilla/gcp-ingestion/50642

So presumably the behavior change happened somewhere between those two revisions.

Assignee: nobody → jklukas

There are no code changes to ingestion-sink within the scope of that diff, but several dependency changes. Nothing jumps out at me.

As a cross-check, I looked at the current telemetry-raw-decoder beam job. The telemetry/main_v4/valid_submission counter shows 124M submissions over the past ~11 hours, which is in line with expectations. Also, in spot checking duplicates, it looks like in most cases the submissions for a given document_id all have the same submission_timestamp.

So this is further evidence that the problem is coming from within our pipeline, so ingestion-sink seems like the right place to focus attention.

One thing that we could try is bumping the container version in case some GCP library has been updated that might help resolve this (the last container update was in late January). This is very unlikely to fix the issue.

I think this is worth pursuing. If we bumped the container version to capture current state of main, we'd be bumping the GCP libraries bom from 24.2.0 to 25.2.0. Given that we have no ingestion-sink code changes in the diff where we saw the regression, a change in library behavior is the only source I can think of right now. If that's the case, it's quite possible that the underlying behavior has since been fixed.

There's no guarantee a new dependency version will have fixed any such problem, but if it was due to the change in dependency versions then reverting back to the previously used dependency versions should give us the previous behavior. Is (temporarily) downgrading the dependencies feasible?

One thing that we could try is bumping the container version in case some GCP library has been updated that might help resolve this (the last container update was in late January).
I think this is worth pursuing.

I plan to do this presently, though note that today, we had some extra latency in the very pipeline that is being investigated here: https://console.cloud.google.com/cloudpubsub/subscription/detail/telemetry-decoded-pending.bq-sink-loader?organizationId=442341870013&project=moz-fx-data-ingesti-prod-d59c. This is likely unrelated and just resolved, probably due to some combination of the following:

  1. a manual config change I made in response to the alert (bumping loader replicas from 3 to 4)
  2. natural recovery (by the time I started investigating the unacked message count was recovering)
  3. the beam deploy (geoip update) pausing decoder output giving the ingestion-sink subscription time to catch up

(In reply to Sean Rose from comment #9)

There's no guarantee a new dependency version will have fixed any such problem, but if it was due to the change in dependency versions then reverting back to the previously used dependency versions should give us the previous behavior. Is (temporarily) downgrading the dependencies feasible?

Certainly, there's no guarantee that updating dependencies will fix the problem. But I don't see value in temporarily downgrading dependencies, at least for now. We have been able to operate in this state for several months already, so I think the focus should be on investigating possibilities for rolling forward and understanding root cause rather than optimizing for eliminating the behavior quickly.

If we hit a dead end, it could be worth circling back and considering if there's value reverting in order to compare logs or other signals of behavior changes.

New version deployed:

mozilla/ingestion-sink:latest Digest: sha256:51a1ed1ba0bf6939070928b23185db4c0bb95a4593712896984a14d8b2d56226 Build URL: https://circleci.com/gh/mozilla/gcp-ingestion/57392

Seeing a bunch of the following now:

May 05, 2022 4:46:25 PM com.google.cloud.pubsub.v1.StreamingSubscriberConnection$2 onFailure
WARNING: failed to send operations
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size exceeds the limit: 524288 bytes.
        at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:92)
        at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
        at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
        at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:67)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
        at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:563)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:535)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size exceeds the limit: 524288 bytes.
        at io.grpc.Status.asRuntimeException(Status.java:535)
        ... 17 more

So we may need to roll this back.

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #4)

evidence points to Jan 19 as the day we saw the step change: https://sql.telemetry.mozilla.org/queries/85657/source#212102

some additional investigation, the payload_bytes_decoded.telemetry_telemetry__main_v4 table does not have a similar spike in duplicates: https://sql.telemetry.mozilla.org/queries/85676/source#212147

This confirms that the issue is with the sink, not somewhere upstream in the pipeline, and also tells us the issue is specific to batch mode.

some additional investigation, the payload_bytes_decoded.telemetry_telemetry__main_v4 table does not have a similar spike in duplicates: https://sql.telemetry.mozilla.org/queries/85676/source#212147

I think this is the same conclusion I made in comment #2.

com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size exceeds the limit: 524288 bytes.

I've rolled this back for now, specifically on the structured and telemetry loader instances but will redeploy the full stack today.

It looks like this is affecting the experiment monitoring dashboards which pulls in live data for the 2 most recent days: https://mozilla.slack.com/archives/C4D5ZA91B/p1651770782968919
Especially search metrics are inflated (which are based on main). So once this is fixed we should redeploy the materialized view telemetry_derived.experiment_search_aggregates_live_v1

It looks like this is affecting the experiment monitoring dashboards which pulls in live data for the 2 most recent days:

Given that we've had these duplicates since January, it's not clear to me why this would start to affect a materialized view in the last few days (unless that view is brand new).

:relud has found the dependency update that likely caused the INVALID_ARGUMENT: Request payload size exceeds the limit errors and is investigating a fix.

Given that we've had these duplicates since January, it's not clear to me why this would start to affect a materialized view in the last few days (unless that view is brand new).

This is likely the first time someone noticed (or mentioned) this issue. The materialized view has data back until beginning of April 2022 and the values are all inflated.

Certainly, there's no guarantee that updating dependencies will fix the problem. But I don't see value in temporarily downgrading dependencies, at least for now. We have been able to operate in this state for several months already, so I think the focus should be on investigating possibilities for rolling forward and understanding root cause rather than optimizing for eliminating the behavior quickly.

I agree about focusing on understanding root cause, and my thought was rolling back to previous dependency versions would be the surest way to do that. If we downgrade and it fixes the problem, then we can selectively upgrade dependencies until we find the one that's the culprit. Conversely, if we downgrade and the problem persists then we've eliminated the dependency upgrades as the source of the problem.

(In reply to Wesley Dawson [:whd] from comment #17)

:relud has found the dependency update that likely caused the INVALID_ARGUMENT: Request payload size exceeds the limit errors and is investigating a fix.

mozilla/gcp-ingestion#2074 forcibly downgrades google-cloud-pubsub to <1.116.0 because 1.116.0+ produces the above error under load.

I deployed https://github.com/mozilla/gcp-ingestion/pull/2074 to stage and am deploying prod presently.

Prod has been deployed and appears to be working. I think the next step is to check main duplicate rates after (say) 20:00 UTC to see if they've dropped as a result of the container update.

Live data after the deploy so far seems to have much better duplicate rates: https://sql.telemetry.mozilla.org/queries/85750/source.
I suspect when we check in on Monday we can in theory consider this resolved, but it would sure be interesting to better understand the sdk change (and subsequent fix) that caused this duplication to occur.

Some potential followups:

  1. monitoring for live table dupe rates
    We would have caught this much sooner with dashboards such as the ones created for investigation here.

  2. consider more frequent code deploys
    While we probably wouldn't have caught this without monitoring, it would have been resolved automatically if we were updating ingestion-sink code more frequently. It might have been more difficult to determine whether/which code change was the culprit in this case, however. I would also have some reservations about doing this continuously (as is done with beam) since it's clear that theoretically innocuous dependency updates (such as the pubsub one we needed to roll back) can cause pipeline issues.
    The current policy for ingestion sink updates is "whenever they are needed" which usually maps to some kind code change (e.g. metadata header updates) or a security fix. In the latest case it looks like it was tied to url2: https://github.com/mozilla/gcp-ingestion/pull/1953 which was a code change to ingestion-core that is used by ingestion-sink.

  3. Figure out the pubsub dependency issue
    We may need to update our code or figure out why upstream behavior changed and if we need to ask for or implement a fix.

See Also: → 1768507

From my previous comment:

  1. has been spun out to bug #1768507
  2. I think it's fine to leave code updates for ingestion-sink as they are now
  3. :relud filed https://console.cloud.google.com/support/go?caseId=29834657&accountId=gcp-sa-1256631965 which tracks the pubsub dependency issue

I'm assuming we don't plan to get to the bottom of what actually happened with the dependency, so I'm going to close this.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Whiteboard: [data-platform-infra-wg] → [dataplatform]
You need to log in before you can comment on or make changes to this bug.