Closed Bug 1679949 Opened 4 years ago Closed 4 years ago

Timeout sending Glean telemetry with Glean SDK 33.4.0

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: raphael, Assigned: janerik)

References

Details

(Whiteboard: [telemetry:glean-rs:m?])

Attachments

(2 files)

For last night's burnham test run all sensors failed. The tables do not contain any records for the test run:

SELECT
  *
FROM
  `moz-fx-data-shared-prod.burnham_live.discovery_v1`
WHERE
  submission_timestamp BETWEEN TIMESTAMP_SUB("2020-12-01T00:02:16.075941+00:00", INTERVAL 1 HOUR)
  AND TIMESTAMP_ADD("2020-12-01T00:02:16.075941+00:00", INTERVAL 3 HOUR)
  AND metrics.uuid.test_run = "6064eca0-2d63-45e5-a077-6c09587e4cdf"

The client1 log reads ERROR:glean._dispatcher:Timeout sending Glean telemetry.

https://workflow.telemetry.mozilla.org/log?task_id=client1&dag_id=burnham&execution_date=2020-11-30T00%3A00%3A00%2B00%3A00

I'm looking through monitoring for the pipeline, and I haven't found any evidence yet of abnormal latency or a failure condition, but I'm continuing to look. I've cleared the DAG run to see if the issue happens again on retry.

The client error ERROR:glean._dispatcher:Timeout sending Glean telemetry could conceivably indicate that the ingestion edge service was returning non-200 values, so I'll focus on whether I can discover an issue with the edge.

It does appear that there was a somewhat unusual sustained period of high requests at the edge starting just about at midnight UTC and continuing for about 3 hours, though I don't see any indication of increased error rates due to the load.

When I retried the DAG, I'm still seeing client1 showing ERROR:glean._dispatcher:Timeout sending Glean telemetry, so it looks like this is not a transient error. Perhaps something has actually changed in the Glean client behavior?

To note: Burnham upgraded Glean v33.04 -> v33.4.0 yesterday, PR: https://github.com/mozilla/burnham/pull/108/

There was a mention in #glean that the Glean debug view is showing high latency (5 to 10 minutes) this morning.

And on further investigation, I do see a sharp spike in 500 responses a few hours ago, around 12:00 UTC. So there does appear to be something going on with the pipeline, though it's unclear if that's related to the behavior here.

The message "ERROR:glean._dispatcher:Timeout sending Glean telemetry" is sent if it takes more than 30 seconds to flush the worker thread queue on shutdown of the process running Glean. Importantly, the ping uploading happens in a separate process (not a separate thread), and there is no timeout shutting down that process, so I'd be a bit surprised if the changes to the throttling timing would cause this to happen. But looking into what might be inadvertently blocking the thread queue...

I looked into this and determined that this is indeed some subtle behavior inside glean-core that's causing this.
It most likely only affects short-lived applications, such as those tests, when they are faster in processing than Glean is in initializing.
I'll do a writeup and see if we can fix that easily right now.

Thank you for investigating, folks!

Assignee: nobody → jrediger
Priority: P3 → P1
See Also: → 1680224

Release is out. Keeping it open until it lands in burnham.

The corresponding burnham pull-request is ready for review at https://github.com/mozilla/burnham/pull/113

Last night's Airflow run failed again. The new Glean SDK release v33.6.0 resolved the timeout issue that we were seeing with v33.4.0, but Glean does not seem to upload pings to the data-platform. For example:

SELECT
  *
FROM
  `moz-fx-data-shared-prod.burnham_live.discovery_v1`
WHERE
  submission_timestamp BETWEEN TIMESTAMP_SUB("2020-12-03T00:00:49.693927+00:00", INTERVAL 1 HOUR)
  AND TIMESTAMP_ADD("2020-12-03T00:00:49.693927+00:00", INTERVAL 3 HOUR)
  AND metrics.uuid.test_run = "b5e185a8-00ba-4fd5-9d59-d457572b2485"
  AND metrics.string.test_name = "test_burnham"

This can be reproduced locally using the burnham main branch and running docker-compose up --build. The fake-data-platform service is programmed to log incoming requests from the burnham clients. Running burnham with v33.6.0 doesn't produce these log entries even though the client logs report that pings were submitted and will be sent as soon as possible.

Posted in chat, posting here for posterity:

So here's how the upload process works in the default configuration:

  1. When you submit a ping Glean launches a new process that takes care of uploading. If there already is an uploader process it doesn't launch a new one
  2. When an app using glean-py exits while an upload process exists, that upload process will still be there until it is finished

Now this is where burnham comes in:

  1. It runs its job, submits a couple of pings, Glean launches the upload process. burnham then exits.
  2. When burnham exits the "main process" of that container exits -> docker shuts down the container immediately, the upload process is killed and can never upload

And this is why previously it worked:
Due to a "bug" that we now fixed we always waited for that process on exit. When burnham was done Glean caused it to block until the upload process is finished.
Previously we would always wait just 1s (at most 3 times in a row, so 3s total), then finish. In burnham's case the brief time was enough to actually upload stuff and exit, then exit burnham.
With the 33.4.0 change we expanded the wait time to 60s, but have a 30s timeout on exit. So Glean was blocked on exit just sleeping, but burnham wanted to exit and ultimatively killed the uploader after 30s (of sleep)

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: