Timeout sending Glean telemetry with Glean SDK 33.4.0
Categories
(Data Platform and Tools :: Glean: SDK, defect, P1)
Tracking
(Not tracked)
People
(Reporter: raphael, Assigned: janerik)
References
Details
(Whiteboard: [telemetry:glean-rs:m?])
Attachments
(2 files)
For last night's burnham test run all sensors failed. The tables do not contain any records for the test run:
SELECT
*
FROM
`moz-fx-data-shared-prod.burnham_live.discovery_v1`
WHERE
submission_timestamp BETWEEN TIMESTAMP_SUB("2020-12-01T00:02:16.075941+00:00", INTERVAL 1 HOUR)
AND TIMESTAMP_ADD("2020-12-01T00:02:16.075941+00:00", INTERVAL 3 HOUR)
AND metrics.uuid.test_run = "6064eca0-2d63-45e5-a077-6c09587e4cdf"
The client1
log reads ERROR:glean._dispatcher:Timeout sending Glean telemetry
.
Comment 1•4 years ago
|
||
I'm looking through monitoring for the pipeline, and I haven't found any evidence yet of abnormal latency or a failure condition, but I'm continuing to look. I've cleared the DAG run to see if the issue happens again on retry.
The client error ERROR:glean._dispatcher:Timeout sending Glean telemetry
could conceivably indicate that the ingestion edge service was returning non-200 values, so I'll focus on whether I can discover an issue with the edge.
Comment 2•4 years ago
|
||
It does appear that there was a somewhat unusual sustained period of high requests at the edge starting just about at midnight UTC and continuing for about 3 hours, though I don't see any indication of increased error rates due to the load.
When I retried the DAG, I'm still seeing client1 showing ERROR:glean._dispatcher:Timeout sending Glean telemetry
, so it looks like this is not a transient error. Perhaps something has actually changed in the Glean client behavior?
Assignee | ||
Comment 3•4 years ago
•
|
||
To note: Burnham upgraded Glean v33.04 -> v33.4.0 yesterday, PR: https://github.com/mozilla/burnham/pull/108/
Comment 4•4 years ago
|
||
There was a mention in #glean
that the Glean debug view is showing high latency (5 to 10 minutes) this morning.
And on further investigation, I do see a sharp spike in 500 responses a few hours ago, around 12:00 UTC. So there does appear to be something going on with the pipeline, though it's unclear if that's related to the behavior here.
Comment 5•4 years ago
|
||
The message "ERROR:glean._dispatcher:Timeout sending Glean telemetry" is sent if it takes more than 30 seconds to flush the worker thread queue on shutdown of the process running Glean. Importantly, the ping uploading happens in a separate process (not a separate thread), and there is no timeout shutting down that process, so I'd be a bit surprised if the changes to the throttling timing would cause this to happen. But looking into what might be inadvertently blocking the thread queue...
Assignee | ||
Comment 6•4 years ago
|
||
I looked into this and determined that this is indeed some subtle behavior inside glean-core that's causing this.
It most likely only affects short-lived applications, such as those tests, when they are faster in processing than Glean is in initializing.
I'll do a writeup and see if we can fix that easily right now.
Reporter | ||
Comment 7•4 years ago
|
||
Thank you for investigating, folks!
Comment 8•4 years ago
|
||
Comment 9•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 10•4 years ago
|
||
Release is out. Keeping it open until it lands in burnham.
Reporter | ||
Comment 11•4 years ago
|
||
The corresponding burnham pull-request is ready for review at https://github.com/mozilla/burnham/pull/113
Reporter | ||
Comment 12•4 years ago
•
|
||
Last night's Airflow run failed again. The new Glean SDK release v33.6.0 resolved the timeout issue that we were seeing with v33.4.0, but Glean does not seem to upload pings to the data-platform. For example:
SELECT
*
FROM
`moz-fx-data-shared-prod.burnham_live.discovery_v1`
WHERE
submission_timestamp BETWEEN TIMESTAMP_SUB("2020-12-03T00:00:49.693927+00:00", INTERVAL 1 HOUR)
AND TIMESTAMP_ADD("2020-12-03T00:00:49.693927+00:00", INTERVAL 3 HOUR)
AND metrics.uuid.test_run = "b5e185a8-00ba-4fd5-9d59-d457572b2485"
AND metrics.string.test_name = "test_burnham"
This can be reproduced locally using the burnham main
branch and running docker-compose up --build
. The fake-data-platform service is programmed to log incoming requests from the burnham clients. Running burnham with v33.6.0 doesn't produce these log entries even though the client logs report that pings were submitted and will be sent as soon as possible.
Assignee | ||
Comment 13•4 years ago
|
||
Posted in chat, posting here for posterity:
So here's how the upload process works in the default configuration:
- When you submit a ping Glean launches a new process that takes care of uploading. If there already is an uploader process it doesn't launch a new one
- When an app using glean-py exits while an upload process exists, that upload process will still be there until it is finished
Now this is where burnham comes in:
- It runs its job, submits a couple of pings, Glean launches the upload process. burnham then exits.
- When burnham exits the "main process" of that container exits -> docker shuts down the container immediately, the upload process is killed and can never upload
And this is why previously it worked:
Due to a "bug" that we now fixed we always waited for that process on exit. When burnham was done Glean caused it to block until the upload process is finished.
Previously we would always wait just 1s (at most 3 times in a row, so 3s total), then finish. In burnham's case the brief time was enough to actually upload stuff and exit, then exit burnham.
With the 33.4.0 change we expanded the wait time to 60s, but have a 30s timeout on exit. So Glean was blocked on exit just sleeping, but burnham wanted to exit and ultimatively killed the uploader after 30s (of sleep)
Assignee | ||
Comment 14•4 years ago
|
||
This was now addressed in https://github.com/mozilla/burnham/pull/115
Description
•