Closed Bug 1679949 Opened 4 years ago Closed 4 years ago

Timeout sending Glean telemetry with Glean SDK 33.4.0

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: raphael, Assigned: janerik)

References

Details

(Whiteboard: [telemetry:glean-rs:m?])

Attachments

(2 files)

Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1349 4 years ago GitHub Bugzilla PR Linker 42 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1350 4 years ago GitHub Bugzilla PR Linker 42 bytes, text/x-github-pull-request		Details \| Review

Raphael Aurich [:raphael] UTC+01:00

Reporter

Description

•

4 years ago

For last night's burnham test run all sensors failed. The tables do not contain any records for the test run:

SELECT
  *
FROM
  `moz-fx-data-shared-prod.burnham_live.discovery_v1`
WHERE
  submission_timestamp BETWEEN TIMESTAMP_SUB("2020-12-01T00:02:16.075941+00:00", INTERVAL 1 HOUR)
  AND TIMESTAMP_ADD("2020-12-01T00:02:16.075941+00:00", INTERVAL 3 HOUR)
  AND metrics.uuid.test_run = "6064eca0-2d63-45e5-a077-6c09587e4cdf"

The client1 log reads ERROR:glean._dispatcher:Timeout sending Glean telemetry.

https://workflow.telemetry.mozilla.org/log?task_id=client1&dag_id=burnham&execution_date=2020-11-30T00%3A00%3A00%2B00%3A00

Jeff Klukas [:klukas] (UTC-4)

Comment 1

•

4 years ago

I'm looking through monitoring for the pipeline, and I haven't found any evidence yet of abnormal latency or a failure condition, but I'm continuing to look. I've cleared the DAG run to see if the issue happens again on retry.

The client error ERROR:glean._dispatcher:Timeout sending Glean telemetry could conceivably indicate that the ingestion edge service was returning non-200 values, so I'll focus on whether I can discover an issue with the edge.

Jeff Klukas [:klukas] (UTC-4)

Comment 2

•

4 years ago

It does appear that there was a somewhat unusual sustained period of high requests at the edge starting just about at midnight UTC and continuing for about 3 hours, though I don't see any indication of increased error rates due to the load.

When I retried the DAG, I'm still seeing client1 showing ERROR:glean._dispatcher:Timeout sending Glean telemetry, so it looks like this is not a transient error. Perhaps something has actually changed in the Glean client behavior?

Jan-Erik Rediger [:janerik]

Assignee

Comment 3

•

4 years ago

•

Edited

To note: Burnham upgraded Glean v33.04 -> v33.4.0 yesterday, PR: https://github.com/mozilla/burnham/pull/108/

Jeff Klukas [:klukas] (UTC-4)

Comment 4

•

4 years ago

There was a mention in #glean that the Glean debug view is showing high latency (5 to 10 minutes) this morning.

And on further investigation, I do see a sharp spike in 500 responses a few hours ago, around 12:00 UTC. So there does appear to be something going on with the pipeline, though it's unclear if that's related to the behavior here.

Michael Droettboom [:mdroettboom]

Comment 5

•

4 years ago

The message "ERROR:glean._dispatcher:Timeout sending Glean telemetry" is sent if it takes more than 30 seconds to flush the worker thread queue on shutdown of the process running Glean. Importantly, the ping uploading happens in a separate process (not a separate thread), and there is no timeout shutting down that process, so I'd be a bit surprised if the changes to the throttling timing would cause this to happen. But looking into what might be inadvertently blocking the thread queue...

Jan-Erik Rediger [:janerik]

Assignee

Comment 6

•

4 years ago

I looked into this and determined that this is indeed some subtle behavior inside glean-core that's causing this.
It most likely only affects short-lived applications, such as those tests, when they are faster in processing than Glean is in initializing.
I'll do a writeup and see if we can fix that easily right now.

Raphael Aurich [:raphael] UTC+01:00

Reporter

Comment 7

•

4 years ago

Thank you for investigating, folks!

GitHub Bugzilla PR Linker

Comment 8

•

4 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1349 — Details

GitHub Bugzilla PR Linker

Comment 9

•

4 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1350 — Details

Jan-Erik Rediger [:janerik]

Assignee

Updated

•

4 years ago

Assignee: nobody → jrediger

Jan-Erik Rediger [:janerik]

Assignee

Updated

•

4 years ago

Priority: P3 → P1

Jan-Erik Rediger [:janerik]

Assignee

Updated

•

4 years ago

Comment 10

•

4 years ago

Release is out. Keeping it open until it lands in burnham.

Raphael Aurich [:raphael] UTC+01:00

Reporter

Comment 11

•

4 years ago

The corresponding burnham pull-request is ready for review at https://github.com/mozilla/burnham/pull/113

Raphael Aurich [:raphael] UTC+01:00

Reporter

Comment 12

•

4 years ago

•

Edited

Last night's Airflow run failed again. The new Glean SDK release v33.6.0 resolved the timeout issue that we were seeing with v33.4.0, but Glean does not seem to upload pings to the data-platform. For example:

SELECT
  *
FROM
  `moz-fx-data-shared-prod.burnham_live.discovery_v1`
WHERE
  submission_timestamp BETWEEN TIMESTAMP_SUB("2020-12-03T00:00:49.693927+00:00", INTERVAL 1 HOUR)
  AND TIMESTAMP_ADD("2020-12-03T00:00:49.693927+00:00", INTERVAL 3 HOUR)
  AND metrics.uuid.test_run = "b5e185a8-00ba-4fd5-9d59-d457572b2485"
  AND metrics.string.test_name = "test_burnham"

This can be reproduced locally using the burnham main branch and running docker-compose up --build. The fake-data-platform service is programmed to log incoming requests from the burnham clients. Running burnham with v33.6.0 doesn't produce these log entries even though the client logs report that pings were submitted and will be sent as soon as possible.

Jan-Erik Rediger [:janerik]

Assignee

Comment 13

•

4 years ago

Posted in chat, posting here for posterity:

So here's how the upload process works in the default configuration:

When you submit a ping Glean launches a new process that takes care of uploading. If there already is an uploader process it doesn't launch a new one
When an app using glean-py exits while an upload process exists, that upload process will still be there until it is finished

Now this is where burnham comes in:

It runs its job, submits a couple of pings, Glean launches the upload process. burnham then exits.
When burnham exits the "main process" of that container exits -> docker shuts down the container immediately, the upload process is killed and can never upload

And this is why previously it worked:
Due to a "bug" that we now fixed we always waited for that process on exit. When burnham was done Glean caused it to block until the upload process is finished.
Previously we would always wait just 1s (at most 3 times in a row, so 3s total), then finish. In burnham's case the brief time was enough to actually upload stuff and exit, then exit burnham.
With the 33.4.0 change we expanded the wait time to 60s, but have a 30s timeout on exit. So Glean was blocked on exit just sleeping, but burnham wanted to exit and ultimatively killed the uploader after 30s (of sleep)

Jan-Erik Rediger [:janerik]

Assignee

Comment 14

•

4 years ago

This was now addressed in https://github.com/mozilla/burnham/pull/115

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Timeout sending Glean telemetry with Glean SDK 33.4.0

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

Tracking

(Not tracked)

People

(Reporter: raphael, Assigned: janerik)

References

Details

(Whiteboard: [telemetry:glean-rs:m?])

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated

Updated

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Attachment

General

Description

File Name

Content Type