1815211 - Validate new upload timing metrics

GLAM graph.
95th percentile at 1.2s
The similar measurement in Legacy Telemetry (GLAM graph) reports 1.3s (1300ms) at the 95th percentile
The legacy is much flatter, or ... static. I'm not sure if this is a bug or real data.

In any case the Glean values seem to be reporting ok and are quite reasonable.

send_failure

GLAM graph
95th percentile at 9.4s, 75% at 73ms.
Failing super quickly is likely a client-side issue (no network?).
Taking close to 10s before failing could be anything from slow network, interrupted connections or server timeouts.
Note that we don't have more detailed info about what the failure was, we record it for any failure.

Legacy (GLAM graph) has a 95th percentile of 17s. Nearly twice as high as for Glean.
But again the lines are so flat, I don't trust them.
Not much we can conclude from this comparison.

shutdown_wait

Glean only. GLAM graph

95th percentile of just 12ms.
This seems very low, given that we spawn a thread, wait for another thread and then report back.
This might require another closer look.

:travis, for a second pair of eyes on at least the first two metrics and my thinking there.
Maybe we can eventually add this to monitoring, so we learn about shifts in the data.

Flags: needinfo?(tlong)

Chris H-C :chutten

Comment 2

•

2 years ago

Firefox Desktop might be expected to be particularly quick at shutdown since it tends to be long-running. You might want to look at BUA's shutdown_wait since it'll almost certainly be doing things during most shutdowns.

Jan-Erik Rediger [:janerik]

Reporter

Comment 3

•

2 years ago

(In reply to Chris H-C :chutten from comment #2)

Firefox Desktop might be expected to be particularly quick at shutdown since it tends to be long-running. You might want to look at BUA's shutdown_wait since it'll almost certainly be doing things during most shutdowns.

Oh right. Guess I have to figure out how to properly query timing distributions in other places now.

Jan-Erik Rediger [:janerik]

Reporter

Comment 4

•

2 years ago

Got the query from some other query that was using timing distributions, and adopted it for the glean_validation.shutdown_wait metric:

https://sql.telemetry.mozilla.org/queries/91253/source?p_app%20id=firefox_desktop_background_update

Looks like 95% is 24ms. Still super quick and I'm not sure what to take from that.

Travis Long [:travis_]

Comment 5

•

2 years ago

I plan on taking a look at this and clearing my ni? later today. Sorry for the delay. I use a slightly different approach than that with APPROX_QUANTILES, I'll see if I can duplicate that 95th percentile of 24ms.

Jan-Erik Rediger [:janerik]

Reporter

Comment 6

•

2 years ago

(In reply to Travis Long [:travis_] from comment #5)

I plan on taking a look at this and clearing my ni? later today. Sorry for the delay. I use a slightly different approach than that with APPROX_QUANTILES, I'll see if I can duplicate that 95th percentile of 24ms.

Thanks! Yeah, dealing with timing distributions is a bit of a hassle, so I just copied an existing query and I'm not 100% sure it's correct yet.

Travis Long [:travis_]

Comment 7

•

2 years ago

Got a chance to take a look at this, and instead of using an APPROX_QUANTILES approach, I borrowed this query from the iOS folks (I think Chris Peterson may be who I borrowed it from). It was going to take a lot less adjusting to get it to look at a timing distribution. I think that this confirms the 24ms timing for the 95th percentile and around 90ms for the 99th percentile.

ref: https://sql.telemetry.mozilla.org/queries/91260/source?p_app_id=firefox_desktop&p_days=10#225945

Flags: needinfo?(tlong)

Chris H-C :chutten

Comment 8

•

2 years ago

Wow, that's really short and not to my expectation at all. A pity we didn't have this in the past as older clients were the ones having the most problems reporting issues in the past. Thank you for rerunning on BUA.

Jan-Erik Rediger [:janerik]

Reporter

Comment 9

•

2 years ago

Yeah, given we all think that is hella short we might want to at least briefly look into this further to validate that this is actually true. To the BUA testing /o/

Jan-Erik Rediger [:janerik]

Reporter

Comment 10

•

2 years ago

Re send_failure: I realize the default timeout for viaduct is 10s. Thus having the 95th percentile as 9.4s makes kinda sense: that's pretty much the timeout.

Jan-Erik Rediger [:janerik]

Reporter

Updated

•

2 years ago

Assignee: jrediger → nobody

Priority: P1 → --

Travis Long [:travis_]

Updated

•

2 years ago

Priority: -- → P4

Bugzilla

Validate new upload timing metrics

Categories

(Data Platform and Tools :: Glean: SDK, task, P4)

Tracking

(Not tracked)

People

(Reporter: janerik, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

send_success

send_failure

shutdown_wait

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated