Validate new upload timing metrics
Categories
(Data Platform and Tools :: Glean: SDK, task, P4)
Tracking
(Not tracked)
People
(Reporter: janerik, Unassigned)
References
Details
There's 3 new metrics:
glean_upload.send_success
glean_upload.send_failure
glean_validation.shutdown_wait
Some validation might be good:
- Do these values look correct?
- Are these values within expectations (at most some seconds)? Similar to what we know from Desktop already?
We can do this on Desktop Nightly, the upload.send_*
ones we can also check on Fenix.
Reporter | ||
Updated•2 years ago
|
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Comment 1•2 years ago
|
||
I took a brief look at the numbers in GLAM.
send_success
GLAM graph.
95th percentile at 1.2s
The similar measurement in Legacy Telemetry (GLAM graph) reports 1.3s (1300ms) at the 95th percentile
The legacy is much flatter, or ... static. I'm not sure if this is a bug or real data.
In any case the Glean values seem to be reporting ok and are quite reasonable.
send_failure
GLAM graph
95th percentile at 9.4s, 75% at 73ms.
Failing super quickly is likely a client-side issue (no network?).
Taking close to 10s before failing could be anything from slow network, interrupted connections or server timeouts.
Note that we don't have more detailed info about what the failure was, we record it for any failure.
Legacy (GLAM graph) has a 95th percentile of 17s. Nearly twice as high as for Glean.
But again the lines are so flat, I don't trust them.
Not much we can conclude from this comparison.
shutdown_wait
Glean only. GLAM graph
95th percentile of just 12ms.
This seems very low, given that we spawn a thread, wait for another thread and then report back.
This might require another closer look.
:travis, for a second pair of eyes on at least the first two metrics and my thinking there.
Maybe we can eventually add this to monitoring, so we learn about shifts in the data.
Comment 2•2 years ago
|
||
Firefox Desktop might be expected to be particularly quick at shutdown since it tends to be long-running. You might want to look at BUA's shutdown_wait
since it'll almost certainly be doing things during most shutdowns.
Reporter | ||
Comment 3•2 years ago
|
||
(In reply to Chris H-C :chutten from comment #2)
Firefox Desktop might be expected to be particularly quick at shutdown since it tends to be long-running. You might want to look at BUA's
shutdown_wait
since it'll almost certainly be doing things during most shutdowns.
Oh right. Guess I have to figure out how to properly query timing distributions in other places now.
Reporter | ||
Comment 4•2 years ago
|
||
Got the query from some other query that was using timing distributions, and adopted it for the glean_validation.shutdown_wait
metric:
https://sql.telemetry.mozilla.org/queries/91253/source?p_app%20id=firefox_desktop_background_update
Looks like 95% is 24ms. Still super quick and I'm not sure what to take from that.
Comment 5•2 years ago
|
||
I plan on taking a look at this and clearing my ni? later today. Sorry for the delay. I use a slightly different approach than that with APPROX_QUANTILES, I'll see if I can duplicate that 95th percentile of 24ms.
Reporter | ||
Comment 6•2 years ago
|
||
(In reply to Travis Long [:travis_] from comment #5)
I plan on taking a look at this and clearing my ni? later today. Sorry for the delay. I use a slightly different approach than that with APPROX_QUANTILES, I'll see if I can duplicate that 95th percentile of 24ms.
Thanks! Yeah, dealing with timing distributions is a bit of a hassle, so I just copied an existing query and I'm not 100% sure it's correct yet.
Comment 7•2 years ago
|
||
Got a chance to take a look at this, and instead of using an APPROX_QUANTILES approach, I borrowed this query from the iOS folks (I think Chris Peterson may be who I borrowed it from). It was going to take a lot less adjusting to get it to look at a timing distribution. I think that this confirms the 24ms timing for the 95th percentile and around 90ms for the 99th percentile.
ref: https://sql.telemetry.mozilla.org/queries/91260/source?p_app_id=firefox_desktop&p_days=10#225945
Comment 8•2 years ago
|
||
Wow, that's really short and not to my expectation at all. A pity we didn't have this in the past as older clients were the ones having the most problems reporting issues in the past. Thank you for rerunning on BUA.
Reporter | ||
Comment 9•2 years ago
|
||
Yeah, given we all think that is hella short we might want to at least briefly look into this further to validate that this is actually true. To the BUA testing /o/
Reporter | ||
Comment 10•2 years ago
|
||
Re send_failure
: I realize the default timeout for viaduct is 10s. Thus having the 95th percentile as 9.4s makes kinda sense: that's pretty much the timeout.
Reporter | ||
Updated•2 years ago
|
Updated•2 years ago
|
Description
•