Open Bug 1815211 Opened 2 years ago Updated 2 years ago

Validate new upload timing metrics

Categories

(Data Platform and Tools :: Glean: SDK, task, P4)

task

Tracking

(Not tracked)

People

(Reporter: janerik, Unassigned)

References

Details

There's 3 new metrics:

  • glean_upload.send_success
  • glean_upload.send_failure
  • glean_validation.shutdown_wait

Some validation might be good:

  • Do these values look correct?
  • Are these values within expectations (at most some seconds)? Similar to what we know from Desktop already?

We can do this on Desktop Nightly, the upload.send_* ones we can also check on Fenix.

Summary: Valid upload timing → Validate new upload timing metrics
Priority: -- → P2
Assignee: nobody → jrediger
Type: defect → task
Priority: P2 → P1

I took a brief look at the numbers in GLAM.

send_success

GLAM graph.
95th percentile at 1.2s
The similar measurement in Legacy Telemetry (GLAM graph) reports 1.3s (1300ms) at the 95th percentile
The legacy is much flatter, or ... static. I'm not sure if this is a bug or real data.

In any case the Glean values seem to be reporting ok and are quite reasonable.

send_failure

GLAM graph
95th percentile at 9.4s, 75% at 73ms.
Failing super quickly is likely a client-side issue (no network?).
Taking close to 10s before failing could be anything from slow network, interrupted connections or server timeouts.
Note that we don't have more detailed info about what the failure was, we record it for any failure.

Legacy (GLAM graph) has a 95th percentile of 17s. Nearly twice as high as for Glean.
But again the lines are so flat, I don't trust them.
Not much we can conclude from this comparison.

shutdown_wait

Glean only. GLAM graph

95th percentile of just 12ms.
This seems very low, given that we spawn a thread, wait for another thread and then report back.
This might require another closer look.


:travis, for a second pair of eyes on at least the first two metrics and my thinking there.
Maybe we can eventually add this to monitoring, so we learn about shifts in the data.

Flags: needinfo?(tlong)

Firefox Desktop might be expected to be particularly quick at shutdown since it tends to be long-running. You might want to look at BUA's shutdown_wait since it'll almost certainly be doing things during most shutdowns.

(In reply to Chris H-C :chutten from comment #2)

Firefox Desktop might be expected to be particularly quick at shutdown since it tends to be long-running. You might want to look at BUA's shutdown_wait since it'll almost certainly be doing things during most shutdowns.

Oh right. Guess I have to figure out how to properly query timing distributions in other places now.

Got the query from some other query that was using timing distributions, and adopted it for the glean_validation.shutdown_wait metric:

https://sql.telemetry.mozilla.org/queries/91253/source?p_app%20id=firefox_desktop_background_update

Looks like 95% is 24ms. Still super quick and I'm not sure what to take from that.

I plan on taking a look at this and clearing my ni? later today. Sorry for the delay. I use a slightly different approach than that with APPROX_QUANTILES, I'll see if I can duplicate that 95th percentile of 24ms.

(In reply to Travis Long [:travis_] from comment #5)

I plan on taking a look at this and clearing my ni? later today. Sorry for the delay. I use a slightly different approach than that with APPROX_QUANTILES, I'll see if I can duplicate that 95th percentile of 24ms.

Thanks! Yeah, dealing with timing distributions is a bit of a hassle, so I just copied an existing query and I'm not 100% sure it's correct yet.

Got a chance to take a look at this, and instead of using an APPROX_QUANTILES approach, I borrowed this query from the iOS folks (I think Chris Peterson may be who I borrowed it from). It was going to take a lot less adjusting to get it to look at a timing distribution. I think that this confirms the 24ms timing for the 95th percentile and around 90ms for the 99th percentile.

ref: https://sql.telemetry.mozilla.org/queries/91260/source?p_app_id=firefox_desktop&p_days=10#225945

Flags: needinfo?(tlong)

Wow, that's really short and not to my expectation at all. A pity we didn't have this in the past as older clients were the ones having the most problems reporting issues in the past. Thank you for rerunning on BUA.

Yeah, given we all think that is hella short we might want to at least briefly look into this further to validate that this is actually true. To the BUA testing /o/

Re send_failure: I realize the default timeout for viaduct is 10s. Thus having the 95th percentile as 9.4s makes kinda sense: that's pretty much the timeout.

Assignee: jrediger → nobody
Priority: P1 → --
Priority: -- → P4
You need to log in before you can comment on or make changes to this bug.