Closed Bug 1591153 Opened 6 years ago Closed 6 years ago

Errors in some GeckoView timing distributions

Categories

(Data Platform and Tools :: Glean: SDK, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdroettboom, Unassigned)

References

Details

(Whiteboard: [telemetry:glean-rs:m11])

As of today, we are getting errors on around 5% - 10% of clients for the following geckoview metrics:

gfx.checkerboard.duration
gfx.checkerboard.potential_duration
geckoview.page_load_time
geckoview.startup_run_time

These are all timing distributions. Given that they are Geckoview metrics and accumulate directly (rather than using Glean's timing code), there are two possible error cases:

  1. A sample is negative
  2. A sample is longer than 10 minutes

We should investigate which of these cases is happening, and where the invalid values are being introduced and triage accordingly.

Hey Chris, in the previous meeting you mentioned sign expansion as a potential cause for this. Where would that happen? On the GeckoView wrapping side?

Flags: needinfo?(chutten)

Samples coming into and out of Gecko Telemetry are uint32_t. They are upgraded to Java longs at some point and I remembered that type width promotions in the JVM are sign-extended, not 0-padded. Combined with a lack of unsigned types, I remember fighting a similar fight a decade ago.

But this might not be relevant as the code widening the type is C++, and it uses static_cast<int64_t> explicitly which ought to (and does, in my small tests) preserve value and not blindly treat the top bit as a sign bit.

So maybe this isn't the source of the errors after all. Maybe it's exceeding 10min.

(( should we maybe report the < 0 and > 10min cases separately in future? ))

Flags: needinfo?(chutten)

(In reply to Chris H-C :chutten from comment #2)

So maybe this isn't the source of the errors after all. Maybe it's exceeding 10min.

Thanks for looking into that. Yes, this seems unlikely.

(( should we maybe report the < 0 and > 10min cases separately in future? ))

Maybe, good point. I filed bug 1591912.


FWIW, here are the CDF/PDF for geckoview.page_load_time: query.

The longer than 10 minutes sounds plausible, given these distributions? There's a bunch of samples in the 163,443,381,347 bucket... :-D

Flags: needinfo?(mdroettboom)
Flags: needinfo?(chutten)

The 163,443,381,347 bucket is 2m43s, so it's possible that we're seeing something over 10min.

(( Side-note: When we error for < 0 and > 10 min, do we throw away the samples or put them in the underflow/overflow buckets? ))

Flags: needinfo?(chutten)

Negative values are dropped (and in fact, the entire sample window is dropped due to a single negative value). Long values are truncated and recorded.

Since that highest bucket is only 2m43s, and the next bucket is ~3m, I don't think we're seeing > 10m samples.

So signs point to negative values coming in (somehow).

Flags: needinfo?(mdroettboom)

I adjusted the mozilla-central-resident geckoview tests to check and see if a roundtrip through GV Streaming Telemetry was causing big or negative numbers to become negative or big.

It turns out that numbers above UINT32_MAX become UINT32_MAX, staying within the space representable by uint32_t. The sign remains positive.

(( And, for the record, a negative sample is coerced to a large uint32_t value. ))

This isn't conclusive evidence that the fault lies above geckoview (Maybe some normalization happens somewhere before the int64_t is cut down to uint32_t and so this is test-only behaviour), but it is consistent with that theory.

Sorry. I wish I could say it was GV Streaming Telemetry's fault, but I can't.

Thanks Chris.

As discussed over IRC, and seen in this query, geckoview_page_load_time has 59 samples in the 599,512,966,122 bucket, which is the bucket holding values for 10 minutes lengths. As documented, values greater than 10 minutes are cropped and reported in that bucket, in addition to generating an error.

So, for the specific page load time metric, we're seeing legitimate page loads longer than 10 minutes...

See Also: → 1592278
Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:m10]
Whiteboard: [telemetry:glean-rs:m10] → [telemetry:glean-rs:m11]
See Also: → 1591912

We recently added a new error type for timing distributions, invalid_overflow, which specifically counts the number of times a value bigger than 10 minutes was attempted to be recorded.

Turns the number of overflow errors for the metrics in comment 0 are not too big (see this query), the invalid_value (negative values in this context) are way bigger.

So yes, looks like we're receiving negative values?

(In reply to Michael Droettboom [:mdroettboom] from comment #0)

gfx.checkerboard.duration

All the errors are about this overflowing. I filed bug 1617179.

gfx.checkerboard.potential_duration

All the errors are about this overflowing. I filed bug 1617175 for this.

geckoview.page_load_time

Since after we landed the overflow error type, that seems to be the only reason why we're seeing erros. We have filed bug 1592278 for this.

geckoview.startup_run_time

There was a spike around december, with 822 errors reported, but now we're around 2-4. No longer a problem :)

Mike, can we go ahead and close this? Do you think we should do anything else here?

Flags: needinfo?(mdroettboom)
See Also: → 1617175, 1617179
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(mdroettboom)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.