Errors in some GeckoView timing distributions
Categories
(Data Platform and Tools :: Glean: SDK, defect, P3)
Tracking
(Not tracked)
People
(Reporter: mdroettboom, Unassigned)
References
Details
(Whiteboard: [telemetry:glean-rs:m11])
As of today, we are getting errors on around 5% - 10% of clients for the following geckoview metrics:
gfx.checkerboard.duration
gfx.checkerboard.potential_duration
geckoview.page_load_time
geckoview.startup_run_time
These are all timing distributions. Given that they are Geckoview metrics and accumulate directly (rather than using Glean's timing code), there are two possible error cases:
- A sample is negative
- A sample is longer than 10 minutes
We should investigate which of these cases is happening, and where the invalid values are being introduced and triage accordingly.
Comment 1•6 years ago
|
||
Hey Chris, in the previous meeting you mentioned sign expansion as a potential cause for this. Where would that happen? On the GeckoView wrapping side?
Comment 2•6 years ago
|
||
Samples coming into and out of Gecko Telemetry are uint32_t. They are upgraded to Java longs at some point and I remembered that type width promotions in the JVM are sign-extended, not 0-padded. Combined with a lack of unsigned types, I remember fighting a similar fight a decade ago.
But this might not be relevant as the code widening the type is C++, and it uses static_cast<int64_t> explicitly which ought to (and does, in my small tests) preserve value and not blindly treat the top bit as a sign bit.
So maybe this isn't the source of the errors after all. Maybe it's exceeding 10min.
(( should we maybe report the < 0 and > 10min cases separately in future? ))
Comment 3•6 years ago
•
|
||
(In reply to Chris H-C :chutten from comment #2)
So maybe this isn't the source of the errors after all. Maybe it's exceeding 10min.
Thanks for looking into that. Yes, this seems unlikely.
(( should we maybe report the
< 0and> 10mincases separately in future? ))
Maybe, good point. I filed bug 1591912.
FWIW, here are the CDF/PDF for geckoview.page_load_time: query.
The longer than 10 minutes sounds plausible, given these distributions? There's a bunch of samples in the 163,443,381,347 bucket... :-D
Comment 4•6 years ago
|
||
The 163,443,381,347 bucket is 2m43s, so it's possible that we're seeing something over 10min.
(( Side-note: When we error for < 0 and > 10 min, do we throw away the samples or put them in the underflow/overflow buckets? ))
| Reporter | ||
Comment 5•6 years ago
|
||
Negative values are dropped (and in fact, the entire sample window is dropped due to a single negative value). Long values are truncated and recorded.
Since that highest bucket is only 2m43s, and the next bucket is ~3m, I don't think we're seeing > 10m samples.
So signs point to negative values coming in (somehow).
Comment 6•6 years ago
|
||
I adjusted the mozilla-central-resident geckoview tests to check and see if a roundtrip through GV Streaming Telemetry was causing big or negative numbers to become negative or big.
It turns out that numbers above UINT32_MAX become UINT32_MAX, staying within the space representable by uint32_t. The sign remains positive.
(( And, for the record, a negative sample is coerced to a large uint32_t value. ))
This isn't conclusive evidence that the fault lies above geckoview (Maybe some normalization happens somewhere before the int64_t is cut down to uint32_t and so this is test-only behaviour), but it is consistent with that theory.
Sorry. I wish I could say it was GV Streaming Telemetry's fault, but I can't.
Comment 7•6 years ago
|
||
Thanks Chris.
As discussed over IRC, and seen in this query, geckoview_page_load_time has 59 samples in the 599,512,966,122 bucket, which is the bucket holding values for 10 minutes lengths. As documented, values greater than 10 minutes are cropped and reported in that bucket, in addition to generating an error.
So, for the specific page load time metric, we're seeing legitimate page loads longer than 10 minutes...
Updated•6 years ago
|
Updated•6 years ago
|
Comment 8•6 years ago
|
||
We recently added a new error type for timing distributions, invalid_overflow, which specifically counts the number of times a value bigger than 10 minutes was attempted to be recorded.
Turns the number of overflow errors for the metrics in comment 0 are not too big (see this query), the invalid_value (negative values in this context) are way bigger.
So yes, looks like we're receiving negative values?
Comment 9•6 years ago
|
||
(In reply to Michael Droettboom [:mdroettboom] from comment #0)
gfx.checkerboard.duration
All the errors are about this overflowing. I filed bug 1617179.
gfx.checkerboard.potential_duration
All the errors are about this overflowing. I filed bug 1617175 for this.
geckoview.page_load_time
Since after we landed the overflow error type, that seems to be the only reason why we're seeing erros. We have filed bug 1592278 for this.
geckoview.startup_run_time
There was a spike around december, with 822 errors reported, but now we're around 2-4. No longer a problem :)
Mike, can we go ahead and close this? Do you think we should do anything else here?
| Reporter | ||
Updated•6 years ago
|
Description
•