Calls to TimingDistribution.accumulateSingleSample are not mirrored through GIFFT
Categories
(Toolkit :: Telemetry, task, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox136 | --- | fixed |
People
(Reporter: florian, Assigned: chutten)
References
Details
Attachments
(2 files)
My try run for histogram migration patches has lots of test failures.
Looking locally at a profile of toolkit/components/antitracking/test/xpcshell/test_purge_trackers_telemetry.js, I see a "TimingDistribution::accumulate - cookiePurging.trackersUserInteractionRemainingDays: 4" marker, but no equivalent "Histogram::Add" marker.
The relevant patch to apply to reproduce this test failure is https://phabricator.services.mozilla.com/D234068
Applying this change to the code makes the test pass:
diff --git a/toolkit/components/glean/api/src/private/timing_distribution.rs b/toolkit/components/glean/api/src/private/timing_distribution.rs
--- a/toolkit/components/glean/api/src/private/timing_distribution.rs
+++ b/toolkit/components/glean/api/src/private/timing_distribution.rs
@@ -560,6 +560,13 @@ impl TimingDistribution for TimingDistri
Some(TDMPayload::Sample(sample.clone())),
)
);
+ extern "C" {
+ fn GIFFT_TimingDistributionAccumulateRawMillis(metric_id: u32, sample: u32);
+ }
+ // SAFETY: using only primitives, no return value.
+ unsafe {
+ GIFFT_TimingDistributionAccumulateRawMillis(id.0, sample as u32);
+ }
inner.accumulate_single_sample(sample)
}
TimingDistributionMetric::Child(_c) => {
I have not submitted this change on phabricator as I think it would be good to review the entire file to verify if other calls are missing. I also wonder if GIFFT_TimingDistributionAccumulateRawMillis should be renamed to remove the "Millis" suffix as in this case I'm accumulating "days" rather than "millis".
Assignee | ||
Comment 1•1 month ago
|
||
This has uncovered a bit of a wrinkle in mirroring time-based histograms that don't use milliseconds.
By design, GIFFT will convert supplied instrumentation into milliseconds before mirroring from a timing_distribution
to a histogram of kind: exponential
or kind: linear
. This was because FOG has no idea what time unit a given timing_distribution
metric is, and needed some unit to supply samples to the Legacy Telemetry API. Since the Legacy Telemetry API defaults to milliseconds (on e.g. AccumulateTimeDelta
), and the vast majority of timing-based histograms are based around millis, that's what we went with.
But now that we're using GIFFT in bulk and would prefer to support all time-based Legacy Telemetry Histograms as they currently exist (whether they use finer or coarser units), we're going to need a different design.
Now, for this bug specifically, the sample-based API would be the easiest way to support these non-millis timing histograms. Samples are samples, and we can "just" assume that the sample being handed to sample-based APIs are in the correct unit for both the timing_distribution
(as is the contract of the API).
I think we desire, and it would be reasonable for users to expect, for the unit of the timing_distribution
and the mirrored-to histogram to match.
Wrinkles and unknowns:
- The sample-based APIs are available on JS and Rust, but are omitted in C++ in favour of
AccumulateRawDuration
- GIFFT on
timing_distribution
is roundabout due to supporting GIFFT from Rust (the only place that does so (bug 1751448)). - I'm tentatively optimistic that we can snag the
TimeUnit
of thetiming_distribution
on construction which should allow us to convert durations (from stop_and_accumulate, AccumulateRawDuration, etc) into samples. - It might be harder to get the correct
TimeUnit
whenLabeledTimingDistributionMetric
submetric instances are being created inSealed::from_glean_metric
- I haven't yet checked the existing mirrored
timing_distribution
+ histogram pairs to ensure none of them depend on having different units
Assignee | ||
Comment 2•1 month ago
|
||
Through the attached pythonic introspection of the metrics defined in-tree, I've found that the following mirrored timing_distribution
metrics have non-millisecond time_units:
'paint.build_displaylist_time PAINT_BUILD_DISPLAYLIST_TIME',
'wr.framebuild_time WR_FRAMEBUILD_TIME',
'wr.gpu_wait_time WR_GPU_WAIT_TIME',
'wr.rasterize_blobs_time WR_RASTERIZE_BLOBS_TIME',
'wr.rasterize_glyphs_time WR_RASTERIZE_GLYPHS_TIME',
'wr.renderer_time WR_RENDERER_TIME',
'wr.renderer_time_no_sc WR_RENDERER_TIME_NO_SC_MS',
'wr.scenebuild_time WR_SCENEBUILD_TIME',
'wr.sceneswap_time WR_SCENESWAP_TIME',
'wr.texture_cache_update_time WR_TEXTURE_CACHE_UPDATE_TIME',
'wr.time_to_frame_build WR_TIME_TO_FRAME_BUILD_MS',
'wr.time_to_render_start WR_TIME_TO_RENDER_START_MS',
'test_only.what_time_is_it TELEMETRY_TEST_MIRROR_FOR_TIMING',
'test_only.ipc.a_timing_dist TELEMETRY_TEST_EXPONENTIAL',
'search.service.startup_time SEARCH_SERVICE_INIT2_MS']
My next step is to audit these to see whether they're intentional or not.
Assignee | ||
Comment 3•1 month ago
|
||
Audit:
- 'paint.build_displaylist_time PAINT_BUILD_DISPLAYLIST_TIME', -- Uses the default
time_unit
(nanosecond
) and mirrors to long-standing millis-resolution hgram. Was migrated from GVST. - 'wr.framebuild_time WR_FRAMEBUILD_TIME', 'wr.gpu_wait_time WR_GPU_WAIT_TIME', 'wr.rasterize_blobs_time WR_RASTERIZE_BLOBS_TIME', 'wr.rasterize_glyphs_time WR_RASTERIZE_GLYPHS_TIME', 'wr.renderer_time WR_RENDERER_TIME', 'wr.renderer_time_no_sc WR_RENDERER_TIME_NO_SC_MS', 'wr.scenebuild_time WR_SCENEBUILD_TIME', 'wr.sceneswap_time WR_SCENESWAP_TIME', 'wr.texture_cache_update_time WR_TEXTURE_CACHE_UPDATE_TIME', 'wr.time_to_frame_build WR_TIME_TO_FRAME_BUILD_MS', 'wr.time_to_render_start WR_TIME_TO_RENDER_START_MS', -- Migrated as part of GVST removal.
time_unit: microsecond
. - 'test_only.what_time_is_it TELEMETRY_TEST_MIRROR_FOR_TIMING', 'test_only.ipc.a_timing_dist TELEMETRY_TEST_EXPONENTIAL', -- Uses default
time_unit
(nanosecond
) and doesn't really care about it. - 'search.service.startup_time SEARCH_SERVICE_INIT2_MS' -- Explicit
time_unit: nanosecond
added in bug 1832509.
All of these (except for the test-only ones which I will from now on ignore) would be sensitive to changing the resolution all of a sudden.
Possible resolutions, assuming we implement the change to GIFFT to perform time_unit
-matching mirroring:
- Stop mirroring to the legacy histograms. This is the ideal case, but depends on these histograms not being useful.
- Introduce duplicate
timing_distribution
metrics withtime_unit: millisecond
to mirror to the named histograms. A little messy (introduces duplicate metrics), but straightforward. - Add an API that does conversion mirroring like today, and migrate these users to it. Splits the API in a way that will require people to read documentation and is a layer violation.
- Put the "needs conversion" information into the metric definition (e.g.
metadata: telemetry_mirror_time_unit: millisecond
), codegen that information intoHistogramGIFFTMap.h
(e.g.HistogramSamplesNeedConversion(aId)
), then teachTimingDistribution.cpp
to read that info. Heaviest lift, but has the best ergonomics to my eye.
Let's start by asking about resolution 1:
ni?Standard8 for search.service.startup_time
and SEARCH_SERVICE_INIT2_MS
-- Is SEARCH_SERVICE_INIT2_MS
something we could remove in favour of the higher-resolution search.service.startup_time
?
ni?jnicol for paint.*
and wr.*
-- Are the legacy histograms something we could remove at this time now that it's been two years since we put higher-resolution data into the timing_distribution
metrics?
Assignee | ||
Comment 4•1 month ago
|
||
The Labeled Wrinkle: FOG labeled_timing_distribution
metrics don't know their time_unit
(because up until now only the Glean labeled_timing_distribution
metrics needed to know that), meaning we wouldn't be able to switch GIFFT over to match the specified time_unit
... except...
I audited the 80 keyed exponential and linear histograms (found via $ cat toolkit/components/telemetry/Histograms.json | jq 'to_entries | .[] | {key: .key, descr: .value.description, kind: .value.kind, keyed: .value.keyed} | select(.kind == "exponential" or .kind == "linear") | select(.keyed) | .key + " " + .descr'
) and all of them are either not recording times or are recording times in units of milliseconds.
So we're clear to have GIFFT convert samples to time_unit
units on mirror for timing_distribution
metrics, but converting to millis for labeled_timing_distribution
metrics if we want to go that route.
Comment 5•1 month ago
|
||
(In reply to Chris H-C :chutten from comment #3)
ni?Standard8 for
search.service.startup_time
andSEARCH_SERVICE_INIT2_MS
-- IsSEARCH_SERVICE_INIT2_MS
something we could remove in favour of the higher-resolutionsearch.service.startup_time
?
I'm fine with this. It is a health metric, and fine to be only in Glean.
Assignee | ||
Comment 6•1 month ago
|
||
Comment 7•26 days ago
|
||
I expect it is okay to stop mirroring the wr.* and paint.* timing probes to legacy telemetry given the data is available elsewhere. I'm not aware of my team actively using the legacy probes. Perhaps worth checking with the performance team too. Bas?
Assignee | ||
Comment 8•25 days ago
|
||
I think it likely enough that perf will agree with you that I'll file a bug and put up a patch for removal.
Comment 10•23 days ago
|
||
Comment 11•23 days ago
|
||
bugherder |
Description
•