Closed Bug 1941760 Opened 1 month ago Closed 23 days ago

Calls to TimingDistribution.accumulateSingleSample are not mirrored through GIFFT

Categories

(Toolkit :: Telemetry, task, P1)

task

Tracking

()

RESOLVED FIXED
136 Branch
Tracking Status
firefox136 --- fixed

People

(Reporter: florian, Assigned: chutten)

References

Details

Attachments

(2 files)

My try run for histogram migration patches has lots of test failures.

Looking locally at a profile of toolkit/components/antitracking/test/xpcshell/test_purge_trackers_telemetry.js, I see a "TimingDistribution::accumulate - cookiePurging.trackersUserInteractionRemainingDays: 4" marker, but no equivalent "Histogram::Add" marker.

The relevant patch to apply to reproduce this test failure is https://phabricator.services.mozilla.com/D234068

Applying this change to the code makes the test pass:

diff --git a/toolkit/components/glean/api/src/private/timing_distribution.rs b/toolkit/components/glean/api/src/private/timing_distribution.rs
--- a/toolkit/components/glean/api/src/private/timing_distribution.rs
+++ b/toolkit/components/glean/api/src/private/timing_distribution.rs
@@ -560,6 +560,13 @@ impl TimingDistribution for TimingDistri
                         Some(TDMPayload::Sample(sample.clone())),
                     )
                 );
+                extern "C" {
+                    fn GIFFT_TimingDistributionAccumulateRawMillis(metric_id: u32, sample: u32);
+                }
+                // SAFETY: using only primitives, no return value.
+                unsafe {
+                    GIFFT_TimingDistributionAccumulateRawMillis(id.0, sample as u32);
+                }
                 inner.accumulate_single_sample(sample)
             }
             TimingDistributionMetric::Child(_c) => {

I have not submitted this change on phabricator as I think it would be good to review the entire file to verify if other calls are missing. I also wonder if GIFFT_TimingDistributionAccumulateRawMillis should be renamed to remove the "Millis" suffix as in this case I'm accumulating "days" rather than "millis".

This has uncovered a bit of a wrinkle in mirroring time-based histograms that don't use milliseconds.

By design, GIFFT will convert supplied instrumentation into milliseconds before mirroring from a timing_distribution to a histogram of kind: exponential or kind: linear. This was because FOG has no idea what time unit a given timing_distribution metric is, and needed some unit to supply samples to the Legacy Telemetry API. Since the Legacy Telemetry API defaults to milliseconds (on e.g. AccumulateTimeDelta), and the vast majority of timing-based histograms are based around millis, that's what we went with.

But now that we're using GIFFT in bulk and would prefer to support all time-based Legacy Telemetry Histograms as they currently exist (whether they use finer or coarser units), we're going to need a different design.

Now, for this bug specifically, the sample-based API would be the easiest way to support these non-millis timing histograms. Samples are samples, and we can "just" assume that the sample being handed to sample-based APIs are in the correct unit for both the timing_distribution (as is the contract of the API).

I think we desire, and it would be reasonable for users to expect, for the unit of the timing_distribution and the mirrored-to histogram to match.

Wrinkles and unknowns:

  • The sample-based APIs are available on JS and Rust, but are omitted in C++ in favour of AccumulateRawDuration
  • GIFFT on timing_distribution is roundabout due to supporting GIFFT from Rust (the only place that does so (bug 1751448)).
  • I'm tentatively optimistic that we can snag the TimeUnit of the timing_distribution on construction which should allow us to convert durations (from stop_and_accumulate, AccumulateRawDuration, etc) into samples.
  • It might be harder to get the correct TimeUnit when LabeledTimingDistributionMetric submetric instances are being created in Sealed::from_glean_metric
  • I haven't yet checked the existing mirrored timing_distribution + histogram pairs to ensure none of them depend on having different units
Assignee: nobody → chutten
Severity: -- → N/A
Status: NEW → ASSIGNED
Type: defect → task
Priority: -- → P1

Through the attached pythonic introspection of the metrics defined in-tree, I've found that the following mirrored timing_distribution metrics have non-millisecond time_units:

'paint.build_displaylist_time PAINT_BUILD_DISPLAYLIST_TIME', 
'wr.framebuild_time WR_FRAMEBUILD_TIME', 
'wr.gpu_wait_time WR_GPU_WAIT_TIME', 
'wr.rasterize_blobs_time WR_RASTERIZE_BLOBS_TIME', 
'wr.rasterize_glyphs_time WR_RASTERIZE_GLYPHS_TIME', 
'wr.renderer_time WR_RENDERER_TIME', 
'wr.renderer_time_no_sc WR_RENDERER_TIME_NO_SC_MS', 
'wr.scenebuild_time WR_SCENEBUILD_TIME', 
'wr.sceneswap_time WR_SCENESWAP_TIME', 
'wr.texture_cache_update_time WR_TEXTURE_CACHE_UPDATE_TIME', 
'wr.time_to_frame_build WR_TIME_TO_FRAME_BUILD_MS', 
'wr.time_to_render_start WR_TIME_TO_RENDER_START_MS', 
'test_only.what_time_is_it TELEMETRY_TEST_MIRROR_FOR_TIMING', 
'test_only.ipc.a_timing_dist TELEMETRY_TEST_EXPONENTIAL', 
'search.service.startup_time SEARCH_SERVICE_INIT2_MS']

My next step is to audit these to see whether they're intentional or not.

Audit:

  • 'paint.build_displaylist_time PAINT_BUILD_DISPLAYLIST_TIME', -- Uses the default time_unit (nanosecond) and mirrors to long-standing millis-resolution hgram. Was migrated from GVST.
  • 'wr.framebuild_time WR_FRAMEBUILD_TIME', 'wr.gpu_wait_time WR_GPU_WAIT_TIME', 'wr.rasterize_blobs_time WR_RASTERIZE_BLOBS_TIME', 'wr.rasterize_glyphs_time WR_RASTERIZE_GLYPHS_TIME', 'wr.renderer_time WR_RENDERER_TIME', 'wr.renderer_time_no_sc WR_RENDERER_TIME_NO_SC_MS', 'wr.scenebuild_time WR_SCENEBUILD_TIME', 'wr.sceneswap_time WR_SCENESWAP_TIME', 'wr.texture_cache_update_time WR_TEXTURE_CACHE_UPDATE_TIME', 'wr.time_to_frame_build WR_TIME_TO_FRAME_BUILD_MS', 'wr.time_to_render_start WR_TIME_TO_RENDER_START_MS', -- Migrated as part of GVST removal. time_unit: microsecond.
  • 'test_only.what_time_is_it TELEMETRY_TEST_MIRROR_FOR_TIMING', 'test_only.ipc.a_timing_dist TELEMETRY_TEST_EXPONENTIAL', -- Uses default time_unit (nanosecond) and doesn't really care about it.
  • 'search.service.startup_time SEARCH_SERVICE_INIT2_MS' -- Explicittime_unit: nanosecond added in bug 1832509.

All of these (except for the test-only ones which I will from now on ignore) would be sensitive to changing the resolution all of a sudden.

Possible resolutions, assuming we implement the change to GIFFT to perform time_unit-matching mirroring:

  1. Stop mirroring to the legacy histograms. This is the ideal case, but depends on these histograms not being useful.
  2. Introduce duplicate timing_distribution metrics with time_unit: millisecond to mirror to the named histograms. A little messy (introduces duplicate metrics), but straightforward.
  3. Add an API that does conversion mirroring like today, and migrate these users to it. Splits the API in a way that will require people to read documentation and is a layer violation.
  4. Put the "needs conversion" information into the metric definition (e.g. metadata: telemetry_mirror_time_unit: millisecond), codegen that information into HistogramGIFFTMap.h (e.g. HistogramSamplesNeedConversion(aId)), then teach TimingDistribution.cpp to read that info. Heaviest lift, but has the best ergonomics to my eye.

Let's start by asking about resolution 1:

ni?Standard8 for search.service.startup_time and SEARCH_SERVICE_INIT2_MS -- Is SEARCH_SERVICE_INIT2_MS something we could remove in favour of the higher-resolution search.service.startup_time?
ni?jnicol for paint.* and wr.* -- Are the legacy histograms something we could remove at this time now that it's been two years since we put higher-resolution data into the timing_distributionmetrics?

Flags: needinfo?(standard8)
Flags: needinfo?(jnicol)

The Labeled Wrinkle: FOG labeled_timing_distribution metrics don't know their time_unit (because up until now only the Glean labeled_timing_distribution metrics needed to know that), meaning we wouldn't be able to switch GIFFT over to match the specified time_unit... except...

I audited the 80 keyed exponential and linear histograms (found via $ cat toolkit/components/telemetry/Histograms.json | jq 'to_entries | .[] | {key: .key, descr: .value.description, kind: .value.kind, keyed: .value.keyed} | select(.kind == "exponential" or .kind == "linear") | select(.keyed) | .key + " " + .descr') and all of them are either not recording times or are recording times in units of milliseconds.

So we're clear to have GIFFT convert samples to time_unit units on mirror for timing_distribution metrics, but converting to millis for labeled_timing_distribution metrics if we want to go that route.

(In reply to Chris H-C :chutten from comment #3)

ni?Standard8 for search.service.startup_time and SEARCH_SERVICE_INIT2_MS -- Is SEARCH_SERVICE_INIT2_MS something we could remove in favour of the higher-resolution search.service.startup_time?

I'm fine with this. It is a health metric, and fine to be only in Glean.

Flags: needinfo?(standard8)
Depends on: 1942150

I expect it is okay to stop mirroring the wr.* and paint.* timing probes to legacy telemetry given the data is available elsewhere. I'm not aware of my team actively using the legacy probes. Perhaps worth checking with the performance team too. Bas?

Flags: needinfo?(jnicol) → needinfo?(bas)

I think it likely enough that perf will agree with you that I'll file a bug and put up a patch for removal.

Depends on: 1943115

That seems absolutely fine.

Flags: needinfo?(bas)
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d1024d5b455d Change GIFFT to mirror timing_distribution samples in time_unit units instead of just in millis r=janerik
Status: ASSIGNED → RESOLVED
Closed: 23 days ago
Resolution: --- → FIXED
Target Milestone: --- → 136 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: