Closed Bug 1751241 Opened 3 years ago Closed 3 years ago

Why is the gpu time in Firefox beta 97 reported by Glean on the "metrics" ping 8x lower than the same value reported by Telemetry on the "main" ping

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

RESOLVED FIXED
98 Branch
Tracking Status
firefox98 --- fixed

People

(Reporter: chutten, Assigned: chutten)

References

Details

(Whiteboard: [data quality])

Attachments

(2 files)

They're straightforward queries. Look at the counter metric power.total_gpu_time_ms in Glean and its mirror the uint Scalar power.total_gpu_time_ms in Telemetry. Sum the values reported in the "metrics" or "main" ping, as appropriate. (and in the case of Telemetry sum the gpu time reported by the gpu process with the gpu time reported by the parent process, since it separates those out)

And then notice that Glean is lower. No matter, you think, Glean "metrics" pings come from fewer clients. This is true. They're coming from about 15% fewer clients in this case.

But the GPU time sums are 8x lower reported by Glean.

What's going on?

Glean Query
Telemetry Query

(( This effect appears in the sampled main_1pct but is noisy as all get out (I presume due to the population effects involved in using a 1% sample of beta). I've included a query run against the full depth of telemetry.main to be certain. ))

What kills me about this is that Telemetry is the higher one. No values can possibly get into Telemetry without being reported via a Glean API... so how is Telemetry reporting more than Glean (the one who gets first crack at the data)?

Now, one thing I can't help but notice is that the parent-process values of gpu time (Telemetry) are the same order of magnitude (120M millis ~~ 600M millis) of the all-process sums reported by Glean. IPC support in the "gpu" process is tested, but new, so maybe we're missing 88% of the gpu activity somehow?

But. The time when the information is recorded via a Glean API is right before we send Glean IPC. So it should be impossible for us to be missing IPC like this.

Local testing isn't coming up with anything, either. My Windows Fx97 beta has the exact same value in Glean as in Telemetry.

Something I noticed when checking the distribution of values is that the max value in Telemetry was 2^32 and the max in Glean was 2^31. This is expected, documented, tested, and in fact we have error reporting for when Glean saturates a counter metric.

Surly this much of a difference in possible maximum, spread over the reporting population, can account for all or at least most of the 8x?

Alas, not. It does get us down to something closer to 6x, though.

I guess I shouldn't have expected so much given only a max of 150 or so pings complained they'd saturated their counters each day.

Oh. Oh fun. Turns out that over 1k pings per day on beta are coming in via Glean with at least one invalid_value error. This means that Glean was asked to take a negative or 0 value.

Now, it looks like we protect against 0 values. Unfortunately, that little cast to int32_t (and this cast to uint32_t) are hiding some stuff from us.

Because of how C++ treats casts, imagine the following:

  1. GetGpuTimeSinceProcessStartInMs returns some value above int32_t's max value (which is 2^31 - 1 aka 2147483647) but below uint32_t's max value (2^32 - 1 aka 4294967295)
  2. This isn't 0, so we'll cast it to int32_t. The value becomes negative, because the sign bit is set.
  3. This metric is mirror'd so we'll need to pass the value to Telemetry. But first cast it to uint32_t which makes it a big positive number.
  4. Telemetry accepts this value. If this is the only value it's given, it'll report a large value between 2^31 - 1 and 2^32 - 1. If it has an existing largish value it might overflow and wrap around its value.
  5. Glean forbids counters adding a negative number, so the value from step 2 is discarded and an error is logged: invalid_value

Repeat several times over and we can now understand a mechanism by which the values stored in Telemetry and Glean are likely to be wildly out of sync.

So how to fix this? To the code in-tree we'll want two changes:

  1. In the GIFFT code that's mirroring to Telemetry we should add a check for negative numbers before the int32_t -> uint32_t cast. If it's a negative value, we should not pass any value to Telemetry but continue to pass the negative value to Glean (for error reporting purposes).
  2. In the GPU/CPU time calculations we should clamp the maximum possible value to int32_t's max value. If GetGpuTimeSinceProcessStartInMs returns some value higher than it, too bad, we saturate. We should also instrument in Glean and Telemetry that we've encountered such a situation so we'll have the option in analysis to discount this ping/client.

Longer-term we'll need a better way to record these staggeringly-large values. I've filed bug 1751277 for Glean SDK support. But maybe also we should figure out how we're getting such stunningly-large values. Isn't 2^31 millis over three weeks? How are these clients accumulating that much GPU time?

(( Also: this is not guaranteed to fix everything. There's no guarantee this is the only effect causing these strange results. But this is definitely a contributing cause, and is a good place to start. ))

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7d047cf994cc If Glean's given large or negative values, don't mirror them to Telemetry r=janerik https://hg.mozilla.org/integration/autoland/rev/403e96a8429b Ensure CPU and GPU time measurements fit in instrumentation r=florian,janerik

Backed out 2 changesets (bug 1751241) for causing xpc failures in toolkit/components/glean/tests/xpcshell/test_GIFFT.js

Backout link: https://hg.mozilla.org/integration/autoland/rev/9298969861fe0b2ecd3fa983e938741f3e80f7ed

Push with failures

Failure log

INFO -  TEST-START | toolkit/components/glean/tests/xpcshell/test_GIFFT.js
[task 2022-01-24T17:05:49.507Z] 17:05:49     INFO -  adb launch_application: am startservice -W -n 'org.mozilla.geckoview.test_runner/org.mozilla.geckoview.test_runner.XpcshellTestRunnerService$i0' -a android.intent.action.MAIN --es env0 XPCOM_DEBUG_BREAK=stack-and-abort --es env1 MOZ_CRASHREPORTER=1 --es env2 MOZ_CRASHREPORTER_NO_REPORT=1 --es env3 MOZ_DISABLE_NONLOCAL_CONNECTIONS=1 --es env4 MOZ_DEVELOPER_REPO_DIR=/builds/worker/checkouts/gecko --es env5 MOZ_DISABLE_CONTENT_SANDBOX=1 --es env6 MOZ_FETCHES_DIR=/builds/worker/fetches --es env7 MOZ_DISABLE_SOCKET_PROCESS=1 --es env8 LD_LIBRARY_PATH=/data/local/tmp/test_root/xpcb --es env9 MOZ_LINKER_CACHE=/data/local/tmp/test_root/xpcb --es env10 GRE_HOME=/data/local/tmp/test_root/xpcb --es env11 XPCSHELL_TEST_PROFILE_DIR=/data/local/tmp/test_root/xpc/p/a733d5e0-71cb-4b83-9b0d-5a5a177bef18 --es env12 HOME=/data/local/tmp/test_root/xpc/p --es env13 XPCSHELL_TEST_TEMP_DIR=/data/local/tmp/test_root/xpc/tmp/b5af8339-e587-4141-ba93-1caa38c72da7 --es env14 MOZ_ANDROID_DATA_DIR=/data/local/tmp/test_root/xpcb --es env15 MOZ_IN_AUTOMATION=1 --es env16 MOZ_ANDROID_CPU_ABI=x86_64 --es env17 MOZHTTP2_PORT=36197 --es env18 MOZNODE_EXEC_PORT=41239 --es env19 TMPDIR=/data/local/tmp/test_root/xpc/p/a733d5e0-71cb-4b83-9b0d-5a5a177bef18 --es env20 XPCSHELL_MINIDUMP_DIR=/data/local/tmp/test_root/xpc/minidumps/a733d5e0-71cb-4b83-9b0d-5a5a177bef18 --es arg0 -g --es arg1 /data/local/tmp/test_root/xpcb --es arg2 --greomni --es arg3 /data/local/tmp/test_root/xpcb/geckoview-test_runner.apk --es arg4 -m --es arg5 -e --es arg6 'const _HEAD_JS_PATH = "/data/local/tmp/test_root/xpc/head.js";' --es arg7 -e --es arg8 'const _MOZINFO_JS_PATH = "/data/local/tmp/test_root/xpc/p/a733d5e0-71cb-4b83-9b0d-5a5a177bef18/mozinfo.json";' --es arg9 -e --es arg10 'const _PREFS_FILE = "/data/local/tmp/test_root/xpc/user.js";' --es arg11 -e --es arg12 'const _TESTING_MODULES_DIR = "/data/local/tmp/test_root/xpc/m";' --es arg13 -f --es arg14 /data/local/tmp/test_root/xpc/head.js --es arg15 -e --es arg16 'const _HEAD_FILES = ["/data/local/tmp/test_root/xpc/toolkit/components/glean/tests/xpcshell/head.js"];' --es arg17 -e --es arg18 'const _JSDEBUGGER_PORT = 0;' --es arg19 -e --es arg20 'const _TEST_CWD = "/data/local/tmp/test_root/xpc/toolkit/components/glean/tests/xpcshell";' --es arg21 -e --es arg22 'const _TEST_FILE = ["test_GIFFT.js"];' --es arg23 -e --es arg24 'const _TEST_NAME = "toolkit/components/glean/tests/xpcshell/test_GIFFT.js";' --es arg25 -e --es arg26 '_execute_test(); quit(0);' --ez use_multiprocess True --es out_file /data/local/tmp/test_root/xpc/logs/xpcshell-0c14b8ab-5a43-4e66-8c75-a10c763aef6d.log
[task 2022-01-24T17:05:49.819Z] 17:05:49     INFO -  remotexpcshelltests.py | toolkit/components/glean/tests/xpcshell/test_GIFFT.js | 4359 | Launched Test App
[task 2022-01-24T17:05:50.845Z] 17:05:50     INFO -  remotexpcshelltests.py | toolkit/components/glean/tests/xpcshell/test_GIFFT.js | 4359 | Application ran for: 0:00:01.416023
[task 2022-01-24T17:05:50.934Z] 17:05:50  WARNING -  TEST-UNEXPECTED-FAIL | toolkit/components/glean/tests/xpcshell/test_GIFFT.js | xpcshell return code: 0
[task 2022-01-24T17:05:50.934Z] 17:05:50     INFO -  TEST-INFO took 1505ms
[task 2022-01-24T17:05:50.934Z] 17:05:50     INFO -  >>>>>>>
[task 2022-01-24T17:05:50.934Z] 17:05:50     INFO -  (xpcshell/head.js) | test MAIN run_test pending (1)
[task 2022-01-24T17:05:50.934Z] 17:05:50     INFO -  (xpcshell/head.js) | test run_next_test 0 pending (2)
[task 2022-01-24T17:05:50.935Z] 17:05:50     INFO -  (xpcshell/head.js) | test MAIN run_test finished (2)
[task 2022-01-24T17:05:50.935Z] 17:05:50     INFO -  running event loop
Flags: needinfo?(chutten)

Ohhhhhhhhh, this is on Andoird.

testResetFOG isn't implemented in Android.

Whoops. Hm, okay. Lemme skip-if this. I've taken the "how do we reset glean on Android" discussion to bug 1670259.

Flags: needinfo?(chutten)
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e8844d680a55 If Glean's given large or negative values, don't mirror them to Telemetry r=janerik https://hg.mozilla.org/integration/autoland/rev/8fa6985ac570 Ensure CPU and GPU time measurements fit in instrumentation r=florian,janerik
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 98 Branch
See Also: → 1752139

Uptake into Nightly builds with the fix should be at a good place for comparison:

To bring those together, I notice that (according to pings received January 30) Glean's reporting about 78% of the sum from about 80% of the clients. This is consistent with

  • Glean's approach of using many fewer pings (30% of the ping volume via Firefox Telemetry on Jan 30) to reach a slightly smaller population of users with "metrics" pings. (The "baseline" ping takes care of getting accurate population counts). (This accounts for the difference in client counts)
  • The fact that, even with the patches, Telemetry uint Scalars can represent values between 2^31 and 2^32-1 if multiple samples are provided that sum to that range, whereas Glean continues to saturate at 2^31-1. (This is consistent with the remaining 2% difference of differences between client count and value sum)

In short, this appears to have fixed the existing bugs and the GIFFT documentation has been updated to be a reference for those in the future to puzzle out similar differences.

Whiteboard: [data quality]
Regressions: 1755733
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: