Intermittent browser/modules/test/browser/browser_UsageTelemetry_interaction.js | single tracking bug
Categories
(Firefox :: Settings UI, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox137 | --- | fixed |
People
(Reporter: jmaher, Assigned: chutten)
References
(Blocks 1 open bug)
Details
(Keywords: intermittent-failure, intermittent-testcase)
Attachments
(1 file, 1 obsolete file)
Bug 1775795 - In testResetFOG, ensure to override custom ping registration after reinit r?TravisLong
48 bytes,
text/x-phabricator-request
|
Details | Review |
Reporter | ||
Comment 1•3 years ago
|
||
Additional information about this bug failures and frequency patterns can be found by running: ./mach test-info failure-report --bug 1775795
Comment 2•3 years ago
|
||
Blanket S4 for all the new tracking bugs because they don't seem to be high-frequency. We can increase the severity if frequency goes up (or clear it to re-surface this to triagers, please.)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 31•1 year ago
|
||
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
For more information, please visit BugBot documentation.
Comment 32•1 year ago
|
||
treeherder |
New failure instance: https://treeherder.mozilla.org/logviewer?job_id=433296354&repo=mozilla-central
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 96•2 months ago
|
||
This looks to be permafailing on artifact builds with:
TEST-UNEXPECTED-FAIL | browser/modules/test/browser/browser_UsageTelemetry_interaction.js | Uncaught exception in test bound toolbarButtons - at chrome://mochitests/content/browser/browser/modules/test/browser/browser_UsageTelemetry_interaction.js:184 - TypeError: can't access property "map", Glean.browserUsage.interaction.testGetValue() is null
That's pretty annoying. Why is this happening? I have vague recollections some glean metrics don't work well with artifact builds, but my trypush didn't touch glean and was on a clean central base, so I'm confused why this would be broken.
I'd ask :chutten but he's out and blocked needinfos. Jan-Erik, do you know?
Comment hidden (Intermittent Failures Robot) |
Comment 98•2 months ago
|
||
Uuuuh, nope. but :chutten is back now.
Assignee | ||
Comment 99•2 months ago
|
||
object
metrics are the ones that don't work in artefact builds (bug 1883857), but browser.usage.interaction
is an event
. So unless there's an unguarded object
metric that's throwing before this line (preventing that line from running), I don't know that that's necessarily the problem.
Comment hidden (Intermittent Failures Robot) |
Comment 101•1 month ago
|
||
(In reply to Chris H-C :chutten from comment #99)
object
metrics are the ones that don't work in artefact builds (bug 1883857), butbrowser.usage.interaction
is anevent
. So unless there's an unguardedobject
metric that's throwing before this line (preventing that line from running), I don't know that that's necessarily the problem.
OK, but there is clearly some kind of problem. I pushed to try again today, for something else, and my push has a bunch of failures. The top ones are this bug, with the same error message as comment 96, plus toolkit/components/nimbus/test/unit/test_nimbusTelemetry.js
with:
Unexpected exception TypeError: can't access property "set", Glean.nimbusTargetingEnvironment.prefValues is undefined at resource://nimbus/lib/TargetingContextRecorder.sys.mjs:330
[task 2025-01-16T17:40:57.185Z] 17:40:57 INFO - recordPrefValues@resource://nimbus/lib/TargetingContextRecorder.sys.mjs:330:3
[task 2025-01-16T17:40:57.185Z] 17:40:57 INFO - recordTargetingContext@resource://nimbus/lib/TargetingContextRecorder.sys.mjs:389:3
[task 2025-01-16T17:40:57.185Z] 17:40:57 INFO - test_nimbusTargetingEnvironment_recordAttrs@/opt/worker/tasks/task_173704834952156/build/tests/xpcshell/tests/toolkit/components/nimbus/test/unit/test_nimbusTelemetry.js:320:9
[task 2025-01-16T17:40:57.186Z] 17:40:57 INFO - async*_run_next_test/<@/opt/worker/tasks/task_173704834952156/build/tests/xpcshell/head.js:1759:22
[task 2025-01-16T17:40:57.186Z] 17:40:57 INFO - _run_next_test@/opt/worker/tasks/task_173704834952156/build/tests/xpcshell/head.js:1759:38
[task 2025-01-16T17:40:57.186Z] 17:40:57 INFO - run@/opt/worker/tasks/task_173704834952156/build/tests/xpcshell/head.js:808:9
[task 2025-01-16T17:40:57.186Z] 17:40:57 INFO - _do_main@/opt/worker/tasks/task_173704834952156/build/tests/xpcshell/head.js:245:6
[task 2025-01-16T17:40:57.186Z] 17:40:57 INFO - _execute_test@/opt/worker/tasks/task_173704834952156/build/tests/xpcshell/head.js:596:5
[task 2025-01-16T17:40:57.186Z] 17:40:57 INFO - @-e:1:1
[task 2025-01-16T17:40:57.187Z] 17:40:57 INFO - exiting test
and toolkit/components/glean/tests/xpcshell/test_ProfilerMarkers.js
with:
[task 2025-01-16T17:51:17.941Z] 17:51:17 WARNING - TEST-UNEXPECTED-FAIL | toolkit/components/glean/tests/xpcshell/test_ProfilerMarkers.js | test_fog_counter_markers - [test_fog_counter_markers : 86] [{"type":"IntLikeMetric","id":"id not found","val":31}] deepEqual [{"type":"IntLikeMetric","id":"testOnly.badCode","val":31}]
all 3 of these are permafailing on artifact builds. I don't know why; my patch only touches how URIs are fixed up so is completely unrelated. The parent rev was from Wednesday so would have long finished building, so it shouldn't be mismatched artifacts either. But all 3 failures are glean related. (there's also a pile of failing a11y tests which are a separate thorn in my eye but that's fodder for a different bug)
Any ideas?
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 103•1 month ago
•
|
||
Ah, nimbus.targeting.environment.pref_values
is indeed a metric with type: object
. If that code's running in those tests, then in artefact builds it will throw, probably messing with tests.
ni?beth - Is it possible that recordPrefValues
is being run in these failing tests? If so, we'll need to guard the use of the metric on whether it's there like so.
Comment 104•1 month ago
|
||
All the actual code paths to recordPrefValues
should only run inside artifact builds, except test_nimbusTelemetry.js
. We should skip this test if we're not in an artifact build
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 108•21 days ago
|
||
I can trivially reproduce the issue in browser_UsageTelemetry_interaction.js
in a local artifact build on today's central. I don't know why it's broken though - I added logging in BrowserUsageTelemetry, here and after doing so, that logging is part of the test output. So it looks to me like the record
call succeeds, and occurs before the testGetValue
call, and yet the testGetValue
call returns null
. It's an artifact build on macOS so I cannot easily debug the C++/rust (though you could maybe do it on Windows with the source server?). I also don't know glean so don't know how to investigate further. Happy to do more digging myself but not sure where to start.
From very cursory looking at the code it would seem that this call:
to
if (!fog_event_test_has_value(mId, &aPingName)) {
must be returning false (and thus entering the if
and early returning) but I don't know why that is, and searchfox can't show me a definition for that function so I'm stumped.
Comment 109•21 days ago
|
||
In case it's helpful, I just ran with the profiler, and realized that these events appear to be hitting:
let raw_name_ptr = FOG_GetMetricIdentifier(**self);
...
"id not found"
because they are recording profiler markers with the metric id value set to "id not found", which sounds bad, and presumably explains why things aren't working...
Assignee | ||
Comment 110•21 days ago
|
||
Two things:
- Thank you so much for all your persistence on this issue, I think you've gotten this to a point from which I can take it. Especially now that I can confirm I can reproduce this on a faux artifact build (one where I copy over the
jogfile.json
from an artifact build to my full build'sobjdir/dist/bin
and use--setpref telemetry.fog.artifact_build=true
) "id not found"
is expected in artifact builds for now asFOG_GetMetricIdentifier
only works on metrics registered at compile-time. Artifact builds over-register their metrics at runtime and the profiler doesn't have a way to figure out what the name of a runtime metric is (yet. bug 1938145 ought to make that more possible)
I'm gonna turn on some logging and see if I can figure out what's going awry. Calling record()
followed by testGetValue
is straight up the first thing we test for events, so it should work.
I do notice that the event goes into the "protoype-no-code-events" ping which is enabled: false
and wonder if perhaps something like bug 1943977 is at play making that matter, but only at runtime.
Assignee | ||
Comment 111•21 days ago
|
||
I think I know what's going on.
When we call testResetFOG
it resets Glean. Glean doesn't know which custom pings exist and are enabled, and any metric against a ping that doesn't exist or isn't enabled doesn't get to be stored. FOG tells Glean about this in register_pings
.
Since the "prototype-no-code-events" ping is disabled: true
, that means that it'll be registered (so it's known) but it'll be disabled. No metrics will store data for this ping.
The test presently specifically enables this ping. And this is enough for full builds.
For artifact builds, though, the setEnabled
is being called on the runtime-registered instance of the ping. When Glean goes to ask "hey, can we store data on this ping?" it asks the compiled instance of the ping (which hasn't had set_enabled
called on it), not the runtime-registered instance.
So I think what needs to happen is that JOG needs to register its pings. In the event of a duplicate-registered ping, Glean will accept the newest instance, so when Glean asks its registry for the ping that has the name "prototype-no-code-events" it'll pull the runtime-registered one, not the compile-time one.
And we need to have a testResetJOG
because FOG will re-register its pings on test reset which will re-register the compile-time pings over the runtime ones (recall Glean accepts the newest instance). So JOG will need to reregister the runtime ones over the reregistered compile-time ones.
That ought to take care of it. I think. But I'm past my EOD so I can't check. +Cc Jan-Erik (already +Cc'd) with whom I've been talking about other JOG+ping+disabled+v63 testing wrinkles.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 113•13 days ago
|
||
With bug 1945663 and bug 1883857 resolved, this should disappear. ni?self to double-check and, if so, dupe against bug 1883857.
Assignee | ||
Comment 114•11 days ago
|
||
It was fairly rare to begin with, but there's been nothing for nearly a week so I think it's fair to assume bug 1883857 fixed this.
Comment 115•7 days ago
|
||
(In reply to Chris H-C :chutten from comment #114)
It was fairly rare to begin with, but there's been nothing for nearly a week so I think it's fair to assume bug 1883857 fixed this.
*** This bug has been marked as a duplicate of bug 1883857 ***
I don't think people star jobs on trypushes as a general rule, and we don't run tests against artifact builds on central. So I wouldn't expect much/anything to show up here either way. That's why the situation is as bad as it is.
I just saw this again on a trypush from today, against central from today ( https://hg.mozilla.org/mozilla-central/rev/749544d5d463ad3bc864266996d264131ebc026e ) , though: https://treeherder.mozilla.org/jobs?repo=try&revision=b6107bde6791a06ba5584a9cb4f716bb1924bf75&test_paths=browser%2Fmodules%2Ftest%2Fbrowser&selectedTaskRun=dD17MERBSdylXjyw15OiXQ.0
So I don't think this is fixed? I can file a new bug if that is better...
Assignee | ||
Comment 116•4 days ago
|
||
Ah, drat, I think there's still one step missing in the linked bugs which of course means this wouldn't be fixed. Sorry, I shouldn't have closed this ahead of its time.
Assignee | ||
Comment 117•4 days ago
|
||
The explanation still eludes me, but I have a minimally-reproducible case. Append to the bottom of toolkit/components/glean/tests/xpcshell/test_Glean.js
(or anywhere else):
add_task(function test_disabled_ping() {
Services.fog.testResetFOG();
GleanPings.prototypeNoCodeEvents.setEnabled(true);
Glean.browserUsage.interaction.record();
let events = Glean.browserUsage.interaction.testGetValue();
Assert.equal(1, events.length);
}).only();
And then run with JOG, either by running in artifact mode or using the instructions
The crucial piece appears to be testResetFOG()
. Investigation continues.
Assignee | ||
Comment 118•4 days ago
|
||
I've finally figured it out. Hooo this was a complex one.
The ultimate issue proved to be that, in test_reset, we specifically register pings before init. This has probably never been a good approach since ping registrations are cleared during init.
(( The only registrations that are replayed during test-reset are those registered before init was called initially. ))
(( You'd think we'd clear those preinit vecs as part of resetting glean, but keeping them around is also deliberate. ))
This isn't a problem if the Glean PingType
instance that we perform operations on in the test is the same as the one registered before init was called initially. In artifact builds, though, the instance we perform operations on is a fresh one that JOG creates.
So what're we to do?
First and foremost we're going to have test_reset re-register its pings after init is called so that the new ping instances override any old ones. This looks like it's in direct opposition to all current advice to definitely register your pings before init, but it's really not because in test_reset we're already after init so we're merely registering new pings. This will fix this test, and also the new, minimal, test that I'll be adding to ensure that we catch these failures inside the component.
Secondly, I'm going to have a think about whether there's a better model for ping registration vis a vis Glean init. In "normal" (non-JOG non-test) cases, we seem to be doing okay, so there's no bug yet.
Patch forthcoming.
Assignee | ||
Comment 119•4 days ago
|
||
Also, while we're here:
- Clear some state that was hanging around in JOG across resets
- Add a missing verbose log for registering pings in JOG
- Add a test so that any regression of this behaviour fails inside the component first.
Comment 120•3 days ago
|
||
Comment 121•2 days ago
|
||
bugherder |
Description
•