Closed Bug 1843178 Opened 2 years ago Closed 2 years ago

FOG IPC flushing during tests must be called before the ContentParent is removed

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

RESOLVED FIXED
117 Branch
Tracking Status
firefox117 --- fixed

People

(Reporter: chutten|PTO, Assigned: chutten|PTO)

References

Details

Attachments

(1 file)

In looking at some migrating-to-Glean metrics from bug 1834570 I discovered that, despite everything happening deterministically with appropriate promise behaviour and awaits, a non-deterministic failure for metrics in the content process being tested in the parent.

My current suspicion is that this is because FOGIPC's FlushFOGData doesn't flush the dispatcher's queue before serializing the IPC buffer for transmission.

This doesn't tend to come up in real operation because IPC happens opportunistically on idle (at which time the dispatchers in the various processes have likely had lots of time to process). This doesn't tend to come up in other tests because what content-process instrumentation there is that's tested tends to be tested in ways that leave more room for the dispatcher to flush before IPC.

Just a theory for now, but it holds water.

Welp, no, this doesn't hold water after all because there's no dispatcher in non-parent processes. The dispatcher only exists for the Glean singleton, which is one-per-app, not one-per-process. All metrics calls in non-parent (non-Glean-singleton-having) processes are synchronous writes to the buffers that occasionally (or on test API demand) punt over IPC.

Then what the heck is going on... (time to reword the title of this). I see logs of things happening on the expected processes in the expected order. No pings are being submitted, so storage isn't getting occasionally cleared for that reason. Mayyyybe there's something about startup stuff triggered by testreset... but it doesn't seem likely, and the logs confirm that init is complete before the test resumes.

I'm going to have to add more tracelogging. IPC replay buffer, here we go again...

Summary: FOG IPC appears to not flush the dispatcher queue before transmission → FOG IPC appears to not flush deterministically

This might be something.

 0:06.40 GECKO(487922) 2023-07-13 18:19:48.747257 UTC - [Parent 487922: Main Thread]: W/fog TestFlushAllChildren()
 0:06.40 GECKO(487922) 2023-07-13 18:19:48.747689 UTC - [Parent 487922: Main Thread]: W/fog Asking for flush from child 488005
 0:06.40 GECKO(487922) 2023-07-13 18:19:48.747717 UTC - [Parent 487922: Main Thread]: W/fog Asking for flush from child 488030
 0:06.40 GECKO(487922) 2023-07-13 18:19:48.747732 UTC - [Parent 487922: Main Thread]: W/fog Asking for flush from child 488127
 0:06.40 GECKO(487922) 2023-07-13 18:19:48.747745 UTC - [Parent 487922: Main Thread]: W/fog Asking for flush from child 488149
 0:06.40 GECKO(487922) 2023-07-13 18:19:48.747764 UTC - [Parent 487922: Main Thread]: W/fog Asking for flush from child 488261
 0:06.40 GECKO(487922) 2023-07-13 18:19:48.747792 UTC - [Parent 487922: Main Thread]: W/fog Asking for flush from child 488266
 0:06.40 GECKO(487922) 2023-07-13 18:19:48.747813 UTC - [Parent 487922: Main Thread]: W/fog Asking for flush from child 488315
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.748080 UTC - [Child 488005: Main Thread]: W/fog Flushing FOG data on this process
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.748087 UTC - [Child 488030: Main Thread]: W/fog Flushing FOG data on this process
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.748088 UTC - [Child 488149: Main Thread]: W/fog Flushing FOG data on this process
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.748127 UTC - [Child 488266: Main Thread]: W/fog Flushing FOG data on this process
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.748148 UTC - [Child 488261: Main Thread]: W/fog Flushing FOG data on this process
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.748176 UTC - [Utility 488105: Main Thread]: W/fog Flushing FOG data on this process
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.748204 UTC - [Socket 487978: Main Thread]: W/fog Flushing FOG data on this process
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.748386 UTC - [Child 488315: Main Thread]: W/fog Flushing FOG data on this process
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.748478 UTC - [Child 488127: Main Thread]: W/fog Flushing FOG data on this process
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.750346 UTC - [Parent 487922: Main Thread]: W/fog FlushFOGDataPromise resolve IsResolve
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.750543 UTC - [Child 488200: Main Thread]: W/fog Flushing FOG data on this process
 0:06.41 GECKO(487922) [WARN  firefox_on_glean::ipc] Taking IPC buf with 2 events
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.750633 UTC - [Parent 487922: Main Thread]: W/fog TestFlushAllChildren done
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.750719 UTC - [Parent 487922: Main Thread]: W/fog TestGetValue for event with id 206
 0:06.41 GECKO(487922) 2023-07-13 18:19:48.751025 UTC - [Parent 487922: Main Thread]: W/fog TestGetValue got Nothing
 0:06.38 FAIL Uncaught exception in test bound test_popup_opened_form_without_autocomplete - at chrome://mochitests/content/browser/browser/extensions/formautofill/test/browser/creditCard/browser_creditCard_telemetry.js:300 - TypeError: can't access property 0, Glean.creditcardCcForm.detected.testGetValue() is undefined
Stack trace:
test_popup_opened_form_without_autocomplete@chrome://mochitests/content/browser/browser/extensions/formautofill/test/browser/creditCard/browser_creditCard_telemetry.js:300:138
async*handleTask@chrome://mochikit/content/browser-test.js:1131:26
_runTaskBasedTest@chrome://mochikit/content/browser-test.js:1203:18
async*Tester_execTest@chrome://mochikit/content/browser-test.js:1345:14
nextTest/<@chrome://mochikit/content/browser-test.js:1120:14
SimpleTest.waitForFocus/<@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:1056:13
 0:06.38 INFO Leaving test bound test_popup_opened_form_without_autocomplete
 0:06.38 INFO Entering test bound test_popup_opened_form_without_autocomplete_separate_cc_number
 0:06.41 GECKO(487922) [WARN  firefox_on_glean::ipc] Event with static metric id 206 replaying
 0:06.41 GECKO(487922) [WARN  firefox_on_glean::ipc] Event with static metric id 209 replaying

The tab opened for this check has the child process with pid 488200 and it has the 2 events we're interested in (one detected, one popup_shown, for what it's worth). But notice in the raft of "Asking for flush from child <pid>" messages (which is the list of all ContentParents we're iterating over to flush) doesn't ask for 488200.

This suggests that the content process we want to hear from's ContentParent isn't present at the time Services.fog.testFlushAllChildren() is called.

I'm not 100% sure, but I'm guessing the tab under test doesn't often outlive BrowserTestUtils.withNewTab. This would normally be fine as one of the destruction actions of a content process is to submit the remaining Glean IPC for consumption... but this happens after the test is over (see event with static metric id 206 (detected) and 209 (popup_shown) being replayed there at the bottom). But the tab appears to have its ContentParent promptly removed from ContentParent::GetAll.

I think I can fix this by putting the call to flush the children inside withNewTab... and wouldn't you know it, it works.

Okay, so I think we can unblock bug 1834570 by augmenting the instrumentation test documentation to include a note about the importance of when to flush IPC, and ensuring we support the test rewrite.

Summary: FOG IPC appears to not flush deterministically → FOG IPC flushing during tests must be called before the ContentParent is removed
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/427d37777a4c Be more specific about when to flush IPC in instrumentation tests r=perry.mcmanis DONTBUILD
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 117 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: