Closed Bug 1865536 Opened 1 year ago Closed 11 months ago

Permafailing "o" startup_about_home_paint_cached | Could not find report in browser output: [('tsformat', ('__start_report', '__end_report')), ('tpformat', ('__start_tp_report', '__end_tp_report'))]

Categories

(Testing :: Talos, defect, P5)

defect

Tracking

(firefox-esr115 unaffected, firefox120 unaffected, firefox121 wontfix, firefox122 fixed)

RESOLVED FIXED
122 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox120 --- unaffected
firefox121 --- wontfix
firefox122 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: mconley)

References

(Regression)

Details

(Keywords: intermittent-failure, intermittent-testcase, regression, Whiteboard: [stockwell disable-recommended])

Attachments

(1 file)

Filed by: ncsoregi [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=436783246&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/FtH2PGNDSru7iksx1H9rjw/runs/0/artifacts/public/logs/live_backing.log


[task 2023-11-19T15:01:43.545Z] 15:01:43     INFO -  TEST-INFO | started process 4930 (/opt/worker/tasks/task_170040083307754/build/application/Firefox Nightly.app/Contents/MacOS/firefox -foreground -profile /var/folders/h3/snrtdstx3_5cydhz58ndp098000014/T/tmpf4qxighp/profile)
[task 2023-11-19T15:01:56.573Z] 15:01:56     INFO -  PID 4930 | Failed to get timestamps.about_home_topsites_first_paint scalar probe in time.
[task 2023-11-19T15:01:56.791Z] 15:01:56     INFO -  PID 4930 | console.error: ({})
[task 2023-11-19T15:01:56.925Z] 15:01:56     INFO -  TEST-INFO | 4930: exit 0
[task 2023-11-19T15:01:56.958Z] 15:01:56     INFO -  mozcrash checking /var/folders/h3/snrtdstx3_5cydhz58ndp098000014/T/tmpf4qxighp/profile/minidumps for minidumps...
[task 2023-11-19T15:01:56.958Z] 15:01:56     INFO -  TEST-UNEXPECTED-ERROR | startup_about_home_paint_cached | Could not find report in browser output: [('tsformat', ('__start_report', '__end_report')), ('tpformat', ('__start_tp_report', '__end_tp_report'))]
[task 2023-11-19T15:01:56.958Z] 15:01:56    ERROR -  Traceback (most recent call last):
[task 2023-11-19T15:01:56.959Z] 15:01:56     INFO -    File "/opt/worker/tasks/task_170040083307754/build/tests/talos/talos/run_tests.py", line 370, in run_tests
[task 2023-11-19T15:01:56.959Z] 15:01:56     INFO -      mytest.runTest(browser_config, test, utility_path=utility_path)
[task 2023-11-19T15:01:56.959Z] 15:01:56     INFO -    File "/opt/worker/tasks/task_170040083307754/build/tests/talos/talos/ttest.py", line 62, in runTest
[task 2023-11-19T15:01:56.960Z] 15:01:56     INFO -      browser_config, test_config, setup, utility_path=utility_path
[task 2023-11-19T15:01:56.960Z] 15:01:56     INFO -    File "/opt/worker/tasks/task_170040083307754/build/tests/talos/talos/ttest.py", line 264, in _runTest
[task 2023-11-19T15:01:56.960Z] 15:01:56     INFO -      counter_management.results() if counter_management else None
[task 2023-11-19T15:01:56.961Z] 15:01:56     INFO -    File "/opt/worker/tasks/task_170040083307754/build/tests/talos/talos/results.py", line 96, in add
[task 2023-11-19T15:01:56.961Z] 15:01:56     INFO -      global_counters=self.global_counters,
[task 2023-11-19T15:01:56.961Z] 15:01:56     INFO -    File "/opt/worker/tasks/task_170040083307754/build/tests/talos/talos/results.py", line 347, in __init__
[task 2023-11-19T15:01:56.962Z] 15:01:56     INFO -      self.parse()
[task 2023-11-19T15:01:56.962Z] 15:01:56     INFO -    File "/opt/worker/tasks/task_170040083307754/build/tests/talos/talos/results.py", line 374, in parse
[task 2023-11-19T15:01:56.962Z] 15:01:56     INFO -      "Could not find report in browser output: %s" % self.report_tokens
[task 2023-11-19T15:01:56.963Z] 15:01:56     INFO -    File "/opt/worker/tasks/task_170040083307754/build/tests/talos/talos/results.py", line 358, in error
[task 2023-11-19T15:01:56.963Z] 15:01:56     INFO -      raise utils.TalosError(message)
[task 2023-11-19T15:01:56.963Z] 15:01:56     INFO -  talos.utils.TalosError: Could not find report in browser output: [('tsformat', ('__start_report', '__end_report')), ('tpformat', ('__start_tp_report', '__end_tp_report'))]
[task 2023-11-19T15:01:56.964Z] 15:01:56     INFO -  TEST-INFO took 17563ms

All the occurrences from Bug 1855599 starting from the 14th of November are actually permafailures which shall be tracked in this bug.

Retriggers and backfills point to Bug 1803406 as being the culprit for the sudden permafail.

Jonathan, can you look into this, please?
Thank you!

Flags: needinfo?(jfkthame)

Set release status flags based on info from the regressing bug 1803406

Update

There have been 54 total failures within the last 7 days, all of them on OS X 10.15 WebRender Shippable opt.

Recent log: https://treeherder.mozilla.org/logviewer?job_id=437749829&repo=autoland&lineNumber=4183

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

:mconley, I wondered if you have any thoughts here? I don't doubt that bug 1803406 may have affected this, as it added work during content-process startup and will have affected timings etc., but I don't know why the test would be failing (rather than showing a regression).

It seems like we're simply hanging during startup, until eventually the harness times out; but that's not happening to other tests or in normal use of the browser. So what's different about how this test runs?

Flags: needinfo?(jfkthame) → needinfo?(mconley)

We should fix this to improve measurements, but this is not something that impacts users, so I don't think it needs to track the 121 release.

I'm actively looking at this. Leaving needinfo.

I can reproduce this locally, and have a lead - I think we've got a race with the mechanism that preloads the background newtab, and how messages are queued up from the about:home page before being sent to the parent. Still investigating...

Okay, the issue here is that the preloaded newtab on the getInfo part of the Talos test causes the about:home startup cache to get populated.

During the first run of the Talos test, the about:home document loads so quickly that the parent doesn't have enough time to initialize and setup a response. While about:home waits for a response from the parent, messages from content are queued up (including the one that sets the Telemetry value that the Talos test is waiting for).

Weird how bug 1803406 caused this to rear its head, but it's pretty unrelated. This is a race we've probably had for a while, and bug 1803406 probably just made us start losing the race.

Looking for the right solution to clear this up.

Before this patch, the about:home/about:newtab Redux store code had some
middleware that queued any messages sent from the page before the parent
had sent any messages. Presumably this was so that those messages wouldn't
be dropped if they were sent while the parent process was still setting
up its Feeds.

Unfortunately, there's a race here - if the parent process is ready and
just chooses not to send any messages right away, the loaded about:home/about:newtab
document will just hold on to any actions until the parent process has
sent something down to it.

The Talos test that was failing here was waiting for the initial about:home
page to send a message which would record a Telemetry probe. That message
wasn't arriving in time. Presumably, eventually the parent process would
have sent a message down to the about:home page which would flush the actions,
but the Talos test would time out before that would occur.

This patch changes things by having the parent process queue any messages
sent from the content in the event that the ActivityStreamMessageChannel
is not yet set up. Once it is set up, those messages are dispatched after
the simulated NEW_TAB_INIT and NEW_TAB_LOAD for those early tabs are
sent to the parent process Redux store.

Assignee: nobody → mconley
Status: NEW → ASSIGNED
Flags: needinfo?(mconley)
Pushed by mconley@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f5b09dd28197 Queue and flush early messages from about:newtab after the parent-process is ready. r=Mardak

My recommendation is that we don't fix this for 121, and let this ride the trains per usual. The Talos test permafailure has no user impact.

Status: ASSIGNED → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
Target Milestone: --- → 122 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: