Closed Bug 1840974 Opened 1 year ago Closed 1 year ago

High frequency windows tabswitch | could not convert string to float: 'smth'

Categories

(Firefox :: Tabbed Browser, defect, P5)

defect

Tracking

()

RESOLVED FIXED
117 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox-esr115 --- unaffected
firefox114 --- unaffected
firefox115 --- unaffected
firefox116 --- wontfix
firefox117 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: nika)

References

(Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [retriggered][stockwell disable-recommended])

Attachments

(3 files)

Filed by: sstanca [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=421024828&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/djFZgdFuRrSzSYsYhPhQdQ/runs/0/artifacts/public/logs/live_backing.log


[task 2023-06-29T01:06:14.222Z] 01:06:14     INFO -  TEST-INFO | 7764: exit 0
[task 2023-06-29T01:06:15.617Z] 01:06:15     INFO -  mozcrash checking C:\Users\task_1687999275\AppData\Local\Temp\tmpiq9p8czy\profile\minidumps for minidumps...
[task 2023-06-29T01:06:15.617Z] 01:06:15     INFO -  TEST-UNEXPECTED-ERROR | tabswitch | could not convert string to float: '7.07el_wi685796104488'
[task 2023-06-29T01:06:15.618Z] 01:06:15    ERROR -  Traceback (most recent call last):
[task 2023-06-29T01:06:15.618Z] 01:06:15     INFO -    File "C:\Users\task_1687999275\build\tests\talos\talos\run_tests.py", line 357, in run_tests
[task 2023-06-29T01:06:15.618Z] 01:06:15     INFO -      talos_results.add(mytest.runTest(browser_config, test))
[task 2023-06-29T01:06:15.618Z] 01:06:15     INFO -    File "C:\Users\task_1687999275\build\tests\talos\talos\ttest.py", line 61, in runTest
[task 2023-06-29T01:06:15.618Z] 01:06:15     INFO -      return self._runTest(browser_config, test_config, setup)
[task 2023-06-29T01:06:15.619Z] 01:06:15     INFO -    File "C:\Users\task_1687999275\build\tests\talos\talos\ttest.py", line 258, in _runTest
[task 2023-06-29T01:06:15.619Z] 01:06:15     INFO -      test_results.add(
[task 2023-06-29T01:06:15.619Z] 01:06:15     INFO -    File "C:\Users\task_1687999275\build\tests\talos\talos\results.py", line 99, in add
[task 2023-06-29T01:06:15.619Z] 01:06:15     INFO -      results = browserLog.results()
[task 2023-06-29T01:06:15.620Z] 01:06:15     INFO -    File "C:\Users\task_1687999275\build\tests\talos\talos\results.py", line 425, in results
[task 2023-06-29T01:06:15.620Z] 01:06:15     INFO -      return self.classes[self.format](
[task 2023-06-29T01:06:15.620Z] 01:06:15     INFO -    File "C:\Users\task_1687999275\build\tests\talos\talos\results.py", line 264, in __init__
[task 2023-06-29T01:06:15.620Z] 01:06:15     INFO -      result["runs"] = [float(i) for i in r[2:]]
[task 2023-06-29T01:06:15.620Z] 01:06:15     INFO -    File "C:\Users\task_1687999275\build\tests\talos\talos\results.py", line 264, in <listcomp>
[task 2023-06-29T01:06:15.621Z] 01:06:15     INFO -      result["runs"] = [float(i) for i in r[2:]]
[task 2023-06-29T01:06:15.621Z] 01:06:15     INFO -  ValueError: could not convert string to float: '7.07el_wi685796104488'
[task 2023-06-29T01:06:15.621Z] 01:06:15     INFO -  TEST-INFO took 389691ms
[task 2023-06-29T01:06:15.621Z] 01:06:15     INFO -  SUITE-END | took 389s
[task 2023-06-29T01:06:15.857Z] 01:06:15     INFO - Return code: 2
[task 2023-06-29T01:06:15.857Z] 01:06:15  WARNING - setting return code to 2

This is failing frequently and backfills and retriggers are pointing to Bug 1838906.

Hello Nika! Could these failures be somehow influenced by your latest changes?

Flags: needinfo?(nika)
Keywords: regression
Regressed by: 1838906
Whiteboard: [retriggered]

Set release status flags based on info from the regressing bug 1838906

Looking at the log it seems that we're ending up in a slightly broken situation where the child process has somehow inherited an already-closed handle. There also seem to be some issues where we're trying to send a null HANDLE to a child process, which is unexpected.

[task 2023-06-29T00:59:47.414Z] 00:59:47     INFO -  [Parent 4104, IPC I/O Parent] WARNING: DuplicateHandle failed for handle 0 in TransferHandles: file /builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/ipc_channel_win.cc:621
[task 2023-06-29T00:59:47.414Z] 00:59:47     INFO -  [GPU 6840, IPC I/O Child] WARNING: pipe error: 232: file /builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/ipc_channel_win.cc:443

With regards to the first log line, notably WriteFileHandle is only called in one place as far as I am aware (https://searchfox.org/mozilla-central/rev/3b707c8fd7e978eebf24279ee51ccf07895cfbcb/ipc/chromium/src/chrome/common/ipc_message_utils.h#941), and there's a check to ensure that it is not nullptr earlier in this function, meaning we should theoretically never be writing a NULL file handle into attached_handles_.

As far as I am aware, the only ways that we can end up with a null handle in atatched_handles_ is if ConsumeFileHandle has been called (happens during deserialization and transfers ownership out of the message, meaning that ownership over the handle is lost and we clear to null: https://searchfox.org/mozilla-central/rev/3b707c8fd7e978eebf24279ee51ccf07895cfbcb/ipc/chromium/src/chrome/common/ipc_message.cc#122), or if SetAttachedFileHandles was called with an array which already contains invalid handles (https://searchfox.org/mozilla-central/rev/3b707c8fd7e978eebf24279ee51ccf07895cfbcb/ipc/chromium/src/chrome/common/ipc_channel_win.cc#549-576).

It seems unlikely to me that we're deserializing a message multiple times, so the most likely case I can think of here is probably that something weird is happening when relaying HANDLEs between two child processes, and after duplicating from the child, we end up with a null handle somehow in the parent? It might be worth adding some assertions to the handle receiving code related to that.


The second line is quite confusing as well. I don't know how we'd end up inheriting an already-closed handle into the newly spawned content process.

Set release status flags based on info from the regressing bug 1838906

Nika, any updates on this one?
It has 141 total failures since it was filed 12 days ago. Fails only on windows shippable, latest log: https://treeherder.mozilla.org/logviewer?job_id=422154442&repo=autoland

Summary: High frequency tabswitch | could not convert string to float: 'smth' → High frequency windows tabswitch | could not convert string to float: 'smth'
Whiteboard: [retriggered][stockwell disable-recommended] → [retriggered][stockwell needswork:owner]

I've done a bit more looking into the code, and I think that the current logging is holding back my ability to understand what exactly is going on here (e.g. the HANDLE being transferred is probably not actually NULL, but the code clobbers it before logging). I'm doing some try pushes with some new logging which will hopefully help.

After digging into this more, the underlying issue makes more sense and I'll have a patch to work around it soon. Effectively, the changes in bug 1838906 meant that some new error logs started happening when content processes are rapidly starting & stopping, such as in the tab switch test. These logs were harmless but weren't being silenced because they hadn't come up often before those changes.

Because these logs were often coming from child processes and background threads, the log output would end up interleaving with error logs, producing strings which the test harness would then fail to parse. Due to the error spam, this was much more common than it otherwise would be.

I'll post some patches soon which refine the error log output from ipc_channel_win to make it less misleading, and then silence the spurious errors which are interfering with the test output.

Flags: needinfo?(nika)
Assignee: nobody → nika

Previously these logs would produce misleading errors due to the HANDLE being
duplicated having been clobbered by the failed DuplicateHandle call. This patch
changes them to include more information in the error log, and makes the error
logs more accurate.

There were a few types of errors which could spuriously happen on Windows, and
are being fixed by this patch. The ERROR_NO_DATA error is now being silenced
similar to ERROR_BROKEN_PIPE. This appears to happen in some specific
edge-cases around the pipe's lifecycle, such as when the pipe has been created
and then destroyed, but no data has been sent yet.

This patch also silences errors when transferring or accepting handles if
DuplicateHandle fails due to the target process terminating. In this
situation, DuplicateHandle returns ERROR_ACCESS_DENIED, which is
unfortunately ambiguous. To avoid that ambiguity, this patch takes the same
approach as Chromium, and uses the RtlGetLastNtStatus function from ntdll to
get the true error status, and compare it to STATUS_PROCESS_IS_TERMINATING.
This matches the behaviour in Chromium to silence the same error.

Depends on D183408

This improves the quality of the logging from IPC::Channel, and also
adds extra assertions to ensure that it's aligned with the values
sent/received in the HELLO message.

This patch also makes the other_pid type more consistent, using
base::ProcessId instead of int32_t in IPC::Channel.

Depends on D183409

Pushed by nlayzell@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0f05d5dc1c64 Part 1: Improve ipc_channel_win HANDLE transfer errors, r=ipc-reviewers,mccr8 https://hg.mozilla.org/integration/autoland/rev/182508d75971 Part 2: Silence spurious IPC channel errors on Windows, r=ipc-reviewers,jld https://hg.mozilla.org/integration/autoland/rev/1b034941fe02 Part 3: Make sure otherpid is set if known on IPC::Channel, r=ipc-reviewers,jld

The patch landed in nightly and beta is affected.
:nika, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox116 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(nika)

Are you seeing these frequent failures on Beta as well? This wouldn't be too bad to uplift & silence the errors with if it's causing errors, but I'd rather not uplift if it's not causing problems for Sherrifs on Beta.

Flags: needinfo?(nika) → needinfo?(csabou)

Hi Nika! The failure frequency is around 50/50 for this.

retriggers

Flags: needinfo?(csabou) → needinfo?(nika)

Comment on attachment 9343553 [details]
Bug 1840974 - Part 2: Silence spurious IPC channel errors on Windows, r=#ipc-reviewers!

Beta/Release Uplift Approval Request

  • User impact if declined: Intermittent talos test failures due to logspam
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Makes some fairly low-risk changes to silence some IPC errors from being logged in release builds.
  • String changes made/needed: None
  • Is Android affected?: No
Flags: needinfo?(nika)
Attachment #9343553 - Flags: approval-mozilla-beta?
Attachment #9343552 - Flags: approval-mozilla-beta?

Comment on attachment 9343552 [details]
Bug 1840974 - Part 1: Improve ipc_channel_win HANDLE transfer errors, r=#ipc-reviewers!

We have no more betas left this cycle for fx116. Since I don't think these talos tests run in release, we can let this ride the 117 trains which is in beta next week.

Attachment #9343552 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Attachment #9343553 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: