High frequency windows tabswitch | could not convert string to float: 'smth'
Categories
(Firefox :: Tabbed Browser, defect, P5)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr102 | --- | unaffected |
firefox-esr115 | --- | unaffected |
firefox114 | --- | unaffected |
firefox115 | --- | unaffected |
firefox116 | --- | wontfix |
firefox117 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: nika)
References
(Regression)
Details
(Keywords: intermittent-failure, regression, Whiteboard: [retriggered][stockwell disable-recommended])
Attachments
(3 files)
48 bytes,
text/x-phabricator-request
|
diannaS
:
approval-mozilla-beta-
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
diannaS
:
approval-mozilla-beta-
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Filed by: sstanca [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=421024828&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/djFZgdFuRrSzSYsYhPhQdQ/runs/0/artifacts/public/logs/live_backing.log
[task 2023-06-29T01:06:14.222Z] 01:06:14 INFO - TEST-INFO | 7764: exit 0
[task 2023-06-29T01:06:15.617Z] 01:06:15 INFO - mozcrash checking C:\Users\task_1687999275\AppData\Local\Temp\tmpiq9p8czy\profile\minidumps for minidumps...
[task 2023-06-29T01:06:15.617Z] 01:06:15 INFO - TEST-UNEXPECTED-ERROR | tabswitch | could not convert string to float: '7.07el_wi685796104488'
[task 2023-06-29T01:06:15.618Z] 01:06:15 ERROR - Traceback (most recent call last):
[task 2023-06-29T01:06:15.618Z] 01:06:15 INFO - File "C:\Users\task_1687999275\build\tests\talos\talos\run_tests.py", line 357, in run_tests
[task 2023-06-29T01:06:15.618Z] 01:06:15 INFO - talos_results.add(mytest.runTest(browser_config, test))
[task 2023-06-29T01:06:15.618Z] 01:06:15 INFO - File "C:\Users\task_1687999275\build\tests\talos\talos\ttest.py", line 61, in runTest
[task 2023-06-29T01:06:15.618Z] 01:06:15 INFO - return self._runTest(browser_config, test_config, setup)
[task 2023-06-29T01:06:15.619Z] 01:06:15 INFO - File "C:\Users\task_1687999275\build\tests\talos\talos\ttest.py", line 258, in _runTest
[task 2023-06-29T01:06:15.619Z] 01:06:15 INFO - test_results.add(
[task 2023-06-29T01:06:15.619Z] 01:06:15 INFO - File "C:\Users\task_1687999275\build\tests\talos\talos\results.py", line 99, in add
[task 2023-06-29T01:06:15.619Z] 01:06:15 INFO - results = browserLog.results()
[task 2023-06-29T01:06:15.620Z] 01:06:15 INFO - File "C:\Users\task_1687999275\build\tests\talos\talos\results.py", line 425, in results
[task 2023-06-29T01:06:15.620Z] 01:06:15 INFO - return self.classes[self.format](
[task 2023-06-29T01:06:15.620Z] 01:06:15 INFO - File "C:\Users\task_1687999275\build\tests\talos\talos\results.py", line 264, in __init__
[task 2023-06-29T01:06:15.620Z] 01:06:15 INFO - result["runs"] = [float(i) for i in r[2:]]
[task 2023-06-29T01:06:15.620Z] 01:06:15 INFO - File "C:\Users\task_1687999275\build\tests\talos\talos\results.py", line 264, in <listcomp>
[task 2023-06-29T01:06:15.621Z] 01:06:15 INFO - result["runs"] = [float(i) for i in r[2:]]
[task 2023-06-29T01:06:15.621Z] 01:06:15 INFO - ValueError: could not convert string to float: '7.07el_wi685796104488'
[task 2023-06-29T01:06:15.621Z] 01:06:15 INFO - TEST-INFO took 389691ms
[task 2023-06-29T01:06:15.621Z] 01:06:15 INFO - SUITE-END | took 389s
[task 2023-06-29T01:06:15.857Z] 01:06:15 INFO - Return code: 2
[task 2023-06-29T01:06:15.857Z] 01:06:15 WARNING - setting return code to 2
Comment 1•1 year ago
|
||
This is failing frequently and backfills and retriggers are pointing to Bug 1838906.
Hello Nika! Could these failures be somehow influenced by your latest changes?
Updated•1 year ago
|
Comment 2•1 year ago
|
||
Set release status flags based on info from the regressing bug 1838906
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 4•1 year ago
|
||
Looking at the log it seems that we're ending up in a slightly broken situation where the child process has somehow inherited an already-closed handle. There also seem to be some issues where we're trying to send a null HANDLE to a child process, which is unexpected.
[task 2023-06-29T00:59:47.414Z] 00:59:47 INFO - [Parent 4104, IPC I/O Parent] WARNING: DuplicateHandle failed for handle 0 in TransferHandles: file /builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/ipc_channel_win.cc:621
[task 2023-06-29T00:59:47.414Z] 00:59:47 INFO - [GPU 6840, IPC I/O Child] WARNING: pipe error: 232: file /builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/ipc_channel_win.cc:443
With regards to the first log line, notably WriteFileHandle
is only called in one place as far as I am aware (https://searchfox.org/mozilla-central/rev/3b707c8fd7e978eebf24279ee51ccf07895cfbcb/ipc/chromium/src/chrome/common/ipc_message_utils.h#941), and there's a check to ensure that it is not nullptr
earlier in this function, meaning we should theoretically never be writing a NULL file handle into attached_handles_
.
As far as I am aware, the only ways that we can end up with a null handle in atatched_handles_
is if ConsumeFileHandle
has been called (happens during deserialization and transfers ownership out of the message, meaning that ownership over the handle is lost and we clear to null: https://searchfox.org/mozilla-central/rev/3b707c8fd7e978eebf24279ee51ccf07895cfbcb/ipc/chromium/src/chrome/common/ipc_message.cc#122), or if SetAttachedFileHandles
was called with an array which already contains invalid handles (https://searchfox.org/mozilla-central/rev/3b707c8fd7e978eebf24279ee51ccf07895cfbcb/ipc/chromium/src/chrome/common/ipc_channel_win.cc#549-576).
It seems unlikely to me that we're deserializing a message multiple times, so the most likely case I can think of here is probably that something weird is happening when relaying HANDLEs between two child processes, and after duplicating from the child, we end up with a null handle somehow in the parent? It might be worth adding some assertions to the handle receiving code related to that.
The second line is quite confusing as well. I don't know how we'd end up inheriting an already-closed handle into the newly spawned content process.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 7•1 year ago
|
||
Set release status flags based on info from the regressing bug 1838906
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 12•1 year ago
|
||
Nika, any updates on this one?
It has 141 total failures since it was filed 12 days ago. Fails only on windows shippable, latest log: https://treeherder.mozilla.org/logviewer?job_id=422154442&repo=autoland
Assignee | ||
Comment 13•1 year ago
|
||
I've done a bit more looking into the code, and I think that the current logging is holding back my ability to understand what exactly is going on here (e.g. the HANDLE being transferred is probably not actually NULL, but the code clobbers it before logging). I'm doing some try pushes with some new logging which will hopefully help.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 15•1 year ago
|
||
After digging into this more, the underlying issue makes more sense and I'll have a patch to work around it soon. Effectively, the changes in bug 1838906 meant that some new error logs started happening when content processes are rapidly starting & stopping, such as in the tab switch test. These logs were harmless but weren't being silenced because they hadn't come up often before those changes.
Because these logs were often coming from child processes and background threads, the log output would end up interleaving with error logs, producing strings which the test harness would then fail to parse. Due to the error spam, this was much more common than it otherwise would be.
I'll post some patches soon which refine the error log output from ipc_channel_win
to make it less misleading, and then silence the spurious errors which are interfering with the test output.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 16•1 year ago
|
||
Previously these logs would produce misleading errors due to the HANDLE being
duplicated having been clobbered by the failed DuplicateHandle call. This patch
changes them to include more information in the error log, and makes the error
logs more accurate.
Assignee | ||
Comment 17•1 year ago
|
||
There were a few types of errors which could spuriously happen on Windows, and
are being fixed by this patch. The ERROR_NO_DATA
error is now being silenced
similar to ERROR_BROKEN_PIPE
. This appears to happen in some specific
edge-cases around the pipe's lifecycle, such as when the pipe has been created
and then destroyed, but no data has been sent yet.
This patch also silences errors when transferring or accepting handles if
DuplicateHandle
fails due to the target process terminating. In this
situation, DuplicateHandle
returns ERROR_ACCESS_DENIED
, which is
unfortunately ambiguous. To avoid that ambiguity, this patch takes the same
approach as Chromium, and uses the RtlGetLastNtStatus
function from ntdll to
get the true error status, and compare it to STATUS_PROCESS_IS_TERMINATING
.
This matches the behaviour in Chromium to silence the same error.
Depends on D183408
Assignee | ||
Comment 18•1 year ago
|
||
This improves the quality of the logging from IPC::Channel, and also
adds extra assertions to ensure that it's aligned with the values
sent/received in the HELLO message.
This patch also makes the other_pid type more consistent, using
base::ProcessId instead of int32_t in IPC::Channel.
Depends on D183409
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 21•1 year ago
|
||
Comment 22•1 year ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/0f05d5dc1c64
https://hg.mozilla.org/mozilla-central/rev/182508d75971
https://hg.mozilla.org/mozilla-central/rev/1b034941fe02
Comment 23•1 year ago
|
||
The patch landed in nightly and beta is affected.
:nika, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox116
towontfix
.
For more information, please visit BugBot documentation.
Assignee | ||
Comment 24•1 year ago
|
||
Are you seeing these frequent failures on Beta as well? This wouldn't be too bad to uplift & silence the errors with if it's causing errors, but I'd rather not uplift if it's not causing problems for Sherrifs on Beta.
Comment 25•1 year ago
|
||
Hi Nika! The failure frequency is around 50/50 for this.
Comment hidden (Intermittent Failures Robot) |
Updated•1 year ago
|
Assignee | ||
Comment 27•1 year ago
|
||
Comment on attachment 9343553 [details]
Bug 1840974 - Part 2: Silence spurious IPC channel errors on Windows, r=#ipc-reviewers!
Beta/Release Uplift Approval Request
- User impact if declined: Intermittent talos test failures due to logspam
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Makes some fairly low-risk changes to silence some IPC errors from being logged in release builds.
- String changes made/needed: None
- Is Android affected?: No
Assignee | ||
Updated•1 year ago
|
Comment 28•1 year ago
•
|
||
Comment on attachment 9343552 [details]
Bug 1840974 - Part 1: Improve ipc_channel_win HANDLE transfer errors, r=#ipc-reviewers!
We have no more betas left this cycle for fx116. Since I don't think these talos tests run in release, we can let this ride the 117 trains which is in beta next week.
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Description
•