Closed Bug 1548630 Opened 5 years ago Closed 5 years ago

Intermittent TEST-UNEXPECTED-FAIL : xperf: File 'c:\windows\system32\wintypes.dll' (normalized from 'C:\Windows\System32\WinTypes.dll') was accessed and we were not expecting it. DiskReadCount: 2, DiskWriteCount: 0, DiskReadBytes: 32768

Categories

(Toolkit :: Startup and Profile System, defect, P1)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla69
Tracking Status
firefox-esr60 --- unaffected
firefox67 - unaffected
firefox67.0.1 - unaffected
firefox68 - unaffected
firefox69 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: bugzilla)

References

(Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [comment 50][stockwell disable-recommended])

Attachments

(1 file, 1 obsolete file)

Filed by: ncsoregi [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=244243523&repo=mozilla-inbound
Full log: https://queue.taskcluster.net/v1/task/OzT3vzZzRhWNioESM36rmA/runs/0/artifacts/public/logs/live_backing.log


17:44:34 INFO - The trace you have just captured "Z:\task_1556814716\build\test.etl.kernel" may contain personally identifiable information, including but not necessarily limited to paths to files accessed, paths to registry accessed and process names. Exact information depends on the events that were logged. Please be aware of this when sharing out this trace with other people.
17:44:34 INFO - The trace you have just captured "Z:\task_1556814716\build\test.etl.user" may contain personally identifiable information, including but not necessarily limited to paths to files accessed, paths to registry accessed and process names. Exact information depends on the events that were logged. Please be aware of this when sharing out this trace with other people.
17:45:05 INFO - Merged Etl: test.etl
17:52:12 INFO - reading etl filename: test.etl
17:52:12 INFO - etlparser: in readfile: test.etl.csv
17:52:12 INFO - TEST-UNEXPECTED-FAIL : xperf: File 'c:\windows\system32\wintypes.dll' (normalized from 'C:\Windows\System32\WinTypes.dll') was accessed and we were not expecting it. DiskReadCount: 2, DiskWriteCount: 0, DiskReadBytes: 32768, DiskWriteBytes: 0
17:52:14 INFO - extending with xperf!
17:52:14 INFO - Detected a regression for tp5n
17:52:14 INFO - TEST-UNEXPECTED-FAIL | tp5n | Talos has found a regression, if you have questions ask for help in irc on #perf
17:52:14 ERROR - Traceback (most recent call last):
17:52:14 INFO - File "Z:\task_1556814716\build\tests\talos\talos\run_tests.py", line 300, in run_tests
17:52:14 INFO - talos_results.add(mytest.runTest(browser_config, test))
17:52:14 INFO - File "Z:\task_1556814716\build\tests\talos\talos\ttest.py", line 64, in runTest
17:52:14 INFO - return self._runTest(browser_config, test_config, setup)
17:52:14 INFO - File "Z:\task_1556814716\build\tests\talos\talos\ttest.py", line 268, in _runTest
17:52:14 INFO - 'Talos has found a regression, if you have questions'
17:52:14 INFO - TalosRegression: Talos has found a regression, if you have questions ask for help in irc on #perf
17:52:14 INFO - TEST-INFO took 594333ms
17:52:14 INFO - SUITE-END | took 594s
17:52:14 ERROR - Return code: 1
17:52:14 WARNING - setting return code to 1
17:52:14 ERROR - # TBPL WARNING #

Currently trending down. Removing disable-recommended.

Whiteboard: [stockwell disable-recommended]

Hi Robert, can you please take a look at this failure?

It seems to also fails on retriggers on older pushes.

Flags: needinfo?(rwood)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

Hi :igoldan, I believe you're familiar with xperf correct? Mind having a look (or fw on to the talos xperf test owner if need be)? Thanks!

Flags: needinfo?(rwood) → needinfo?(igoldan)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

Could this be related to the machines running the test? The failures on the oldest commits are all retriggers, the initial xperf runs didn't fail.

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Assignee: nobody → igoldan
Flags: needinfo?(igoldan)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #15)

Could this be related to the machines running the test? The failures on the oldest commits are all retriggers, the initial xperf runs didn't fail.

Flags: needinfo?(aklotz)

Aaron, I believe xperf is only monitoring files opened/imported by Firefox. Am I right?

There is a certain amount of non-determinism here, at least dependent in part on the host machine and OS.

I wouldn't mind knowing how wintypes.dll is being loaded before whitelisting, though.

Flags: needinfo?(aklotz)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]

Aron, Ionut, any updates here? Should we consider disabling the test? Thanks.

Flags: needinfo?(igoldan)
Flags: needinfo?(aklotz)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]

(In reply to Cosmin Sabou [:CosminS] from comment #25)

Aron, Ionut, any updates here? Should we consider disabling the test? Thanks.

So I have the patch waiting for review & it fixes the intermittent.

There is a certain amount of non-determinism here, at least dependent in part on the host machine and OS.
I wouldn't mind knowing how wintypes.dll is being loaded before whitelisting, though.

I don't have enough cycles nor the xperf experience to investigate this. Aaron, could you look into this instead?

Flags: needinfo?(igoldan)
Flags: needinfo?(dave.hunt)

I don't think we should be whitelisting this. In fact, I think this is a legitimate perf regression from bug 1260483.

I ran Nightly in a debugger set up to break when WinTypes.dll is loaded. The most useful frames are here:

00 ntdll!NtMapViewOfSection+0x14
01 firefox!patched_NtMapViewOfSection+0x6d
02 ntdll!LdrpMinimalMapModule+0x101
03 ntdll!LdrpMapDllWithSectionHandle+0x1b
04 ntdll!LdrpMapDllNtFileName+0x189
05 ntdll!LdrpMapDllFullPath+0xe0
06 ntdll!LdrpProcessWork+0x74
07 ntdll!LdrpDrainWorkQueue+0x16e
08 ntdll!LdrpLoadDllInternal+0x148
09 ntdll!LdrpLoadDll+0xa8
0a ntdll!LdrLoadDll+0xe4
0b mozglue!patched_LdrLoadDll+0x206
0c KERNELBASE!LoadLibraryExW+0x161
0d combase!LoadLibraryWithLogging+0x2c
0e combase!CClassCache::CDllPathEntry::LoadDll+0x56
0f combase!CClassCache::CDllPathEntry::Create+0x4f
10 combase!CClassCache::GetOrLoadWinRTInprocClass+0xd80
11 combase!CCGetOrLoadWinRTInprocClass+0x21
12 combase!WinRTGetActivationFactoryOfInprocClass+0x49
13 combase!_RoGetActivationFactory+0x361
14 mrmcorer!Microsoft::Resources::IsHighContrast+0x68
15 mrmcorer!Microsoft::Resources::GetContrastQualifierValue+0x28
16 mrmcorer!Microsoft::Resources::QualifierValueProviderBase::GetQualifierValue+0x55
17 mrmcorer!Microsoft::Resources::Runtime::CContextualHash::_GetHash+0xe2
18 mrmcorer!Microsoft::Resources::Runtime::CContext::s_GetHash+0x60
19 mrmcorer!Microsoft::Resources::Runtime::CResourceManagerQueueInternal::_FillCacheStrings+0x71
1a mrmcorer!Microsoft::Resources::Runtime::CResourceManagerQueueInternal::GetStringInternal+0x2b9
1b mrmcorer!ResourceManagerQueueGetString+0x3a
1c KERNELBASE!SHLoadIndirectStringInternal+0x161
1d windows_storage!CAssocElement::_QuerySourceAny<<lambda_dfb37409bb15dfff50d1c0491f1e76a6> >+0x1bf
1e windows_storage!CAssocElement::QueryString+0x2f
1f windows_storage!CAssocShellElement::QueryStringNoDelegate+0x25
20 windows_storage!CAssocProgidElement::QueryStringNoDelegate+0x24
21 windows_storage!CAssocShellVerbElement::QueryString+0x12d
22 windows_storage!CAssocArray::_QueryAllElements<<lambda_0ad978ee0b657b29c9332a702085a2f5> >+0x12b
23 windows_storage!CAssocArray::GetString+0xa5
24 shlwapi!AssocQueryStringW+0xb1
25 xul!nsOSHelperAppService::GetApplicationDescription+0x1c7
26 xul!nsOSHelperAppService::GetProtocolHandlerInfoFromOS+0x1c7
27 xul!nsExternalHelperAppService::GetProtocolHandlerInfo+0x2d
28 xul!XPTC__InvokebyIndex+0x72

In essence, AssocQueryStringW is triggering main thread I/O. It is intermittent because the Windows loader has transitioned to using a thread pool, so whether or not the I/O occurs on the main thread is nondeterministic.

Flags: needinfo?(aklotz)
Regressed by: 1260483

(In reply to Aaron Klotz [:aklotz] (PTO May 29 - June 5) from comment #29)

I don't think we should be whitelisting this. In fact, I think this is a legitimate perf regression from bug 1260483.

[...]
In essence, AssocQueryStringW is triggering main thread I/O. It is intermittent because the Windows loader has transitioned to using a thread pool, so whether or not the I/O occurs on the main thread is nondeterministic.

Ok, I've pointed this out in bug 1260483. Thanks, Aaron!

Whiteboard: [stockwell disable-recommended]

igoldan: could you schedule a time for you, me, jmaher, and aklotz to discuss xperf ownership and future?

Flags: needinfo?(dave.hunt)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC from comment #31)

igoldan: could you schedule a time for you, me, jmaher, and aklotz to discuss xperf ownership and future?

Yep, I've sent an email for that to better pick the date.

Whiteboard: [stockwell disable-recommended]

There are 108 total failures in the last 7 days on windows10-64, windows10-64-qr, windows10-64-shippable all opt

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=249653062&repo=mozilla-central&lineNumber=1613

03:16:30 INFO - TEST-START | tp5n
03:16:30 INFO - Initialising browser for tp5n test...

03:18:30 INFO - The trace you have just captured "Z:\task_1559526833\build\test.etl.user" may contain personally identifiable information, including but not necessarily limited to paths to files accessed, paths to registry accessed and process names. Exact information depends on the events that were logged. Please be aware of this when sharing out this trace with other people.
03:18:47 INFO - Merged Etl: test.etl
03:23:50 INFO - reading etl filename: test.etl
03:23:50 INFO - etlparser: in readfile: test.etl.csv
03:23:50 INFO - TEST-UNEXPECTED-FAIL : xperf: File 'c:\windows\system32\wintypes.dll' (normalized from 'C:\Windows\System32\WinTypes.dll') was accessed and we were not expecting it. DiskReadCount: 2, DiskWriteCount: 0, DiskReadBytes: 32768, DiskWriteBytes: 0
03:23:51 INFO - extending with xperf!
03:23:51 INFO - Detected a regression for tp5n
03:23:51 INFO - TEST-UNEXPECTED-FAIL | tp5n | Talos has found a regression, if you have questions ask for help in irc on #perf
03:23:51 ERROR - Traceback (most recent call last):
03:23:51 INFO - File "Z:\task_1559526833\build\tests\talos\talos\run_tests.py", line 276, in run_tests
03:23:51 INFO - talos_results.add(mytest.runTest(browser_config, test))
03:23:51 INFO - File "Z:\task_1559526833\build\tests\talos\talos\ttest.py", line 64, in runTest
03:23:51 INFO - return self._runTest(browser_config, test_config, setup)
03:23:51 INFO - File "Z:\task_1559526833\build\tests\talos\talos\ttest.py", line 263, in _runTest
03:23:51 INFO - 'Talos has found a regression, if you have questions'
03:23:51 INFO - TalosRegression: Talos has found a regression, if you have questions ask for help in irc on #perf
03:23:51 INFO - TEST-INFO took 440734ms
03:23:51 INFO - SUITE-END | took 440s
03:23:51 ERROR - Return code: 1

Aaron can you take a look?

Flags: needinfo?(aklotz)

Joel is there anything more we can do here?

Flags: needinfo?(jmaher)

we need to add 'C:\Windows\System32\WinTypes.dll' to the whitelist file:
https://searchfox.org/mozilla-central/source/testing/talos/talos/xtalos/xperf_whitelist.json

redirecting to :davehunt to get this fixed- code sheriffs can fix this as well- but be warned, adding a file to xperf_whitelist.json should only be done if it is expected- in this case why is wintypes.dll being called- did firefox change and now is calling it? I think this is unrelated to firefox, but in many other cases it is related (say a new font file is used, etc.)

Flags: needinfo?(jmaher) → needinfo?(dave.hunt)

Thank you Joel, we'll wait for Dave to reply.

According to :aklotz in comment 29 this should not be whitelisted as it is a genuine regression from bug bug 1260483. Is there new information that demonstrates this is not the case?

Flags: needinfo?(dave.hunt) → needinfo?(jmaher)

I missed that, thanks :davehunt for pointing it out. Given the high failure rate, this is nearing perma-fail territory, can we backout bug 1260483? If that landed 3 years ago I don't see how this could be the problem.

Can we disable this test or make it tier-2?

Flags: needinfo?(jmaher)
Attachment #9066027 - Attachment is obsolete: true

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #43)

Can we disable this test or make it tier-2?

Yes. Let's demote it to tier-2, to ease the burden on the code sheriffs.
Will re promote the test to tier-1, once this issue gets fixed.

Depends on: 1557246

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #45)

Yes. Let's demote it to tier-2, to ease the burden on the code sheriffs.

Tier 1 or 2 only matters for the question if something has to be backed out (1 = yes, 2 = no, file a bug and needinfo developer instead). Regarding the load of failure classifications, it doesn't make any difference.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #46)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #45)

Yes. Let's demote it to tier-2, to ease the burden on the code sheriffs.

Tier 1 or 2 only matters for the question if something has to be backed out (1 = yes, 2 = no, file a bug and needinfo developer instead). Regarding the load of failure classifications, it doesn't make any difference.

Ok, so we need to demote to Tier-3 then? Will the tests still run on trunk?

Flags: needinfo?(aryx.bugmail)

Yes, tier 3 only hides them from treeherder by default and they will still run on the same trees. If the latter shall be modified, run-on-projects has to be adjusted.

Flags: needinfo?(aryx.bugmail)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #43)

I missed that, thanks :davehunt for pointing it out. Given the high failure rate, this is nearing perma-fail territory, can we backout bug 1260483? If that landed 3 years ago I don't see how this could be the problem.

The reason why it is a problem now is probably due to Windows 10 upgrades on our test boxes. Also, the code in question is called by JS, which could have changed such that it moved the native call to an earlier point in browser startup.

Flags: needinfo?(aklotz)

I have a patch that will probably mitigate this. I'd rather we just fix this in Firefox than demote the test.

Assignee: igoldan → aklotz
Status: NEW → ASSIGNED
Component: Talos → Startup and Profile System
OS: Unspecified → Windows
Priority: P5 → P1
Product: Testing → Toolkit
Version: Version 3 → unspecified
Whiteboard: [stockwell disable-recommended] → [comment 50]

Aaron could you choose another reviewer? agashlin has blocking review.

Flags: needinfo?(aklotz)

"Blocking review" means that person must review the patch - it doesn't matter if anybody else grants review.

Flags: needinfo?(aklotz)

oh, okay. thanks for clarifying that. I thought it's just like in bugzilla when people block ni, you cannot set ni - block review, he won't see the review request.

Pushed by aklotz@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/29bf2cb2b9f3
Preload wintypes.dll off main thread; r=jmathies
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: